{"title": "Non-Stochastic Bandit Slate Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1054, "page_last": 1062, "abstract": "We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions of the problem, and give efficient algorithms which have regret O(sqrt(T)), where the constant depends on the specific nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recommendations for slates in every round, and give algorithms with O(sqrt(T)) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms.", "full_text": "Non-Stochastic Bandit Slate Problems\n\nSatyen Kale\n\nYahoo! Research\nSanta Clara, CA\n\nGeorgia Inst. of Technology\n\nLev Reyzin\u2217\n\nAtlanta, GA\n\nRobert E. Schapire\u2020\nPrinceton University\n\nPrinceton, NJ\n\nskale@yahoo-inc.com\n\nlreyzin@cc.gatech.edu\n\nschapire@cs.princeton.edu\n\nAbstract\n\nWe consider bandit problems, motivated by applications in online advertising and\nnews story selection, in which the learner must repeatedly select a slate, that is,\na subset of size s from K possible actions, and then receives rewards for just the\nselected actions. The goal is to minimize the regret with respect to total reward of\nthe best slate computed in hindsight. We consider unordered and ordered versions\nof the problem, and give ef\ufb01cient algorithms which have regret O(\nT ), where the\nconstant depends on the speci\ufb01c nature of the problem. We also consider versions\n\u221a\nof the problem where we have access to a number of policies which make recom-\nmendations for slates in every round, and give algorithms with O(\nT ) regret for\ncompeting with the best such policy as well. We make use of the technique of\nrelative entropy projections combined with the usual multiplicative weight update\nalgorithm to obtain our algorithms.\n\n\u221a\n\n1\n\nIntroduction\n\nIn traditional bandit models, the learner is presented with a set of K actions. On each of T rounds,\nan adversary (or the world) \ufb01rst chooses rewards for each action, and afterwards the learner decides\nwhich action it wants to take. The learner then receives the reward of its chosen action, but does\nIn the standard bandit setting, the learner\u2019s goal is to\nnot see the rewards of the other actions.\ncompete with the best \ufb01xed arm in hindsight.\nIn the more general \u201cexperts setting,\u201d each of N\nexperts recommends an arm on each round, and the goal of the learner is to perform as well as the\nbest expert in hindsight.\nThe bandit setting tackles many problems where a learner\u2019s decisions re\ufb02ect not only how well\nit performs but also the data it learns from \u2014 a good algorithm will balance exploiting actions\nit already knows to be good and exploring actions for which its estimates are less certain. One\nsuch real-world problem appears in computational advertising, where publishers try to present their\ncustomers with relevant advertisements. In this setting, the actions correspond to advertisements,\nand choosing an action means displaying the corresponding ad. The rewards correspond to the\npayments from the advertiser to the publisher, and these rewards depend on the probability of users\nclicking on the ads.\nUnfortunately, many real-world problems, including the computational advertising problem, do not\n\ufb01t so nicely into the traditional bandit framework. Most of the time, advertisers have the ability to\ndisplay more than one ad to users, and users can click on more than one of the ads displayed to\nthem. To capture this reality, in this paper we de\ufb01ne the slate problem. This setting is similar to the\ntraditional bandit setting, except that here the advertiser selects a slate, or subset, of S actions.\nIn this paper we \ufb01rst consider the unordered slate problem, where the reward to the learning algo-\nrithm is the sum of the rewards of the chosen actions in the slate. This setting is applicable when all\n\u2217This work was done while Lev Reyzin was at Yahoo! Research, New York. This material is based upon\nwork supported by the National Science Foundation under Grant #0937060 to the Computing Research Asso-\nciation for the Computing Innovation Fellowship program.\n\n\u2020This work was done while R. Schapire was visiting Yahoo! Research, New York.\n\n1\n\n\factions in a slate are treated equally. While this is a realistic assumption in certain settings, we also\ndeal with the case when different positions in a slate have different importance. Going back to our\ncomputational advertising example, we can see not all ads are given the same treatment (i.e. an ad\ndisplayed higher in a list is more likely to be clicked on). One may plausibly assume that for every\nad and every position that it can be shown in, there is a click-through-rate associated with the (ad,\nposition) pair, which speci\ufb01es the probability that a user will click on the ad if it is displayed in that\nposition. This is a very general user model used widely in practice in web search engines. To ab-\nstract this, we turn to the ordered slate problem, where for each action and position in the ordering,\nthe adversary speci\ufb01es a reward for using the action in that position. The reward to the learner then\nis the sum of the rewards of the (actions, position) pairs in the chosen ordered slate.1 This setting\nis similar to that of Gy\u00a8orgy, Linder, Lugosi and Ottucs\u00b4ak [10] in that the cost of all actions in the\nchosen slate are revealed, rather than just the total cost of the slate.\nFinally, we show how to tackle these problems in the experts setting, where instead of competing\nwith the best slate in hindsight, the algorithm competes with the best expert, recommending different\nslates on different rounds.\nOne key idea appearing in our algorithms is to use a variant of the multiplicative weights expert\nalgorithm for a restricted convex set of distributions. In our case, the restricted set of distributions\nover actions corresponds to the one de\ufb01ned by the stipulation that the learner choose a slate instead\nof individual actions. Our variant \ufb01rst \ufb01nds the distribution generated by multiplicative weights and\nthen chooses the closest distribution in the restricted subset using relative entropy as the distance\nmetric \u2014 this is a type of Bregman projection, which has certain nice properties for our analysis.\nPrevious Work. The multi-armed bandit problem, \ufb01rst studied by Lai and Robbins [15], is a\nclassic problem which has had wide application. In the stochastic setting, where the rewards of the\narms are i.i.d., Lai and Robbins [15] and Auer, Cesa-Bianchi and Fischer [2] gave regret bounds of\n\nO(K ln(T )). In the non-stochastic setting, Auer et al. [3] gave regret bounds of O((cid:112)K ln(K)T ).2\n\nThis non-stochastic setting of the multi-armed bandit problem is exactly the speci\ufb01c case of our\nproblem when the slate size is 1, and hence our results generalize those of Auer et al. which can be\nrecovered by setting s = 1.\nOur problem is a special case of the more general online linear optimization with bandit feedback\nproblem [1, 4, 5, 11]. Specializing the best result in this series to our setting, we get worse regret\n\nbounds of O((cid:112)T log(T )). The constant in the O(\u00b7) notation is also worse than our bounds. For a\n\nmore speci\ufb01c comparison of regret bounds, see Section 2. Our algorithms, being specialized for the\nslates problem, are simpler to implement as well, avoiding the sophisticated self-concordant barrier\ntechniques of [1].\nThis work also builds upon the algorithm in [18] to learn subsets of experts and the algorithm in [12]\nfor learning permutations, both in the full information setting. Our work is also a special case of\nthe Combinatorial Bandits setting of Cesa-Bianchi and Lugosi [9]; however, our algorithms obtain\nbetter regret bounds and are computationally more ef\ufb01cient.\nOur multiplicative weights algorithm also appears under the name Component Hedge in the inde-\npendent work of Koolen, Warmuth and Kivinen [14]. Furthermore, the expertless, unordered slate\nproblem is studied by Uchiya, Nakamura and Kudo [17] who obtain the same asymptotic bounds as\nappear in this paper, though using different techniques.\n\n2 Statement of the problem and main results\n\nFor vectors x, y \u2208 RK, x \u00b7 y denotes their inner product, viz. (cid:80)\n\nNotation.\ni xiyi. For ma-\ntrices X, Y \u2208 Rs\u00d7K, X \u2022 Y denotes their inner product considering them vectors in RsK, viz.\n\n1The unordered slate problem is a special case of the ordered slate problem for which all positional fac-\ntors are equal. However, the bound on the regret that we get when we consider the unordered slate problem\nseparately is a factor of \u02dcO(\n\ns) better than when we treat it as a special case of the ordered slate problem.\n\n2The difference in the regret bounds can be attributed to the de\ufb01nition of regret in the stochastic and non-\nstochastic settings. In the stochastic setting, we compare the algorithm\u2019s expected reward to that of the arm\nwith the largest expected reward, with the expectation taken over the reward distribution.\n\n\u221a\n\n2\n\n\f(cid:80)\np and q, let RE(p (cid:107) q) denote their relative entropy, i.e. RE(p (cid:107) q) =(cid:80)\n\nij XijYij. For a set S of actions, let 1S be the indicator vector for that set. For two distributions\n\n).\n\ni pi ln( pi\nqi\n\nProblem Statement. In a sequence of rounds, for t = 1, 2, . . . , T , we are required to choose a slate\nfrom a base set A of K actions. An unordered slate is a subset S \u2286 A of s out of the K actions. An\nordered slate is a slate together with an ordering over its s actions; thus, it is a one-to-one mapping\n\u03c0 : {1, 2, . . . , s} \u2192 A. Prior to the selection of the slate, the adversary chooses losses3 for the\nactions in the slates. Once the slate is chosen, the cost of only the actions in the chosen slate is\nrevealed. This cost is de\ufb01ned in the following manner:\n\n\u2022 Unordered slate. The adversary chooses a loss vector (cid:96)(t) \u2208 RK which speci\ufb01es a loss\n(cid:96)j(t) \u2208 [\u22121, 1] for every action j \u2208 A. For a chosen slate S, only the coordinates (cid:96)j(t) for\n\n\u2022 Ordered slate. The adversary chooses a loss matrix L(t) \u2208 Rs\u00d7K which speci\ufb01es a loss\nLij(t) \u2208 [\u22121, 1] for every action j \u2208 A and every position i, 1 \u2264 i \u2264 s, in the ordering on\nthe slate. For a chosen slate \u03c0, the entries Li,\u03c0(i)(t) for every position i are revealed, and\n\nj \u2208 S are revealed, and the cost incurred for choosing S is(cid:80)\nthe cost incurred for choosing \u03c0 is(cid:80)s\n(cid:88)\n\ni=1 Li,\u03c0(i)(t).\n\nT(cid:88)\n\nT(cid:88)\n\n(cid:88)\n\n(cid:96)j(t) \u2212 min\n\nRegretT =\n\nt=1\n\nj\u2208S(t)\n\nS\n\nt=1\n\nj\u2208S\n\n(cid:96)j(t).\n\nj\u2208S (cid:96)j(t).\n\nIn the unordered slate problem, if slate S(t) is chosen in round t, for t = 1, 2, . . . , T , then the regret\nof the algorithm is de\ufb01ned to be\n\nHere, the subscript S is used as a shorthand for ranging over all slates S. The regret for the ordered\nslate problem is de\ufb01ned analogously.\nOur goal is to design a randomized algorithm for online slate selection such that E[RegretT ] = o(T ),\nwhere the expectation is taken over the internal randomization of the algorithm.\nCompeting with policies. Frequently in applications we have access to N policies which are algo-\nrithms that recommend slates to use in every round. These policies might leverage extra information\nthat we have about the losses in the next round. It is therefore bene\ufb01cial to devise algorithms that\nhave low regret with respect to the best policy in the pool in hindsight, where regret is de\ufb01ned as:\n\nT(cid:88)\n\n(cid:88)\n\nt=1\n\nj\u2208S(t)\n\nRegretT =\n\n(cid:96)j(t) \u2212 min\n\n\u03c1\n\nT(cid:88)\n\n(cid:88)\n\nt=1\n\nj\u2208S\u03c1(t)\n\n(cid:96)j(t).\n\nHere, \u03c1 ranges over all policies, S\u03c1(t) is the recommendation of policy \u03c1 at time t, and S(t) is the\nalgorithm\u2019s chosen slate. The regret is de\ufb01ned analogously for ordered slates. More generally, we\nmay allow policies to recommend distributions over slates, and our goal is to minimize the expected\nregret with respect to the best policy in hindsight, where the expectation is taken over the distribution\nrecommended by the policy as well as the internal randomization of the algorithm.\nOur results. We are now able to formally state our main results:\nTheorem 2.1. There are ef\ufb01cient (running in poly(s, K) time in the no-policies case, and in\npoly(s, K, N) time with N policies) randomized algorithms achieving the following regret bounds:\n\nUnordered slates\n\n4(cid:112)sK ln(K/s)T\n4(cid:112)sK ln(N)T\n\n(Sec. 3.2)\n(Sec. 4.1)\n\nOrdered slates\n\n4s(cid:112)K ln(K)T\n4s(cid:112)K ln(N)T\n\n(Sec. 3.3)\n(Sec. 4.2)\n\nand [9] are O((cid:112)s3K ln(K/s)T ) in the unordered slates problem, and O(s2(cid:112)K ln(K)T ) in the\n\nTo compare, the best bounds obtained for the no-policies case using the more general algorithms [1]\n\n\u221a\nordered slates problem. It is also possible, in the no-policies setting, to devise algorithms that have\nregret bounded by O(\nT ) with high probability, using the upper con\ufb01dence bounds technique of [3].\nWe omit these algorithms in this paper for the sake of brevity.\n\nNo policies\nN policies\n\n3Note that we switch to losses rather than rewards to be consistent with most recent literature on online\n\nlearning. Since we allow negative losses, we can easily deal with rewards as well.\n\n3\n\n\fAlgorithm MW(P)\nInitialization: An arbitrary probability distribution p(1) \u2208 P on the experts, and some \u03b7 > 0.\nFor t = 1, 2, . . . , T :\n\n1. Choose distribution p(t) over experts, and observe the cost vector (cid:96)(t).\n2. Compute the probability vector \u02c6p(t + 1) using the following multiplicative update rule:\n\nfor every expert i,\n\nwhere Z(t) =(cid:80)\n\n\u02c6pi(t + 1) = pi(t) exp(\u2212\u03b7(cid:96)i(t))/Z(t)\ni pi(t) exp(\u2212\u03b7(cid:96)i(t)) is the normalization factor.\n\n3. Set p(t + 1) to be the projection of \u02c6p(t + 1) on the set P using the RE as a distance\n\nfunction, i.e. p(t + 1) = arg minp\u2208P RE(p (cid:107) \u02c6p(t + 1)).\nFigure 1: The Multiplicative Weights Algorithm with Restricted Distributions\n\n(1)\n\n3 Algorithms for the slate problems with no policies\n\n3.1 Main algorithmic ideas\n\nOur starting point is the Hedge algorithm for learning online with expert advice. In this setting, on\neach round t, the learner chooses a probability distribution p(t) over experts, each of which then\nsuffers a (fully observable) loss represented by the vector (cid:96)(t). The learner\u2019s loss is then p(t) \u00b7 (cid:96)(t).\nThe main idea of our approach is to apply Hedge (and ideas from bandit variants of it, especially\nExp3 [3]) by associating the probability distributions that it selects with mixtures of (ordered or\nunordered) slates, and thus with the randomized choice of a slate. However, this requires that the\nselected probability distributions have a particular form, which we describe shortly. We therefore\nneed a special variant of Hedge which uses only distributions p(t) from some \ufb01xed convex subset\nP of the simplex of all distributions. The goal then is to minimize regret relative to an arbitrary\ndistribution p \u2208 P. Such a version of Hedge is given in Figure 1, and a statement of its performance\nbelow. This algorithm is implicit in the work of [13, 18].\nTheorem 3.1. Assume that \u03b7 > 0 is chosen so that for all t and i, \u03b7(cid:96)i(t) \u2265 \u22121. Then algorithm\nMW(P) generates distributions p(1), . . . , p(T ) \u2208 P, such that for any p \u2208 P,\n\nT(cid:88)\n\nT(cid:88)\n\n(cid:96)(t) \u00b7 p(t) \u2212 (cid:96)(t) \u00b7 p \u2264 \u03b7\n\nt=1\n\nt=1\n\n((cid:96)(t))2 \u00b7 p(t) +\n\nRE(p (cid:107) p(1))\n\n\u03b7\n\n.\n\nHere, ((cid:96)(t))2 is the vector that is the coordinate-wise square of (cid:96)(t).\n\n3.2 Unordered slates with no policies\n\nconvex polytope de\ufb01ned by the linear constraints(cid:80)K\n\nTo apply the approach described above, we need a way to compactly represent the set of distributions\nover slates. We do this by embedding slates as points in some high-dimensional Euclidean space,\nand then giving a compact representation of the convex hull of the embedded points. Speci\ufb01cally, we\nrepresent an unordered slate S by its indicator vector 1S \u2208 RK, which is 1 for all coordinates j \u2208 S,\nand 0 for all others. The convex hull X of all such 1S vectors can be succinctly described [18] as the\nj=1 xj = s and xj \u2265 0 for j = 1, . . . , K. An\nalgorithm is given in [18] (Algorithm 2) to decompose any vector x \u2208 X into a convex combination\nof at most K indicator vectors 1S. We embed the convex hull X of all the 1S vectors in the simplex\nof distributions over the K actions simply by scaling down all coordinates by s so that they sum to\n1. Let P be this scaled down version of X . Our algorithm is given in Figure 2.\nStep 3 of MW(P) requires us to compute the arg minp\u2208P RE(p (cid:107) \u02c6p(t + 1)), which can be solved by\nconvex programming. A linear time algorithm is given in [13], and a simple algorithm (from [18])\nis the following: \ufb01nd the least index k such that clipping the largest k coordinates of p to 1\ns and\nrescaling the rest of the coordinates to sum up to 1 \u2212 k\ns ensures that all coordinates are at most 1\ns ,\nand output the probability vector thus obtained. This can be implemented by sorting the coordinates,\nand so it takes O(K log(K)) time.\n\n4\n\n\f(cid:113) (1\u2212\u03b3)s ln(K/s)\n\nBandit Algorithm for Unordered Slates\nInitialization: Start an instance of MW(P) with the uniform initial distribution p(1) = 1\n\u03b7 =\nFor t = 1, 2, . . . , T :\n\n(cid:113) (K/s) ln(K/s)\n\n, and \u03b3 =\n\nKT\n\nT\n\n.\n\nK 1. Set\n\nK 1A.\n\ncorresponding to slates S as sp(cid:48)(t) =(cid:80)\n\n1. Obtain the distribution p(t) from MW(P).\n2. Set p(cid:48)(t) = (1 \u2212 \u03b3)p(t) + \u03b3\n3. Note that p(cid:48)(t) \u2208 P. Decompose sp(cid:48)(t) as a convex combination of slate vectors 1S\n4. Choose a slate S to display with probability qS, and obtain the loss (cid:96)j(t) for all j \u2208 S.\n5. Set \u02c6(cid:96)j(t) = (cid:96)j(t)/(sp(cid:48)\n6. Send \u02c6(cid:96)(t) as the loss vector to MW(P).\n\nS qS1S, where qS > 0 and(cid:80)\n\nj(t)) if j \u2208 S, and 0 otherwise.\n\nS qS = 1.\n\nFigure 2: The Bandit Algorithm with Unordered Slates\n\nWe now prove the regret bound of Theorem 2.1. We use the notation Et[X] to denote the expectation\nof a random variable X conditioned on all the randomness chosen by the algorithm up to round\nt, assuming that X is measurable with respect to this randomness. We note the following facts:\ns . This immediately implies that\n\nEt[\u02c6(cid:96)j(t)] =(cid:80)\n\nS(cid:51)j qS \u00b7 (cid:96)j (t)\nj (t) = (cid:96)j(t), since p(cid:48)\nsp(cid:48)\n\nEt[\u02c6(cid:96)(t) \u00b7 p(t)] = (cid:96)(t) \u00b7 p(t) and E[\u02c6(cid:96)(t) \u00b7 p] = (cid:96)(t) \u00b7 p, for any \ufb01xed distribution p.\nNote that if we decompose a distribution p \u2208 P as a convex combination of 1\ns 1S vectors and ran-\ndomly choose a slate S according to its weight in the combination, then the expected loss, averaged\nover the s actions chosen, is (cid:96)(t) \u00b7 p. We can bound the difference between the expected loss (aver-\naged over the s actions) in round t suffered by the algorithm, (cid:96)(t) \u00b7 p(cid:48)(t), and (cid:96)(t) \u00b7 p(t) as follows:\n\nj(t) =(cid:80)\n\nS(cid:51)j qS \u00b7 1\n\nKT\n\nand \u03b3 =\n\nby setting \u03b7 =\nIt remains to verify that \u03b7\u02c6(cid:96)j(t) \u2265 \u22121 for all i and t. We know that \u02c6(cid:96)j(t) \u2265 \u2212 K\nso all we need to check is that\n\n(cid:113) (1\u2212\u03b3)s ln(K/s)\n\ns\u03b3 , because p(cid:48)\nK , which is true for our choice of \u03b3.\n\n\u2264 s\u03b3\n\nKT\n\nT\n\n.\n\nj(t) \u2265 \u03b3\nK ,\n\n5\n\nWe note that the leading factor of 1\nWe now bound the terms on the RHS. First, we have\n\ns on the expected regret is due to the averaging over the s positions.\n\n(cid:96)(t) \u00b7 p(cid:48)(t) \u2212 (cid:96)(t) \u00b7 p(t) = (cid:88)\n=(cid:88)\n\n(cid:96)(t) \u00b7 p(cid:48)(t) \u2212 (cid:96)(t) \u00b7 1\ns\n\nj\n\nt\n\nUsing this bound and Theorem 3.1, if S(cid:63) = arg minS\nE[RegretT ]\n\n1S(cid:63) \u2264 \u03b7\n\n(cid:96)j(t)(p(cid:48)\n\ns\n\nE\nt\n\n[(\u02c6(cid:96)(t))2 \u00b7 p(t)] = (cid:88)\n= (cid:88)\n\nS\n\nqS \u00b7\n\n(cid:34)\n\n\uf8ee\uf8f0(cid:88)\n\nj\u2208S\n\n((cid:96)j(t))2pj(t)\n(sp(cid:48)\n\nj(t))2\n\n(cid:35)\n\n\u00b7(cid:88)\n\nS(cid:51)j\n\nbecause pj (t)\n\nj (t) \u2264 1\np(cid:48)\n\n((cid:96)j(t))2pj(t)\n(sp(cid:48)\n\nj\n\nj(t))2\n1\u2212\u03b3 , and all |(cid:96)j(t)| \u2264 1.\n(cid:113) (1\u2212\u03b3)s ln(K/s)\n\nE[RegretT ] \u2264 \u03b7\n\nKT\n1 \u2212 \u03b3\n\n+ s ln(K/s)\n\n(cid:113) (K/s) ln(K/s)\n\n\u03b7\n\ns 1S, we have\nRE( 1\n\nj\n\nE[(\u02c6(cid:96)(t))2 \u00b7 p(t)] +\n\nj(t) \u2212 pj(t)) \u2264 (cid:88)\n(cid:80)\nt (cid:96)(t) \u00b7 1\n(cid:88)\n\uf8f9\uf8fb\nqS =(cid:88)\n\n(cid:34)\n\nt\n\n((cid:96)j(t))2pj(t)\n(sp(cid:48)\n\nj(t))2\n\nj\n\n+ s\u03b3T \u2264 4(cid:112)sK ln(K/s)T ,\n\n(cid:96)j(t) \u00b7 \u03b3\nK\n\n\u2264 \u03b3.\n\ns 1S(cid:63) (cid:107) p(1))\n\n\u03b7\n\n+ \u03b3T.\n\n(cid:35)\n\n\u00b7 sp(cid:48)\n\nj(t) \u2264\n\nK\n\ns(1 \u2212 \u03b3) ,\n\n\fKT\n\nT\n\nand \u03b3 =\n\n(cid:113) K ln(K)\n\nBandit Algorithm for Ordered Slates\nInitialization: Start an instance of MW(P) with the uniform initial distribution p(1) = 1\n\u03b7 =\n\n(cid:113) (1\u2212\u03b3) ln(K)\nof M\u03c0 matrices corresponding to ordered slates \u03c0 as sp(cid:48)(t) =(cid:80)\nand(cid:80)\n\n1. Obtain the distribution p(t) from MW(P).\n2. Set p(cid:48)(t) = (1 \u2212 \u03b3)p(t) + \u03b3\n3. Note that p(cid:48)(t) \u2208 P, and so sp(cid:48)(t) \u2208 M. Decompose sp(cid:48)(t) as a convex combination\n\u03c0 q\u03c0M\u03c0, where q\u03c0 > 0\n\n. For t = 1, 2, . . . , T :\n\nsK 1. Set\n\nsK 1A.\n\n4. Choose a slate \u03c0 to display w.p. q\u03c0, and obtain the loss Li,\u03c0(i)(t) for all 1 \u2264 i \u2264 s.\n5. Construct the loss matrix \u02c6L(t) as follows: for 1 \u2264 i \u2264 s, set \u02c6Li,\u03c0(i)(t) = Li,\u03c0(i)(t)\n\n\u03c0 q\u03c0 = 1.\n\ni,\u03c0(i)(t), and\nsp(cid:48)\n\nall other entries are 0.\n\n6. Send \u02c6L(t) as the loss vector to MW(P).\n\nFigure 3: Bandit Algorithm for Ordered Slates\n\n3.3 Ordered slates with no policies\n\nA similar approach can be used for ordered slates. Here, we represent an ordered slate \u03c0 by the\nsubpermutation matrix M\u03c0 \u2208 Rs\u00d7K which is de\ufb01ned as follows: for i = 1, 2, . . . , s, we have\nmatrices is the convex polytope de\ufb01ned by the linear constraints:(cid:80)K\ni,\u03c0(i) = 1, and all other entries are 0. In [7, 16], it is shown that the convex hull M of all the M\u03c0\n(cid:80)s\nM \u03c0\nj=1 Mij = 1 for i = 1, . . . , s;\ni=1 Mij \u2264 1 for j = 1, . . . , K; and Mij \u2265 0 for i = 1, . . . , s and j = 1, . . . , K. Clearly, all\nsubpermutation matrices M\u03c0 \u2208 M. To complete the characterization of the convex hull, we can\nshow (details omitted) that given any matrix M \u2208 M, we can ef\ufb01ciently decompose it into a convex\ncombination of at most K 2 subpermutation matrices.\nWe identify matrices in Rs\u00d7K with vectors in RsK in the obvious way. We embed M in the simplex\nof distributions in RsK simply by scaling all the entries down by s so that their sum equals one. Let\nP be this scaled down version of M. Our algorithm is given in Figure 3.\nThe projection in step 3 of MW(P) can be computed simply by solving the convex program.\nIn\npractice, however, noticing that the relative entropy projection is a Bregman projection, the cyclic\nprojections method of Bregman [6, 8] is likely to work faster. Adapted to the speci\ufb01c problem at\nhand, this method works as follows (see [8] for details): \ufb01rst, for every column j, initialize a dual\nvariable \u03bbj = 1. Then, alternate between row phases and column phases. In a row phase, iterate over\ns . The column phase is a little more complicated:\nall rows, and rescale them to make them sum to 1\ns . Set \u03b1(cid:48) = min{\u03bbj, \u03b1},\n\ufb01rst, for every column j, compute the scaling factor \u03b1 to make it sum to 1\nand scale the column by \u03b1(cid:48), and update \u03bbj \u2190 \u03bbj/\u03b1(cid:48). Repeat these alternating row and column\nphases until convergence to within the desired tolerance.\n\nThe regret bound analysis is similar to that of Section 3.2. We have Et[\u02c6Lij(t)] = (cid:80)\n\n\u03c0:\u03c0(i)=j q\u03c0 \u00b7\n= Lij(t), and hence Et[\u02c6L(t) \u2022 p(t)] = L(t) \u2022 p(t) and E[\u02c6L(t) \u2022 p] = L(t) \u2022 p. We can show\n\nij\n\nLij (t)\nsp(cid:48)\nalso that L(t) \u2022 p(cid:48)(t) \u2212 L(t) \u2022 p(t) \u2264 \u03b3.\nUsing this bound and Theorem 3.1, if \u03c0(cid:63) = arg min\u03c0\n\n(cid:80)\nt L(t) \u2022 1\n(cid:88)\n\ns M\u03c0, we have\n\n=(cid:88)\n\nt\n\nE[RegretT ]\n\ns\n\nL(t)\u2022p(cid:48)(t)\u2212L(t)\u2022 1\ns\n\nM\u03c0(cid:63) \u2264 \u03b7\n\nt\n\nE[(\u02c6L(t))2\u2022p(t)]+\n\nRE( 1\n\ns M\u03c0(cid:63)(cid:107)p(1))\n\n\u03b7\n\n+\u03b3T.\n\nWe now bound the terms on the RHS. First, we have\n\n[(\u02c6L(t))2 \u2022 p(t)] =(cid:88)\n\nq\u03c0 \u00b7\n\nE\nt\n\n(cid:34) s(cid:88)\n\n\u03c0\n\ni=1\n\n(Li,\u03c0(i)(t))2pi,\u03c0(i)(t)\n\n(sp(cid:48)\n\ni,\u03c0(i)(t))2\n\n6\n\n(cid:35)\n\n(cid:34)\n\ns(cid:88)\n\nK(cid:88)\n\ni=1\n\nj=1\n\n=\n\n(Lij(t))2pij(t)\n\n(sp(cid:48)\n\nij(t))2\n\n(cid:35)\n\n\u00b7 (cid:88)\n\nq\u03c0\n\u03c0:\u03c0(i)=j\n\n\fBandit Algorithm for Unordered Slates With Policies\nInitialization: Start an instance of MW with no restrictions over the set of distributions over the N\npolicies, with the initial distribution r(1) = 1\n.\nFor t = 1, 2, . . . , T :\n\n(cid:113) (1\u2212\u03b3)s ln(N )\n\n(cid:113) (K/s) ln(N )\n\nN 1. Set \u03b7 =\n\n, and \u03b3 =\n\nKT\n\nT\n\n\u03c1=1 r\u03c1(t)\u03c6\u03c1(t).\n\n1. Obtain the distribution over policies r(t) from MW, and the recommended distribution\n\nover slates \u03c6\u03c1(t) \u2208 P for each policy \u03c1.\n\n2. Compute the distribution p(t) =(cid:80)N\ncorresponding to slates S as sp(cid:48)(t) =(cid:80)\n\nK 1.\n\nS qS1S, where qS > 0 and(cid:80)\n\n3. Set p(cid:48)(t) = (1 \u2212 \u03b3)p(t) + \u03b3\n4. Note that p(cid:48)(t) \u2208 P. Decompose sp(cid:48)(t) as a convex combination of slate vectors 1S\n5. Choose a slate S to display with probability qS, and obtain the loss (cid:96)j(t) for all j \u2208 S.\n6. Set \u02c6(cid:96)j(t) = (cid:96)j(t)/sp(cid:48)\n7. Set the loss of policy \u03c1 to be \u03bb\u03c1(t) = \u02c6(cid:96)(t) \u00b7 \u03c6\u03c1(t) in the MW algorithm.\nFigure 4: Bandit Algorithm for Unordered Slates With Policies\n\nj(t) if j \u2208 S, and 0 otherwise.\n\nS qS = 1.\n\n(cid:34)\n\ns(cid:88)\n\nK(cid:88)\n1\u2212\u03b3 , all |Lij(t)| \u2264 1.\n\nj=1\n\n=\n\ni=1\n\nbecause pij (t)\n\nij (t) \u2264 1\np(cid:48)\n\n(cid:35)\n\n(Lij(t))2pij(t)\n\n(sp(cid:48)\n\nij(t))2\n\n\u00b7 sp(cid:48)\n\nij(t) \u2264 K\n1 \u2212 \u03b3\n\n,\n\nFinally, we have RE( 1\nrem 3.1, we get the stated regret bound from Theorem 2.1:\n\ns M\u03c0(cid:63) (cid:107) p(1)) = ln(K). Plugging these bounds into the bound of Theo-\n\n+ s\u03b3T \u2264 4s(cid:112)K ln(K)T ,\n\nE[RegretT ] \u2264 \u03b7\n\n(cid:113) (1\u2212\u03b3) ln(K)\n\n+ s ln(K)\n\nsKT\n1 \u2212 \u03b3\n\n(cid:113) K ln(K)\n\n\u03b7\n\nand \u03b3 =\n\nKT\n\nT\n\nby setting \u03b7 =\n\n, which satisfy the necessary technical conditions.\n\n4 Competing with a set of policies\n4.1 Unordered Slates with N Policies\nIn each round, every policy \u03c1 recommends a distribution over slates \u03c6\u03c1(t) \u2208 P, where P is the X\nscaled down by s as in Section 3.2. Our algorithm is given in Figure 4.\n(cid:80)\nAgain the regret bound analysis is along the lines of Section 3.2. We have for any j, Et[\u02c6(cid:96)j(t)] =\n\u03c1((cid:96)(t) \u00b7\nS(cid:51)j qS \u00b7 (cid:96)j (t)\nsp(cid:48)\n\u03c6\u03c1(t))r\u03c1(t) = (cid:96)(t) \u00b7 p(t). We can also show as before that (cid:96)(t) \u00b7 p(cid:48)(t) \u2212 (cid:96)(t) \u00b7 p(t) \u2264 \u03b3.\nUsing this bound and Theorem 3.1, if \u03c1(cid:63) = arg min\u03c1\nE[RegretT ]\n(cid:96)(t) \u00b7 p(cid:48)(t) \u2212 (cid:96)(t) \u00b7 \u03c6\u03c1(cid:63)(t) \u2264 \u03b7\n\nj (t) = (cid:96)j(t). Thus, Et[\u03bb\u03c1(t)] = (cid:96)(t) \u00b7 \u03c6\u03c1(t), and hence Et[\u03bb(t) \u00b7 r(t)] =(cid:80)\n=(cid:88)\n\nt (cid:96)(t) \u00b7 \u03c6\u03c1(t), we have\nE[(\u03bb(t))2 \u00b7 r(t)] +\n\nRE(e\u03c1(cid:63)(cid:107)r(1))\n\n(cid:80)\n(cid:88)\n\n+ \u03b3T,\n\ns\n\nt\n\nt\n\n\u03b7\n\nwhere e\u03c1(cid:63) is the distribution (vector) that is concentrated entirely on policy \u03c1(cid:63).\nWe now bound the terms on the RHS. First, we have\n\n[(\u03bb(t))2 \u00b7 r(t)] = E\nE\nt\n\nt\n\n\u2264 E\n\nt\n\n(cid:34)(cid:88)\n(cid:34)(cid:88)\n\n\u03c1\n\n(cid:35)\n\n(cid:34)(cid:88)\n(cid:35)\n((\u02c6(cid:96)(t))2 \u00b7 \u03c6\u03c1(t))r\u03c1(t)\n\n\u03bb\u03c1(t)2r\u03c1(t)\n\n= E\n\n\u03c1\n\nt\n\n\u03c1\n\n7\n\n(cid:35)\n( \u02c6(cid:96)(t) \u00b7 \u03c6\u03c1(t))2r\u03c1(t)\n\n= E\n\n[(\u02c6(cid:96)(t))2 \u00b7 p(t)] \u2264\n\nt\n\nK\n\ns(1 \u2212 \u03b3) .\n\n\fBandit Algorithm for Ordered Slates with Policies\nInitialization: Start an instance of MW with no restrictions, over the set of distributions over the N\npolicies, starting with r(1) = 1\nFor t = 1, 2, . . . , T :\n\n(cid:113) (1\u2212\u03b3) ln(N )\n\n(cid:113) K ln(N )\n\nN 1. Set \u03b7 =\n\nand \u03b3 =\n\nKT\n\nT\n\n.\n\nover ordered slates \u03c6\u03c1(t) \u2208 P for each policy \u03c1.\n\n1. Obtain the distribution over policies r(t) from MW, and the recommended distribution\n\n2. Compute the distribution p(t) =(cid:80)N\nM\u03c0 matrices corresponding to ordered slates \u03c0 as sp(cid:48)(t) = (cid:80)\nand(cid:80)\n\n3. Set p(cid:48)(t) = (1 \u2212 \u03b3)p(t) + \u03b3\n4. Note that p(cid:48)(t) \u2208 P, and so sp(cid:48)(t) \u2208 X . Decompose sp(cid:48)(t) as a convex combination of\n\u03c0 q\u03c0M\u03c0, where q\u03c0 > 0\n\n\u03c1=1 r\u03c1(t)\u03c6\u03c1(t).\n\n5. Choose a slate \u03c0 to display w.p. q\u03c0, and obtain the loss Li,\u03c0(i)(t) for all 1 \u2264 i \u2264 s.\n6. Construct the loss matrix \u02c6L(t) as follows: for 1 \u2264 i \u2264 s, set \u02c6Li,\u03c0(i)(t) = Li,\u03c0(i)(t)\n\n\u03c0 q\u03c0 = 1.\n\nsK 1A.\n\ni,\u03c0(i)(t), and\nsp(cid:48)\n\nall other entries are 0.\n\n7. Set the loss of policy \u03c1 to be \u03bb\u03c1(t) = \u02c6L(t) \u2022 \u03c6\u03c1(t) in the MW algorithm.\n\nFigure 5: Bandit Algorithm for Ordered Slates with Policies\n\nThe \ufb01rst inequality above follows from Jensen\u2019s inequality, and the second one is proved exactly as\nin Section 3.2. Finally, we have RE(e\u03c1(cid:63) (cid:107) p(1)) = ln(N). Plugging these bounds into the bound\nabove, we get the stated regret bound from Theorem 2.1:\n\n+ s\u03b3T \u2264 4(cid:112)sK ln(N)T ,\n\nE[RegretT ] \u2264 \u03b7\n\n(cid:113) (1\u2212\u03b3)s ln(N )\n\nKT\n1 \u2212 \u03b3\nand \u03b3 =\n\n+ s ln(N)\n\n(cid:113) (K/s) ln(N )\n\n\u03b7\n\nKT\n\nT\n\nby setting \u03b7 =\ntions.\n\n, which satisfy the necessary technical condi-\n\n4.2 Ordered Slates with N Policies\nIn each round, every policy \u03c1 recommends a distribution over ordered slates \u03c6\u03c1(t) \u2208 P, where P is\nM scaled down by s as in Section 3.3. Our algorithm is given in Figure 5.\nThe regret bound analysis is exactly along the lines of that in Section 4.1, with L(t) and \u02c6L(t) playing\nthe roles of (cid:96)(t) and \u02c6(cid:96)(t) respectively, with the inequalities from Section 3.3. We omit the details\nfor brevity. We get the stated regret bound from Theorem 2.1:\n\nE[RegretT ] \u2264 4s(cid:112)K ln(N)T .\n\n5 Conclusions and Future Work\n\n\u221a\nIn this paper, we presented ef\ufb01cient algorithms for the unordered and ordered slate problems with\nregret bounds of O(\nT ), in the presence and and absence of policies, employing the technique of\nBregman projections on a convex set representing the convex hull of slate vectors.\nPossible future work on this problem is in two directions. The \ufb01rst direction is to handle other user\nmodels for the loss matrices, such as models incorporating the following sort of interaction between\nthe chosen actions: if two very similar ads are shown, and the user clicks on one, then the user is\nless likely to click on the other. Our current model essentially assumes no interaction.\n\u221a\nT ) regret bounds for the slate problems in\nThe second direction is to derive high probability O(\nthe presence of policies. The techniques of [3] only give such algorithms in the no-policies setting.\n\nReferences\n[1] ABERNETHY, J., HAZAN, E., AND RAKHLIN, A. Competing in the dark: An ef\ufb01cient algo-\n\nrithm for bandit linear optimization. In COLT (2008), pp. 263\u2013274.\n\n8\n\n\f[2] AUER, P., CESA-BIANCHI, N., AND FISCHER, P. Finite-time analysis of the multiarmed\n\nbandit problem. Machine Learning 47, 2-3 (2002), 235\u2013256.\n\n[3] AUER, P., CESA-BIANCHI, N., FREUND, Y., AND SCHAPIRE, R. E. The nonstochastic\n\nmultiarmed bandit problem. SIAM J. Comput. 32, 1 (2002), 48\u201377.\n\n[4] AWERBUCH, B., AND KLEINBERG, R. Online linear optimization and adaptive routing. J.\n\nComput. Syst. Sci. 74, 1 (2008), 97\u2013114.\n\n[5] BARTLETT, P. L., DANI, V., HAYES, T. P., KAKADE, S., RAKHLIN, A., AND TEWARI,\nIn COLT (2008),\n\nA. High-probability regret bounds for bandit online linear optimization.\npp. 335\u2013342.\n\n[6] BREGMAN, L. The relaxation method of \ufb01nding the common point of convex sets and its\napplication to the solution of problems in convex programming. USSR Comp. Mathematics\nand Mathematical Physics 7 (1967), 200\u2013217.\n\n[7] BRUALDI, R. A., AND LEE, G. M. On the truncated assignment polytope. Linear Algebra\n\nand its Applications 19 (1978), 33\u201362.\n\n[8] CENSOR, Y., AND ZENIOS, S. Parallel optimization. Oxford University Press, 1997.\n[9] CESA-BIANCHI, N., AND LUGOSI, G. Combinatorial bandits. In COLT (2009).\n[10] GY \u00a8ORGY, A., LINDER, T., LUGOSI, G., AND OTTUCS \u00b4AK, G. The on-line shortest path\nproblem under partial monitoring. Journal of Machine Learning Research 8 (2007), 2369\u2013\n2403.\n\n[11] HAZAN, E., AND KALE, S. Better algorithms for benign bandits. In SODA (2009), pp. 38\u201347.\n[12] HELMBOLD, D. P., AND WARMUTH, M. K. Learning permutations with exponential weights.\n\nIn COLT (2007), pp. 469\u2013483.\n\n[13] HERBSTER, M., AND WARMUTH, M. K. Tracking the best linear predictor. Journal of\n\nMachine Learning Research 1 (2001), 281\u2013309.\n\n[14] KOOLEN, W. M., WARMUTH, M. K., AND KIVINEN, J. Hedging structured concepts. In\n\nCOLT (2010).\n\n[15] LAI, T., AND ROBBINS, H. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics 6 (1985), 4\u201322.\n\n[16] MENDELSOHN, N. S., AND DULMAGE, A. L. The convex hull of sub-permutation matrices.\n\nProceedings of the American Mathematical Society 9, 2 (Apr 1958), 253\u2013254.\n\n[17] UCHIYA, T., NAKAMURA, A., AND KUDO, M. Algorithms for adversarial bandit problems\n\nwith multiple plays. In ALT (2010), pp. 375\u2013389.\n\n[18] WARMUTH, M. K., AND KUZMIN, D. Randomized PCA algorithms with regret bounds that\n\nare logarithmic in the dimension. In In Proc. of NIPS (2006).\n\n9\n\n\f", "award": [], "sourceid": 1073, "authors": [{"given_name": "Satyen", "family_name": "Kale", "institution": null}, {"given_name": "Lev", "family_name": "Reyzin", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}]}