{"title": "Multiple-Play Bandits in the Position-Based Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1597, "page_last": 1605, "abstract": "Sequentially learning to place items in multi-position displays or lists is a task that can be cast into the multiple-play semi-bandit setting. However, a major concern in this context is when the system cannot decide whether the user feedback for each item is actually exploitable. Indeed, much of the content may have been simply ignored by the user. The present work proposes to exploit available information regarding the display position bias under the so-called Position-based click model (PBM). We first discuss how this model differs from the Cascade model and its variants considered in several recent works on multiple-play bandits. We then provide a novel regret lower bound for this model as well as computationally efficient algorithms that display good empirical and theoretical performance.", "full_text": "Multiple-Play Bandits in the Position-Based Model\n\nPaul Lagr\u00e9e\u2217\n\nLRI, Universit\u00e9 Paris Sud\nUniversit\u00e9 Paris Saclay\n\npaul.lagree@u-psud.fr\n\nClaire Vernade\u2217\n\nLTCI, CNRS, T\u00e9l\u00e9com ParisTech\n\nUniversit\u00e9 Paris Saclay\n\nvernade@enst.fr\n\nOlivier Capp\u00e9\nLTCI, CNRS\n\nT\u00e9l\u00e9com ParisTech\n\nUniversit\u00e9 Paris Saclay\n\nAbstract\n\nSequentially learning to place items in multi-position displays or lists is a task that\ncan be cast into the multiple-play semi-bandit setting. However, a major concern in\nthis context is when the system cannot decide whether the user feedback for each\nitem is actually exploitable. Indeed, much of the content may have been simply\nignored by the user. The present work proposes to exploit available information\nregarding the display position bias under the so-called Position-based click model\n(PBM). We \ufb01rst discuss how this model differs from the Cascade model and its\nvariants considered in several recent works on multiple-play bandits. We then\nprovide a novel regret lower bound for this model as well as computationally\nef\ufb01cient algorithms that display good empirical and theoretical performance.\n\n1\n\nIntroduction\n\nDuring their browsing experience, users are constantly provided \u2013 without having asked for it \u2013 with\nclickable content spread over web pages. While users interact on a website, they send clicks to the\nsystem for a very limited selection of the clickable content. Hence, they let every unclicked item with\nan equivocal answer: the system does not know whether the content was really deemed irrelevant\nor simply ignored. In contrast, in traditional multi-armed bandit (MAB) models, the learner makes\nactions and observes at each round the reward corresponding to the chosen action. In the so-called\nmultiple play semi-bandit setting, when users are presented with L items, they are assumed to provide\nfeedback for each of those items.\nSeveral variants of this basic setting have been considered in the bandit literature. The necessity\nfor the user to provide feedback for each item has been called into question in the context of the\nso-called Cascade Model [8, 14, 6] and its extensions such as the Dependent Click Model (DCM)\n[20]. Both models are particularly suited for search contexts, where the user is assumed to be looking\nfor something relative to a query. Consequently, the learner expects explicit feedback: in the Cascade\nModel each valid observation sequence must be either all zeros or terminated by a one, such that no\nambiguity is left on the evaluation of the presented items, while multiple clicks are allowed in the\nDCM thus leaving some ambiguity on the last zeros of a sequence.\nIn the Cascade Model, the positions of the items are not taken into account in the reward process\nbecause the learner is assumed to obtain a click as long as the interesting item belongs to the list.\nIndeed, there are even clear indications that the optimal strategy in a learning context consists in\nshowing the most relevant items at the end of the list in order to maximize the amount of observed\nfeedback [14] \u2013 which is counter-intuitive in recommendation tasks.\nTo overcome these limitations, [6] introduces weights \u2013 to be de\ufb01ned by the learner \u2013 that are\nattributed to positions in the list, with a click on position l \u2208 {1, . . . , L} providing a reward wl,\nwhere the sequence (wl)l is decreasing to enforce the ranking behavior. However, no rule is given for\n\n\u2217The two authors contributed equally.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fsetting the weights (wl)l that control the order of importance of the positions. The authors propose an\nalgorithm based on KL-UCB [10] and prove a lower bound on the regret as well as an asymptotically\noptimal upper bound.\nAnother way to address the limitations of the Cascade Model is to consider the DCM as in [20]. Here,\nexamination probabilities vl are introduced for each position l: conditionally on the event that the\nuser effectively scanned the list up to position l, he/she can choose to leave with probability vl and in\nthat case, the learner is aware of his/her departure. This framework naturally induces the necessity to\nrank the items in the optimal order.\nAll previous models assume that a portion of the recommendation list is explicitly examined by the\nuser and hence that the learning algorithm eventually has access to rewards corresponding to the\nunbiased user\u2019s evaluation of each item. In contrast, we propose to analyze multiple-play bandits in\nthe Position-based model (PBM) [5]. In the PBM, each position in the list is also endowed with a\nbinary Examination variable [8, 19] which is equal to one only when the user paid attention to the\ncorresponding item. But this variable, that is independent of the user\u2019s evaluation of the item, is not\nobservable. It allows to model situations where the user is not explicitly looking for speci\ufb01c content,\nas in typical recommendation scenarios.\nCompared to variants of the Cascade model, the PBM is challenging due to the censoring induced by\nthe examination variables: the learning algorithm observes actual clicks but non-clicks are always\nambiguous. Thus, combining observations made at different positions becomes a non-trivial statistical\ntask. Some preliminary ideas on how to address this issue appear in the supplementary material of\n[13]. In this work, we provide a complete statistical study of stochastic multiple-play bandits with\nsemi-bandit feedback in the PBM.\nWe introduce the model and notations in Section 2 and provide the lower bound on the regret in\nSection 3. In Section 4, we present two optimistic algorithms as well as a theoretical analysis of\ntheir regret. In the last section dedicated to experiments, those policies are compared to several\nbenchmarks on both synthetic and realistic data.\n\n2 Setting and Parameter Estimation\n\nWe consider the binary stochastic bandit model with K Bernoulli-distributed arms. The model\nparameters are the arm expectations \u03b8 = (\u03b81, \u03b82, . . . , \u03b8K), which lie in \u0398 = (0, 1)K. We will\ndenote by B(\u03b8) the Bernoulli distribution with parameter \u03b8 and by d(p, q) := p log(p/q) + (1 \u2212\np) log((1 \u2212 p)/(1 \u2212 q)) the Kullback-Leibler divergence from B(p) to B(q). At each round t, the\nlearner selects a list of L arms \u2013 referred to as an action \u2013 chosen among the K arms which are\nindexed by k \u2208 {1, . . . , K}. The set of actions is denoted by A and thus contains K!/(K \u2212 L)!\nordered lists; the action selected at time t will be denoted A(t) = (A1(t), . . . , AL(t)).\nThe PBM is characterized by examination parameters (\u03bal)1\u2264l\u2264L, where \u03bal is the probability that the\nuser effectively observes the item in position l [5]. At round t, the selection A(t) is shown to the user\nand the learner observes the complete feedback \u2013 as in semi-bandit models \u2013 but the observation at\nposition l, Zl(t), is censored being the product of two independent Bernoulli variables Yl(t) and Xl(t),\nwhere Yl(t) \u223c B(\u03bal) is non null when the user considered the item in position l \u2013 which is unknown to\nthe learner \u2013 and Xl(t) \u223c B(\u03b8Al(t)) represents the actual user feedback to the item shown in position\nl=1 Zl(t), where Z(t) = (X1(t)Y1(t), . . . , XL(t)YL(t))\ndenotes the vector of censored observations at step t.\nIn the following, we will assume, without loss of generality, that \u03b81 > \u00b7\u00b7\u00b7 > \u03b8K and \u03ba1 > \u00b7\u00b7\u00b7 >\n\u03baL > 0, in order to simplify the notations. The fact that the sequences (\u03b8l)l and (\u03bal)l are decreasing\nt=1 ra\u2217 \u2212 rA(t) the regret\nL(cid:88)\nT(cid:88)\n(cid:88)\nwhere \u00b5a =(cid:80)L\naverage, \u2206a = \u00b5\u2217 \u2212 \u00b5a the expected gap to optimality, and, Na(T ) =(cid:80)T\nl=1 \u03bal\u03b8al is the expected reward of action a, \u00b5\u2217 = \u00b5a\u2217 is the best possible reward in\n1{A(t) = a} is the\n\nl. The learner receives a reward rA(t) =(cid:80)L\nimplies that the optimal list is a\u2217 = (1, . . . , L). Denoting by R(T ) =(cid:80)T\n\n(\u00b5\u2217 \u2212 \u00b5a) E[Na(T )] =\n\n\u2206aE[Na(T )],\n\nincurred by the learner up to time T , one has\n\u2212 E[\u03b8Al(t)]) =\n\nE[R(T )] =\n\n\u03bal(\u03b8a\u2217\n\nt=1\n\nl=1\n\n(cid:88)\n\na\u2208A\n\na\u2208A\n\nt=1\n\nl\n\n(1)\n\nnumber of times action a has been chosen up to time T .\n\n2\n\n\fIn the following, we assume that the examination parameters (\u03bal)1\u2264l\u2264L are known to the learner.\nThese can be estimated from historical data [5], using, for instance, the EM algorithm [9] (see also\nSection 5). In most scenarios, it is realistic to assume that the content (e.g., ads in on-line advertising)\nis changing much more frequently than the layout (web page design for instance) making it possible\nto have a good knowledge of the click-through biases associated with the display positions.\nThe main statistical challenge associated with the PBM is that one needs to obtain estimates and\ncon\ufb01dence bounds for the components \u03b8k of \u03b8 from the available B(\u03bal\u03b8k)-distributed draws cor-\nresponding to occurrences of arm k at various positions l = 1, . . . , L in the list. To this aim,\nl=1 Sk,l(t),\nl=1 Nk,l(t). We further require bias-corrected versions\n\ns=1 Zl(s)1{Al(s) = k}, Sk(t) = (cid:80)L\n\n1{Al(s) = k}, Nk(t) =(cid:80)L\n\nwe de\ufb01ne the following statistics: Sk,l(t) = (cid:80)t\u22121\nNk,l(t) =(cid:80)t\u22121\nof the counts \u02dcNk,l(t) =(cid:80)t\u22121\ngiven by I(\u03b8k) =(cid:80)L\n\nA time t, and conditionally on the past actions A(1) up to A(t \u2212 1), the Fisher information for \u03b8k is\nl=1 Nk,l(t)\u03bal/(\u03b8k(1 \u2212 \u03bal\u03b8k)) (see Appendix A). We cannot however estimate\n\u03b8k using the maximum likelihood estimator since it has no closed form expression. Interestingly\nthough, the simple pooled linear estimator\n\ns=1 \u03bal1{Al(s) = k} and \u02dcNk(t) =(cid:80)L\n\n\u02dcNk,l(t).\n\nl=1\n\ns=1\n\n\u02c6\u03b8k(t) = Sk(t)/ \u02dcNk(t),\n\n\u03c5(\u03b8k) = ((cid:80)L\n\nl=1 Nk,l(t)\u03bal\u03b8k(1 \u2212 \u03bal\u03b8k))/((cid:80)L\n\n(2)\nconsidered in the supplementary material to [13], is unbiased and has a (conditional) variance of\nl=1 Nk,l(t)\u03bal)2, which is close to optimal given the\nCram\u00e9r-Rao lower bound. Indeed, \u03c5(\u03b8k)I(\u03b8k) is recognized as a ratio of a weighted arithmetic mean\nto the corresponding weighted harmonic mean, which is known to be larger than one, but is upper\nbounded by 1/(1\u2212 \u03b8k), irrespectively of the values of the \u03bal\u2019s. Hence, if, for instance, we can assume\nthat all \u03b8k\u2019s are smaller than one half, the loss with respect to the best unbiased estimator is no more\nthan a factor of two for the variance. Note that despite its simplicity, \u02c6\u03b8k(t) cannot be written as a\nsimple sum of conditionally independent increments divided by the number of terms and will thus\nrequire speci\ufb01c concentration results.\nIt can be checked that when \u03b8k gets very close to one, \u02c6\u03b8k(t) is no longer close to optimal. This\nobservation also has a Bayesian counterpart that will be discussed in Section 5. Nevertheless, it is\nl=1 Sk,l(t)/\u03bal)/Nk,l(t) which gets very\n\nalways preferable to the \u201cposition-debiased\u201d estimator ((cid:80)L\n\nunreliable as soon as one of the \u03bal\u2019s gets very small.\n\n3 Lower Bound on the Regret\n\nIn this section, we consider the fundamental asymptotic limits of learning performance for online\nalgorithms under the PBM. These cannot be deduced from earlier general results, such as those of\n[11, 7], due to the censoring in the feedback associated to each action. We detail a simple and general\nproof scheme \u2013 using the results of [12] \u2013 that applies to the PBM, as well as to more general models.\nLower bounds on the regret rely on changes of measure: the question is how much can we mistake\nthe true parameters of the problem for others, when observing successive arms? With this in mind,\nwe will subscript all expectations and probabilities by the parameter value and indicate explicitly\nthat the quantities \u00b5a, a\u2217, \u00b5\u2217, \u2206a, introduced in Section 2, also depend on the parameter. For ease of\nnotation, we will still assume that \u03b8 is such that a\u2217(\u03b8) = (1, . . . , L).\n\n3.1 Existing results for multiple-play bandit problems\n\nLower bounds on the regret will be proved for uniformly ef\ufb01cient algorithms, in the sense of [16]:\nDe\ufb01nition 1. An algorithm is said to be uniformly ef\ufb01cient if for any bandit model parameterized by\n\u03b8 and for all \u03b1 \u2208 (0, 1], its expected regret after T rounds is such that E\u03b8R(T ) = o(T \u03b1).\nFor the multiple-play MAB, [2] obtained the following bound\n\nlim inf\nT\u2192\u221e\n\nE\u03b8R(T )\nlog(T )\n\n\u03b8L \u2212 \u03b8k\nd(\u03b8k, \u03b8L)\n\n.\n\n(3)\n\n\u2265 K(cid:88)\n\nk=L+1\n\n3\n\n\fFor the \u201clearning to rank\u201d problem where rewards follow the weighted Cascade Model with decreasing\nweights (wl)l=1,...,L, [6] derived the following bound\n\nlim inf\nT\u2192\u221e\n\nE\u03b8R(T )\n\nlog T\n\n\u2265 wL\n\nK(cid:88)\n\nk=L+1\n\n\u03b8L \u2212 \u03b8k\nd(\u03b8k, \u03b8L)\n\n.\n\nPerhaps surprisingly, this lower bound does not show any additional term corresponding to the\ncomplexity of ranking the L optimal arms. Indeed, the errors are still asymptotically dominated by\nthe need to discriminate irrelevant arms (\u03b8k)k>L from the worst of the relevant arms, that is, \u03b8L.\n\n3.2 Lower bound step by step\nStep 1: Computing the expected log-likelihood ratio. Denoting by Fs\u22121 the \u03c3-algebra generated\nby the past actions and observations, we de\ufb01ne the log-likelihood ratio for the two values \u03b8 and \u03bb of\nthe parameters by\n\n(cid:96)(t) :=\n\nlog\n\np(Z(s); \u03b8 | Fs\u22121)\np(Z(s); \u03bb | Fs\u22121)\n\n.\n\nLemma 2. For each position l and each item k, de\ufb01ne the local amount of information by\n\nand its cumulated sum over the L positions by Ia(\u03b8, \u03bb) :=(cid:80)L\n\nIl(\u03b8k, \u03bbk) := E\u03b8\n\np(Zl(t); \u03b8)\np(Zl(t); \u03bb)\n\nlog\n\nexpected log-likelihood ratio is given by\n\n,\n1{al = k}Il(\u03b8k, \u03bbk). The\n\n(cid:21)\n\n(cid:12)(cid:12)(cid:12)(cid:12) Al(t) = k\n(cid:80)K\n\nl=1\n\nk=1\n\n(4)\n\nE\u03b8[(cid:96)(t)] =\n\nIa(\u03b8, \u03bb)E\u03b8[Na(t)].\n\n(5)\n\nt(cid:88)\n(cid:20)\n\ns=1\n\n(cid:88)\n\na\u2208A\n\n(cid:80)\n\nThe next proposition is adapted from Theorem 17 in Appendix B of [12] and provides a lower bound\non the expected log-likelihood ratio.\nProposition 3. Let B(\u03b8) := {\u03bb \u2208 \u0398|\u2200l \u2264 L, \u03b8l = \u03bbl and \u00b5\u2217(\u03b8) < \u00b5\u2217(\u03bb)} be the set of changes of\nmeasure that improve over \u03b8 without modifying the optimal arms. Assuming that the expectation of\nthe log-likelihood ratio may be written as in (5), for any uniformly ef\ufb01cient algorithm one has\n\n\u2200\u03bb \u2208 B(\u03b8),\n\nlim inf\nT\u2192\u221e\n\na\u2208A Ia(\u03b8, \u03bb)E\u03b8[Na(T )]\n\n\u2265 1.\n\nlog(T )\n\nStep 2: Variational form of the lower bound. We are now ready to obtain the lower bound in a\nform similar to that originally given by [11].\nTheorem 4. The expected regret of any uniformly ef\ufb01cient algorithm satis\ufb01es\n\nlim inf\nT\u2192\u221e\n\nE\u03b8R(T )\n\nlog T\n\n\u2265 f (\u03b8) , where f (\u03b8) = inf\nc(cid:23)0\n\nIa(\u03b8, \u03bb)ca \u2265 1.\n\n\u2206a(\u03b8)ca ,\n\n(cid:88)\n(cid:88)\n+ , that satis\ufb01es the inequality(cid:80)\n\n\u03bb\u2208B(\u03b8)\n\na\u2208A\n\ns.t.\n\na\u2208A\n\ninf\n\nTheorem 4 is a straightforward consequence of Proposition 3, combined with the expression of the\nexpected regret given in (1). The vector c \u2208 R|A|\na\u2208A Ia(\u03b8, \u03bb)ca \u2265 1,\nrepresents the feasible values of E\u03b8[Na(T )]/ log(T ).\n\nStep 3: Relaxing the constraints. The bounds mentioned in Section 3.1 may be recovered from\nTheorem 4 by considering only the changes of measure that affect a single suboptimal arm.\nCorollary 5.\n\n(cid:88)\n\na\u2208A\n\nf (\u03b8) \u2265 inf\nc(cid:23)0\n\n\u2206a(\u03b8)ca ,\n\ns.t. (cid:88)\n\nL(cid:88)\n\na\u2208A\n\nl=1\n\n1{al = k}Il(\u03b8k, \u03b8L)ca \u2265 1 ,\n\n\u2200k \u2208 {L + 1, . . . , K}.\n\nCorollary 5 is obtained by restricting the constraint set B(\u03b8) of Theorem 4 to \u222aK\nBk(\u03b8) := {\u03bb \u2208 \u0398|\u2200j (cid:54)= k, \u03b8j = \u03bbj and \u00b5\u2217(\u03b8) < \u00b5\u2217(\u03bb)} .\n\nk=L+1Bk(\u03b8), where\n\n4\n\n\f3.3 Lower bound for the PBM\n\nTheorem 6. For the PBM, the following lower bound holds for any uniformly ef\ufb01cient algorithm:\n\n\u2265 K(cid:88)\n\nk=L+1\n\nlim inf\nT\u2192\u221e\n\nE\u03b8R(T )\n\nlog T\n\nmin\n\nl\u2208{1,...,L}\n\n\u2206vk,l (\u03b8)\n\nd(\u03bal\u03b8k, \u03bal\u03b8L)\n\n,\n\n(6)\n\nwhere vk,l := (1, . . . , l \u2212 1, k, l, . . . , L \u2212 1).\n\nProof. First, note that for the PBM one has Il(\u03b8k, \u03bbk) = d(\u03bal\u03b8k, \u03bal\u03bbk). To get the expression given\nin Theorem 6 from Corollary 5, we proceed as in [6] showing that the optimal coef\ufb01cients (ca)a\u2208A\ncan be non-zero only for the K \u2212 L actions that put the suboptimal arm k in the position l that reaches\nthe minimum of \u2206vk,l (\u03b8)/d(\u03bal\u03b8k, \u03bal\u03b8L). Nevertheless, this position does not always coincide with\nL, the end of the displayed list, contrary to the case of [6] (see Appendix B for details).\n\nThe discrete minimization that appears in the r.h.s. of Theorem 6 corresponds to a fundamental\ntrade-off in the PBM. When trying to discriminate a suboptimal arm k from the L optimal ones, it\nis desirable to put it higher in the list to obtain more information, as d(\u03bal\u03b8k, \u03bal\u03b8L) is an increasing\nfunction of \u03bal. On the other hand, the gap \u2206vk,l (\u03b8) is also increasing as l gets closer to the top\nof the list. The fact that d(\u03bal\u03b8k, \u03bal\u03b8L) is not linear in \u03bal (it is a strictly convex function of \u03bal)\nrenders the trade-off non trivial. It is easily checked that when (\u03b81 \u2212 \u03b8L) is very small, i.e. when all\noptimal arms are equivalent, the optimal exploratory position is l = 1. In contrast, it is equal to L\nwhen the gap (\u03b8L \u2212 \u03b8L+1) becomes very small. Note that by using that for any suboptimal a \u2208 A,\n1{al = k}\u03bal(\u03b8L \u2212 \u03b8k), one can lower bound the r.h.s. of Theorem 6 by\n\n\u2206a(\u03b8) \u2265(cid:80)K\n(cid:80)K\nk=L+1(\u03b8L \u2212 \u03b8k)/d(\u03baL\u03b8k, \u03baL\u03b8L), which is not tight in general.\nof Ia(\u03b8, \u03bb) is simpler: it is equal to(cid:80)L\n\n(cid:80)L\n\nk=L+1\n\nl=1\n\n\u03baL\nRemark 7. In the uncensored version of the PBM \u2013 i.e., if the Yl(t) were observed \u2013, the expression\n1{Al(t) = k}\u03bald(\u03b8k, \u03bbk) and leads to a lower\nbound that coincides with (3). The uncensored PBM is actually statistically very close to the weighted\nCascade model and can be addressed by algorithms that do not assume knowledge of the (\u03bal)l but\nonly of their ordering.\n\n(cid:80)K\n\nk=1\n\nl=1\n\n4 Algorithms\n\nIn this section we introduce two algorithms for the PBM. The \ufb01rst one uses the CUCB strategy of [4]\nand requires an simple upper con\ufb01dence bound for \u03b8k based on the estimator \u02c6\u03b8k(t) de\ufb01ned in (2).\nThe second algorithm is based on the Parsimonious Item Exploration \u2013 PIE(L) \u2013 scheme proposed\nin [6] and aims at reaching asymptotically optimal performance. For this second algorithm, termed\nPBM-PIE, it is also necessary to use a multi-position analog of the well-known KL-UCB index [10]\nthat is inspired by a result of [17]. The analysis of PBM-PIE provided below con\ufb01rms the relevance\nof the lower bound derived in Section 3.\n\nPBM-UCB The \ufb01rst algorithm simply consists in sorting optimistic indices in decreasing order\nand pulling the corresponding \ufb01rst L arms [4]. To derive the expression of the required \u201cexploration\nbonus\u201d we use an upper con\ufb01dence for \u02c6\u03b8k(t) based on Hoeffding\u2019s inequality:\n\n(cid:115)\n\n(cid:115)\n\nU U CB\n\nk\n\n(t, \u03b4) =\n\nSk(t)\n\u02dcNk(t)\n\n+\n\nNk(t)\n\u02dcNk(t)\n\n\u03b4\n\n2 \u02dcNk(t)\n\n,\n\nfor which a coverage bound is given by the next proposition, proven in Appendix C.\nProposition 8. Let k be any arm in {1, . . . , K}, then for any \u03b4 > 0,\n\nP(cid:0)U U CB\n\nk\n\n(t, \u03b4) \u2264 \u03b8k\n\n(cid:1) \u2264 e\u03b4 log(t)e\u2212\u03b4.\n\nFollowing the ideas of [7], it is possible to obtain a logarithmic regret upper bound for this algorithm.\nThe proof is given in Appendix D.\n\n5\n\n\fTheorem 9. Let C(\u03ba) = min1\u2264l\u2264L[((cid:80)L\n\nL and \u2206 = mina\u2208\u03c3(a\u2217)\\a\u2217 \u2206a,\nwhere \u03c3(a\u2217) denotes the permutations of the optimal action. Using PBM-UCB with \u03b4 = (1 +\n\u0001) log(t) for some \u0001 > 0, there exists a constant C0(\u0001) independent from the model parameters such\nthat the regret of PBM-UCB is bounded from above by\nE[R(T )] \u2264 C0(\u0001) + 16(1 + \u0001)C(\u03ba) log T\n\nj=1 \u03baj)2]/\u03ba2\n\n(cid:88)\n\n(cid:33)\n\n+\n\n1\n\n.\n\nL\n\u2206\n\n\u03baL(\u03b8L \u2212 \u03b8k)\n\nk /\u2208a\u2217\n\nj=1 \u03baj)2/l + ((cid:80)l\n(cid:32)\n\nThe presence of the term L/\u2206 in the above expression is attributable to limitations of the mathematical\nanalysis. On the other hand, the absence of the KL-divergence terms appearing in the lower bound (6)\nis due to the use of an upper con\ufb01dence bound based on Hoeffding\u2019s inequality.\n\nPBM-PIE We adapt the PIE(l) algorithm introduced by [6] for the Cascade Model to the PBM in\nAlgorithm 1 below. At each round, the learner potentially explores at position L with probability 1/2\nusing the following upper-con\ufb01dence bound for each arm k\n\n(cid:40)\n\n(cid:18) Sk,l(t)\n(t) is the minimum of the convex function \u03a6 : q (cid:55)\u2192(cid:80)L\n\nsup\nq\u2208[\u03b8min\n\nUk(t, \u03b4) =\n\nNk,l(t)d\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L(cid:88)\n\nNk,l(t)\n\n(t),1]\n\nl=1\n\nq\n\nk\n\n(cid:19)\n\n(cid:41)\n\n, \u03balq\n\n\u2264 \u03b4\n\n,\n\n(7)\n\nk\n\nwhere \u03b8min\nl=1 Nk,l(t)d(Sk,l(t)/Nk,l(t), \u03balq).\nIn other positions, l = 1, . . . , L \u2212 1, PBM-PIE selects the arms with the largest estimates \u02c6\u03b8k(t).\nThe resulting algorithm is presented as Algorithm 1 below, denoting by L(t) the L-largest empirical\nestimates, referred to as the \u201cleaders\u201d at round t.\n\nAlgorithm 1 \u2013 PBM-PIE\nRequire: K, L, observation probabilities \u03ba, \u0001 > 0\n\nInitialization: \ufb01rst K rounds, play each arm at every position\nfor t = K + 1, . . . , T do\nCompute \u02c6\u03b8k(t) for all k\nL(t) \u2190 top-L ordered arms by decreasing \u02c6\u03b8k(t)\nAl(t) \u2190 Ll(t) for each position l < L\nB(t) \u2190 {k|k /\u2208 L(t), Uk(t, (1 + \u0001) log(T )) \u2265 \u02c6\u03b8LL(t)(t)\nif B(t) = \u2205 then\n\nAL(t) \u2190 LL(t)\nWith probability 1/2, select AL(t) uniformly at random from B(t), else AL(t) \u2190 LL(t)\n\nelse\n\nend if\nPlay action A(t) and observe feedback Z(t); Update Nk,l(t + 1) and Sk,l(t + 1).\n\nend for\n\nThe Uk(t, \u03b4) index de\ufb01ned in (7) aggregates observations from all positions \u2013 as in PBM-UCB \u2013 but\nallows to build tighter con\ufb01dence regions as shown by the next proposition proved in Appendix E.\nProposition 10. For all \u03b4 \u2265 L + 1,\n\nP (Uk(t, \u03b4) < \u03b8k) \u2264 eL+1\n\n(cid:18)(cid:100)\u03b4 log(t)(cid:101) \u03b4\n\n(cid:19)L\n\nL\n\ne\u2212\u03b4.\n\nWe may now state the main result of this section that provides an upper bound on the regret of\nPBM-PIE.\nTheorem 11. Using PBM-PIE with \u03b4 = (1 + \u0001) log(t) and \u0001 > 0, for any \u03b7 < mink