{"title": "TopRank: A practical algorithm for online stochastic ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 3945, "page_last": 3954, "abstract": "Online learning to rank is a sequential decision-making problem where in each round the learning agent chooses a list of items and receives feedback in the form of clicks from the user. Many sample-efficient algorithms have been proposed for this problem that assume a specific click model connecting rankings and user behavior. We propose a generalized click model that encompasses many existing models, including the position-based and cascade models. Our generalization motivates a novel online learning algorithm based on topological sort, which we call TopRank. TopRank is (a) more natural than existing algorithms, (b) has stronger regret guarantees than existing algorithms with comparable generality, (c) has a more insightful proof that leaves the door open to many generalizations, (d) outperforms existing algorithms empirically.", "full_text": "TopRank: A Practical Algorithm for Online\n\nStochastic Ranking\n\nTor Lattimore\n\nDeepMind\n\nBranislav Kveton\n\nGoogle\n\nShuai Li\n\nThe Chinese University of Hong Kong\n\nCsaba Szepesv\u00e1ri\n\nDeepMind and University of Alberta\n\nAbstract\n\nOnline learning to rank is a sequential decision-making problem where in each\nround the learning agent chooses a list of items and receives feedback in the form\nof clicks from the user. Many sample-ef\ufb01cient algorithms have been proposed\nfor this problem that assume a speci\ufb01c click model connecting rankings and user\nbehavior. We propose a generalized click model that encompasses many existing\nmodels, including the position-based and cascade models. Our generalization\nmotivates a novel online learning algorithm based on topological sort, which we\ncall TopRank. TopRank is (a) more natural than existing algorithms, (b) has\nstronger regret guarantees than existing algorithms with comparable generality, (c)\nhas a more insightful proof that leaves the door open to many generalizations, and\n(d) outperforms existing algorithms empirically.\n\n1\n\nIntroduction\n\nLearning to rank is an important problem with numerous applications in web search and recommender\nsystems [11]. Broadly speaking, the goal is to learn an ordered list of K items from a larger collection\nof size L that maximizes the satisfaction of the user, often conditioned on a query. This problem has\ntraditionally been studied in the of\ufb02ine setting, where the ranking policy is learned from manually-\nannotated relevance judgments. It has been observed that the feedback of users can be used to\nsigni\ufb01cantly improve existing ranking policies [1, 16]. This is the main motivation for online learning\nto rank, where the goal is to adaptively maximize the user satisfaction.\nNumerous methods have been proposed for online learning to rank, both in the adversarial [12, 13]\nand stochastic settings. Our focus is on the stochastic setup where recent work has leveraged click\nmodels to mitigate the curse of dimensionality that arises from the combinatorial nature of the\naction-set. A click model is a model for how users click on items in rankings and is widely studied by\nthe information retrieval community [2]. One popular click model in learning to rank is the cascade\nmodel (CM), which assumes that the user scans the ranking from top to bottom, clicking on the \ufb01rst\nitem they \ufb01nd attractive [6, 3, 7, 18, 10, 5]. Another model is the position-based model (PBM), where\nthe probability that the user clicks on an item depends on its position and attractiveness, but not on\nthe surrounding items [8].\nThe cascade and position-based models have relatively few parameters, which is both a blessing and\na curse. On the positive side, a small model is easy to learn. More negatively, there is a danger that a\nsimplistic model will have a large approximation error. In fact, it has been observed experimentally\nthat no single existing click model captures the behavior of an entire population of users [4]. Zoghi\net al. [17] recently showed that under reasonable assumptions a single online learning algorithm can\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flearn the optimal list of items in a much larger class of click models that includes both the cascade\nand position-based models.\nWe build on the work of Zoghi et al. [17] and generalize it non-trivially in multiple directions. First,\nwe propose a general model of user interaction where the problem of \ufb01nding most attractive list\ncan be posed as a sorting problem with noisy feedback. An interesting characteristic of our model\nis that the click probability does not factor into the examination probability of the position and the\nattractiveness of the item at that position. Second, we propose an online learning algorithm for \ufb01nding\nthe most attractive list, which we call TopRank. The key idea in the design of the algorithm is to\nmaintain a partial order over the items that is re\ufb01ned as the algorithm observes more data. The new\nalgorithm is simultaneously simpler, more principled and empirically outperforms the algorithm of\nZoghi et al. [17]. We also provide an analysis of the cumulative regret of TopRank that is simple,\ninsightful and strengthens the results by Zoghi et al. [17], despite the weaker assumptions.\n\n2 Online learning to rank\n\nWe assume the total numbers of items L is larger than the number of available slots K and that the\ncollection of items is [L] = {1, 2, . . . , L}. A permutation on \ufb01nite set X is an invertible function\n\u03c3 : X \u2192 X and the set of all permutations on X is denoted by \u03a0(X). The set of actions A is the set\nof permutations \u03a0([L]), where for each a \u2208 A the value a(k) should be interpreted as the identity\nof the item placed at the kth position. Equivalently, item i is placed at position a\u22121(i). The user\ndoes not observe items in positions k > K so the order of a(k + 1), . . . , a(L) is not important and is\nincluded only for notational convenience. We adopt the convention throughout that i and j represent\nitems while k represents a position.\nThe online ranking problem proceeds over n rounds. In each round t the learner chooses an action\nAt \u2208 A based on its observations so far and observes binary random variables Ct1, . . . , CtL where\nCti = 1 if the user clicked on item i. We assume a stochastic model where the probability that the\nuser clicks on position k in round t only depends on At and is given by\n\nP(CtAt(k) = 1 | At = a) = v(a, k)\n\n(cid:34) n(cid:88)\n\nK(cid:88)\n\nt=1\n\nk=1\n\nwith v : A \u00d7 [L] \u2192 [0, 1] an unknown function. Another way of writing this is that the conditional\nprobability that the user clicks on item i in round t is P(Cti = 1 | At = a) = v(a, a\u22121(i)).\nThe performance of the learner is measured by the expected cumulative regret, which is the de\ufb01cit\nsuffered by the learner relative to the omniscient strategy that knows the optimal ranking in advance.\n\nK(cid:88)\n\n(cid:34) n(cid:88)\n\nL(cid:88)\n\n(cid:35)\n\n(cid:35)\n\nRn = n max\na\u2208A\n\nv(a, k) \u2212 E\n\nk=1\n\nt=1\n\ni=1\n\nCti\n\n= max\na\u2208A\n\nE\n\n(v(a, k) \u2212 v(At, k))\n\n.\n\nRemark 1. We do not assume that Ct1, . . . , CtL are independent or that the user can only click on\none item.\n\n3 Modeling assumptions\n\nIn previous work on online learning to rank it was assumed that v factors into v(a, k) =\n\u03b1(a(k))\u03c7(a, k) where \u03b1 : [L] \u2192 [0, 1] is the attractiveness function and \u03c7(a, k) is the probability\nthat the user examines position k given ranking a. Further restrictions are made on the examination\nfunction \u03c7. For example, in the document-based model it is assumed that \u03c7(a, k) = 1{k \u2264 K}. In\nthis work we depart from this standard by making assumptions directly on v. The assumptions are\nsuf\ufb01ciently relaxed that the model subsumes the document-based, position-based and cascade models,\nas well as the factored model studied by Zoghi et al. [17]. See the supplementary material for a proof\nof this. Our \ufb01rst assumption uncontroversially states that the user does not click on items they cannot\nsee.\nAssumption 1. v(a, k) = 0 for all k > K.\n\nAlthough we do not assume an explicit factorization of the click probability into attractiveness and\nexamination functions, we do assume there exists an unknown attractiveness function \u03b1 : [L] \u2192 [0, 1]\nthat satis\ufb01es the following assumptions. In all classical click models the optimal ranking is to sort the\n\n2\n\n\f2)\n\nerf(\n\nd \u2190 0\n\nwhile [L] \\(cid:83)d\n\nAlgorithm 1 TopRank\n\u221a\n1: G1 \u2190 \u2205 and c \u2190 4\n\u221a\n2/\u03c0\n2: for t = 1, . . . , n do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nd \u2190 d + 1\nPtd \u2190 minGt\n\n10:\n\nUtij \u2190\n\nStij \u2190(cid:80)t\n12: Gt+1 \u2190 Gt \u222a(cid:110)\n\n11:\n\nChoose At uniformly at random from A(Pt1, . . . ,Ptd)\nObserve click indicators Cti \u2208 {0, 1} for all i \u2208 [L]\nfor all (i, j) \u2208 [L]2 do\n\n(cid:17)\n\nc=1 Ptc\n\n[L] \\(cid:83)d\u22121\n\n(cid:16)\nc=1 Ptc (cid:54)= \u2205 do\n(cid:26)Cti \u2212 Ctj\ns=1 Usij and Ntij \u2190(cid:80)t\n(j, i) : Stij \u2265(cid:113)\n\nif i, j \u2208 Ptd for some d\notherwise\ns=1 |Usij|\n\n2Ntij log(cid:0) c\n\n0\n\n\u03b4\n\n(cid:112)Ntij\n\n(cid:111)\n(cid:1) and Ntij > 0\n\nitems in order of decreasing attractiveness. Rather than deriving this from other assumptions, we will\nsimply assume that v satis\ufb01es this criteria. We call action a optimal if \u03b1(a(k)) = maxk(cid:48)\u2265k \u03b1(a(k(cid:48)))\nfor all k \u2208 [K]. The optimal action need not be unique if \u03b1 is not injective, but the sequence\n\u03b1(a(1)), . . . , \u03b1(a(K)) is the same for all optimal actions.\n\nAssumption 2. Let a\u2217 \u2208 A be an optimal action. Then maxa\u2208A(cid:80)K\n\nk=1 v(a, k) =(cid:80)K\n\nk=1 v(a\u2217, k).\n\na\n\na(cid:48)\n\nThe next assumption asserts that if a is an action and i is more\nattractive than j, then exchanging the positions of i and j can\nonly decrease the likelihood of clicking on the item in slot a\u22121(i).\nFig. 1 illustrates the two cases. The probability of clicking on\nthe second position is larger in a than in a(cid:48). On the other hand,\nthe probability of clicking on the fourth position is larger in a(cid:48)\nthan in a. The assumption is actually slightly stronger than this\nFigure 1: The probability of clicking\nbecause it also speci\ufb01es a lower bound on the amount by which\non the second position is larger in a\nthan a(cid:48). The pattern reverses for the\none probability is larger than another in terms of the attractiveness\nfunction.\nfourth position.\nAssumption 3. Let i and j be items with \u03b1(i) \u2265 \u03b1(j) and let \u03c3 : A \u2192 A be the permutation that\nexchanges i and j and leaves other items unchanged. Then for any action a \u2208 A,\n\n1\n2\n3\n4\n5\n\nj\n\ni\n\nj\n\ni\n\nv(a, a\u22121(i)) \u2265 \u03b1(i)\n\u03b1(j)\n\nv(\u03c3 \u25e6 a, a\u22121(i)) .\n\nOur \ufb01nal assumption is that for any action a with \u03b1(a(k)) = \u03b1(a\u2217(k)) the probability of clicking on\nthe kth position is at least as high as the probability of clicking on the kth position for the optimal\naction. This assumption makes sense if the user is scanning the items from the \ufb01rst position until\nthe last, clicking on items they \ufb01nd attractive until some level of satisfaction is reached. Under this\nassumption the user is least likely to examine position k under the optimal ranking.\nAssumption 4. For any action a and optimal action a\u2217 with \u03b1(a(k)) = \u03b1(a\u2217(k)) it holds that\nv(a, k) \u2265 v(a\u2217, k).\n\n4 Algorithm\nBefore we present our algorithm, we introduce some basic notation. Given a relation G \u2286 [L]2 and\nX \u2286 [L], let minG(X) = {i \u2208 X : (i, j) /\u2208 G for all j \u2208 X}. When X is nonempty and G does not\nhave cycles, then minG(X) is nonempty. Let P1, . . . ,Pd be a partition of [L] so that \u222ac\u2264dPc = [L]\nand Pc \u2229 Pc(cid:48) = \u2205 for any c (cid:54)= c(cid:48). We refer to each subset in the partition, Pc for c \u2264 d, as a block.\nLet A(P1, . . . ,Pd) be the set of actions a where the items in P1 are placed at the \ufb01rst |P1| positions,\n\n3\n\n\fthe items in P2 are placed at the next |P2| positions, and so on. Speci\ufb01cally,\n\nA(P1, . . . ,Pd) =(cid:8)a \u2208 A : maxi\u2208Pc a\u22121(i) \u2264 mini\u2208Pc+1 a\u22121(i) for all c \u2208 [d \u2212 1](cid:9) .\n\nOur algorithm is presented in Algorithm 1. We call it TopRank, because it maintains a topological\norder of items in each round. The order is represented by relation Gt, where G1 = \u2205. In each round,\nTopRank computes a partition of [L] by iteratively peeling off minimum items according to Gt. Then\nit randomizes items in each block of the partition and maintains statistics on the relative number of\nclicks between pairs of items in the same block. A pair of items (j, i) is added to the relation once\nitem i receives suf\ufb01ciently more clicks than item j during rounds where the items are in the same\nblock. The reader should interpret (j, i) \u2208 Gt as meaning that TopRank collected enough evidence\nup to round t to conclude that \u03b1(j) < \u03b1(i).\nRemark 2. The astute reader will notice that the algorithm is not well de\ufb01ned if Gt contains cycles.\nThe analysis works by proving that this occurs with low probability and the behavior of the algorithm\nmay be de\ufb01ned arbitrarily whenever a cycle is encountered. Assumption 1 means that items in\nposition k > K are never clicked. As a consequence, the algorithm never needs to actually compute\nthe blocks Ptd where minItd > K because items in these blocks are never shown to the user.\nShortly we give an illustration of the algorithm, but \ufb01rst introduce the notation to be used in the\nanalysis. Let Itd be the slots of the ranking where items in Ptd are placed,\n\nItd = [|\u222ac\u2264dPtc|] \\ [|\u222ac<dPtc|] .\n\nFurthermore, let Dti be the block with item i, so that i \u2208 PtDti. Let Mt = maxi\u2208[L] Dti be the\nnumber of blocks in the partition in round t.\n\n4\n\n2\n\nIllustration Suppose L = 5 and K = 4 and in round\nt the relation is Gt = {(3, 1), (5, 2), (5, 3)}. This indi-\ncates the algorithm has collected enough data to believe\nthat item 3 is less attractive than item 1 and that item\n5 is less attractive than items 2 and 3. The relation is\ndepicted in Fig. 2 where an arrow from j to i means\nthat (j, i) \u2208 Gt. In round t the \ufb01rst three positions in\nthe ranking will contain items from Pt1 = {1, 2, 4},\nbut with random order. The fourth position will be item\n3 and item 5 is not shown to the user. Note that Mt = 3\nhere and Dt2 = 1 and Dt5 = 3.\nRemark 3. TopRank is not an elimination algorithm. In the scenario described above, item 5 is not\nshown to the user, but it could happen that later (4, 2) and (4, 3) are added to the relation and then\nTopRank will start randomizing between items 4 and 5 for the fourth position.\n\nPt1\nPt2\nPt3\nFigure 2: Illustration of partition produced by\ntopological sort\n\nIt1 = {1, 2, 3}\nIt2 = {4}\nIt3 = {5}\n\n1\n\n3\n\n5\n\n5 Regret analysis\nTheorem 1. Let function v satisfy Assumptions 1\u20134 and \u03b1(1) > \u03b1(2) > \u00b7\u00b7\u00b7 > \u03b1(L). Let \u2206ij =\n\u03b1(i) \u2212 \u03b1(j) and \u03b4 \u2208 (0, 1). Then the n-step regret of TopRank is bounded from above as\n\nFurthermore, Rn \u2264 \u03b4nKL2 + KL +\n\n4K 3Ln log\n\nBy choosing \u03b4 = n\u22121 the theorem shows that the expected regret is at most\n\n\uf8eb\uf8ed L(cid:88)\n\nmin{K,j\u22121}(cid:88)\n\nRn = O\n\nj=1\n\ni=1\n\n\uf8f6\uf8f8\n\n\u03b1(i) log(n)\n\n\u2206ij\n\n(cid:16)(cid:112)K 3Ln log(n)\n(cid:17)\n\n.\n\nand\n\nRn = O\n\n4\n\nRn \u2264 \u03b4nKL2 +\n\nL(cid:88)\n\nj=1\n\nmin{K,j\u22121}(cid:88)\n(cid:115)\n\ni=1\n\n\uf8eb\uf8ed1 +\n(cid:18) c\n\n6(\u03b1(i) + \u03b1(j)) log\n\n\u2206ij\n\n(cid:19)\n\n.\n\nn\n\n\u221a\n\n\u03b4\n\n(cid:16) c\n\n\u221a\n\n\u03b4\n\nn\n\n(cid:17)\n\n\uf8f6\uf8f8 .\n\n\fThe algorithm does not make use of any assumed ordering on \u03b1(\u00b7), so the assumption is only used\nto allow for a simple expression for the regret. The only algorithm that operates under comparably\ngeneral assumptions is BatchRankfor which the problem-dependent regret is a factor of K 2 worse and\nthe dependence on the suboptimality gap is replaced by a dependence on the minimal suboptimality\ngap.\nThe core idea of the proof is to show that (a) if the algorithm is suffering regret as a consequence\nof misplacing an item, then it is gaining information about the relation of the items so that Gt\nwill gain elements and (b) once Gt is suf\ufb01ciently rich the algorithm is playing optimally. Let\nFt = \u03c3(A1, C1, . . . , At, Ct) and Pt(\u00b7) = P(\u00b7 | Ft) and Et[\u00b7] = E [\u00b7 | Ft]. For each t \u2208 [n] let Ft to\nbe the failure event that there exists i (cid:54)= j \u2208 [L] and s < t such that Nsij > 0 and\n\nEu\u22121 [Uuij | Uuij (cid:54)= 0]|Uuij|\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2265(cid:113)\n\n2Nsij log(c(cid:112)Nsij/\u03b4) .\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Ssij \u2212 s(cid:88)\n\nu=1\n\nLemma 1. Let i and j satisfy \u03b1(i) \u2265 \u03b1(j) and d \u2265 1. On the event that i, j \u2208 Psd and d \u2208 [Ms]\nand Usij (cid:54)= 0, the following hold almost surely:\n\n(a) Es\u22121[Usij | Usij (cid:54)= 0] \u2265\n\n\u2206ij\n\n\u03b1(i) + \u03b1(j)\n\n(b) Es\u22121[Usji | Usji (cid:54)= 0] \u2264 0 .\n\nProof. For the remainder of the proof we focus on the event that i, j \u2208 Psd and d \u2208 [Ms] and\nUsij (cid:54)= 0. We also discard the measure zero subset of this event where Ps\u22121(Usij (cid:54)= 0) = 0.\nFrom now on we omit the \u2018almost surely\u2019 quali\ufb01cation on conditional expectations. Under these\ncircumstances the de\ufb01nition of conditional expectation shows that\n\nEs\u22121[Usij | Usij (cid:54)= 0] =\n\nPs\u22121(Csi = 1, Csj = 0) \u2212 Ps\u22121(Csi = 0, Csj = 1)\nPs\u22121(Csi (cid:54)= Csj)\n\u2265 Ps\u22121(Csi = 1) \u2212 Ps\u22121(Csj = 1)\nPs\u22121(Csi = 1) + Ps\u22121(Csj = 1)\n\nPs\u22121(Csi = 1) \u2212 Ps\u22121(Csj = 1)\n\nPs\u22121(Csi (cid:54)= Csj)\n\n=\n\n=\n\nEs\u22121[v(As, A\u22121\nEs\u22121[v(As, A\u22121\n\ns (i)) \u2212 v(As, A\u22121\ns (i)) + v(As, A\u22121\n\ns (j))]\ns (j))]\n\n,\n\n(1)\n\nwhere in the second equality we added and subtracted Ps\u22121(Csi = 1, Csj = 1). By the design of\nTopRank, the items in Ptd are placed into slots Itd uniformly at random. Let \u03c3 be the permutation\nthat exchanges the positions of items i and j. Then using Assumption 3,\nEs\u22121[v(As, A\u22121\n\nPs\u22121(As = a)v(\u03c3 \u25e6 a, a\u22121(i))\n\nPs\u22121(As = a)v(a, a\u22121(i)) \u2265 \u03b1(i)\n\u03b1(j)\n\n(cid:88)\n\n(cid:88)\n\ns (i))] =\n\na\u2208A\n\n(cid:88)\n\na\u2208A\n\n=\n\n\u03b1(i)\n\u03b1(j)\n\nPs\u22121(As = \u03c3 \u25e6 a)v(\u03c3 \u25e6 a, (\u03c3 \u25e6 a)\u22121(j)) =\n\na\u2208A\nEs\u22121[v(As, A\u22121\n\ns (j))] ,\n\n\u03b1(i)\n\u03b1(j)\n\nwhere the second equality follows from the fact that a\u22121(i) = (\u03c3 \u25e6 a)\u22121(j) and the de\ufb01nition of the\nalgorithm ensuring that Ps\u22121(As = a) = Ps\u22121(As = \u03c3 \u25e6 a). The last equality follows from the fact\nthat \u03c3 is a bijection. Using this and continuing the calculation in Eq. (1) shows that\nEs\u22121\nEs\u22121\n\ns (j))(cid:3)\ns (j))(cid:3) = 1 \u2212\n\ns (j))(cid:3)\n(cid:2)v(As, A\u22121\n\n(cid:2)v(As, A\u22121\n(cid:2)v(As, A\u22121\n\ns (i)) \u2212 v(As, A\u22121\ns (i)) + v(As, A\u22121\n\ns (i))(cid:3) /Es\u22121\n\n(cid:2)v(As, A\u22121\n\n2\n\n1 + Es\u22121\n2\n\n1 + \u03b1(i)/\u03b1(j)\n\n\u2265 1 \u2212\n\n\u03b1(i) \u2212 \u03b1(j)\n\u03b1(i) + \u03b1(j)\n\n=\n\n=\n\n\u2206ij\n\n\u03b1(i) + \u03b1(j)\n\n.\n\nThe second part follows from the \ufb01rst since Usji = \u2212Usij.\n\nThe next lemma shows that the failure event occurs with low probability.\nLemma 2. It holds that P(Fn) \u2264 \u03b4L2.\n\nProof. The proof follows immediately from Lemma 1, the de\ufb01nition of Fn, the union bound over all\npairs of actions, and a modi\ufb01cation of the Azuma-Hoeffding inequality in Lemma 6.\n\n5\n\n\fLemma 3. On the event F c\n\nProof. Let i < j so that \u03b1(i) \u2265 \u03b1(j). On the event F c\n\nt either Nsji = 0 or\n\nt it holds that (i, j) /\u2208 Gt for all i < j.\n(cid:16) c\n\n(cid:114)\n\n2Nsji log\n\n\u03b4\n\nEu\u22121[Uuji | Uuji (cid:54)= 0]|Uuji| <\n\n(cid:17)\n\n(cid:112)Nsji\n\nSsji \u2212 s(cid:88)\n\nu=1\n\nfor all s < t .\n\nWhen i and j are in different blocks in round u < t, then Uuji = 0 by de\ufb01nition. On the other hand,\nwhen i and j are in the same block, Eu\u22121[Uuji | Uuji (cid:54)= 0] \u2264 0 almost surely by Lemma 1. Based\non these observations,\n\n(cid:114)\n\n(cid:17)\n\n(cid:112)Nsji\n\n(cid:16) c\n\n\u03b4\n\nSsji <\n\n2Nsji log\n\nfor all s < t ,\n\nwhich by the design of TopRank implies that (i, j) /\u2208 Gt.\nLemma 4. Let I\u2217\nI\u2217\n\ntd \u2264 1 +(cid:80)\nProof. Let i\u2217 = min\u222ac\u2265dPtc. Then i\u2217 \u2264 1 +(cid:80)\n\nc<d |Ptd| for all d \u2208 [Mt].\n\ntd = minPtd be the most attractive item in Ptd. Then on event F c\n\nt , it holds that\n\nc<d |Ptd| holds trivially for any Pt1, . . . ,PtMt and\nd \u2208 [Mt]. Now consider two cases. Suppose that i\u2217 \u2208 Ptd. Then it must be true that i\u2217 = I\u2217\ntd and\nour claim holds. On other hand, suppose that i\u2217 \u2208 Ptc for some c > d. Then by Lemma 3 and the\ndesign of the partition, there must exist a sequence of items id, . . . , ic in blocks Ptd, . . . ,Ptc such\nthat id < \u00b7\u00b7\u00b7 < ic = i\u2217. From the de\ufb01nition of I\u2217\n\ntd \u2264 id < i\u2217. This concludes our proof.\n\ntd, I\u2217\n\nLemma 5. On the event F c\n\nn and for all i < j it holds that Snij \u2264 1 +\n\n6(\u03b1(i) + \u03b1(j))\n\n\u2206ij\n\nlog\n\n(cid:18) c\n\n(cid:19)\n\n.\n\n\u221a\n\n\u03b4\n\nn\n\nProof. The result is trivial when Nnij = 0. Assume from now on that Nnij > 0. By the de\ufb01nition\nof the algorithm arms i and j are not in the same block once Stij grows too large relative to Ntij,\nwhich means that\n\n(cid:114)\n\nSnij \u2264 1 +\n\n2Nnij log\n\nOn the event F c\n\nn and part (a) of Lemma 1 it also follows that\n\nCombining the previous two displays shows that\n\nSnij \u2265 \u2206ijNnij\n\u03b1(i) + \u03b1(j)\n\n\u2212\n\n(cid:114)\n\n(cid:16) c\n\n\u03b4\n\n(cid:112)Nnij\n\n\u2206ijNnij\n\n\u03b1(i) + \u03b1(j)\n\n\u2212\n\n2Nnij log\n\n.\n\n\u03b4\n\n\u03b4\n\n(cid:17)\n(cid:112)Nnij\n(cid:114)\n\n(cid:16) c\n\n2Nnij log\n\n(cid:112)Nnij\n(cid:114)\n(cid:16) c\n(cid:17) \u2264 Snij \u2264 1 +\n(cid:114)\n(cid:18) c\n\n\u2264 (1 +\n\n\u221a\n\n\u221a\n\nn\n\nlog\n\n\u03b4\n\n(cid:17)\n\n.\n\n(cid:19)\n\n.\n\n2Nnij log\n\n(cid:17)\n\n(cid:16) c\n(cid:112)Nnij\n(cid:17)\n(cid:112)Nnij\n\n\u03b4\n\n(cid:16) c\n\n.\nUsing the fact that Nnij \u2264 n and rearranging the terms in the previous display shows that\n\nNnij log\n\n2)\n\n\u03b4\n\n(2)\n\n\u221a\n\nNnij \u2264 (1 + 2\n\n2)2(\u03b1(i) + \u03b1(j))2\n\n\u22062\nij\n\nThe result is completed by substituting this into Eq. (2).\n\nProof of Theorem 1. The \ufb01rst step in the proof is an upper bound on the expected number of clicks\nin the optimal list a\u2217. Fix time t, block Ptd, and recall that I\u2217\ntd = minPtd is the most attractive\nitem in Ptd. Let k = A\u22121\ntd and \u03c3 be the permutation that exchanges\ntd \u2264 k; and then from Assumptions 3 and 4, we have that\nitems k and I\u2217\n\ntd) be the position of item I\u2217\n\ntd. By Lemma 4, I\u2217\n\nt (I\u2217\n\n6\n\n\fv(At, k) \u2265 v(\u03c3 \u25e6 At, k) \u2265 v(a\u2217, k). Based on this result, the expected number of clicks on I\u2217\nbounded from below by those on items in a\u2217,\nt (I\u2217\n\ntd) = k)Et\u22121[v(At, k) | A\u22121\n\n(cid:2)CtI\u2217\n\nPt\u22121(A\u22121\n\n(cid:3) =\n\n(cid:88)\n\nt (I\u2217\n\ntd) = k]\n\nEt\u22121\n\ntd\n\ntd is\n\nEt\u22121[v(At, k) | A\u22121\n\nt (I\u2217\n\ntd) = k] \u2265 1\n|Itd|\n\nv(a\u2217, k) ,\n\n(cid:88)\n\nk\u2208Itd\n\nwhere we also used the fact that TopRank randomizes within each block to guarantee that\nPt\u22121(A\u22121\n\ntd) = k) = 1/|Itd| for any k \u2208 Itd. Using this and the design of TopRank,\n\nt (I\u2217\n\nk\u2208Itd\n1\n|Itd|\n\n=\n\n(cid:88)\n\nk\u2208Itd\n\nv(a\u2217, k) =\n\nTherefore, under event F c\n\nt , the conditional expected regret in round t is bounded by\n\nk=1\n\nK(cid:88)\n\uf8ee\uf8f0 L(cid:88)\n\nj=1\n\n(CtI\u2217\n\ntd\n\nK(cid:88)\n\nv(a\u2217, k) \u2212 Et\u22121\n\n\uf8ee\uf8f0 Mt(cid:88)\n\nd=1\n\n(cid:88)\n\nj\u2208Ptd\n\nk=1\n\n= Et\u22121\n\nCtj\n\n\u2212 Ctj)\n\ntd\n\nd=1\n\nd=1\n\nk\u2208Itd\n\n(cid:3) .\n\n|Ptd|CtI\u2217\n\n|Itd|Et\u22121\n\n(cid:88)\nMt(cid:88)\nv(a\u2217, k) \u2264 Mt(cid:88)\n\uf8f9\uf8fb \u2264 Et\u22121\n\uf8ee\uf8f0 Mt(cid:88)\n\uf8f9\uf8fb =\n(cid:88)\nMt(cid:88)\n\n(cid:2)CtI\u2217\n\uf8f9\uf8fb\n\u2212 L(cid:88)\nmin{K,j\u22121}(cid:88)\ntdj] \u2264 L(cid:88)\ntdj] \u2264(cid:80)min{K,j\u22121}\nmin{K,j\u22121}(cid:88)\n\nEt\u22121[UtI\u2217\n\nL(cid:88)\n\nj\u2208Ptd\n\nCtj\n\nd=1\n\nd=1\n\nj=1\n\nj=1\n\ni=1\n\ntd\n\nE [1{F c\n\nn} Snij] ,\n\nj=1\n\ni=1\n\nRn \u2264 nKP(Fn) +\n\nThe last inequality follows by noting that Et\u22121[UtI\u2217\npart (a) of Lemma 1 to show that Et\u22121[Utij] \u2265 0 for i < j and Lemma 4 to show that when I\u2217\nthen neither I\u2217\nin Eq. (3) into the regret leads to\n\nEt\u22121[Utij]. To see this use\ntd > K,\ntdj = 0. Substituting the bound\n\ntd nor j are not shown to the user in round t so that UtI\u2217\n\ni=1\n\nEt\u22121 [Utij] .\n\n(3)\n\n(4)\n\nwhere we used the fact that the maximum number of clicks over n rounds is nK. The proof of the\n\ufb01rst part is completed by using Lemma 2 to bound the \ufb01rst term and Lemma 5 to bound the second.\nThe problem independent bound follows from Eq. (4) and by stopping early in the proof of Lemma 5.\nThe details are given in the supplementary material.\n\nLemma 6. Let (Ft)n\n\nvariables with Xt \u2208 {\u22121, 0, 1} and \u00b5t = E[Xt | Ft\u22121, Xt (cid:54)= 0]. Then with St = (cid:80)t\n\u00b5s|Xs|) and Nt =(cid:80)t\n\uf8eb\uf8edexists t \u2264 n : |St| \u2265\n4(cid:112)2/\u03c0\n\nt=0 be a \ufb01ltration and X1, X2, . . . , Xn be a sequence of Ft-adapted random\ns=1(Xs \u2212\n(cid:115)\ns=1 |Xs|,\n\n\uf8f6\uf8f8 \u2264 \u03b4 , where c =\n\nand Nt > 0\n\n\u2248 3.43 .\n\n(cid:18) c\n\n2Nt log\n\n(cid:19)\n\n\u221a\n\nP\n\n\u221a\nerf(\n\n2)\n\nNt\n\u03b4\n\nSee the supplementary material for the proof.\nWe also provide a minimax lower bound, the proof of which is deferred to the supplementary material.\nTheorem 2. Suppose that L = N K with N an integer and n \u2265 K and n \u2265 N and N \u2265 8. Then\nfor any algorithm there exists a ranking problem such that E[Rn] \u2265 \u221a\n\n\u221a\nKLn/(16\n\n2).\n\nThe proof of this result only makes use of ranking problems in the document-based model. This also\ncorresponds to a lower bound for m-sets in online linear optimization with semi-bandit feedback.\nDespite the simple setup and abundant literature, we are not aware of any work where a lower bound\nof this form is presented for this unstructured setting.\n\n7\n\n\f6 Experiments\n\nWe experiment with the Yandex dataset [15], a dataset of 167 million search queries. In each query,\nthe user is shown 10 documents at positions 1 to 10 and the search engine records the clicks of the\nuser. We select 60 frequent search queries from this dataset, and learn their CMs and PBMs using\nPyClick [2]. The parameters of the models are learned by maximizing the likelihood of observed\nclicks. Our goal is to rerank L = 10 most attractive items with the objective of maximizing the\nexpected number of clicks at the \ufb01rst K = 5 positions. This is the same experimental setup as in\nZoghi et al. [17]. This is a realistic scenario where the learning agent can only rerank highly attractive\nitems that are suggested by some production ranker [16].\nTopRank is compared to BatchRank [17] and CascadeKL-UCB [6]. We used the implementation of\nBatchRank by Zoghi et al. [17]. We do not compare to ranked bandits [12], because they have already\nbeen shown to perform poorly in stochastic click models, for instance by Zoghi et al. [17] and Katariya\net al. [5]. The parameter \u03b4 in TopRank is set as \u03b4 = 1/n, as suggested in Theorem 1. Fig. 3 illustrates\n\nFigure 3: The n-step regret of TopRank (red), CascadeKL-UCB (blue), and BatchRank (gray) in three problems.\nThe results are averaged over 10 runs. The error bars are the standard errors of our regret estimates.\n\nFigure 4: The n-step regret of TopRank (red), CascadeKL-UCB (blue), and BatchRank (gray) in two click\nmodels. The results are averaged over 60 queries and 10 runs per query. The error bars are the standard errors of\nour regret estimates.\n\nthe general trend on speci\ufb01c queries. In the cascade model, CascadeKL-UCB outperforms TopRank.\nThis should not come as a surprise because CascadeKL-UCB heavily exploits the knowledge of the\nmodel. Despite being a more general algorithm, TopRank consistently outperforms BatchRank\nin the cascade model. In the position-based model, CascadeKL-UCB learns very good policies in\nabout two thirds of queries, but suffers linear regret for the rest. In many of these queries, TopRank\noutperforms CascadeKL-UCB in as few as one million steps. In the position-based model, TopRank\ntypically outperforms BatchRank.\nThe average regret over all queries is reported in Fig. 4. We observe similar trends to those in Fig. 3.\nIn the cascade model, the regret of CascadeKL-UCB is about three times lower than that of TopRank,\nwhich is about three times lower than that of BatchRank. In the position-based model, the regret\nof CascadeKL-UCB is higher than that of TopRank after 4 million steps. The regret of TopRank is\nabout 30% lower than that of BatchRank. In summary, we observe that TopRank improves over\nBatchRank in both the cascade and position-based models. The worse performance of TopRank\nrelative to CascadeKL-UCB in the cascade model is offset by its robustness to multiple click models.\n\n8\n\n200k400k600k800k1MStep n01k2k3k4k5kRegretCM on query 101524200k400k600k800k1MStep n01k2k3k4k5kPBM on query 101524200k400k600k800k1MStep n01k2k3k4k5kPBM on query 286581M2M3M4M5MStep n02k4k6k8kRegretCM1M2M3M4M5MStep n02k4k6k8kPBM\f7 Conclusions\n\nWe introduced a new click model for online ranking that subsumes previous models. Despite\nthe increased generality, the new algorithm enjoys stronger regret guarantees, an easier and more\ninsightful proof and improved empirical performance. We hope the simpli\ufb01cations can inspire even\nmore interest in online ranking. We also proved a lower bound for combinatorial linear semi-bandits\nwith m-sets that improves on the bound by Uchiya et al. [14]. We do not currently have matching\nupper and lower bounds. The key to understanding minimax lower bounds is to identify what\nmakes a problem hard. In many bandit models there is limited \ufb02exibility, but our assumptions are\nso weak that the space of all v satisfying Assumptions 1\u20134 is quite large and we do not yet know\nwhat is the hardest case. This dif\ufb01culty is perhaps even greater if the objective is to prove instance-\ndependent or asymptotic bounds where the results usually depend on solving a regret/information\noptimization problem [9]. Ranking becomes increasingly dif\ufb01cult as the number of items grows. In\nmost cases where L is large, however, one would expect the items to be structured and this should\nthen be exploited. This has been done for the cascade model by assuming a linear structure [18, 10].\nInvestigating this possibility, but with more relaxed assumptions seems like an interesting future\ndirection.\n\nReferences\n[1] E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user\nbehavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference,\npages 19\u201326, 2006.\n\n[2] A. Chuklin, I. Markov, and M. de Rijke. Click Models for Web Search. Morgan & Claypool\n\nPublishers, 2015.\n\n[3] R. Combes, S. Magureanu, A. Proutiere, and C. Laroche. Learning to rank: Regret lower\nbounds and ef\ufb01cient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International\nConference on Measurement and Modeling of Computer Systems, 2015.\n\n[4] A. Grotov, A. Chuklin, I. Markov, L. Stout, F. Xumara, and M. de Rijke. A comparative study\nof click models for web search. In Proceedings of the 6th International Conference of the CLEF\nAssociation, 2015.\n\n[5] S. Katariya, B. Kveton, Cs. Szepesv\u00e1ri, and Z. Wen. DCM bandits: Learning to rank with\nmultiple clicks. In Proceedings of the 33rd International Conference on Machine Learning,\npages 1215\u20131224, 2016.\n\n[6] B. Kveton, Cs. Szepesv\u00e1ri, Z. Wen, and A. Ashkan. Cascading bandits: Learning to rank in the\ncascade model. In Proceedings of the 32nd International Conference on Machine Learning,\n2015.\n\n[7] B. Kveton, Z. Wen, A. Ashkan, and Cs. Szepesv\u00e1ri. Combinatorial cascading bandits. In\n\nAdvances in Neural Information Processing Systems 28, pages 1450\u20131458, 2015.\n\n[8] P. Lagree, C. Vernade, and O. Cappe. Multiple-play bandits in the position-based model. In\n\nAdvances in Neural Information Processing Systems 29, pages 1597\u20131605, 2016.\n\n[9] T. Lattimore and Cs. Szepesv\u00e1ri. The End of Optimism? An Asymptotic Analysis of Finite-\nArmed Linear Bandits. In A. Singh and J. Zhu, editors, Proceedings of the 20th International\nConference on Arti\ufb01cial Intelligence and Statistics, volume 54 of Proceedings of Machine\nLearning Research, pages 728\u2013737, Fort Lauderdale, FL, USA, 20\u201322 Apr 2017. PMLR.\n\n[10] S. Li, B. Wang, S. Zhang, and W. Chen. Contextual combinatorial cascading bandits. In\nProceedings of the 33rd International Conference on Machine Learning, pages 1245\u20131253,\n2016.\n\n[11] T. Liu. Learning to Rank for Information Retrieval. Springer, 2011.\n[12] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed\nbandits. In Proceedings of the 25th International Conference on Machine Learning, pages\n784\u2013791, 2008.\n\n[13] A. Slivkins, F. Radlinski, and S. Gollapudi. Ranked bandits in metric spaces: Learning diverse\nrankings over large document collections. Journal of Machine Learning Research, 14(1):\n399\u2013436, 2013.\n\n9\n\n\f[14] Taishi Uchiya, Atsuyoshi Nakamura, and Mineichi Kudo. Algorithms for adversarial bandit\nproblems with multiple plays. In Proceedings of the 21st International Conference on Algo-\nrithmic Learning Theory, ALT\u201910, pages 375\u2013389, Berlin, Heidelberg, 2010. Springer-Verlag.\nISBN 3-642-16107-3.\n\n[15] Yandex. Yandex personalized web search challenge. https://www.kaggle.com/c/yandex-\n\npersonalized-web-search-challenge, 2013.\n\n[16] M. Zoghi, T. Tunys, L. Li, D. Jose, J. Chen, C. Ming Chin, and M. de Rijke. Click-based hot\n\ufb01xes for underperforming torso queries. In Proceedings of the 39th International ACM SIGIR\nConference on Research and Development in Information Retrieval, pages 195\u2013204, 2016.\n\n[17] M. Zoghi, T. Tunys, M. Ghavamzadeh, B. Kveton, Cs. Szepesv\u00e1ri, and Z. Wen. Online learning\nto rank in stochastic click models. In Proceedings of the 34th International Conference on\nMachine Learning, pages 4199\u20134208, 2017.\n\n[18] S. Zong, H. Ni, K. Sung, N. Rosemary Ke, Z. Wen, and B. Kveton. Cascading bandits for\nlarge-scale recommendation problems. In Proceedings of the 32nd Conference on Uncertainty\nin Arti\ufb01cial Intelligence, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1947, "authors": [{"given_name": "Tor", "family_name": "Lattimore", "institution": "DeepMind"}, {"given_name": "Branislav", "family_name": "Kveton", "institution": "Google"}, {"given_name": "Shuai", "family_name": "Li", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "University of Alberta"}]}