{"title": "Copeland Dueling Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 307, "page_last": 315, "abstract": "A version of the dueling bandit problem is addressed in which a Condorcet winner may not exist. Two algorithms are proposed that instead seek to minimize regret with respect to the Copeland winner, which, unlike the Condorcet winner, is guaranteed to exist. The first, Copeland Confidence Bound (CCB), is designed for small numbers of arms, while the second, Scalable Copeland Bandits (SCB), works better for large-scale problems. We provide theoretical results bounding the regret accumulated by CCB and SCB, both substantially improving existing results. Such existing results either offer bounds of the form O(K log T) but require restrictive assumptions, or offer bounds of the form O(K^2 log T) without requiring such assumptions. Our results offer the best of both worlds: O(K log T) bounds without restrictive assumptions.", "full_text": "Copeland Dueling Bandits\n\nMasrour Zoghi\n\nInformatics Institute\n\nUniversity of Amsterdam, Netherlands\n\nm.zoghi@uva.nl\n\nShimon Whiteson\n\nDepartment of Computer Science\n\nUniversity of Oxford, UK\n\nshimon.whiteson@cs.ox.ac.uk\n\nZohar Karnin\nYahoo Labs\nNew York, NY\n\nzkarnin@yahoo-inc.com\n\nMaarten de Rijke\nInformatics Institute\n\nUniversity of Amsterdam\n\nderijke@uva.nl\n\nAbstract\n\nA version of the dueling bandit problem is addressed in which a Condorcet winner\nmay not exist. Two algorithms are proposed that instead seek to minimize regret\nwith respect to the Copeland winner, which, unlike the Condorcet winner, is guar-\nanteed to exist. The \ufb01rst, Copeland Con\ufb01dence Bound (CCB), is designed for\nsmall numbers of arms, while the second, Scalable Copeland Bandits (SCB),\nworks better for large-scale problems. We provide theoretical results bounding\nthe regret accumulated by CCB and SCB, both substantially improving existing\nresults. Such existing results either offer bounds of the form O(K log T ) but\nrequire restrictive assumptions, or offer bounds of the form O(K2 log T ) without\nrequiring such assumptions. Our results offer the best of both worlds: O(K log T )\nbounds without restrictive assumptions.\n\nIntroduction\n\n1\nThe dueling bandit problem [1] arises naturally in domains where feedback is more reliable when\ngiven as a pairwise preference (e.g., when it is provided by a human) and specifying real-valued\nfeedback instead would be arbitrary or inef\ufb01cient. Examples include ranker evaluation [2, 3, 4] in\ninformation retrieval, ad placement and recommender systems. As with other preference learning\nproblems [5], feedback consists of a pairwise preference between a selected pair of arms, instead of\nscalar reward for a single selected arm, as in the K-armed bandit problem.\nMost existing algorithms for the dueling bandit problem require the existence of a Condorcet win-\nner, which is an arm that beats every other arm with probability greater than 0.5. If such algorithms\nare applied when no Condorcet winner exists, no decision may be reached even after many compar-\nisons. This is a key weakness limiting their practical applicability. For example, in industrial ranker\nevaluation [6], when many rankers must be compared, each comparison corresponds to a costly live\nexperiment and thus the potential for failure if no Condorcet winner exists is unacceptable [7].\nThis risk is not merely theoretical. On the contrary, recent experiments on K-armed dueling bandit\nproblems based on information retrieval datasets show that dueling bandit problems without Con-\ndorcet winners arise regularly in practice [8, Figure 1]. In addition, we show in Appendix C.1 in the\nsupplementary material that there are realistic situations in ranker evaluation in information retrieval\nin which the probability that the Condorcet assumption holds, decreases rapidly as the number of\narms grows. Since the K-armed dueling bandit methods mentioned above do not provide regret\nbounds in the absence of a Condorcet winner, applying them remains risky in practice. Indeed, we\ndemonstrate empirically the danger of applying such algorithms to dueling bandit problems that do\nnot have a Condorcet winner (cf. Appendix A in the supplementary material).\nThe non-existence of the Condorcet winner has been investigated extensively in social choice theory,\nwhere numerous de\ufb01nitions have been proposed, without a clear contender for the most suitable\nresolution [9]. In the dueling bandit context, a few methods have been proposed to address this\nissue, e.g., SAVAGE [10], PBR [11] and RankEl [12], which use some of the notions proposed by\n\n1\n\n\fsocial choice theorists, such as the Copeland score or the Borda score to measure the quality of each\narm, hence determining what constitutes the best arm (or more generally the top-k arms). In this\npaper, we focus on \ufb01nding Copeland winners, which are arms that beat the greatest number of other\narms, because it is a natural, conceptually simple extension of the Condorcet winner.\nUnfortunately, the methods mentioned above come with bounds of the form O(K2 log T ). In this\npaper, we propose two new K-armed dueling bandit algorithms for the Copeland setting with sig-\nni\ufb01cantly improved bounds.\nThe \ufb01rst algorithm, called Copeland Con\ufb01dence Bound (CCB), is inspired by the recently pro-\nposed Relative Upper Con\ufb01dence Bound method [13], but modi\ufb01ed and extended to address the\nunique challenges that arise when no Condorcet winner exists. We prove anytime high-probability\nand expected regret bounds for CCB of the form O(K2 + K log T ). Furthermore, the denominator\nof this result has much better dependence on the \u201cgaps\u201d arising from the dueling bandit problem\nthan most existing results (cf. Sections 3 and 5.1 for the details).\nHowever, a remaining weakness of CCB is the additive O(K2) term in its regret bounds. In applica-\ntions with large K, this term can dominate for any experiment of reasonable duration. For example,\nat Bing, 200 experiments are run concurrently on any given day [14], in which case the duration\nof the experiment needs to be longer than the age of the universe in nanoseconds before K log T\nbecomes signi\ufb01cant in comparison to K2.\nOur second algorithm, called Scalable Copeland Bandits (SCB), addresses this weakness by elim-\ninating the O(K2) term, achieving an expected regret bound of the form O(K log K log T ). The\nprice of SCB\u2019s tighter regret bounds is that, when two suboptimal arms are close to evenly matched,\nit may waste comparisons trying to determine which one wins in expectation. By contrast, CCB\ncan identify that this determination is unnecessary, yielding better performance unless there are very\nmany arms. CCB and SCB are thus complementary algorithms for \ufb01nding Copeland winners.\nOur main contributions are as follows:\n1. We propose two algorithms that address the dueling bandit problem in the absence of a Condorcet\nwinner, one designed for problems with small numbers of arms and the other scaling well with\nthe number of arms.\n2. We provide regret bounds that bridge the gap between two groups of results: those of the form\nO(K log T ) that make the Condorcet assumption, and those of the form O(K2 log T ) that do not\nmake the Condorcet assumption. Our bounds are similar to those of the former but are as broadly\napplicable as the latter. Furthermore, the result for CCB has substantially better dependence on\nthe gaps than the second group of results.\n\n3. We include an empirical evaluation of CCB and SCB using a real-life problem arising from\n\ninformation retrieval (IR). The experimental results mirror the theoretical ones.\n\n2 Problem Setting\nLet K 2. The K-armed dueling bandit problem [1] is a modi\ufb01cation of the K-armed bandit\nproblem [15]. The latter considers K arms {a1, . . . , aK} and at each time-step, an arm ai can be\npulled, generating a reward drawn from an unknown stationary distribution with expected value \u00b5i.\nThe K-armed dueling bandit problem is a variation in which, instead of pulling a single arm, we\nchoose a pair (ai, aj) and receive one of them as the better choice, with the probability of ai being\npicked equal to an unknown constant pij and that of aj being picked equal to pji = 1 pij. A\nproblem instance is fully speci\ufb01ed by a preference matrix P = [pij], whose ij entry is equal to pij.\nMost previous work assumes the existence of a Condorcet winner [10]: an arm, which without loss\n2 for all i > 1. In such work, regret is de\ufb01ned relative to\nof generality we label a1, such that p1i > 1\nthe Condorcet winner. However, Condorcet winners do not always exist [8, 13]. In this paper, we\nconsider a formulation of the problem that does not assume the existence of a Condorcet winner.\nInstead, we consider the Copeland dueling bandit problem, which de\ufb01nes regret with respect to a\nCopeland winner, which is an arm with maximal Copeland score. The Copeland score of ai, denoted\nCpld(ai), is the number of arms aj for which pij > 0.5. The normalized Copeland score, denoted\ncpld(ai), is simply Cpld(ai)\nK1 . Without loss of generality, we assume that a1, . . . , aC are the Copeland\nwinners, where C is the number of Copeland winners. We de\ufb01ne regret as follows:\nDe\ufb01nition 1. The regret incurred by comparing ai and aj is 2cpld(a1) cpld(ai) cpld(aj).\n\n2\n\n\fRemark 2. Since our results (see \u00a75) establish bounds on the number of queries to non-Copeland\nwinners, they can also be applied to other notions of regret.\n\n3 Related Work\nNumerous methods have been proposed for the K-armed dueling bandit problem, including Inter-\nleaved Filter [1], Beat the Mean [3], Relative Con\ufb01dence Sampling [8], Relative Upper Con\ufb01dence\nBound (RUCB) [13], Doubler and MultiSBM [16], and mergeRUCB [17], all of which require the\nexistence of a Condorcet winner, and often come with bounds of the form O(K log T ). However,\nas observed in [13] and Appendix C.1, real-world problems do not always have Condorcet winners.\nThere is another group of algorithms that do not assume the existence of a Condorcet winner, but\nhave bounds of the form O(K2 log T ) in the Copeland setting: Sensitivity Analysis of VAriables\nfor Generic Exploration (SAVAGE) [10], Preference-Based Racing (PBR) [11] and Rank Elicitation\n(RankEl) [12]. All three of these algorithms are designed to solve more general or more dif\ufb01cult\nproblems, and they solve the Copeland dueling bandit problem as a special case.\nThis work bridges the gap between these two groups by providing algorithms that are as broadly\napplicable as the second group but have regret bounds comparable to those of the \ufb01rst group. Fur-\nthermore, in the case of the results for CCB, rather than depending on the smallest gap between arms\nai and aj, min :=mini>j |pij 0.5|, as in the case of many results in the Copeland setting,1 our\nregret bounds depend on a larger quantity that results in a substantially lower upper-bound, cf. \u00a75.1.\nIn addition to the above, bounds have been proven for other notions of winners, including Borda\n[10, 11, 12], Random Walk [11, 18], and very recently von Neumann [19]. The dichotomy discussed\nalso persists in the case of these results, which either rely on restrictive assumptions to obtain a linear\ndependence on K or are more broadly applicable, at the expense of a quadratic dependence on K. A\nnatural question for future work is whether the improvements achieved in this paper in the case of the\nCopeland winner can be obtained in the case of these other notions as well. We refer the interested\nreader to Appendix C.2 for a numerical comparison of these notions of winners in practice. More\ngenerally, there is a proliferation of notions of winners that the \ufb01eld of Social Choice Theory has put\nforth and even though each de\ufb01nition has its merits, it is dif\ufb01cult to argue for any single de\ufb01nition\nto be superior to all others.\nA related setting is that of partial monitoring games [20]. While a dueling bandit problem can be\nmodeled as a partial monitoring problem, doing so yields weaker results. In [21], the authors present\nproblem-dependent bounds from which a regret bound of the form O(K2 log T ) can be deduced for\nthe dueling bandit problem, whereas our work achieves a linear dependence in K.\n4 Method\nWe now present two algorithms that \ufb01nd Copeland winners.\n4.1 Copeland Con\ufb01dence Bound (CCB)\nCCB (see Algorithm 1) is based on the principle of optimism followed by pessimism: it maintains\noptimistic and pessimistic estimates of the preference matrix, i.e., matrices U and L (Line 6). It uses\nU to choose an optimistic Copeland winner ac (Lines 7\u20139 and 11\u201312), i.e., an arm that has some\nchance of being a Copeland winner. Then, it uses L to choose an opponent ad (Line 13), i.e., an arm\ndeemed likely to discredit the hypothesis that ac is indeed a Copeland winner.\nMore precisely, an optimistic estimate of the Copeland score of each arm ai is calculated using U\n(Line 7), and ac is selected from the set of top scorers, with preference given to those in a shortlist, Bt\n(Line 11). These are arms that have, roughly speaking, been optimistic winners throughout history.\nTo maintain Bt, as soon as CCB discovers that the optimistic Copeland score of an arm is lower than\nthe pessimistic Copeland score of another arm, it purges the former from Bt (Line 9B).\nThe mechanism for choosing the opponent ad is as follows. The matrices U and L de\ufb01ne a con\ufb01-\ndence interval around pij for each i and j. In relation to ac, there are three types of arms: (1) arms\naj s.t. the con\ufb01dence region of pcj is strictly above 0.5, (2) arms aj s.t. the con\ufb01dence region of pcj\nis strictly below 0.5, and (3) arms aj s.t. the con\ufb01dence region of pcj contains 0.5. Note that an arm\nof type (1) or (2) at time t0 may become an arm of type (3) at time t > t0 even without queries to the\ncorresponding pair as the size of the con\ufb01dence intervals increases as time goes on.\n\n1Cf. [10, Equation 9 in \u00a74.1.1] and [11, Theorem 1].\n\n3\n\n\f1 = ? for each i = 1, . . . , K // potential to beat ai\n\nAlgorithm 1 Copeland Con\ufb01dence Bound\nInput: A Copeland dueling bandit problem and an exploration parameter \u21b5> 1\n2.\n1: W = [wij] 0K\u21e5K // 2D array of wins: wij is the number of times ai beat aj\n2: B1 = {a1, . . . , aK} // potential best arms\n3: Bi\n4: LC = K // estimated max losses of a Copeland winner\n5: for t = 1, 2, . . . do\n6: U := [uij] = W\n7:\n8:\n9:\n\nW+WT and L := [lij] = W\n\n2 , k 6= i \n\nt B i\n\n2, 8i\n\nW+WT , with uii = lii = 1\n\nW+WT q \u21b5 ln t\nW+WT +q \u21b5 ln t\n2 , k 6= i and Cpld(ai) = #k | lik 1\nCpld(ai) = #k | uik 1\nCt = {ai | Cpld(ai) = maxj Cpld(aj)}\nt1 and update as follows:\nSet Bt B t1 and Bi\nA. Reset disproven hypotheses: If for any i and aj 2B i\nt we have lij > 0.5, reset Bt, LC and\nt for all k (i.e., set them to their original values as in Lines 2\u20134 above).\nBk\nB. Remove non-Copeland winners: For each ai 2B t, if Cpld(ai) < Cpld(aj) holds for any\nj, set Bt B t \\ {ai}, and if |Bi\nt { ak|uik < 0.5}. However, if\nBt = ?, reset Bt, LC and Bk\nC. Add Copeland winners: For any ai 2C t with Cpld(ai) = Cpld(ai), set Bt B t [{ ai},\nt ? and LC K 1 Cpld(ai). For each j 6= i, if we have |Bj\nt| < LC + 1, set\nBi\nBj\nt ?, and if |Bj\nt and remove the rest.\nt and 0.5 2\n[lij, uij]} (if it is non-empty) and skip to Line 14.\nIf Bt \\C t 6= ?, then with probability 2/3, set Ct B t \\C t.\nSample ac from Ct uniformly at random.\nt or {a1, . . . , aK} and then set\nd arg max{j2Bi | ljc\uf8ff0.5} ujc. If there is a tie, d is not allowed to be equal to c.\nCompare arms ac and ad and increment wcd or wdc depending on which arm wins.\n\n10: With probability 1/4, sample (c, d) uniformly from the set {(i, j) | aj 2B i\n11:\n12:\n13: With probability 1/2, choose the set Bi to be either Bi\n14:\n15: end for\n\nt| > LC +1, randomly choose LC +1 elements of Bj\n\nt|6 = LC + 1, then set Bi\n\nt for all k.\n\nCCB always chooses ad from arms of type (3) because comparing ac and a type (3) arm is most\ninformative about the Copeland score of ac. Among arms of type (3), CCB favors those that have\ncon\ufb01dently beaten arm ac in the past (Line 13), i.e., arms that in some round t0 < t were of type (2).\nSuch arms are maintained in a shortlist of \u201cformidable\u201d opponents (Bi\nt) that are likely to con\ufb01rm\nthat ai is not a Copeland winner; these arms are favored when selecting ad (Lines 10 and 13).\nThe sets Bi\nt are what speed up the elimination of non-Copeland winners, enabling regret bounds that\nscale asymptotically with K rather than K2. Speci\ufb01cally, for a non-Copeland winner ai, the set\nt will eventually contain LC +1 strong opponents for ai (Line 4.1C), where LC is the number of\nBi\nlosses of each Copeland winner. Since LC is typically small (cf. Appendix C.3), asymptotically this\nleads to a bound of only O(log T ) on the number of time-steps when ai is chosen as an optimistic\nCopeland winner, instead of a bound of O(K log T ), which a more naive algorithm would produce.\n4.2 Scalable Copeland Bandits (SCB)\nSCB is designed to handle dueling bandit problems with large numbers of arms. It is based on an\narm-identi\ufb01cation algorithm, described in Algorithm 2, designed for a PAC setting, i.e., it \ufb01nds an\n\u270f-Copeland winner with probability 1 , although we are primarily interested in the case with\n\u270f = 0. Algorithm 2 relies on a reduction to a K-armed bandit problem where we have direct access\nAlgorithm 2 Approximate Copeland Bandit Solver\nInput: A Copeland dueling bandit problem with preference matrix P = [pij], failure probability\n> 0, and approximation parameter \u270f> 0. Also, de\ufb01ne [K] := {1, . . . , K}.\n1: De\ufb01ne a random variable reward(i) for i 2 [K] as the following procedure: pick a uniformly\nrandom j 6= i from [K]; query the pair (ai, aj) suf\ufb01ciently many times in order to determine\nw.p. at least 1 /K2 whether pij > 1/2; return 1 if pij > 0.5 and 0 otherwise.\n2: Invoke Algorithm 4, where in each of its calls to reward(i), the feedback is determined by the\nabove stochastic process.\n\nReturn: The same output returned by Algorithm 4.\n\n4\n\n\fto a noisy version of the Copeland score; the process of estimating the score of arm ai consists of\ncomparing ai to a random arm aj until it becomes clear which arm beats the other. The sample\ncomplexity bound, which yields the regret bound, is achieved by combining a bound for K-armed\nbandits and a bound on the number of arms that can have a high Copeland score.\nAlgorithm 2 calls a K-armed bandit algorithm as a subroutine. To this end, we use the KL-based\narm-elimination algorithm, a slight modi\ufb01cation of Algorithm 2 in [22]: it implements an elimi-\nnation tournament with con\ufb01dence regions based on the KL-divergence between probability dis-\ntributions. The interested reader can \ufb01nd the pseudo-code in Algorithm 4 contained in Appendix\nJ.\nCombining this with the squaring trick, a modi\ufb01cation of the doubling trick that reduces the number\nof partitions from log T to log log T , the SCB algorithm, described in Algorithm 3, repeatedly calls\nAlgorithm 2 but force-terminates if an increasing threshold is reached. If it terminates early, then\nthe identi\ufb01ed arm is played against itself until the threshold is reached.\n\nAlgorithm 3 Scalable Copeland Bandits\nInput: A Copeland dueling bandit problem with preference matrix P = [pij]\n1: for all r = 1, 2, . . . do\n2:\n\nSet T = 22r and run Algorithm 2 with failure probability log(T )/T in order to \ufb01nd an exact\nCopeland winner (\u270f = 0); force-terminate if it requires more than T queries.\nLet T0 be the number of queries used by invoking Algorithm 2, and let ai be the arm produced\nby it; query the pair (ai, ai) T T0 times.\n\n3:\n\n4: end for\n\n5 Theoretical Results\nIn this section, we present regret bounds for both CCB and SCB. Assuming that the number of\nCopeland winners and the number of losses of each Copeland winner are bounded,2 CCB\u2019s regret\nbound takes the form O(K2 + K log T ), while SCB\u2019s is of the form O(K log K log T ). Note that\nthese bounds are not directly comparable. When there are relatively few arms, CCB is expected to\nperform better. By contrast, when there are many arms SCB is expected to be superior. Appendix A\nin the supplementary material provides empirical evidence to support these expectations.\nThroughout this section we impose the following condition on the preference matrix:\n\nA There are no ties, i.e., for all pairs (ai, aj) with i 6= j, we have pij 6= 0.5.\n\nThis assumption is not very restrictive in practice. For example, in the ranker evaluation setting from\ninformation retrieval, each arm corresponds to a ranker, a complex and highly engineered system,\nso it is unlikely that two rankers are indistinguishable. Furthermore, some of the results we present\nin this section actually hold under even weaker assumptions. However, for the sake of clarity, we\ndefer a discussion of these nuanced differences to Appendix F in the supplementary material.\n5.1 Copeland Con\ufb01dence Bound (CCB)\nIn this section, we provide a rough outline of our argument for the bound on the regret accumulated\nby Algorithm 1. For a more detailed argument, the interested reader is referred to Appendix E.\nConsider a K-armed Copeland bandit problem with arms a1, . . . , aK and preference matrix P =\n[pij], such that arms a1, . . . , aC are the Copeland winners, with C being the number of Copeland\nwinners. Moreover, we de\ufb01ne LC to be the number of arms to which a Copeland winner loses in\nexpectation.\n\nUsing this notation, our expected regret bound for CCB takes the form: O\u21e3 K2+(C+LC )K ln T\n\nHere, is a notion of gap de\ufb01ned in Appendix E, which is an improvement upon the smallest gap\nbetween any pair of arms.\nThis result is proven in two steps. First, we bound the number of comparisons involving non-\nCopeland winners, yielding a result of the form O(K2 ln T ). Second, Theorem 3 closes the gap\n\n\u2318 (1)\n\n2\n\n2See Appendix C.3 in the supplementary material for experimental evidence that this is the case in practice.\n\n5\n\n\fbetween this bound and the one in (1) by showing that, beyond a certain time horizon, CCB selects\nnon-Copeland winning arms as the optimistic Copeland winner very infrequently.\nTheorem 3. Given a Copeland bandit problem satisfying Assumption A and any > 0 and \u21b5> 0.5,\nthere exist constants A(1)\nsuch that, with probability 1 , the regret accumulated by CCB\nis bounded by the following:\n\n and A(2)\n\n\n\nA(1)\n + A(2)\n\n\n\npln T +\n\n2K(C + LC + 1)\n\nln T.\n\n2\n\nUsing the high probability regret bound given in Theorem 3, we can deduce the expected regret\nresult claimed in (1) for \u21b5> 1, as a corollary by integrating over the interval [0, 1].\n5.2 Scalable Copeland Bandits\nWe now turn to our regret result for SCB, which lowers the K2 dependence in the additive constant\nof CCB\u2019s regret result to K log K. We begin by de\ufb01ning the relevant quantities:\nDe\ufb01nition 4. Given a K-armed Copeland bandit problem and an arm ai, we de\ufb01ne the following:\n1. Recall that cpld(ai) := Cpld(ai)/(K 1) is called the normalized Copeland score.\n2. ai is an \u270f-Copeland-winner if 1 cpld(ai) \uf8ff (1 cpld(a1)) (1 + \u270f).\n3. i := max{cpld(a1) cpld(ai), 1/(K 1)} and Hi :=Pj6=i\n4. \u270f\n\n, with H1 := maxi Hi.\n\ni = max{i,\u270f (1 cpld(a1))}.\n\n1\n2\nij\n\nWe now state our main scalability result:\nTheorem 5. Given a Copeland bandit problem satisfying Assumption A, the expected regret of SCB\n\n(Algorithm 3) is bounded by O\u21e3 1\nO\u21e3 K(LC +log K) log T\n\n2\n\nmin\n\nKPK\n\ni=1\n\n\u2318, where LC and min are as in De\ufb01nition 10.\n\nHi(1cpld(ai))\n\n2\ni\n\n\u2318 log(T ), which in turn can be bounded by\n\nK\n\ni)2\n\n(\u270f\n\nKXi=1\n\nHi(1 cpld(ai))\n\nRecall that SCB is based on Algorithm 2, an arm-identi\ufb01cation algorithm that identi\ufb01es a Copeland\nwinner with high probability. As a result, Theorem 5 is an immediate corollary of Lemma 6, obtained\nby using the well known squaring trick. As mentioned in Section 4.2, the squaring trick is a minor\nvariation on the doubling trick that reduces the number of partitions from log T to log log T .\nLemma 6 is a result for \ufb01nding an \u270f-approximate Copeland winner (see De\ufb01nition 4.2). Note that,\nfor the regret setting, we are only interested in the special case with \u270f = 0, i.e., the problem of\nidentifying the best arm.\nLemma 6. With probability 1 , Algorithm 2 \ufb01nds an \u270f-approximate Copeland winner by time\nO 1\nassuming3 = (KH1)\u2326(1). In particular when there is a Condorcet winner (cpld(a1) = 1, LC =\n0) or more generally cpld(a1) = 1O(1/K), LC = O(1), an exact solution is found with probabil-\nity at least 1 by using an expected number of queries of at most O (H1(LC + log K)) log(1/).\nIn the remainder of this section, we sketch the main ideas underlying the proof of Lemma 6, detailed\nin Appendix I in the supplementary material. We \ufb01rst treat the simpler deterministic setting in which\na single query suf\ufb01ces to determine which of a pair of arms beats the other. While a solution can\neasily be obtained using K(K 1)/2 many queries, we aim for one with query complexity linear\nin K. The main ingredients of the proof are as follows:\n1. cpld(ai) is the mean of a Bernoulli random variable de\ufb01ned as such: sample uniformly at random\n\n! log(1/) \uf8ffO H1log(K) + min\u270f2, LC log(1/).\n\nan index j from the set {1, . . . , K} \\ {i} and return 1 if ai beats aj and 0 otherwise.\n2. Applying a KL-divergence based arm-elimination algorithm (Algorithm 4) to the K-armed ban-\ndit arising from the above observation, we obtain a bound by dividing the arms into two groups:\nthose with Copeland scores close to that of the Copeland winners, and the rest. For the former,\nwe use the result from Lemma 7 to bound the number of such arms; for the latter, the resulting\nregret is dealt with using Lemma 8, which exploits the possible distribution of Copeland scores.\n\n3The exact expression requires replacing log(1/) with log(KH1/).\n\n6\n\n\fFigure 1: Small-scale regret results for a 5-armed Copeland dueling bandit problem arising from\nranker evaluation.\nLet us state the two key lemmas here:\nLemma 7. Let D \u21e2{ a1, . . . , aK} be the set of arms for which cpld(ai) 1 d/(K 1), that is\narms that are beaten by at most d arms. Then |D|\uf8ff 2d + 1.\nProof. Consider a fully connected directed graph, whose node set is D and the arc (ai, aj) is in the\ngraph if arm ai beats arm aj. By the de\ufb01nition of cpld, the in-degree of any node i is upper bounded\nby d. Therefore, the total number of arcs in the graph is at most |D|d. Now, the full connectivity\nof the graph implies that the total number of arcs in the graph is exactly |D|(|D| 1)/2. Thus,\n|D|(|D| 1)/2 \uf8ff| D|d and the claim follows.\nLemma 8. The sumP{i|cpld(ai)<1}\n\nProof. Follows from Lemma 7 via a careful partitioning of arms. Details are in Appendix I.\n\n1cpld(ai) is in O(K log K).\n\n1\n\n2\nij\n\nij )/)\n\nGiven the structure of Algorithm 2, the stochastic case is similar to the deterministic case for the\nfollowing reason: while the latter requires a single comparison between arms ai and aj to determine\nwhich arm beats the other, in the stochastic case, we need roughly log(K log(1\ncomparisons\nbetween the two arms to correctly answer the same question with probability at least 1 /K2.\n6 Experiments\nTo evaluate our methods CCB and SCB, we apply them to a Copeland dueling bandit problem arising\nfrom ranker evaluation in the \ufb01eld of information retrieval (IR) [23].\nWe follow the experimental approach in [3, 13] and use a preference matrix to simulate comparisons\nbetween each pair of arms (ai, aj) by drawing samples from Bernoulli random variables with mean\npij. We compare our proposed algorithms against the state of the art K-armed dueling bandit al-\ngorithms, RUCB [13], Copeland SAVAGE, PBR and RankEl. We include RUCB in order to verify\nour claim that K-armed dueling bandit algorithms that assume the existence of a Condorcet winner\nhave linear regret if applied to a Copeland dueling bandit problem without a Condorcet winner.\nMore speci\ufb01cally, we consider a 5-armed dueling bandit problem obtained from comparing \ufb01ve\nrankers, none of whom beat the other four, i.e. there is no Condorcet winner. Due to lack of space,\nthe details of the experimental setup have been included in Appendix B4. Figure 1 shows the regret\naccumulated by CCB, SCB, the Copeland variants of SAVAGE, PBR, RankEl and RUCB on this\nproblem. The horizontal time axis uses a log scale, while the vertical axis, which measures cumula-\ntive regret, uses a linear scale. CCB outperforms all other algorithms in this 5-armed experiment.\nNote that three of the baseline algorithms under consideration here (i.e., SAVAGE, PBR and RankEl)\nrequire the horizon of the experiment as an input, either directly or through a failure probability ,\n4Sample code and the preference matrices used in the experiments can be found at http://bit.ly/nips15data.\n\n7\n\n104105106107108time020000040000060000080000010000001200000cumulativeregretMSLRInformationalCMwith5RankersRUCBRankElPBRSCBSAVAGECCB\fwhich we set to 1/T (with T being the horizon), in order to obtain a \ufb01nite-horizon regret algo-\nrithm, as prescribed in [3, 10]. Therefore, we ran independent experiments with varying horizons\nand recorded the accumulated regret: the markers on the curves corresponding to these algorithms\nrepresent these numbers. Consequently, the regret curves are not monotonically increasing. For\ninstance, SAVAGE\u2019s cumulative regret at time 2 \u21e5 107 is lower than at time 107 because the runs\nthat produced the former number were not continuations of those that resulted in the latter, but rather\ncompletely independent. Furthermore, RUCB\u2019s cumulative regret grows linearly, which is why the\nplot does not contain the entire curve.\nAppendix A contains further experimental results, including those of our scalability experiment.\n\n7 Conclusion\nIn many applications that involve learning from human behavior, feedback is more reliable when\nprovided in the form of pairwise preferences. In the dueling bandit problem, the goal is to use such\npairwise feedback to \ufb01nd the most desirable choice from a set of options. Most existing work in\nthis area assumes the existence of a Condorcet winner, i.e., an arm that beats all other arms with\nprobability greater than 0.5. Even though these results have the advantage that the bounds they\nprovide scale linearly in the number of arms, their main drawback is that in practice the Condorcet\nassumption is too restrictive. By contrast, other results that do not impose the Condorcet assumption\nachieve bounds that scale quadratically in the number of arms.\nIn this paper, we set out to solve a natural generalization of the problem, where instead of assuming\nthe existence of a Condorcet winner, we seek to \ufb01nd a Copeland winner, which is guaranteed to\nexist. We proposed two algorithms to address this problem: one for small numbers of arms, called\nCCB, and a more scalable one, called SCB, that works better for problems with large numbers of\narms. We provided theoretical results bounding the regret accumulated by each algorithm: these\nresults improve substantially over existing results in the literature, by \ufb01lling the gap that exists in the\ncurrent results, namely the discrepancy between results that make the Condorcet assumption and are\nof the form O(K log T ) and the more general results that are of the form O(K2 log T ).\nMoreover, we have included in the supplementary material empirical results on both a dueling bandit\nproblem arising from a real-life application domain and a large-scale synthetic problem used to test\nthe scalability of SCB. The results of these experiments show that CCB beats all existing Copeland\ndueling bandit algorithms, while SCB outperforms CCB on the large-scale problem.\nOne open question raised by our work is how to devise an algorithm that has the bene\ufb01ts of both\nCCB and SCB, i.e., the scalability of the latter together with the former\u2019s better dependence on the\ngaps. At this point, it is not clear to us how this could be achieved. Another interesting direction\nfor future work is an extension of both CCB and SCB to problems with a continuous set of arms.\nGiven the prevalence of cyclical preference relationships in practice, we hypothesize that the non-\nexistence of a Condorcet winner is an even greater issue when dealing with an in\ufb01nite number of\narms. Given that both our algorithms utilize con\ufb01dence bounds to make their choices, we anticipate\nthat continuous-armed UCB-style algorithms like those proposed in [24, 25, 26, 27, 28, 29, 30] can\nbe combined with our ideas to produce a solution to the continuous-armed Copeland bandit problem\nthat does not rely on the convexity assumptions made by algorithms such as the one proposed in\n[31]. Finally, it is also interesting to expand our results to handle scores other than the Copeland\nscore, such as an \u270f-insensitive variant of the Copeland score (as in [12]), or completely different\nnotions of winners, such as the Borda, Random Walk or von Neumann winners (see, e.g., [32, 19]).\n\nAcknowledgments\n\nWe would like to thank Nir Ailon and Ulle Endriss for helpful discussions. This research was supported by\nAmsterdam Data Science, the Dutch national program COMMIT, Elsevier, the European Community\u2019s Seventh\nFramework Programme (FP7/2007-2013) under grant agreement nr 312827 (VOX-Pol), the ESF Research Net-\nwork Program ELIAS, the Royal Dutch Academy of Sciences (KNAW) under the Elite Network Shifts project,\nthe Microsoft Research Ph.D. program, the Netherlands eScience Center under project number 027.012.105,\nthe Netherlands Institute for Sound and Vision, the Netherlands Organisation for Scienti\ufb01c Research (NWO)\nunder project nrs 727.011.005, 612.001.116, HOR-11-10, 640.006.013, 612.066.930, CI-14-25, SH-322-15,\nthe Yahoo! Faculty Research and Engagement Program, and Yandex. All content represents the opinion of the\nauthors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.\n\n8\n\n\fReferences\n[1] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The K-armed dueling bandits problem.\n\nJournal of Computer and System Sciences, 78(5), 2012.\n\n[2] T. Joachims. Optimizing search engines using clickthrough data. In KDD, 2002.\n[3] Y. Yue and T. Joachims. Beat the mean bandit. In ICML, 2011.\n[4] K. Hofmann, S. Whiteson, and M. de Rijke. Balancing exploration and exploitation in listwise\nand pairwise online learning to rank for information retrieval. Information Retrieval, 16, 2013.\n\n[5] J. F\u00a8urnkranz and E. H\u00a8ullermeier, editors. Preference Learning. Springer-Verlag, 2010.\n[6] A. Schuth, F. Sietsma, S. Whiteson, D. Lefortier, and M. de Rijke. Multileaved comparisons\n\nfor fast online evaluation. In CIKM, 2014.\n\n[7] L. Li, J. Kim, and I. Zitouni. Toward predicting the outcome of an A/B experiment for search\n\nrelevance. In WSDM, 2015.\n\n[8] M. Zoghi, S. Whiteson, M. de Rijke, and R. Munos. Relative con\ufb01dence sampling for ef\ufb01cient\n\non-line ranker evaluation. In WSDM, 2014.\n\n[9] M. Schulze. A new monotonic, clone-independent, reversal symmetric, and Condorcet-\n\nconsistent single-winner election method. Social Choice and Welfare, 36(2):267\u2013303, 2011.\n\n[10] T. Urvoy, F. Clerot, R. F\u00b4eraud, and S. Naamane. Generic exploration and k-armed voting\n\nbandits. In ICML, 2013.\n\n[11] R. Busa-Fekete, B. Sz\u00a8or\u00b4enyi, P. Weng, W. Cheng, and E. H\u00a8ullermeier. Top-k selection based\n\non adaptive sampling of noisy preferences. In ICML, 2013.\n\n[12] R. Busa-Fekete, B. Sz\u00a8or\u00b4enyi, and E. H\u00a8ullermeier. PAC rank elicitation through adaptive sam-\n\npling of stochastic pairwise preferences. In AAAI, 2014.\n\n[13] M. Zoghi, S. Whiteson, R. Munos, and M. de Rijke. Relative upper con\ufb01dence bound for the\n\nK-armed dueling bandits problem. In ICML, 2014.\n\n[14] R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, and N. Pohlmann. Online controlled experi-\n\nments at large scale. In KDD, 2013.\n\n[15] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, pages 285\u2013294, 1933.\n\n[16] N. Ailon, Z. Karnin, and T. Joachims. Reducing dueling bandits to cardinal bandits. In ICML,\n\n2014.\n\n2012.\n\n[17] M. Zoghi, S. Whiteson, and M. de Rijke. MergeRUCB: A method for large-scale online ranker\n\nevaluation. In WSDM, 2015.\n\n[18] S. Negahban, S. Oh, and D. Shah. Iterative ranking from pair-wise comparisons. In NIPS,\n\n[19] M. Dud\u00b4\u0131k, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi. Contextual dueling bandits.\n\n[20] A. Piccolboni and C. Schindelhauer. Discrete prediction games with arbitrary feedback and\n\n[21] G. Bart\u00b4ok, N. Zolghadr, and C. Szepesv\u00b4ari. An adaptive algorithm for \ufb01nite stochastic partial\n\nIn COLT, 2015.\n\nloss. In COLT, 2001.\n\nmonitoring. In ICML, 2012.\n\n[22] O. Capp\u00b4e, A. Garivier, O. Maillard, R. Munos, G. Stoltz, et al. Kullback\u2013leibler upper con\ufb01-\n\ndence bounds for optimal sequential allocation. The Annals of Statistics, 41(3), 2013.\n\n[23] C. Manning, P. Raghavan, and H. Sch\u00a8utze. Introduction to Information Retrieval. Cambridge\n\nUniversity Press, 2008.\n\n[24] R. Kleinberg, A. Slivkins, and E. Upfa. Multi-armed bandits in metric space. In STOC, 2008.\n[25] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvari. X-armed bandits. JMLR, 12, 2011.\n[26] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the\n\nbandit setting: No regret and experimental design. In ICML, 2010.\n\n[27] R. Munos. Optimistic optimization of a deterministic function without the knowledge of its\n\nsmoothness. In NIPS, 2011.\n\n[28] A. D. Bull. Convergence rates of ef\ufb01cient global optimization algorithms. JMLR, 12, 2011.\n[29] N. de Freitas, A. Smola, and M. Zoghi. Exponential regret bounds for Gaussian process bandits\n\nwith deterministic observations. In ICML, 2012.\n\n[30] M. Valko, A. Carpentier, and R. Munos. Stochastic simultaneous optimistic optimization. In\n\n[31] Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling\n\nICML, 2013.\n\nbandits problem. In ICML, 2009.\n\n[32] A. Altman and M. Tennenholtz. Axiomatic foundations for ranking systems. JAIR, 2008.\n\n9\n\n\f", "award": [], "sourceid": 186, "authors": [{"given_name": "Masrour", "family_name": "Zoghi", "institution": "University of Amsterdam"}, {"given_name": "Zohar", "family_name": "Karnin", "institution": null}, {"given_name": "Shimon", "family_name": "Whiteson", "institution": "University of Oxford"}, {"given_name": "Maarten", "family_name": "de Rijke", "institution": "University of Amsterdam"}]}