{"title": "Verification Based Solution for Structured MAB Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 145, "page_last": 153, "abstract": "We consider the problem of finding the best arm in a stochastic Mutli-armed Bandit (MAB) game and propose a general framework based on verification that applies to multiple well-motivated generalizations of the classic MAB problem. In these generalizations, additional structure is known in advance, causing the task of verifying the optimality of a candidate to be easier than discovering the best arm. Our results are focused on the scenario where the failure probability $\\delta$ must be very low; we essentially show that in this high confidence regime, identifying the best arm is as easy as the task of verification. We demonstrate the effectiveness of our framework by applying it, and improving the state-of-the art results in the problems of: Linear bandits, Dueling bandits with the Condorcet assumption, Copeland dueling bandits, Unimodal bandits and Graphical bandits.", "full_text": "Veri\ufb01cation Based Solution for Structured MAB\n\nProblems\n\nZohar Karnin\nYahoo Research\n\nNew York, NY 10036\nzkarnin@ymail.com\n\nAbstract\n\nWe consider the problem of \ufb01nding the best arm in a stochastic Multi-armed\nBandit (MAB) game and propose a general framework based on veri\ufb01cation that\napplies to multiple well-motivated generalizations of the classic MAB problem. In\nthese generalizations, additional structure is known in advance, causing the task of\nverifying the optimality of a candidate to be easier than discovering the best arm.\nOur results are focused on the scenario where the failure probability must be very\nlow; we essentially show that in this high con\ufb01dence regime, identifying the best\narm is as easy as the task of veri\ufb01cation. We demonstrate the effectiveness of our\nframework by applying it, and matching or improving the state-of-the art results in\nthe problems of: Linear bandits, Dueling bandits with the Condorcet assumption,\nCopeland dueling bandits, Unimodal bandits and Graphical bandits.\n\n1\n\nIntroduction\n\nThe Multi-Armed Bandit (MAB) game is one where in each round the player chooses an action,\nalso referred to as an arm, from a pre-determined set. The player then gains a reward associated\nwith the chosen arm and observes the reward while rewards associated with the other arms are not\nrevealed. In the stochastic setting, each arm x has a \ufb01xed associated value \u00b5(x) throughout all rounds,\nand the reward associated with the arm is a random variable, independent of the history, with an\nexpected value of \u00b5(x). In this paper we focus on the pure exploration task [9] in the stochastic\nsetting where our objective is to identify the arm maximizing \u00b5(x) with suf\ufb01ciently high probability,\nwhile minimizing the required number of rounds, otherwise known as the query complexity. This\ntask, as opposed to the classic task of maximizing the sum of accumulated rewards is motivated\nby numerous scenarios where exploration (i.e. trying multiple options) is only possible in an initial\ntesting phase, and not throughout the running time of the game.\nAs an example consider a company testing several variations of a (physical) product, and then once\nrealizing the best one, moving to a production phase where the product is massively produced and\nshipped to numerous vendors. It is very natural to require that the identi\ufb01ed option is the best one\nwith very high probability, as a mistake can be very costly. Generally speaking, the vast majority of\nuses-cases of a pure exploration requires the error probability to be very small, so much so that\neven a logarithmic dependence over is non-negligible. Another example to demonstrate this is that\nof explore-then-exploit type algorithms. There are many examples of papers providing a solution to a\nregret based MAB problem where the \ufb01rst phase consists of identifying the best arm with probability\nat least 1 1/T , and then using it in the remainder of the rounds. Here, = 1/T is often assumed to\nbe the only non-constant.\nWe do not focus on the classic MAB problem but rather on several extensions of it for settings where\nwe are given as input some underlying structural properties of the reward function \u00b5. We elaborate\non the formal de\ufb01nitions and different scenarios in Section 2. Another extension we consider is that\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fof Dueling Bandits where, informally, we do not query a single arm but rather a pair, and rather than\nobserving the reward of the arms we observe a hint as to the difference between their associated \u00b5\nvalues. Each extension we discuss is motivated by different scenarios which we elaborate on in the\nupcoming sections. In all of the cases mentioned, we focus on the regime of high con\ufb01dence meaning\nwhere the failure probability is very small.\nNotice that due to the additional structure (that does not exist in the classic case), verifying a candidate\narm is indeed the best arm can be a much easier task, at least conceptually, compared to that of\ndiscovering which arm is the best. This observation leads us to the following design: Explore the\narms and obtain a candidate arm that is the best arm w.p. 1 \uf8ff for some constant \uf8ff, then verify it\nis indeed the best with con\ufb01dence 1 . If the exploration procedure happened to be correct, the\nquery complexity of the problem will be composed of a sum of two quantities. One is that of the\nexploration algorithm that is completely independent of , and the other is dependent of but is\nthe query complexity of the easier veri\ufb01cation task. The query complexity is either dominated by\nthat of the veri\ufb01cation task, or by that of the original task with a constant failure probability. Either\nway, for small values of the savings are potentially huge. As it turns out, as discussed in Section 3,\na careful combination of an exploration and veri\ufb01cation algorithm can achieve an expected query\ncomplexity of Hexplore + Hverify where Hexplore is the exploration query complexity, independent of ,\nand Hverify is the query complexity of the veri\ufb01cation procedure with con\ufb01dence 1 . Below, we\ndesign exploration and veri\ufb01cation algorithms for the problems of: Dueling bandits \u00a74, Linear bandits\n\u00a75, Unimodal graphical bandits \u00a76 and Graphical bandits1 . In the corresponding sections we provide\nshort reviews of each MAB problem, and analyze their exploration and veri\ufb01cation algorithms. Our\nresults improve upon the state-of-the-art results in each of these mentioned problem (See Table 1 for\na detailed comparison).\nRelated Works: We are aware of one attempt to capture multiple (stochastic) bandit problems\nin a single frameworks, given in [20]. The focus there is mostly on problems where the observed\nrandom variables do not necessarily re\ufb02ect the reward, such as the dueling bandit problem, rather\nthan methods to exploit structure between the arms. For example, in the case of the dueling bandit\nproblem with the Condorcet assumption their algorithm does not take advantage of the structural\nproperties and the corresponding query complexity is larger than that obtained here (see Section 4.1).\nWe review the previous literature of each speci\ufb01c problem in the corresponding sections.\n\n2 Formulation of Bandit Problems\n\nThe pure exploration Multi-Armed Bandit (MAB) problem, in the stochastic setting, can be generally\nformalized as follows. Our input consists of a set K of arms, where each arm x is associated with\nsome reward \u00b5(x). In each round t we play an arm xt and observe the outcome of a random variable\nwhose expected value is \u00b5(xt). Other non-stochastic settings exist yet they are outside the scope of\nour paper; see [4] for a survey on bandit problems, including the stochastic and non-stochastic settings.\nThe objective in the best arm identi\ufb01cation problem is to identify the arm2 x\u21e4 = arg max \u00b5(x) while\nminimizing the expected number of queries to the reward values of the arms. Other than the classic\nMAB problem, where K is a \ufb01nite set and \u00b5 is an arbitrary function there exist other frameworks\nwhere some structure is assumed regarding the behavior of \u00b5 over the arms of K. An example for a\ncommon framework matching this formulation, that we will analyze in detail in Section 5, is that\nof the linear MAB. Here, K is a compact subset of Rd, and the reward function \u00b5 is assumed to be\nlinear. Unlike the classic MAB case, an algorithm can take advantage of the structure of \u00b5 and obtain\na performance that is independent of the size of K. Yet another example, discuss in Section 6, is that\nof unimodal bandits, where we are given a graph whose vertices are the arms, and it is guaranteed\nthat the best arm is the unique arm having a maximal value among its neighbors in the graph.\nThe above general framework captures many variants of the MAB problem, yet does not capture\nthe Dueling Multi Armed Bandit (DMAB) problem. Here, the input as before consists of a set of\narms denoted by K yet we are not allowed to play a single arm in a round but rather a pair x, y 2K .\nThe general de\ufb01nition of the observation from playing the pair x, y is a random variable whose\n\n1Do to space restrictions we defer the section of Graphical bandits [7] to the extended version.\n2This objective is naturally extended in the PAC setting where we are interested in an arm that is approximately\nthe best arm. For simplicity we restrict our focus to the best arm identi\ufb01cation problem. We note that our general\nframework of exploration and veri\ufb01cation can be easily expanded to handle the PAC setting as well.\n\n2\n\n\fexpected value is P (x, y) where P : K\u21e5K! R. The original motivating example for the DMAB\n[22] problem is that of information retrieval, where a query to a pair of arms is a presentation of the\ninterleaved results of two ranking algorithms. The output is the 0 or 1, depending on the choice of the\nuser, i.e. whether she chose a result from one or ranker or the other. The \u00b5 score here can be thought\nof a quality score for a ranker, de\ufb01ned according to the P scores. We elaborate on the motivation\nfor the MAB problem and the exact de\ufb01nition of the best arm in Section 4. In an extended version\nof this paper we discuss the problem of graphical bandits that is in some sense a generalization of\nthe dueling bandit problem. There, we are not allowed to query any pair but rather pairs from some\nprede\ufb01ned set E \u2713K\u21e5K .\n3 Boosting the Exploration Process with a Veri\ufb01cation Policy\n\nIn what follows we present results for different variants of the MAB problem. We discuss two types\nof problems. The \ufb01rst is the well known pure exploration problem. Our input is the MAB instance,\nincluding the set of arms and possible structural information, and a con\ufb01dence parameter \uf8ff. The\nobjective is to \ufb01nd the best arm w.p. at least 1 \uf8ff while using a minimal number of queries. We\noften discuss variants of the exploration problem where in addition to \ufb01nding the best arm, we wish\nto obtain some additional information about the problem such as an estimate of the gaps of the reward\nvalue of suboptimal arms from the optimal one, the identity of important arm pairs, etc. We refer\nto this additional information as an advice vector \u2713, and our objective is to minimize queries while\nobtaining a suf\ufb01ciently accurate advice vector and the true optimal arm with probability at least\n1 \uf8ff. For each MAB problem we describe an algorithm referred to as FindBestArm with a query\ncomplexity of3 Hexplore \u00b7 log(1/\uf8ff) that obtains an advice vector \u2713 that is suf\ufb01ciently accurate4 w.p. at\nleast 1 \uf8ff.\nDe\ufb01nition 1. Let FindBestArm be an algorithm that given the MAB problem and con\ufb01dence param-\neter \uf8ff> 0 has the following guarantees. (1) with probability at least 1 \uf8ff it outputs a correct best\narm and advice vector \u2713. (2) its expected query complexity is Hexplore \u00b7 log(1/\uf8ff), where Hexplore is\nsome instance speci\ufb01c complexity (that is not required to be known).\n\nThe second type of problem is that of veri\ufb01cation. Here we are given as input not only the MAB\nproblem and con\ufb01dence parameter , but an advice vector \u2713, including the identity of a candidate\noptimal arm.\nDe\ufb01nition 2. Let VerifyBestArm be an algorithm that given the MAB problem, con\ufb01dence parameter\n> 0 and an advice vector \u2713 including a proposed identity of the best arm, has the following\nguarantees. (1) if the candidate optimal arm is not the actual optimal arm, the output is \u2018fail\u2019 w.p. at\nleast 1 . (2) if the advice vector is suf\ufb01ciently accurate, and in particular the candidate is indeed\nthe optimal arm, we should output \u2018success\u2019 w.p. at least 1 . (3) if the advice vector is suf\ufb01ciently\naccurate the expected query complexity is Hverify log(1/). Otherwise, it is Hexplore log(1/).\nIt is very common that Hverify \u2327 Hexplore as it is clearly an easier problem to simply verify the\nidentity of the optimal arm rather than discover it. Our main result is thus somewhat surprising as\nit essentially shows that in the regime of high con\ufb01dence, the best arm identi\ufb01cation problem is as\neasy as verifying the identity of a candidate. Speci\ufb01cally we provide a complexity that is additive in\nHexplore and log(1/) rather than multiplicative. The formal result is as follows.\n\nAlgorithm 1 Explore-Verify Framework\nInput: Best arm identi\ufb01cation problem, Oracle access to FindBestArm and VerifyBestArm with\n\nfailure probability tuning, failure probability parameter , parameter \uf8ff.\nfor all r = 1 . . . do\n\nCall FindBestArm with failure probability \uf8ff, denote by \u2713 its output.\nCall VerifyBestArm with advice vector \u2713, that includes a candidate best arm \u02c6x, and failure\nprobability /2r2. If succeeded, return \u02c6x. Else, continue to the next iteration\n\nend for\n\n3The general form of such algorithms is in fact H1 log(1/\uf8ff) + H0. For simplicity we state our results for\n\nthe form H log(1/\uf8ff); the general statements are an easy modi\ufb01cation.\n\n4The exact de\ufb01nition of suf\ufb01ciently accurate is given per problem instance.\n\n3\n\n\fTheorem 3. Assume that algorithm 1 is given oracle access to FindBestArm and VerifyBestArm\nwith the above mentioned guarantees, and a con\ufb01dence parameter < 1/3. For any \uf8ff< 1/3, the\nalgorithm identi\ufb01es the best arm with probability 1 while using an expected number of at most\n\nO (Hexplore log(1/\uf8ff) + (Hverify + \uf8ff \u00b7 Hexplore) log(1/))\n\nThe following provides the guarantees for two suggested values of \uf8ff. The \ufb01rst may not be known to\nus but can very often be estimated beforehand. The second depends only on hence is always known\nin advance.\nCorollary 4. By setting \uf8ff = min{1/3, Hverify/Hexplore}, algorithm 1 has an expected number of at\nmost\n\nO(Hexplore log(Hexplore/Hverify) + Hverify log(1/))\n\nqueries. By setting \uf8ff = min{1/3, 1/ log(1/)}, algorithm 1 has an expected query complexity of at\nmost\n\nO(Hexplore log(log(1/)) + Hverify log(1/))\n\nNotice that by setting \uf8ff to min{1/3, 1/ log(1/)}, for any practical use-case, the dependence on \nin the left summand is nonexistent. In particular, this default value for \uf8ff provides a multiplicative\nsaving of either Hexplore/Hverify, i.e. the ratio between the exploration and veri\ufb01cation problem, or\nlog(log(1/)). Since log(1/) is rarely a negligible term, and as we will see in what follows, neither is\nHexplore/Hverify, the savings are signi\ufb01cant, hence the effectiveness of our result.\n\nlog(1/)\n\nProof of Theorem 3. In the analysis we often discuss the output of the sub-procedures in round r > 1,\neven if the algorithm terminated before round r. We note that these values are well-de\ufb01ned random\nvariables regardless of the fact that we may not reach the round. To prove the correctness of the\n\nalgorithm notice that sinceP1r=1 r2 \uf8ff 2 we have with probability at least 1 that all runs of\nVerifyBestArm do not err. Since we halt only when VerifyBestArm outputs \u2018success\u2019 our algorithm\nindeed outputs the best arm w.p. at least 1 \nWe proceed to analyze the expected query complexity, and start with a simple observation. Let\nQCsingle(r) denote the expected query complexity in round r, and let Yr be the indicator variable to\nwhether the algorithm reached round r. Since Yr is independent of the procedures running in round r\nand in particular of the number of queries required by them, we have that the total expected query\ncomplexity is\n\nE\" 1Xr=1\n\nYrQCsingle(r)# =\n1Xr=1\nE\u21e5QCsingle(r)\u21e4 \uf8ff Hexplore log(1/\uf8ff)+\n\nE [Yr] \u00b7 E\u21e5QCsingle(r)\u21e4\n\nHence, we proceed to analyze E\u21e5QCsingle(r)\u21e4 and E[Yr] separately. For E\u21e5QCsingle(r)\u21e4 we have\n\n((1 \uf8ff) Hverify + \uf8ffHexplore) log\u2713 2r2\nHexplore log(1/\uf8ff) + (\uf8ffHexplore + Hverify) log\u2713 2r2\n \u25c6\n\n \u25c6 \uf8ff\n\nTo explain the \ufb01rst inequality, the \ufb01rst summand is the complexity of FindBestArm . The second\nsummand is that of VerifyBestArm , that is decomposed to the complexity in the scenario where\nFindBestArm succeeded vs. the scenario where it failed. To compute E[Yr], we notice that Yr is an\nindicator function hence E[Yr] = Pr[Yr = 1]. In order for Yr to take the value of 1 we must have that\nfor all rounds r0 < r either VerifyBestArm or FindBestArm have failed. Since the failure or success of\nthe algorithms at different rounds are independent we have\n\n\nPr[Yr = 1] \uf8ff Yr0 0.\nA general framework capturing both the MAB and DMAB scenarios is that of partial monitoring\ngames introduced by [18]. In this framework, when playing an arm K one obtains a reward \u00b5(x) yet\nobserves a different function h(x). Some connection between h and \u00b5 is known in advance and based\non it, one can design a strategy to discover the best arm or minimize regret. As we do not present\nresults regarding this framework we do not elaborate on it any further, but rather mention that our\nresults, in terms of query complexity, cannot be matched by the existing results there.\n\n5It is actually common to de\ufb01ne the output of P as a number in [0, 1] and have P (x, y) = 1 P (y, x), but\n\nboth de\ufb01nitions are equivalent up to a linear shift of P .\n\n5\n\n\f4.1 Dueling Bandits with the Condorcet Assumption\nThe Condorcet assumption in the Dueling bandit setting asserts the existence of an arm x\u21e4 that beats\nall other arms. In this section we discuss a solution for \ufb01nding this arm under the assumption of its\nexistence. Recall that the observable input consists of a set of arms K of size K. There is assumed\nto exist some matrix P mapping each pair of arms x, y 2K to a number pxy 2 [1, 1]; the matrix\nP has a zero diagonal, meaning pxx = 0 and is anti-symmetric pxy = pyx. A query to the pair\n(x, y) gives an observation to a random Bernoulli variable with expected value (1 + pxy)/2 and is\nconsidered as an outcome of a match between x, y. As we assume the existence of a Condorcet\nwinner, there exists some x\u21e4 2K with px\u21e4y > 0 for all y 6= x.\nThe Condorcet dueling bandit problem, as stated here and without any additional assumptions was\ntackled in several papers [20, 26, 16]. The best guarantees to date are given by [16] that provide an\nasymptotically optimal regret bound for the problem, for the regime of a very large time horizon. This\nresult can be transformed into a best-arm identi\ufb01cation algorithm, and the corresponding guarantee is\nlisted in Table 1. Loosely speaking, the result shows that it suf\ufb01ces to query each pair suf\ufb01ciently\nmany times to separate the corresponding Px,y from 0.5 with constant probability, and additionally\nonly K pairs must be queried suf\ufb01ciently many times in order to separate the corresponding Px,y from\n0.5 with probability 1 . We note that other improvements exist that achieve a better constant term\n(the additive term independent of ) [25, 24] or an overall improved result via imposing additional\nassumptions about P such as an induced total order, stochastic triangle inequality etc. [22, 23, 1].\nThese types of results however fall outside the scope of our paper.\nIn Appendix B.1 we provide an exploration and veri\ufb01cation algorithm for the problem. The explo-\nration algorithm queries all pairs until \ufb01nding, for each suboptimal arm x, an arm y with pxy < 0;\nthe exploration algorithm provides as output not only the identity of the optimal arm, but for each\nsub-optimal arm x, the identity of an arm y(x) that (approximately) maximizes pyx meaning it beats\nx by the largest gap. The veri\ufb01cation procedure is now straightforward. Given the above advice the\nalgorithm makes sure that for each allegedly sub-optimal x, the arm y(x) indeed beats it meaning\np(yx) > 0. We obtain the following formal result.\nTheorem 5. Algorithm 1, along with the exploration and veri\ufb01cation algorithms given in Ap-\npendix B.1, \ufb01nds the Condorcet winner w.p. at least 1 while using an expected amount of at\nmost\n\n\u02dcO0@Xy6=x\u21e4\n\nqueries, where x\u21e4 is the Condorcet winner.\n\np2\n\nx\u21e4y + Xx6=x\u21e4Xy6=x\n\nmin\u21e2p2\n\nxy , min\n\ny0,pxy0 <0\n\np2\n\nxy1A + O0@Xx6=x\u21e4\n\nmin\n\ny,pxy<0\n\np2\nxy ln(K/p2\n\nxy)1A\n\n5 Application to Linear Bandits\n\nThe linear bandit problem was originally introduced in [2]. It captures multiple problems where there\nis linear structure among the available options. Its pure exploration variant (as opposed to the regret\nsetting) was recently discussed in [19]. Recall that in the linear bandit problem the set of arms K is\na subset of Rd. The reward function associated with an arm x is a random variable with expected\nvalue \u00b5(x) = w>x, for some unknown w 2 Rd. For simplicity we assume that all vectors w, and\nthose of K lie inside the Euclidean unit ball, and that the noise is sub-gaussian with variance 1 (hence\nconcentration bounds such as Hoeffding\u2019s inequality can be applied).\nThe results of [19] offer two approaches. The \ufb01rst is a static strategy that guarantees, for failure\nprobability \uf8ff, a query complexity of d log(K/\uf8ff)\nwith x\u21e4 being the best arm, x = w>(x\u21e4 x) for\nx 6= x\u21e4 and min = minx6=x\u21e4 x. The second is adaptive and provides better bounds in a speci\ufb01c\ncase where the majority of the hardship of the problem is in separating the best arm from the second\nbest arm.\nThe algorithms are based on tools from the area of Optimal design of experiments where the high level\nidea is the following: Consider our set of vectors (arms) K and an additional set of vecotrs Y . We are\ninterested in querying a sequence of t arms from K that will minimize the maximum variance of the\nestimation of w>y, where the maximum is taken over all y 2 Y . Recall that via the Azuma-Hoeffding\ninequality, one can show that by querying a set of points x1, . . . , xt and solving the Ordinary Least\n\n2\n\nmin\n\n6\n\n\fSquares (OLS) problem, one obtains an unbiased estimator of w and the corresponding variance to a\npoint y is\n\n\u21e2x1,...,xt(y) = y> tXi=1\n\nxix>i !1\n\ny\n\nHence, our formal problem statement is to obtain a sequence x1, . . . , xt that minimizes \u21e2x1,...,xt(Y )\nde\ufb01ned as \u21e2x1,...,xt(Y ) = maxy2Y \u21e2x1,...,xt(y). Tools from the area of Optimal design of experi-\nments (see e.g. [21]) provide ways to obtain such sequences that achieve a multiplicative approxima-\ntion of 1 + d(d + 1)/t of the optimal sequence. In particular it is shown that as t tends to in\ufb01nity, t\ntimes the \u21e2 value of the optimal sequence of length t tends to\n\n\u21e2\u21e4(Y ) = min\np\n\nmax\ny2Y\n\ny> Xx2K\n\npxxx>!1\n\ny\n\nwith p restricted to being a distribution over K. We elaborate on these in the extended version of the\npaper.\n[19] propose two and analyze two different choices of the set Y . The \ufb01rst is the set Y = K; querying\npoints of K in order to minimize \u21e2x1...,xt(K) leads to a best arm identi\ufb01cation algorithm with a query\ncomplexity of d log(K/\uf8ff)/2\nmin for failure probability \uf8ff. We use essentially the same approach for\nthe exploration procedure (given in the extended version), and with the same (asymptotic) query\ncomplexity we do not only obtain a candidate best arm \u02c6x but also approximations of the different x\nfor all x 6= x\u21e4. These are required for the veri\ufb01cation procedure.\nThe second interesting set Y is the set Y =n x\u21e4x\n\nx |x 2K , x 6= x\u21e4o. Clearly this set is not known to\n\nus in advance, but it helps in [19] to de\ufb01ne a notion of the \u2018true\u2019 complexity of the problem. Indeed,\none cannot discover the best arm without verifying that it is superior to the others, and the set Y\nprovides the best strategy to do so. The authors show that6\n\nmax\n\ny2Y kyk2 \uf8ff \u21e2\u21e4(Y ) \uf8ff 4d/2\n\nmin\n\nand bring examples where each of the inequalities are tight. Notice that the multiplicative gap between\nthe bounding expressions can be huge (at least linear in the dimension d), hence an algorithm with a\nquery complexity depending on \u21e2\u21e4(Y ) as opposed to d/2\nmin can potentially be much better than the\nabove mentioned algorithm. The bound on \u21e2\u21e4(Y ) proves in particular that indeed querying w.r.t. Y is a\nbetter strategy than querying w.r.t. K. This immediately translates into a veri\ufb01cation procedure. Given\nthe advice from our exploration procedure, we have access to a candidate best arm, and approximate \nvalues. Hence, we construct this set Y and query according to it. We show that given a correct advice,\nthe query complexity for failure probability is at most O (\u21e2\u21e4(Y \u21e4) log(K\u21e2\u21e4(Y \u21e4)/)). Combining\nthe exploration and veri\ufb01cation algorithms, we get the following result.\nTheorem 6. Algorithm 1, along with the exploration and veri\ufb01cation algorithms described above\n(we give a the formal version only in the extended version of the paper), \ufb01nds the best arm w.p. at\nleast 1 while using an expected query complexity of\n\nO d logKd/2\nmin\n\n2\n\nmin\n\n+ \u21e2\u21e4(Y \u21e4) log (1/)!\n\n6 Application to Unimodal Bandits\nThe unimodal bandit problem consists of a MAB problem given unimodality information. We focus\non a graphical variant de\ufb01ned as follows: There exist some graph G whose vertex set is the set of\narm K and an arbitrary edge set E. For every sub-optimal arm x there exist some neighbor y in the\ngraph such that \u00b5(x) < \u00b5(y). In other words, the best arm x\u21e4 is the unique arm having a superior\nreward compared to its immediate neighbors. The graphical unimodal bandit problem was introduced\nby7 [13].\n\n6Under the assumption that all vectors in K lie in the Euclidean unit sphere\n7Other variants of the unimodal bandit problem exist, e.g. one where the arms are the scalars in the intervals\n[0, 1] yet we do not deal with them in this paper, as we focus on pure best arm identi\ufb01cation problems and in\nthat scenario the regret setting is more common, and only a PAC algorithm is possible, translating to a T 2/3\nrather than pT regret algorithm\n\n7\n\n\fDue to space constraints we limit the discussion here to a speci\ufb01c type of unimodal bandits in\nwhich the underlying graph is line. The motivation here comes from a scenario where the point\nset K represents an \u270f-net over the [0, 1] interval and the \u00b5 values come from some unimodal one-\ndimensional function. We discuss the more general graph scenario only in the extended version of\nthe paper. To review the existing results we introduce some notations. For an arm x let (x) denote\nthe set of its neighbors in the graph. For a suboptimal arm x we let \nx = maxy2(x) \u00b5(y) \u00b5(x)\nbe the gap between the reward of x and its neighbors and let x = \u00b5(x\u21e4) \u00b5(x) be its gap from the\nbest arm x\u21e4. We denote by \nx and min be the minimal value of x.\nNotice that in reasonable scenarios, for a typical arm x we have \nx \u2327 x since many arms are far\nfrom being optimal but have a close value to those of their two neighbors.\nThe state-of-the-art results to date, as far as we are aware, for the problem at hand is by [6], where\na method OSUB is proposed achieving an expected query complexity of (up to logarithmic terms\nindependent of )8\n\nmin the minimal value of \n\nO0@Xx6=x\u21e4\n\n(\n\nx)2 + Xx2(x\u21e4)\n\n2\n\nx log(1/)1A\n\nThey show that the summand with the logarithmic dependence over is optimal. In the context of a\nline graph we provide an algorithm whose exploration is a simple naive application of a best arm\nidenti\ufb01cation algorithm that ignores the structure of the problem, e.g. Exponential Gap-Elimination\nby [15]. The veri\ufb01cation algorithm requires only the identity of the candidate best arm as advice. It\nsimply applies a best arm identi\ufb01cation algorithm over the candidate arm and its neighborhood. The\nfollowing provides our formal results.\nTheorem 7. Algorithm 1, along with the exploration of Exponential Gap-Elimination and the\nveri\ufb01cation algorithm of Exponential Gap-Elimination, applied to the neighborhood of the candidate\nbest arm, \ufb01nds the best arm w.p. at least 1 while using an expected query complexity of\n\nO0@Xx6=x\u21e4\n\n2\n\nx log (K/min) + Xx2(x\u21e4)\n\n2\n\nx log (1/)1A\n\nThe improvement w.r.t. the results of [6] is in the constant term independent of . The replacement\nof \nx with x leads to a signi\ufb01cant improvement in many reasonable submodular functions. For\nexample, if the arms for an \u270f-net over the [0, 1] interval, and the function is O(1)-Lipchitz then\n\nx)2 =\u2326( \u270f3) whilePx6=x\u21e4(x)2 can potentially be O(\u270f2). Perhaps for this reason,\n\nexperiments in [6] showed that often, performing UCB on an \u270f-net is superior to other algorithms.\n7 Conclusions\n\nPx6=x\u21e4(\n\nWe presented a general framework for improving the performance of best-arm identi\ufb01cation problems,\nfor the regime of high con\ufb01dence. Our framework is based on the fact that in MAB problems with\nstructure, it is often easier to design an algorithm for verifying a candidate arm is the best one, rather\nthan discovering the identity of the best arm. We demonstrated the effectiveness of our framework by\nimproving the state-of-the-art results in several MAB problems.\n\nReferences\n[1] Nir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits.\n\nIn\nProceedings of the 31st International Conference on Machine Learning (ICML-14), pages 856\u2013864, 2014.\n\n[2] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. The Journal of Machine\n\nLearning Research, 3:397\u2013422, 2003.\n\n[3] Akshay Balsubramani, Zohar Karnin, Robert Schapire, and Masrour Zoghi. Instance-dependent regret\nbounds for dueling bandits. In Proceedings of The 29th Conference on Learning Theory, COLT 2016,\n2016.\n\n8The result of [6] is in fact tighter in the sense that it takes advantage of the variance of the estimators by\nusing con\ufb01dence bounds based on KL-divergence. In the case of uniform variance however, the stated results\nhere are accurate. More importantly, the KL-divergence type techniques can be applied here to obtain the same\ntype of guarantees, at the expense of a slightly more technical analysis. For this reason we present the results for\nthe case of uniform variance.\n\n8\n\n\f[4] S\u00e9bastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Machine Learning, 5(1):1\u2013122, 2012.\n\n[5] R\u00f3bert Busa-Fekete, Bal\u00e1zs Sz\u00f6r\u00e9nyi, and Eyke H\u00fcllermeier. Pac rank elicitation through adaptive sampling\nof stochastic pairwise preferences. In Proceedings of the Twenty-Eighth AAAI Conference on Arti\ufb01cial\nIntelligence, AAAI, 2014.\n\n[6] Richard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and optimal algorithms.\nIn Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 521\u2013529,\n2014.\n\n[7] Dotan Di Castro, Claudio Gentile, and Shie Mannor. Bandits with an edge. CoRR, abs/1109.2296, 2011.\n\n[8] Miroslav Dud\u00edk, Katja Hofmann, Robert E. Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual\n\ndueling bandits. In Gr\u00fcnwald et al. [11], pages 563\u2013587.\n\n[9] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the\nmulti-armed bandit and reinforcement learning problems. The Journal of Machine Learning Research,\n7:1079\u20131105, 2006.\n\n[10] J. F\u00fcrnkranz and E. H\u00fcllermeier, editors. Preference Learning. Springer-Verlag, 2010.\n\n[11] Peter Gr\u00fcnwald, Elad Hazan, and Satyen Kale, editors. Proceedings of The 28th Conference on Learning\n\nTheory, COLT 2015, Paris, France, July 3-6, 2015, volume 40 of JMLR Proceedings. JMLR.org, 2015.\n\n[12] K. Hofmann, S. Whiteson, and M. de Rijke. Balancing exploration and exploitation in listwise and pairwise\n\nonline learning to rank for information retrieval. Information Retrieval, 16(1):63\u201390, 2013.\n\n[13] Y Yu Jia and Shie Mannor. Unimodal bandits. In Proceedings of the 28th International Conference on\n\nMachine Learning (ICML-11), pages 41\u201348, 2011.\n\n[14] T. Joachims. Optimizing search engines using clickthrough data. In KDD, 2002.\n\n[15] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In\nProceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1238\u20131246,\n2013.\n\n[16] Junpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. Regret lower bound and\n\noptimal algorithm in dueling bandit problem. In Gr\u00fcnwald et al. [11], pages 1141\u20131154.\n\n[17] Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Copeland dueling bandit problem: Regret lower\n\nbound, optimal algorithm, and computationally ef\ufb01cient algorithm, 2016.\n\n[18] A. Piccolboni and C. Schindelhauer. Discrete prediction games with arbitrary feedback and loss. In\n\nComputational Learning Theory, pages 208\u2013223, 2001.\n\n[19] Marta Soare, Alessandro Lazaric, and R\u00e9mi Munos. Best-arm identi\ufb01cation in linear bandits. In Advances\n\nin Neural Information Processing Systems, pages 828\u2013836, 2014.\n\n[20] Tanguy Urvoy, Fabrice Clerot, Raphael F\u00e9raud, and Sami Naamane. Generic exploration and k-armed\nvoting bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML-13),\npages 91\u201399, 2013.\n\n[21] Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. In Proceedings\n\nof the 23rd international conference on Machine learning, pages 1081\u20131088. ACM, 2006.\n\n[22] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The K-armed dueling bandits problem. Journal of\n\nComputer and System Sciences, 78(5):1538\u20131556, September 2012.\n\n[23] Y. Yue and T. Joachims. Beat the mean bandit. In ICML, 2011.\n\n[24] Masrour Zoghi, Zohar Karnin, Shimon Whiteson, and Maarten de Rijke. Copeland dueling bandits. In\n\nAdvances in Neural Information Processing Systems, pages 307\u2013315, 2015.\n\n[25] Masrour Zoghi, Shimon Whiteson, and Maarten de Rijke. Mergerucb: A method for large-scale online\nranker evaluation. In Proceedings of the Eighth ACM International Conference on Web Search and Data\nMining, pages 17\u201326. ACM, 2015.\n\n[26] Masrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten D Rijke. Relative upper con\ufb01dence bound\nfor the k-armed dueling bandit problem. In Proceedings of the 31st International Conference on Machine\nLearning (ICML-14), pages 10\u201318, 2014.\n\n9\n\n\f", "award": [], "sourceid": 105, "authors": [{"given_name": "Zohar", "family_name": "Karnin", "institution": "Yahoo Research"}]}