{"title": "Thresholding Bandit with Optimal Aggregate Regret", "book": "Advances in Neural Information Processing Systems", "page_first": 11664, "page_last": 11673, "abstract": "We consider the thresholding bandit problem, whose goal is to find arms of mean rewards above a given threshold $\\theta$, with a fixed budget of $T$ trials. We introduce LSA, a new, simple and anytime algorithm that aims to minimize the aggregate regret (or the expected number of mis-classified arms). We prove that our algorithm is instance-wise asymptotically optimal. We also provide comprehensive empirical results to demonstrate the algorithm's superior performance over existing algorithms under a variety of different scenarios.", "full_text": "Thresholding Bandit with Optimal Aggregate Regret\n\nChao Tao\n\nComputer Science Department\n\nIndiana University at Bloomington\n\nSa\u00fal A. Blanco\n\nComputer Science Department\n\nIndiana University at Bloomington\n\nJian Peng\n\nComputer Science Department\n\nUniversity of Illinois at Urbana-Champaign\n\nComputer Science Department, Indiana University at Bloomington\n\nDepartment of ISE, University of Illinois at Urbana-Champaign\n\nYuan Zhou\n\nAbstract\n\nWe consider the thresholding bandit problem, whose goal is to \ufb01nd arms of mean\nrewards above a given threshold \u03b8, with a \ufb01xed budget of T trials. We introduce\nLSA, a new, simple and anytime algorithm that aims to minimize the aggregate\nregret (or the expected number of mis-classi\ufb01ed arms). We prove that our algo-\nrithm is instance-wise asymptotically optimal. We also provide comprehensive\nempirical results to demonstrate the algorithm\u2019s superior performance over existing\nalgorithms under a variety of different scenarios.\n\n1\n\nIntroduction\n\nThe stochastic Multi-Armed Bandit (MAB) problem has been extensively studied in the past decade\n(Auer, 2002; Audibert et al., 2010; Bubeck et al., 2009; Gabillon et al., 2012; Karnin et al., 2013;\nJamieson et al., 2014; Garivier and Kaufmann, 2016; Chen et al., 2017). In the classical framework,\nat each trial of the game, a learner faces a set of K arms, pulls an arm and receives an unknown\nstochastic reward. Of particular interest is the \ufb01xed budget setting, in which the learner is only given\na limited number of total pulls. Based on the received rewards, the learner will recommend the best\narm, i.e., the arm with the highest mean reward. In this paper, we study a variant of the MAB problem,\ncalled the Thresholding Bandit Problem (TBP). In TBP, instead of \ufb01nding the best arm, we expect\nthe learner to identify all the arms whose mean rewards \u03b8i (i \u2208 {1, 2, 3, . . . , K}) are greater than or\nequal to a given threshold \u03b8. This is a very natural setting with direct real-world applications to active\nbinary classi\ufb01cation and anomaly detection (Locatelli et al., 2016; Steinwart et al., 2005).\nIn this paper, we propose to study TBP under the notion of aggregate regret, which is de\ufb01ned as the\nexpected number of errors after T samples of the bandit game. Speci\ufb01cally, for a given algorithm A\nand a TBP instance I with K arms, if we use ei to denote the probability that the algorithm makes an\nincorrect decision corresponding to arm i after T rounds of samples, the aggregate regret is de\ufb01ned\nA\ni=1 ei. In contrast, most previous works on TBP aim to minimize the simple\nas R\nregret, which is the probability that at least one of the arms is incorrectly labeled. Note that the\nde\ufb01nition of aggregate regret directly re\ufb02ects the overall classi\ufb01cation accuracy of the TBP algorithm,\nwhich is more meaningful than the simple regret in many real-world applications. For example, in\nthe crowdsourced binary labeling problem, the learner faces K binary classi\ufb01cation tasks, where\neach task i is associated with a latent true label zi \u2208 {0, 1}, and a latent soft-label \u03b8i. The soft-label\n\u03b8i may be used to model the labeling dif\ufb01culty/ambiguity of the task, in the sense that \u03b8i fraction\n\n= (cid:80)K\n\n(I; T )\n\ndef\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fof the crowd will label task i as 1 and the rest labels task i as 0. The crowd is also assumed to be\nreliable, i.e., zi = 1 if and only if \u03b8i \u2265 1\n2. The goal of the crowdsourcing problem is to sequentially\nquery a random worker from the large crowd about his/her label on task i for a budget of T times,\nand then label the tasks with as high (expected) accuracy as possible. If we treat each of the binary\nclassi\ufb01cation task as a Bernoulli arm with mean reward \u03b8i, then this crowdsourced problem becomes\naggregate regret minimization in TBP with \u03b8 = 1\n2. If a few tasks are extremely ambiguous (i.e.,\n2), the simple regret would trivially approach 1 (i.e., every algorithm would almost always fail\n\u03b8i \u2192 1\nto correctly label all tasks). In such cases, however, a good learner may turn to accurately label the\nless ambiguous tasks and still achieve a meaningful aggregate regret.\nA new challenge arising for the TBP with aggregate regret is how to balance the exploration for each\narm given a \ufb01xed budget. Different from the exploration vs. exploitation trade-off in the classical\nMAB problems, where exploration is only aimed for \ufb01nding the best arm, the TBP expects to\nmaximize the accuracy of the classi\ufb01cation of all arms. Let \u2206i\n= |\u03b8i \u2212 \u03b8| be the hardness parameter\nor gap for each arm i. An arm with smaller \u2206i would need more samples to achieve the same\nclassi\ufb01cation con\ufb01dence. A TBP learner faces the following dilemma \u2013 whether to allocate samples\nto determine the classi\ufb01cation of one hard arm, or use it for improving the accuracy of another easier\narm.\n\ndef\n\nRelated Work. Since we focus on the TBP problem in this paper, due to limit of the space, we are\nsorry for not being able to include the signi\ufb01cant amount of references to other MAB variants.\nIn a previous work (Locatelli et al., 2016), the authors introduced the APT (Anytime Parameter-free\nThresholding) algorithm with the goal of simple regret minimization. In this algorithm, a precision\nparameter \u0001 is used to determine the tolerance of errors (a.k.a. the indifference zone); and the APT\nalgorithm only attempts to correctly classify the arms with hardness gap \u2206i > \u0001. This variant goal\nof simple regret partly alleviates the trivialization problem mentioned previously because of the\n\nextremely hard arms. In details, at any time t, APT selects the arm that minimizes(cid:112)\nTi(t)(cid:98)\u2206i(t),\nwhere Ti(t) is the number of times arm i has been pulled until time t,(cid:98)\u2206i(t) is de\ufb01ned as |(cid:98)\u03b8i(t)\u2212\u03b8|+\u0001,\nand(cid:98)\u03b8i(t) is the empirical mean reward of arm i at time t. In their experiments, Locatelli et al. (2016)\n\nalso adapted the UCBE algorithm from (Audibert et al., 2010) for the TBP problem and showed that\nAPT performs better than UCBE.\nWhen the goal is to minimize the aggregate regret, the APT algorithm also works better than\nUCBE. However, we notice that the choice of precision parameter \u0001 has signi\ufb01cant in\ufb02uence on\nthe algorithm\u2019s performance. A large \u0001 makes sure that, when the sample budget is limited, the\nAPT algorithm is not intrigued by a hard arm to spend overwhelmingly many samples on it without\nachieving a con\ufb01dent label. However, when the sample budget is ample, a large \u0001 would also prevent\nthe algorithm from making enough samples for the arms with hardness gap \u2206i < \u0001. Theoretically,\nthe optimal selection of this precision parameter \u0001 may differ signi\ufb01cantly across the instances, and\nalso depends on the budget T . In this work, we propose an algorithm that does not require such\na precision parameter and demonstrates improved robustness in practice. Furthermore, a simple\ncorollary of our main theorem (Theorem 1) shows that, for the simple regret with no indifference zone\n(\u0001 = 0), our LSA algorithm achieves the optimality up to a ln K factor in the budget T compared\nwith APT(0). We attach experimental results in Appendix F.1 to show that LSA performs better than\nAPT(0) towards the simple regret objective.\nAnother natural approach to TBP is the uniform sampling method, where the learner plays each\narm the same number of times (about T /K times). In Appendix C, we show that the uniform\nsampling approach may need \u2126(K) times more budget than the optimal algorithm to achieve the\nsame aggregate regret.\nFinally, Chen et al. (2015) proposed the optimistic knowledge gradient heuristic algorithm for budget\nallocation in crowdsourcing binary classi\ufb01cation with Beta priors, which is closely related to the TBP\nproblem in the Bayesian setting.\n\nOur Results and Contributions. Let R\n(I; T ) denote the aggregate regret of an instance I after\nT time steps. Given a sequence of hardness parameters \u22061, \u22062, . . . , \u2206K, assume I\u22061,...,\u2206K is the\nclass of all K-arm instances where the gap between \u03b8i of the i-th arm and the threshold \u03b8 is \u2206i, and\nlet\n\nA\n\n2\n\n\fOPT({\u2206i}K\n\ni=1, T )\n\ndef\n= infA\n\nsup\n\nI\u2208I\u22061,...,\u2206K R\n\nA\n\n(I; T )\n\n(1)\n\nbe the minimum possible aggregate regret that any algorithm can achieve among all instances with\nthe given set of gap parameters. We say an algorithm A is instance-wise asymptotically optimal if for\nevery T , any set of gap parameters {\u2206i}K\n\ni=1, and any instance I \u2208 I\u22061,...,\u2206K , it holds that\n\nA\n\n(2)\n\nR\n\n(I; T ) \u2264 O(1) \u00b7 OPT({\u2206i}K\n\ni=1, \u2126(T )).\n\nWhile it may appear that a constant factor multiplied to T can affect the regret if the optimal regret\nis an exponential function of T , we note that our de\ufb01nition aligns with major multi-armed bandit\nliterature (e.g., \ufb01xed-budget best arm identi\ufb01cation (Gabillon et al., 2012; Carpentier and Locatelli,\n2016) and thresholding bandit with simple regret (Locatelli et al., 2016)). Indeed, according to\nour de\ufb01nition, if the universal optimal algorithm requires a budget of T to achieve \u0001 regret, an\nasymptotically optimal algorithm requires a budget of only T multiplying some constant to achieve\nthe same order of regret. On the other hand, if one wishes to pin down the optimal constant before T ,\neven for the single arm case, it boils down to the question of the optimal (and distribution dependent)\nconstant in the exponent of existing concentration bounds such as Chernoff Bound, Hoeffding\u2019s\nInequality, and Bernstein Inequalities, which is beyond the scope of this paper.\nWe address the challenges mentioned previously and introduce a simple and elegant algorithm,\nthe Logarithmic-Sample Algorithm (LSA). LSA has a very similar form as the APT algorithm in\n(Locatelli et al., 2016) but introduces an additive term that is proportional to the logarithm of the\nnumber of samples made to each arm in order to more carefully allocate the budget among the arms\n(see Line 4 of Algorithm 1). This logarithmic term arises from the optimal sample allocation scheme\nof an of\ufb02ine algorithm when the gap parameters are known beforehand. The log-sample additive term\nof LSA can be interpreted as an incentive to encourage the samples for arms with bigger gaps and/or\nless explored arms, which boasts a similar idea as the incentive term in the famous Upper Con\ufb01dence\nBound (UCB) type of algorithms that date back to (Lai and Robbins, 1985; Agrawal, 1995; Auer,\n2002), while interestingly the mathematical forms of the two incentive terms are very different.\nAs the main theoretical result of this paper, we analyze the aggregate regret upper bound of LSA in\nTheorem 1. We complement the upper bound result with a lower bound theorem (Theorem 20) for\nany online algorithm. In Remark 2, we compare the upper and lower bounds and show that LSA is\ninstance-wise asymptotically optimal.\nWe now highlight the technical contributions made in our regret upper bound analysis at a very high\nlevel. Please refer to Section 4 for more detailed explanations. In our proof of the upper bound\ntheorem, we \ufb01rst de\ufb01ne a global class of events {FC} (in (14)) which serves as a measurement of\nhow well the arms are explored. Our analysis then goes by two steps. In the \ufb01rst step, we show that\nFC happens with high probability, which intuitively means that all arms are \u201cwell explored\u201d. In the\nsecond step, we show the quantitative upper bound on the mis-classi\ufb01cation probability for each\narm, when conditioned on FC. The \ufb01nal regret bound follows by summing up the mis-classi\ufb01cation\nprobability for each arm via linearity of expectation. Using this approach, we successfully by-pass the\nanalysis that involves pairs of (or even more) arms, which usually brings in union bound arguments\nand extra ln K terms. Indeed, such ln K slack appears between the upper and lower bounds proved\nin (Locatelli et al., 2016). In contrast, our LSA algorithm is asymptotically optimal, without any\nsuper-constant slack.\nAnother important technical ingredient that is crucial to the asymptotic optimality analysis is a new\nconcentration inequality for the empirical mean of an arm that uniformly holds over all time periods,\nwhich we refer to as the Variable Con\ufb01dence Level Bound. This new inequality helps to reduce an\nextra ln ln T factor in the upper bound. It is also a strict improvement of Hoeffding\u2019s celebrated\nMaximal Inequality, which might be useful in many other problems.\nFinally, we highlight that our LSA is anytime, i.e., it does not need to know the time horizon T\nbeforehand. LSA does use a universal tuning parameter. However, this parameter does not depend\non the instances. As we will show in Section 5, the choice of the parameter is quite robust; and\nthe natural parameter setting leads to superior performance of LSA among a set of very different\ninstances, while APT may suffer from poor performance if the precision parameter is not chosen well\nfor an instance.\n\n3\n\n\fOrganization. The organization of the rest of the paper is as follows. In Section 2 we provide the\nnecessary notation and de\ufb01nitions. Then we present the details of the LSA algorithm in Section 3\nand upper bound its aggregate regret in Section 4. In Section 5, we present experiments establishing\nthe empirical advantages of LSA over other algorithms. The instance-wise aggregate regret lower\nbound theorem is deferred to Appendix E.\n\n2 Problem Formulation and Notation\n\ndef\n\n= {1, 2, . . . , K} be the set of K arms in an instance I.\nGiven an integer K > 1, we let S = [K]\nEach arm i \u2208 S is associated with a distribution Di supported on [0, 1] which has an unknown mean\n\u03b8i. We are interested in the following dynamic game setting: At any round t \u2265 1, the learner chooses\nto pull an arm it from S and receives an i.i.d. reward sampled from Dit.\nWe let T , with T \u2265 K, be the time horizon, or the budget of the game, which is not necessarily\nknown beforehand. We furthermore let \u03b8 \u2208 (0, 1) be the threshold of the game. After T rounds, the\nlearner A has to determine, for every arm i \u2208 S, whether or not its mean reward is greater than or\nequal to \u03b8. So the learner outputs a vector (d1, . . . , dK) \u2208 {0, 1}K, where di = 0 if and only if A\ndecides that \u03b8i < \u03b8. The goal of the Thresholding Bandit Problem (TBP) in this paper is to maximize\nthe expected number of correct labels after T rounds of the game.\nA\ni (T ) to denote the event that A\u2019s decision corre-\nMore speci\ufb01cally, for any algorithm A, we use E\nsponding to arm i is correct after T rounds of the game. The goal of the TBP algorithm is to minimize\nthe aggregate regret, which is the expected number of incorrect classi\ufb01cations for the K arms, i.e.,\n\nA\nR\n\nA\n(T ) = R\n\n(I; T )\n\ndef\n\n= E\n\nI{EA\n\ni (T )}\n\n,\n\n(3)\n\nwhere E denotes the complement of event E and I{condition} denotes the indicator function.\nLet Xi,t denote the random variable representing the sample received by pulling arm i for the t-th\ntime. We further write\n\n(cid:35)\n\n(cid:34) K(cid:88)\n\ni=1\n\nto denote the empirical mean and the empirical gap of arm i after being pulled t times, respectively.\nA\nFor a given algorithm A, let T\ni (t) denote the number of times arm i is pulled and the\nuse (cid:98)\u2206\nempirical mean reward of arm i after t rounds of the game, respectively. For each arm i \u2208 S, we\nA\ni (t) \u2212 \u03b8| to denote the empirical gap after t rounds of the game. We will omit the\nreference to A when it is clear from the context.\n\n= |(cid:98)\u03b8\n\nA\ni (t)\n\ndef\n\nA\n\n3 Our Algorithm\n\nWe now motivate our Logarithmic-Sample Algorithm by \ufb01rst designing an optimal but unrealistic\nalgorithm with the assumption that the hardness gaps {\u2206i}i\u2208S are known beforehand. Now we\ndesign the following algorithm O. Suppose the algorithm pulls arm i a total of xi times and makes\n\na decision based on the empirical mean(cid:98)\u03b8i,xi: if(cid:98)\u03b8i,xi \u2265 \u03b8, the algorithm decides that \u03b8i \u2265 \u03b8, and\n\ndecides \u03b8i < \u03b8 otherwise. Note that this is all an algorithm can do when the gaps \u2206i are known. We\nupper bound the aggregate regret of the algorithm by\n\nK(cid:88)\n\ni=1\n\nP(|(cid:98)\u03b8i,xi \u2212 \u03b8i| \u2265 \u2206i) \u2264\n\n2 exp(cid:0)\n\nK(cid:88)\n\ni=1\n\n(cid:1) ,\n\n\u22122xi\u22062\n\ni\n\n(5)\n\nK(cid:88)\n\ni=1\n\nO\n\nR\n\n(T ) =\n\nP(E\n\nO\ni (T )) \u2264\n\nwhere the last inequality follows from Chernoff-Hoeffding Inequality (Proposition 5). Now we would\nlike to minimize the RHS (right-hand-side) of (5), and upper bound the aggregate regret of the optimal\nalgorithm O by\n\nK(cid:88)\n\ni=1\n\n2 \u00b7\n\nmin\n\nx1+\u00b7\u00b7\u00b7+xK =T\nx1,...,xK\u2208N\n\nexp(\u22122xi\u22062\n\ni ) = 2P\u2217\n\n2 ({\u2206i}i\u2208S, T ).\n\n4\n\nXi,s and (cid:98)\u2206i,t\n\n= |(cid:98)\u03b8i,t \u2212 \u03b8|\n\ndef\n\ndef\n=\n\nt(cid:88)\n(cid:98)\u03b8i,t\ni (t) and(cid:98)\u03b8\n\n1\ns\n\ns=1\n\n(4)\n\n\fHere, for every c > 0, we de\ufb01ne\n\nP\u2217\n\nc ({\u2206i}i\u2208S, T )\n\ndef\n=\n\nmin\n\nx1+\u00b7\u00b7\u00b7+xK =T\nx1,...,xK\u2208N\n\nK(cid:88)\n\ni=1\n\nexp(\u2212cxi\u22062\ni ).\n\nK(cid:88)\n\n(6)\n\n(7)\n\nWe naturally introduce the following continuous relaxation of the program Pc, by de\ufb01ning\n\nPc approximates P\u2217\n\ndef\n=\n\nPc({\u2206i}i\u2208S, T )\ni=1\nc well, as it is straightforward to see that\n\nx1+\u00b7\u00b7\u00b7+xK =T\nx1,...,xK\u22650\n\nmin\n\nexp(\u2212cxi\u22062\ni ).\n\n\u22121\ni\n\n\u22062\ni\n\n\u22121\ni\n\n\u22062\ni\n\nPc({\u2206i}i\u2208S, T ) \u2264 P\u2217\n\nwhere \u03a6\n\ndef\n\nxi\u22062\n\ni + ln \u2206\n\ni=1 max{ x\u2212ln \u2206\n\n\u22121\ni \u2265 \u03a6, for i \u2208 S,\n\nc ({\u2206i}i\u2208S, T ) \u2264 Pc({\u2206i}i\u2208S, T \u2212 K).\n\n= max{x :(cid:80)K\n\n(8)\nWe apply the Karush-Kuhn-Tucker (KKT) conditions to the optimization problem P2({\u2206i}i\u2208S, T )\nand \ufb01nd that the optimal solution satis\ufb01es\n(9)\n(cid:80)K\n, 0} \u2264 T} is independent of i \u2208 S. Furthermore, since\ni=1 max{ x\u2212ln \u2206\n, 0} is an increasing continuous function of x, \u03a6 is indeed well-de\ufb01ned. Please\nrefer to Lemma 10 of Appendix B for the details of the relevant calculations.\nIn light of (8) and (9), the following algorithm O(cid:48) (still, with the unrealistic assumption of the\nknowledge of the gaps {\u2206i}i\u2208S) incrementally solves Pc and approximates the algorithm O \u2013 at\neach time t, the algorithm selects the arm i that minimizes Ti(t \u2212 1)\u22062\nprecise gap quantities, we use the empirical estimates (cid:98)\u22062\nOur proposed algorithm is very close to O(cid:48). Since in reality the algorithm does not have access to the\nthe logarithmic term, if we also use ln(cid:98)\u2206\ni term. For\nof ln(cid:98)\u2206\n, we may encounter extremely small\nempirical estimates when the arm is not suf\ufb01ciently sampled, which would lead to unbounded values\nterm). In light of this, we use(cid:112)\n, and render the arm almost impossible to be sampled in future. To solve this problem,\nwe note that O(cid:48) tries to maintain Ti(t \u2212 1)\u22062\ni to be roughly the same across the arms (if ignoring\n\u22121\n\u22121\nthe ln \u2206\n. This\nTi(t \u2212 1) to roughly estimate the order of \u2206\nTo summarize, at each time t, our algorithm selects the arm i that minimizes \u03b1 \u00b7 Ti(t \u2212 1)((cid:98)\u2206i(t \u2212\ni\ni\nencourages the exploration of both the arms with larger gaps and the ones with fewer trials.\n\n1))2 + 0.5 ln Ti(t \u2212 1), where \u03b1 > 0 is a universal tuning parameter, and plays the arm. The details\nof the algorithm are presented in Algorithm 1.\n\ni in the Ti(t \u2212 1)\u22062\n\n\u22121\ninstead of ln \u2206\ni\n\n\u22121\ni + ln \u2206\ni\n\ni instead of \u22062\n\nand plays it.\n\n\u22121\ni\n\n\u22121\ni\n\nAlgorithm 1 Logarithmic-Sample Algorithm, LSA(S, \u03b8)\n1: Input: A set of arms S = [K], threshold \u03b8\n2: Initialization: pull each arm once\n3: for t = K + 1 to T do\n4:\n\n\u03b1Ti(t \u2212 1)((cid:98)\u2206i(t \u2212 1))2 + 0.5 ln Ti(t \u2212 1)\n5: For each arm i \u2208 S, let di \u2190 1 if(cid:98)\u03b8i(T ) \u2265 \u03b8 and di \u2190 0 otherwise\n\nPull arm it = argmin\n\n(cid:16)\n\ni\u2208S\n\n6: Output: (d1, . . . , dK)\n\n(cid:17)\n\n4 Regret Upper Bound for LSA\n\n(cid:32)\n\nK(cid:88)\n\ni=1\n\nIn this section, we show the upper bound of the aggregate regret of Algorithm 1.\nLet x = \u039b be the solution to the following equation\n\n(cid:33)\n\nI{x\u2264ln \u2206\n\n\u22121\n\ni } \u00b7 exp(2x) + I{x>ln \u2206\n\n\u22121\n\ni } \u00b7\n\n\u22121\ni + \u03b1\n\nx \u2212 ln \u2206\n\u03b1\u22062\ni\n\n5\n\n=\n\nT\n\nmax{40/\u03b1 + 1, 40}\n\n.\n\n(10)\n\n\fNotice that(cid:80)K\n\n\u22121\n\ni=1(I{x\u2264ln \u2206\n\ni } \u00b7 exp(2x) + I{x>ln \u2206\n\n) is a strictly increasing, con-\ntinuous function with x \u2265 0 that becomes K when x = 0 and goes to in\ufb01nity when x \u2192 \u221e. Hence\n\u039b is guaranteed to exist and is uniquely de\ufb01ned when T is large. Furthermore, for any i \u2208 S, we let\n(11)\n\n\u22121\ni + \u03b1\n\ni } \u00b7 x\u2212ln \u2206\n\n\u22121\ni +\u03b1\n\u03b1\u22062\ni\n\n\u22121\n\ndef\n\n= I{\u039b\u2264ln \u2206\n\n\u03bbi\n\n.\n\n\u22121\n\ni } \u00b7 exp(2\u039b) + I{\u039b>ln \u2206\n\n\u22121\ni } \u00b7\n\n\u039b \u2212 ln \u2206\n\u03b1\u22062\ni\n\nWe note that {\u03bbi}i\u2208S is the optimal solution to P2\u03b1({max{\u2206i, exp(\u2212\u039b)}}i\u2208S, T /(max{40/\u03b1 +\n1, 40})). Please refer to Lemma 11 of Appendix B for the detailed calculations.\nThe goal of this section is to prove the following theorem.\nTheorem 1. Let RLSA(T ) be the aggregate regret incurred by Algorithm 1. When 0 < \u03b1 \u2264 8, and\nT \u2265 max{40/\u03b1 + 1, 40} \u00b7 K, we have\n\n(cid:88)\n\n(cid:18)\n\n(cid:19)\n\n\u03bbi\u22062\ni\n10\n\nexp\n\n(cid:1) is a constant that only depends on the universal tuning\n\ni\u2208S\n\n\u2212\n\n,\n\n(12)\n\nwhere \u03a5(\u03b1) = 9.3\u00b7 8\u03b1\u221a\n8\u03b1\u221a\n2\n2\u22121\n(cid:16)(cid:80)\nparameter \u03b1.\nRemark 2.\n(cid:88)\nO\n\nRLSA(T ) \u2264 \u03a5(\u03b1) \u00b7\n\nexp(cid:0) 2.1\u03b1\u2212ln \u03b1\u22120.5\n(cid:17)(cid:17)\nIf we set \u03b1 = 1/20,\n(cid:19)\n\u2212 \u03bbi\u22062\n\u2264 O(cid:0)P1/10({max{\u2206i, exp(\u2212\u039b)}}i\u2208S, T /801)(cid:1)\n\u03bbi\u22062\ni\n10\n\nthen the right-hand side of\n\n. One can verify that\n\ni\u2208S exp\n\n(cid:18)\n\n(cid:16)\n\nexp\n\n\u2212\n\n4\u03b1\n\n10\n\ni\n\ni\u2208S\n\n(12) would be at most\n\n= O (P16({max{\u2206i, exp(\u2212\u039b)}}i\u2208S, T /128160)) \u2264 O (P16({\u2206i}i\u2208S, T /128160)) ,\n\nwhere the \ufb01rst inequality is due to Lemma 12 of Appendix B and the equality is because of Lemma 13\nof Appendix B. This matches the lower bound stated in Theorem 20 up to constant factors. 1\n\nThe rest of this section is devoted to the proof of Theorem 1. Before proceeding, we note that the\nanalysis of the APT algorithm (Locatelli et al., 2016) crucially depends on a favorable event stating\nthat the empirical mean of any arm at any time does not deviate too much from the true mean. This\nrequires a union bound that introduces extra factors such as ln K and ln ln T . Our analysis adopts\na novel approach that does not need a union bound over all arms, and hence avoids the extra ln K\nfactor. In the second step of our analysis, we introduce the new Variable Con\ufb01dence Level Bound to\nsave the extra doubly logarithmic term in T .\n\ndef\n\n\u22121\ni < \u039b}. Intuitively, B contains the\nNow we dive into details of the proof. Let B\narms that can be well classi\ufb01ed by the ideal algorithm O (described in Section 3), while even the\nideal algorithm O suffers \u2126(1) regret for each arm in S \\ B. In light of this, the key of the proof is to\nupper bound the regret incurred by the arms in B.\nLet RLSA\ni /10) \u2265 1 for\nevery arm i \u2208 S \\ B, and the regret incurred by each arm is at most 1. Therefore, to establish (12),\nwe only need to show that\n\nB (T ) denote the regret incurred by arms in B. Note that \u03a5(\u03b1) \u00b7 exp(\u2212\u03bbi\u22062\n\n= {i \u2208 S : ln \u2206\n\nRLSA\nB (T ) \u2264 \u03a5(\u03b1) \u00b7\n\nexp\n\n\u2212\u03bbi\u22062\n10\n\ni\n\n.\n\n(13)\n\n= \u03b1t((cid:98)\u2206i,t)2 + 0.5 ln t.\n\n= \u03b1Ti(t)((cid:98)\u2206i(t))2 +\nWe set up a few notations to facilitate the proof of (13). We de\ufb01ne \u03bei(t)\n0.5 ln Ti(t) to be the expression inside the argmin(\u00b7) operator in Line 4 of the algorithm, for arm i\nand at time t. We also de\ufb01ne \u03bei,t\nIntuitively, when \u03bei(t) is large, we usually have a larger value for Ti(t), and arm i is better explored.\nTherefore, \u03bei(t) can be used as a measurement of how well arm i is explored, which directly relates\n1While the constants may seem large, we emphasize that i) we make no effort in optimizing the constants\nin asymptotic bounds, ii) most of the constants come from the lower bound, while the constant factor in our\nupper bound is 10, and iii) we believe that the actual constant of our algorithm is quite small, as the experimental\nevaluation in the later section demonstrates that our algorithm performs very well in practice.\n\ndef\n\ndef\n\n(cid:18)\n\n(cid:19)\n\n(cid:88)\n\ni\u2208B\n\n6\n\n\f(cid:48)\n\n(cid:48)\n\nFC\n\n= {\u2203T\n\n) > C} .\n\nto the mis-classi\ufb01cation probability for classifying the arm. We say that arm i is C-well explored at\ntime T if there exists T (cid:48)\n\u2264 T such that \u03bei(T (cid:48)) > C. For any C > 0, we also de\ufb01ne the event FC to\nbe\n(14)\n\ndef\n\n\u2264 T : \u2200i \u2208 S, \u03bei(T\nWhen FC happens, we know that all arms are C-well explored.\nAt a higher level, the proof of (13) goes by two steps. First, we show that for C that is almost as large\nas \u039b, FC happens with high probability, which means that every arm is C-well explored. Second, we\nquantitatively relate that being C-well explored and the mis-classi\ufb01cation probability for classifying\neach arm, which can be used to further deduce a regret upper bound given the event FC.\nWe start by revealing more details about the \ufb01rst step. The following Lemma 3 gives a lower bound\non the probability of the event FC.\nLemma 3. P(F\u039b\u2212k) \u2265 1 \u2212 exp(\u221240k/\u03b1) for 0 \u2264 k < \u039b.\nWe now introduce the high-level ideas for proving Lemma 3 and defer the formal proofs to Ap-\npendix D.2. For any arm i \u2208 S and C > 0, let \u03c4i,C be the random variable representing the smallest\npositive integer such that \u03bei,\u03c4i,C > C (i.e., \u03bei,t \u2264 C for all 1 \u2264 t < \u03c4i,C). Intuitively, \u03c4i,C denotes\nthe \ufb01rst time arm i is C-well explored. We \ufb01rst show that the distribution of \u03c4i,C has an exponential\ntail. Hence, the sum of them with the same C also has an exponential tail. Next, we show that with\ni=1 \u03c4i,\u039b\u2212k \u2264 T and the probability vanishes exponentially as k increases. In the\ni=1 \u03c4i,\u039b\u2212k \u2264 T implies\nF\u039b\u2212k.\nWe now proceed to the second step of the proof of (13). The following lemma (whose proof is\ndeferred to Appendix D.3) gives an upper bound of regret incurred by arms in B conditioned on FC.\n(cid:18)\nLemma 4. If k \u2265 0.1\u03b1, then conditioned on F\u039b\u2212k,\n\u2212\n\nhigh probability(cid:80)K\nlast step, thanks to the design of the algorithm, we are able to argue that(cid:80)K\n\nk + \u03b1 \u2212 ln \u03b1 \u2212 0.5\n\nRLSA\nB (T ) \u2264\n\n(cid:19)\n\n\u03bbi\u22062\ni\n10\n\n+\n\nexp\n\n4\u03b1\n\n.\n\n(cid:88)\n\ni\u2208B\n\n9 \u00b7 8\u03b1\u221a2\n8\u03b1\u221a2 \u2212 1 \u00b7\n\nlength of the con\ufb01dence band scale linearly with(cid:112)\nconcentration bound where the ratio between the length of the con\ufb01dence band and(cid:112)\n\nAs mentioned before, the key to proving Lemma 4 is to pin down the quantitative relation between\nthe event FC and the probability of mis-classifying an arm conditioned on FC, then the expected\nregret upper bound can be achieved by summing up the mis-classifying probability for all arms in B.\nA key technical challenge in our analysis is to design a concentration bound for the empirical mean\nof an arm (namely arm i) that uniformly holds over all time periods. A typical method is to let the\n1/t, where t is the number of samples made\nfor the arm. However, this would worsen the failure probability, and lead to an extra ln ln T factor\nin the regret upper bound. To reduce the iterated logarithmic factor, we introduce a novel uniform\n1/t is almost\nconstant for large t, but becomes larger for smaller t. Since this ratio is related to the con\ufb01dence level\nof the corresponding con\ufb01dence band, we refer to this new concentration inequality as the Variable\nCon\ufb01dence Level Bound. More speci\ufb01cally, in Appendix D.3.1, we prove the following lemma.\nLemma 19 (Variable Con\ufb01dence Level Bound, pre-stated). Let X1, . . . , XL be i.i.d. random vari-\nables supported on [0, 1] with mean \u00b5. For any a > 0 and b > 0, it holds that\n\n(cid:33)\n\na + b ln(L/t)\n\nt\n\n\u2265 1 \u2212\n\n2b/2+2\n2b/2 \u2212 1\n\nexp(\u2212a/2).\n\n(cid:32)\n\nP\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nt\n\nt(cid:88)\n\ni=1\n\n(cid:114)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264\n\n\u2200t \u2208 [1, L],\n\nXi \u2212 \u00b5\n\nThis new inequality greatly helps the analysis of our algorithm, where the intuition is that when\nconditioned on the event FC, it is much less likely that fewer number of samples are conducted for\narm i, and therefore we can afford a less accurate (i.e. bigger) con\ufb01dence band for its mean value.\nIt is notable that a similar idea is also adopted in the analysis of the MOSS algorithm (Audibert\nand Bubeck, 2009) which gives a minimax optimal regret bound for the ordinary multi-armed\nbandits. However, our Variable Con\ufb01dence Level Bound is more general. It can replace the usage\nof Hoeffding\u2019s Maximal Inequality in the analysis of MOSS and may \ufb01nd other applications. We\n\n7\n\n\fFigure 1: Average aggregate regret on a logarithmic scale for different settings.\n\nadditionally remark that in Hoeffding\u2019s celebrated Maximal Inequality, the con\ufb01dence level also\nchanges with time. However, the blow-up factor made to the con\ufb01dence level in our inequality is only\nthe logarithm of that of Hoeffding\u2019s Maximal Inequality. Therefore, if constant factors are ignored,\nour inequality strictly improves Hoeffding\u2019s Maximal Inequality.\nThe formal proof of Theorem 1 involves a few technical tricks to combine Lemma 3 and Lemma 4\nto deduce the \ufb01nal regret bound, and is deferred to Appendix D.1. The lower bound theorem\n(Theorem 20) that complements Theorem 1 is deferred to Appendix E due to space constraints.\n\n5 Experiments\n\nIn our experiments, we assume that each arm follows independent Bernoulli distributions with\ndifferent means. To guarantee a fair comparison, we vary the total number of samples T and compare\nthe empirical average aggregate regret on a logarithmic scale which is averaged over 5000 independent\nruns. We consider three different choices of {\u03b8i}i\u2208S:\n\n\u03b87:10 = 0.65 + (0 : 3) \u00b7 0.05 (see Setup 1 in Figure 1).\n\n1. (arithmetic progression I). K = 10; \u03b81:4 = 0.2 + (0 : 3) \u00b7 0.05, \u03b85 = 0.45, \u03b86 = 0.55, and\n2. (arithmetic progression II). K = 20; \u03b81:20 = 0.405 + (i \u2212 1)/100 (see Setup 2 in Figure 1).\n3. (two-group setting). K = 10; \u03b81:5 = 0.45, and \u03b86:10 = 0.505 (see Setup 3 in Figure 1).\n\nIn our experiments, we \ufb01x \u03b8 = 0.5. We notice that the choice of \u03b1 in our LSA is quite robust (see\nAppendix F.4 for experimental results). To illustrate the performance, we \ufb01x \u03b1 = 1.35 in LSA and\ncompare it with four existing algorithms for the TBP problem under a variety of settings. Now we\ndiscuss these algorithms and their parameter settings in more details.\n\nfollowing choices of \u0001: 0, 0.025, 0.05, and 0.1.\n\n\u2022 Uniform: Given the budget T , this method pulls each arm sequentially from 1 to K until\n\u03b8i \u2265 \u03b8 when(cid:98)\u03b8i \u2265 \u03b8.\nbudget T is reached such that each arm is sampled roughly T /K times. Then it outputs\nset of arms ({i \u2208 S :(cid:98)\u00b5i \u2265 \u03b8}) serving as an estimate of the set of arms with means over\n\u2022 APT(\u0001): Introduced and analyzed in (Locatelli et al., 2016), this algorithm aims to output a\nthe output: it outputs \u03b8i \u2265 \u03b8 if(cid:98)\u03b8i \u2265 \u03b8 and \u03b8i < \u03b8 otherwise. In the experiments, we test the\n\u03b8 + \u0001. The natural adaptation of the APT algorithm to our problem corresponds to changing\nthis algorithm to TBP is for each time t, it pulls argmini\u2208S((cid:98)\u2206i \u2212\n\n\u2022 UCBE(b): Introduced and analyzed in (Audibert and Bubeck, 2010), this algorithm aims\nto identify the best arm (the arm with the largest mean reward). A natural adaptation of\na/Ti(t \u2212 1)) where a\nis a tuning parameter. In (Audibert and Bubeck, 2010), it has been proved optimal when\nH and test three different\na = 25\n36\nchoices of b: \u22121, 0, and 4.\n\u2022 Opt-KG(a, b): Introduced in (Chen et al., 2015), this algorithm also aims to minimize the\naggregate regret. It models TBP as a Bayesian Markov decision process where {\u03b8i}i\u2208S is\n\nH where H = (cid:80)\n\nT\u2212K\n\n(cid:112)\n\n. Here we set a = 4b T\u2212K\n\ni\u2208S\n\n1\n\u22062\ni\n\n8\n\n2004006008001000T\u22126\u22125\u22124\u22123\u22122ln(regret/K)Setup10500010000150002000025000300003500040000T\u22124.5\u22124.0\u22123.5\u22123.0\u22122.5\u22122.0\u22121.5\u22121.0ln(regret/K)Setup2APT(0)APT(.025)APT(.05)APT(.1)UCBE(\u22121)UCBE(0)UCBE(4)Opt-KG(1,1)Opt-KG(.5,.5)UniformLSA020000400006000080000100000T\u22123.5\u22123.0\u22122.5\u22122.0\u22121.5\u22121.0ln(regret/K)Setup3\fassumed to be drawn from a known Beta prior Beta(a, b). Here we choose two different\npriors: Beta(1, 1) (uniform prior) and Beta(0.5, 0.5) (Jeffreys prior).\n\nComparisons.\nIn Setup 1, which is a relatively easy setting, LSA works best among all choices of\nbudget T . With the right choice of parameter, APT and Opt-KG also achieve satisfactory performance.\nThough the performance gaps appear to be small, two-tailed paired t-tests of aggregate regrets indicate\nthat LSA is signi\ufb01cantly better than most of the other methods, except APT(.05) and APT(.025) (see\nTable 1 in Appendix F.2).\nIn Setup 2 and 3, where ambiguous arms close to the threshold \u03b8 are presented, the performance\ndifference between LSA and other methods is more noticeable. LSA consistently outperforms other\nmethods in both settings over almost all choices of budget T with statistical signi\ufb01cance. It is worth\nnoting that, though APT works also reasonably well in Setup 2 when T is small, the best parameter \u0001\nis different from that for bigger T and other setups. On the other hand, the parameters chosen in LSA\nare \ufb01xed across all setups, indicating that our algorithm is more robust.\nWe perform additional experiments that due to space limitations are included in Appendix F.3. In all\nsetups, LSA outperforms its competitors with various parameter choices.\n\n6 Conclusion\n\nIn this paper we introduce an algorithm that minimizes the aggregate regret for the thresholding\nbandit problem. Our algorithm LSA makes use of a novel approach inspired by the optimal allocation\nscheme of the budget when the reward gaps are known ahead of time. When compared to APT, LSA\nuses an additional term, similar in spirit to the UCB-type algorithms though mathematically different,\nthat encourages the exploration of arms that have bigger gaps, and/or those have not been suf\ufb01ciently\nexplored. Moreover, LSA is anytime and robust, while the precision parameter \u0001 needed in the APT\nalgorithm is highly sensitive and hard to choose. Besides showing empirically that LSA performs\nbetter than APT for different values of \u0001 and other algorithms in a variety of settings, we also employ\nnovel proof ideas that eliminate the logarithmic terms usually brought in by the straightforward union\nbound argument, design the new Variable Con\ufb01dence Level Bound that strictly improves Hoeffding\u2019s\ncelebrated Maximal Inequality, and prove that LSA achieves instance-wise asymptotically optimal\naggregate regret.\n\nReferences\nPeter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422, 2002.\n\nJean-Yves Audibert, S\u00e9bastien Bubeck, and R\u00e9mi Munos. Best arm identi\ufb01cation in multi-armed\n\nbandits. In Conference on Learning Theory (COLT), pages 41\u201353, 2010.\n\nS\u00e9bastien Bubeck, R\u00e9mi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems.\n\nIn Algorithmic Learning Theory (ALT), pages 23\u201337, 2009.\n\nVictor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identi\ufb01cation: A\nIn Advances in Neural Information\n\nuni\ufb01ed approach to \ufb01xed budget and \ufb01xed con\ufb01dence.\nProcessing Systems (NIPS), pages 3212\u20133220, 2012.\n\nZohar Shay Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed\n\nbandits. In International Conference on Machine Learning (ICML), pages 1238\u20131246, 2013.\n\nKevin Jamieson, Matthew Malloy, Robert Nowak, and S\u00e9bastien Bubeck.\n\nlil\u2019UCB: An optimal\nexploration algorithm for multi-armed bandits. In Conference on Learning Theory (COLT), pages\n423\u2013439, 2014.\n\nAur\u00e9lien Garivier and Emilie Kaufmann. Optimal best arm identi\ufb01cation with \ufb01xed con\ufb01dence. In\n\nConference on Learning Theory (COLT), pages 998\u20131027, 2016.\n\nLijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identi\ufb01cation.\n\nIn Conference on Learning Theory (COLT), pages 535\u2013592, 2017.\n\n9\n\n\fAndrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the threshold-\ning bandit problem. In International Conference on Machine Learning (ICML), pages 1690\u20131698,\n2016.\n\nIngo Steinwart, Don Hush, and Clint Scovel. A classi\ufb01cation framework for anomaly detection.\n\nJournal of Machine Learning Research, 6(Feb):211\u2013232, 2005.\n\nXi Chen, Qihang Lin, and Dengyong Zhou. Statistical decision making for optimal budget allocation\n\nin crowd labeling. Journal of Machine Learning Research, 16(1):1\u201346, 2015.\n\nAlexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the \ufb01xed budget best arm\n\nidenti\ufb01cation bandit problem. In Conference on Learning Theory, pages 590\u2013604, 2016.\n\nTze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6(1):4\u201322, 1985.\n\nRajeev Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit\n\nproblem. Advances in Applied Probability, 27(4):1054\u20131078, 1995.\n\nJean-Yves Audibert and S\u00e9bastien Bubeck. Minimax policies for adversarial and stochastic bandits.\n\nIn Conference on Learning Theory (COLT), 2009.\n\nJean-Yves Audibert and S\u00e9bastien Bubeck. Best arm identi\ufb01cation in multi-armed bandits. In\n\nConference on Learning Theory (COLT), pages 13\u2013p, 2010.\n\nWassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the\n\nAmerican statistical association, 58(301):13\u201330, 1963.\n\nSvante Janson. Tail bounds for sums of geometric and exponential variables. Statistics & Probability\n\nLetters, 135:1\u20136, 2018.\n\nAlexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics.\nSpringer, New York, 2009. Revised and extended from the 2004 French original, Translated by\nVladimir Zaiats.\n\nTor Lattimore and Csaba Szepesv\u00e1ri. Bandit algorithms. preprint, 2018.\n\nJir\u00ed Matouek and Bernd G\u00e4rtner. Understanding and Using Linear Programming (Universitext).\n\n2006.\n\n10\n\n\f", "award": [], "sourceid": 6225, "authors": [{"given_name": "Chao", "family_name": "Tao", "institution": "Indiana University Bloomington"}, {"given_name": "Sa\u00fal", "family_name": "Blanco", "institution": "Indiana University"}, {"given_name": "Jian", "family_name": "Peng", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Yuan", "family_name": "Zhou", "institution": "UIUC"}]}