{"title": "Pure Exploration with Multiple Correct Answers", "book": "Advances in Neural Information Processing Systems", "page_first": 14591, "page_last": 14600, "abstract": "We determine the sample complexity of pure exploration bandit problems with multiple good answers. We derive a lower bound using a new game equilibrium argument. We show how continuity and convexity properties of single-answer problems ensure that the existing Track-and-Stop algorithm has asymptotically optimal sample complexity. However, that convexity is lost when going to the multiple-answer setting. We present a new algorithm which extends Track-and-Stop to the multiple-answer case and has asymptotic sample complexity matching the lower bound.", "full_text": "Pure Exploration with Multiple Correct Answers\n\nR\u00e9my Degenne\n\nCentrum Wiskunde & Informatica\nScience Park 123, Amsterdam, NL\n\nremy.degenne@cwi.nl\n\nWouter M. Koolen\n\nCentrum Wiskunde & Informatica\nScience Park 123, Amsterdam, NL\n\nwmkoolen@cwi.nl\n\nAbstract\n\nWe determine the sample complexity of pure exploration bandit problems with\nmultiple good answers. We derive a lower bound using a new game equilibrium\nargument. We show how continuity and convexity properties of single-answer\nproblems ensure that the existing Track-and-Stop algorithm has asymptotically\noptimal sample complexity. However, that convexity is lost when going to the\nmultiple-answer setting. We present a new algorithm which extends Track-and-\nStop to the multiple-answer case and has asymptotic sample complexity matching\nthe lower bound.\n\n1\n\nIntroduction\n\nIn pure exploration aka active testing problems the learning system interacts with its environment\nby sequentially performing experiments to quickly and reliably identify the answer to a particular\npre-speci\ufb01ed question. Practical applications range from simple queries for cost-constrained physical\nregimes, i.e. clinical drug testing, to complex queries in structured environments bottlenecked\nby computation, i.e. simulation-based planning. The theory of pure exploration is studied in the\nmulti-armed bandit framework. The scienti\ufb01c challenge is to develop tools for characterising the\nsample complexity of new pure exploration problems, and methodologies for developing (matching)\nalgorithms. With the aim of understanding power and limits of existing methodology, we study an\nextended problem formulation where each instance may have multiple correct answers. We \ufb01nd\nthat multiple-answer problems present a phase transition in complexity, and require a change in our\nthinking about algorithms.\nThe existing methodology for developing asymptotically instance-optimal algorithms, Track-and-\nStop by Garivier and Kaufmann [2016], exploits the so-called oracle weights. These probability\ndistributions on arms naturally arise in sample complexity lower bounds, and dictate the optimal\nsampling proportions for an \u201coracle\u201d algorithm that needs to be sample ef\ufb01cient only on exactly the\ncurrent problem instance. The main idea is to track the oracle weights computed at a converging\nestimate of the instance. The analysis of Track-and-Stop requires continuity of the oracle weights as a\nfunction of the bandit model. For the core Best Arm Identi\ufb01cation problem, discontinuity only occurs\nat degenerate instances where the sample complexity explodes. So this assumption seemed harmless.\n\nOur contribution We show that the oracle weights may be non-unique, even for single-answer\nproblems, and hence need to be regarded as a set-valued mapping. We show this mapping is always\n(upper hemi-)continuous. We show that each instance maps to a convex set for single-answer\nproblems, and this allows us to extend the Track-and-Stop methodology to all such problems. At\ninstances with non-singleton set-valued oracle weights more care is needed: of the two classical\ntracking schemes \u201cC\u201d converges to the convex set, while \u201cD\u201d may fail entirely.\nWe show that for multiple-answer problems convexity is violated. There are instances where two\ndistinct oracle weights are optimal, while no mixture is. Unmodi\ufb01ed tracking converges in law\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(experimentally) to a distribution on the full convex hull, and suffers as a result. We propose a \u201csticky\u201d\nmodi\ufb01cation to stabilise the approach, and show that now it converges to only the corners. We prove\nthat Sticky Track-and-Stop is asymptotically optimal.\n\nRelated work Multi-armed bandits have been the subject of intense study in their role as a model\nfor medical testing and reinforcement learning. For the objective of reward maximisation [Berry and\nFristedt, 1985, Lai and Robbins, 1985, Auer et al., 2002, Bubeck and Cesa-Bianchi, 2012] the main\nchallenge is balancing exploration and exploitation. The \ufb01eld of pure exploration (active testing)\nfocuses on generalisation vs sample complexity, in the \ufb01xed con\ufb01dence, \ufb01xed budget and simple regret\nscalarisations. It took off in machine learning with the multiple-answer problem of (\u0001, \u03b4)-Best Arm\nIdenti\ufb01cation (BAI) [Even-Dar et al., 2002]. Early results focused on worst-case sample complexity\nguarantees in sub-Gaussian bandits. Successful approaches include Upper and Lower con\ufb01dence\nbounds [Bubeck et al., 2011, Kalyanakrishnan et al., 2012, Gabillon et al., 2012, Kaufmann and\nKalyanakrishnan, 2013, Jamieson et al., 2014], Racing or Successive Rejects/Eliminations [Maron\nand Moore, 1997, Even-Dar et al., 2006, Audibert et al., 2010, Kaufmann and Kalyanakrishnan, 2013,\nKarnin et al., 2013].\nFundamental information-theoretic barriers [Castro, 2014, Kaufmann et al., 2016, Garivier and\nKaufmann, 2016] for each speci\ufb01c problem instance re\ufb01ned the worst-case picture, and sparked\nthe development of instance-optimal algorithms for single-answer problems based on Track-and-\nStop [Garivier and Kaufmann, 2016] and Thompson Sampling [Russo, 2016]. For multiple-answer\nproblems the elegant KL-contraction-based lower bound is not sharp, and new techniques were\ndeveloped by Garivier and Kaufmann [2019].\nRecent years also saw a surge of interest in pure exploration with complex queries and structured\nenvironments. Kalyanakrishnan and Stone [2010] identify the top-M set, Locatelli et al. [2016]\nthe arm closest to a threshold, and Chen et al. [2014], Gabillon et al. [2016] the optimiser over an\narbitrary combinatorial class. For arms arranged in a matrix Katariya et al. [2017] study BAI under a\nrank-one assumption, while Zhou et al. [2017] seek to identify a Nash equilibrium. For arms arranged\nin a minimax tree there is signi\ufb01cant interest in \ufb01nding the optimal move at the root Teraoka et al.\n[2014], Garivier et al. [2016], Huang et al. [2017], Kaufmann and Koolen [2017], Kaufmann et al.\n[2018], as a theoretical model for studying Monte Carlo Tree search (which is a planning sub-module\nof many advanced reinforcement learning systems).\n\n2 Notations\n\nWe work in a given one-parameter one-dimensional canonical exponential family, with mean pa-\nrameter in an open interval O \u2286 R. We denote by d(\u00b5, \u03bb) the KL divergence from the distribution\nwith mean \u00b5 to that with mean \u03bb. A K-armed bandit model is identi\ufb01ed by its vector \u00b5 \u2208 OK of\nmean parameters. We denote by M \u2286 OK the set of possible mean parameters in a speci\ufb01c bandit\nproblem. Excluding parts of OK from M may be used to encode a known structure of the problem.\nWe assume that there is a \ufb01nite domain I of answers, and that the correct answer for each bandit\nmodel is speci\ufb01ed by a set-valued function i\u2217 : M \u2192 2I.\nExample 1. As our running example, we will use the Any Low Arm multiple-answer problem. Given\na threshold \u03b3 \u2208 O, the goal is return the index k of any arm with \u00b5k < \u03b3 if such an arm exists, or to\nreturn \u201cno\u201d otherwise. Formally, we have possible answers I = [K] \u222a {no}, and correct answers\n\nWe exclude the case mink \u00b5k = \u03b3 from M (as such instances require in\ufb01nite sample complexity).\nAdditional examples of multiple-answer identi\ufb01cation problems are visualised in Table 1 in Ap-\npendix B.\nAlgorithms. A learning strategy is parametrised by a stopping rule \u03c4\u03b4 \u2208 N depending on a\nparameter \u03b4 \u2208 [0, 1], a sampling rule A1, A2, . . . \u2208 [K], and a recommendation rule \u02c6\u0131 \u2208 I. When a\nlearning strategy meets a bandit model \u00b5, they interactively generate a history A1, X1, . . . , A\u03c4 , X\u03c4 , \u02c6\u0131,\nwhere Xt \u223c \u00b5At. We allow the possibility of non-termination \u03c4\u03b4 = \u221e, in which case there is no\nrecommendation \u02c6\u0131. At stage t \u2208 N, we denote by Nt = (Nt,1, . . . , Nt,K) the number of samples (or\n\u201cpulls\u201d) of the arms, and by \u02c6\u00b5t the vector of empirical means of the samples of each arm.\n\n(cid:26){k | \u00b5k \u2264 \u03b3}\n\n{no}\n\ni\u2217(\u00b5) =\n\nif mink \u00b5k < \u03b3,\nif mink \u00b5k > \u03b3.\n\n2\n\n\f(cid:88)\n\nk\n\n(cid:0)\u03c4\u03b4 < \u221e and \u02c6\u0131 \u2208 i\u2217(\u00b5)(cid:1) \u2265 1 \u2212 \u03b4. We call a given strategy \u03b4-correct if it is \u03b4-correct for\n\nCon\ufb01dence and sample complexity. For con\ufb01dence parameter \u03b4 \u2208 (0, 1), we say that a strategy\nis \u03b4-correct (or \u03b4-PAC) for bandit model \u00b5 if it recommends a correct answer with high probability,\ni.e. P\u00b5\nevery \u00b5 \u2208 M. We measure the statistical ef\ufb01ciency of a strategy on a bandit model \u00b5 by its sample\ncomplexity E\u00b5[\u03c4\u03b4]. We are interested in \u03b4-correct algorithms minimizing sample complexity.\nDivergences. For any answer i \u2208 I, we de\ufb01ne the alternative to i, denoted \u00aci, to be the set of\nbandit models on which answer i is incorrect, i.e.\n\n\u00aci := {\u00b5 \u2208 M|i /\u2208 i\u2217(\u00b5)} .\n\nWe de\ufb01ne the (w-weighted) divergence from \u00b5 \u2208 M to \u03bb \u2208 M or \u039b \u2286 M by\n\nD(w, \u00b5, \u03bb) =\n\nwkd(\u00b5k, \u03bbk)\n\nD(w, \u00b5, \u039b) = inf\n\u03bb\u2208\u039b\n\nD(\u00b5, \u039b) = sup\nw\u2208(cid:52)K\n\nD(w, \u00b5, \u039b)\n\nD(\u00b5) = max\ni\u2208I\n\nD(w, \u00b5, \u03bb)\nD(\u00b5,\u00aci)\n\nNote that D(w, \u00b5, \u039b) = 0 whenever \u00b5 \u2208 \u039b. We denote by iF (\u00b5) the argmax (set of maximisers) of\n(cid:83)\ni (cid:55)\u2192 D(\u00b5,\u00aci), and we call iF (\u00b5) the oracle answer(s) at \u00b5. We write w\u2217(\u00b5,\u00aci) for the maximisers\nof w (cid:55)\u2192 D(w, \u00b5,\u00aci), and call these the oracle weights for answer i at \u00b5. We write w\u2217(\u00b5) =\ni\u2208iF (\u00b5) w\u2217(\u00b5,\u00aci) for the set of oracle weights among all oracle answers. We include expressions\nfor the divergence when i\u2217 is generated by half-spaces, minima and spheres in Appendix H.\nExample 1 (Continued). Consider an Any Low Arm instance \u00b5 with mink \u00b5k < \u03b3. For any arm i \u2208\ni\u2217(\u00b5), we have D(w, \u00b5,\u00aci) = wid(\u00b5i, \u03b3) and D(\u00b5,\u00aci) = d(\u00b5i, \u03b3). Hence D(\u00b5) = d(mini \u00b5i, \u03b3),\nand iF (\u00b5) = argmini \u00b5i. On the other hand, when mink \u00b5k > \u03b3, we have i\u2217(\u00b5) = {no}, so that\nD(w, \u00b5,\u00acno) = mink wkd(\u00b5k, \u03b3) and D(\u00b5,\u00acno) = D(\u00b5) = 1\nThe function iF (\u00b5) = {i \u2208 I : D(\u00b5,\u00aci) = D(\u00b5)} is set valued, as is w\u2217. They are singletons with\ncontinuous value over some connected subsets of M, and are multi-valued on common boudaries of\ntwo or more such sets. Both iF and w\u2217 can be thought of as having single values, unless \u00b5 sits on\nsuch a boundary, in which case we will prove that they are equal to the union (or convex hull of the\nunion) of their values in the neighbouring regions.\n\n(cid:46)(cid:80)K\n\nk=1\n\n1\n\nd(\u00b5k,\u03b3) .\n\n3 Lower Bound\n\nWe show a lower bound for any algorithm for multiple-answer problems. Our lower bound extends\nthe single-answer result of Garivier and Kaufmann [2016]. We are further inspired by Garivier\nand Kaufmann [2019], who analyse the \u0001-BAI problem. They prove lower bounds for algorithms\nwith a sampling rule independent of \u03b4, imposing the further restriction that either K = 2 or that\nthe algorithm ensures that Nt,k/t converges as t \u2192 \u221e. The new ingredient in this section is a\ngame-theoretic equilibrium argument, which allows us to analyse any \u03b4-correct algorithm in any\nmultiple answer problem. Our main lower bound is the following.\nTheorem 1. Any \u03b4-correct algorithm veri\ufb01es\n\nE\u00b5[\u03c4\u03b4]\nlog(1/\u03b4)\n\nlim inf\n\u03b4\u21920\n\n\u2265 T \u2217(\u00b5) := D(\u00b5)\u22121 where D(\u00b5) = max\ni\u2208i\u2217(\u00b5)\n\nmax\nw\u2208(cid:52)K\n\ninf\n\u03bb\u2208\u00aci\n\nfor any multiple answer instance \u00b5 with sub-Gaussian arm distributions.\n\nK(cid:88)\n\nk=1\n\nwkd(\u00b5k, \u03bbk)\n\nThe proof is in Appendix C, where we also discuss how the convenient sub-Gaussian assumption can\nbe relaxed. We would like to point out one salient feature here. To show sample complexity lower\nbounds at \u00b5, one needs to \ufb01nd problems that are hard to distinguish from it statistically, yet require a\ndifferent answer. We obtain these problems by means of a minimax result.\nLemma 2. For any answer i \u2208 I, the divergence from \u00b5 to \u00aci equals\nE\u03bb\u223cP [d(\u00b5k, \u03bbk)] .\n\nD(\u00b5,\u00aci) = inf\nP\n\nmax\nk\u2208[K]\n\nwhere the in\ufb01mum ranges over probability distributions on \u00aci supported on (at most) K points.\n\n3\n\n\fThe proof of Theorem 1 then challenges any algorithm for \u00b5 by obtaining a witness P for D(\u00b5) =\nmaxi D(\u00b5,\u00aci) from Lemma 2, sampling a model \u03bb \u223c P, and showing that if the algorithm stops\nearly, it must make a mistake w.h.p. on at least one model from the support. The equilibrium property\nof P allows us to control a certain likelihood ratio martingale regardless of the sampling strategy\nemployed by the algorithm.\nWe discuss the novel aspect of Theorem 1 and its lessons for the design of optimal algorithms. First\nof all, for single-answer instances |i\u2217(\u00b5)|=1 we recover the known asymptotic lower bound [Garivier\nand Kaufmann, 2016, Remark 2]. For multiple-answer instances the bound says the following.\nThe optimal sample complexity hinges on the oracle answers iF (\u00b5). That is, for if \u2208 iF (\u00b5), the\ncomplexity of problem \u00b5 is governed by the dif\ufb01culty of discriminating \u00b5 from the set of models on\nwhich answer if is incorrect.\nIs the bound tight? We argue yes. Consider the following oracle strategy, which is speci\ufb01cally\ndesigned to be very good at \u00b5. First, it computes a pair (i, w) witnessing the two outer maxima in\nD(\u00b5). The algorithm samples according to w. It stops when it can statistically discriminate \u02c6\u00b5t from\n\u00aci, and outputs \u02c6\u0131 = i. This algorithm will indeed have expected sample complexity equal to D(\u00b5)\u22121\nat \u00b5, and it will be correct.\nThe above oracle viewpoint presents an idea for designing algorithms, following Garivier and\nKaufmann [2016] and Chen et al. [2017]. Perform a lower-order amount of forced exploration of all\narms to ensure \u02c6\u00b5t \u2192 \u00b5. Then at each time point compute the empirical mean vector \u02c6\u00b5t and oracle\nweights wt \u2208 w\u2217( \u02c6\u00b5t). Then sample according to wt. This approach is successful for single-answer\nbandits with unique and continuous oracle weights. We argue in Section 4.3 below that it extends to\npoints of discontinuity by exploiting upper hemicontinuity and convexity of w\u2217.\nFor multiple-answer bandits, we argue that the set of maximisers w\u2217(\u00b5) is no longer convex when\niF (\u00b5) is not a singleton. It can then happen that \u02c6\u00b5t \u2192 \u00b5, while at the same time w\u2217( \u02c6\u00b5t) keeps\noscillating. If the algorithm tracks w\u2217( \u02c6\u00b5t), its sampling proportions will end up in the convex hull of\nw\u2217(\u00b5). However, as w\u2217(\u00b5) is not convex itself, these proportions will not be optimal. We present\nempirical evidence for that effect in Appendix D. The lesson here is that the oracle needs to pick an\nanswer and \u201cstick with it\u201d. This will be the basis of our algorithm design in Section 5.\n\n4 Properties of the Optimal Allocation Sets\n\nThe Track-and-Stop sampling strategy aims at ensuring that the sampling proportions converge to\noracle weights. In the case of a singleton-valued oracle weights set w\u2217(\u00b5) for single answer problems,\nthat convergence was proven in [Garivier and Kaufmann, 2016]. We study properties of that set\nwith the double aim of extending Track-and-Stop to points \u00b5 where w\u2217(\u00b5) is not a singleton and of\nhighlighting what properties hold only for the single-answer case, but not in general.\n\n4.1 Continuity\nWe \ufb01rst prove continuity properties of D and w\u2217. We show how the convergence of \u02c6\u00b5t to \u00b5 translates\ninto properties of the divergences from \u02c6\u00b5t to the alternative sets.\nFor a set B, let S(B) = 2B \\ {\u2205} be the set of all non-empty subsets of B.\nDe\ufb01nition 3 (Upper hemicontinuity). A set-valued function \u0393 : A \u2192 S(B) is upper hemicontinuous\nat a \u2208 A if for any open neighbourhood V of \u0393(a) there exists a neighbourhood U of a such that for\nall x \u2208 U, \u0393(x) is a subset of V .\nTheorem 4. For all i \u2208 I,\n\n1. the function (w, \u00b5) (cid:55)\u2192 D(w, \u00b5,\u00aci) is continuous on (cid:52)K \u00d7 M,\n2. \u00b5 (cid:55)\u2192 D(\u00b5,\u00aci) and \u00b5 (cid:55)\u2192 D(\u00b5) are continuous on M,\n3. \u00b5 (cid:55)\u2192 w\u2217(\u00b5,\u00aci), \u00b5 (cid:55)\u2192 w\u2217(\u00b5) and \u00b5 (cid:55)\u2192 iF (\u00b5) are upper hemicontinuous on M with\n\nnon-empty and compact values,\n\nThe proof is in Appendix F. It uses Berge\u2019s maximum theorem and a modi\ufb01cation thereof due to\n[Feinberg et al., 2014]. Related continuity results using this type of arguments, but restricted to\nsingle-valued functions, appeared for the regret minimization problem in [Combes et al., 2017].\n\n4\n\n\f4.2 Convexity\n\nNext we establish convexity.\nProposition 5. For each i \u2208 I, for all \u00b5 \u2208 M the set w\u2217(\u00b5,\u00aci) is convex.\nThis is a consequence of the concavity of w (cid:55)\u2192 D(w, \u00b5,\u00aci) (which is an in\ufb01mum of linear functions).\nIn single-answer problems, we obtain that the oracle weights set w\u2217(\u00b5) is convex everywhere. This\nis however not the case in general for multiple-answer problems, as illustrated by the next example.\nExample 1 (Continued). Consider a K = 2-arm Any Low Arm instance \u00b5 with \u00b51 < \u03b3 and\n\u00b52 < \u03b3, so that both answers 1 and 2 are correct. Recall that D(\u00b5) = maxk=1,2 d(\u00b5k, \u03b3). Now for\n\u00b51 < \u00b52 < \u03b3, w\u2217(\u00b5) = {(1, 0)} and symmetrically for \u00b52 < \u00b51 < \u03b3, w\u2217(\u00b5) = {(0, 1)}. However,\nfor \u00b51 = \u00b52 < \u03b3, w\u2217(\u00b5) = {(1, 0), (0, 1)}, which is not convex. Playing intermediate weights w =\n(\u03b1, 1 \u2212 \u03b1) results in strictly sub-optimal D(\u00b5, w) = max{\u03b1, 1 \u2212 \u03b1} d(\u00b5, \u03b3) < d(\u00b5, \u03b3) = D(\u00b5).\nThis example also illustrates the upper hemicontinuity of w\u2217(\u00b5): since \u00b5 of the form (\u00b5, \u00b5) is the\nlimit of a sequence (\u00b5t)t\u2208N with \u00b5t,1<\u00b5t,2, we obtain that {(1, 0)} \u2286 w\u2217(\u00b5). Similarly, using a\nsequence with \u00b5t,1>\u00b5t,2, {(0, 1)} \u2286 w\u2217(\u00b5).\nThe example scales up to K arms, and shows that the sample complexity guarantee for vanilla TaS\n(Theorem 9) may exceeds by a factor K the optimal complexity, which is matched by our new method\n(Theorem 11).\n\n4.3 Consequences for Track-and-Stop\nThe original analysis of Track-and-Stop excludes the mean vectors \u00b5 \u2208 M for which w\u2217(\u00b5) is not\na singleton. We show that the upper hemicontinuity and convexity properties of w\u2217(\u00b5) allow us\nto extend that analysis to all \u00b5 with a single oracle answer (in particular all single-answer bandit\nproblems), at least for one of the two Track-and-Stop variants. Indeed, that algorithm was introduced\nwith two possible subroutines, dubbed C-tracking and D-tracking [Garivier and Kaufmann, 2016].\nBoth variants compute oracle weights wt at the point \u02c6\u00b5t, but the arm pulled differs.\nK = {w \u2208 (cid:52)K : \u2200k \u2208 [K], wk \u2265 \u03b5t}, where\nC-tracking: compute the projection w\u03b5t\n\n\u03b5t > 0. Pull the arm with index kt = arg mink\u2208[K] Nt,k \u2212(cid:80)t\nD-tracking: if there is an arm j with Nt,j \u2264 \u221a\nkt = arg mink\u2208[K] Nt,k \u2212 twt,k .\nThe proof of the optimal sample complexity of Track-and-Stop for C-tracking remains essentially\nunchanged but we replace Proposition 9 of [Garivier and Kaufmann, 2016] by the following lemma,\nproved in Appendix G.3.\nLemma 6. Let a sequence ( \u02c6\u00b5t)t\u2208N verify limt\u2192+\u221e \u02c6\u00b5t = \u00b5 . For all t \u2265 0, let wt \u2208 w\u2217( \u02c6\u00b5t) be\narbitrary oracle weights for \u02c6\u00b5t . If w\u2217(\u00b5) is convex, then\n\nt \u2212 K/2, then pull kt = j. Otherwise, pull the arm\n\nt of wt on (cid:52)\u03b5t\n\ns=1 w\u03b5s\ns,k.\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nt\n\nt(cid:88)\n\ns=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n\nlim\nt\u2192+\u221e inf\n\nw\u2208w\u2217(\u00b5)\n\nws \u2212 w\n\n= 0 .\n\nThe average of oracle weights for \u02c6\u00b5t converges to the set of oracle weights for \u00b5. C-tracking then\nensures that the proportion of pulls Nt/t is close to that average by Lemma 7 of [Garivier and\nKaufmann, 2016], hence Nt/t gets close to oracle weights.\nTheorem 7. For all \u00b5 \u2208 M such that iF (\u00b5) is a singleton (in particular all single-answer problems),\nTrack-and-Stop with C-tracking is \u03b4-correct with asymptotically optimal sample complexity.\n\nProof in Appendix G.6. We encourage the reader to \ufb01rst proceed to Section 5, since the proof\nconsiders the result as a special case of the multiple-answers setting.\nRemark 8. If w\u2217(\u00b5) is not a singleton, Track-and-Stop using D-tracking may fail to converge to\nw\u2217(\u00b5), even when it is convex.\nWhile we do not prove that D-tracking fails to converge to w\u2217(\u00b5) on a speci\ufb01c example of a\nbandit, we provide empirical evidence in Appendix E. The reason for the failure of D-tracking\n\n5\n\n\fDt\n\ni\u2217(\u00b5)\n\n\u02c6\u00b5t\n\n{1}\n{2}\n{no}\n{1, 2}\n\nCt\n\n\u02c6\u00b5t\n\n< <\n\n(a) Stopping rule: does the conservative con\ufb01dence\nregion Dt exclude the alternative \u00aci to any answer i?\n\n(b) Sampling rule: \ufb01nd least (in sticky order) ora-\ncle answer in the aggressive con\ufb01dence region Ct.\nTrack its oracle weights at \u02c6\u00b5t.\n\nFigure 1: Sticky Track-and-Stop: The two main ideas, illustrated on the Any Low Arm problem.\n\nis that it does not in general converge to the convex hull of the points it tracks. Suppose that\nwt = w(1) = (1/2, 1/2, 0) for t odd and wt = w(2) = (1/2, 0, 1/2) for t even. Then D-tracking\nveri\ufb01es limt\u2192+\u221e Nt/t = (1/3, 1/3, 1/3). This limit is outside of the convex hull of {w(1), w(2)}.\n\n5 Algorithms for the Multiple-Answers Setting\n\nWe can prove for Track-and-Stop the following suboptimal upper bound on the sample complexity,\nbased on the fact that it ensures convergence of Nt/t to the convex hull of the oracle weight set.\nTheorem 9. Let conv(A) be the convex hull of a set A. For all \u00b5 \u2208 M in a multi-answer problem,\nTrack-and-Stop with C-tracking is \u03b4-correct and veri\ufb01es\n\nE\u00b5[\u03c4\u03b4]\nlog(1/\u03b4)\n\nlim\n\u03b4\u21920\n\n\u2264\n\nmax\n\nw\u2208conv(w\u2217(\u00b5))\n\n1\n\nD(w, \u00b5)\n\n.\n\n5.1 Sticky Track-and-Stop\nThe cases of multiple-answers problems for which Track-and-Stop is inadequate are \u00b5 \u2208 M with\niF (\u00b5) of cardinality greater than 1. When convexity does not hold, w\u2217(\u00b5) is the union of the convex\nsets (w\u2217(\u00b5,\u00aci))i\u2208iF (\u00b5). If an algorithm can a priori select if \u2208 iF (\u00b5) and track allocations wt in\nw\u2217( \u02c6\u00b5t,\u00acif ), then using Track-and-Stop on that restricted problem will ensure that Nt/t converges to\nthe oracle weights. Our proposed algorithm, Sticky Track-and-Stop, which we display in Algorithm 1,\nuses a con\ufb01dence region around the current estimate \u02c6\u00b5t to determine what i \u2208 I can be the oracle\nanswer for \u00b5. It selects one of these answers according to an arbitrary total order on I and does not\nchange it (sticks to it) until no point in the con\ufb01dence region has the chosen answer in its set of oracle\nanswers.\n\n\u00b5(cid:48)\u2208Ct\n\nCompute It =(cid:83)\n\nAlgorithm 1 Sticky Track-and-Stop.\nInput: \u03b4 > 0, strict total order on I. Set t = 1 , \u02c6\u00b50 = 0 , N0 = 0 .\nwhile not stopped do\nLet Ct = {\u00b5(cid:48) \u2208 M : D(Nt\u22121, \u02c6\u00b5t\u22121, \u00b5(cid:48)) \u2264 log(f (t \u2212 1))} .\niF (\u00b5(cid:48)) .\nPick the \ufb01rst alternative it \u2208 It in the order on I.\nCompute wt \u2208 w\u2217( \u02c6\u00b5t\u22121,\u00acit).\nPull an arm at according to the C-tracking rule and receive Xt \u223c \u03bdat .\nSet Nt = Nt\u22121 + eat and \u02c6\u00b5t = \u02c6\u00b5t\u22121 + 1\nNt,at\nLet Dt = {\u00b5(cid:48) \u2208 M : D(Nt, \u02c6\u00b5t, \u00b5(cid:48)) \u2264 \u03b2(t, \u03b4)} .\nif there exists i \u2208 I such that Dt \u2229 \u00aci = \u2205 then\nend\nt \u2190 t + 1 .\n\n(Xt \u2212 \u02c6\u00b5t\u22121,at)eat .\n\nend\n\nTheorem 10. For \u03b2(t, \u03b4) = log(Ct2/\u03b4), with C such that C \u2265 e(cid:80)+\u221e\n\nstop and return i.\n\nSticky Track-and-Stop is \u03b4-correct.\n\n// small conf. reg.\n\n// large conf. reg.\n\nt=1 ( e\n\nK )K (log2(Ct2) log(t))K\n\nt2\n\n,\n\n6\n\n\fThat result is a consequence of Proposition 12 of [Garivier and Kaufmann, 2016].\n\n5.2 Sample Complexity\nTheorem 11. Sticky Track-and-Stop is asymptotically optimal, i.e. it veri\ufb01es for all \u00b5 \u2208 M,\n\nE\u00b5[\u03c4\u03b4]\nlog(1/\u03b4)\n\nlim\n\u03b4\u21920\n\n\u2192 1\n\nD(\u00b5)\n\n.\n\nLet i\u00b5 = min iF (\u00b5) in the arbitrary order on answers. For \u03b5, \u03be > 0, we de\ufb01ne C\u2217\nvalue of D(w(cid:48), \u00b5(cid:48),\u00aci\u00b5) for w(cid:48) and \u00b5(cid:48) in \u03b5 and \u03be-neighbourhoods of w\u2217(\u00b5) and \u00b5.\n\n\u03b5,\u03be(\u00b5), the minimal\n\nC\u2217\n\u03b5,\u03be(\u00b5) =\n\ninf\n\n\u00b5(cid:48):(cid:107)\u00b5(cid:48)\u2212\u00b5(cid:107)\u221e\u2264\u03be\n\nw(cid:48):infw\u2208w\u2217 (\u00b5,\u00aci\u00b5) (cid:107)w(cid:48)\u2212w(cid:107)\u221e\u22643\u03b5\n\nD(w(cid:48), \u00b5(cid:48),\u00aci\u00b5) .\n\nOur proof strategy is to show that under a concentration event de\ufb01ned below, for t big enough,\n( \u02c6\u00b5t, Nt/t) belongs to that (\u03be, \u03b5) neighbourhood of (\u00b5, w\u2217(\u00b5,\u00aci\u00b5)). From that fact, we obtain\nD(Nt, \u02c6\u00b5t,\u00aci\u00b5) \u2265 tC\u2217\n\u03b5,\u03be(\u00b5). Furthermore, if the algorithm does not stop at stage t, we also get\nan upper bound on D(Nt, \u02c6\u00b5t,\u00aci\u00b5) from the stopping condition. We obtain an upper bound on\n\u03b5,\u03be(\u00b5). By continuity of (w, \u00b5) (cid:55)\u2192 D(w, \u00b5,\u00aci\u00b5) (from\nthe stopping time, function of \u03b4 and C\u2217\nTheorem 4), we have lim\u03b5\u21920,\u03be\u21920 C\u2217\n\n\u03b5,\u03be(\u00b5) = D(\u00b5,\u00aci\u00b5) = D(\u00b5).\n\nTwo concentration events. Let ET =(cid:84)T\n\n\u221a\n\nt=h(T ){\u00b5 \u2208 Ct} be the event that the small con\ufb01dence\nregion contains the true parameter vector \u00b5 for t \u2265 h(T ). The function h : N \u2192 R, positive,\nincreasing and going to +\u221e, makes sure that each event {\u00b5 \u2208 Ct} appears in \ufb01nitely many ET ,\nwhich will be essential in the concentration results. We will eventually use h(T ) =\nIn order to de\ufb01ne the second event, we \ufb01rst highlight a consequence of Theorem 4.\nCorollary 12. For all \u03b5 > 0, for all \u00b5 \u2208 M, for all i \u2208 I, there exists \u03be > 0 such that\n(cid:107)\u00b5(cid:48) \u2212 \u00b5(cid:107)\u221e \u2264 \u03be \u21d2 \u2200w(cid:48) \u2208 w\u2217(\u00b5(cid:48),\u00aci) \u2203w \u2208 w\u2217(\u00b5,\u00aci), (cid:107)w(cid:48) \u2212 w(cid:107)\u221e \u2264 \u03b5 .\nLet E(cid:48)\nt=h(T ){(cid:107) \u02c6\u00b5t \u2212 \u00b5(cid:107)\u221e \u2264 \u03be} be the event that the empirical parameter vector is close to \u00b5,\nwhere \u03be is chosen as in the previous corollary for i = i\u00b5. The analysis of Sticky Track-and-Stop\nT and E(cid:48)\nconsists of two parts: \ufb01rst show that E c\nc happen rarely enough to lead only to a \ufb01nite term in\nE\u00b5[\u03c4\u03b4]; then show than under ET \u2229 E(cid:48)\nT there is an upper bound on \u03c4\u03b4.\nLemma 13. Suppose that there exists T0 such that for T \u2265 T0, ET \u2229 E(cid:48)\n+\u221e(cid:88)\nP\u00b5(E(cid:48)\n\nT \u2282 {\u03c4\u03b4 \u2264 T}. Then\n\nT =(cid:84)T\n\nE\u00b5[\u03c4\u03b4] \u2264 T0 +\n\n+\u221e(cid:88)\n\nP\u00b5(E c\n\nc) .\n\n(1)\n\nT .\n\nT\n\nT ) +\n\nT\n\nProof. Since \u03c4\u03b4 is a non-negative integer-valued random variable, E\u00b5[\u03c4\u03b4] =(cid:80)+\u221e\n\nT =T0\n\nT =T0\n\nT \u222a E(cid:48)\n\nFor T \u2265 T0, P\u00b5{\u03c4\u03b4 > T} \u2264 P\u00b5(E c\nThe sums depending on the events ET and E(cid:48)\nLemma 14. For h(T ) =\n\nc) \u2264 P\u00b5(E c\nT in (1) are \ufb01nite for well chosen h(T ) and C(t).\nP\u00b5(E(cid:48)\n\n\u221a\nT and f (t) = exp(\u03b2(t, 1/t5)) = Ct10 in the de\ufb01nition of the con\ufb01dence\nP\u00b5(E c\n\nregion Ct, the sum(cid:80)+\u221e\n\nT ) +(cid:80)+\u221e\n\nT ) + P\u00b5(E(cid:48)\n\nc) is \ufb01nite.\n\nc).\n\nT =0\n\nP\u00b5{\u03c4\u03b4 > T}.\n\nT\n\nT =T0\n\nT\n\nT =T0\n\nT\n\nThe proof of the Lemma can be found in Appendix G.1. The remainder of the proof is concerned\nwith \ufb01nding a suitable T0. First, we show that if \u02c6\u00b5t and Nt/t are in an (\u03be, \u03b5) neighbourhood of \u00b5\nand w\u2217(\u00b5,\u00aci\u00b5), then such an upper bound T0 on \u03c4\u03b4 can be obtained.\nLemma 15. Let t1 be an integer and suppose that for all t \u2265 t1, D(Nt, \u02c6\u00b5t,\u00aci\u00b5) \u2265 tC\u2217\nT\u03b2 = inf{t : t > \u03b2(t, \u03b4)/C\u2217\nProof. Take t \u2265 t1.\nD(Nt, \u02c6\u00b5t,\u00aci\u00b5)/C\u2217\nthen \u03c4\u03b4 \u2264 t. We obtain that \u03c4\u03b4 \u2264 max(t1, inf{t : t > \u03b2(t, \u03b4)/C\u2217\n\n\u03b5,\u03be(\u00b5)}. Then \u03c4\u03b4 \u2264 max(t1, T\u03b2).\nIf \u03c4\u03b4 > t then by hypothesis and the stopping condition, t \u2264\n\u03b5,\u03be(\u00b5)\n\n\u03b5,\u03be(\u00b5) . Conversely, for t \u2265 t1, if t > \u03b2(t, \u03b4)/C\u2217\n\n\u03b5,\u03be(\u00b5) \u2264 \u03b2(t, \u03b4)/C\u2217\n\n\u03b5,\u03be(\u00b5). Let\n\n\u03b5,\u03be(\u00b5)}).\n\n7\n\n\f\u221a\n\nt + K 2 \u2212 2K by Lemma 34 [Garivier and Kaufmann, 2016, Lemma 7].\n\nThe oracle answer it becomes constant. Due to the forced exploration present in the C-tracking\nprocedure, the con\ufb01dence region Ct shrinks. After some time, when concentration holds, the set of\npossible oracle answers It becomes constant over t and equal to iF (\u00b5).\nLemma 16. If an algorithm guaranties that for all k \u2208 [K] and all t \u2265 1, Nt,k \u2265 n(t) > 0\nwith limt\u2192+\u221e n(t)/ log(f (t)) = +\u221e, then there exists T\u2206 such that under the event ET , for\nt \u2265 max(h(T ), T\u2206), It = iF (\u00b5) and min It = i\u00b5 = min iF (\u00b5).\nProof in Appendix G.4. Note that Lemma 16 depends only on the amount of forced exploration and\nnot on other details of the algorithm. Any algorithm using C-tracking veri\ufb01es the hypothesis with\nn(t) =\nConvergence to the neighbourhood of (\u00b5, w\u2217(\u00b5,\u00aci\u00b5)). Once it = i\u00b5, we fall back to tracking\npoints from a convex set of oracle weights. The estimate \u02c6\u00b5t and Nt/t both converge, to \u00b5 and to the\nset w\u2217(\u00b5,\u00aci\u00b5). The Lemma below is proved in Appendix G.5.\nLemma 17. Let T\u2206 be de\ufb01ned as in Lemma 16. For T such that h(T ) \u2265 T\u2206, it holds that on ET \u2229E(cid:48)\nSticky Track-and-Stop with C-Tracking veri\ufb01es\n\u2200t \u2265 h(T ), (cid:107) \u02c6\u00b5t \u2212 \u00b5(cid:107)\u221e \u2264 \u03be ,\nand \u2200t \u2265 4\n\nK 2\n\u03b52 + 3\nRemainder of the proof. Suppose that the event ET \u2229E(cid:48)\nT holds. Let T\u2206 be de\ufb01ned as in Lemma 16\nand T be such that h(T ) \u2265 T\u2206. Let \u03b7(T ) = 4K 2/\u03b52 + 3h(T )/\u03b5. For all t \u2265 \u03b7(T ) we have\nD(Nt, \u02c6\u00b5t,\u00aci\u00b5) \u2265 tC\u2217\n\u03b5,\u03be(\u00b5) by Lemma 17. For h(T ) bigger than some T\u03b7 we have \u03b7(T ) \u2264 T .\nWe suppose h(T ) \u2265 max(T\u2206, T\u03b7). We apply Lemma 15 with t1 = \u03b7(T ). We obtain that \u03c4\u03b4 \u2264\nmax(\u03b7(T ), T\u03b2) \u2264 max(T, T\u03b2). Conclusion: for T \u2265 T0 = max(h\u22121(T\u2206), h\u22121(T\u03b7), T\u03b2), under\nthe concentration event, \u03c4\u03b4 \u2264 T and we can apply Lemma 13.\n\u03b5,\u03be(\u00b5). Taking \u03b5 \u2192 0 (hence \u03be \u2192 0 as well), we obtain\nNote that lim\u03b4\u21920\nC\u2217\nD(\u00b5) . We proved Theorem 11.\nlim\u03b4\u21920\n\n\u2212 w(cid:107)\u221e \u2264 3\u03b5 .\n\nE\u00b5[\u03c4\u03b4]\nlog(1/\u03b4) =\n\nT0\n\nlog(1/\u03b4) =\n\nw\u2208w\u2217(\u00b5,\u00aci\u00b5)\n\nh(T )\n\n,\n\n\u03b5\n\ninf\n\n(cid:107) Nt\nt\n\nT\n\n1\n\n1\n\nlim\u03b5\u21920 C\u2217\n\n\u03b5,\u03be(\u00b5) = 1\n\n6 Conclusion\n\nWe characterized the complexity of multiple-answers pure exploration bandit problems, showing a\nlower bound and exhibiting an algorithm with asymptotically matching sample complexity on all\nsuch problems. That study could be extended in several interesting directions and we now list a few.\n\u2022 The computational complexity of Track-and-Stop is an important issue: it would be desirable to\ndesign a pure exploration algorithm with optimal sample complexity which does not need to solve a\nmin-max problem at each step. Furthermore, the same would need to be done for the sticky selection\nof an answer for the multiple-answers setting.\n\u2022 Both lower bounds and upper bounds in this paper are asymptotic. In the upper bound case, only\nthe forced exploration rounds are considered when evaluating the convergence of \u02c6\u00b5t to \u00b5, giving rise\nto potentially sub-optimal lower order terms. A \ufb01nite time analysis with reasonably small o(log(1/\u03b4))\nterms for an optimal algorithm is desirable. In addition, while selecting one of the oracle answers\nto stick to has no asymptotic cost, it could have a lower order effect on the sample complexity and\nappear in a re\ufb01ned lower bound.\n\u2022 Current tools in the theory of Brownian motion are insuf\ufb01cient to characterise the asymptotic\ndistribution of proportions induced by tracking, even for two arms. Without tracking the Arcsine law\narises, so this slightly more challenging problem holds the promise of similarly elegant results.\n\u2022 Finally, the multiple answer pure exploration setting can be extended in various ways. Making\nI continuous leads to regression problems. The parametric assumption that the arms are in one-\nparameter exponential families could also be relaxed.\n\n8\n\n\fReferences\nJ.-Y. Audibert, S. Bubeck, and R. Munos. Best Arm Identi\ufb01cation in Multi-armed Bandits. In\n\nProceedings of the 23rd Conference on Learning Theory, 2010.\n\nP. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2):235\u2013256, 2002.\n\nD. A. Berry and B. Fristedt. Bandit Problems. Sequential allocation of experiments. Chapman and\n\nHall, 1985.\n\nD. Blackwell and M. A. Girshick. Theory of games and statistical decisions. Wiley, 1954.\n\nS. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY,\n\nUSA, 2004. ISBN 0521833787.\n\nS. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit\n\nproblems. Fondations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\nS. Bubeck, R. Munos, and G. Stoltz. Pure Exploration in Finitely Armed and Continuous Armed\n\nBandits. Theoretical Computer Science 412, 1832-1852, 412:1832\u20131852, 2011.\n\nR. M. Castro. Adaptive sensing performance lower bounds for sparse signal detection and support\n\nestimation. Bernoulli, 20(4):2217\u20132246, 11 2014.\n\nL. Chen, A. Gupta, J. Li, M. Qiao, and R. Wang. Nearly optimal sampling algorithms for combinatorial\npure exploration. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of\nProceedings of Machine Learning Research, pages 482\u2013534. PMLR, July 2017.\n\nS. Chen, T. Lin, I. King, M. Lyu, and W. Chen. Combinatorial Pure Exploration of Multi-Armed\n\nBandits. In Advances in Neural Information Processing Systems, 2014.\n\nR. Combes, S. Magureanu, and A. Proutiere. Minimal exploration in structured stochastic bandits. In\n\nAdvances in Neural Information Processing Systems, pages 1763\u20131771, 2017.\n\nE. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for multi-armed bandit and Markov decision\nprocesses. In Computational Learning Theory, 15th Annual Conference on Computational Learning\nTheory, COLT 2002, Sydney, Australia, July 8-10, 2002, Proceedings, volume 2375 of Lecture\nNotes in Computer Science, pages 255\u2013270. Springer, 2002. ISBN 3-540-43836-X.\n\nE. Even-Dar, S. Mannor, and Y. Mansour. Action Elimination and Stopping Conditions for the Multi-\nArmed Bandit and Reinforcement Learning Problems. Journal of Machine Learning Research, 7:\n1079\u20131105, 2006.\n\nE. A. Feinberg, P. O. Kasyanov, and M. Voorneveld. Berge\u2019s maximum theorem for noncompact\n\nimage sets. Journal of Mathematical Analysis and Applications, 413(2):1040\u20131046, 2014.\n\nV. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best Arm Identi\ufb01cation: A Uni\ufb01ed Approach to\nFixed Budget and Fixed Con\ufb01dence. In Advances in Neural Information Processing Systems, 2012.\n\nV. Gabillon, A. Lazaric, M. Ghavamzadeh, R. Ortner, and P. L. Bartlett. Improved learning complexity\nin combinatorial pure exploration bandits. In Proceedings of the 19th International Conference on\nArti\ufb01cial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016, volume 51 of\nJMLR Workshop and Conference Proceedings, pages 1004\u20131012. JMLR.org, 2016.\n\nA. Garivier and E. Kaufmann. Optimal best arm identi\ufb01cation with \ufb01xed con\ufb01dence. In Proceedings\n\nof the 29th Conference On Learning Theory (COLT), 2016.\n\nA. Garivier and E. Kaufmann. Non-asymptotic sequential tests for overlapping hypotheses and\n\napplication to near optimal arm identi\ufb01cation in bandit models. In arXiv 1905.03495, 2019.\n\nA. Garivier, E. Kaufmann, and W. M. Koolen. Maximin action identi\ufb01cation: A new bandit framework\nfor games. In Proceedings of the 29th Annual Conference on Learning Theory (COLT), pages\n1028 \u2013 1050, June 2016.\n\n9\n\n\fR. Huang, M. M. Ajallooeian, C. Szepesv\u00e1ri, and M. M\u00fcller. Structured best arm identi\ufb01cation with\n\ufb01xed con\ufb01dence. In International Conference on Algorithmic Learning Theory, ALT 2017, 15-17\nOctober 2017, Kyoto University, Kyoto, Japan, volume 76 of Proceedings of Machine Learning\nResearch, pages 593\u2013616. PMLR, 2017.\n\nK. Jamieson, M. Malloy, R. Nowak, and S. Bubeck. lil\u2019UCB: an Optimal Exploration Algorithm for\n\nMulti-Armed Bandits. In Proceedings of the 27th Conference on Learning Theory, 2014.\n\nS. Kalyanakrishnan and P. Stone. Ef\ufb01cient Selection in Multiple Bandit Arms: Theory and Practice.\n\nIn International Conference on Machine Learning (ICML), 2010.\n\nS. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi-armed\n\nbandits. In International Conference on Machine Learning (ICML), 2012.\n\nZ. Karnin, T. Koren, and O. Somekh. Almost optimal Exploration in multi-armed bandits.\n\nInternational Conference on Machine Learning (ICML), 2013.\n\nIn\n\nS. Katariya, B. Kveton, C. Szepesv\u00e1ri, C. Vernade, and Z. Wen. Stochastic rank-1 bandits. In\nProceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS\n2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings of Machine Learning\nResearch, pages 392\u2013401. PMLR, 2017.\n\nE. Kaufmann and S. Kalyanakrishnan.\n\nProceeding of the 26th Conference On Learning Theory., 2013.\n\nInformation complexity in bandit subset selection.\n\nIn\n\nE. Kaufmann and W. M. Koolen. Monte-Carlo tree search by best arm identi\ufb01cation. In Advances in\n\nNeural Information Processing Systems (NeurIPS) 30, pages 4904\u20134913, Dec. 2017.\n\nE. Kaufmann, O. Capp\u00e9, and A. Garivier. On the Complexity of Best Arm Identi\ufb01cation in Multi-\n\nArmed Bandit Models. Journal of Machine Learning Research, 17(1):1\u201342, 2016.\n\nE. Kaufmann, W. M. Koolen, and A. Garivier. Sequential test for the lowest mean: From Thompson\nto Murphy sampling. In Advances in Neural Information Processing Systems (NeurIPS) 31, pages\n6333\u20136343. Curran Associates, Inc., Dec. 2018.\n\nT. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied\n\nMathematics, 6(1):4\u201322, 1985.\n\nA. Locatelli, M. Gutzeit, and A. Carpentier. An optimal algorithm for the thresholding bandit problem.\nIn Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York\nCity, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings,\npages 1690\u20131698. JMLR.org, 2016.\n\nS. Magureanu, R. Combes, and A. Prouti\u00e8re. Lipschitz bandits: Regret lower bound and optimal\nalgorithms. In Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona,\nSpain, June 13-15, 2014, volume 35 of JMLR Workshop and Conference Proceedings, pages\n975\u2013999. JMLR.org, 2014.\n\nO. Maron and A. Moore. The Racing algorithm: Model selection for Lazy learners. Arti\ufb01cial\n\nIntelligence Review, 11(1-5):113\u2013131, 1997.\n\nD. Russo. Simple Bayesian algorithms for best arm identi\ufb01cation. CoRR, abs/1602.08448, 2016.\n\nK. Teraoka, K. Hatano, and E. Takimoto. Ef\ufb01cient sampling method for Monte Carlo tree search\n\nproblem. IEICE Transactions, 97-D(3):392\u2013398, 2014.\n\nY. Zhou, J. Li, and J. Zhu. Identify the Nash equilibrium in static games with random payoffs. In\nProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings\nof Machine Learning Research, pages 4160\u20134169, International Convention Centre, Sydney,\nAustralia, Aug. 2017. PMLR.\n\n10\n\n\f", "award": [], "sourceid": 8261, "authors": [{"given_name": "R\u00e9my", "family_name": "Degenne", "institution": "Centrum Wiskunde & Informatica, Amsterdam"}, {"given_name": "Wouter", "family_name": "Koolen", "institution": "Centrum Wiskunde & Informatica, Amsterdam"}]}