{"title": "Non-Asymptotic Pure Exploration by Solving Games", "book": "Advances in Neural Information Processing Systems", "page_first": 14492, "page_last": 14501, "abstract": "Pure exploration (aka active testing) is the fundamental task of sequentially gathering information to answer a query about a stochastic environment. Good algorithms make few mistakes and take few samples. Lower bounds (for multi-armed bandit models with arms in an exponential family) reveal that the sample complexity is determined by the solution to an optimisation problem. The existing state of the art algorithms achieve asymptotic optimality by solving a plug-in estimate of that optimisation problem at each step. We interpret the optimisation problem as an unknown game, and propose sampling rules based on iterative strategies to estimate and converge to its saddle point. We apply no-regret learners to obtain the first finite confidence guarantees that are adapted to the exponential family and which apply to any pure exploration query and bandit structure. Moreover, our algorithms only use a best response oracle instead of fully solving the optimisation problem.", "full_text": "Non-Asymptotic Pure Exploration by Solving Games\n\nR\u00e9my Degenne\n\nCentrum Wiskunde & Informatica\n\nScience Park 123, 1098 XG Amsterdam\n\nWouter M. Koolen\n\nCentrum Wiskunde & Informatica\n\nScience Park 123, 1098 XG Amsterdam\n\nremy.degenne@cwi.nl\n\nwmkoolen@cwi.nl\n\nPierre M\u00e9nard\n\nInria Lille\n\n40 Avenue Halley, 59650 Villeneuve-d\u2019Ascq\n\nmenardprr@gmail.com\n\nAbstract\n\nPure exploration (aka active testing) is the fundamental task of sequentially gather-\ning information to answer a query about a stochastic environment. Good algorithms\nmake few mistakes and take few samples.\nLower bounds (for multi-armed bandit models with arms in an exponential family)\nreveal that the sample complexity is determined by the solution to an optimisation\nproblem. The existing state of the art algorithms achieve asymptotic optimality by\nsolving a plug-in estimate of that optimisation problem at each step.\nWe interpret the optimisation problem as an unknown game, and propose sampling\nrules based on iterative strategies to estimate and converge to its saddle point. We\napply no-regret learners to obtain the \ufb01rst \ufb01nite con\ufb01dence guarantees that are\nadapted to the exponential family and which apply to any pure exploration query\nand bandit structure. Moreover, our algorithms only use a best response oracle\ninstead of fully solving the optimisation problem.\n\n1\n\nIntroduction\n\nWe study fundamental trade-offs arising in sequential interactive learning. We adopt the framework\nof Pure Exploration, in which the learning system interacts with its environment by performing a\nsequence of experiments, with the goal of maximising information gain. We aim to design general,\nef\ufb01cient systems that can answer a given query with few experiments yet few mistakes.\nAs usual, we model the environment by a multi-armed bandit model with exponential family arms,\nand work in the \ufb01xed con\ufb01dence (\u03b4-PAC) setting. Information-theoretic lower bounds [13] show that\na certain number of samples is unavoidable to reach a certain con\ufb01dence. Moreover, algorithms are\ndeveloped [13] that match these lower bounds asymptotically, in the small con\ufb01dence \u03b4 \u2192 0 regime.\nOur contribution is a framework for obtaining ef\ufb01cient algorithms with non-asymptotic guarantees.\nThe main object of study is the \u201cPure Exploration Game\u201d [9], a two-player zero-sum game that is\ncentral to lower bounds as well as to the widely used GLRT-based stopping rules. We develop iterative\nmethods that provably converge to saddle-point behaviour. The game itself is not known to the learner,\nand has to be explored and estimated on the \ufb02y. Our methods are based on pairs of low-regret\nalgorithms, combined with optimism and tracking. We prove sample complexity guarantees for\nseveral combinations of algorithms, and discuss their computational and statistical trade-offs.\nThe rest of the introduction provides more detail on pure exploration problems, the pure exploration\ngame, the connection between them, and expands on our contribution. We also review related work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ft and \u02c6\u00b5k\n\nOur model for the environment is a K-armed bandit, i.e. distributions (\u03bd1, . . . , \u03bdK) on R. We assume\nthroughout that these distributions come from a one-dimensional exponential family, and we denote\nby d(\u00b5, \u03bb) the relative entropy (Kullback-Leibler divergence) from the distribution with mean \u00b5\nto that with mean \u03bb. A pure exploration problem is parameterised by a set M of K-armed bandit\nmodels (the possible environments), a \ufb01nite set I of candidate answers and a correct-answer function\ni\u2217 : M\u2192I. We focus on Best Arm Identi\ufb01cation, for which i\u2217(\u00b5) = argmaxi \u00b5i and the Minimum\nThreshold problem, which is de\ufb01ned for any \ufb01xed threshold \u03b3 by i\u2217(\u00b5) = 1{mini \u00b5i<\u03b3}. The goal of\nthe learner is to learn i\u2217(\u00b5) con\ufb01dently and ef\ufb01ciently by means of sequentially sampling from the\narms of \u00b5, no matter which \u00b5 \u2208 M it faces. When an algorithm sequentially interacts with \u00b5, we\ndenote by N k\nt the sample count and empirical mean estimate (these form a suf\ufb01cient statistic)\nfor each arm k after t rounds. We write \u03c4\u03b4 for the time at which the algorithm stops and \u02c6\u0131 for the\nanswer it recommends. The algorithm is correct (on a particular run) if it recommends \u02c6\u0131 = i\u2217(\u00b5) the\ncorrect answer for \u00b5. An algorithm is \u03b4-PAC (or \u03b4-correct) if P\u00b5(\u02c6\u0131 (cid:54)= i\u2217(\u00b5)) \u2264 \u03b4 for each \u00b5 \u2208 M.\nAmong \u03b4-PAC algorithms, we are interested in those minimising the sample complexity E\u00b5[\u03c4\u03b4]. As\nit turns out, what can be achieved, and how, is captured by a certain game.\nFor each \u00b5 \u2208 M, [9] de\ufb01ne the two-player zero-sum simultaneous-move Pure Exploration Game:\nMAX plays an arm k \u2208 [K], MIN plays an \u201calternative\u201d bandit model \u03bb \u2208 M with a different\ncorrect answer i\u2217(\u03bb) (cid:54)= i\u2217(\u00b5). We denote the set of such alternatives to answer i by \u00aci = {\u03bb \u2208 M :\ni\u2217(\u03bb) (cid:54)= i}. MAX then receives payoff d(\u00b5k, \u03bbk) from MIN. As the payoff is neither concave in k\n(since discrete) nor convex in \u03bb (both domain and divergence are problematic), we will analyse the\ngame by sequencing the moves and considering a mixed strategy for the player moving \ufb01rst. With\nMAX moving \ufb01rst and playing a mixed strategy k \u223c w \u2208 (cid:52)K (we identify distributions over [K]\nand the simplex (cid:52)K), the value of the game is\n\nD\u00b5 := sup\nw\u2208(cid:52)K\n\nD\u00b5(w)\n\nwhere\n\nD\u00b5(w) :=\n\ninf\n\n\u03bb\u2208M:i\u2217(\u03bb)(cid:54)=i\u2217(\u00b5)\n\nwkd(\u00b5k, \u03bbk).\n\n(1)\n\nK(cid:88)\n\nk=1\n\n(cid:14)D\u00b5 , and matching this rate requires sampling\n\nWe denote a minimiser of D\u00b5 by w\u2217(\u00b5) and call it an oracle allocation. The analogue where MIN\nplays \ufb01rst using a mixed strategy \u03bb \u223c q \u2208 P(\u00aci\u2217(\u00b5)) (distributions over that set) is proposed and\nanalysed in [9]. Despite the baroque domain of \u03bb in (1), there always exist minimax q supported on\n\u2264 K points due to dimension constraints.\nThe Pure Exploration Game is essential to both characterising the complexity of learning, and also\nto algorithm design. Namely, \ufb01rst, any \u03b4-correct algorithm has sample complexity for each bandit\n\u00b5 \u2208 M at least E\u00b5[\u03c4\u03b4] \u2265 kl(\u03b4, 1 \u2212 \u03b4)/D\u00b5 \u2248 ln 1\nproportions E\u00b5[N\u03c4\u03b4 ]/E\u00b5[\u03c4\u03b4] converging to w\u2217(\u00b5) as \u03b4 \u2192 0 [see 13]. Moreover, second, the\ngeneral approach for obtaining \u03b4-correct algorithms is based on the Generalised Likelihood Ratio\nTest (GLRT) statistic Zt := tD \u02c6\u00b5t ( Nt/t ). There are universal thresholds \u03b2(t, \u03b4) \u2248 ln 1\n2 ln ln t\n[see e.g. 12, 13, 19, 23] such that P\u00b5 {\u2203t : Zt \u2265 \u03b2(\u03b4, t)} \u2264 \u03b4 for any \u00b5 \u2208 M. Hence stopping when\nZt \u2265 \u03b2(t, \u03b4) and recommending \u02c6\u0131 = i\u2217( \u02c6\u00b5t) is \u03b4-correct for any sampling rule. Maximising the\nGLRT to stop as early as possible is achieved by the sampling proportions Nt/t = w\u2217( \u02c6\u00b5t).\nThese considerations show that any successful Pure Exploration agent needs to (approximately)\nsolve the Pure Exploration Game D\u00b5. The Track-and-Stop approach, pioneered by [13], ensures\nthat \u02c6\u00b5t \u2192 \u00b5 using forced exploration, and Nt/t \u2192 w\u2217( \u02c6\u00b5t) using tracking. Continuity of w\u2217\nand D\u00b5 then yields that Zt \u2248 tD\u00b5(w\u2217(\u00b5)) = tD\u00b5. The GLRT stopping rule triggers when\nt = \u03b2(\u03b4, t)/D\u00b5 \u2248 ln 1\n\n(cid:14)D\u00b5 , meeting the lower bound in the asymptotic regime \u03b4 \u2192 0.\n\n\u03b4 + K\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\nOur contributions. We explore methods to solve the Pure Exploration game D\u00b5 associated with\nthe unknown bandit model \u00b5, and discusses their statistical and computational trade-offs. We look at\nsolving the game iteratively, by instantiating a low-regret online learner for each player. In particular\nfor the k-player we use a self-tuning instance of Exponentiated Gradient called AdaHedge [8]. The\n\u03bb-player needs to play a distribution to deal with non-convexity; we consider Follow the Perturbed\nLeader as well as an ensemble of Online Gradient Descent experts. We show how a combination of\noptimistic gradient estimates, concentration of measure arguments and regret guarantees combine to\ndeliver the \ufb01rst non-asymptotic sample complexity guarantees (which retain asymptotic optimality for\n\u03b4 \u2192 0). The advantage of this approach is that it only requires a best response oracle (1, right) instead\nof a computationally more costly max-min oracle (1, left) employed by Track-and-Stop. Going the\nother extreme, we also develop Optimistic Track-and-Stop based on a max-max-min oracle (the outer\n\n2\n\n\fmax implementing optimism over a con\ufb01dence region for \u00b5), which trades increased computation for\ntighter sample complexity guarantees with simpler proofs.\nOur cocktail sheds new light on the trade-offs involved in the design of pure exploration algorithms.\nWe show how \u201cbig-hammer\u201d forced exploration can be re\ufb01ned using problem-adapted optimism.\nWe show how tracking is unnecessary when the k player goes second. We show how computational\ncomplexity can be traded off using oracles of various sophistication. And \ufb01nally, we validate our\napproach empirically in benchmark experiments at practical \u03b4, and \ufb01nd that our algorithms are either\ncompetitive with Track-and-Stop (dense w\u2217) or dominate it (sparse w\u2217).\n\nRelated work Besides maximising information gain, there is a vast literature on maximising reward\nin multi-armed bandit models for which a good starting point is [21]. The canonical Pure Exploration\nproblem is Best Arm Identi\ufb01cation [10, 3], which is actively studied in the \ufb01xed con\ufb01dence, \ufb01xed\nbudget and simple regret settings [21, Ch. 33]. Its sample complexity as a function of the con\ufb01dence\nlevel \u03b4 has been analysed very thoroughly in the (sub)-Gaussian case, where we have a rather complete\npicture, even including lower order terms [5]. [18] initiated the quest for correct instance-dependent\nconstants for arms from any exponential family. [26] stresses the importance of the \u201cmoderate\ncon\ufb01dence\u201d regime \u03b4 (cid:29) 0. Although it is not the focus here, we do believe that it is crucial to obtain\nthe right problem dependence not only in ln 1\n\u03b4 but also in K and other structural parameters, as the\nlatter may in practice dominate the sample complexity.\nPure Exploration queries beyond Best Arm include Top-M [15], Thresholding [22], Minimum\nThreshold [20], Combinatorial Bandits [6], pure-strategy Nash equilibria [29] and Monte-Carlo Tree\nSearch [27]. There is also signi\ufb01cant interest in these problems in structured bandit models, including\nRank-one [17], Lipschitz [23], Monotonic [14], Unimodal [7] and Unit-Sum [26]. Our framework\napplies to all these cases. Problems with multiple correct answers were recently considered by [9].\nExisting learning strategies do not work unmodi\ufb01ed; some fail and others need to be generalised.\nOptimism is ubiquitous in bandit optimisation since [1], and was adapted to pure exploration by\n[16]. We are not aware of optimism being used to solve unknown min-max problems. Optimism was\nemployed in the UCB Frank-Wolfe method by [2] for maximising an unknown smooth function faster.\nWe do not currently know how to make use of such fast rate results. For games the best response\nvalue is a non-smooth function of the action.\nUsing a pair of independent no-regret learners to solve a \ufb01xed and known game goes back to [11].\nMore recently game dynamics were used to explain (Nesterov) acceleration in of\ufb02ine optimisation\n[28]. Ensuring faster convergence with coordinating learners is an active area of research [25].\nUnfortunately, we currently do not know how to obtain an advantage in this way, as our main learning\noverhead comes from concentration, not regret.\n\n2 Algorithms with \ufb01nite con\ufb01dence sample complexity bounds\n\nWe introduce a family of algorithms, presented as Algorithm 1, with sample complexity bounds\nfor non-asymptotic con\ufb01dence \u03b4. It uses the following ingredients: the GLRT stopping rule, a\nsaddle point algorithm (possibly formed by two regret minimization algorithms) and optimistic loss\nestimates.\n\n2.1 Model and assumption: sub-Gaussian exponential families.\n\nWe suppose that the arm distributions belong to a known one-parameter exponential family. That\nis, there is a reference measure \u03bd0 and parameters \u03b71, . . . , \u03b7K \u2208 R such that the distribution of arm\nk \u2208 [K] is de\ufb01ned by d\u03bdk/d\u03bd0(x) \u221d e\u03b7kx. Examples include Gaussians with a given variance or\nBernoulli with means in (0, 1). All results can be extended to arms each in a possibly different\nknown exponential family. Let \u0398 be the open interval of possible means of such distributions. A\ndistribution \u03bd is said to be \u03c32-sub-Gaussian if for all u \u2208 R, log EX\u223c\u03bd eu(X\u2212EX\u223c\u03bd [X]) \u2264 \u03c32\n2 u2. An\nexponential family has all distributions sub-Gaussian with constant \u03c32 iff for all \u00b5, \u03bb \u2208 \u0398, it veri\ufb01es\nd(\u00b5, \u03bb) \u2265 1\nAssumption 1. The arm distributions belong to sub-Gaussian exponential families with constant \u03c32.\nAssumption 2. There exists a closed interval [\u00b5min, \u00b5max] \u2282 \u0398 such that M \u2286 [\u00b5min, \u00b5max]K.\n\n2\u03c32 (\u00b5 \u2212 \u03bb)2.\n\n3\n\n\ft , \u03b2k\n\nt\u22121d(\u02c6\u00b5k\nt ,\u03b2k\nt ]\n\n(cid:80)K\nt\u22121, \u03be) \u2264 f (t\u22121)}.\nt\u22121, \u03bbk).\nk=1 N k\n\nAlgorithm 1 Pure exploration meta-algorithm.\nRequire: Algorithms Ak and A\u03bb, stopping threshold \u03b2(t, \u03b4) and exploration bonus f (t).\n1: Sample each arm once and form estimate \u02c6\u00b5K.\n2: for t = K + 1, . . . do\nt ] = {\u03be : N k\nFor k \u2208 [K], let [\u03b1k\n3:\n(cid:80)\nLet \u02dc\u00b5t\u22121 = argmin\u03bb\u2208M\u2229\u00d7K\n4:\nk=1[\u03b1k\nLet it = i\u2217( \u02dc\u00b5t\u22121).\n5:\n(cid:110)\nStop and output \u02c6\u0131 = it if inf \u03bb\u2208\u00acit\n6:\nit and A\u03bb\nGet wt and qt from Ak\nit.\n7:\nt (w) = \u2212(cid:80)K\nFor k \u2208 [K], let U k\nf (t\u22121)/N k\n8:\n(cid:80)K\nFeed Ak\nt\u22121/(cid:80)t\nit the loss (cid:96)w\nk=1 wkU k\nt .\nFeed A\u03bb\nit the loss (cid:96)\u03bb\nk=1 wk\nt d(\u02c6\u00b5k\nPick arm kt = argmink N k\ns .\ns=1 wk\nObserve sample Xt \u223c \u03bdkt. Update \u02c6\u00b5t.\n\nt } E\u03bb\u223cqt d(\u03be, \u03bbk)\n\nt (q) = E\u03bb\u223cq\n\nt\u22121, max\u03be\u2208{\u03b1k\n\nt\u22121, \u03bbk) .\n\nk N k\n\nt\u22121d(\u02c6\u00b5k\n\nt = max\n\nt\u22121d(\u02c6\u00b5k\n\n(cid:111)\n\nt ,\u03b2k\n\n9:\n10:\n11:\n12:\n13: end for\n\n. (cid:46) Optimism\n\n(cid:46) Cumulative tracking\n\n(cid:46) KL con\ufb01dence intervals\n(cid:46) = \u02c6\u00b5t\u22121 if \u02c6\u00b5t\u22121 \u2208 M\n\nt\u22121, \u03bbk) > \u03b2(t, \u03b4). (cid:46) GLRT Stopping rule\n\nAs a consequence of Assumption 2, there exists L, D > 0 such that for all y \u2208 [\u00b5min, \u00b5max], the\nfunction x (cid:55)\u2192 d(x, y) is L-Lipschitz on [\u00b5min, \u00b5max] and d(x, y) \u2264 D. Assumption 1 is implied by\nAssumption 2. Both are discussed in Appendix F. In particular, Assumption 2 can often be relaxed.\nL and D will appear in the sample complexity bounds but none of our algorithms use them explicitly.\nEverywhere below, \u02c6\u00b5t denotes the orthogonal projection of the empirical mean onto [\u00b5min, \u00b5max]K,\nwith one possible exception: the GLRT stopping rule may use it either projected or not, indifferently.\n\n2.2 Algorithmic ingredients\nStopping and recommendation rules. The algorithm stops if any one of |I| many GLRT tests\nsucceeds [13]. Let L\u00b5 denote the likelihood under the model parametrized by \u00b5. The generalized\nlog-likelihood ratio between a set \u039b and the whole parameter space \u0398K is\n\nGLR\u0398K\n\nt\n\n(\u039b) = log\n\nsup \u02dc\u00b5\u2208\u0398K L \u02dc\u00b5(X1, . . . , Xt)\nsup\u03bb\u2208\u039b L\u03bb(X1, . . . , Xt)\n\n(cid:88)\n\n= inf\n\u03bb\u2208\u039b\n\nk\u2208[K]\n\nN k\n\nt d(\u02c6\u00b5k\n\nt , \u03bbk) .\n\nt\n\nt\n\nBy concentration of measure arguments, we may \ufb01nd \u03b2(t, \u03b4) such that with probability greater\nthan 1 \u2212 \u03b4, for all t \u2208 N, GLR\u0398K\n({\u00b5}) \u2264 \u03b2(t, \u03b4) [see 12, 13, 19, 23]. Test i \u2208 I succeeds if\n(\u00aci) > \u03b2(t, \u03b4). If the algorithm stops because of test i, it recommends \u02c6\u0131 = i (if several tests\nGLR\u0398K\nsucceed at the same time, it chooses arbitrarily among these).\nTheorem 1. Any algorithm using the GLRT stopping and recommendation rules with threshold\n\u03b2(t, \u03b4) such that P\u00b5{GLR\u0398K\nA game with two players An algorithm is unable to stop at time t if the stopping condition is not\nmet, i.e.\n\n({\u00b5}) > \u03b2(t, \u03b4)} \u2264 \u03b4 is \u03b4-correct.\n\nt\n\n(cid:88)\n\n4\n\n\u03b2(t, \u03b4) \u2265\n\ninf\n\n\u03bb\u2208\u00aci\u2217( \u02c6\u00b5t)\n\nk\u2208[K]\n\nN k\n\nt d(\u02c6\u00b5k\n\nt , \u03bbk) .\n\n(cid:80)\n\nt d(\u02c6\u00b5k\n\nthe right hand side has to be maximized,\nk\u2208[K] wk\n\ni.e. made close to\nt , \u03bbk) = tD \u02c6\u00b5t \u2248 tD\u00b5. Then with \u03b2(t, \u03b4) \u2248 log 1/\u03b4 + o(t)\n\nIn order to stop early,\nt supw\u2208(cid:52)K inf \u03bb\u2208\u00aci\u2217( \u02c6\u00b5t)\nwe obtain t \u2264 log(1/\u03b4)/D\u00b5 up to lower order terms, i.e. the stopping time is close to optimality.\nWe propose to approach that max-min saddle-point by implementing two iterative algorithms, Ak and\nA\u03bb, for the k-player and a \u03bb-player. Our sample complexity bound is a function of two quantities\nt and R\u03bb\nRk\nOne player of our choice goes \ufb01rst. The second player can see the action of the \ufb01rst, see the\ncorresponding loss function and use an algorithm with zero regret (e.g. Best-Response or Be-The-\nLeader). One of the players has to play distributions on its action set. We have one of the following:\n\nt , regret bounds of algorithms Ak and A\u03bb when used for t steps on appropriate losses.\n\n\f1. \u03bb-player plays \ufb01rst and uses a distribution in P(\u00acit). The k-player plays kt \u2208 [K].\n2. k-player plays \ufb01rst and uses wt \u2208 (cid:52)K (distribution over [K]). The \u03bb-player plays \u03bbt \u2208 \u00acit.\n3. Both players play distributions and go in any order, or concurrently.\n\nt (q) before computing qt.\n\nAlgorithm 1 presents two players playing concurrently but can be modi\ufb01ed: if for example \u03bb plays\nsecond, then it gets to see (cid:96)\u03bb\nThe sampling rule at stage t \ufb01rst computes the most likely answer it for \u02c6\u00b5t\u22121. If the set over which\nthe algorithm optimizes at line 4 is empty, it is arbitrary. The k-player plays wt coming from Ak\nit,\nan instance of Ak running only on the rounds on which the selected answer is that it. The \u03bb-player\nsimilarly uses an instance A\u03bb\n\nit of A\u03bb.\n\nTracking. Since a single arm has to be pulled, if the k-player plays w \u2208 (cid:52)K an additional\nprocedure is needed to translate that play into a sampling rule. We use a so-called tracking procedure,\ns .\ns=1 wk\nkt = argmink\u2208[K] N k\n\ns , which ensures that(cid:80)t\n\ns \u2212 (K \u2212 1) \u2264 N k\n\nt\u22121/(cid:80)t\n\nt \u2264(cid:80)t\n\ns=1 wk\n\ns=1 wk\n\nOptimism in face of uncertainty. Existing algorithms for general pure exploration use forced\n\u221a\nexploration to ensure convergence of \u02c6\u00b5t to \u00b5, making sure that every arm is sampled more than e.g.\nt times. We replace that method by the \u201coptimism in face of uncertainty\u201d principle, which gives a\nmore adaptive exploration scheme. While that heuristic is widely used in the bandit literature, this\nwork is its \ufb01rst successful implementation for general pure exploration. In Algorithm 1, the k-player\nalgorithm gets an optimistic loss depending on wt and qt. The \u03bb-player gets a non-optimistic loss.\n\n2.3 Proof scheme and sample complexity result\nIn order to bound the sample complexity, we introduce a sequence of concentration events Et =\n{\u2200s \u2264 t,\u2200k \u2208 [K], d(\u02c6\u00b5k\n\n} for a > 0 and (cid:99)W (x) = x + log x + 1/2 . It\nt ) \u2264 2eK/a2 (see Appendix B for a proof). The concentration intervals used in\n\nveri\ufb01es(cid:80)+\u221e\nAlgorihtm 1 are a function of f (t) =(cid:99)W ((1 + a)(1 + b) log t) for b > 0.\n\ns , \u00b5k) \u2264 (cid:99)W ((1+a) log(t))\n\nP\u00b5(E c\n\nN k\ns\n\nt=3\n\nLemma 1. Let Et be an event and T0(\u03b4) \u2208 N be such that for t \u2265 T0(\u03b4), Et \u2286 {\u03c4\u03b4 \u2264 t}. Then\n\n+\u221e(cid:88)\n\n+\u221e(cid:88)\n\nE\u00b5[\u03c4\u03b4] =\n\nP{\u03c4\u03b4 > t} \u2264 T0(\u03b4) +\n\nP\u00b5(E c\n\nt ) .\n\nt=1\n\nt=T0(\u03b4)\n\nWe now present brie\ufb02y the steps of the proof for the stopping time upper bound before stating our\nmain theorem on the sample complexity of Algorithm 1. These steps are inexact and should be\nregarded as a guideline and not as rigorous computations. A full proof of our results can be found\nin the appendices (Appendix B for concentration results, C for tracking and D for the main sample\ncomplexity proof). We simplify the presentation by supposing that it = i\u2217(\u00b5) throughout (the main\nproof will show this may fail only o(t) rounds). For t < \u03c4\u03b4, under concentration event Et,\n\nt d(\u02c6\u00b5k\n\nt , \u03bbk)\n\n\u03b2(t, \u03b4) \u2265 inf\n\n\u03bb\u2208\u00aci\u2217(\u00b5)\n\n\u2265 inf\n\n\u03bb\u2208\u00aci\u2217(\u00b5)\n\n\u2265 inf\n\n\u03bb\u2208\u00aci\u2217(\u00b5)\n\n(cid:88)\n(cid:88)\n(cid:88)\n\ns\u2208[t]\n\nk\u2208[K]\n\nN k\n\n(cid:88)\n(cid:88)\n\nk\u2208[K]\n\ns\u2208[t]\n\nk\u2208[K]\n\nwk\n\ns d(\u02c6\u00b5k\n\nwk\n\ns d(\u02c6\u00b5k\n\nt , \u03bbk) \u2212 KD\n\ns\u22121, \u03bbk) \u2212 O((cid:112)t log(t)) .\n(cid:80)\n\n(stopping condition)\n\n(tracking)\n\n(concentration)\n\nThe \ufb01rst term is now the in\ufb01mum of a sum of losses, inf \u03bb\u2208\u00aci\u2217(\u00b5)\nproperty of the \u03bb-player\u2019s algorithm on those losses, then we introduce optimistic values U k\n\ns (\u03bb). We use the regret\ns such\n\ns\u2208[t] (cid:96)\u03bb\n\n5\n\n\f(cid:88)\ns\u22121} we have E\u03bb\u223cqs d(\u03bek, \u03bbk) \u2264 U k\nthat for \u03bek \u2208 {\u00b5k, \u02c6\u00b5k\ns d(\u02c6\u00b5k\nwk\nwk\n\n(cid:88)\n\n(cid:88)\n\ninf\n\n\u03bb\u2208\u00aci\u2217(\u00b5)\n\ns\u2208[t]\n\nk\u2208[K]\n\ns\u22121, \u03bbk) \u2265 (cid:88)\n\u2265 (cid:88)\n\ns\u2208[t]\n\ns \u2264 E\u03bb\u223cqs d(\u03bek, \u03bbk) + O((cid:112)1/s).\n\nE\n\u03bb\u223cqs\n\n(cid:88)\n(cid:88)\n(cid:88)\n\ns\u2208[t]\n\nk\u2208[K]\n\ns\u2208[t]\n\ns\u2208[t]\n\u2265 max\nk\u2208[K]\n\u2265 max\nk\u2208[K]\n\ns\u22121, \u03bbk) \u2212 R\u03bb\ns d(\u02c6\u00b5k\n\u221a\ns \u2212 O(\n\nt\n\nt\n\nk\u2208[K]\ns U k\n\nwk\n\nt) \u2212 R\u03bb\n\u221a\n\ns \u2212 Rk\nU k\n\nt \u2212 O(\n\nt) \u2212 R\u03bb\n\nt\n\nd(\u00b5k, \u03bbk) \u2212 Rk\n\nt \u2212 O(\n\nE\n\u03bb\u223cqs\n\n\u221a\n\nt) \u2212 R\u03bb\n\nt\n\n(regret \u03bb)\n\n(optimism)\n\n(regret w)\n\n(optimism)\n\nFinally, 1/t(cid:80)\n\ns\u2208[t]\n\n(cid:88)\nE\u03bb\u223cqs is itself the expectation of another distribution on P(\u00aci\u2217(\u00b5)). Hence\n\nmax\n\nE\u03bb\u223cq d(\u00b5k, \u03bbk) = tD\u00b5 .\n\nE\u03bb\u223cqs d(\u00b5k, \u03bbk) \u2265 t inf\n\nq\n\nk\n\nmax\nk\u2208[K]\n\ns\u2208[t]\n\nPutting these inequalities together, we get \ufb01nally an inequality on such a t < \u03c4\u03b4. The exact result we\nobtain is the following Theorem, proved in Appendix D.\nTheorem 2. Under Assumption 2, the sample complexity of Algorithm 1 on model \u00b5 \u2208 M is\nE\u00b5[\u03c4\u03b4] \u2264 T0(\u03b4) +\n\nwhere C\u00b5 depends on \u00b5 and M and h(t) = O((cid:112)t log(t)). See Appendix D for an exact de\ufb01nition.\nof that type translates into T0(\u03b4). The next lemma is a consequence of the concavity of t (cid:55)\u2192 \u221a\nLemma 2. Suppose that t \u2208 R veri\ufb01es the equation t \u2212 C\n\nThe forms of h(t) and of T0(\u03b4) depend on the particular algorithm but we now show how an inequality\nt log t.\n\u03b4 = log 1/\u03b4\n,\nD\u00b5\n\nwith T0(\u03b4) = max{t \u2208 N : t \u2264 \u03b2(t, \u03b4)\nD\u00b5\n\nt log t \u2264 log 1/\u03b4\n\nt + h(t))} ,\n\n. Then for T \u2217\n\n+ C\u00b5(R\u03bb\n\n2eK\na2\n\nt + Rk\n\n\u221a\n\nD\u00b5\n\n(cid:18)\n\n(cid:115)\n\nt \u2264 log 1/\u03b4\nD\u00b5\n\n1 + C\n\nlog T \u2217\nT \u2217\n\n\u03b4\n\n\u03b4\n\n1\n\u221a\n\n1 \u2212 C 1+log T \u2217\nT \u2217\n\u03b4 log T \u2217\n\n2\n\n\u03b4\n\n\u03b4\n\n(cid:19)\n\n.\n\n3 Practical Implementations\n\nk\u2208[K] wkd(\u03bek, \u03bbk) .\n\n\u00aci of \u03bb (cid:55)\u2192(cid:80)\n\nNext we discuss instantiating no-regret learners. We consider a hierarchy of computational oracles:\n1. Min aka Best-Response oracle: obtain for any i \u2208 I, w \u2208 (cid:52)K and \u03be \u2208 \u0398K a minimizer in\n2. Max-min aka Game-Solving oracle: obtain for any i \u2208 I and \u03be \u2208 \u0398K a vector w\u2217 \u2208 (cid:52)K\nsuch that there is a Nash equilibrium (w\u2217, q\u2217) \u2208 (cid:52)K \u00d7 P(\u00aci) for the zero-sum game with\nreward d(\u03bek, \u03bbk) with the k-player using the mixed strategy w\u2217.\n3. Max-max-min oracle: for any con\ufb01dence region C = [a1, b1] \u00d7 . . . \u00d7 [aK, bK], obtain\nk=1 wkd(\u03bek, \u03bbk) and\n\n(\u00b5+, i+, w+) with (\u00b5+, i+) = argmax\u03be\u2208C,i\u2208I supw\u2208(cid:52)K inf \u03bb\u2208\u00aci\nw+ a k-player strategy of a Nash equilibrium of the game with reward d(\u00b5+k, \u03bbk).\n\n(cid:80)K\n\nFor Minimum Threshold all oracles can be evaluated in closed form in O(K) time, and the same is\ntrue for Best Response in Best Arm Identi\ufb01cation. Max-min for Best Arm requires binary search\n[13] and Max-max-min requires O(K) max-min calls. See [24] for run-time data on Track-and-Stop\n(max-min oracle) and gradient ascent (min oracle) for Best Arm. Our approach also extends naturally\nto min-max and max-min-max oracles, which we plan to incorporate in full detail in our future work.\n\n3.1 A Learning Algorithm for the k-Player vs Best-Response for the \u03bb-Player\n\nIn this section the k-player plays \ufb01rst, employing a regret minimization algorithm for linear losses on\nthe simplex to produce wt \u2208 (cid:52)K at time t. We pick AdaHedge of [8], which runs in O(K) per round\nand adapts to the scale of the losses. The \u03bb-player goes second and can use a zero-regret algorithm:\nBest-Response. It plays qt , a Dirac at \u03bbt \u2208 argmin\u03bb\u2208\u00acit\n\nt\u22121, \u03bbk) .\n\nk\u2208[K] wk\n\n(cid:80)\n\nt d(\u02c6\u00b5k\n\n6\n\n\ft \u2264 (cid:113)(cid:80)\n\ns \u2212 mink U k\n\ns \u2264 max{D, f (s)} is the loss scale in round s, so that Rk\n\nLemma 3. AdaHedge has regret Rk\n\u221a\nt = O(\nmaxk U k\nt \u2264 0. The sample complexity is bounded per Theorem 2.\nBest-Response has no regret, R\u03bb\nWe expect that in practice the scale converges to bs \u2192 D\u00b5 after a transitory startup phase.\nComputational complexity: one best-response oracle call per time step.\n\ns ln K + maxs\u2264t bs( 4\n\ns\u2264t b2\n\n3 ln K + 2) where bs =\nt ln K ln t).\n\n3.2 Learning Algorithms for the \u03bb-Player vs Best Response for the k-Player\n\nUsing a learner for the \u03bb-player removes the need for a tracking procedure. In this section the\nk-player goes second and uses Best-Response, with zero regret, i.e. kt = argmaxk\u2208[K] U k\nt (see\nAlgorithm 1). After playing qt \u2208 P(\u00acit), the \u03bb-player suffers loss E\u03bb\u223cqt d(\u02c6\u00b5kt\nMost existing regret minimization algorithms do not apply since the function \u03bb (cid:55)\u2192 d(\u00b5, \u03bb) is not\nconvex in general and the action set \u00acit is also not convex. The challenge is to come up with an\nalgorithm able to play distributions with only access to a best-response oracle.\n\nt\u22121, \u03bbkt ).\n\nFollow-The-Perturbed-Leader. Follow-The-Perturbed-Leader can sample points from a distribu-\ntion on P(\u00aci) by only using best-response oracle calls on \u00aci. The version we use here incorporates\nall the information available to the \u03bb-player: the loss of \u03bb \u2208 \u00acit will be d(\u02c6\u00b5kt\nt\u22121, \u03bbkt) where the only\nunknown quantity is kt. Let \u03c3t \u2208 RK be a random vector with independent exponentially distributed\ncoordinates. The idea is that the distribution qt played by the \u03bb-player should be the distribution of\n\nt\u22121(cid:88)\n\nargmin\n\u03bb\u2208\u00acit\n\nK(cid:88)\n\nd(\u02c6\u00b5ks\n\ns\u22121, \u03bbks ) +\n\n\u03c3k\nt d(\u02c6\u00b5k\n\nt\u22121, \u03bbk) .\n\ns=1\n\nk=1\n\nWe show in Appendix E.2 that this argmin can be computed by a single best-response oracle call.\nHowever, the k-player has to be able to compute the best response to qt. Since we cannot get the\nabove distribution exactly, we instead take for qt an empirical distribution from t samples. A regret\nbound R\u03bb\nt log t) for that algorithm is in Appendix E.2. The sample complexity is then\nbounded by Theorem 2.\nComputational complexity: t best-response oracle calls at time step t.\n\n\u221a\nt = O(\n\nOnline Gradient Descent. While the learning problem for \u03bb is hard in general, in several common\ncases the sets \u00aci have a simple structure. If these sets are unions of a \ufb01nite number J of convex\nsets and \u03bb (cid:55)\u2192 d(\u00b5, \u03bb) is convex (i.e. for Gaussian or Bernoulli arm distributions), then we can use\noff-the-shelf regret algorithms. One gradient descent learner can be used on each convex set, and\nthese J experts are then aggregated by an exponential weights algorithm. This procedure would have\nO(\nt) regret. The computational complexity is J (convex) best-response oracle calls per time step.\n\n\u221a\n\n(cid:80)K\n\n3.3 Optimistic Track-and-Stop.\n\nAt stage t, this algorithm computes (\u00b5+, it) = argmax\u03be,i supw\u2208(cid:52)K inf \u03bb\u2208\u00aci\nk=1 wkd(\u03bek, \u03bbk)\nwhere \u03be ranges over all points in \u0398K in a con\ufb01dence region around \u02c6\u00b5t\u22121 and i \u2208 I. Then, the\nk-player plays wt such that there exists a Nash equilibrium (wt, qt) of the game with reward\nd(\u00b5+k, \u03bbk). The proof of its sample complexity bound proceeds slightly differently from the sketch\nof part 2.3, although the ingredients are still the GLRT, concentration, optimism and game-solving.\nThe proof of the following lemma can be found in appendix E.2.\n\u221a\nLemma 4. Take b = 1 in the de\ufb01nition of f (t). Let h(t) = 2\n1/3)\nT0(\u03b4) + 2eK\n\n2 +\nKt) + f (t)(K 2 + 2K log(t/K)) + KD. Then the expected sample complexity is at most\n\na2 , where T0(\u03b4) is the maximal t \u2208 N such that t \u2264 (\u03b2(t, \u03b4) + h(t))/D\u00b5 .\n\ntD\u00b5 + 3L(cid:112)2\u03c32f (t)(K 2 + (2\n\n\u221a\n\n\u221a\n\nNote: the K 2 factors are due to the tracking. We conjecture that they should be K log K instead.\nComputational complexity: one max-max-min oracle call per time step.\n\n7\n\n\f(a) Best Arm for Bernoulli bandit model \u00b5 =\n(0.3, 0.21, 0.2, 0.19, 0.18). The oracle weights are\nw\u2217 = (0.34, 0.25, 0.18, 0.13, 0.10).\n\n(b) Minimum Threshold for Gaussian bandit model\n\u00b5 = (0.5, 0.6) with threshold \u03b3 = 0.6, w\u2217 = e1.\nNote the excessive sample complexity of T-C/ T-D.\n\nFigure 1: Selected experiments. In both cases \u03b4 = 0.1. Plots based on 3000 and 5000 runs.\n\nThis algorithm is the most computationally expensive but has the best sample complexity upper\nbound, has a simpler proof and works well in experiments where computing the max-max-min oracle\nis feasible, like the Best Arm and Minimum Threshold problems (see section 4).\n\n4 Experiments\n\n\u03b4\n\na few percent of D. We append -C or -D to indicate whether cumulative (Nt (cid:32)(cid:80)\n\nThe goal of our experiments is to empirically validate Algorithm 1 on benchmark problems for\npractical \u03b4. We use stylised stopping threshold \u03b2(\u03b4, t) = ln 1+ln t\nand exploration bonus f (t) = ln t.\nBoth are unlicensed by theory yet conservative in practise (the error frequency is way below \u03b4). We\nuse the following letter coding to designate sampling rules: D for AdaHedge vs Best-Response as\nadvocated in Section 3.1, T for Track-and-Stop of [13], M for the Gradient Ascent algorithm of [24],\nO for Optimistic Track-and-Stop from Section 3.3, RR for uniform, and opt for following the oracle\nproportions w\u2217(\u00b5). We also ran all our experiments on a simpli\ufb01cation of D that uses a single learner\ninstead of partitioning the rounds according to it. We omit it from the results, as it was always within\ns\u2264t ws) or direct\n(Nt (cid:32) twt) tracking [13] is employed. We \ufb01nally note that we tune the learning rate of M in terms\nof (the unknown) D\u00b5.\nWe perform two series of experiments, one on Best Arm instances from [13, 24], and one on\nMinimum Threshold instances from [20]. Two selected experiments are shown in Figure 1, the\nothers are included in Appendix G. We contrast the empirical sample complexity with the lower\nbound kl(\u03b4, 1 \u2212 \u03b4)/D\u00b5, and with a more \u201cpractical\u201d version, which indicates the time t for which\nt = \u03b2(t, \u03b4)/D\u00b5, which is, approximately, the \ufb01rst time at which the GLRT stopping rule crosses the\nthreshold \u03b2.\nWe see in Figures 1(a) and 1(b) that direct tracking -D has the advantage over cumulative tracking -C\nacross the board, and that uniform sampling RR is sub-optimal as expected. In Figure 1(a) we see\nthat T performs best, closely followed by M and O. Sampling from the oracle weights opt performs\nsurprisingly poorly (as also observed in [26, Table 1]). The main message of Figure 1(b) is that T can\nbe highly sub-optimal. We comment on the reason in Appendix G.2. Asymptotic optimality of T\nimplies that this effect disappears as \u03b4 \u2192 0. However, for this example this kicks in excruciatingly\nslowly. Figure 5 shows that T is still not competitive at \u03b4 = 10\u221220. On the other hand, O performs\nbest, closely followed by M and then D. Practically, we recommend using O if its computational cost\nis acceptable, M if an estimate of the problem scale is available for tuning, and D otherwise.\nThe gap between opt and T (or O) shows that Track-and-Stop outperforms its design motivation. It\nis an exciting open problem to understand exactly why, and to optimise for stopping early (Nt/t \u2248\nw\u2217( \u02c6\u00b5t)) while ensuring optimality (E\u00b5[N\u03c4 ]/E\u00b5[\u03c4 ] \u2248 w\u2217(\u00b5)).\n\n8\n\nD-CD-DM-CM-DT-CT-DO-CO-DRRopt010002000300040005000lower bdpracticalD-CD-DM-CM-DT-CT-DO-CO-DRRopt025005000750010000lower bdpractical\f5 Conclusion\n\nWe leveraged the game point of view of the pure exploration problem, together with the use of the\noptimism principle, to derive algorithms with sample complexity guarantees for non-asymptotic\ncon\ufb01dence. Varying the \ufb02avours of optimism and saddle-point strategies leads to procedures with\ndiverse tradeoffs between sample and computational complexities. Our sample complexity bounds\nattain asymptotic optimality while offering guarantees for moderate con\ufb01dence and the obtained\nalgorithms are empirically sound. Our bounds however most probably do not depend optimally on\nthe problem parameters, like the number of arms K. For BAI and the Top-K arms problems, lower\nbounds with lower order terms as well as matching algorithms were derived by [26]. A generalization\nof such lower bounds to the general pure exploration problem could shed light upon the optimal\ncomplexity across the full con\ufb01dence spectrum.\nThe richness of existing saddle-point iterative algorithms may bring improved performance over our\nrelatively simple choices. A smart algorithm could possibly take advantage of the stochastic nature of\nthe losses instead of treating them as completely adversarial.\n\nAcknowledgements\n\nWe are grateful to Zakaria Mhammedi and Emilie Kaufmann for multiple generous discussions.\nTravel funding was provided by INRIA Associate Team 6PAC. The experiments were carried out on\nthe Dutch national e-infrastructure with the support of SURF Cooperative.\n\nReferences\n[1] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2):235\u2013256, 2002.\n\n[2] Quentin Berthet and Vianney Perchet. Fast rates for bandit optimization with upper-con\ufb01dence\nFrank-Wolfe. In Advances in Neural Information Processing Systems (NeurIPS), pages 2222\u2013\n2231, 2017.\n\n[3] S. Bubeck, R. Munos, and G. Stoltz. Pure Exploration in Finitely Armed and Continuous Armed\n\nBandits. Theoretical Computer Science 412, 1832-1852, 412:1832\u20131852, 2011.\n\n[4] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, learning, and games. Cambridge university\n\npress, 2006.\n\n[5] Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identi-\n\ufb01cation. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on\nLearning Theory, volume 65 of Proceedings of Machine Learning Research, pages 535\u2013592,\nAmsterdam, Netherlands, July 2017. PMLR.\n\n[6] S. Chen, T. Lin, I. King, M. Lyu, and W. Chen. Combinatorial Pure Exploration of Multi-Armed\n\nBandits. In Advances in Neural Information Processing Systems, 2014.\n\n[7] Richard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and optimal\n\nalgorithms. In International Conference on Machine Learning, pages 521\u2013529, 2014.\n\n[8] Steven de Rooij, Tim van Erven, Peter D. Gr\u00fcnwald, and Wouter M. Koolen. Follow the leader\nif you can, Hedge if you must. Journal of Machine Learning Research, 15:1281\u20131316, April\n2014.\n\n[9] R\u00e9my Degenne and Wouter M. Koolen. Pure exploration with multiple correct answers. In\n\nAdvances in Neural Information Processing Systems (NeurIPS), December 2019.\n\n[10] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. PAC bounds for multi-armed bandit and\nmarkov decision processes. In 15th Annual Conference on Learning Theory (COLT), volume\n2375 of Lecture Notes in Computer Science, pages 255\u2013270. Springer, 2002.\n\n[11] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights.\n\nGames and Economic Behavior, 29(1-2):79\u2013103, 1999.\n\n9\n\n\f[12] Aur\u00e9lien Garivier. Informational con\ufb01dence bounds for self-normalized averages and applica-\n\ntions. In 2013 IEEE Information Theory Workshop (ITW), pages 1\u20135. IEEE, 2013.\n\n[13] Aur\u00e9lien Garivier and Emilie Kaufmann. Optimal best arm identi\ufb01cation with \ufb01xed con\ufb01dence.\n\nIn Conference on Learning Theory, pages 998\u20131027, 2016.\n\n[14] Aur\u00e9lien Garivier, Pierre M\u00e9nard, and Laurent Rossi. Thresholding bandit for dose-ranging:\n\nThe impact of monotonicity. arXiv preprint arXiv:1711.04454, 2017.\n\n[15] S. Kalyanakrishnan and P. Stone. Ef\ufb01cient Selection in Multiple Bandit Arms: Theory and\n\nPractice. In International Conference on Machine Learning (ICML), 2010.\n\n[16] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic\n\nmulti-armed bandits. In International Conference on Machine Learning (ICML), 2012.\n\n[17] Sumeet Katariya, Branislav Kveton, Csaba Szepesv\u00e1ri, Claire Vernade, and Zheng Wen. Stochas-\ntic rank-1 bandits. In Aarti Singh and Xiaojin (Jerry) Zhu, editors, Proceedings of the 20th\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2017, 20-22 April\n2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings of Machine Learning Research,\npages 392\u2013401. PMLR, 2017.\n\n[18] E. Kaufmann, O. Capp\u00e9, and A. Garivier. On the Complexity of Best Arm Identi\ufb01cation in\n\nMulti-Armed Bandit Models. Journal of Machine Learning Research, 17(1):1\u201342, 2016.\n\n[19] Emilie Kaufmann and Wouter M. Koolen. Mixture martingales revisited with applications to\n\nsequential tests and con\ufb01dence intervals. Preprint, October 2018.\n\n[20] Emilie Kaufmann, Wouter M. Koolen, and Aur\u00e9lien Garivier. Sequential test for the lowest\nmean: From Thompson to Murphy sampling. In Advances in Neural Information Processing\nSystems (NeurIPS) 31, pages 6333\u20136343, December 2018.\n\n[21] Tor Lattimore and Csaba Szepesv\u00e1ri. Bandit Algorithms. Cambridge University Press, 2019.\n\n[22] Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the\nthresholding bandit problem.\nIn Maria-Florina Balcan and Kilian Q. Weinberger, editors,\nProceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York\nCity, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings,\npages 1690\u20131698. JMLR.org, 2016.\n\n[23] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower\n\nbound and optimal algorithms. In Conference on Learning Theory, pages 975\u2013999, 2014.\n\n[24] Pierre M\u00e9nard. Gradient ascent for active exploration in bandit problems. arXiv preprint\n\narXiv:1905.08165, 2019.\n\n[25] Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable\nsequences. In Advances in Neural Information Processing Systems (NeurIPS), pages 3066\u20133074,\n2013.\n\n[26] Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive\nsampling in the moderate-con\ufb01dence regime. In Conference on Learning Theory, pages 1794\u2013\n1834, 2017.\n\n[27] Kazuki Teraoka, Kohei Hatano, and Eiji Takimoto. Ef\ufb01cient sampling method for Monte Carlo\n\ntree search problem. IEICE Transactions, 97-D(3):392\u2013398, 2014.\n\n[28] Jun-Kun Wang and Jacob D. Abernethy. Acceleration through optimistic no-regret dynamics.\n\nIn Advances in Neural Information Processing Systems (NeurIPS), pages 3828\u20133838, 2018.\n\n[29] Yichi Zhou, Jialian Li, and Jun Zhu. Identify the Nash equilibrium in static games with random\npayoffs. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International\nConference on Machine Learning, volume 70 of Proceedings of Machine Learning Research,\npages 4160\u20134169, International Convention Centre, Sydney, Australia, August 2017. PMLR.\n\n10\n\n\f", "award": [], "sourceid": 8205, "authors": [{"given_name": "R\u00e9my", "family_name": "Degenne", "institution": "Centrum Wiskunde & Informatica, Amsterdam"}, {"given_name": "Wouter", "family_name": "Koolen", "institution": "Centrum Wiskunde & Informatica, Amsterdam"}, {"given_name": "Pierre", "family_name": "M\u00e9nard", "institution": "Institut de Math\u00e9matiques de Toulouse"}]}