{"title": "Algorithms for Infinitely Many-Armed Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 1729, "page_last": 1736, "abstract": "We consider multi-armed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the mean-reward of a new selected arm which characterizes its probability of being a near-optimal arm. Our assumption is weaker than in previous works. We describe algorithms based on upper-confidence-bounds applied to a restricted set of randomly selected arms and provide upper-bounds on the resulting expected regret. We also derive a lower-bound which matchs (up to logarithmic factors) the upper-bound in some cases.", "full_text": "Algorithms for In\ufb01nitely Many-Armed Bandits\n\nDepartment of Statistics - University of Michigan\n\n437 West Hall, 1085 South University, Ann Arbor, MI, 48109-1107, USA\n\nYizao Wang\u2217\n\nyizwang@umich.edu\n\nJean-Yves Audibert\n\nUniversit\u00e9 Paris Est, Ecole des Ponts, ParisTech, Certis\n\n& Willow - ENS / INRIA, Paris, France\n\naudibert@certis.enpc.fr\n\nR\u00e9mi Munos\n\nINRIA Lille - Nord Europe, SequeL project,\n\n40 avenue Halley, 59650 Villeneuve d\u2019Ascq, France\n\nremi.munos@inria.fr\n\nAbstract\n\nWe consider multi-armed bandit problems where the number of arms is larger\nthan the possible number of experiments. We make a stochastic assumption on\nthe mean-reward of a new selected arm which characterizes its probability of be-\ning a near-optimal arm. Our assumption is weaker than in previous works. We\ndescribe algorithms based on upper-con\ufb01dence-bounds applied to a restricted set\nof randomly selected arms and provide upper-bounds on the resulting expected\nregret. We also derive a lower-bound which matches (up to a logarithmic factor)\nthe upper-bound in some cases.\n\n1 Introduction\n\nMulti-armed bandit problems describe typical situations where learning and optimization should be\nbalanced in order to achieve good cumulative performances. Usual multi-armed bandit problems\n(see e.g. [8]) consider a \ufb01nite number of possible actions (or arms) from which the learner may\nchoose at each iteration. The number of arms is typically much smaller than the number of ex-\nperiments allowed, so exploration of all possible options is usually performed and combined with\nexploitation of the apparently best ones.\n\nIn this paper, we investigate the case when the number of arms is in\ufb01nite (or larger than the available\nnumber of experiments), which makes the exploration of all the arms an impossible task to achieve:\nif no additional assumption is made, it may be arbitrarily hard to \ufb01nd a near-optimal arm. Here we\nconsider a stochastic assumption on the mean-reward of any new selected arm. When a new arm\nk is pulled, its mean-reward \u00b5k is assumed to be an independent sample from a \ufb01xed distribution.\nMoreover, given the mean-reward \u00b5k for any arm k, the distribution of the reward is only required\nto be uniformly bounded and non-negative without any further assumption. Our assumptions essen-\ntially characterize the probability of pulling near-optimal arms. That is, given \u00b5\u2217 \u2208 [0, 1] as the best\npossible mean-reward and \u03b2 \u2265 0 a parameter of the mean-reward distribution, the probability that a\nnew arm is \u0001-optimal is of order \u0001\u03b2 for small \u0001, i.e. P(\u00b5k \u2265 \u00b5\u2217\u2212\u0001) = \u0398(\u0001\u03b2) for \u0001 \u2192 0. Note that we\nwrite f (\u0001) = \u0398(g(\u0001)) for \u0001 \u2192 0 when \u2203c1, c2, \u00010 > 0 such that \u2200\u0001 \u2264 \u00010, c1g(\u0001) \u2264 f (\u0001) \u2264 c2g(\u0001).\n\u2217The major part of this work was completed during the research internship at Certis and INRIA SequeL.\n\n1\n\n\fLike in multi-armed bandits, this setting exhibits a trade off between exploitation (selection of the\narms that are believed to perform well) and exploration. The exploration takes two forms here:\ndiscovery (pulling a new arm that has never been tried before) and sampling (pulling an arm already\ndiscovered in order to gain information about its actual mean-reward).\n\nin [5].\n\nRn = n\u00b5\u2217 \u2212Pn\n\nNumerous applications can be found e.g.\nIt includes labor markets (a worker has many\nopportunities for jobs), mining for valuable resources (such as gold or oil) when there are many\nareas available for exploration (the miner can move to another location or continue in the same\nlocation, depending on results), and path planning under uncertainty in which the path planner has\nto decide among a route that has proved to be ef\ufb01cient in the past (exploitation), or a known route\nthat has not been explored many times (sampling), or a brand new route that has never been tried\nbefore (discovery).\nLet us write kt the arm selected by our algorithm at time t. We de\ufb01ne the regret up to time n as\nt=1 \u00b5kt. From the tower rule, ERn is the expectation of the difference between\nthe rewards we would have obtained by drawing an optimal arm (an arm having a mean-reward\nequal to \u00b5\u2217) and the rewards we did obtain during the time steps 1, . . . , n. Our goal is to design an\narm-pulling strategy such as to minimize this regret.\nOverview of our results: We write vn = \u02dcO(un) when for some n0, C > 0, vn \u2264 Cun(log(un))2,\nfor all n \u2265 n0. We assume that the rewards of the arms lie in [0, 1]. Our regret bounds depend on\nwhether \u00b5\u2217 = 1 or \u00b5\u2217 < 1. For \u00b5\u2217 = 1, our algorithms are such that ERn = \u02dcO(n\u03b2/(1+\u03b2)). For\n\u00b5\u2217 < 1, we have ERn = \u02dcO(n\u03b2/(1+\u03b2)) if \u03b2 > 1, and (only) ERn = \u02dcO(n1/2) if \u03b2 \u2264 1. Moreover\nwe derive the lower-bound: for any \u03b2 > 0, \u00b5\u2217 \u2264 1, any algorithm satis\ufb01es ERn \u2265 Cn\u03b2/(1+\u03b2) for\nsome C > 0. Finally we propose an algorithm having the anytime property, which is based on an\narm-increasing rule.\n\nOur algorithms essentially consist in pulling K different arms randomly chosen, where K is of\norder n\u03b2/2 if \u00b5\u2217 < 1 and \u03b2 < 1, and n\u03b2/(1+\u03b2) otherwise, and using a variant of the UCB (Upper\nCon\ufb01dence Bound) algorithm ([3],[2]) on this set of K arms, which takes into account the empirical\nvariance of the rewards. This last point is crucial to get the proposed rate for \u00b5\u2217 = 1 and \u03b2 < 1, i.e.\nin cases where there are many arms with small variance.\nPrevious works on many-armed bandits: In [5], a speci\ufb01c setting of an in\ufb01nitely many-armed\nbandit is considered, namely that the rewards are Bernoulli random variables with parameter p,\nwhere p follows a uniform law over a given interval [0, \u00b5\u2217]. All mean-rewards are therefore in\n[0, \u00b5\u2217]. They proposed three algorithms. (1) The 1-failure strategy where an arm is played as long\nas 1s are received. When a 0 is received, a new arm is played and this strategy is repeated forever.\n(2) The m-run strategy uses the 1-failure strategy until either m continuous 1s are received (from the\nsame arm) or m different arms have been played. In the \ufb01rst case, we continue to play forever the\ncurrent arm. In the second case, the arm that gave the most wins is chosen to play for the remaining\nrounds. Finally, (3) the m-learning strategy uses the 1-failure strategy during the \ufb01rst m rounds, and\nfor the remaining rounds it chooses the arm that gave the most 1s during the \ufb01rst m rounds.\nFor \u00b5\u2217 = 1, the authors of [5] have shown that 1-failure strategy, \u221an-run strategy, and log(n)\u221an-\nlearning strategy have a regret ERn \u2264 2\u221an. They also provided a lower bound on the regret of any\nstrategy: ERn \u2265 \u221a2n. For \u00b5\u2217 < 1, the corresponding optimal strategies are \u221an\u00b5\u2217-run strategy\nand \u221an\u00b5\u2217 log(n\u00b5\u2217)-learning strategy. All these algorithms require the knowledge of the horizon n\nof the game. In many applications, it is important to design algorithms having the anytime property,\nthat is, the upper bounds on the expected regret ERn have the similar order for all n. Under the\nsame Bernoulli assumption on the reward distributions, such algorithms has been obtained in [9].\n\nIn comparison to their setting (uniform distribution corresponds to \u03b2 = 1), our upper- and lower-\nbounds are also of order \u221an up to a logarithmic factor, and we do not assume that we know exactly\nthe distribution of the mean-reward. However it is worth noting that the proposed algorithms in\n[5, 9] heavily depend on the Bernoulli assumption of the rewards and are not easily transposable to\ngeneral distributions. Note also that the Bernoulli assumption does not work for the real problems\nmentioned above, where the outcomes may take several possible values.\n\nThus an important aspect of our work, compared to previous many-armed bandits, is that our setting\nallows general reward distributions for the arms, under a simple assumption on the mean-reward.\n\n2\n\n\f2 Main results\n\nIn our framework, each arm of a bandit is characterized by the distribution of the rewards (obtained\nby drawing that arm) and the essential parameter of the distribution of rewards is its expectation.\nAnother parameter of interest is the standard deviation. With low variance, poor arms will be easier\nto spot while good arms will have higher probability of not being disregarded at the beginning due\nto unlucky trials. To draw an arm is equivalent to draw a distribution \u03bd of mean-rewards. Let\n\n\u00b5 =R w\u03bd(dw) and \u03c32 =R (w\u2212 \u00b5)2\u03bd(dw) denote the expectation and variance of \u03bd. The quantities\n\u00b5 and \u03c3 are random variables. Our assumptions are the following:\n(A) Rewards are uniformly bounded: without loss of generality, we assume all rewards are in [0, 1].\n(B) the expected reward of a randomly drawn arm satis\ufb01es: there exist \u00b5\u2217 \u2208 (0, 1] and \u03b2 > 0 s.t.\n\nP{\u00b5 > \u00b5\u2217 \u2212 \u0001} = \u0398(\u0001\u03b2), for \u0001 \u2192 0\n\n(1)\n\n(2)\n\n(C) there is a function V : [0, 1] \u2192 R such that P{\u03c32 \u2264 V (\u00b5\u2217 \u2212 \u00b5)} = 1.\nThe key assumption here is (B). It gives us (the order of) the number of arms that needs to be drawn\nbefore \ufb01nding an arm that is \u0001-close to the optimum1 (i.e., an arm for which \u00b5 \u2265 \u00b5\u2217\u2212\u0001). Assumption\n(B) implies that there exists positive constants c1 and c2 such that for any \u0001 \u2208 [0, \u00b5\u2217], we have2\n\nc1\u0001\u03b2 \u2264 P{\u00b5 > \u00b5\u2217 \u2212 \u0001} \u2264 P{\u00b5 \u2265 \u00b5\u2217 \u2212 \u0001} \u2264 c2\u0001\u03b2.\n\nFor example, the uniform distribution on (0, \u00b5\u2217) satis\ufb01es the Condition (1) with \u03b2 = 1.\nAssumption (C) always holds for V (u) = \u00b5\u2217(1 \u2212 \u00b5\u2217 + u) (since Var W \u2264 EW (1 \u2212 EW ) when\nW \u2208 [0, 1]). However it is convenient when the near-optimal arms have low variance (for instance,\nthis happens when \u00b5\u2217 = 1).\nrandom\nLet Xk,1, Xk,2, . . . denote the rewards obtained when pulling arm k. These are i.i.d.\nvariables with common expected value denoted \u00b5k. Let X k,s , 1\nj=1 Xk,j and Vk,s ,\nsPs\n1\nj=1(Xk,j \u2212 X k,s)2 be the empirical mean and variance associated with the \ufb01rst s draws of\narm k. Let Tk(t) denote the number of times arm k is chosen by the policy during the \ufb01rst t plays.\nWe will use as a subroutine of our algorithms the following version of UCB (Upper Con\ufb01dence\nBound) algorithm as introduced in [2]. Let (Et)t\u22650 be a nondecreasing sequence of nonnegative\nreal numbers. It will be referred to as the exploration sequence since the larger it is, the more UCB\nexplores. For any arm k and nonnegative integers s, t, introduce\n\nsPs\n\nBk,s,t , X k,s +r 2Vk,sEt\n\ns\n\n+\n\n3Et\ns\n\n(3)\n\nwith the convention 1/0 = +\u221e. De\ufb01ne the UCB-V (for Variance estimate) policy:\n\nUCB-V policy for a set K of arms:\nAt time t, play an arm in K maximizing Bk,Tk(t\u22121),t.\n\nFrom [2, Theorem 1], the main property of Bk,s,t is that with probability at least 1\u2212 5(log t)e\u2212Et/2,\nfor any s \u2208 [0, t] we have \u00b5k \u2264 Bk,s,t. So provided that Et is large, Bk,Tk(t\u22121),t is an observable\nquantity at time t which upper bounds \u00b5k with high probability. We consider nondecreasing se-\nquence (Et) in order that these bounds hold with probability increasing with time. This ensures that\nthe low probability event, that the algorithm might concentrate the draws on suboptimal arms, has a\ndecreasing probability with time.\n\n2.1 UCB revisited for the in\ufb01nitely many-armed bandit\n\nWhen the number of arms of the bandit is greater than the total number of plays, it makes no sense\nto apply UCB-V algorithm (or other variants of UCB [3]) since its \ufb01rst step is to draw each arm once\n(to have Bk,Tk(t\u22121),t \ufb01nite). A more meaningful and natural approach is to decide at the beginning\n\n1Precise computations lead to a number which is of order \u0001\n2Indeed, (1) implies that for some 0 < c\n\u03b2 \u2264 P{\u00b5 > \u00b5\n\n\u2217 \u2212 \u0001} \u2264 P{\u00b5 \u2265 \u00b5\n\n\u2217 \u2212 \u0001} \u2264 c\n\n1 < c\n\n2\u0001\n\n\u03b2\n\n0\n\n0\n\n0\n\nc\n\n1\u0001\n\n\u2212\u03b2 up to possibly a logarithmic factor.\n\n0\n\n\u2217 such that for any \u0001 \u2264 \u00010,\n2, there exists 0 < \u00010 < \u00b5\n\u03b2\n0 and c2 = max(\u0001\n. Then one may take c1 = c\n1\u0001\n\n\u2212\u03b2\n0 , c\n\n2).\n\n0\n\n0\n\n3\n\n\fthat only K arms will be investigated in the entire experiment. The K should be suf\ufb01ciently small\nwith respect to n (the total number of plays), as in this way we have fewer plays on bad arms and\nmost of the plays will be on the best of K arms. The number K should not be too small either, since\nwe want that the best of the K arms has an expected reward close to the best possible arm.\nIt is shown in [2, Theorem 4] that in the multi-armed bandit, taking a too small exploration se-\nquence (e.g. such as Et \u2264 1\n2 log t) might lead to polynomial regret (instead of logarithmic for e.g.\nEt = 2 log t) in a simple 2-armed bandit problem. However, we will show that this is not the case\nin the in\ufb01nitely many-armed bandit, where one may (and should) take much smaller exploration\nsequences (typically of order log log t). The reason for this phenomenon is that in this setting, there\nare typically many near-optimal arms so that the subroutine UCB-V may miss some good arms (by\nunlucky trials) without being hurt: there are many other near-optimal arms to discover! This illus-\ntrates a trade off between the two aspects of exploration: sample the current, not well-known, arms\nor discover new arms.\nWe will start our analysis by considering the following UCB-V(\u221e) algorithm:\nUCB-V(\u221e) algorithm: Given parameters K and the exploration sequence (Et)\n\u2022 Randomly choose K arms,\n\u2022 Run the UCB-V policy on the set of the K selected arms.\n\nTheorem 1 If the exploration sequence satis\ufb01es 2 log(10 log t) \u2264 Et \u2264 log t, then for n \u2265 2 and\nK \u2265 2 the expected regret of the UCB-V(\u221e) algorithm satis\ufb01es:\nERn \u2264 Cn(log K)nK \u22121/\u03b2 + K(log n)Eh(cid:0) V (\u2206)\n\nwhere \u2206 = \u00b5\u2217 \u2212 \u00b5 with \u00b5 the random variable corresponding to the expected reward of a sampled\narm from the pool, and where C is a positive constant depending only on c1 and \u03b2 (see (2)).\nProof: The UCB-V(\u221e) algorithm has two steps: randomly choose K arms and run a UCB sub-\nroutine on the selected arms. The \ufb01rst part of the proof studies what happens during the UCB\nsubroutine, that is, conditionally to the arms that have been randomly chosen during the \ufb01rst step\nof the algorithm. In particular we consider in the following that \u00b51, . . . , \u00b5K are \ufb01xed. From the\nequality (obtained using Wald\u2019s theorem):\n\n\u2206 + 1(cid:1) \u2227 (n\u2206)io,\n\n(4)\n\n(5)\nwith \u2206k = \u00b5\u2217 \u2212 \u00b5k, it suf\ufb01ces to bound ETk(n). The proof is inspired from the ones of Theorems\n2 and 3 in [2]. The novelty of the following lemma is to include the product of probabilities in the\nlast term of the right-hand-side. This enables us to incorporate the idea that if there are a lot of\nnear-optimal arms, it is very unlikely that suboptimal arms are often drawn.\n\nERn =PK\n\nE{Tk(n)}\u2206k\n\nk=1\n\nLemma 1 For any real number \u03c4 and any positive integer u, we have\n\nwhere the expectations and probabilities are conditioned on the set of selected arms.\n\nP(cid:0)Bk,s,t > \u03c4(cid:1) +Pn\n\nt=u+1Qk06=k\n\nP(\u2203s0 \u2208 [0, t], Bk0,s0,t \u2264 \u03c4(cid:1)(6)\n\ns=u\n\nt=u+1Pt\n\nETk(n) \u2264 u +Pn\nProof: We have Tk(n) \u2212 u \u2264Pn\nZk(u, t) \u2264 1\n\u2264 1\n\nt=u+1 Zk(u, t) where Zk(u, t) = 1It=k;Tk(t)>u. We have\n\n\u2200k06=k Bk,Tk (t\u22121),t\u2265Bk0,T\n\u2203s\u2208[u,t] Bk,s,t>\u03c4 + 1\n\nk0 (t\u22121),t;Tk(t\u22121)\u2265u\n\n\u2200k06=k \u2203s0\u2208[0,t] Bk0,s0,t\u2264\u03c4\n\nwhere the last inequality holds since if the two terms in the last sum are equal to zero, then it implies\nthat there exists k0 6= k such that for any s0 \u2208 [0, t] and any s \u2208 [u, t], Bk0,s0,t > \u03c4 \u2265 Bk,s,t. Taking\nthe expectation of both sides, using a union bound and the independence between rewards obtained\nfrom different arms, we obtain Lemma 1. (cid:3)\n\nNow we use Inequality (6) with \u03c4 = \u00b5\u2217+\u00b5k\n\nlarger than 32(cid:16) \u03c32\n\nk\n\u22062\nk\n\n+ 1\n\n\u2206k(cid:17) log n. These choices are made to ensure that the probabilities in the r.h.s.\n\n2 = \u00b5k + \u2206k\n\n2 = \u00b5\u2217 \u2212 \u2206k\n\n2 , and u the smallest integer\n\n4\n\n\fof (6) are small. Precisely, for any s \u2265 u and t \u2264 n, we have\n\nr 2[\u03c32\n\ns\n\n+ 3Et\n\ns \u2264r [2\u03c32\nk+\u2206k] + 3\u22062\n\nu\nk+\u2206k] = \u2206k\n\n32[\u03c32\n\nk+\u2206k/2]\u22062\nk\n32[\u03c32\n\nk + \u2206k/4]Et\n\u2264r [2\u03c32\n\nk\n\nk + \u2206k/2] log n\n\n+ 3\n\n4 (cid:20)r \u03c32\n\nk+\u2206k/4\n\u03c32\nk+\u2206k\n\n\u2206k\n\nk+\u2206k(cid:21) \u2264 \u2206k\n\n4 ,\n\n\u03c32\n\nlog n\n\nu\n+ 3\n8\n\nwhere the last inequality holds since it is equivalent to (x \u2212 1)2 \u2265 0 for x =r \u03c32\n\nk+\u2206k/4\n\u03c32\nk+\u2206k\n\n. Thus:\n\nP(Bk,s,t > \u03c4 ) \u2264 P(cid:0)X k,s +q 2Vk,sEt\n\u2264 P(cid:0)X k,s +q 2[\u03c32\n\u2264 P(cid:0)X k,s \u2212 \u00b5k > \u2206k/4(cid:1) + P(cid:16)P s\n\u2264 2e\u2212s\u22062\n\n> \u00b5k + \u2206k/2(cid:1)\n\ns + 3Et\ns\n+ 3 Et\ns > \u00b5k + \u2206k/2(cid:1) + P(cid:0)Vk,s \u2265 \u03c32\nk \u2265 \u2206k/4(cid:17)\n\nj=1(Xk,j \u2212\u00b5k)2\n\nk+8\u2206k/3),\n\n\u2212 \u03c32\n\nk+\u2206k/4]Et\n\nk/(32\u03c32\n\ns\n\ns\n\nwhere in the last step we used Bernstein\u2019s inequality twice. Summing up we obtain\n\nk + \u2206k/4(cid:1)\n\n(7)\n\nt\n\n\u221e\n\nP(Bk,s,t > \u03c4 ) \u2264 2\n\nXs=u\n\ne\u2212s\u22062\n\nXs=u\n\u2206k(cid:17) e\u2212u\u22062\n\nk/(32\u03c32\n\nk+8\u2206k/3) = 2\n\nk/(32\u03c32\n\nk+8\u2206k/3)\n\nk/(32\u03c32\n\nk+8\u2206k/3)\n\ne\u2212u\u22062\n1 \u2212 e\u2212\u22062\n+ 7\n\n(8)\nwhere we have used that 1 \u2212 e\u2212x \u2265 4x/5 for 0 \u2264 x \u2264 3/8. Now let us bound the product of\nprobabilities in (6). Since \u03c4 = \u00b5\u2217 \u2212 \u2206k/2, we have\n\nk+8\u2206k/3) \u2264(cid:16) 80\u03c32\n\n\u2206k(cid:17) n\u22121,\n\n\u2264(cid:16) 80\u03c32\n\nk/(32\u03c32\n\n+ 7\n\nk\n\u22062\nk\n\nk\n\u22062\nk\n\nYk06=k\n\nP(\u2203s \u2208 [0, t], Bk0,s,t \u2264 \u03c4(cid:1) \u2264\n\nYk0:\u00b5k0 >\u00b5\u2217\u2212\u2206k/2\n\nP(\u2203s \u2208 [0, t], Bk0,s,t < \u00b50\nk(cid:1) .\n\nNow from [2, Theorem 1], with probability at least 1 \u2212 5(log t)e\u2212Et/2, for any s \u2208 [0, t] we have\n\u00b5k \u2264 Bk,s,t. For Et \u2265 2 log(10 log t), this gives P(\u2203s \u2208 [0, t], Bk0,s,t < \u00b50\nk(cid:1) \u2264 1/2. Putting all\nthe bounds of the different terms of (6) leads to\n\nETk(n) \u2264 1 + 32(cid:18) \u03c32\n\nk\n\u22062\nk\n\n+\n\n1\n\n\u2206k(cid:19) log n +(cid:18) 80\u03c32\n\nk\n\u22062\nk\n\n+\n\n7\n\n\u2206k(cid:19) + n2\u2212N\u2206k ,\n\n(9)\n\nwith N\u2206k the cardinal of (cid:8)k0 \u2208 {1, . . . , K} : \u00b5k0 > a \u2212 \u2206k/2(cid:9). Since \u2206k \u2264 \u00b5\u2217 \u2264 1 and\nTk(n) \u2264 n, the previous inequality can be simpli\ufb01ed into\n\nETk(n) \u2264nh50(cid:16) \u03c32\n\nk\n\u22062\nk\n\n+ 1\n\n\u2206k(cid:17) log ni \u2227 no + n2\u2212N\u2206k ,\n\nHere, for the sake of simplicity, we are not interested in having tight constants. From here on, we\nwill take the expectations with respect to all sources of randomness, that is including the one coming\nfrom the \ufb01rst step of UCB-V(\u221e). The quantities \u22061, . . . , \u2206K are i.i.d. random variables satisfying\n0 \u2264 \u2206k \u2264 \u00b5\u2217 and P(\u2206k \u2264 \u0001) = \u0398(\u0001\u03b2). The quantities \u03c31, . . . , \u03c3k are i.i.d. random variables\nsatisfying almost surely \u03c32\n\nk \u2264 V (\u2206k). From (5) and (9), we have\n\nERn = K E(cid:8)T1(n)\u22061(cid:9) \u2264 K E(cid:26)h50(cid:16) V (\u22061)\n\n\u22061\n\n+ 1(cid:17) log ni \u2227 (n\u22061) + n\u220612\u2212N\u22061(cid:27)\n\nE(cid:8)\u220612\u2212N\u22061(cid:9) = E(cid:8)\u22061(1 \u2212 P(\u00b5 > \u00b5\u2217 \u2212 \u22061/2)/2)K\u22121(cid:9) \u2264 E\u03c7(\u22061),\n\nLet p denote the probability that the expected reward \u00b5 of a randomly drawn arm satis\ufb01es \u00b5 >\n\u00b5\u2217 \u2212 \u03b4/2 for a given \u03b4. Conditioning on \u22061 = \u03b4, the quantity N\u22061 follows a binomial distribution\nwith parameters K \u2212 1 and p, hence E(2\u2212N\u22061|\u22061 = \u03b4) = (1\u2212 p + p/2)K\u22121. By using (2), we get:\nwith \u03c7(u) = u(1\u2212 c3u\u03b2)K\u22121 and c3 = c1/2\u03b2. We have \u03c70(u) = (1\u2212 c3u\u03b2)K\u22122(cid:2)1\u2212 c3(1 + (K \u2212\n1)\u03b2)u\u03b2(cid:3) so that \u03c7(u) \u2264 \u03c7(u0) with u0 =\n[c3(1+(K\u22121)\u03b2)]1/\u03b2 \u2264\nC 0K \u22121/\u03b2 for C 0 a positive constant depending only c1 and \u03b2. For any u1 \u2208 [u0, \u00b5\u2217], we have\nE\u03c7(\u22061) \u2264 \u03c7(u0)P(\u22061 \u2264 u1) + \u03c7(u1)P(\u22061 > u1) \u2264 \u03c7(u0)P(\u22061 \u2264 u1) + \u03c7(u1) .\nLet us take u1 = C 00(cid:0) log K\nto ensure u1 \u2265 u0 and \u03c7(u1) \u2264 K \u22121\u22121/\u03b2. We obtain E\u03c7(\u22061) \u2264 CK \u22121/\u03b2 log K\nconstant C depending on c1 and \u03b2. Putting this into (10), we obtain the result of Theorem 1. (cid:3)\n\nK (cid:1)1/\u03b2 for C 00 a positive constant depending on c1 and \u03b2 suf\ufb01ciently large\n\n[c3(1+(K\u22121)\u03b2)]1/\u03b2 and \u03c7(u0) =\n\nK for an appropriate\n\n1+(K\u22121)\u03b2 )K\u22121\n\n(1\u2212\n\n1\n\n1\n\n(10)\n\n5\n\n\fThe r.h.s. of Inequality (4) contains two terms. The \ufb01rst term is the bias: when we randomly draw\nK arms, the expected reward of the best drawn arm is \u02dcO(K \u22121/\u03b2)-optimal. So the best algorithm,\nonce the K arms are \ufb01xed, will yield a regret \u02dcO(nK \u22121/\u03b2). The second term is the estimation. It\nindicates the difference between the UCB subroutine\u2019s performance and the best drawn arm.\n\n2.2 Strategy for \ufb01xed play number\n\nConsider that we know in advance the total number of plays n and the value of \u03b2. In this case,\none can use the UCB-V(\u221e) algorithm with parameter K of order of the minimizer of the r.h.s. of\nInequality (4). This leads to the following UCB-F (for Fixed horizon) algorithm.\n\nUCB-F (\ufb01xed horizon): given total number of plays n, and parameters \u00b5\u2217 and \u03b2 of (1)\n\n\u2022 Choose K arms with K of order( n\n\u2022 Run the UCB-V algorithm with the K chosen arms and an exploration sequence satisfying\n(11)\n\nif \u03b2 < 1, \u00b5\u2217 < 1\notherwise, i.e. if \u00b5\u2217 = 1 or \u03b2 \u2265 1\n\n\u03b2+1\n\nn\n\n\u03b2\n2\n\n\u03b2\n\n2 log(10 log t) \u2264 Et \u2264 log t\n\nTheorem 2 For any n \u2265 2, the expected regret of the UCB-F algorithm satis\ufb01es\n\nC(log n)\u221an\nC(log n)2\u221an\nC(log n)n\n\n1+\u03b2\n\n\u03b2\n\nif \u03b2 < 1 and \u00b5\u2217 < 1\nif \u03b2 = 1 and \u00b5\u2217 < 1\notherwise, i.e. if \u00b5\u2217 = 1 or \u03b2 > 1\n\nwith C a constant depending only on c1, c2 and \u03b2 (see (2)).\n\nERn \u2264\uf8f1\uf8f2\n\uf8f3\n\n(12)\n\nProof: The result comes from Theorem 1 by bounding the expectation E = E(cid:2)(cid:0) V (\u2206)\n\u2206 +1(cid:1)\u2227(n\u2206)(cid:3).\nFirst, as mentioned before, Assumption (C) is satis\ufb01ed for V (\u2206) = \u00b5\u2217(1 \u2212 \u00b5\u2217 + \u2206). So for \u00b5\u2217 = 1\nand this choice of function V , we have E \u2264 2. For \u00b5\u2217 < 1, since \u2206 \u2264 \u00b5\u2217, we have E \u2264 E\u03a8(\u2206)\nwith \u03a8(t) = 2\u00b5\u2217\nt \u2227 (nt). The function \u03a8 is continuous and differentiable by parts. Using Fubini\u2019s\ntheorem and Inequality (2), we have\nE\u03a8(\u2206) = \u03a8(\u00b5\u2217) \u2212 ER \u00b5\u2217\n\n1\u2212\u03b2\n\n2\n\n1\u2212\u03b2\n\n0 \u03a80(t)P(\u2206 \u2264 t)dt\nif \u03b2 < 1\nif \u03b2 = 1\nif \u03b2 > 1\n\n.\n\nPutting these bounds in Theorem 1, we get\n\n2\n\n2 + 2(1+\u03b2)/2c2\nn\n2 + c2 log(n/2)\n2 + 2c2\n\u03b2\u22121\n\n\u2206 \u03a80(t)dt = \u03a8(\u00b5\u2217) \u2212R \u00b5\u2217\nt2 c2t\u03b2dt \u2264\uf8f1\uf8f4\uf8f2\n\u2264 2 +R 1\u221a2/n\n\uf8f4\uf8f3\nCn(log K)nK \u22121/\u03b2 + (log n)Kn\nCn(log K)nK \u22121/\u03b2 + (log n)2Ko\nCn(log K)nK \u22121/\u03b2 + (log n)Ko\n\n1\u2212\u03b2\n\n2 o if \u03b2 < 1 and \u00b5\u2217 < 1\n\nif \u03b2 = 1 and \u00b5\u2217 < 1\n\notherwise: \u00b5\u2217 = 1 or \u03b2 > 1\n\nERn \u2264\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f3\n\nwith C a constant only depending on c1, c2 and \u03b2. The number K of selected arms in UCB-F is\ntaken of the order of the minimizer of these bounds up to a logarithmic factor. (cid:3)\n\nTheorem 2 makes no difference between a logarithmic exploration sequence and an iterated loga-\nrithmic exploration sequence. However in practice, it is clearly better to take an iterated logarithmic\nexploration sequence, for which the algorithm spends much less time on exploring all suboptimal\narms. For sake of simplicity, we have \ufb01xed the constants in (11).\nIt is easy to check that for\nEt = \u03b6 logt and \u03b6 \u2265 1, Inequality (12) still holds but with a constant C depending linearly in \u03b6.\nTheorem 2 shows that when \u00b5\u2217 = 1 or \u03b2 \u2265 1, the bandit subroutine takes no time in spotting near-\noptimal arms (the use of UCB-V algorithm using variance estimate is crucial for this), whereas for\n\u03b2 < 1 and \u00b5\u2217 < 1, which means a lot of near-optimal arms with possibly high variances, the bandit\nsubroutine has dif\ufb01culties in achieving low regret.\n\nThe next theorem shows that our regret upper bounds are optimal up to logarithmic terms except for\nthe case \u03b2 < 1 and \u00b5\u2217 < 1. We do not know whether the rate O(n\u03b2/2 log n) for \u03b2 < 1 and \u00b5\u2217 < 1\nis improvable. This remains an open problem.\n\n6\n\n\fTheorem 3 For any \u03b2 > 0 and \u00b5\u2217 \u2264 1, any algorithm suffers a regret larger than cn\nsmall enough constant c depending on c2 and \u03b2.\n\n\u03b2\n\n1+\u03b2 for some\n\nSketch of proof. If we want to have a regret smaller than cn\u03b2/(1+\u03b2) we need that most draws are\ndone on an arm having an individual regret smaller than \u0001 = cn\u22121/(1+\u03b2). To \ufb01nd such an arm, we\nneed to try a number of arms larger than C 0\u0001\u2212\u03b2 = C 0c\u2212\u03b2n\u03b2/(1+\u03b2) arms for some C 0 > 0 depending\non c2 and \u03b2. Since these arms are drawn at least once and since most of these arms give a constant\nregret, it leads to a regret larger than C 00c\u2212\u03b2n\u03b2/(1+\u03b2) with C 00 depending on c2 and \u03b2. For c small\nenough, this contradicts that the regret is smaller than cn\u03b2/(1+\u03b2). So it is not possible to improve on\nthe n\u03b2/(1+\u03b2) rate. (cid:3)\n\n2.3 Strategy for unknown play number\n\nTo apply the UCB-F algorithm we need to know the total number of plays n and we choose the\ncorresponding K arms before starting. When n is unknown ahead of time, we propose here an\nanytime algorithm with a simple and reasonable way of choosing K by adding a new arm from time\nto time into the set of sampled arms. Let Kn denote the number of arms played up to time n. We\nset K0 = 0. We de\ufb01ne the UCB-AIR (for Arm-Increasing Rule):\n\nUCB-AIR (Arm-Increasing Rule): given parameters \u00b5\u2217 and \u03b2 of (1),\n\n\u2022 At time n, try a new arm if\n\nKn\u22121 <( n\n\nn\n\n\u03b2\n2\n\n\u03b2\n\n\u03b2+1\n\nif \u03b2 < 1 and \u00b5\u2217 < 1\notherwise: \u00b5\u2217 = 1 or \u03b2 \u2265 1\n\n\u2022 Otherwise apply UCB-V on Kn\u22121 drawn arms with an exploration sequence satisfying\n\n2 log(10 log t) \u2264 Et \u2264 log t\n\nThis arm-increasing rule makes our algorithm applicable for the anytime problem. This is a more\nreasonable approach in practice than restarting-based algorithms like the ones using the doubling\ntrick (see e.g. [4, Section 5.3]). Our second main result is to show that the UCB-AIR algorithm has\nthe same properties as the UCB-F algorithm (proof omitted from this extended abstract).\nTheorem 4 For any horizon time n \u2265 2, the expected regret of the UCB-AIR algorithm satis\ufb01es\n\n(13)\n\nERn \u2264(cid:26) C(log n)2\u221an\n\nC(log n)2n\n\nif \u03b2 < 1 and \u00b5\u2217 < 1\notherwise, i.e. if \u00b5\u2217 = 1 or \u03b2 \u2265 1\n\n\u03b2\n\n1+\u03b2\n\nwith C a constant depending only on c1, c2 and \u03b2 (see (2)).\n\n3 Comparison with continuum-armed bandits and conclusion\n\nIn continuum-armed bandits (see e.g. [1, 6, 4]), an in\ufb01nity of arms is also considered. The arms\nlie in some Euclidean (or metric) space and their mean-reward is a deterministic and smooth (e.g.\nLipschitz) function of the arms. This setting is different from ours since our assumption is stochastic\nand does not consider regularities of the mean-reward w.r.t. the arms. However, if we choose an\narm-pulling strategy which consists in selecting randomly the arms, then our setting encompasses\ncontinuum-armed bandits. For example, consider the domain [0, 1]d and a mean-reward function \u00b5\nassumed to be locally equivalent to a H\u00f6lder function (of order \u03b1 \u2208 [0, +\u221e)) around any maximum\nx\u2217 (the number of maxima is assumed to be \ufb01nite), i.e.\n(14)\nPulling randomly an arm X according to the Lebesgue measure on [0, 1]d, we have: P(\u00b5(X) >\n\u00b5\u2217 \u2212 \u0001) = \u0398(P(kX \u2212 x\u2217k\u03b1 < \u0001)) = \u0398(\u0001d/\u03b1), for \u0001 \u2192 0. Thus our assumption (1) holds with\n\u03b2 = d/\u03b1, and our results say that if \u00b5\u2217 = 1, we have ERn = \u02dcO(n\u03b2/(1+\u03b2)) = \u02dcO(nd/(\u03b1+d)).\nFor d = 1, under the assumption that \u00b5 is \u03b1-H\u00f6lder (i.e. |\u00b5(x)\u2212\u00b5(y)| \u2264 ckx \u2212 yk\u03b1 for 0 < \u03b1 \u2264 1),\n[6] provides upper- and lower-bounds on the regret Rn = \u0398(n(\u03b1+1)/(2\u03b1+1)). Our results gives\n\n\u00b5(x\u2217) \u2212 \u00b5(x) = \u0398(kx\u2217 \u2212 xk\u03b1) when x \u2192 x\u2217.\n\n7\n\n\fERn = \u02dcO(n1/(\u03b1+1)) which is better for all values of \u03b1. The reason for this apparent contradiction\nis that the lower bound in [6] is obtained by the construction of a very irregular function, which\nactually does not satisfy our local assumption (14).\n\nNow, under assumptions (14) for any \u03b1 > 0 (around a \ufb01nite set of maxima), [4] provides the rate\nERn = \u02dcO(\u221an). Our result gives the same rate when \u00b5\u2217 < 1 but in the case \u00b5\u2217 = 1 we obtain the\nimproved rate ERn = \u02dcO(n1/(\u03b1+1)) which is better whenever \u03b1 > 1 (because we are able to exploit\nthe low variance of the good arms). Note that like our algorithm, the algorithms in [4] as well as in\n[6], do not make an explicit use (in the procedure) of the smoothness of the function. They just use\na \u2018uniform\u2019 discretization of the domain.\n\nOn the other hand, the zooming algorithm of [7] adapts to the smoothness of \u00b5 (more arms are\nsampled at areas where \u00b5 is high). For any dimension d, they obtain ERn = \u02dcO(n(d0+1)/(d0+2)),\nwhere d0 \u2264 d is their \u2019zooming dimension\u2019. Under assumptions (14) we deduce d0 = \u03b1\u22121\n\u03b1 d using\nthe Euclidean distance as metric, thus their regret is ERn = \u02dcO(n(d(\u03b1\u22121)+\u03b1)/(d(\u03b1\u22121)+2\u03b1)). For\nlocally quadratic functions (i.e. \u03b1 = 2), their rate is \u02dcO(n(d+2)/(d+4)), whereas ours is \u02dcO(nd/(2+d)).\nAgain, we have a smaller regret although we do not use the smoothness of \u00b5 in our algorithm. Here\nthe reason is that the zooming algorithm does not make full use of the fact that the function is locally\nquadratic (it considers a Lipschitz property only). However, in the case \u03b1 < 1, our rates are worse\nthan algorithms speci\ufb01cally designed for continuum armed bandits.\n\nHence, the comparison between the many-armed and continuum-armed bandits settings is not easy\nbecause of the difference in nature of the basis assumptions. Our setting is an alternative to the\ncontinuum-armed bandit setting which does not require the existence of an underlying metric space\nin which the mean-reward function would be smooth. Our assumption (1) naturally deals with\npossibly very complicated functions where maxima may be located in any part of the space. For the\ncontinuum-armed bandit problems when there are relatively many near-optimal arms, our algorithm\nwill be also competitive compared to the speci\ufb01cally designed continuum-armed bandit algorithms.\nThis result matches the intuition that in such cases, a random selection strategy will perform well.\n\nTo conclude, our contributions are: (i) Compared to previous results on many-armed bandits, our\nsetting allows general mean-reward distributions for the arms, under a simple assumption on the\nprobability of pulling near-optimal arms. (ii) We show that, for in\ufb01nitely many-armed bandits, we\nneed much less exploration of each arm than for \ufb01nite-armed bandits (the log term may be replaced\nby log log). (iii) Our variant of UCB algorithm, making use of the variance estimate, enables to\nobtain higher rates in cases when the variance of the near-optimal arms is small. (iv) We propose the\nUCB-AIR algorithm, which is anytime, taking advantage of an arm-increasing rule. (v) We provide\na lower-bound matching the upper-bound (up to a logarithmic factor) in the case \u03b2 \u2265 1 or \u00b5\u2217 = 1.\nReferences\n[1] R. Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33:1926\u20131951,\n\n1995.\n\n[2] J.-Y. Audibert, R. Munos, and C. Szepesv\u00e1ri. Tuning bandit algorithms in stochastic environments. In\nM. Hutter, R. A. Servedio, and E. Takimoto, editors, ALT, volume 4754 of Lecture Notes in Computer\nScience, pages 150\u2013165. Springer, 2007.\n\n[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine\n\nLearning, 47(2/3):235\u2013256, 2002.\n\n[4] P. Auer, R. Ortner, and C. Szepesv\u00e1ri. Improved rates for the stochastic continuum-armed bandit problem.\n\n20th COLT, San Diego, CA, USA, 2007.\n\n[5] D. A. Berry, R. W. Chen, A. Zame, D. C. Heath, and L. A. Shepp. Bandit problems with in\ufb01nitely many\n\narms. The Annals of Statistics, 25(5):2103\u20132116, 1997.\n\n[6] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In NIPS-2004, 2004.\n[7] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandit problems in metric spaces. In Proceedings of\n\nthe 40th ACM Symposium on Theory of Computing, 2008.\n\n[8] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied Mathe-\n\nmatics, 6:4\u201322, 1985.\n\n[9] O. Teytaud, S. Gelly, and M. Sebag. Anytime many-armed bandit. Conf\u00e9rence francophone sur\n\nl\u2019Apprentissage automatique (CAp) Grenoble, France, 2007.\n\n8\n\n\f", "award": [], "sourceid": 453, "authors": [{"given_name": "Yizao", "family_name": "Wang", "institution": null}, {"given_name": "Jean-yves", "family_name": "Audibert", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}