{"title": "Optimistic Gittins Indices", "book": "Advances in Neural Information Processing Systems", "page_first": 3153, "page_last": 3161, "abstract": "Starting with the Thomspon sampling algorithm, recent years have seen a resurgence of interest in Bayesian algorithms for the Multi-armed Bandit (MAB) problem. These algorithms seek to exploit prior information on arm biases and while several have been shown to be regret optimal, their design has not emerged from a principled approach. In contrast, if one cared about Bayesian regret discounted over an infinite horizon at a fixed, pre-specified rate, the celebrated Gittins index theorem offers an optimal algorithm. Unfortunately, the Gittins analysis does not appear to carry over to minimizing Bayesian regret over all sufficiently large horizons and computing a Gittins index is onerous relative to essentially any incumbent index scheme for the Bayesian MAB problem. The present paper proposes a sequence of 'optimistic' approximations to the Gittins index. We show that the use of these approximations in concert with the use of an increasing discount factor appears to offer a compelling alternative to a variety of index schemes proposed for the Bayesian MAB problem in recent years. In addition, we show that the simplest of these approximations yields regret that matches the Lai-Robbins lower bound, including achieving matching constants.", "full_text": "Optimistic Gittins Indices\n\nOperations Research Center, MIT\n\nMIT Sloan School of Management\n\nVivek F. Farias\n\nCambridge, MA 02142\n\nvivekf@mit.edu\n\nEli Gutin\n\nCambridge, MA 02142\n\ngutin@mit.edu\n\nAbstract\n\nStarting with the Thomspon sampling algorithm, recent years have seen a resur-\ngence of interest in Bayesian algorithms for the Multi-armed Bandit (MAB) prob-\nlem. These algorithms seek to exploit prior information on arm biases and while\nseveral have been shown to be regret optimal, their design has not emerged from a\nprincipled approach. In contrast, if one cared about Bayesian regret discounted over\nan in\ufb01nite horizon at a \ufb01xed, pre-speci\ufb01ed rate, the celebrated Gittins index theorem\noffers an optimal algorithm. Unfortunately, the Gittins analysis does not appear to\ncarry over to minimizing Bayesian regret over all suf\ufb01ciently large horizons and\ncomputing a Gittins index is onerous relative to essentially any incumbent index\nscheme for the Bayesian MAB problem.\nThe present paper proposes a sequence of \u2018optimistic\u2019 approximations to the\nGittins index. We show that the use of these approximations in concert with\nthe use of an increasing discount factor appears to offer a compelling alternative\nto state-of-the-art index schemes proposed for the Bayesian MAB problem in\nrecent years by offering substantially improved performance with little to no\nadditional computational overhead. In addition, we prove that the simplest of these\napproximations yields frequentist regret that matches the Lai-Robbins lower bound,\nincluding achieving matching constants.\n\n1\n\nIntroduction\n\nThe multi-armed bandit (MAB) problem is perhaps the simplest example of a learning problem\nthat exposes the tension between exploration and exploitation. Recent years have seen a resurgence\nof interest in Bayesian MAB problems wherein we are endowed with a prior on arm rewards, and\na number of policies that exploit this prior have been proposed and/or analyzed. These include\nThompson Sampling [20], Bayes-UCB [12], KL-UCB [9], and Information Directed Sampling [19].\nThe ultimate motivation for these algorithms appears to be two-fold: superior empirical performance\nand light computational burden. The strongest performance results available for these algorithms\nestablish regret lower bounds that match the Lai-Robbins lower bound [15]. Even among this set of\nrecently proposed algorithms, there is a wide spread in empirically observed performance.\nInterestingly, the design of the index policies referenced above has been somewhat ad-hoc as opposed\nto having emerged from a principled analysis of the underlying Markov Decision process. Now if in\ncontrast to requiring \u2018small\u2019 regret for all suf\ufb01ciently large time horizons, we cared about minimizing\nBayesian regret over an in\ufb01nite horizon, discounted at a \ufb01xed, pre-speci\ufb01ed rate (or equivalently,\nmaximizing discounted in\ufb01nite horizon rewards), the celebrated Gittins index theorem provides an\noptimal, ef\ufb01cient solution. Importing this celebrated result to the fundamental problem of designing\nalgorithms that achieve low regret (either frequentist or Bayesian) simultaneously over all suf\ufb01ciently\nlarge time horizons runs into two substantial challenges:\nHigh-Dimensional State Space: Even minor \u2018tweaks\u2019 to the discounted in\ufb01nite horizon objective\nrender the corresponding Markov Decision problem for the Bayesian MAB problem intractable. For\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\finstance, it is known that a Gittins-like index strategy is sub-optimal for a \ufb01xed horizon [5], let alone\nthe problem of minimizing regret over all suf\ufb01ciently large horizons.\n\nComputational Burden: Even in the context of the discounted in\ufb01nite horizon problem, the\ncomputational burden of calculating a Gittins index is substantially larger than that required for any\nof the index schemes for the multi-armed bandit discussed thus far.\nThe present paper attempts to make progress on these challenges. Speci\ufb01cally, we make the following\ncontribution:\n\n\u2022 We propose a class of \u2018optimistic\u2019 approximations to the Gittins index that can be computed\nwith signi\ufb01cantly less effort. In fact, the computation of the simplest of these approximations\nis no more burdensome than the computation of indices for the Bayes UCB algorithm, and\nseveral orders of magnitude faster than the nearest competitor, IDS.\n\n\u2022 We establish that an arm selection rule that is greedy with respect to the simplest of these\noptimistic approximations achieves optimal regret in the sense of meeting the Lai-Robbins\nlower bound (including matching constants) provided the discount factor is increased at a\ncertain rate.\n\n\u2022 We show empirically that even the simplest optimistic approximation to the Gittins index\nproposed here outperforms the state-of-the-art incumbent schemes discussed in this introduc-\ntion by a non-trivial margin. We view this as our primary contribution \u2013 the Bayesian MAB\nproblem is fundamental making the performance improvements we demonstrate important.\n\nLiterature review Thompson Sampling [20] was proposed as a heuristic to the MAB problem\nin 1933, but was largely ignored until the last decade. An empirical study by Chapelle and Li [7]\nhighlighted Thompson Sampling\u2019s superior performance and led to a series of strong theoretical\nguarantees for the algorithm being proved in [2, 3, 12] (for speci\ufb01c cases when Gaussian and Beta\npriors are used). Recently, these proofs were generalized to the 1D exponential family of distributions\nin [13]. A few decades after Thompson Sampling was introduced, Gittins [10] showed that an index\npolicy was optimal for the in\ufb01nite horizon discounted MAB problem. Several different proofs for the\noptimality of Gittins index, were shown in [21, 22, 23, 6]. Inspired by this breakthrough, Lai and\nRobbins [15, 14], while ignoring the original MDP formulation, proved an asymptotic lower bound\non achievable (non-discounted) regret and suggested policies that attained it.\nSimple and ef\ufb01cient UCB algorithms were later developed by Agrawal and Auer et al. [1, 4], with\n\ufb01nite time regret bounds. These were followed by the KL-UCB [9] and Bayes UCB [12] algorithms.\nThe Bayes UCB paper drew attention to how well Bayesian algorithms performed in the frequentist\nsetting. In that paper, the authors also demonstrated that a policy using indices similar to Gittins\u2019 had\nthe lowest regret. The use of Bayesian techniques for bandits was explored further in [19] where\nthe authors propose Information Directed Sampling, an algorithm that exploits complex information\nstructures arising from the prior. There is also a very recent paper, [16], which also focuses on regret\nminimization using approximated Gittins Indices. However, in that paper, the time horizon is assumed\nto be known and \ufb01xed, which is different from the focus in this paper on \ufb01nding a policy that has low\nregret over all suf\ufb01ciently long horizons.\n\n2 Model and Preliminaries\n\nWe consider a multi-armed bandit problem with a \ufb01nite set of arms A = {1, . . . , A}. Arm i 2 A if\npulled at time t, generates a stochastic reward Xi,Ni(t) where Ni(t) denotes the cumulative number\nof pulls of arm i up to and including time t. (Xi,s, s 2 N) is an i.i.d. sequence of random variables,\neach distributed according to p\u2713i(\u00b7) where \u2713i 2 \u21e5 is a parameter. Denote by \u2713 the tuple of all\n\u2713i. The expected reward from the ith arm is denoted by \u00b5i(\u2713i) := E [Xi,1 | \u2713i]. We denote by\n\u00b5\u21e4(\u2713) the maximum expected reward across arms; \u00b5\u21e4(\u2713) := maxi \u00b5i(\u2713i) and let i\u21e4 be an optimal\narm. The present paper will focus on the Bayesian setting, and so we suppose that each \u2713i is an\nindependent draw from some prior distribution q over \u21e5. All random variables are de\ufb01ned on a\ncommon probability space (\u2326,F, P). We de\ufb01ne a policy, \u21e1 := (\u21e1t, t 2 N), to be a stochastic process\ntaking values in A. We require that \u21e1 be adapted to the \ufb01ltration Ft generated by the history of arm\npulls and their corresponding rewards up to and including time t 1.\n\n2\n\n\fOver time, the agent accumulates rewards, and we denote by\n\nV (\u21e1, T, \u2713) := E\"Xt\n\n\u2713#\n\nX\u21e1t,N\u21e1t (t)\n\nthe reward accumulated up to time T when using policy \u21e1. We write V (\u21e1, T ) := E [V (\u21e1, T, \u2713)]. The\nregret of a policy over T time periods, for a speci\ufb01c realization \u2713 2 \u21e5A, is the expected shortfall\nagainst always pulling the optimal arm, namely\n\nRegret (\u21e1, T, \u2713) := T \u00b5\u21e4(\u2713) V (\u21e1, T, \u2713)\n\nIn a seminal paper, [15], Lai and Robbins established a lower bound on achievable regret. They\nconsidered the class of policies under which for any choice of \u2713 and positive constant a, any policy\nin the class achieves o(na) regret. They showed that for any policy \u21e1 in this class, and any \u2713 with a\nunique maximum, we must have\n\nlim inf\n\nT\n\nRegret (\u21e1, T, \u2713)\n\nlog T\n\nXi\n\n\u00b5\u21e4(\u2713) \u00b5i(\u2713i)\ndKL (p\u2713i, p\u2713i\u21e4 )\n\n(1)\n\nwhere dKL is the Kullback-Liebler divergence. The Bayes\u2019 risk (or Bayesian regret) is simply the\nexpected regret over draws of \u2713 according to the prior q:\n\nRegret (\u21e1, T ) := TE [\u00b5\u21e4(\u2713)] V (\u21e1, T ).\n\nIn yet another landmark paper, [15] showed that for a restricted class of priors q a similar class of\nalgorithms to those found to be regret optimal in [14] were also Bayes optimal. Interestingly, however,\nthis class of algorithms ignores information about the prior altogether. A number of algorithms that\ndo exploit prior information have in recent years received a good deal of attention; these include\nThompson sampling [20], Bayes-UCB [12], KL-UCB [9], and Information Directed Sampling [19].\nThe Bayesian setting endows us with the structure of a (high dimensional) Markov Decision process.\nAn alternative objective to minimizing Bayes risk, is the maximization of the cumulative reward\ndiscounted over an in\ufb01nite horizon. Speci\ufb01cally, for any positive discount factor < 1, de\ufb01ne\n\nV(\u21e1) := Eq\" 1Xt=1\n\nt1X\u21e1t,N\u21e1t (t)# .\n\nThe celebrated Gittins index theorem provides an optimal, ef\ufb01cient solution to this problem that we\nwill describe in greater detail shortly; unfortunately as alluded to earlier even a minor \u2018tweak\u2019 to the\nobjective above \u2013 such as maximizing cumulative expected reward over a \ufb01nite horizon renders the\nGittins index sub-optimal [17].\nAs a \ufb01nal point of notation, every scheme we consider will maintain a posterior on the mean of an\narm at every point in time. We denote by qi,s the posterior on the mean of the ith arm after s 1\npulls of that arm; qi,1 := q. Since our prior on \u2713i will frequently be conjugate to the distribution of\nthe reward Xi, qi,s will permit a succinct description via a suf\ufb01cient statistic we will denote by yi,s;\ndenote the set of all such suf\ufb01cient statistics Y. We will thus use qi,s and yi,s interchangeably and\nrefer to the latter as the \u2018state\u2019 of the ith arm after s 1 pulls.\n3 Gittins Indices and Optimistic Approximations\n\nOne way to compute the Gittins Index is via the so-called retirement value formulation [23]. The\nGittins Index for arm i in state y is the value for that solves\n\n\n1 \n\n= sup\n\u2327 >1\n\nE\"\u23271Xt=1\n\nt1Xi,t + \u23271 \n\nyi,1 = y# .\n\n1 \n\nWe denote this quantity by \u232b(y). If one thought of the notion of retiring as receiving a deterministic\nreward in every period, then the value of that solves the above equation could be interpreted\nas the per-period retirement reward that makes us indifferent between retiring immediately and the\noption of continuing to play arm i with the potential of retiring at some future time. The Gittins index\npolicy can thus succinctly be stated as follows: at time t, play an arm in the set arg maxi v(yi,Ni(t)).\nIgnoring computational considerations, we cannot hope for a scheme such as the one above to achieve\nacceptable regret or Bayes risk. Speci\ufb01cally, denoting the Gittins policy by \u21e1G,, we have\n\n(2)\n\n3\n\n\fLemma 3.1. There exists an instance of the multi armed bandit problem with |A| = 2 for which\n\nRegret\u21e1G,, T = \u2326(T )\n\nfor any 2 (0, 1).\nThe above result is expected. If the posterior means on the two arms are suf\ufb01ciently apart, the Gittins\nindex policy will pick the arm with the larger posterior mean. The threshold beyond which the Gittins\npolicy \u2018exploits\u2019 depends on the discount factor and with a \ufb01xed discount factor there is a positive\nprobability that the superior arm is never explored suf\ufb01ciently so as to establish that it is, in fact, the\nsuperior arm. Fixing this issue then requires that the discount factor employed increase over time.\nConsider then employing discount factors that increase at roughly the rate 1 1/t; speci\ufb01cally,\nconsider setting\n\n1\n\nt = 1 \n\n2bln2 tc+1\n\nand consider using the policy that at time t picks an arm from the set arg maxi \u232bt(yi,Ni(t)). Denote\nthis policy by \u21e1D. The following proposition shows that this \u2018doubling\u2019 policy achieves Bayes risk\nthat is within a factor of log T of the optimal Bayes risk. Speci\ufb01cally, we have:\nProposition 3.1.\n\nRegret(\u21e1D, T ) = Olog3 T .\n\nwhere the constant in the big-Oh term depends on the prior q and A. The proof of this simple result\n(Appendix A.1) relies on showing that the \ufb01nite horizon regret achieved by using a Gittins index with\nan appropriate \ufb01xed discount factor is within a constant factor of the optimal \ufb01nite horizon regret.\nThe second ingredient is a doubling trick.\nWhile increasing discount factors does not appear to get us to the optimal Bayes risk (the achievable\nlower bound being log2 T ; see [14]); we conjecture that in fact this is a de\ufb01ciency in our analysis\nfor Proposition 3.1. In any case, the policy \u21e1D is not the primary subject of the paper but merely a\nmotivation for the discount factor schedule proposed. Putting aside this issue, one is still left with the\ncomputational burden associated with \u21e1D \u2013 which is clearly onerous relative to any of the incumbent\nindex rules discussed in the introduction.\n\n3.1 Optimistic Approximations to The Gittins Index\nThe retirement value formulation makes clear that computing a Gittins index is equivalent to solving\na discounted, in\ufb01nite horizon stopping problem. Since the state space Y associated with this problem\nis typically at least countable, solving this stopping problem, although not necessarily intractable, is a\nnon-trivial computational task. Consider the following alternative stopping problem that requires as\ninput the parameters (which has the same interpretation as it did before), and K, an integer limiting\nthe number of steps that we need to look ahead. For an arm in state y (recall that the state speci\ufb01es\nsuf\ufb01cient statistics for the current prior on the arm reward), let R(y) be a random variable drawn from\nthe prior on expected arm reward speci\ufb01ed by y. De\ufb01ne the retirement value R,K(s, y) according to\n\nFor a given K, the Optimistic Gittins Index for arm i in state y is now de\ufb01ned as the value for that\nsolves\n\nR,K(s, y) =\u21e2,\nE\"\u23271Xs=1\n\n\n1 \n\n= sup\n\n1<\u2327\uf8ffK+1\n\nmax (, R(y)) ,\n\nif s < K + 1\notherwise\n\ns1Xi,s + \u23271 R,K(\u2327, yi,\u2327 )\n\n1 \n\nWe denote the solution to this equation by vK\n (y). The problem above admits a simple, attractive\ninterpretation: nature reveals the true mean reward for the arm at time K + 1 should we choose to\nnot retire prior to that time, which enables the decision maker to then instantaneously decide whether\nto retire at time K + 1 or else, never retire. In this manner one is better off than in the stopping\nproblem inherent to the de\ufb01nition of the Gittins index, so that we use the moniker optimistic. Since\nwe need to look ahead at most K steps in solving the stopping problem implicit in the de\ufb01nition\nabove, the computational burden in index computation is limited. The following Lemma formalizes\nthis intuition\n\nyi,1 = y# .\n\n\n\n(3)\n\n4\n\n\fLemma 3.2. For all discount factors and states y 2 Y, we have\n\nvK\n (y) v(y) 8K.\n\nProof. See Appendix A.2.\n\nIt is instructive to consider the simplest version of the approximation proposed here, namely the case\nwhere K = 1. There, equation (3) simpli\ufb01es to\n\n = \u02c6\u00b5(y) + E\u21e5( R(y))+\u21e4\n\n(4)\nwhere \u02c6\u00b5(y) := E [R(y)] is the mean reward under the prior given by y. The equation for above can\nalso be viewed as an upper con\ufb01dence bound to an arm\u2019s expected reward. Solving equation (4) is\noften simple in practice, and we list a few examples to illustrate this:\nExample 3.1 (Beta). In this case y is the pair (a, b), which speci\ufb01ces a Beta prior distribution. The\n1-step Optimistic Gittins Index, is the value of that solves\n\n =\n\na\n\n(1 F \na,b is the CDF of a Beta distribution with parameters a, b.\n\n+ E\u21e5( Beta(a, b))+\u21e4 =\n\na + b\n\na + b\n\na\n\nwhere F \nExample 3.2 (Gaussian). Here y = (\u00b5, 2), which speci\ufb01ces a Gaussian prior and the corresponding\nequation is\n\na+1,b()) + (1 F \n\na,b())\n\n = \u00b5 + E\u21e5( N (\u00b5, 2))+\u21e4\n= \u00b5 + \uf8ff( \u00b5)\u2713 \u00b5 \n\n \u25c6 + \u2713 \u00b5 \n \u25c6\n\nwhere and denote the Gaussian PDF and CDF, respectively.\n\nNotice that in both the Beta and Gaussian examples, the equations for are in terms of distribution\nfunctions. Therefore it\u2019s straightforward to compute a derivative for these equations (which would be\nin terms of the density and CDF of the prior) which makes \ufb01nding a solution, using a method such as\nNewton-Raphson, simple and ef\ufb01cient.\nWe summarize the Optimistic Gittins Index (OGI) algorithm succinctly as follows.\nAssume the state of arm i at time t is given by yi,t, and let t = 1 1/t. Play an arm\n\ni\u21e4 2 arg max\n\ni\n\nvK\nt(yi,t),\n\nand update the posterior on the arm based on the observed reward.\n\n4 Analysis\n\nWe establish a regret bound for Optimistic Gittins Indices when the algorithm is given the parameter\nK = 1, the prior distribution q is uniform and arm rewards are Bernoulli. The result shows that\nthe algorithm, in that case, meets the Lai-Robbins lower bound and is thus asymptotically optimal,\nin both a frequentist and Bayesian sense. After stating the main theorem, we brie\ufb02y discuss two\ngeneralizations to the algorithm.\nIn the sequel, whenever x, y 2 (0, 1), we will simplify notation and let d(x, y)\n:=\ndKL(Ber(x), Ber(y)). Also, we will refer to the Optimistic Gittins Index policy simply as \u21e1OG,\nwith the understanding that this refers to the case when K, the \u2018look-ahead\u2019 parameter, equals 1 and\na \ufb02at beta prior is used. Moreover, we will denote the Optimistic Gittins Index of the ith arm as\nvi,t := v1\nTheorem 1. Let \u270f > 0. For the multi-armed bandit problem with Bernoulli rewards and any\nparameter vector \u2713 \u21e2 [0, 1]A, there exists T \u21e4 = T \u21e4(\u270f, \u2713) and C = C(\u270f, \u2713) such that for all T T \u21e4,\n(5)\n\n11/t(yi,t). Now we state the main result:\n\n(1 + \u270f)2(\u2713\u21e4 \u2713i)\n\nlog T + C(\u270f, \u2713)\n\nRegret\u21e1OG, T, \u2713 \uf8ff Xi=1,...,A\n\ni6=i\u21e4\n\nd(\u2713i, \u2713\u21e4)\n\nwhere C(\u270f, \u2713) is a constant that is only determined by \u270f and the parameter \u2713.\n\n5\n\n\fProof. Because we prove frequentist regret, the \ufb01rst few steps of the proof will be similar to that of\nUCB and Thompson Sampling.\nAssume w.l.o.g that arm 1 is uniquely optimal, and therefore \u2713\u21e4 = \u27131. Fix an arbitrary suboptimal\narm, which for convenience we will say is arm 2. Let jt and kt denote the number of pulls of arms\n1 and 2, respectively, by (but not including) time t. Finally, we let st and s0t be the corresponding\ninteger reward accumulated from arms 1 and 2, respectively. That is,\n\nst =\n\nX1,s\n\ns0t =\n\nX2,s.\n\njtXs=1\n\nktXs=1\n\nTherefore, by de\ufb01nition, j1 = k1 = s1 = s01 = 0. Let \u23181, \u23182, \u23183 2 (\u27132, \u27131) be chosen such that\n\u23181 < \u23182 < \u23183, d(\u23181, \u23183) = d(\u27132,\u27131)\nWe upper bound the expected number of pulls of the second arm as follows,\n\n. Next, we de\ufb01ne L(T ) := log T\n\nand d(\u23182, \u23183) = d(\u23181,\u23183)\n\nd(\u23182,\u23183).\n\n1+\u270f\n\n1+\u270f\n\nE [kT ] \uf8ff L(T ) +\n\n\uf8ff L(T ) +\n\n\uf8ff L(T ) +\n\nTXt=bL(T )c+1\nTXt=1\nTXt=1\n\nP\u21e1OG\nt = 2, kt L(T )\nTXt=1\nTXt=1\n\nP (v1,t < \u23183) +\n\nP (v1,t < \u23183) +\n\n(1 + \u270f)2 log T\n\nd(\u27132, \u27131)\n\n\uf8ff\n\n+\n\nP (v1,t < \u23183)\n\nP\u21e1OG\nt = 2, v1,t \u23183, kt L(T )\nt = 2, v2,t \u23183, kt L(T )\nP\u21e1OG\nTXt=1\nP\u21e1OG\nt = 2, v2,t \u23183, kt L(T )\n}\n|\n\n{z\n\n}\n\n+\n\nB\n\n(6)\n\n1Xt=1\n|\n\nA\n\n{z\n\n1Xt=1\n\nAll that remains is to show that terms A and B are bounded by constants. These bounds are given in\nLemmas 4.1 and 4.2 whose proofs we describe at a high-level with the details in the Appendix.\nLemma 4.1 (Bound on term A). For any \u2318 < \u27131, the following bound holds for some constant\nC1 = C1(\u270f, \u27131)\n\nP (v1,t < \u2318) \uf8ff C1.\n\nt1+h. The full proof is in Appendix A.4.\n\nProof outline. The goal is to bound P (v1,t < \u2318) by an expression that decays fast enough in t\nso that the series converges. To prove this, we shall express the event {v1,t < \u2318} in the form\n{Wt < 1/t} for some sequence of random variables Wt. It turns out that for large enough t,\nP (Wt < 1/t) \uf8ff PcU 1/(1+h) < 1/t where U is a uniform random variable, c, h > 0 and therefore\nP (v1,t < \u2318) = O 1\nWe remark that the core technique in the proof of Lemma 4.1 is the use of the Beta CDF. As such,\nour analysis can, in some sense, improve the result for Bayes UCB. In the main theorem of [12], the\nauthors state that the quantile in their algorithm is required to be 1 1/(t logc T ) for some parameter\nc 5, however they show simulations with the quantile 1 1/t and suggest that, in practice, it\nshould be used instead. By utilizing techniques in our analysis, it is possible to prove that the use of\n1 1/t, as a discount factor, in Bayes UCB would lead to the same optimal regret bound. Therefore\nthe \u2018scaling\u2019 by logc T is unnecessary.\nLemma 4.2 (Bound on term B). There exists T \u21e4 = T \u21e4(\u270f, \u2713) suf\ufb01ciently large and a constant\nC2 = C2(\u270f, \u27131, \u27132) so that for any T T \u21e4, we have\n\nTXt=1\n\nP\u21e1OG\nt = 2, v2,t \u23183, kt L(T ) \uf8ff C2.\n\nProof outline. This relies on a concentration of measure result and the assumption that the 2nd arm\nwas sampled at least L(T ) times. The full proof is given in Appendix A.5.\n\n6\n\n\fLemma 4.1 and 4.2, together with (6), imply that\n\nfrom which the regret bound follows.\n\nE [kT ] \uf8ff\n\n(1 + \u270f)2 log T\n\nd(\u27132, \u27131)\n\n+ C1 + C2\n\n4.1 Generalizations and a tuning parameter\nThere is an argument in Agrawal and Goyal [2] which shows that any algorithm optimal for the\nBernoulli bandit problem, can be modi\ufb01ed to yield an algorithm that has O(log T ) regret with\ngeneral bounded stochastic rewards. Therefore Optimistic Gittins Indices is an effective and practical\nalternative to policies such as Thompson Sampling and UCB. We also suspect that the proof of\nTheorem 1 can be generalized to all lookahead values (K > 1) and to a general exponential family of\ndistributions.\nAnother important observation is that the discount factor for Optimistic Gittins Indices does not\nhave to be exactly 1 1/t. In fact, a tuning parameter \u21b5 > 0 can be added to make the discount\nfactor t+\u21b5 = 1 1/(t + \u21b5) instead. An inspection of the proofs of Lemmas 4.1 and 4.2 shows\nthat the result in Theorem 1 would still hold were one to use such a tuning parameter. In practice,\nperformance is remarkably robust to our choice of K and \u21b5.\n\n5 Experiments\n\nOur goal is to benchmark Optimistic Gittins Indices (OGI) against state-of-the-art algorithms in the\nBayesian setting. Speci\ufb01cally, we compare ourselves against Thomson sampling, Bayes UCB, and\nIDS. Each of these algorithms has in turn been shown to substantially dominate other extant schemes.\nWe consider the OGI algorithm for two values of the lookahead parameter K (1 and 3) , and in\none experiment included for completeness, the case of exact Gittins indices (K = 1). We used a\ncommon discount factor schedule in all experiments setting t = 1 1/(100 + t). The choice of\n\u21b5 = 100 is second order and our conclusions remain unchanged (and actually appear to improve in\nan absolute sense) with other choices (we show this in a second set of experiments).\nA major consideration in running the experiments is that the CPU time required to execute IDS\n(the closest competitor) based on the current suggested implementation is orders of magnitudes\ngreater than that of the index schemes or Thompson Sampling. The main bottleneck is that IDS uses\nnumerical integration, requiring the calculation of a CDF over, at least, hundreds of iterations. By\ncontrast, the version of OGI with K = 1 uses 10 iterations of the Newton-Raphson method. In the\nremainder of this section, we discuss the results.\n\nGaussian This experiment (Table 1) replicates one in [19]. Here the arms generate Gaussian\nrewards Xi,t \u21e0 N (\u2713i, 1) where each \u2713i is independently drawn from a standard Gaussian distribution.\nWe simulate 1000 independent trials with 10 arms and 1000 time periods. The implementation of\nOGI in this experiment uses K = 1. It is dif\ufb01cult to compute exact Gittins indices in this setting, but\na classical approximation for Gaussian bandits does exist; see [18], Chapter 6.1.3. We term the use of\nthat approximation \u2018OGI(1) Approx\u2019. In addition to regret, we show the average CPU time taken, in\nseconds, to execute each trial.\n\nAlgorithm OGI(1) OGI(1) Approx.\nMean Regret\n\nS.D.\n\n1st quartile\n\nMedian\n\n3rd quartile\nCPU time (s)\n\n49.19\n51.07\n17.49\n41.72\n73.24\n0.02\n\n47.64\n50.59\n16.88\n40.99\n72.26\n0.01\n\nIDS\n55.83\n65.88\n18.61\n40.79\n78.76\n11.18\n\nTS\n67.40\n47.38\n37.46\n63.06\n94.52\n0.01\n\nBayes UCB\n\n60.30\n45.35\n31.41\n57.71\n86.40\n0.02\n\nTable 1: Gaussian experiment. OGI(1) denotes OGI with K = 1, while OGI Approx. uses the\napproximation to the Gaussian Gittins Index from [18].\n\nThe key feature of the results here is that OGI offers an approximately 10% improvement in regret\nover its nearest competitor IDS, and larger improvements (20 and 40 % respectively) over Bayes\n\n7\n\n\fUCB and Thompson Sampling. The best performing policy is OGI with the specialized Gaussian\napproximation since it gives a closer approximation to the Gittins Index. At the same time, OGI\nis essentially as fast as Thomspon sampling, and three orders of magnitude faster than its nearest\ncompetitor (in terms of regret).\n\nBernoulli\nIn this experiment regret is simulated over 1000 periods, with 10 arms each having a\nuniformly distributed Bernoulli parameter, over 1000 independent trials (Table 2). We use the same\nsetup as in [19] for consistency.\n\nAlgorithm OGI(1) OGI(3) OGI(1)\n17.52\nMean Regret\n4.45\n1st quartile\n12.06\n24.93\n\nMedian\n\n18.12\n6.26\n15.08\n27.63\n0.19\n\n18.00\n5.60\n14.84\n27.74\n0.89\n\n3rd quartile\nCPU time (s)\n\nIDS\n19.03\n5.85\n14.06\n26.48\n8.11\n\nTS\n27.39\n14.62\n23.53\n36.11\n0.01\n\nBayes UCB\n\n22.71\n10.09\n18.52\n30.58\n0.05\n\n(?) hours\n\nTable 2: Bernoulli experiment. OGI(K) denotes the OGI algorithm with a K step approximation and\ntuning parameter \u21b5 = 100. OGI(1) is the algorithm that uses Gittins Indices.\n\nEach version of OGI outperforms other algorithms and the one that uses (actual) Gittins Indices\nhas the lowest mean regret. Perhaps, unsurprisingly, when OGI looks ahead 3 steps it performs\nmarginally better than with a single step. Nevertheless, looking ahead 1 step is a reasonably close\napproximation to the Gittins Index in the Bernoulli problem. In fact the approximation error, when\nusing an optimistic 1 step approximation, is around 15% and if K is increased to 3, the error drops to\naround 4%.\n\n(a) Gaussian experiment\n\n(b) Bernoulli experiment\n\nFigure 1: Bayesian regret. In the legend, OGI(K)-\u21b5 is the format used to indicate parameters K and\n\u21b5. The OGI Appox policy uses the approximation to the Gittins index from [18].\n\nLonger Horizon and Robustness For this experiment, we simulate the earlier Bernoulli and\nGaussian bandit setups with a longer horizon of 5000 steps and with 3 arms. The arms\u2019 parameters\nare drawn at random in the same manner as the previous two experiments and regret is averaged over\n100,000 independent trials. Results are shown in Figures 1a and 1b. In the Bernoulli experiment\nof this section, due to the computational cost, we are only able to simulate OGI with K = 1. In\naddition, to show robustness with respect to the choice of tuning parameter \u21b5, we show results for\n\u21b5 = 50, 100, 150. The message here is essentially the same as in the earlier experiments: the OGI\nscheme offers a non-trivial performance improvement at a tiny fraction of the computational effort\nrequired by its nearest competitor. We omit Thompson Sampling and Bayes UCB from the plots in\norder to more clearly see the difference between OGI and IDS. The complete graphs can be found in\nAppendix A.6.\n\n8\n\n\fReferences\n[1] AGRAWAL, R. Sample mean based index policies with O(log n) regret for the multi-armed\n[2] AGRAWAL, S., AND GOYAL, N. Analysis of Thompson Sampling for the Multi-armed Bandit\n\nbandit problem. Advances in Applied Probability (1995), 1054\u20131078.\n\nProblem. In Proceedings of The 25th Conference on Learning Theory, pp. 39.1\u2014-39.26.\n\n[3] AGRAWAL, S., AND GOYAL, N. Further Optimal Regret Bounds for Thompson Sampling. In\nProceedings of the Sixteenth International Conference on Arti\ufb01cial Intelligence and Statistics\n(2013), pp. 99\u2013107.\n\n[4] AUER, P., CESA-BIANCHI, N., AND FISCHER, P. Finite-time analysis of the multiarmed\n\nbandit problem. Machine learning 47, 2-3 (2002), 235\u2013256.\n\n[5] BERRY, D. A., AND FRISTEDT, B. Bandit problems: sequential allocation of experiments\n\n(Monographs on statistics and applied probability). Springer, 1985.\n\n[6] BERTSIMAS, D., AND NI\u00d1O-MORA, J. Conservation laws, extended polymatroids and\nmultiarmed bandit problems; a polyhedral approach to indexable systems. Mathematics of\nOperations Research 21, 2 (1996), 257\u2013306.\n\n[7] CHAPELLE, O., AND LI, L. An empirical evaluation of Thompson Sampling. In Advances in\n\nneural information processing systems (2011), pp. 2249\u20132257.\n\n[8] COVER, T. M., AND THOMAS, J. A. Elements of information theory. John Wiley & Sons,\n\n2012.\n\n[9] GARIVIER, A. The KL-UCB algorithm for bounded stochastic bandits and beyond. In COLT\n\n(2011).\n\n[10] GITTINS, J. C. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical\n\nSociety. Series B (Methodological) (1979), 148\u2013177.\n\n[11] JOGDEO, K., AND SAMUELS, S. M. Monotone convergence of binomial probabilities and a\ngeneralization of ramanujan\u2019s equation. The Annals of Mathematical Statistics (1968), 1191\u2013\n1195.\n\n[12] KAUFMANN, E., KORDA, N., AND MUNOS, R. Thompson Sampling: An asymptotically\noptimal \ufb01nite-time analysis. In Algorithmic Learning Theory (2012), Springer, pp. 199\u2013213.\n[13] KORDA, N., KAUFMANN, E., AND MUNOS, R. Thompson Sampling for 1-dimensional\nexponential family bandits. In Advances in Neural Information Processing Systems (2013),\npp. 1448\u20131456.\n\n[14] LAI, T. L. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of\n\nStatistics (1987), 1091\u20131114.\n\n[15] LAI, T. L., AND ROBBINS, H. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\napplied mathematics 6, 1 (1985), 4\u201322.\n\n[16] LATTIMORE, T. Regret Analysis of the Finite-Horizon Gittins Index strategy for Multi-Armed\n\nBandits. In Proceedings of The 29th Conference on Learning Theory (2016), pp. 1\u201332.\n\n[17] NI\u00d1O-MORA, J. Computing a classic index for \ufb01nite-horizon bandits. INFORMS Journal on\n\nComputing 23, 2 (2011), 254\u2013267.\n\n[18] POWELL, W. B., AND RYZHOV, I. O. Optimal learning, vol. 841. John Wiley & Sons, 2012.\n[19] RUSSO, D., AND VAN ROY, B. Learning to optimize via information-directed sampling. In\n\nAdvances in Neural Information Processing Systems (2014), pp. 1583\u20131591.\n\n[20] THOMPSON, W. R. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika (1933), 285\u2013294.\n\n[21] TSITSIKLIS, J. N. A short proof of the Gittins Index theorem. The Annals of Applied Probability\n\n(1994), 194\u2013199.\n\n[22] WEBER, R. On the Gittins Index for Multi-Armed Bandits. The Annals of Applied Probability\n\n2, 4 (1992), 1024\u20131033.\n\n[23] WHITTLE, P. Multi-armed bandits and the gittins index. Journal of the Royal Statistical Society.\n\nSeries B (Methodological) (1980), 143\u2013149.\n\n9\n\n\f", "award": [], "sourceid": 1572, "authors": [{"given_name": "Eli", "family_name": "Gutin", "institution": "Massachusetts Institute of Tec"}, {"given_name": "Vivek", "family_name": "Farias", "institution": "Massachusetts Institute of Technology"}]}