{"title": "Bounded Regret for Finite-Armed Structured Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 550, "page_last": 558, "abstract": "We study a new type of K-armed bandit problem where the expected return of one arm may depend on the returns of other arms. We present a new algorithm for this general class of problems and show that under certain circumstances it is possible to achieve finite expected cumulative regret. We also give problem-dependent lower bounds on the cumulative regret showing that at least in special cases the new algorithm is nearly optimal.", "full_text": "Bounded Regret for Finite-Armed Structured Bandits\n\nTor Lattimore\n\nDepartment of Computing Science\n\nUniversity of Alberta, Canada\ntlattimo@ualberta.ca\n\nR\u00b4emi Munos\n\nINRIA\n\nLille, France1\n\nremi.munos@inria.fr\n\nAbstract\n\nWe study a new type of K-armed bandit problem where the expected return of one\narm may depend on the returns of other arms. We present regret analysis for this\ngeneral class of problems and show that under certain circumstances it is possible\nto achieve \ufb01nite expected cumulative regret. We also give problem-dependent\nlower bounds on the cumulative regret showing that at least in special cases the\nresults are nearly optimal.\n\n1\n\nIntroduction\n\n\u00b5\n\n(c)\n\n0\n\n(a)\n\n(b)\n\n\u22121\n\n\u22121\n\n1\n\n0\n\nThe multi-armed bandit problem is a reinforcement learning problem with K actions. At each time-\nstep a learner must choose an action i after which it receives a reward distributed with mean \u00b5i. The\ngoal is to maximise the cumulative reward. This is perhaps the simplest setting in which the well-\nknown exploration/exploitation dilemma becomes apparent, with a learner being forced to choose\nbetween exploring arms about which she has little information, and exploiting by choosing the arm\nthat currently appears optimal.\nWe consider a general class of K-\narmed bandit problems where the ex-\npected return of each arm may be de-\npendent on other arms. This model\nhas already been considered when the\ndependencies are linear [19] and also\nin the general setting studied here\n[13, 1]. Let \u0398 (cid:51) \u03b8\u2217 be an arbitrary\nparameter space and de\ufb01ne the expected return of arm i by \u00b5i(\u03b8\u2217) \u2208 R. The learner is permitted\nto know the functions \u00b51 \u00b7\u00b7\u00b7 \u00b5K, but not the true parameter \u03b8\u2217. The unknown parameter \u03b8\u2217 deter-\nmines the mean reward for each arm. The performance of a learner is measured by the (expected)\ncumulative regret, which is the difference between the expected return of the optimal policy and the\nt=1 \u00b5It(\u03b8\u2217) where It is\n\n(expected) return of the learner\u2019s policy. Rn := n maxi\u22081\u00b7\u00b7\u00b7K \u00b5i(\u03b8\u2217) \u2212(cid:80)n\n\nthe arm chosen at time-step t.\nA motivating example is as follows. Suppose a long-running company must decide each week\nwhether or not to purchase some new form of advertising with unknown expected returns. The\nproblem may be formulated using the new setting by letting K = 2 and \u0398 = [\u2212\u221e,\u221e]. We assume\nthe base-line performance without purchasing the advertising is known and so de\ufb01ne \u00b51(\u03b8) = 0 for\nall \u03b8. The expected return of choosing to advertise is \u00b52(\u03b8) = \u03b8 (see Figure (b) above).\nOur main contribution is regret analysis for an algorithm based on UCB [6] for the structured bandit\nproblem with strong problem-dependent guarantees on the regret. While the setting, objective and\nanalysis are different, the algorithm is essentially the same as \u201cmodel UCB\u201d introduced in [8]. The\nkey improvement over UCB is that the algorithm enjoys \ufb01nite regret in many cases while UCB\n\n\u22121\n1\nFigure 1: Examples\n\n0\n\n\u03b8\n\n1\n\n0\n\n1\n\n\u22121\n\n1Current af\ufb01liation: Google DeepMind.\n\n1\n\n\fsuffers logarithmic regret unless all arms have the same return. For example, in (a) and (c) above\nwe show that \ufb01nite regret is possible for all \u03b8\u2217, while in the advertising problem \ufb01nite regret is\nattainable if \u03b8\u2217 \u2265 0. The improved algorithm exploits the known structure and so avoids the famous\nnegative results by Lai and Robbins [18]. One insight from this work is that knowing the return\nof the optimal arm and a bound on the minimum gap is not the only information that leads to the\npossibility of \ufb01nite regret. In the examples given above neither quantity is known, but the assumed\nstructure is nevertheless suf\ufb01cient for \ufb01nite regret.\nDespite the enormous literature on bandits, as far as we are aware this is the \ufb01rst time this setting\nhas been considered with the aim of achieving \ufb01nite regret. There has been substantial work on\nexploiting various kinds of structure to reduce an otherwise impossible problem to one where sub-\nlinear (or even logarithmic) regret is possible [20, 4, 11, and references therein], but the focus is\nusually on ef\ufb01ciently dealing with large action spaces rather than sub-logarithmic/\ufb01nite regret. The\nmost comparable previous work studies the case where both the return of the best arm and a bound\non the minimum gap between the best arm and some sub-optimal arm is known [12, 10], which\nextended the permutation bandits studied by Lai and Robbins [17] and more general results by\nthe same authors [16]. Also relevant is the paper by Agrawal et. al. [1], which studied a similar\nsetting, but where \u0398 was \ufb01nite. Graves and Lai [13] extended the aforementioned contribution to\ncontinuous parameter spaces (and also to MDPs). Their work differs from ours in a number of\nways. Most notably, their objective is to compute exactly the asymptotically optimal regret in the\ncase where \ufb01nite regret is not possible. In the case where \ufb01nite regret is possible they prove only that\nthe optimal regret is sub-logarithmic, and do not present any explicit bounds on the actual regret.\nAside from this the results depend on the parameter space being a metric space and they assume that\nthe optimal policy is locally constant about the true parameter.\n\n2 Notation\n\nGeneral. Most of our notation is common with [9]. The indicator function is denoted by 1{expr}\nand is 1 if expr is true and 0 otherwise. We use log for the natural logarithm. Logical and/or\nare denoted by \u2227 and \u2228 respectively. De\ufb01ne function \u03c9(x) = min{y \u2208 N : z \u2265 x log z, \u2200z \u2265 y},\nwhich satis\ufb01es log \u03c9(x) \u2208 O(log x). In fact, limx\u2192\u221e log(\u03c9(x))/ log(x) = 1.\nBandits. Let \u0398 be a set. A K-armed structured bandit is characterised by a set of functions\n\u00b5k : \u0398 \u2192 R where \u00b5k(\u03b8) is the expected return of arm k \u2208 A := {1,\u00b7\u00b7\u00b7 , K} given un-\nknown parameter \u03b8. We de\ufb01ne the mean of the optimal arm by the function \u00b5\u2217 : \u0398 \u2192 R with\n\u00b5\u2217(\u03b8) := maxi \u00b5i(\u03b8). The true unknown parameter that determines the means is \u03b8\u2217 \u2208 \u0398. The best\narm is i\u2217 := arg maxi \u00b5i(\u03b8\u2217). The arm chosen at time-step t is denoted by It while Xi,s is the sth\nreward obtained when sampling from arm i. We denote the number of times arm i has been chosen\nat time-step t by Ti(t). The empiric estimate of the mean of arm i based on the \ufb01rst s samples is\n\u02c6\u00b5i,s. We de\ufb01ne the gap between the means of the best arm and arm i by \u2206i := \u00b5\u2217(\u03b8\u2217) \u2212 \u00b5i(\u03b8\u2217).\nThe set of sub-optimal arms is A(cid:48) := {i \u2208 A : \u2206i > 0}. The minimum gap is \u2206min := mini\u2208A(cid:48) \u2206i\nwhile the maximum gap is \u2206max := maxi\u2208A \u2206i. The cumulative regret is de\ufb01ned\n\nn(cid:88)\n\nt=1\n\n\u00b5\u2217(\u03b8\u2217) \u2212 n(cid:88)\n\nt=1\n\nn(cid:88)\n\nt=1\n\nRn :=\n\n\u00b5It =\n\n\u2206It\n\nNote quantities like \u2206i and i\u2217 depend on \u03b8\u2217, which is omitted from the notation. As is rather\ncommon we assume that the returns are sub-gaussian, which means that if X is the return sampled\nfrom some arm, then ln E exp(\u03bb(X \u2212 EX)) \u2264 \u03bb2\u03c32/2. As usual we assume that \u03c32 is known and\ndoes not depend on the arm. If X1 \u00b7\u00b7\u00b7 Xn are sampled independently from some arm with mean \u00b5\n\nt=1 Xt, then the following maximal concentration inequality is well-known.\n\nand Sn =(cid:80)n\n\n(cid:26)\n\n(cid:27)\n\nP\n\n|St \u2212 t\u00b5| \u2265 \u03b5\n\nmax\n1\u2264t\u2264n\n\n\u2264 2 exp\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n\u2212 \u03b52\n2n\u03c32\n\u2212 \u03b52n\n2\u03c32\n\n.\n\n(cid:19)\n\n.\n\nA straight-forward corollary is that P{|\u02c6\u00b5i,n \u2212 \u00b5i| \u2265 \u03b5} \u2264 2 exp\n\nIt is an important point that \u0398 is completely arbitrary. The classic multi-armed bandit can be ob-\ntained by setting \u0398 = RK and \u00b5k(\u03b8) = \u03b8k, which removes all dependencies between the arms. The\n\n2\n\n\fsetting where the optimal expected return is known to be zero and a bound on \u2206i \u2265 \u03b5 is known can\nbe regained by choosing \u0398 = (\u2212\u221e,\u2212\u03b5]K \u00d7{1,\u00b7\u00b7\u00b7 , K} and \u00b5k(\u03b81,\u00b7\u00b7\u00b7 , \u03b8K, i) = \u03b8k1{k (cid:54)= i}. We\ndo not demand that \u00b5k : \u0398 \u2192 R be continuous, or even that \u0398 be endowed with a topology.\n\n3 Structured UCB\n\nThe algorithm is a straight-forward modi\ufb01cation of UCB [6], but where the known structure of the\nproblem is exploited. Note that the algorithm is essentially the same as model UCB used by [8],\nbut here the con\ufb01dence sets are based on the known structure rather than a \ufb01nite set of models. At\neach time-step it constructs a con\ufb01dence interval about the mean of each arm. From this a subspace\n\u02dc\u0398t \u2286 \u0398 is constructed, which contains the true parameter \u03b8 with high probability. The algorithm\ntakes the optimistic action over all \u03b8 \u2208 \u02dc\u0398t.\nAlgorithm 1 UCB-S\n1: Input: functions \u00b51,\u00b7\u00b7\u00b7 , \u00b5k : \u0398 \u2192 [0, 1]\n2: for t \u2208 1, . . . ,\u221e do\n\n(cid:40)\n\n\u02dc\u03b8 : \u2200i,\n\n(cid:12)(cid:12)(cid:12)\u00b5i(\u02dc\u03b8) \u2212 \u02c6\u00b5i,Ti(t\u22121)\n\n(cid:12)(cid:12)(cid:12) <\n\n(cid:115)\n\n(cid:41)\n\n\u03b1\u03c32 log t\nTi(t \u2212 1)\n\nChoose arm arbitrarily\nChoose optimistic arm i \u2190 arg maxi sup\u02dc\u03b8\u2208 \u02dc\u0398t\n\n4:\n5:\n6:\n7:\nRemark 1. The choice of arm when \u02dc\u0398t = \u2205 does not affect the regret bounds in this paper. In\npractice, it is possible to simply increase t without taking an action, but this complicates the analysis.\nIn many cases the true parameter \u03b8\u2217 is never identi\ufb01ed in the sense that we do not expect that\n\u02dc\u0398t \u2192 {\u03b8\u2217}. The computational complexity of UCB-S depends on the dif\ufb01culty of computing \u02dc\u0398t\nand computing the optimistic arm within this set. This is ef\ufb01cient in simple cases, like when \u00b5k is\npiecewise linear, but may be intractable for complex functions.\n\n\u00b5i(\u02dc\u03b8)\n\n3:\n\nDe\ufb01ne con\ufb01dence set \u02dc\u0398t \u2190\nif \u02dc\u0398t = \u2205 then\nelse\n\n4 Theorems\n\nWe present two main theorems bounding the regret of the UCB-S algorithm. The \ufb01rst is for arbitrary\n\u03b8\u2217, which leads to a logarithmic bound on the regret comparable to that obtained for UCB by [6].\nThe analysis is slightly different because UCB-S maintains upper and lower con\ufb01dence bounds\nand selects its actions optimistically from the model class, rather than by maximising the upper\ncon\ufb01dence bound as UCB does.\nTheorem 2. If \u03b1 > 2 and \u03b8 \u2208 \u0398, then the algorithm UCB-S suffers an expected regret of at most\n\nERn \u2264 2\u2206maxK(\u03b1 \u2212 1)\n\n+\n\n\u03b1 \u2212 2\n\n8\u03b1\u03c32 log n\n\ni\u2208A(cid:48)\n\n\u2206i\n\n(cid:88)\n\n(cid:88)\n\ni\n\n+\n\n\u2206i\n\nIf the samples from the optimal arm are suf\ufb01cient to learn the optimal action, then \ufb01nite regret is\npossible. In Section 6 we give something of a converse by showing that if knowing the mean of the\noptimal arm is insuf\ufb01cient to act optimally, then logarithmic regret is unavoidable.\nTheorem 3. Let \u03b1 = 4 and assume there exists an \u03b5 > 0 such that\n\n|\u00b5i\u2217 (\u03b8\u2217) \u2212 \u00b5i\u2217 (\u03b8)| < \u03b5 =\u21d2 \u2200i (cid:54)= i\u2217, \u00b5i\u2217 (\u03b8) > \u00b5i(\u03b8).\n\n(1)\n\n(\u2200\u03b8 \u2208 \u0398)\n\n(cid:18) 32\u03c32 log \u03c9\u2217\nThen ERn \u2264 (cid:88)\n(cid:19)\n(cid:26)\n(cid:18) 8\u03c32\u03b1K\n\ni\u2208A(cid:48)\nwith \u03c9\u2217 := max\n\n\u2206i\n\n\u03c9\n\n, \u03c9\n\n\u03b52\n\n(cid:19)\n(cid:18) 8\u03c32\u03b1K\n\n\u22062\n\nmin\n\n(cid:19)(cid:27)\n\n.\n\n+ \u2206i\n\n+ 3\u2206maxK +\n\n\u2206maxK 3\n\n\u03c9\u2217\n\n,\n\nRemark 4. For small \u03b5 and large n the expected regret looks like ERn \u2208 O\n(for small n the regret is, of course, even smaller).\n\n3\n\n(cid:33)\n\n(cid:1)\n\n(cid:32) K(cid:88)\n\ni=1\n\nlog(cid:0) 1\n\n\u03b5\n\n\u2206i\n\n\fThe explanation of the bound is as follows.\nIf at some time-step t it holds that all con\ufb01dence\nintervals contain the truth and the width of the con\ufb01dence interval about i\u2217 drops below \u03b5, then by\nthe condition in Equation (1) it holds that i\u2217 is the optimistic arm within \u02dc\u0398t. In this case UCB-S\nsuffers no regret at this time-step. Since the number of samples of each sub-optimal arm grows at\nmost logarithmically by the proof of Theorem 2, the number of samples of the best arm must grow\nlinearly. Therefore the number of time-steps before best arm has been pulled O(\u03b5\u22122) times is also\nO(\u03b5\u22122). After this point the algorithm suffers only a constant cumulative penalty for the possibility\nthat the con\ufb01dence intervals do not contain the truth, which is \ufb01nite for suitably chosen values of \u03b1.\nNote that Agrawal et. al. [1] had essentially the same condition to achieve \ufb01nite regret as (1), but\nspeci\ufb01ed to the case where \u0398 is \ufb01nite.\nAn interesting question is raised by comparing the bound in Theorem 3 to those given by Bubeck\net. al. [12] where if the expected return of the best arm is known and \u03b5 is a known bound on the\nminimum gap, then a regret bound of\n\n(cid:32)(cid:88)\n\n(cid:32)\n\nO\n\nlog(cid:0) 2\u2206i\n\n\u03b5\n\n(cid:1)\n\n(cid:18)\n\ni\u2208A(cid:48)\n\n\u2206i\n\n(cid:19)(cid:33)(cid:33)\n\n1\n\u03b5\n\n1 + log log\n\n(2)\n\nlog n\u22062\n\ni\u2208A(cid:48) 1\n\u2206i\n\nexpected regret of O((cid:80)\n\nis achieved. If \u03b5 is close to \u2206i, then this bound is an improvement over the bound given by Theorem\n3, although our theorem is more general. The improved UCB algorithm [7] enjoys a bound on the\ni ). If we follow the same reasoning as above we obtain a\nbound comparable to (2). Unfortunately though, the extension of the improved UCB algorithm to\nthe structured setting is rather challenging with the main obstruction being the extreme growth of\nthe phases used by improved UCB. Re\ufb01ning the phases leads to super-logarithmic regret, a problem\nwe ultimately failed to resolve. Nevertheless we feel that there is some hope of obtaining a bound\nlike (2) in this setting.\nBefore the proofs of Theorems 2 and 3 we give some example structured bandits and indicate the\nregions where the conditions for Theorem 3 are (not) met. Areas where Theorem 3 can be applied\nto obtain \ufb01nite regret are unshaded while those with logarithmic regret are shaded.\n\n1\n\n0\n\n\u00b5\n\n\u22121\n\n\u22121\n\n1\n\n0\n\n\u00b5\n\n\u22121\n\n\u22121\n\n(a)\n\n0\n(d)\n\n(b)\n\n0\n(e)\n\n1\n\n\u22121\n\n(c)\n\nKey:\n\n\u00b51\n\u00b52\n\u00b53\n\n1\n\n\u22121\n\n0\n\n\u03b8\n\n1\n(f)\n\na hidden message\n\n0\n\n1\n\n\u22121\n\n0\n\n\u22121 1\nFigure 2: Examples\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n\u03b8\n\n(a) The conditions for Theorem 3 are met for all \u03b8 (cid:54)= 0, but for \u03b8 = 0 the regret strictly vanishes for\n\nall policies, which means that the regret is bounded by ERn \u2208 O(1{\u03b8\u2217 (cid:54)= 0} 1|\u03b8\u2217| log 1|\u03b8\u2217| ).\n\n(b) Action 2 is uninformative and not globally optimal so Theorem 3 does not apply for \u03b8 < 1/2\nwhere this action is optimal. For \u03b8 > 0 the optimal action is 1, when the conditions are met and\n\ufb01nite regret is again achieved.\nERn \u2208 O\n\n1{\u03b8\u2217 < 0} log n\n\n(cid:18)\n\n(cid:19)\n\n|\u03b8\u2217| + 1{\u03b8\u2217 > 0} log 1\n\u03b8\u2217\n\u03b8\u2217\n\n.\n\n(c) The conditions for Theorem 3 are again met for all non-zero \u03b8\u2217, which leads as in (a) to a regret\n\nof ERn \u2208 O(1{\u03b8\u2217 (cid:54)= 0} 1|\u03b8\u2217| log 1|\u03b8\u2217| ).\n\n4\n\n\fExamples (d) and (e) illustrate the potential complexity of the regions in which \ufb01nite regret is pos-\nsible. Note especially that in (e) the regret for \u03b8\u2217 = 1\n2 is logarithmic in the horizon, but \ufb01nite for \u03b8\u2217\narbitrarily close. Example (f) is a permutation bandit with 3 arms where it can be clearly seen that\nthe conditions of Theorem 3 are satis\ufb01ed.\n\n5 Proof of Theorems 2 and 3\n\nWe start by bounding the probability that some mean does not lie inside the con\ufb01dence set.\nLemma 5. P{Ft = 1} \u2264 2Kt exp(\u2212\u03b1 log(t)) where\n\n(cid:40)\n\n(cid:115)\n\n(cid:41)\n\n.\n\nProof. We use the concentration guarantees:\n\n2\u03b1\u03c32 log t\nTi(t \u2212 1)\n\n(cid:41)\n\n(cid:40)\n\nFt = 1\n\n\u2203i : |\u02c6\u00b5i,Ti(t\u22121) \u2212 \u00b5i| \u2265\n(cid:115)\n(cid:41)\n(cid:41) (d)\u2264 K(cid:88)\n\n\u2203i :(cid:12)(cid:12)\u00b5i(\u03b8\u2217) \u2212 \u02c6\u00b5i,Ti(t\u22121)\n(cid:115)\n(cid:12)(cid:12) \u2265\n(cid:114)\n\n(cid:40)(cid:12)(cid:12)\u00b5i(\u03b8\u2217) \u2212 \u02c6\u00b5i,Ti(t\u22121)\n(cid:40)\nt(cid:88)\n\n|\u00b5i(\u03b8\u2217) \u2212 \u02c6\u00b5i,s| \u2265\n\n2\u03b1\u03c32 log t\nTi(t \u2212 1)\n\n(cid:12)(cid:12) \u2265\n\n2\u03b1\u03c32 log t\n\nP\n\nP\n\n2\u03b1\u03c32 log t\nTi(t \u2212 1)\n\ns\n\ni=1\n\ns=1\n\nP{Ft = 1} (a)\n= P\n\n(b)\u2264 K(cid:88)\n(c)\u2264 K(cid:88)\n\ni=1\n\ni=1\n\ns=1\n\nt(cid:88)\n\n2 exp(\u2212\u03b1 log t)\n\n(e)\n\n= 2Kt1\u2212\u03b1\n\nwhere (a) follows from the de\ufb01nition of Ft. (b) by the union bound. (c) also follows from the union\nbound and is the standard trick to deal with the random variable Ti(t \u2212 1). (d) follows from the\nconcentration inequalities for sub-gaussian random variables. (e) is trivial.\nProof of Theorem 2. Let i be an arm with \u2206i > 0 and suppose that It = i. Then either Ft is true or\n\nTi(t \u2212 1) <\n\n(3)\nNote that if Ft does not hold then the true parameter lies within the con\ufb01dence set, \u03b8\u2217 \u2208 \u02dc\u0398t. Suppose\non the contrary that Ft and (3) are both false.\n\n=: ui(n)\n\n\u22062\ni\n\n(cid:24) 8\u03c32\u03b1 log n\n\n(cid:25)\n\n(cid:115)\n\n\u00b5i\u2217 (\u02dc\u03b8)\n\n(a)\u2265 \u00b5\u2217(\u03b8\u2217)\n\nmax\n\u02dc\u03b8\u2208 \u02dc\u0398t\n\n(b)\n\n= \u00b5i(\u03b8\u2217) + \u2206i\n\n(c)\n\n> \u2206i + \u02c6\u00b5i,Ti(t\u22121) \u2212\n\n2\u03c32\u03b1 log t\nTi(t \u2212 1)\n\n(cid:115)\n\n(d)\u2265 \u02c6\u00b5i,Ti(t\u22121) +\n\n2\u03b1\u03c32 log t\nTi(t \u2212 1)\n\n(e)\u2265 max\n\u02dc\u03b8\u2208 \u02dc\u0398t\n\n\u00b5i(\u02dc\u03b8),\n\nwhere (a) follows since \u03b8\u2217 \u2208 \u02dc\u0398t. (b) is the de\ufb01nition of the gap. (c) since Ft is false. (d) is true\nbecause (3) is false. Therefore arm i is not taken. We now bound the expected number of times that\narm i is played within the \ufb01rst n time-steps by\n\nn(cid:88)\n\nETi(n)\n\n(a)\n\n= E\n\n1{It = i} (b)\u2264 ui(n) + E\n\n1{It = i \u2227 (3) is false}\n\nt=1\n\n(c)\u2264 ui(n) + E\n\nt=ui+1\n\n1{Ft = 1 \u2227 It = i}\n\nn(cid:88)\n\nt=ui+1\n\nwhere (a) follows from the linearity of expectation and de\ufb01nition of Ti(n). (b) by Equation (3) and\nthe de\ufb01nition of ui(n) and expectation. (c) is true by recalling that playing arm i at time-step t\nimplies that either Ft or (3) must be true. Therefore\n\n\u2206i\n\nui(n) + E\n\n1{Ft = 1 \u2227 It = i}\n\n\u2206iui(n) + \u2206maxE\n\n1{Ft = 1}\n\n(cid:32)\n\nERn \u2264 (cid:88)\n\ni\u2208A(cid:48)\n\nn(cid:88)\n\nn(cid:88)\n\nt=1\n\n(4)\n\nn(cid:88)\n\n(cid:33)\n\n\u2264 (cid:88)\n\ni\u2208A(cid:48)\n\nt=ui+1\n\n5\n\n\fBounding the second summation\n\nn(cid:88)\n\nt=1\n\nE\n\n1{Ft = 1} (a)\n=\n\nn(cid:88)\n\nt=1\n\nP{Ft = 1} (b)\u2264 n(cid:88)\n\nt=1\n\n2Kt1\u2212\u03b1\n\n(c)\u2264 2K(\u03b1 \u2212 1)\n\u03b1 \u2212 2\n\nwhere (a) follows by exchanging the expectation and sum and because the expectation of an indicator\nfunction can be written as the probability of the event. (b) by Lemma 5 and (c) is trivial. Substituting\ninto (4) leads to\n\nERn \u2264 2\u2206maxK(\u03b1 \u2212 1)\n\n+\n\n\u03b1 \u2212 2\n\n8\u03b1\u03c32 log n\n\ni\u2208A(cid:48)\n\n\u2206i\n\n+\n\n\u2206i.\n\n(cid:88)\n\ni\n\n(cid:88)\n\nBefore the proof of Theorem 3 we need a high-probability bound on the number of times arm i is\npulled, which is proven along the lines of similar results by [5].\nLemma 6. Let i \u2208 A(cid:48) be some sub-optimal arm. If z > ui(n), then P{Ti(n) > z} \u2264 2Kz2\u2212\u03b1\n\u03b1 \u2212 2\nProof. As in the proof of Theorem 2, if t \u2264 n and Ft is false and Ti(t \u2212 1) > ui(n) \u2265 ui(t), then\narm i is not chosen. Therefore\n\n.\n\nP{Ti(n) > z} \u2264 n(cid:88)\n\nP{Ft = 1} (a)\u2264 n(cid:88)\n\n(c)\u2264 2Kz2\u2212\u03b1\n\u03b1 \u2212 2\nLemma 7. Assume the conditions of Theorem 3 and additionally that Ti\u2217 (t\u2212 1) \u2265(cid:108) 8\u03b1\u03c32 log t\n\nwhere (a) follows from Lemma 5 and (b) and (c) are trivial.\n\n2Kt1\u2212\u03b1\n\nt1\u2212\u03b1dt\n\n(b)\u2264 2K\n\nt=z+1\n\nt=z+1\n\nz\n\n(cid:109)\n\nand\n\n(cid:90) n\n\n\u03b52\n\nFt is false. Then It = i\u2217.\nProof. Since Ft is false, for \u02dc\u03b8 \u2208 \u02dc\u0398t we have:\n\n|\u00b5i\u2217 (\u02dc\u03b8) \u2212 \u00b5i\u2217 (\u03b8\u2217)| (a)\u2264 |\u00b5i\u2217 (\u02dc\u03b8) \u2212 \u02c6\u00b5i\u2217,Ti(t\u22121)| + |\u02c6\u00b5i\u2217,Ti(t\u22121) \u2212 \u00b5i\u2217 (\u03b8\u2217)| (b)\n< 2\n\n2\u03c32\u03b1 log t\nTi\u2217 (t \u2212 1)\n\n(c)\u2264 \u03b5\n\nwhere (a) is the triangle inequality.\n(b) follows by the de\ufb01nition of the con\ufb01dence interval and\nbecause Ft is false. (c) by the assumed lower bound on Ti\u2217 (t \u2212 1). Therefore by (1), for all \u02dc\u03b8 \u2208 \u02dc\u0398t\nit holds that the best arm is i\u2217. Finally, since Ft is false, \u03b8\u2217 \u2208 \u02dc\u0398t, which means that \u02dc\u0398t (cid:54)= \u2205.\nTherefore It = i\u2217 as required.\nProof of Theorem 3. Let \u03c9\u2217 be some constant to be chosen later. Then the regret may be written as\n\n(cid:115)\n\n(5)\n\n(6)\n\n(7)\n\n. Then t\n\n(cid:26)\n\nK > ui(t) for all i (cid:54)= i\u2217 and t\nK \u2265\n(cid:19)\u03b1\u22122\n\n(cid:27) (c)\n\n(cid:18) K\n\n\u2203i : Ti(t) >\n\nt\nK\n\n<\n\n2K 2\n\u03b1 \u2212 2\n\nt\n\n(8)\n\n\u03c9\u2217(cid:88)\n\nK(cid:88)\n\nt=1\n\ni=1\n\nERn \u2264 E\n\n\u2206i1{It = i} + \u2206maxE\n\n1{It (cid:54)= i\u2217} .\n\nThe \ufb01rst summation is bounded as in the proof of Theorem 2 by\n8\u03b1\u03c32 log \u03c9\u2217\n\n\u2206i1{It = i} \u2264 (cid:88)\n\n\u03c9\u2217(cid:88)\n\n(cid:88)\n\nE\n\nt=1\n\ni\u2208A\n\ni\u2208A(cid:48)\n\nWe now bound the second sum in (5) and choose \u03c9\u2217. By Lemma 6, if n\n\n\u03c9\u2217(cid:88)\n\nt=1\n\nP{Ft = 1} .\n\nK > ui(n), then\n\nt=\u03c9\u2217+1\n\nn(cid:88)\n(cid:19)\n(cid:19)\u03b1\u22122\n\n+\n\n.\n\n\u2206i\n\n(cid:18) K\n\nn\n\nTi(n) >\n\nn\nK\n\n(cid:17)\n\n, \u03c9\n\nP(cid:110)\n(cid:16) 8\u03c32\u03b1K\n(cid:110)\n(cid:27) (a)\u2264 P\n(cid:26)\n\n\u03c9\n\n\u03b52\n\nSuppose t \u2265 \u03c9\u2217 := max\n\n8\u03c32\u03b1 log t\n\n. By the union bound\n\n(cid:26)\n\n\u03b52\n\nP\n\nTi\u2217 (t) <\n\n8\u03c32\u03b1 log t\n\n\u03b52\n\nTi\u2217 (t) <\n\n\u2206i +\n\n(cid:18)\n(cid:111) \u2264 2K\n(cid:17)(cid:111)\n(cid:16) 8\u03c32\u03b1K\n(cid:27) (b)\u2264 P\n\n\u22062\n\nmin\n\nt\nK\n\n\u03b1 \u2212 2\n\n6\n\n\f. (b) since(cid:80)K\n\nwhere (a) is true since t\nNow if Ti(t) \u2265 8\u03c32\u03b1 log t\n\nK \u2265 8\u03c32\u03b1 log t\nand Ft is false, then the chosen arm is i\u2217. Therefore\n\n\u03b52\n\ni=1 Ti(t) = t. (c) by the union bound and (7).\n\n(cid:27)\n\nn(cid:88)\n\nE\n\nt=\u03c9\u2217+1\n\n\u03b52\n\n1{It (cid:54)= i\u2217} \u2264 n(cid:88)\nn(cid:88)\nn(cid:88)\n\n(a)\u2264\n\nt=\u03c9\u2217+1\n\nt=\u03c9\u2217+1\n\n(b)\u2264\n\nt=\u03c9\u2217+1\n\n(cid:26)\nn(cid:88)\n\nn(cid:88)\n\nt=\u03c9\u2217+1\n2K 2\n\u03b1 \u2212 2\n\nP{Ft = 1} +\n\nP\n\nTi(t \u2212 1) <\n\n8\u03c32\u03b1 log t\n\nP{Ft = 1} +\n\nP{Ft = 1} +\n\nt=\u03c9\u2217+1\n2K 2\n\n(\u03b1 \u2212 2)(\u03b1 \u2212 3)\n\n\u03c9\u2217\n\n(cid:19)\u03b1\u22122\n(cid:18) K\n(cid:19)\u03b1\u22123\n(cid:18) K\n\nt\n\n\u03b52\n\n(9)\n\nwhere (a) follows from (8) and (b) by straight-forward calculus. Therefore by combining (5), (6)\nand (9) we obtain\n\n\u2206i\n\nERn \u2264 (cid:88)\n\u2264 (cid:88)\n\n(cid:24) 8\u03c32\u03b1 log \u03c9\u2217\n(cid:24) 8\u03c32\u03b1 log \u03c9\u2217\nSetting \u03b1 = 4 leads to ERn \u2264 K(cid:88)\n\n(cid:25)\n(cid:25)\n(cid:18) 32\u03c32 log \u03c9\u2217\n\ni:\u2206i>0\n\ni:\u2206i>0\n\n\u22062\ni\n\n\u22062\ni\n\n\u2206i\n\n+\n\n+\n\n2\u2206maxK 2\n\n(\u03b1 \u2212 2)(\u03b1 \u2212 3)\n\n2\u2206maxK 2\n\n(\u03b1 \u2212 2)(\u03b1 \u2212 3)\n\n(cid:19)\n\ni=1\n\n\u2206i\n\n(cid:19)\u03b1\u22123\n(cid:19)\u03b1\u22123\n\n(cid:18) K\n(cid:18) K\n\n\u03c9\u2217\n\n\u03c9\u2217\n\nn(cid:88)\n\nt=1\n\n+ \u2206max\n\nP{Ft = 1}\n\n2\u2206maxK(\u03b1 \u2212 1)\n\n\u03b1 \u2212 2\n\n+\n\n+ \u2206i\n\n+ 3\u2206maxK +\n\n\u2206maxK 3\n\n\u03c9\u2217\n\n.\n\n6 Lower Bounds and Ambiguous Examples\n\nWe prove lower bounds for two illustrative examples of structured bandits. Some previous work\nis also relevant. The famous paper by Lai and Robbins [18] shows that the bound of Theorem 2\ncannot in general be greatly improved. Many of the techniques here are borrowed from Bubeck et.\nal. [12]. Given a \ufb01xed algorithm and varying \u03b8 we denote the regret and expectation by Rn(\u03b8) and\nE\u03b8 respectively. Returns are assumed to be sampled from a normal distribution with unit variance,\nso that \u03c32 = 1. The proofs of the following theorems may be found in the supplementary material.\n\n1\n\n0\n\n\u00b5\n\n\u22121\n\n\u22121\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nKey:\n\n\u2206\n\n\u2206\n\n0\n\n\u22121\n\n1\n\n0\n\n\u22121\n\n1\n1\nFigure 3: Counter-examples\n\n0\n\n\u00b51\n\u00b52\n\n\u22121\n\n0\n\n1\n\na hidden message\n\nTheorem 8. Given the structured bandit depicted in Figure 3.(a) or Figure 2.(c), then for all \u03b8 > 0\nand all algorithms the regret satis\ufb01es max{E\u2212\u03b8Rn(\u2212\u03b8), E\u03b8Rn(\u03b8)} \u2265 1\n8\u03b8 for suf\ufb01ciently large n.\nTheorem 9. Let \u0398,{\u00b51, \u00b52} be a structured bandit where returns are sampled from a normal dis-\ntribution with unit variance. Assume there exists a pair \u03b81, \u03b82 \u2208 \u0398 and constant \u2206 > 0 such that\n\u00b51(\u03b81) = \u00b51(\u03b82) and \u00b51(\u03b81) \u2265 \u00b52(\u03b81) + \u2206 and \u00b52(\u03b82) \u2265 \u00b51(\u03b82) + \u2206. Then the following hold:\n(1) E\u03b81 Rn(\u03b81) \u2265 1+log 2n\u22062\n(2) E\u03b82 Rn(\u03b82) \u2265 n\u2206\nA natural example where the conditions are satis\ufb01ed is depicted in Figure 3.(b) and by choosing \u03b81 =\n\u22121, \u03b82 = 1. We know from Theorem 3 that UCB-S enjoys \ufb01nite regret of E\u03b82Rn(\u03b82) \u2208 O( 1\n\u2206 log 1\n\u2206 )\nand logarithmic regret E\u03b81Rn(\u03b81) \u2208 O( 1\n\u2206 log n). Part 1 of Theorem 9 shows that if we demand\n\ufb01nite regret E\u03b82Rn(\u03b82) \u2208 O(1), then the regret E\u03b81 Rn(\u03b81) is necessarily logarithmic. On the other\n\n2 exp (\u22124E\u03b81Rn(\u03b81)\u2206) \u2212 E\u03b81 Rn(\u03b81)\n\nE\u03b82 Rn(\u03b82)\n\n\u2212 1\n\n8\u2206\n\n2\n\n7\n\n\fhand, part 2 shows that if we demand E\u03b81Rn(\u03b81) \u2208 o(log(n)), then the regret E\u03b82Rn(\u03b82) \u2208 \u2126(n).\nTherefore the trade-off made by UCB-S essentially cannot be improved.\n\nDiscussion of Figure 3.(c/d). In both examples there is an ambiguous region for which the lower\nbound (Theorem 9) does not show that logarithmic regret is unavoidable, but where Theorem 3\ncannot be applied to show that UCB-S achieves \ufb01nite regret. We managed to show that \ufb01nite regret\nis possible in both cases by using a different algorithm. For (c) we could construct a carefully\ntuned algorithm for which the regret was at most O(1) if \u03b8 \u2264 0 and O( 1\n\u03b8 ) otherwise. This\nresult contradicts a claim by Bubeck et. al. [12, Thm. 8]. Additional discussion of the ambiguous\ncase in general, as well as this speci\ufb01c example, may be found in the supplementary material. One\nobservation is that unbridled optimism is the cause of the failure of UCB-S in these cases. This is\nillustrated by Figure 3.(d) with \u03b8 \u2264 0. No matter how narrow the con\ufb01dence interval about \u00b51, if\nthe second action has not been taken suf\ufb01ciently often, then there will still be some belief that \u03b8 > 0\nis possible where the second action is optimistic, which leads to logarithmic regret. Adapting the\nalgorithm to be slightly risk averse solves this problem.\n\n\u03b8 log log 1\n\n7 Experiments\n\nWe tested Algorithm 1 on a selection of structured bandits depicted in Figure 2 and compared to\nUCB [6, 9]. Rewards were sampled from normal distributions with unit variances. For UCB we\nchose \u03b1 = 2, while we used the theoretically justi\ufb01ed \u03b1 = 4 for Algorithm 1. All code is available\nin the supplementary material. Each data-point is the average of 500 independent samples with the\nblue crosses and red squares indicating the regret of UCB-S and UCB respectively.\n\n)\n\u03b8\n(\nn\nR\n\u03b8\n\u02c6E\n\n200\n\n100\n\n0\n\u22120.2 \u22120.1\n\n0\n\n0.1\n\n0.2\n\n)\n\u03b8\n(\nn\nR\n\u03b8\n\u02c6E\n\n200\n\n100\n\n0\n\n0\n\n\u03b8\n\n5e4\nn\n\n)\n\u03b8\n(\nn\nR\n\u03b8\n\u02c6E\n\n400\n\n200\n\n0\n\n\u22121\n\n1e5\n\nK = 2, \u00b51(\u03b8) = \u03b8, \u00b52(\u03b8) = \u2212\u03b8,\n\u03b8 = 0.04 (see Figure 2.(a))\n\nK = 2, \u00b51(\u03b8) = \u03b8, \u00b52(\u03b8) = \u2212\u03b8,\nn = 50 000 (see Figure 2.(a))\nThe results show that Algorithm 1 typically out-performs regu-\nlar UCB. The exception is the top right experiment where UCB\nperforms slightly better for \u03b8 < 0. This is not surprising, since\nin this case the structured version of UCB cannot exploit the ad-\nditional structure and suffers due to worse constant factors. On\nthe other hand, if \u03b8 > 0, then UCB endures logarithmic regret\nand performs signi\ufb01cantly worse than its structured counterpart.\nThe superiority of Algorithm 1 would be accentuated in the top\nleft and bottom right experiments by increasing the horizon.\n\n0\n\n\u03b8\n\n1\n\nK = 2, \u00b51(\u03b8) = 0, \u00b52(\u03b8) = \u03b8,\nn = 50 000 (see Figure 2.(b))\n\n)\n\u03b8\n(\nn\nR\n\u03b8\n\u02c6E\n\n150\n\n100\n\n50\n\n0\n\n\u22121\n\n0\n\n1\n\n\u03b8\nK = 2, \u00b51(\u03b8) = \u03b81{\u03b8 > 0},\n\u00b52(\u03b8) = \u2212\u03b81{\u03b8 < 0},\nn = 50 000 (see Figure 2.(c))\n\n8 Conclusion\n\nThe limitation of the new approach is that the proof techniques and algorithm are most suited to\nthe case where the number of actions is relatively small. Generalising the techniques to large action\nspaces is therefore an important open problem. There is still a small gap between the upper and\nlower bounds, and the lower bounds have only been proven for special examples. Proving a general\nproblem-dependent lower bound is an interesting question, but probably extremely challenging given\nthe \ufb02exibility of the setting. We are also curious to know if there exist problems for which the\noptimal regret is somewhere between \ufb01nite and logarithmic. Another question is that of how to\nde\ufb01ne Thompson sampling for structured bandits. Thompson sampling has recently attracted a great\ndeal of attention [14, 2, 15, 3, 10], but so far we are unable even to de\ufb01ne an algorithm resembling\nThompson sampling for the general structured bandit problem.\nAcknowledgements. Tor Lattimore was supported by the Google Australia Fellowship for Ma-\nchine Learning and the Alberta Innovates Technology Futures, NSERC. The majority of this work\nwas completed while R\u00b4emi Munos was visiting Microsoft Research, New England. This research\nwas partially supported by the European Community\u2019s Seventh Framework Programme under grant\nagreements no. 270327 (project CompLACS).\n\n8\n\n\fReferences\n[1] Rajeev Agrawal, Demosthenis Teneketzis, and Venkatachalam Anantharam. Asymptotically\nef\ufb01cient adaptive allocation schemes for controlled markov chains: Finite parameter space.\nAutomatic Control, IEEE Transactions on, 34(12):1249\u20131259, 1989.\n\n[2] Shipra Agrawal and Navin Goyal. Analysis of Thompson sampling for the multi-armed bandit\n\nproblem. In In Proceedings of the 25th Annual Conference on Learning Theory, 2012.\n\n[3] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In\nIn Proceedings of the 16th International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 31, pages 99\u2013107, 2013.\n\n[4] Kareem Amin, Michael Kearns, and Umar Syed. Bandits, query learning, and the haystack\n\ndimension. Journal of Machine Learning Research-Proceedings Track, 19:87\u2013106, 2011.\n\n[5] Jean-Yves Audibert, R\u00b4emi Munos, and Csaba Szepesv\u00b4ari. Variance estimates and exploration\nfunction in multi-armed bandit. Technical report, research report 07-31, Certis-Ecole des Ponts,\n2007.\n\n[6] Peter Auer, Nicol\u00b4o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine Learning, 47:235\u2013256, 2002.\n\n[7] Peter Auer and Ronald Ortner. UCB revisited: Improved regret bounds for the stochastic\n\nmulti-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55\u201365, 2010.\n\n[8] Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Sequential transfer\nin multi-armed bandit with \ufb01nite set of models.\nIn C.J.C. Burges, L. Bottou, M. Welling,\nZ. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing\nSystems 26, pages 2220\u20132228. Curran Associates, Inc., 2013.\n\n[9] S\u00b4ebastien Bubeck and Nicol`o Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic\nMulti-armed Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers\nIncorporated, 2012.\n\n[10] S\u00b4ebastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for thompson\n\nsampling. In Advances in Neural Information Processing Systems, pages 638\u2013646, 2013.\n\n[11] S\u00b4ebastien Bubeck, R\u00b4emi Munos, Gilles Stoltz, and Csaba Szepesv\u00b4ari. Online optimization in\n\nX-armed bandits. In NIPS, pages 201\u2013208, 2008.\n\n[12] S\u00b4ebastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-\n\narmed bandits. In In Proceedings of the 26th Annual Conference on Learning Theory, 2013.\n\n[13] Todd L Graves and Tze Leung Lai. Asymptotically ef\ufb01cient adaptive choice of control laws in\n\ncontrolled Markov chains. SIAM journal on control and optimization, 35(3):715\u2013743, 1997.\n\n[14] Emilie Kaufmann, Nathaniel Korda, and R\u00b4emi Munos. Thompson sampling: An asymptoti-\ncally optimal \ufb01nite-time analysis. In Algorithmic Learning Theory, pages 199\u2013213. Springer,\n2012.\n\n[15] Nathaniel Korda, Emilie Kaufmann, and R\u00b4emi Munos. Thompson sampling for 1-dimensional\nIn Advances in Neural Information Processing Systems, pages\n\nexponential family bandits.\n1448\u20131456, 2013.\n\n[16] Tze Leung Lai and Herbert Robbins. Asymptotically optimal allocation of treatments in se-\nquential experiments. In T. J. Santner and A. C. Tamhane, editors, Design of Experiments:\nRanking and Selection, pages 127\u2013142. 1984.\n\n[17] Tze Leung Lai and Herbert Robbins. Optimal sequential sampling from two populations.\n\nProceedings of the National Academy of Sciences, 81(4):1284\u20131286, 1984.\n\n[18] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Ad-\n\nvances in applied mathematics, 6(1):4\u201322, 1985.\n\n[19] Adam J Mersereau, Paat Rusmevichientong, and John N Tsitsiklis. A structured multiarmed\nbandit problem and the greedy policy. Automatic Control, IEEE Transactions on, 54(12):2787\u2013\n2802, 2009.\n\n[20] Dan Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic\nexploration. In Advances in Neural Information Processing Systems, pages 2256\u20132264, 2013.\n\n9\n\n\f", "award": [], "sourceid": 374, "authors": [{"given_name": "Tor", "family_name": "Lattimore", "institution": "University of Alberta"}, {"given_name": "Remi", "family_name": "Munos", "institution": "INRIA / MSR"}]}