{"title": "Improved Algorithms for Linear Stochastic Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 2312, "page_last": 2320, "abstract": "We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer\u2019s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales.", "full_text": "Improved Algorithms for Linear Stochastic Bandits\n\nYasin Abbasi-Yadkori\nabbasiya@ualberta.ca\nDept. of Computing Science\n\nUniversity of Alberta\n\nD\u00b4avid P\u00b4al\n\ndpal@google.com\n\nDept. of Computing Science\n\nUniversity of Alberta\n\nCsaba Szepesv\u00b4ari\n\nszepesva@ualberta.ca\nDept. of Computing Science\n\nUniversity of Alberta\n\nAbstract\n\nWe improve the theoretical analysis and empirical performance of algorithms for\nthe stochastic multi-armed bandit problem and the linear stochastic multi-armed\nbandit problem.\nIn particular, we show that a simple modi\ufb01cation of Auer\u2019s\nUCB algorithm (Auer, 2002) achieves with high probability constant regret.\nMore importantly, we modify and, consequently, improve the analysis of the\nalgorithm for the for linear stochastic bandit problem studied by Auer (2002),\nDani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010).\nOur modi\ufb01cation improves the regret bound by a logarithmic factor, though\nexperiments show a vast improvement.\nIn both cases, the improvement stems\nfrom the construction of smaller con\ufb01dence sets. For their construction we use a\nnovel tail inequality for vector-valued martingales.\n\n1\n\nIntroduction\n\nLinear stochastic bandit problem is a sequential decision-making problem where in each time step\nwe have to choose an action, and as a response we receive a stochastic reward, expected value of\nwhich is an unknown linear function of the action. The goal is to collect as much reward as possible\nover the course of n time steps. The precise model is described in Section 1.2.\nSeveral variants and special cases of the problem exist differing on what the set of available\nactions is in each round. For example, the standard stochastic d-armed bandit problem, introduced\nby Robbins (1952) and then studied by Lai and Robbins (1985), is a special case of linear stochastic\nbandit problem where the set of available actions in each round is the standard orthonormal basis of\nRd. Another variant, studied by Auer (2002) under the name \u201clinear reinforcement learning\u201d, and\nlater in the context of web advertisement by Li et al. (2010), Chu et al. (2011), is a variant when the\nset of available actions changes from time step to time step, but has the same \ufb01nite cardinality in\neach step. Another variant dubbed \u201csleeping bandits\u201d, studied by Kleinberg et al. (2008), is the case\nwhen the set of available actions changes from time step to time step, but it is always a subset of the\nstandard orthonormal basis of Rd. Another variant, studied by Dani et al. (2008), Abbasi-Yadkori\net al. (2009), Rusmevichientong and Tsitsiklis (2010), is the case when the set of available actions\ndoes not change between time steps but the set can be an almost arbitrary, even in\ufb01nite, bounded\nsubset of a \ufb01nite-dimensional vector space. Related problems were also studied by Abe et al.\n(2003), Walsh et al. (2009), Dekel et al. (2010).\nIn all these works, the algorithms are based on the same underlying idea\u2014the optimism-in-the-\nface-of-uncertainty (OFU) principle. This is not surprising since they are solving almost the same\nproblem. The OFU principle elegantly solves the exploration-exploitation dilemma inherent in the\nproblem. The basic idea of the principle is to maintain a con\ufb01dence set for the vector of coef\ufb01cients\nof the linear function.\nIn every round, the algorithm chooses an estimate from the con\ufb01dence\nset and an action so that the predicted reward is maximized, i.e., estimate-action pair is chosen\noptimistically. We give details of the algorithm in Section 2.\n\n1\n\n\fThus the problem reduces to the construction of con\ufb01dence sets for the vector of coef\ufb01cients of the\nlinear function based on the action-reward pairs observed in the past time steps. This is not an easy\nproblem, because the future actions are not independent of the actions taken in the past (since the\nalgorithm\u2019s choices of future actions depend on the random con\ufb01dence set constructed from past\ndata). In fact, several authors (Auer, 2000, Li et al., 2010, Walsh et al., 2009) fell victim of making\na mistake because they did not recognize this issue. Correct solutions require new martingale\ntechniques which we provide here.\nThe smaller con\ufb01dence sets one is able to construct, the better regret bounds one obtains for\nthe resulting algorithm, and, more importantly, the better the algorithm performs empirically.\nWith our new technique, we vastly reduce the size of the con\ufb01dence sets of Dani et al. (2008)\nand Rusmevichientong and Tsitsiklis (2010). First, our con\ufb01dence sets are valid uniformly over all\ntime steps, which immediately saves log(n) factor by avoiding the otherwise needed union bound.\nSecond, our con\ufb01dence sets are \u201cmore empirical\u201d in the sense that some worst-case quantities from\nthe old bounds are replaced by empirical quantities that are always smaller, sometimes substantially.\nAs a result, our experiments show an order-of-magnitude improvement over the CONFIDENCEBALL\nalgorithm of Dani et al. (2008). To construct our con\ufb01dence sets, we prove a new martingale tail\ninequality. The new inequality is derived using techniques from the theory of self-normalized\nprocesses (de la Pe\u02dcna et al., 2004, 2009).\nUsing our con\ufb01dence sets, we modify the UCB algorithm (Auer, 2002) for the d-armed bandit prob-\nlem and show that with probability 1 , the regret of the modi\ufb01ed algorithm is O(d log(1/)/)\nwhere is the difference between the expected rewards of the best and the second best action.\nIn particular, note that the regret does not depend on n. This seemingly contradicts the result\nof Lai and Robbins (1985) who showed that the expected regret of any algorithm is at least\n(Pi6=i\u21e4\n) o(1)) log n where pi\u21e4 and pi are the reward distributions of the optimal arm\nand arm i respectively and D is the Kullback-Leibler divergence. However, our algorithm receives\n as an input, and thus its expected regret depends on . With = 1/n our algorithm has the same\nexpected regret bound, O((d log n)/), as Auer (2002) has shown for UCB.\nFor the general linear stochastic bandit problem, we improve regret of the CONFIDENCEBALL\n\n1/D(pj | pi\u21e4\n\nalgorithm of Dani et al. (2008). They showed that its regret is at most O(d log(n)pn log(n/))\nwith probability at least 1 . We modify their algorithm so that it uses our new con\ufb01dence\nsets and we show that its regret is at most O(d log(n)pn +pdn log(n/)) which is roughly\nimprovement a multiplicative factorplog(n). Dani et al. (2008) prove also a problem dependent\n\nregret bound. Namely, they show that the regret of their algorithm is O( d2\n log(n/) log2(n)) where\n is the \u201cgap\u201d as de\ufb01ned in (Dani et al., 2008). For our modi\ufb01ed algorithm we prove an improved\nO( log(1/)\n\n (log(n) + d log log n)2) bound.\n\n1.1 Notation\nWe use kxkp to denote the p-norm of a vector x 2 Rd. For a positive de\ufb01nite matrix A 2 Rd\u21e5d, the\nweighted 2-norm of vector x 2 Rd is de\ufb01ned by kxkA = px>Ax. The inner product is denoted\nby h\u00b7,\u00b7i and the weighted inner-product x>Ay = hx, yiA. We use min(A) to denote the minimum\neigenvalue of the positive de\ufb01nite matrix A. For any sequence {at}1t=0 we denote by ai:j the\nsub-sequence ai, ai+1, . . . , aj.\n\n1.2 The Learning Model\nIn each round t, the learner is given a decision set Dt \u2713 Rd from which he has to choose an\naction Xt. Subsequently he observes reward Yt = hXt,\u2713 \u21e4i + \u2318t where \u2713\u21e4 2 Rd is an unknown\nparameter and \u2318t is a random noise satisfying E[\u2318t | X1:t,\u2318 1:t1] = 0 and some tail-constraints, to\nbe speci\ufb01ed soon.\nThe goal of the learner is to maximize his total rewardPn\nt=1 hXt,\u2713 \u21e4i accumulated over the course\nof n rounds. Clearly, with the knowledge of \u2713\u21e4, the optimal strategy is to choose in round t the\npoint x\u21e4t = argmaxx2Dt hx, \u2713\u21e4i that maximizes the reward. This strategy would accumulate total\nrewardPn\nt=1 hx\u21e4t ,\u2713 \u21e4i. It is thus natural to evaluate the learner relative to this optimal strategy. The\ndifference of the learner\u2019s total reward and the total reward of the optimal strategy is called the\n\n2\n\n\ffor t := 1, 2, . . . do\n\n(Xt,e\u2713t) = argmax(x,\u2713)2Dt\u21e5Ct1 hx, \u2713i\n\nPlay Xt and observe reward Yt\nUpdate Ct\n\nend for\n\nFigure 1: OFUL ALGORITHM\n\npseudo-regret (Audibert et al., 2009) of the algorithm and it can be formally written as\n\nRn = nXt=1\n\nhx\u21e4t ,\u2713 \u21e4i! nXt=1\n\nhXt,\u2713 \u21e4i! =\n\nnXt=1\n\nhx\u21e4t Xt,\u2713 \u21e4i .\n\nAs compared to the regret, the pseudo-regret has the same expected value, but lower variance\nbecause the additive noise \u2318t is removed. However, the omitted quantity is uncontrollable, hence\nwe have no interest in including it in our results (the omitted quantity would also cancel, if \u2318t was a\nsequence which is independently selected of X1:t.) In what follows, for simplicity we use the word\nregret instead of the more precise pseudo-regret in connection to Rn.\nThe goal of the algorithm is to keep the regret Rn as low as possible. As a bare minimum, we\nrequire that the algorithm is Hannan consistent, i.e., Rn/n ! 0 with probability one.\nIn order to obtain meaningful upper bounds on the regret, we will place assumptions on {Dt}1t=1,\n\u2713\u21e4 and the distribution of {\u2318t}1t=1. Roughly speaking, we will need to assume that {Dt}1t=1 lies in\na bounded set. We elaborate on the details of the assumptions later in the paper.\nHowever, we state the precise assumption on the noise sequence {\u2318t}1t=1 now. We will assume that\n\u2318t is conditionally R-sub-Gaussian where R 0 is a \ufb01xed constant. Formally, this means that\n\n8 2 R\n\nE\u21e5e\u2318t | X1:t,\u2318 1:t1\u21e4 \uf8ff exp\u2713 2R2\n2 \u25c6 .\n\nThe sub-Gaussian condition automatically implies that E[\u2318t | X1:t,\u2318 1:t1] = 0. Furthermore, it\nalso implies that Var[\u2318t | Ft] \uf8ff R2 and thus we can think of R2 as the (conditional) variance of\nthe noise. An example of R-sub-Gaussian \u2318t is a zero-mean Gaussian noise with variance at most\nR2, or a bounded zero-mean noise lying in an interval of length at most 2R.\n\n2 Optimism in the Face of Uncertainty\n\nA natural and successful way to design an algorithm is the optimism in the face of uncertainty\nprinciple (OFU). The basic idea is that the algorithm maintains a con\ufb01dence set Ct1 \u2713 Rd\nfor the parameter \u2713\u21e4.\nIt is required that Ct1 can be calculated from X1, X2, . . . , Xt1 and\nY1, Y2, . . . , Yt1 and \u201cwith high probability\u201d \u2713\u21e4 lies in Ct1. The algorithm chooses an optimistic\n\nestimatee\u2713t = argmax\u27132Ct1(maxx2Dt hx, \u2713i) and then chooses action Xt = argmaxx2DtDx,e\u2713tE\nwhich maximizes the reward according to the estimatee\u2713t. Equivalently, and more compactly, the\n\nalgorithm chooses the pair\n\nwhich jointly maximizes the reward. We call the resulting algorithm the OFUL ALGORITHM for\n\u201coptimism in the face of uncertainty linear bandit algorithm\u201d. Pseudo-code of the algorithm is given\nin Figure 1.\nThe crux of the problem is the construction of the con\ufb01dence sets Ct. This construction is the\nsubject of the next section.\n\n(Xt,e\u2713t) = argmax\n\n(x,\u2713)2Dt\u21e5Ct1 hx, \u2713i ,\n\n3 Self-Normalized Tail Inequality for Vector-Valued Martingales\nSince the decision sets {Dt}1t=1 can be arbitrary, the sequence of actions Xt 2 Dt is arbitrary as\nwell. Even if {Dt}1t=1 is \u201cwell-behaved\u201d, the selection rule that OFUL uses to choose Xt 2 Dt\n\n3\n\n\fgenerates a sequence {Xt}1t=1 with complicated stochastic dependencies that are hard to handle.\nTherefore, for the purpose of deriving con\ufb01dence sets it is easier to drop any assumptions on\n{Xt}1t=1 and pursue a more general result.\nIf we consider the -algebra Ft = (X1, X2, . . . , Xt+1,\u2318 1,\u2318 2, . . . ,\u2318 t) then Xt becomes Ft1-\nmeasurable and \u2318t becomes Ft-measurable. Relaxing this a little bit, we can assume that {Ft}1t=0 is\nany \ufb01ltration of -algebras such that for any t 1, Xt is Ft1-measurable and \u2318t is Ft-measurable\nand therefore Yt = hXt,\u2713 \u21e4i + \u2318t is Ft-measurable. This is the setup we consider for derivation of\nthe con\ufb01dence sets.\nThe sequence {St}1t=0, St =Pt\ns=1 \u2318tXt, is a martingale with respect {Ft}1t=0 which happens to\nbe crucial for the construction of the con\ufb01dence sets for \u2713\u21e4. The following theorem shows that with\nhigh probability the martingale stays close to zero. Its proof is given in Appendix A\nTheorem 1 (Self-Normalized Bound for Vector-Valued Martingales). Let {Ft}1t=0 be a \ufb01ltration.\nLet {\u2318t}1t=1 be a real-valued stochastic process such that \u2318t is Ft-measurable and \u2318t is conditionally\nR-sub-Gaussian for some R 0 i.e.\n\n8 2 R\n\nE\u21e5e\u2318t | Ft1\u21e4 \uf8ff exp\u2713 2R2\n2 \u25c6 .\n\nLet {Xt}1t=1 be an Rd-valued stochastic process such that Xt is Ft1-measurable. Assume that V\nis a d \u21e5 d positive de\ufb01nite matrix. For any t 0, de\ufb01ne\n\nV t = V +\n\nThen, for any > 0, with probability at least 1 , for all t 0,\n\n\u2318sXs .\n\nSt =\n\nXsX>s\n\ntXs=1\ntXs=1\n\u25c6 .\nt \uf8ff 2R2 log\u2713 det(V t)1/2 det(V )1/2\n\n\n\nkStk2\n\nV 1\n\nNote that the deviation of the martingale kStk2\nV 1\nt which is itself derived from the martingale, hence the name \u201cself-normalized bound\u201d.\n\nis measured by the norm weighted by the matrix\n\nV 1\n\nt\n\n4 Construction of Con\ufb01dence Sets\n\n(1)\n\nb\u2713t = (X>1:tX1:t + I)1X>1:tY1:t\n\nwhere X1:t is the matrix whose rows are X>1 , X>2 , . . . , X>t and Y1:t = (Y1, . . . , Yt)>. The\n\nLetb\u2713t be the `2-regularized least-squares estimate of \u2713\u21e4 with regularization parameter > 0:\nfollowing theorem shows that \u2713\u21e4 lies with high probability in an ellipsoid with center atb\u2713t. Its proof\ncan be found in Appendix B.\nTheorem 2 (Con\ufb01dence Ellipsoid). Assume the same as in Theorem 1, let V = I, > 0, de\ufb01ne\nYt = hXt,\u2713 \u21e4i + \u2318t and assume that k\u2713\u21e4k2 \uf8ff S. Then, for any > 0, with probability at least\n1 , for all t 0, \u2713\u21e4 lies in the set\n\u25c6 + 1/2 S9=;\nCt =8<:\n\u2713 2 Rd : b\u2713t \u2713V t \uf8ff Rs2 log\u2713 det(V t)1/2 det(I)1/2\nC0t =(\u2713 2 Rd : b\u2713t \u2713V t \uf8ff Rsd log\u2713 1 + tL2/\n\u25c6 + 1/2 S) .\n\nFurthermore, if for all t 1, kXtk2 \uf8ff L then with probability at least 1 , for all t 0, \u2713\u21e4 lies\nin the set\n\n\n\n\n\n.\n\n4\n\n\f8\n3\n\n(2)\n\nThe above bound could be compared with a similar bound of Dani et al. (2008) whose bound, under\nidentical conditions, states that (with appropriate initialization) with probability 1 ,\nlog\u2713 t2\nfor all t large enough\n\nb\u2713t \u2713\u21e4V t \uf8ff R max(s128 d log(t) log\u2713 t2\n\u25c6,\n\n\u25c6) ,\nwhere large enough means that t satis\ufb01es 0 << t 2e1/16. Denote bypt() the right-hand side\nin the last bound. The restriction on t comes from the fact that t() 2d(1 + 2 log(t)) is needed\nin the proof of the last inequality of their Theorem 5.\nOn the other hand, Rusmevichientong and Tsitsiklis (2010) proved that for any \ufb01xed t 2, for any\n0 << 1, with probability at least 1 ,\nwhere \uf8ff = p3 + 2 log((L2 + trace(V ))/. To get a uniform bound one can use a union bound\nwith t = /t2. ThenP1t=2 t = ( \u21e12\n6 1) \uf8ff . This thus gives that for any 0 << 1, with\nprobability at least 1 ,\n\nb\u2713t \u2713\u21e4V t \uf8ff 2 \uf8ff2Rplog tpd log(t) + log(1/) + 1/2S ,\nb\u2713t \u2713\u21e4V t \uf8ff 2\uf8ff2Rplog tpd log(t) + log(t2/) + 1/2S ,\n\nThis is tighter than (2), but is still lagging behind the result of Theorem 2. Note that the new con\ufb01-\ndence set seems to require the computation of a determinant of a matrix, a potentially expensive step.\nHowever, one can speed up the computation by using the matrix determinant lemma, exploiting that\nthe matrix whose determinant is needed is obtained via a rank-one update (cf. the proof of Lemma 11\nin the Appendix). This way, the determinant can be kept up-to-date with linear time computation.\n\n8t 2,\n\n5 Regret Analysis of the OFUL ALGORITHM\n\nWe now give a bound on the regret of the OFUL algorithm when run with con\ufb01dence sets Cn\nconstructed in Theorem 2 in the previous section. We will need to assume that expected rewards\nare bounded. We can view this as a bound on \u2713\u21e4 and the bound on the decision sets Dt. The next\ntheorem states a bound on the regret of the algorithm. Its proof can be found in Appendix C.\nTheorem 3 (The regret of the OFUL algorithm). Assume that for all t and all x 2 Dt,\nhx, \u2713\u21e4i 2 [1, 1]. Then, with probability at least 1 , the regret of the OFUL algorithm satis\ufb01es\n\n8n 0, Rn \uf8ff 4pnd log( + nL/d)\u21e31/2S + Rp2 log(1/) + d log(1 + nL/(d))\u2318 .\n\nFigure 2 shows the experiments with the new con\ufb01dence set. The regret of OFUL is signi\ufb01cantly\nbetter compared to the regret of CONFIDENCEBALL of Dani et al. (2008). The \ufb01gure also shows\na version of the algorithm that has a similar regret to the algorithm with the new bound, but spends\nabout 350 times less computation in this experiment. Next, we explain how we can achieve this\ncomputation saving.\n\n5.1 Saving Computation\n\nIn this section, we show that we essentially need to recomputee\u2713t only O(log n) times up to time\nn and hence saving computations.1 The idea is to recomputee\u2713t whenever det(Vt) increases by a\nconstant factor (1 + C). We call the resulting algorithm the RARELY SWITCHING OFUL algorithm\nand its pseudo-code is given in Figure 3. As the next theorem shows its regret bound is essentially\nthe same as the regret for OFUL.\nTheorem 4. Under the same assumptions as in Theorem 3, with probability at least 1 , for all\nn 0, the regret of the RARELY SWITCHING OFUL ALGORITHM satis\ufb01es\nd\u25c6 + 2 log\n\nRn \uf8ff 4s(1 + C)nd log\u2713 +\n\nd \u25c6(pS + Rsd log\u27131 +\n\n) + 4rd log\n\nn\nd\n\nnL\n\nnL\n\n1\n\n.\n\n1Note this is very different than the common \u201cdoubling trick\u201d in online learning literature. The doubling is\nused to cope with a different problem. Namely, the problem when the time horizon n is unknown ahead of time.\n\n5\n\n\ft\n\ne\nr\ng\ne\nR\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n \n\n0\n0\n\nNew bound\nOld bound\nNew bound with rare switching\n\n \n\n2000\n\n4000\n\nTime\n\n6000\n\n8000\n\n10000\n\nFigure 2: The application of the new bound to a linear bandit problem. A 2-dimensional linear\nbandit, where the parameters vector and the actions are from the unit ball. The regret of OFUL is\nsigni\ufb01cantly better compared to the regret of CONFIDENCEBALL of Dani et al. (2008). The noise\nis a zero mean Gaussian with standard deviation = 0.1. The probability that con\ufb01dence sets fail\nis = 0.0001. The experiments are repeated 10 times.\n\nInput: Constant C > 0\n\nfor t := 1, 2, . . . do\n\nend for\n\nif det(Vt) > (1 + C) det(V\u2327 ) then\n\n\u2327 = 1 {This is the last time step that we changede\u2713t}\n(Xt,e\u2713t) = argmax(x,\u2713)2Dt\u21e5Ct1 h\u2713, xi.\nXt = argmaxx2DtDe\u2713\u2327 , xE.\n\nPlay Xt and observe reward Yt.\n\n\u2327 = t.\n\nend if\n\nFigure 3: The RARELY SWITCHING OFUL ALGORITHM\n\n0.2\n\nt\n\ne\nr\ng\ne\nr\n \ne\ng\na\nr\ne\nv\nA\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nC\n\nFigure 4: Regret against computation. We \ufb01xed the number of times the algorithm is allowed to\nupdate its action in OFUL. For larger values of C, the algorithm changes action less frequently,\nhence, will play for a longer time period. The \ufb01gure shows the average regret obtained during the\ngiven time periods for the different values of C. Thus, we see that by increasing C, one can actually\nlower the average regret per time step for a given \ufb01xed computation budget.\n\nThe proof of the theorem is given in Appendix D. Figure 4 shows a simple experiment with the\nRARELY SWITCHING OFUL ALGORITHM.\n\n5.2 Problem Dependent Bound\n\nLet t be the \u201cgap\u201d at time step t as de\ufb01ned in (Dani et al., 2008). (Intuitively, t is the difference\nbetween the rewards of the best and the \u201csecond best\u201d action in the decision set Dt.) We consider\n\n6\n\n\fthe smallest gap \u00afn = min1\uf8fft\uf8ffn t. This includes the case when the set Dt is the same polytope\nin every round or the case when Dt is \ufb01nite.\nThe regret of OFUL can be upper bounded in terms of ( \u00afn)n as follows.\nTheorem 5. Assume that 1 and k\u2713\u21e4k2 \uf8ff S where S 1. With probability at least 1 , for\nall n 1, the regret of the OFUL satis\ufb01es\n\u21e3log(Ln) + (d 1) log\n\n64R2S2L\n\n16R2S2\n\nRn \uf8ff\n\n\u00afn\n\n+ 2(d 1) log\u2713d log\n\n\u00af2\nn\nd + nL2\n\nd\n\n+ 2 log(1/)\u2318 + 2 log(1/)\u25c62\n\n.\n\nThe proof of the theorem can be found in the Appendix E.\nThe problem dependent regret of (Dani et al., 2008) scales like O( d2\n (log2 n + d log n + d2 log log n)), where = inf n \u00afn.\nscales like O( 1\n\n log3 n), while our bound\n\n6 Multi-Armed Bandit Problem\n\nIn this section we show that a modi\ufb01ed version of UCB has with high probability constant regret.\nLet \u00b5i be the expected reward of action i = 1, 2, . . . , d. Let \u00b5\u21e4 = max1\uf8ffi\uf8ffd \u00b5i be the expected\nreward of the best arm, and let i = \u00b5\u21e4 \u00b5i, i = 1, 2, . . . , d, be the \u201cgaps\u201d with respect to the\nbest arm. We assume that if we choose action It in round t we obtain reward \u00b5It + \u2318t. Let Ni,t\ndenote the number of times that we have played action i up to time t, and X i,t denote the average\nof the rewards received by action i up to time t. We construct con\ufb01dence intervals for the expected\nrewards \u00b5i based on X i,t in the following lemma. (The proof can be found in the Appendix F.)\nLemma 6 (Con\ufb01dence Intervals). Assuming that the noise \u2318t is conditionally 1-sub-Gaussian. With\nprobability at least 1 ,\n\n8i 2{ 1, 2, . . . , d}, 8t 0\n\n|X i,t \u00b5i|\uf8ff ci,t ,\n\nwhere\n\nci,t =s (1 + Ni,t)\n\nN 2\ni,t\n\n\u27131 + 2 log\u2713 d(1 + Ni,t)1/2\n\n\n\n\u25c6\u25c6 .\n\n(3)\n\nUsing these con\ufb01dence intervals, we modify the UCB algorithm of Auer et al. (2002) and change\nthe action selection rule accordingly. Hence, at time t, we choose the action\n\nIt = argmax\n\ni\n\nX i,t + ci,t.\n\n(4)\n\nWe call this algorithm UCB().\nThe main difference between UCB() and UCB is that the length of con\ufb01dence interval ci,t depends\nneither on n, nor on t. This allows us to prove the following result that the regret of UCB() is\nconstant. (The proof can be found in the Appendix G.)\nTheorem 7 (Regret of UCB()). Assume that the noise \u2318t is conditionally 1-sub-Gaussian, with\nprobability at least 1 , the total regret of the UCB() is bounded as\ni\u25c6 .\n\nRn \uf8ff Xi:i>0\u27133i +\n\n16\ni\n\nlog\n\n2d\n\nLai and Robbins (1985) prove that for any suboptimal arm j,\n\nwhere, p\u21e4 and pj are the reward density of the optimal arm and arm j respectively, and D is the\nKL-divergence. This lower bound does not contradict Theorem 7, as Theorem 7 only states a high\n\nE Ni,t \n\nlog t\n\n,\n\nD(pj, p\u21e4)\n\n7\n\n\f800\n\n600\n\n400\n\n200\n\nt\n\ne\nr\ng\ne\nR\n\n \n\n0\n0\n\nNew bound\nOld bound\n\n \n\n2000\n\n4000\n\nTime\n\n6000\n\n8000\n\n10000\n\nFigure 5: The regret of UCB() against-time when it uses either the con\ufb01dence bound based on\nHoeffding\u2019s inequality, or the bound in (3). The results are shown for a 10-armed bandit problem,\nwhere the mean value of each arm is \ufb01xed to some values in [0, 1]. The regret of UCB() is\nimproved with the new bound. The noise is a zero-mean Gaussian with standard deviation = 0.1.\nThe value of is set to 0.0001. The experiments are repeated 10 times and the average is shown,\ntogether with the error bars.\n\nprobability upper bound for the regret. Note that UCB() takes delta as its input. Because with\nprobability , the regret in time t can be t, on expectation, the algorithm might have a regret of t.\nNow if we select = 1/t, then we get O(log t) upper bound on the expected regret.\nIf one is interested in an average regret result, then, with slight modi\ufb01cation of the proof technique\none can obtain an identical result to what Auer et al. (2002) proves.\nFigure 5 shows the regret of UCB() when it uses either the con\ufb01dence bound based on Hoeffding\u2019s\ninequality, or the bound in (3). As can be seen, the regret of UCB() is improved with the new bound.\nCoquelin and Munos (2007), Audibert et al. (2009) prove similar high-probability constant regret\nbounds for variations of the UCB algorithm. Compared to their bounds, our bound is tighter thanks\nto that with the new self-normalized tail inequality we can avoid one union bound. The improve-\nment can also be seen in experiment as the curve that we get for the performance of the algorithm\nof Coquelin and Munos (2007) is almost exactly the same as the curve that is labeled \u201cOld Bound\u201d\nin Figure 5.\n\n7 Conclusions\n\nIn this paper, we showed how a novel tail inequality for vector-valued martingales allows one to\nimprove both the theoretical analysis and empirical performance of algorithms for various stochastic\nbandit problems. In particular, we show that a simple modi\ufb01cation of Auer\u2019s UCB algorithm (Auer,\n2002) achieves with high probability constant regret. Further, we modify and improve the analysis\nof the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al.\n(2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modi\ufb01cation improves the\nregret bound by a logarithmic factor, though experiments show a vast improvement, stemming\nfrom the construction of smaller con\ufb01dence sets. To our knowledge, ours is the \ufb01rst, theoretically\nwell-founded algorithm, whose performance is practical for this latter problem. We also proposed\na novel variant of the algorithm with which we can save a large amount of computation without\nsacri\ufb01cing performance.\nWe expect that the novel tail inequality will also be useful in a number of other situations thanks\nto its self-normalized form and that it holds for stopped martingales and thus can be used to derive\nbounds that hold uniformly in time. In general, the new inequality can be used to improve deviation\nbounds which use a union bound (over time). Since many modern machine learning techniques\nrely on having tight high-probability bounds, we expect that the new inequality will \ufb01nd many\napplications. Just to mention a few examples, the new inequality could be used to improve the\ncomputational complexity of the HOO algorithm Bubeck et al. (2008) (when it is used with a \ufb01xed\n, by avoiding union bounds, or the need to know the horizon, or the doubling trick) or to improve\nthe bounds derived by Garivier and Moulines (2008) for UCB for changing environments, or the\nstopping rules and racing algorithms of Mnih et al. (2008).\n\n8\n\n\fReferences\nY. Abbasi-Yadkori, A. Antos, and Cs. Szepesv\u00b4ari. Forced-exploration based algorithms for playing\nin stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback, 2009.\nN. Abe, A. W. Biermann, and P. M. Long. Reinforcement learning with immediate rewards and\n\nlinear hypotheses. Algorithmica, 37:263293, 2003.\n\nA. Antos, V. Grover, and Cs. Szepesv\u00b4ari. Active learning in heteroscedastic noise. Theoretical\n\nComputer Science, 411(29-30):2712\u20132728, 2010.\n\nJ.-Y. Audibert, R. Munos, and Csaba Szepesv\u00b4ari. Exploration-exploitation tradeoff using variance\n\nestimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876\u20131902, 2009.\n\nP. Auer. Using upper con\ufb01dence bounds for online learning. In FOCS, pages 270\u2013279, 2000.\nP. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. JMLR, 2002.\nP. Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2-3):235\u2013256, 2002.\n\nS. Bubeck, R. Munos, G. Stoltz, and Cs. Szepesv\u00b4ari. Online optimization in X-armed bandits. In\n\nNIPS, pages 201\u2013208, 2008.\n\nN. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. 2006.\nW. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. In\n\nAISTATS, 2011.\n\nP.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In UAI, 2007.\nV. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In\n\nRocco Servedio and Tong Zhang, editors, COLT, pages 355\u2013366, 2008.\n\nV. H. de la Pe\u02dcna, M. J. Klass, and T. L. Lai. Self-normalized processes: exponential inequalities,\n\nmoment bounds and iterated logarithm laws. Annals of Probability, 32(3):1902\u20131933, 2004.\n\nV. H. de la Pe\u02dcna, T. L. Lai, and Q.-M. Shao. Self-normalized processes: Limit theory and Statistical\n\nApplications. Springer, 2009.\n\nO. Dekel, C. Gentile, and K. Sridharan. Robust selective sampling from single and multiple\n\nteachers. In COLT, 2010.\n\nD. A. Freedman. On tail probabilities for martingales. The Annals of Probability, 3(1):100\u2013118,\n\n1975.\n\nA. Garivier and E. Moulines. On upper-con\ufb01dence bound policies for non-stationary bandit\n\nproblems. Technical report, LTCI, 2008.\n\nR. Kleinberg, A. Niculescu-Mizil, and Y. Sharma. Regret bounds for sleeping experts and bandits.\n\nMachine learning, pages 1\u201328, 2008.\n\nT. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied\n\nMathematics, 6:4\u201322, 1985.\n\nT. L. Lai and C. Z. Wei. Least squares estimates in stochastic regression models with applications\nto identi\ufb01cation and control of dynamic systems. The Annals of Statistics, 10(1):154\u2013166, 1982.\nT. L. Lai, H. Robbins, and C. Z. Wei. Strong consistency of least squares estimates in multiple\n\nregression. Proceedings of the National Academy of Sciences, 75(7):3034\u20133036, 1979.\n\nL. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news\nIn Proceedings of the 19th International Conference on World Wide\n\narticle recommendation.\nWeb (WWW 2010), pages 661\u2013670. ACM, 2010.\n\nV. Mnih, Cs. Szepesv\u00b4ari, and J.-Y. Audibert. Empirical Bernstein stopping. pages 672\u2013679, 2008.\nH. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematical Society, 58:527\u2013535, 1952.\n\nP. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics of\n\nOperations Research, 35(2):395\u2013411, 2010.\n\nG. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press, 1990.\nT. J. Walsh, I. Szita, C. Diuk, and M. L. Littman. Exploring compact reinforcement-learning\n\nrepresentations with linear regression. In UAI, pages 591\u2013598. AUAI Press, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1243, "authors": [{"given_name": "Yasin", "family_name": "Abbasi-yadkori", "institution": null}, {"given_name": "D\u00e1vid", "family_name": "P\u00e1l", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}