{"title": "Prior-free and prior-dependent regret bounds for Thompson Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 638, "page_last": 646, "abstract": "We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and prior-dependent regret bounds, very much in the same spirit than the usual distribution-free and distribution-dependent bounds for the non-Bayesian stochastic bandit. We first show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by $14 \\sqrt{n K}$. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by $\\frac{1}{20} \\sqrt{n K}$. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors.", "full_text": "Prior-free and prior-dependent regret bounds for\n\nThompson Sampling\n\nS\u00b4ebastien Bubeck, Che-Yu Liu\n\nDepartment of Operations Research and Financial Engineering,\n\nsbubeck@princeton.edu, cheliu@princeton.edu\n\nPrinceton University\n\nAbstract\n\nWe consider the stochastic multi-armed bandit problem with a prior distribution\non the reward distributions. We are interested in studying prior-free and prior-\ndependent regret bounds, very much in the same spirit than the usual distribution-\nfree and distribution-dependent bounds for the non-Bayesian stochastic bandit.\nWe \ufb01rst show that Thompson Sampling attains an optimal prior-free bound in the\nsense that for any prior distribution its Bayesian regret is bounded from above by\n14\u221anK. This result is unimprovable in the sense that there exists a prior dis-\ntribution such that any algorithm has a Bayesian regret bounded from below by\n\u221anK. We also study the case of priors for the setting of Bubeck et al. [2013]\n1\n20\n(where the optimal mean is known as well as a lower bound on the smallest gap)\nand we show that in this case the regret of Thompson Sampling is in fact uniformly\nbounded over time, thus showing that Thompson Sampling can greatly take advan-\ntage of the nice properties of these priors.\n\n1\n\nIntroduction\n\nIn this paper we are interested in the Bayesian multi-armed bandit problem which can be described\nas follows. Let \u03c00 be a known distribution over some set \u0398, and let \u03b8 be a random variable dis-\ntributed according to \u03c00. For i \u2208 [K], let (Xi,s)s\u22651 be identically distributed random variables\ntaking values in [0, 1] and which are independent conditionally on \u03b8. Denote \u00b5i(\u03b8) := E(Xi,1|\u03b8).\nConsider now an agent facing K actions (or arms). At each time step t = 1, . . . n, the agent pulls\nan arm It \u2208 [K]. The agent receives the reward Xi,s when he pulls arm i for the sth time. The arm\nselection is based only on past observed rewards and potentially on an external source of random-\nness. More formally, let (Us)s\u22651 be an i.i.d. sequence of random variables uniformly distributed\non [0, 1], and let Ti(s) = Ps\n1It=i, then It is a random variable measurable with respect to\n\u03c3(I1, X1,1, . . . , It\u22121, XIt\u22121,TIt\u22121 (t\u22121), Ut). We measure the performance of the agent through the\nBayesian regret de\ufb01ned as\n\nt=1\n\nBRn = E\n\nnXt=1(cid:18)max\n\ni\u2208[K]\n\n\u00b5i(\u03b8) \u2212 \u00b5It (\u03b8)(cid:19) ,\n\nwhere the expectation is taken with respect to the parameter \u03b8, the rewards (Xi,s)s\u22651, and the\nexternal source of randomness (Us)s\u22651. We will also be interested in the individual regret Rn(\u03b8)\nwhich is de\ufb01ned similarly except that \u03b8 is \ufb01xed (instead of being integrated over \u03c00). When it is\nclear from the context we drop the dependency on \u03b8 in the various quantities de\ufb01ned above.\n\n1\n\n\fGiven a prior \u03c00 the problem of \ufb01nding an optimal strategy to minimize the Bayesian regret BRn\nis a well de\ufb01ned optimization problem and as such it is merely a computational problem. On the\nother hand the point of view initially developed in Robbins [1952] leads to a learning problem. In\nthis latter view the agent\u2019s strategy must have a low regret Rn(\u03b8) for any \u03b8 \u2208 \u0398. Both formulations\nof the problem have a long history and we refer the interested reader to Bubeck and Cesa-Bianchi\n[2012] for a survey of the extensive recent literature on the learning setting. In the Bayesian setting\na major breakthrough was achieved in Gittins [1979] where it was shown that when the prior\ndistribution takes a product form an optimal strategy is given by the Gittins indices (which are\nrelatively easy to compute). The product assumption on the prior means that the reward processes\nIn the present paper we are precisely interested in the\n(Xi,s)s\u22651 are independent across arms.\nsituations where this assumption is not satis\ufb01ed. Indeed we believe that one of the strength of the\nBayesian setting is that one can incorporate prior knowledge on the arms in very transparent way.\nA prototypical example that we shall consider later on in this paper is when one knows the dis-\ntributions of the arms up to a permutation, in which case the reward processes are strongly dependent.\n\nIn general without the product assumption on the prior it seems hopeless (from a computational\nperspective) to look for the optimal Bayesian strategy. Thus, despite being in a Bayesian setting,\nit makes sense to view it as a learning problem and to evaluate the agent\u2019s performance through its\nBayesian regret. In this paper we are particularly interested in studying the Thompson Sampling\nstrategy which was proposed in the very \ufb01rst paper on the multi-armed bandit problem Thompson\n[1933]. This strategy can be described very succinctly:\nlet \u03c0t be the posterior distribution on \u03b8\ngiven the history Ht = (I1, X1,1, . . . , It\u22121, XIt\u22121,TIt\u22121 (t\u22121)) of the algorithm up to the beginning\nof round t. Then Thompson Sampling \ufb01rst draws a parameter \u03b8(t) from \u03c0t (independently from the\npast given \u03c0t) and it pulls It \u2208 argmaxi\u2208[K] \u00b5i(\u03b8(t)).\nRecently there has been a surge of interest for this simple policy, mainly because of its \ufb02exibility to\nincorporate prior knowledge on the arms, see for example Chapelle and Li [2011]. For a long time the\ntheoretical properties of Thompson Sampling remained elusive. The speci\ufb01c case of binary rewards\nwith a Beta prior is now very well understood thanks to the papers Agrawal and Goyal [2012a],\nKaufmann et al. [2012], Agrawal and Goyal [2012b]. However as we pointed out above here we\nare interested in proving regret bounds for the more realistic scenario where one runs Thompson\nSampling with a hand-tuned prior distribution, possibly very different from a Beta prior. The \ufb01rst\nresult in that spirit was obtained very recently by Russo and Roy [2013] who showed that for any\n\nprior distribution \u03c00 Thompson Sampling always satis\ufb01es BRn \u2264 5\u221anK log n. A similar bound\n\nwas proved in Agrawal and Goyal [2012b] for the speci\ufb01c case of Beta prior1. Our \ufb01rst contribution\nis to show in Section 2 that the extraneous logarithmic factor in these bounds can be removed by\nusing ideas reminiscent of the MOSS algorithm of Audibert and Bubeck [2009].\n\nOur second contribution is to show that Thompson Sampling can take advantage of the properties of\nsome non-trivial priors to attain much better regret guarantees. More precisely in Section 2 and 3 we\nconsider the setting of Bubeck et al. [2013] (which we call the BPR setting) where \u00b5\u2217 and \u03b5 > 0 are\nknown values such that for any \u03b8 \u2208 \u0398, \ufb01rst there is a unique best arm {i\u2217(\u03b8)} = argmaxi\u2208[K] \u00b5i(\u03b8),\nand furthermore\n\n\u00b5i\u2217(\u03b8)(\u03b8) = \u00b5\u2217, and \u2206i(\u03b8) := \u00b5i\u2217(\u03b8)(\u03b8) \u2212 \u00b5i(\u03b8) \u2265 \u03b5,\u2200i 6= i\u2217(\u03b8).\n\nIn other words the value of the best arm is known as well as a non-trivial lower bound on the gap\nbetween the values of the best and second best arms. For this problem a new algorithm was proposed\nin Bubeck et al. [2013] (which we call the BPR policy), and it was shown that the BPR policy satis\ufb01es\n\nRn(\u03b8) = O\uf8eb\uf8ed Xi6=i\u2217(\u03b8)\n\nlog(\u2206i(\u03b8)/\u03b5)\n\n\u2206i(\u03b8)\n\nlog log(1/\u03b5)\uf8f6\uf8f8 ,\u2200\u03b8 \u2208 \u0398,\u2200n \u2265 1.\n\nThus the BPR policy attains a regret uniformly bounded over time in the BPR setting, a feature that\nstandard bandit algorithms such as UCB of Auer et al. [2002] cannot achieve. It is natural to view\n\n1Note however that the result of Agrawal and Goyal [2012b] applies to the individual regret Rn(\u03b8) while\n\nthe result of Russo and Roy [2013] only applies to the integrated Bayesian regret BRn.\n\n2\n\n\fthe assumptions of the BPR setting as a prior over the reward distributions and to ask what regret\nguarantees attain Thompson Sampling in that situation. More precisely we consider Thompson Sam-\npling with Gaussian reward distributions and uniform prior over the possible range of parameters.\nWe then prove individual regret bounds for any sub-Gaussian distributions (similarly to Bubeck et al.\n[2013]). We obtain that Thompson Sampling uses optimally the prior information in the sense that\nit also attains uniformly bounded over time regret. Furthermore as an added bonus we remove the\nextraneous log-log factor of the BPR policy\u2019s regret bound.\n\nThe results presented in Section 2 and 3 can be viewed as a \ufb01rst step towards a better understanding\nof prior-dependent regret bounds for Thompson Sampling. Generalizing these results to arbitrary\npriors is a challenging open problem which is beyond the scope of our current techniques.\n\n2 Optimal prior-free regret bound for Thompson Sampling\n\nIn this section we prove the following result.\n\nTheorem 1 For any prior distribution \u03c00 over reward distributions in [0, 1], Thompson Sampling\nsatis\ufb01es\n\nBRn \u2264 14\u221anK.\nRemark that the above result is unimprovable in the sense that there exist prior distributions \u03c00 such\n\u221anK (see e.g. [Theorem 3.5, Bubeck and Cesa-Bianchi\nthat for any algorithm one has Rn \u2265 1\n[2012]]). This theorem also implies an optimal rate of identi\ufb01cation for the best arm, see\nBubeck et al. [2009] for more details on this.\n\n20\n\nProof We decompose the proof into three steps. We denote i\u2217(\u03b8) \u2208 argmaxi\u2208[K] \u00b5i(\u03b8), in\nparticular one has It = i\u2217(\u03b8t).\n\nStep 1: rewriting of the Bayesian regret in terms of upper con\ufb01dence bounds. This step is given\nby [Proposition 1, Russo and Roy [2013]] which we reprove for sake of completeness. Let Bi,t be a\nrandom variable measurable with respect to \u03c3(Ht). Note that by de\ufb01nition \u03b8(t) and \u03b8 are identically\ndistributed conditionally on Ht. This implies by the tower rule:\n\nEBi\u2217(\u03b8),t = EBi\u2217(\u03b8(t)),t = EBIt,t.\n\nThus we obtain:\n\nE(cid:0)\u00b5i\u2217(\u03b8)(\u03b8) \u2212 \u00b5It (\u03b8)(cid:1) = E(cid:0)\u00b5i\u2217(\u03b8)(\u03b8) \u2212 Bi\u2217(\u03b8),t(cid:1) + E (BIt,t \u2212 \u00b5It (\u03b8)) .\n\nInspired by the MOSS strategy of Audibert and Bubeck [2009] we will now take\n\nP(\u00b5i \u2212 Bi,t \u2265 u) \u2264\n\n4K\n\nnu2 log(cid:18)r n\n\nK\n\nu(cid:19) +\n\n3\n\n1\n\n.\n\nnu2/K \u2212 1\n\nBi,t =b\u00b5i,Ti(t\u22121) +vuut log+(cid:16)\n\nn\n\nKTi(t\u22121)(cid:17)\n\n,\n\nTi(t \u2212 1)\n\nFrom now on we work conditionally on \u03b8 and thus we drop all the dependency on \u03b8.\n\nt=1 Xi,t, and log+(x) = log(x)1x\u22651. In the following we denote \u03b40 = 2q K\nsPs\nwhereb\u00b5i,s = 1\nStep 2: control of E(cid:0)\u00b5i\u2217(\u03b8)(\u03b8) \u2212 Bi\u2217(\u03b8),t|\u03b8(cid:1). By a simple integration of the deviations one has\n\nn .\n\nE (\u00b5i\u2217 \u2212 Bi\u2217,t) \u2264 \u03b40 +Z 1\n\n\u03b40\n\nP(\u00b5i\u2217 \u2212 Bi\u2217,t \u2265 u)du.\n\nNext we extract the following inequality from Audibert and Bubeck [2010] (see p2683\u20132684), for\nany i \u2208 [K],\n\n\f,\n\n.\n\nn\n\nn\n\n\u03b40\n\n\u03b40\n\nK\n\n1\n\n\u03b40\n\nK\n\nt=1\n\nE\n\n1\n\nn\n\n1\n\nn\n\nK\n\n\u03b40\n\nn .\n\n\u2264\n\n\u2264\n\n4K\n\nlog 3\n\n\u03b40\nand\n\n4K\nnu\n\n4K\nn\u03b40\n\nK u + 1\n\nK \u03b40 + 1\n\nNext we use the following simple inequality:\n\nnu2/K \u2212 1\n\nnu2 log(cid:18)r n\n\nu(cid:19) du =(cid:20)\u2212\ndu =\"\u2212\n2r K\n\nlog(cid:18)er n\nu(cid:19)(cid:21)1\nK u \u2212 1!#1\nlog p n\np n\n\n\u03b40(cid:19) = 2(1+log 2)r K\nlog(cid:18)er n\nZ 1\nK \u03b40 \u2212 1! =\nlog p n\n2 r K\n2r K\nZ 1\np n\nn \u2264 6q K\n2 (cid:17)q K\nThus we proved: E(cid:0)\u00b5i\u2217(\u03b8)(\u03b8) \u2212 Bi\u2217(\u03b8),t|\u03b8(cid:1) \u2264(cid:16)2 + 2(1 + log 2) + log 3\nStep 3: control ofPn\nE (BIt,t \u2212 \u00b5It (\u03b8)|\u03b8). We start again by integrating the deviations:\n(BIt,t \u2212 \u00b5It ) \u2264 \u03b40n +Z +\u221e\nnXt=1\n\nP(BIt,t \u2212 \u00b5It \u2265 u)du.\n\nnXt=1\n1\uf8f1\uf8f2\uf8f3b\u00b5i,s +s log+(cid:0) n\nKs(cid:1)\nP\uf8eb\uf8edb\u00b5i,s +s log+(cid:0) n\nKs(cid:1)\nK (cid:17)\n3 log(cid:16) nu2\n\n\u2212 \u00b5i \u2265 u\uf8fc\uf8fd\uf8fe\n\u2212 \u00b5i \u2265 u\uf8f6\uf8f8 .\n\n1{BIt,t \u2212 \u00b5It \u2265 u} \u2264\n\nnXs=1\nKXi=1\nK (cid:17) /u2\u2309 where \u2308x\u2309 is the smallest integer large than x. Let\nNow for u \u2265 \u03b40 let s(u) = \u23083 log(cid:16) nu2\nc = 1 \u2212 1\u221a3\nP\uf8eb\uf8edb\u00b5i,s +s log+(cid:0) n\n\u2212 \u00b5i \u2265 u\uf8f6\uf8f8 \u2264\nKs(cid:1)\nK (cid:17)\n3 log(cid:16) nu2\n\u2264 3(1 + log(2))r n\nZ +\u221e\n\nnXs=s(u)\nK \u2264 5.1r n\n\nP (b\u00b5i,s \u2212 \u00b5i \u2265 cu) .\n\nUsing an integration already done in Step 2 we have\n\nP(BIt,t \u2212 \u00b5It \u2265 u) \u2264\n\nnXt=1\nnXt=1\n\n. It is is easy to see that one has:\n\nKXi=1\nnXs=1\n\nwhich implies\n\nnXs=1\n\n.\n\nK\n\ns\n\ns\n\nu2\n\nu2\n\ns\n\n\u03b40\n\n+\n\n,\n\nNext using Hoeffding\u2019s inequality and the fact that the rewards are in [0, 1] one has for u \u2265 \u03b40\n1u\u22641/c.\n\nP (b\u00b5i,s \u2212 \u00b5i \u2265 cu) \u2264\n\nnXs=s(u)\nexp(\u22122sc2u2)1u\u22641/c \u2264\nNow using that 1 \u2212 exp(\u2212x) \u2265 x \u2212 x2/2 for x \u2265 0 one obtains\nZ 1/c\n\n1 \u2212 exp(\u22122c2u2)\n\nexp(\u221212c2 log 2)\n1 \u2212 exp(\u22122c2u2)\n\ndu +Z 1/c\n\n1/(2c)\n\n\u03b40\n\n1\n\n1\n\n1\n\ndu\n\n1 \u2212 exp(\u22122c2u2)\n1\n\nNow an elementary integration gives\n\n\u03b40\n\nnXs=s(u)\ndu = Z 1/(2c)\n\u2264 Z 1/(2c)\n\u2264 Z 1/(2c)\n\u2264 1.9r n\n\n\u03b40\n2\n3c2\u03b40 \u2212\n\nK\n\n=\n\n\u03b40\n\n.\n\n4\n\n1\n\n1 \u2212 exp(\u22122c2u2)\n2c2u2 \u2212 2c4u4 du +\n3c2u2 du +\n4\n3c\n\n+\n\n1\n\n2\n\n2c(1 \u2212 exp(\u22121/2))\n\n2c(1 \u2212 exp(\u22121/2))\n\n2c(1 \u2212 exp(\u22121/2))\n1\n\n\fPutting the pieces together we proved\n\nwhich concludes the proof together with the results of Step 1 and Step 2.\n\n(BIt,t \u2212 \u00b5It ) \u2264 7.6\u221anK,\n\nE\n\nnXt=1\n\n3 Thompson Sampling in the two-armed BPR setting\n\nFollowing [Section 2, Bubeck et al. [2013]] we consider here the two-armed bandit problem with\nsub-Gaussian reward distributions (that is they satisfy Ee\u03bb(X\u2212\u00b5) \u2264 e\u03bb2/2 for all \u03bb \u2208 R) and such\nthat one reward distribution has mean \u00b5\u2217 and the other one has mean \u00b5\u2217 \u2212 \u2206 where \u00b5\u2217 and \u2206 are\nknown values.\n\nIn order to derive the Thompson Sampling strategy for this problem we further assume that the\nreward distributions are in fact Gaussian with variance 1. In other words let \u0398 = {\u03b81, \u03b82}, \u03c00(\u03b81) =\n\u03c00(\u03b82) = 1/2, and under \u03b81 one has X1,s \u223c N (\u00b5\u2217, 1) and X2,s \u223c N (\u00b5\u2217 \u2212 \u2206, 1) while under \u03b82\none has X2,s \u223c N (\u00b5\u2217, 1) and X1,s \u223c N (\u00b5\u2217 \u2212 \u2206, 1). Then a straightforward computation (using\nBayes rule and induction) shows that one has for some normalizing constant c > 0:\n\n\u03c0t(\u03b81) = c exp\uf8eb\uf8ed\u2212\n\u03c0t(\u03b82) = c exp\uf8eb\uf8ed\u2212\n\n1\n2\n\n1\n2\n\nT1(t\u22121)Xs=1\nT1(t\u22121)Xs=1\n\n(\u00b5\u2217 \u2212 X1,s)2 \u2212\n\n1\n2\n\n(\u00b5\u2217 \u2212 \u2206 \u2212 X1,s)2 \u2212\n\n1\n2\n\nT2(t\u22121)Xs=1\n\n(\u00b5\u2217 \u2212 \u2206 \u2212 X2,s)2\uf8f6\uf8f8 ,\n(\u00b5\u2217 \u2212 X2,s)2\uf8f6\uf8f8 .\nT2(t\u22121)Xs=1\n\nRecall that Thompson Sampling draws \u03b8(t) from \u03c0t and then pulls the best arm for the environment\n\u03b8(t). Observe that under \u03b81 the best arm is arm 1 and under \u03b82 the best arm is arm 2. In other words\nThompson Sampling draws It at random with the probabilities given by the posterior \u03c0t. This leads\nto a general algorithm for the two-armed BPR setting with sub-Gaussian reward distributions that\nwe summarize in Figure 1. The next result shows that it attains optimal performances in this setting\nup to a numerical constant (see Bubeck et al. [2013] for lower bounds), for any sub-Gaussian reward\ndistribution (not necessarily Gaussian) with largest mean \u00b5\u2217 and gap \u2206.\n\nFor rounds t \u2208 {1, 2}, select arm It = t.\nFor each round t = 3, 4, . . . play It at random from pt where\n\npt(1) = c exp\uf8eb\uf8ed\u2212\npt(2) = c exp\uf8eb\uf8ed\u2212\n\n1\n2\n\n1\n2\n\nT1(t\u22121)Xs=1\nT1(t\u22121)Xs=1\n\nT2(t\u22121)Xs=1\n\n(\u00b5\u2217 \u2212 \u2206 \u2212 X2,s)2\uf8f6\uf8f8 ,\n(\u00b5\u2217 \u2212 X2,s)2\uf8f6\uf8f8 ,\nT2(t\u22121)Xs=1\n\n(\u00b5\u2217 \u2212 X1,s)2 \u2212\n\n1\n2\n\n(\u00b5\u2217 \u2212 \u2206 \u2212 X1,s)2 \u2212\n\n1\n2\n\nand c > 0 is such that pt(1) + pt(2) = 1.\n\nFigure 1: Policy inspired by Thompson Sampling for the two-armed BPR setting.\n\nTheorem 2 The policy of Figure 1 has regret bounded as Rn \u2264 \u2206 + 578\n\n\u2206 , uniformly in n.\n\n5\n\n\f* = 0, D\n\n = 0.2\n\n* = 0, D\n\n = 0.05\n\nn\nR\n\n \n \n \n:\nt\n\ne\nr\ng\ne\nr\n \nd\ne\na\nc\ns\ne\nR\n\nl\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nPolicy 1 from Bubeck et al.[2013]\nPolicy of Figure 1\n\nn\nR\n\n \n \n \n:\nt\n\ne\nr\ng\ne\nr\n \nd\ne\na\nc\ns\ne\nR\n\nl\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nPolicy 1 from Bubeck et al.[2013]\nPolicy of Figure 1\n\n0\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\n0\n\n5000\n\n10000\n\n15000\n\n20000\n\nTime n\n\nTime n\n\nFigure 2: Empirical comparison of the policy of Figure 1 and Policy 1 of Bubeck et al. [2013] on Gaussian\nreward distributions with variance 1.\n\nNote that we did not try to optimize the numerical constant in the above bound. Figure 2 shows\nan empirical comparison of the policy of Figure 1 with Policy 1 of Bubeck et al. [2013]. Note in\nparticular that a regret bound of order 16/\u2206 was proved for the latter algorithm and the (limited)\nnumerical simulation presented here suggests that Thompson Sampling outperforms this strategy.\nProof Without loss of generality we assume that arm 1 is the optimal arm, that is \u00b51 = \u00b5\u2217 and\n\nsPs\n\u00b52 = \u00b5\u2217 \u2212 \u2206. Letb\u00b5i,s = 1\nt=1 Xi,t,b\u03b31,s = \u00b51 \u2212b\u00b51,s andb\u03b32,s = b\u00b52,s \u2212 \u00b52. Note that large\n(positive) values ofb\u03b31,s orb\u03b32,s might mislead the algorithm into bad decisions, and we will need to\n\ncontrol what happens in various regimes for these \u03b3 coef\ufb01cients. We decompose the proof into three\nsteps.\n\nStep 1. This \ufb01rst step will be useful in the rest of the analysis, it shows how the probability ratio of\na bad pull over a good pull evolves as a function of the \u03b3 coef\ufb01cients introduced above. One has:\n\n1\n2\n\npt(2)\npt(1)\n\n= exp\uf8eb\uf8ed\u2212\n= exp(cid:18)\u2212\n= exp(cid:18)\u2212\n= exp \u2212\n\nt\u22062\n2\n\nT1(t \u2212 1)\n\nT1(t \u2212 1)\n\n2\n\n1\n2\n\nT1 (t\u22121)Xs=1 (cid:20)(\u00b52 \u2212 X1,s)2 \u2212 (\u00b51 \u2212 X1,s)2(cid:21) \u2212\n1 \u2212 2(\u00b52 \u2212 \u00b51)b\u00b51,T1 (t\u22121)(cid:21) \u2212\n\n(cid:20)\u00b52\n2 \u2212 \u00b52\n(cid:20)\u22062 \u2212 2\u2206(\u00b51 \u2212b\u00b51,T1(t\u22121))(cid:21) \u2212\n+ T1(t \u2212 1)\u2206b\u03b31,T1(t\u22121) + T2(t \u2212 1)\u2206b\u03b32,T2 (t\u22121)! .\n\nT2(t \u2212 1)\n\n2\n\n2\n\nT2(t \u2212 1)\n\nT2 (t\u22121)Xs=1 (cid:20)(\u00b51 \u2212 X2,s)2 \u2212 (\u00b52 \u2212 X2,s)2(cid:21)\uf8f6\uf8f8\n(cid:20)\u22062 \u2212 2\u2206(b\u00b52,T2 (t\u22121) \u2212 \u00b52)(cid:21)(cid:19)\n\n2 \u2212 2(\u00b51 \u2212 \u00b52)b\u00b52,T2(t\u22121)(cid:21)(cid:19)\n\n(cid:20)\u00b52\n1 \u2212 \u00b52\n\n2\n\nStep 2. We decompose the regret Rn as follows:\n\nRn\n\u2206\n\n= 1 + E\n\n= 1 + E\n\nE\n\nnXt=3\n\nWe use Hoeffding\u2019s inequality to control the \ufb01rst term:\n\n+E\n\n1{It = 2}\n\nnXt=3\n1(cid:26)b\u03b32,T2(t\u22121) >\nnXt=3\n1(cid:26)b\u03b32,T2(t\u22121) \u2264\nnXt=3\n1(cid:26)b\u03b32,T2(t\u22121) >\n\n\u2206\n4\n\n\u2206\n4\n\n\u2206\n4\n\n\u2206\n4\n\n, It = 2(cid:27) + E\n,b\u03b31,T1(t\u22121) >\n, It = 2(cid:27) \u2264 E\nnXs=1\n\n1(cid:26)b\u03b32,T2(t\u22121) \u2264\nnXt=3\n, It = 2(cid:27) .\n1(cid:26)b\u03b32,s >\n\n4(cid:27) \u2264\n\nnXs=1\n\n\u2206\n\n6\n\n\u2206\n4\n\n,b\u03b31,T1(t\u22121) \u2264\n\n\u2206\n4\n\n, It = 2(cid:27)\n\nexp(cid:18)\u2212\n\ns\u22062\n\n32 (cid:19) \u2264\n\n32\n\u22062 .\n\nD\nm\nD\nm\n\fFor the second term, using the rewriting of Step 1 as an upper bound on pt(2), one obtains:\n\nE\n\nnXt=3\n\n1(cid:26)b\u03b32,T2(t\u22121) \u2264\n\n\u2206\n4\n\n,b\u03b31,T1(t\u22121) \u2264\n\n\u2206\n4\n\n, It = 2(cid:27) =\n\n\u2264\n\nE(cid:18)pt(2)1(cid:26)b\u03b32,T2(t\u22121) \u2264\nexp \u2212\n\n4 ! \u2264\n\n4\n\u22062\n\nt\u22062\n\n.\n\n\u2206\n4\n\n,b\u03b31,T1(t\u22121) \u2264\n\n\u2206\n\n4(cid:27)(cid:19)\n\nnXt=3\nnXt=3\n\nThe third term is more dif\ufb01cult to control, and we further decompose the corresponding event as\nfollows:\n\nThe cumulative probability of the \ufb01rst event in the above decomposition is easy to control thanks to\nHoeffding\u2019s maximal inequality2 which states that for any m \u2265 1 and x > 0 one has\n\n\u2206\n4\n\n(cid:26)b\u03b32,T2(t\u22121) \u2264\n\u2282(cid:26)b\u03b31,T1(t\u22121) >\n\nIndeed this implies\n\nP(cid:18)b\u03b31,T1(t\u22121) >\n\nand thus\n\n\u2206\n4\n\n\u2206\n4\n\n, It = 2(cid:27)\n\nx2\n\n\u2206\n4\n\n\u2206\n4\n\n,b\u03b31,T1(t\u22121) >\n, T1(t \u2212 1) > t/4(cid:27) \u222a(cid:26)b\u03b32,T2(t\u22121) \u2264\n, It = 2, T1(t \u2212 1) \u2264 t/4(cid:27) .\n2m(cid:19) .\nP(\u2203 1 \u2264 s \u2264 m s.t. sb\u03b31,s \u2265 x) \u2264 exp(cid:18)\u2212\n, T1(t \u2212 1) > t/4(cid:19) \u2264 P(cid:18)\u2203 1 \u2264 s \u2264 t s.t. sb\u03b31,s >\n, T1(t \u2212 1) > t/4(cid:27) \u2264\nnXt=3\n, It = 2, T1(t \u2212 1) \u2264 t/4(cid:27) =\nnXt=3\nnXt=3\n\nE(cid:18)pt(2)1(cid:26)b\u03b32,T2(t\u22121) \u2264\nE exp \u2212\n\n1(cid:26)b\u03b31,T1(t\u22121) >\n\n16(cid:19) \u2264 exp(cid:18)\u2212\n\n512\n\u22062 .\n\n+ \u2206 max\n\nt\u22062\n4\n\n\u2206\n4\n\nt\u22062\n\n1\u2264s\u2264t/4\n\n\u2206t\n\n\u2206\n4\n\n\u2206\n4\n\n\u2264\n\nE\n\n, T1(t \u2212 1) \u2264 t/4(cid:27)(cid:19)\nsb\u03b31,s! ,\n\n512(cid:19) ,\n\nIt only remains to control the term\n\nE\n\nnXt=3\n\n1(cid:26)b\u03b32,T2(t\u22121) \u2264\n\nwhere the last inequality follows from Step 1. The last step is devoted to bounding from above this\nlast term.\n\nStep 3. By integrating the deviations and using again Hoeffding\u2019s maximal inequality one obtains\n\n1\u2264s\u2264t/4\n\nE exp(cid:18)\u2206 max\nexp \u2212\n\nsb\u03b31,s(cid:19) \u2264 1+Z +\u221e\n4 ! 1 +Z +\u221e\nexp \u2212\n\nNow, straightforward computation gives\n\nt\u22062\n\n1\n\n1\n\n1\u2264s\u2264 t\n4\n\nsb\u03b31,s \u2265\n\u22062t ! dx! \u2264\n\nP max\n\n2(log x)2\n\nnXt=3\n\nlog x\n\n1\n\n\u2206 ! dx \u2264 1+Z +\u221e\nexp(cid:18)\u2212\n4 !\uf8eb\uf8ed1 +s \u03c0\u22062t\nexp \u2212\n0 s \u03c0\u22062t\nexp \u2212\n+Z +\u221e\n\u22062 Z +\u221e\n\u221au exp(\u2212u) du\n\nnXt=3\n\n16\u221a\u03c0\n\n4\n\u22062\n\nt\u22062\n\nt\u22062\n\n+\n\n2\n\n2\n\n0\n\n2(log x)2\n\n\u22062t (cid:19) dx.\n8 !\uf8f6\uf8f8\nexp t\u22062\n8 ! dt\n\n\u2264\n\n\u2264\n\n\u2264\n\n4\n\u22062\n30\n\u22062\n\n.\n\nwhich concludes the proof by putting this together with the results of the previous step.\n\n2It is an easy exercise to verify that Azuma-Hoeffding holds for martingale differences with sub-Gaussian\n\nincrements, which implies Hoeffding\u2019s maximal inequality for sub-Gaussian distributions.\n\n7\n\n\f4 Optimal strategy for the BPR setting inspired by Thompson Sampling\n\nIn this section we consider the general BPR setting. That is the reward distributions are sub-Gaussian\n(they satisfy Ee\u03bb(X\u2212\u00b5) \u2264 e\u03bb2/2 for all \u03bb \u2208 R), one reward distribution has mean \u00b5\u2217, and all the other\nmeans are smaller than \u00b5\u2217 \u2212 \u03b5 where \u00b5\u2217 and \u03b5 are known values.\nSimilarly to the previous section we assume that the reward distributions are Gaussian with variance\n1 for the derivation of the Thompson Sampling strategy (but we do not make this assumption for the\nanalysis of the resulting algorithm). Then the set of possible parameters is described as follows:\n\n\u0398 = \u222aK\n\ni=1\u0398i where \u0398i = {\u03b8 \u2208 RK s.t. \u03b8i = \u00b5\u2217 and \u03b8j \u2264 \u00b5\u2217 \u2212 \u03b5 for all j 6= i}.\n\nAssuming a uniform prior over the index of the best arm, and a prior \u03bb over the mean of a suboptimal\narm one obtains by Bayes rule that the probability density function of the posterior is given by:\n\nNow remark that with Thompson Sampling arm i is played at time t if and only if \u03b8(t) \u2208 \u0398i. In other\nwords It is played at random from probability pt where\n\n1\n2\n\nKXj=1\n\nTj (t\u22121)Xs=1\n\nd\u03c0t(\u03b8) \u221d exp\uf8eb\uf8ed\u2212\n(Xj,s \u2212 \u03b8j)2\uf8f6\uf8f8\n(Xi,s \u2212 \u00b5\u2217)2\uf8f6\uf8f8Yj6=i\n\uf8ee\uf8f0Z \u00b5\u2217\u2212\u03b5\npt(i) = \u03c0t(\u0398i) \u221d exp\uf8eb\uf8ed\u2212\nTi(t\u22121)Xs=1\n(Xi,s \u2212 \u00b5\u2217)2(cid:17)\nexp(cid:16)\u2212 1\n2PTi(t\u22121)\n(Xi,s \u2212 v)2(cid:17) d\u03bb(v)\n\u2212\u221e exp(cid:16)\u2212 1\nR \u00b5\u2217\u2212\u03b5\n2PTi(t\u22121)\n\n1\n2\n\ns=1\n\ns=1\n\n\u2212\u221e\n\n\u221d\n\nKYj=1,j6=i\u2217(\u03b8)\n\nd\u03bb(\u03b8j).\n\nexp\uf8eb\uf8ed\u2212\n\n1\n2\n\nTj (t\u22121)Xs=1\n\n(Xj,s \u2212 v)2\uf8f6\uf8f8 d\u03bb(v)\uf8f9\uf8fb\n\n.\n\nTaking inspiration from the above calculation we consider the following policy, where \u03bb is the\nLebesgue measure and we assume a slightly larger value for the variance (this is necessary for the\nproof).\n\nFor rounds t \u2208 [K], select arm It = t.\nFor each round t = K + 1, K + 2, . . . play It at random from pt where\n\npt(i) = c\n\nexp(cid:16)\u2212 1\nexp(cid:16)\u2212 1\nR \u00b5\u2217\u2212\u03b5\n\n3PTi(t\u22121)\n3PTi(t\u22121)\n\n(Xi,s \u2212 \u00b5\u2217)2(cid:17)\n(Xi,s \u2212 v)2(cid:17) dv\n\n\u2212\u221e\n\ns=1\n\ns=1\n\n,\n\nand c > 0 is such thatPK\n\ni=1 pt(i) = 1.\n\nFigure 3: Policy inspired by Thompson Sampling for the BPR setting.\n\nThe following theorem shows that this policy attains the best known performance for the BPR setting,\nshaving off a log-log term in the regret bound of the BPR policy.\n\nTheorem 3 The policy of Figure 3 has regret bounded as Rn \u2264 Pi:\u2206i>0(cid:16)\u2206i + 80+log(\u2206i/\u03b5)\n\nuniformly in n.\n\n\u2206i\n\n(cid:17),\n\nThe proof of this result is fairly technical and it is deferred to the supplementary material.\n\n8\n\n\fReferences\nS. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In\n\nProceedings of the 25th Annual Conference on Learning Theory (COLT), 2012a.\n\nS. Agrawal and N. Goyal.\n\narXiv:1209.3353.\n\nFurther optimal regret bounds for thompson sampling, 2012b.\n\nJ.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In Proceed-\n\nings of the 22nd Annual Conference on Learning Theory (COLT), 2009.\n\nJ.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal\n\nof Machine Learning Research, 11:2635\u20132686, 2010.\n\nP. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning Journal, 47(2-3):235\u2013256, 2002.\n\nS. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit\n\nproblems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\nS. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In Proceed-\n\nings of the 20th International Conference on Algorithmic Learning Theory (ALT), 2009.\n\nS. Bubeck, V. Perchet, and P. Rigollet. Bounded regret in stochastic multi-armed bandits. In Pro-\n\nceedings of the 26th Annual Conference on Learning Theory (COLT), 2013.\n\nO. Chapelle and L. Li. An empirical evaluation of Thompson sampling.\n\nInformation Processing Systems (NIPS), 2011.\n\nIn Advances in Neural\n\nJ.C. Gittins. Bandit processes and dynamic allocation indices. Journal Royal Statistical Society\n\nSeries B, 14:148\u2013167, 1979.\n\nE. Kaufmann, N. Korda, and R. Munos. Thompson sampling: an asymptotically optimal \ufb01nite-time\nanalysis. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory\n(ALT), 2012.\n\nH. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathe-\n\nmatics Society, 58:527\u2013535, 1952.\n\nD. Russo and B. Van Roy. Learning to optimize via posterior sampling, 2013. arXiv:1301.2609.\nW. Thompson. On the likelihood that one unknown probability exceeds another in view of the\n\nevidence of two samples. Bulletin of the American Mathematics Society, 25:285\u2013294, 1933.\n\n9\n\n\f", "award": [], "sourceid": 386, "authors": [{"given_name": "Sebastien", "family_name": "Bubeck", "institution": "Princeton University"}, {"given_name": "Che-Yu", "family_name": "Liu", "institution": "Princeton University"}]}