{"title": "Connections Between Mirror Descent, Thompson Sampling and the Information Ratio", "book": "Advances in Neural Information Processing Systems", "page_first": 11973, "page_last": 11982, "abstract": "The information-theoretic analysis by Russo and Van Roy [2014] in combination with minimax duality has proved a powerful tool for the analysis of online learning algorithms in full and partial information settings. In most applications there is a tantalising similarity to the classical analysis based on mirror descent. We make a formal connection, showing that the information-theoretic bounds in most applications are derived from existing techniques from online convex optimisation. Besides this, we improve best known regret guarantees for $k$-armed adversarial bandits, online linear optimisation on $\\ell_p$-balls and bandits with graph feedback.", "full_text": "Connections Between Mirror Descent, Thompson\n\nSampling and the Information Ratio\n\nJulian Zimmert\n\nDeepMind, London/\n\nUniversity of Copenhagen\n\nzimmert@di.ku.dk\n\nTor Lattimore\n\nDeepMind, London\n\nlattimore@google.com\n\nAbstract\n\nThe information-theoretic analysis by Russo and Van Roy [25] in combination\nwith minimax duality has proved a powerful tool for the analysis of online learning\nalgorithms in full and partial information settings. In most applications there\nis a tantalising similarity to the classical analysis based on mirror descent. We\nmake a formal connection, showing that the information-theoretic bounds in most\napplications can be derived from existing techniques for online convex optimisation.\nBesides this, for k-armed adversarial bandits we provide an ef\ufb01cient algorithm\nwith regret that matches the best information-theoretic upper bound and improve\nbest known regret guarantees for online linear optimisation on (cid:96)p-balls and bandits\nwith graph feedback.\n\n1\n\nIntroduction\n\nThe combination of minimax duality and the information-theoretic machinery by Russo and Van Roy\n[25] has proved a powerful tool in the analysis of online learning algorithms. This has led to short\nand insightful analysis for k-armed bandits, linear bandits, convex bandits and partial monitoring, all\nimproving on prior best known results. The downside is that the approach is non-constructive. The\napplication of minimax duality demonstrates the existence of an algorithm with a given bound in the\nadversarial setting, but provides no way of constructing that algorithm.\nThe fundamental quantity in the information-theoretic analysis is the \u2018information ratio\u2019 in round t,\nwhich informally is\n\ninformation ratiot =\n\n(expected regret in round t)2\n\nexpected information gain in round t\n\n,\n\nwhere the information gain is either measured using the mutual information [25] or a generalisation\nbased on a Bregman divergence [21]. Proving the information ratio is small corresponds to showing\nthat either the learner is suffering small regret in round t or gaining information, which ultimately\nleads to a bound on the cumulative regret. The aforementioned generalisation by Lattimore and\nSzepesv\u00e1ri [21] (restated in the supplementary) lead to a short analysis for k-armed adversarial\nbandits that is minimax optimal except for small constant factors. The authors speculated that the\nnew idea should lead to improved bounds for a range of online learning problems and suggested a\nnumber of applications, including bandits with graph feedback [3] and linear bandits on (cid:96)p-balls [11].\nWe started to follow this plan, successfully improving existing minimax bounds for bandits with\ngraph feedback and online linear optimisation for (cid:96)p-balls with full information (the bandit setting\nremains a mystery). Along the way, however, we noticed a striking connection between the analysis\ntechniques for bounding the information ratio and controlling the stability of online stochastic mirror\ndescent (OSMD), which is a classical algorithm for online convex optimisation. A connection was\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\falready hypothesised by Lattimore and Szepesv\u00e1ri [21], who noticed a similarity between the bounds\nobtained. Notably, why does using the negentropy potential in the information-theoretic analysis lead\nto almost identical bounds for k-armed bandits as Exp3? Why does this continue to hold with the\nTsallis entropy and the INF strategy [6]?\n\nContribution Our main contribution is a formal connection between the information-theoretic\nanalysis and OSMD. Speci\ufb01cally, we show how tools for analysing OSMD can be applied to a\nmodi\ufb01ed version of Thompson sampling that uses the same sampling strategy as OSMD, but replaces\nthe mirror descent update with a Bayesian update. This contribution is valuable for several reasons:\n(a) it explains the similarity between the information-theoretic and OSMD style analysis, (b) it allows\nfor the transfer of techniques for OSMD to Bayesian regret analysis and (c) it opens the possibility of\na constructive transfer of ideas from Bayesian regret analysis to the adversarial framework, as we\nillustrate in the next contribution.\nA curiosity in the Bayesian analysis of adversarial k-armed bandits is that the resulting bound was\nalways a factor of 2 smaller than the corresponding bound for OSMD. This was true in the original\nanalysis [25] and its generalisation [21]. Our new theorem entirely explains the difference, and indeed,\nallows us to improve the bounds for OSMD. This leads to an ef\ufb01cient algorithm for adversarial\n2kn + O(k), matching the information-theoretic upper bound\n\nk-armed bandits with regret Rn \u2264 \u221a\n\nexcept for small lower-order terms.\nFinally, we improve the regret guarantees for two online learning problems. First, for bandits with\ngraph feedback we improve the minimax regret in the \u2018easy\u2019 setting by a log(n) factor, matching the\nlower bound up to a factor of log3/2(k). Second, for online linear optimisation over the (cid:96)p-balls we\nimprove existing bounds by arbitrarily large constant factors. At \ufb01rst we had proved these results\nusing the information-theoretic tools and minimax duality, but here we present the uni\ufb01ed view and\nconsequentially the analysis also applies to OSMD for which we have ef\ufb01cient algorithms.\n\nRelated work The information-theoretic Bayesian regret analysis was introduced by [24, 25, 26].\nThe focus in these papers is on the analysis of Bayesian algorithms in the stochastic setting, a line\nof work continued recently by [15]. [10] noticed that the stochastic assumption is not required and\nthat the results continued to hold in a Bayesian adversarial setting where the prior is over arbitrary\nsequences of losses, rather than over (parametric) distributions as is usual in Bayesian statistics. The\nidea to use minimax duality to derive minimax regret bounds is due to [1] and has been applied and\ngeneralised by a number of authors [10, 17, 21, 9]. Mirror descent was developed by [22] and [23]\nfor optimization. As far as we know its \ufb01rst application to bandits was by [2], which precipitated a\n\ufb02ood of papers as summarised in the books by [8, 20]. We work in the partial monitoring framework,\nwhich goes back to [27]. Most of the focus since then has been on classifying the growth of the regret\non the horizon for \ufb01nite partial monitoring games [13, 16, 5, 7, 19]. Bandits with graph feedback are\na special kind of partial monitoring problem and have been studied extensively [3, 14, 4, and others],\nwith a monograph on the subject by [28]. Online linear optimisation is an enormous subject by itself.\nWe refer the reader to the books by [12, 18].\nNotation The reader will \ufb01nd omitted proofs in the appendix. Let [n] = {1, 2, . . . , n} and Bd\np =\n{x \u2208 Rd : (cid:107)x(cid:107)p \u2264 1} be the standard (cid:96)p-ball. For positive de\ufb01nite A we write (cid:107)x(cid:107)2\nA = x(cid:62)Ax.\nGiven a topological space X, let int(X) be its interior and \u2206(X) be the space of probability\nmeasures on X with the Borel \u03c3-algebra. We write X\u25e6 = {y \u2208 Rd : supx\u2208X |(cid:104)x, y(cid:105)| \u2264 1} for the\nfunctional analysts polar and co(X) for the convex hull of X. The domain of a convex function\nF : Rd \u2192 R \u222a {\u221e} is dom(F ) = {x : F (x) < \u221e}. For x, y \u2208 dom(F ) the Bregman divergence\nbetween x and y with respect to F is DF (x, y) = F (x) \u2212 F (y) \u2212 \u2207Fx\u2212y(y) where \u2207vF (x) is\nthe directional derivative of F at x in the direction v. The diameter of X with respect to F is\ndiamF (X) = supx,y\u2208X F (x) \u2212 F (y). We abuse notation by writing \u2207\u22122F (x) = (\u22072F (x))\u22121.\nFor x, y \u2208 Rd we let [x, y] = co({x, y}) be the convex hull of x and y, which is the set of points on\nthe chord between x and y.\n\nLinear partial monitoring Our results are most easily expressed in a linear version of the partial\nmonitoring framework, which extends the standard adversarial linear bandit framework to general\nfeedback structures. Let A be the action space and L the loss space, which are subsets of Rd with\nA compact. The convex hull of A is X = co(A). When A is \ufb01nite we let k = |A|. The signal\n\n2\n\n\ffunction is a known function \u03a6 : A \u00d7 L \u2192 \u03a3 for some observation space \u03a3. An adversary and\nt=1 with (cid:96)t \u2208 L for all t. In\nlearner interact over n rounds. First the adversary secretly chooses ((cid:96)t)n\neach round t the learner samples an action At \u2208 A from a distribution depending on observations\nA1, \u03a61, . . . , At\u22121, \u03a6t\u22121 where \u03a6s = \u03a6(As, (cid:96)s) is the observation in round s. The regret of policy \u03c0\nin environment ((cid:96)t)n\n\nt=1 is\n\nRn(\u03c0, ((cid:96)t)n\n\nt=1) = max\na\u2208A\n\nE\n\n(cid:104)At \u2212 a, (cid:96)t(cid:105)\n\n,\n\n(cid:35)\n\n(cid:34) n(cid:88)\n\nt=1\n\nwhere the expectation is with respect to the randomness in the actions. The regret depends on a policy\nand the losses. The minimax regret is\nR\u2217\nn = inf\n\u03c0\n\nRn(\u03c0, ((cid:96)t)n\n\nt=1) ,\n\nsup\n((cid:96)t)n\n\nt=1\n\nwhere the in\ufb01mum is over all policies and the supremum over all loss sequences in Ln. From here on\nthe dependence of Rn on the policy and loss sequence is omitted.\nExamples The standard k-armed bandit is recovered when A = {e1, . . . , ek}, L = [0, 1]k and\n\u03a6(a, (cid:96)) = (cid:104)a, (cid:96)(cid:105) \u2208 \u03a3 = [0, 1]. For linear bandits the set A is an arbitrary compact set and L is\ntypically A\u25e6. Bandits with graph feedback have a richer signal function as we explain in Section 4.\n\nas normal. The optimal action is now a random variable A\u2217 = arg mina\u2208A(cid:80)n\n\nBayesian setting\nt=1 are sampled from a known\nprior probability measure \u03bd on Ln and subsequently the learner interacts with the sampled losses\nt=1(cid:104)a, (cid:96)t(cid:105) and the\n\nIn the Bayesian setting the sequence of losses ((cid:96)t)n\n\nBayesian regret is\n\n(cid:34) n(cid:88)\n\n(cid:35)\n\nBRn = E\n\n(cid:104)At \u2212 A\u2217, (cid:96)t(cid:105)\n\n.\n\nFinally, de\ufb01ne Pt(\u00b7) = P(\u00b7|Ft) and Et[\u00b7] = E[\u00b7|Ft] with Ft = \u03c3(A1, \u03a61, . . . , At, \u03a6t), \u2206t =\n(cid:104)At \u2212 A\u2217, (cid:96)t(cid:105). A crucial piece of notation is Xt = Et\u22121[At] \u2208 X , which is the conditional expected\naction played in round t.\n\nt=1\n\n2 Mirror descent, Thompson sampling and the information ratio\n\nAlgorithm 1: OSMD\nInput: A = (P, E, F ) and \u03b7\nInitialize X1 = arg mina\u2208X F (a)\nfor t = 1, . . . , n do\n\nWe now develop the connection between OSMD and the\ninformation-theoretic Bayesian regret analysis. Speci\ufb01-\ncally we show that instances of OSMD can be transformed\ninto an algorithm similar to Thompson sampling (TS) for\nwhich the Bayesian regret can be bounded in the same way\nas the regret of the original algorithm. The similarity to\nTS is important. Any instance of OSMD with a uniform\nbound on the adversarial regret enjoys the same bound on\nthe Bayesian regret for any prior without modi\ufb01cation. Our\nresult has a different \ufb02avour because we prove a bound for a variant of OSMD that replaces the mirror\ndescent update with a Bayesian update.\nOSMD is a modular algorithm that depends on de\ufb01ning three components: (1) A sampling scheme\nthat determines how the algorithm explores, (2) a method for estimating the unobserved loss vectors,\nand (3) a convex \u2018potential\u2019 and learning rate that determines how the algorithm updates its iterates.\nThe following de\ufb01nition makes this more precise.\nDe\ufb01nition 1. An instance of OSMD is determined by a tuple A = (P, F, E) and learning rate \u03b7 > 0\nsuch that\n\nSample At \u223c PXt and observe \u03a6t\nConstruct: \u02c6(cid:96)t = E(Xt, At, \u03a6t)\nUpdate: Xt+1 = ft(Xt, At)\n\n(a) The sampling scheme is a collection P = {Px : x \u2208 X} of probability measures in \u2206(A)\n\nsuch that EA\u223cPx [A] = x for all x \u2208 X .\n\n(b) The potential is a Legendre function F : Rd \u2192 R \u222a {\u221e} with dom(F ) \u2229 X (cid:54)= \u2205 and \u03b7 > 0\n\nis the learning rate.\n\n3\n\n\f(c) The estimation function is E : X \u00d7 A \u00d7 \u03a3 \u2192 Rd, which we assume satis\ufb01es\n\nEA\u223cPx [E(x, A, \u03a6(A, (cid:96)))] = (cid:96) for all (cid:96) \u2208 L and x \u2208 X .\n\nThe assumptions on the mean of Px and that E is unbiased are often relaxed in minor ways, but for\nsimplicity we maintain the strict de\ufb01nition. For the remainder we \ufb01x A = (P, F, E) and \u03b7 > 0 and\nabbreviate\n\nEt(x, a) = E(x, a, \u03a6(a, (cid:96)t))\n\nand\n\n\u02c6(cid:96)t = E(Xt, At, \u03a6t) .\n\nYou should think of Et(x, a) as the estimated loss vector when the learner plays action a while\nsampling from Px and \u02c6(cid:96)t as the realisation of this estimate in round t. OSMD starts by initialising X1\nas the minimiser of F constrained to X . Subsequently it samples At \u223c PXt and updates\n\nA useful notation is to let (ft)n\n\nt=1 and (gt)n\n\nXt+1 = arg min\n\n\u03b7(cid:104)y, \u02c6(cid:96)t(cid:105) + DF (y, Xt) .\n\ny\u2208X\nt=1 be sequences of functions from X \u00d7 A to Rd with\n(\u03b7(cid:104)y, Et(x, a)(cid:105) + DF (y, x))\n\nand\n(\u03b7(cid:104)y, Et(x, a)(cid:105) + DF (y, x)) ,\n\nft(x, a) = arg min\n\ny\u2208X\n\ngt(x, a) = arg min\n\ny\u2208int(dom(F ))\n\nwhich means that Xt+1 = ft(Xt, At), while gt is the same as ft, but without the constraint to X .\nThe complete algorithm is summarised in Algorithm 1. The next theorem is well known [20, \u00a728].\nTheorem 2 (OSMD REGRET BOUND). The regret of OSMD satis\ufb01es\n\n(cid:34) n(cid:88)\n\nt=1\n\nRn \u2264 diamF (X )\n(cid:20)\n\n\u03b7\n\n+\n\nE\n\n\u03b7\n2\n\nstabt(Xt; \u03b7)\n\n,\n\n(cid:35)\n\n\u03b7\n\nwhere stabt(x; \u03b7) =\n\n2\n\u03b7\n\nEA\u223cPx\n\n(cid:104)x \u2212 ft(x, A), Et(x, A)(cid:105) \u2212 DF (ft(x, A), x)\n\nThe random variable stabt(Xt; \u03b7) measures the stability of the algorithm relative to the learning rate\nand is usually almost surely bounded. The diameter term depends on how fast the algorithm can\nmove from the starting point to optimal, which is large when the learning rate is small. In this sense\nthe learning rate is tuned to balance the stability of the algorithm and the requirement that (Xt) can\ntend towards an optimal point. Note that stabt(x) depends on P , E, F , \u03b7 and the loss vector (cid:96)t,\nwhich means that in the Bayesian setting the stability function is random. The next lemma is also\nknown and is often useful for bounding the stability function.\nLemma 3. Suppose that F is twice differentiable on int(dom(F )), then\n\n(cid:21)\n\n.\n\n(cid:35)\n\n(cid:35)\n\n.\n\n.\n\nstabt(x; \u03b7) \u2264 EA\u223cPx\n\nsup\n\nz\u2208[x,ft(x,A)]\n\n(cid:107)Et(x, A)(cid:107)2\u2207\u22122F (z)\n\nFurthermore, provided that gt(x, a) exists for all a in the support of Px, then\n\nstabt(x; \u03b7) \u2264 EA\u223cPx\n\nsup\n\nz\u2208[x,gt(x,A)]\n\n(cid:107)Et(x, A)(cid:107)2\u2207\u22122F (z)\n\n(cid:34)\n\n(cid:34)\n\nBayesian analysis Modi\ufb01ed Thompson sampling (MTS)\nis a variant of TS summarised in Algorithm 2 that depends\non a prior distribution \u03bd and a sampling scheme P . The\nalgorithm differs from Algorithm 1 in the computation\nof Xt. Rather than using the mirror descent update, it\nuses the Bayesian expected optimal action conditioned on\nthe observations. Expectations in this subsection are with\nrespect to both the prior and the actions, which means that\n((cid:96)t)n\nrandom. Our main theorem is the following bound on the Bayesian regret of MTS.\n\nAlgorithm 2: MTS\nInput: Prior \u03bd and P\nInitialize X1 = E[A\u2217]\nfor t = 1, . . . , n do\n\nt=1 are randomly distributed according to \u03bd and consequently the functions ft, gt and stabt are\n\nSample At \u223c PXt and observe \u03a6t\nUpdate: Xt+1 = Et\u22121[A\u2217]\n\n4\n\n\f(cid:34) n(cid:88)\n\n(cid:35)\n\n.\n\nTheorem 4. MTS satis\ufb01es BRn \u2264 diamF (X )\nRemark 5. The stability function depends on A = (P, F, E) and \u03b7 while Algorithm 2 only uses P .\nIn this sense Theorem 4 shows that MTS satis\ufb01es the given bound for all E, F and \u03b7. MTS is the\nsame as TS when sampling from the posterior is the same as sampling from PXt. A fundamental\ncase where this always holds is when A = {e1, . . . , ed} because each x \u2208 X is uniquely represented\nas a linear combination of elements in A and hence Px is unique.\n\nstabt(Xt; \u03b7)\n\n\u03b7\n2\n\nt=1\n\nE\n\n+\n\n\u03b7\n\nProof of Theorem 4. Beginning with the de\ufb01nition of the per-step regret,\n\nEt\u22121 [\u2206t] = (cid:104)Xt, Et\u22121[(cid:96)t](cid:105) \u2212 Et\u22121 [(cid:104)A\u2217, (cid:96)t(cid:105)]\n\n(cid:104)(cid:104)A\u2217, \u02c6(cid:96)t(cid:105)(cid:105)\n(cid:104)(cid:104)Et\u22121[A\u2217 | At, \u03a6t], \u02c6(cid:96)t(cid:105)(cid:105)\n\n= (cid:104)Xt, Et\u22121[\u02c6(cid:96)t](cid:105) \u2212 Et\u22121\n= (cid:104)Xt, Et\u22121[\u02c6(cid:96)t](cid:105) \u2212 Et\u22121\n= Et\u22121\n\u2264 Et\u22121\n\n(cid:104)(cid:104)Xt \u2212 Xt+1, \u02c6(cid:96)t(cid:105)(cid:105)\n(cid:20)\n(cid:20) \u03b7\n\n\u2264 Et\u22121\n\n2\n\n1\n\u03b7\n\n(cid:104)Xt \u2212 ft(Xt, At), \u02c6(cid:96)t(cid:105) \u2212 1\n\u03b7\n\nstabt(Xt; \u03b7) +\n\nDF (Xt+1, Xt)\n\n.\n\n(cid:21)\n\n1\n\u03b7\n\nDF (Xt+1, Xt)\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nDF (ft(Xt, At), Xt) +\n\n(cid:21)\n\nEq. (1) uses that the loss estimators are unbiased. Eq. (2) follows using the tower rule for conditional\nexpectations and the fact that \u02c6(cid:96)t is a measurable function of Xt, At and \u03a6t so that\nEt\u22121[(cid:104)A\u2217, \u02c6(cid:96)t(cid:105)] = Et\u22121[Et\u22121[(cid:104)A\u2217, \u02c6(cid:96)t(cid:105)| At, \u03a6t]] = Et\u22121[(cid:104)Et\u22121[A\u2217 | At, \u03a6t], \u02c6(cid:96)t(cid:105)] = Et\u22121[(cid:104)Xt+1, \u02c6(cid:96)t(cid:105)] .\nEq. (3) uses the de\ufb01nitions of Xt+1. Eq. (4) follows from the de\ufb01nition of ft, which implies that\n\n(cid:104)ft(Xt, At), \u02c6(cid:96)t(cid:105) +\n\n1\n\u03b7\n\nDF (ft(Xt, At), Xt) \u2264 (cid:104)Xt+1, \u02c6(cid:96)t(cid:105) +\n\n1\n\u03b7\n\nDF (Xt+1, Xt) .\n\nFinally, Eq. (5) follows from the de\ufb01nition of stabt. The proof is completed by summing over the\nper-step regret, noting that (Xt)n\n\nt=1 is a (Ft)t-adapted martingale and by [21, Theorem 3],\n\n(cid:34) n(cid:88)\n\nE\n\n(cid:35)\n\nDF (Xt+1, Xt)\n\n\u2264 E[F (Xn+1)] \u2212 F (X1) \u2264 diamF (X ) .\n\nt=1\n\nThe stability coef\ufb01cient The only difference between Theorems 2 and 4 is the trajectory of (Xt)n\nt=1\nand the randomness of the stability function. In most analyses of OSMD the \ufb01nal bound is obtained\nvia a uniform bound on stabt(x; \u03b7) that holds regardless of the losses and in this case the trajectory\nXt is irrelevant. This is formalised in the following de\ufb01nition and corollary. De\ufb01ne the stability\ncoef\ufb01cients by\n\nstab(A ; \u03b7) = sup\nx\u2208X\n\nmax\nt\u2208[n]\n\nstabt(x; \u03b7)\n\nand\n\nstab(A ) = sup\n\u03b7>0\n\nstab(A ; \u03b7) .\n\nCorollary 6. The regret of Algorithm 1 for an appropriately tuned learning rate is bounded by\n\nThe Bayesian regret of Algorithm 2 is bounded by BRn \u2264(cid:112)2 diamF (X ) ess sup(stab(A ))n.\n\nRn \u2264(cid:112)2 diamF (X ) stab(A )n .\n\nt=1,\nThe essential supremum is needed because the stability coef\ufb01cient depends on the losses ((cid:96)t)n\nwhich are random in the Bayesian setting. Generally speaking, however, bounds on the stability\ncoef\ufb01cient are proven in a manner that is independent of the losses.\nRemark 7. Often stab(A ; \u03b7) \u2264 a + b\u03b7 for constants a, b \u2265 0 and stab(A ) = \u221e. Nevertheless,\nthe same argument shows that the regret of Algorithm 1 is bounded by\nb diamF (X )\n\nRn \u2264(cid:112)2a diamF (X )n +\n\n,\n\na\n\nand similarly for the Bayesian regret of Algorithm 2.\n\n5\n\n\fStability and the information ratio The generalised information-theoretic analysis by [21] starts\nby assuming there exists a constant \u03b1 > 0 such that the following bound on the information ratio\nholds almost surely:\n\ninformation ratiot = Et\u22121[\u2206t]2(cid:46)Et\u22121[DF (Xt+1, Xt)] \u2264 \u03b1 .\n\n(6)\n\nThen [21, Theorem 3] shows that\n\nBRn \u2264(cid:112)\u03b1n diamF (X ) .\n\n(7)\nThe proof of Theorem 4 directly provides a bound on the information ratio in terms of the stability\ncoef\ufb01cient. To see this, notice that Eq. (5) holds for all measurable \u03b7 and let\n\n\u03b7 =(cid:112)2Et\u22121[DF (Xt+1, Xt)]/ ess sup(stab(A )) .\n\n(8)\n\nThen by Eq. (5) and the de\ufb01nition of stab(A ) it follows that\n\nEt\u22121[\u2206t]2(cid:46)Et\u22121[DF (Xt+1, Xt)] \u2264 2 ess sup(stab(A )) a.s. .\n\nIn other words, the usual methods for bounding the stability coef\ufb01cient in the analysis of OSMD can\nbe used to bound the information ratio in the information-theoretic analysis.\nExample 8. To make the abstraction more concrete, consider the k-armed bandit problem where\nL = [0, 1]k and A = {e1, . . . , ek}. In this case there is a unique sampling scheme de\ufb01ned by\nPx(a) = (cid:104)x, a(cid:105). The standard loss estimation function is to use importance-weighting, which leads to\nA commonly used potential is the unnormalised negentropy F (x) = (cid:80)k\n(9)\ni=1 xi log(xi) \u2212 xi that\nsatis\ufb01es \u2207\u22122F (x) = diag(x). The instance of OSMD resulting from these choices is called Exp3\n(cid:16)\u2212\u03b7(cid:80)t\u22121\n(cid:17)(cid:17)\nfor which an explicit form for Xt is well known:\n\nEt(x, a)i = (cid:96)ti1(a = ei)(cid:14)xi .\n(cid:16)\u2212\u03b7(cid:80)t\u22121\n\n(cid:17)(cid:46)(cid:16)(cid:80)k\n\nXti = exp\n\n\u02c6(cid:96)sj\n\n\u02c6(cid:96)si\n\n.\n\nj=1 exp\n\ns=1\n\ns=1\n\nA short calculation shows that gt(x, a)i = xi exp(\u2212\u03b7 \u02c6(cid:96)ti) \u2264 xi. The stability function is bounded\nusing the second part of Lemma 3 by\n\n(cid:34)\n\n(cid:34)\n\nz\u2208[x,gt(x,A)]\n\nsup\n\nk(cid:88)\n\ni=1\n\nzti\n\nstabt(x; \u03b7) \u2264 EA\u223cPx\n\n(cid:107)Et(x, A)(cid:107)2\u2207\u22122F (z)\n\n= EA\u223cPx\n\nsup\n\nz\u2208[x,gt(x,A)]\n\n1(A = ei)(cid:96)2\nti\n\nx2\nti\n\n= EA\u223cPx\n\n(cid:35)\n\n(cid:35)\n\n(cid:20) 1(A = ei)(cid:96)2\n\nti\n\n(cid:21)\n\nxti\n\n\u2264 k(cid:88)\n\nti \u2264 k .\n(cid:96)2\n\ni=1\n\nFinally, the diameter of the probability simplex X with respect to the unnormalised negentropy is\ndiamF (X ) = log(k). Applying Theorem 2 shows that the regret of OSMD and Bayesian regret of\nMTS satisfy\n\nBRn \u2264(cid:112)2nk log(k)\n\n(MTS) .\n\nRn \u2264(cid:112)2nk log(k)\n\nRemark 9. Theorems 2 and 4 are vacuous when diamF (X ) = \u221e. The most straightforward\nresolution is to restrict Xt to a subset of X on which the diameter is bounded and then control the\nadditive error. This idea also works in the Bayesian setting as described by [21]. We omit a detailed\ndiscussion to avoid technicalities.\n\n(OSMD)\n\nand\n\n3 Bandits\nF (x) = \u22122(cid:80)k\nThe best known bound on the minimax regret for k-armed bandits is Rn \u2264 \u221a\n\nxi be the 1/2-Tsallis entropy and prove that\n\nEt\u22121[\u2206t]2(cid:46)Et\u22121[DF (Xt+1, Xt)] \u2264\n\n\u221a\n\n\u221a\n\nk .\n\ni=1\n\n\u221a\nBy Cauchy-Schwarz diamF (X ) \u2264 2\n2nk\nfor all priors \u03bd. Minimax duality is used to conclude that R\u2217\n2kn. Meanwhile,\n\u221a\nusing the importance-weighted estimator in Eq. (9) leads to a bound on the stability co-\nef\ufb01cient of stab(A ) \u2264 2\n8nk.\n\nk and then Eq. (7) shows that BRn \u2264 \u221a\nk and then Theorem 2 yields a bound of Rn \u2264 \u221a\n\nn \u2264 \u221a\n\n2kn by [21]. They let\n\n6\n\n\fThe discrepancy between these methods is entirely ex-\nplained by the naive choice of importance-weighted\nestimator. The approach based on bounding the infor-\nmation ratio is effectively shifting the losses, which\ncan be achieved in the OSMD framework by shifting\nthe importance-weighted estimators (see Fig. 1). This\nidea reduces the worst-case variance of the importance\nweighted estimators by a factor of 4.\nLemma 10. If the loss estimator in Example 8 with\n\nF (s) = \u22122(cid:80)k\n\n\u221a\nxi is replaced by\n((cid:96)ti \u2212 cti)1(a = ei)\n\ni=1\n\nEt(x, a)i =\n\nxi\n\n+ cti ,\n\n800\n\n600\n\n400\n\n200\n\n0\n\n0\n\nINF\nINF+shift\n\n25000\n\n50000\n\n75000\n\n100000\n\nwhere cti =\n\n(1 \u2212 1(Xti < \u03b72)) ,\n\n1\n2\n\n2kn + 48k.\n\nFigure 1: Comparison of INF with and with-\nout shifted loss estimators. x-axis is number\nof time-steps and y-axis the empirical regret\nestimation. \u03b7 is tuned to the horizon and\nall experiments use Bernoulli losses with\nE[(cid:96)t] = (0.45, 0.55, . . . , 0.55)T (k = 5).\nWe repeat the experiment 100 times with er-\nror bars indicating three standard deviations.\nThe empirical result matches our theoretical\nimprovement of a factor 2.\n\nthen the stability coef\ufb01cient for any \u03b7 \u2264 1/2 is bounded\nby stab(A ; \u03b7) \u2264 k1/2/2 + 12k\u03b7.\nTheorem 11. The regret of OSMD with the loss estima-\nRn \u2264 \u221a\ntor of Lemma 10 and appropriate learning rate satis\ufb01es:\n4 Bandits with graph feedback\nIn bandits with graph feedback the action set is A = {e1, . . . , ek} and L = [0, 1]k. Let E \u2286 [k]\u00d7 [k]\nbe a set of directed edges over vertex set [k] so that G = ([k], E) is a directed graph. The signal\nfunction is \u03a6(ei, (cid:96)) = {(j, (cid:96)j) : j \u2208 N (i)}. The standard bandit framework is recovered when\nE = {(i, i) : i \u2208 [k]} while the full information setup corresponds to E = [k] \u00d7 [k]. Of course there\nare settings between and beyond these extremes. The dif\ufb01culty of the graph feedback problem is\ndetermined by the connectivity of the graph. For example, when E = \u2205, the learner has no way to\nestimate the losses and the regret is linear in the worst case. Like \ufb01nite partial monitoring, graph\nfeedback problems can be classi\ufb01ed into one of four regimes for which:\n\nn \u2208(cid:110)O(1), \u02dc\u0398(n1/2), \u0398(n2/3), \u2126(n)\n\n(cid:111)\n\nR\u2217\n\n.\n\nOur focus is on graph feedback problems that \ufb01t in the second category, which is the most challenging\nto analyse.\nDe\ufb01nition 12. G is called strongly observable if for every vertex i \u2208 [k] at least one of the following\nholds: (a) a \u2208 N (b) for all b (cid:54)= a or (b) a \u2208 N (a).\n\nAlon et al. [3] prove the minimax regret for bandits with graph feedback is \u02dc\u0398(n1/2) if and only if\nk > 1 and G is strongly observable. They also prove the following theorem upper and lower bounding\nthe dependence of the minimax regret on the horizon, the number of actions and a graph functional\ncalled the independence number.\nTheorem 13 ([3]). Let Gind be the independence number of G, which is the cardinality of the largest\nsubset of vertices such that no tow distinct vertices are connected by an edge. Suppose k > 1 and G\nis strongly observable. Then R\u2217\nThe logarithmic dependence on n in the proof of Theorem 13 appears quite naturally, which raises the\n\u221a\nquestion of whether or not the upper or lower bound is tight. In fact, as n tends to in\ufb01nity the upper\nbound in Theorem 13 could be improved to O(\nnk) by using a \ufb01nite-armed algorithm that ignores\nthe feedback except for the played action. Perhaps the independence number is not as fundamental as\n\ufb01rst thought? The following theorem shows the upper bound can be improved.\nTheorem 14. Let A = (P, E, F ) be a triple de\ufb01ning OSMD with Px(a) = (cid:104)a, x(cid:105),\n\n\u221aGindn log(kn)) and R\u2217\n\n\u221aGindn).\n\nn = O(\n\nn = \u2126(\n\nF (x) =\n\n1\n\n\u03b1(1 \u2212 \u03b1)\n\nx\u03b1\ni\n\nwhere \u03b1 = 1 \u2212 1/ log(k) .\n\nk(cid:88)\n\ni=1\n\n7\n\n\fFinally, de\ufb01ne the unbiased loss estimation function E by\n\n(cid:80)\n\n(cid:96)ti1(a \u2208 N (i))\nb\u2208N (i) xb\n\nEt(x, a)i =\n\nfor i (cid:54)\u2208 It, and Et(x, a)i =\n\n((cid:96)ti \u2212 1)1(a (cid:54)= i)\n\n1 \u2212 xi\n\n+ 1 otherwise ,\n\nwhere It = {i \u2208 [k] : i (cid:54)\u2208 N (i) and Xti > 1/2}. Then for any k \u2265 8 and an appropriately tuned\n\nlearning rate the regret of OSMD with A satis\ufb01es Rn = O((cid:112)Gindn log(k)3).\np = 1 (cid:112)n log(d)\np > 1 (cid:112)n/(p \u2212 1)\n\n5 Online linear optimisation over (cid:96)p-balls\n\np and L = Bd\n\nWe now consider full information online linear optimization\non the (cid:96)p balls with p \u2208 [1, 2], which is modelled in our\nframework by choosing A = Bd\nq with 1/p +\nTable 1: Known results for (cid:96)p-balls\n1/q = 1 and \u03a6(a, (cid:96)) = (cid:96). Table 1 summarises the known\nresults. When p = 1 the situation is unambiguous, with matching upper and lower bounds. For\np \u2208 (1, 2] there exist algorithms for which the regret is dimension free, but with constants that become\narbitrarily large as p tends to 1. Known results for online gradient descent (OGD) prove the blowup\nin terms of p is avoidable, but with a price that is polynomial in the dimension.\nTheorem 15. For any p \u2208 [1, 2], let h be the following convex and twice continuously differentiable\nfunction:\n\nAlgorithm\nHedge\n[12, \u00a711.5]\nOGD [18]\n\nd2/p\u22121n\n\np \u2265 1\n\nRegret\n\n\u221a\n\np\n\n(cid:40) d\n\nh(x) =\n\nThen for OSMD using potential F (x) =(cid:80)d\n\np\u22122|x| +\np\u22121\n\n2 x2\np\u22122\np\u22121 d\n\n1\n\np\u22122\n\nif |x| \u2264 d\notherwise .\n\np\n\np\u22122\n\n|x|p\np(p\u22121) + 2\u2212p\n2p d\ni=1 h(xi), loss estimator E(x, a, \u03c3) = \u03c3, an arbitrary\n\nexploration scheme and appropriately tuned learning rate,\n\nRn = O(cid:16)(cid:112)min{1/(p \u2212 1), log(d)} n\n(cid:17)\n\n.\n\nFurthermore, the Bayesian regret of TS is bounded by the same quantity.\nRemark 16. In the full information setting the loss estimation is independent of the action, which\nexplains the arbitrariness of the exploration scheme. The intuitive justi\ufb01cation for the slightly cryptic\npotential function is provided in the supplementary material.\n\n6 Discussion\n\nWe demonstrated a connection between the information-theoretic analysis and OSMD. For k-armed\nbandits, we explained the factor of two difference between the regret analysis using information-\ntheoretic and convex-analytic machinery and improved the bound for the latter. For graph bandits\nwe improved the regret by a factor of log(n). Finally, we designed a new potential for which the\nregret for online linear optimisation over the (cid:96)p-balls improves the previously best known bound by\narbitrarily large constant factors.\n\nOpen problems The main open problem is whether or not we can \u2018close the circle\u2019 and use the\ninformation-theoretic analysis to directly construct OSMD algorithms. Another direction is to try\nand relax the assumption that the loss is linear. The leading constant in the new bandit analysis now\nmatches the best known information-theoretic bound [21]. There is still a constant lower-order term,\nwhich presently seems challenging to eliminate. In bandits with graph feedback one can ask whether\nthe log(k) dependency can be improved. Lower bounds are still needed for (cid:96)p-balls and extending\nthe idea to the bandit setting is an obvious followup. Finally, the best known algorithms for \ufb01nite\npartial monitoring also use the information-theoretic machinery. Understanding how to borrow the\nideas for OSMD remains a challenge.\n\nReferences\n[1] J. Abernethy, A. Agarwal, P. L. Bartlett, and A. Rakhlin. A stochastic view of optimal regret\nthrough minimax duality. In Proceedings of the 22nd Annual Conference on Learning Theory,\n2009.\n\n8\n\n\f[2] J. D. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An ef\ufb01cient algorithm for\nbandit linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory,\npages 263\u2013274. Omnipress, 2008.\n\n[3] N. Alon, N. Cesa-Bianchi, O. Dekel, and T. Koren. Online learning with feedback graphs:\nBeyond bandits. In Peter Gr\u00fcnwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The\n28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research,\npages 23\u201335, Paris, France, 03\u201306 Jul 2015. PMLR.\n\n[4] N. Alon, N. Cesa-Bianchi, C. Gentile, S. Mannor, Y. Mansour, and O. Shamir. Nonstochastic\nmulti-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):\n1785\u20131826, 2017.\n\n[5] A. Antos, G. Bart\u00f3k, D. P\u00e1l, and Cs. Szepesv\u00e1ri. Toward a classi\ufb01cation of \ufb01nite partial-\n\nmonitoring games. Theoretical Computer Science, 473:77\u201399, 2013.\n\n[6] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. In\n\nProceedings of Conference on Learning Theory (COLT), pages 217\u2013226, 2009.\n\n[7] G. Bart\u00f3k, D. P. Foster, D. P\u00e1l, A. Rakhlin, and Cs. Szepesv\u00e1ri. Partial monitoring\u2014\nclassi\ufb01cation, regret bounds, and algorithms. Mathematics of Operations Research, 39(4):\n967\u2013997, 2014.\n\n[8] S. Bubeck and N. Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed\nBandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated,\n2012.\n\n[9] S. Bubeck and M. Sellke. First-order regret analysis of Thompson sampling. arXiv preprint\n\narXiv:1902.00681, 2019.\n\n\u221a\n\n[10] S. Bubeck, O. Dekel, T. Koren, and Y. Peres. Bandit convex optimization:\n\nT regret in one\ndimension. In P. Gr\u00fcnwald, E. Hazan, and S. Kale, editors, Proceedings of The 28th Conference\non Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 266\u2013278,\nParis, France, 03\u201306 Jul 2015. PMLR.\n\n[11] S. Bubeck, M. Cohen, and Y. Li. Sparsity, variance and curvature in multi-armed bandits. In\nF. Janoos, M. Mohri, and K. Sridharan, editors, Proceedings of Algorithmic Learning Theory,\nvolume 83 of Proceedings of Machine Learning Research, pages 111\u2013127. PMLR, 07\u201309 Apr\n2018.\n\n[12] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press,\n\n2006.\n\n[13] N. Cesa-Bianchi, G. Lugosi, and G. Stoltz. Regret minimization under partial monitoring.\n\nMathematics of Operations Research, 31:562\u2013580, 2006.\n\n[14] A. Cohen, T. Hazan, and T. Koren. Online learning with feedback graphs without the graphs. In\n\nInternational Conference on Machine Learning, pages 811\u2013819, 2016.\n\n[15] S. Dong and B. Van Roy. An information-theoretic analysis for thompson sampling with many\nactions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 31, pages 4157\u20134165. Curran\nAssociates, Inc., 2018.\n\n[16] D. Foster and A. Rakhlin. No internal regret via neighborhood watch. In N. D. Lawrence and\nM. Girolami, editors, Proceedings of the 15th International Conference on Arti\ufb01cial Intelligence\nand Statistics, volume 22 of Proceedings of Machine Learning Research, pages 382\u2013390, La\nPalma, Canary Islands, 21\u201323 Apr 2012. PMLR.\n\n[17] N. Gravin, Y. Peres, and B. Sivan. Towards optimal algorithms for prediction with expert advice.\nIn Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms,\npages 528\u2013547. SIAM, 2016.\n\n[18] E. Hazan. Introduction to online convex optimization. Foundations and Trends R(cid:13) in Optimiza-\n\ntion, 2(3-4):157\u2013325, 2016.\n\n[19] T. Lattimore and Cs. Szepesv\u00e1ri. Cleaning up the neighbourhood: A full classi\ufb01cation for\nadversarial partial monitoring. In International Conference on Algorithmic Learning Theory,\n2019.\n\n9\n\n\f[20] T. Lattimore and Cs. Szepesv\u00e1ri. Bandit Algorithms. Cambridge University Press (preprint),\n\n2019.\n\n[21] T. Lattimore and Cs. Szepesv\u00e1ri. An information-theoretic approach to minimax regret in partial\n\nmonitoring. In Conference on Learning Theory, 2019.\n\n[22] A. S. Nemirovsky. Ef\ufb01cient methods for large-scale convex optimization problems. Ekonomika\n\ni Matematicheskie Metody, 15, 1979.\n\n[23] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization.\n\nWiley, 1983.\n\n[24] D. Russo and B. Van Roy. Learning to optimize via information-directed sampling.\n\nIn\nZ. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances\nin Neural Information Processing Systems 27, NIPS, pages 1583\u20131591. Curran Associates, Inc.,\n2014.\n\n[25] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics of\n\nOperations Research, 39(4):1221\u20131243, 2014.\n\n[26] D. Russo and B. Van Roy. An information-theoretic analysis of Thompson sampling. Journal\n\nof Machine Learning Research, 17(1):2442\u20132471, 2016. ISSN 1532-4435.\n\n[27] A. Rustichini. Minimizing regret: The general case. Games and Economic Behavior, 29(1):\n\n224\u2013243, 1999.\n\n[28] M. Valko. Bandits on graphs and structures, 2016.\n\n10\n\n\f", "award": [], "sourceid": 6448, "authors": [{"given_name": "Julian", "family_name": "Zimmert", "institution": "University of Copenhagen"}, {"given_name": "Tor", "family_name": "Lattimore", "institution": "DeepMind"}]}