{"title": "Adaptation to Easy Data in Prediction with Limited Advice", "book": "Advances in Neural Information Processing Systems", "page_first": 2909, "page_last": 2918, "abstract": "We derive an online learning algorithm with improved regret guarantees for ``easy'' loss sequences. We consider two types of ``easiness'': (a) stochastic loss sequences and (b) adversarial loss sequences with small effective range of the losses. While a number of algorithms have been proposed for exploiting small effective range in the full information setting, Gerchinovitz and Lattimore [2016] have shown the impossibility of regret scaling with the effective range of the losses in the bandit setting. We show that just one additional observation per round is sufficient to circumvent the impossibility result. The proposed Second Order Difference Adjustments (SODA) algorithm requires no prior knowledge of the effective range of the losses, $\\varepsilon$, and achieves an $O(\\varepsilon \\sqrt{KT \\ln K}) + \\tilde{O}(\\varepsilon K \\sqrt[4]{T})$ expected regret guarantee, where $T$ is the time horizon and $K$ is the number of actions. The scaling with the effective loss range is achieved under significantly weaker assumptions than those made by Cesa-Bianchi and Shamir [2018] in an earlier attempt to circumvent the impossibility result. We also provide a regret lower bound of $\\Omega(\\varepsilon\\sqrt{T K})$, which almost matches the upper bound. In addition, we show that in the stochastic setting SODA achieves an $O\\left(\\sum_{a:\\Delta_a>0} \\frac{K\\varepsilon^2}{\\Delta_a}\\right)$ pseudo-regret bound that holds simultaneously with the adversarial regret guarantee. In other words, SODA is safe against an unrestricted oblivious adversary and provides improved regret guarantees for at least two different types of ``easiness'' simultaneously.", "full_text": "Adaptation to Easy Data in Prediction with Limited\n\nAdvice\n\nTobias Sommer Thune\n\nDepartment of Computer Science\n\nUniversity of Copenhagen\ntobias.thune@di.ku.dk\n\nAbstract\n\nYevgeny Seldin\n\nDepartment of Computer Science\n\nUniversity of Copenhagen\n\nseldin@di.ku.dk\n\nWe derive an online learning algorithm with improved regret guarantees for \u201ceasy\u201d\nloss sequences. We consider two types of \u201ceasiness\u201d: (a) stochastic loss sequences\nand (b) adversarial loss sequences with small effective range of the losses. While\na number of algorithms have been proposed for exploiting small effective range\nin the full information setting, Gerchinovitz and Lattimore [2016] have shown\nthe impossibility of regret scaling with the effective range of the losses in the\nbandit setting. We show that just one additional observation per round is suf\ufb01cient\nto circumvent the impossibility result. The proposed Second Order Difference\nAdjustments (SODA) algorithm requires no prior knowledge of the effective range\nof the losses, \u03b5, and achieves an O(\u03b5\nT ) expected regret\nguarantee, where T is the time horizon and K is the number of actions. The scaling\nwith the effective loss range is achieved under signi\ufb01cantly weaker assumptions\nthan those made by Cesa-Bianchi and Shamir [2018] in an earlier attempt to\ncircumvent the impossibility result. We also provide a regret lower bound of\nT K), which almost matches the upper bound. In addition, we show that in\n\u2126(\u03b5\nthe stochastic setting SODA achieves an O\npseudo-regret bound\nthat holds simultaneously with the adversarial regret guarantee. In other words,\nSODA is safe against an unrestricted oblivious adversary and provides improved\nregret guarantees for at least two different types of \u201ceasiness\u201d simultaneously.\n\n\u221a\nKT ln K) + \u02dcO(\u03b5K 4\n\n(cid:16)(cid:80)\n\n(cid:17)\n\na:\u2206a>0\n\nK\u03b52\n\u2206a\n\n\u221a\n\n\u221a\n\n1\n\nIntroduction\n\nOnline learning algorithms with both worst-case regret guarantees and re\ufb01ned guarantees for \u201ceasy\u201d\nloss sequences have come into research focus in recent years. In our work we consider prediction\nwith limited advice games [Seldin et al., 2014], which are an interpolation between full information\ngames [Vovk, 1990, Littlestone and Warmuth, 1994, Cesa-Bianchi and Lugosi, 2006] and games with\nlimited (a.k.a. bandit) feedback [Auer et al., 2002b, Bubeck and Cesa-Bianchi, 2012].1 In prediction\nwith limited advice the learner faces K unobserved sequences of losses {(cid:96)a\nt }t,a, where a indexes\nthe sequence number and t indexes the elements within the a-th sequence. At each round t of the\ngame the learner picks a sequence At \u2208 {1, . . . , K} and suffers the loss (cid:96)At\n, which is then observed.\nAfter that, the learner is allowed to observe the losses of M additional sequences in the same round t,\nwhere 0 \u2264 M \u2264 K \u2212 1. For M = K \u2212 1 the setting is equivalent to a full information game and for\nM = 0 it becomes a bandit game.\n\nt\n\n1There exists an orthogonal interpolation between full information and bandit games through the use of\nfeedback graphs Alon et al. [2017], which is different and incomparable with prediction with limited advice, see\nSeldin et al. [2014] for a discussion.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFor a practical motivation behind prediction with limited advice imagine that the loss sequences\ncorrespond to losses of K different algorithms for solving some problem, or K different parametriza-\ntions of one algorithm, or K different experts. If we had the opportunity we would have executed all\nthe algorithms or queried all the experts before making a prediction. This would correspond to a full\ninformation game. But in reality we may be constrained by time, computational power, or monetary\nbudget. In such case we are forced to select algorithms or experts to query. Being able to query just\none expert or algorithm per prediction round corresponds to a bandit game, but we may have time\nor money to get a bit more, even though not all of it. This is the setting modeled by prediction with\nlimited advice.\nOur goal is to derive an algorithm for prediction with limited advice that is robust in the worst case and\nprovides improved regret guarantees in \u201ceasy\u201d cases. There are multiple ways to de\ufb01ne \u201ceasiness\u201d of\nloss sequences. Among them, loss sequences generated by i.i.d. sources, like the classical stochastic\nbandit model [Robbins, 1952, Lai and Robbins, 1985, Auer et al., 2002a], and adversarial sequences\nwith bounded effective range of the losses within each round [Cesa-Bianchi et al., 2007]. For the\nformer a simple calculation shows that in the full information setting the basic Hedge algorithm\n[Vovk, 1990, Littlestone and Warmuth, 1994] achieves an improved \u201cconstant\u201d (independent of time\nhorizon) pseudo-regret guarantee without sacri\ufb01cing the worst-case guarantee. Much more work is\nrequired to achieve adaptation to this form of easiness in the bandit setting if we want to keep the\nadversarial regret guarantee simultaneously [Bubeck and Slivkins, 2012, Seldin and Slivkins, 2014,\nAuer and Chiang, 2016, Seldin and Lugosi, 2017, Wei and Luo, 2018, Zimmert and Seldin, 2018].\nAn algorithm that adapts to the second form of easiness in the full information setting was \ufb01rst\nproposed by Cesa-Bianchi et al. [2007] and a number of variations have followed [Gaillard et al.,\n2014, Koolen and van Erven, 2015, Luo and Schapire, 2015, Wintenberger, 2017]. However, a recent\nresult by Gerchinovitz and Lattimore [2016] have shown that such adaptation is impossible in the\nbandit setting. Cesa-Bianchi and Shamir [2018] proposed a way to circumvent the impossibility result\nby either assuming that the ranges of the individual losses are provided to the algorithm in advance or\nassuming that the losses are smooth and an \u201canchor\u201d loss of one additional arm is provided to the\nalgorithm. The latter assumption has so far only lead to a substantial improvement when the \u201canchor\u201d\nloss is always the smallest loss in the corresponding round.\nWe consider adaptation to both types of easiness in prediction with limited advice. We show that\nM = 1 (just one additional observation per round) is suf\ufb01cient to circumvent the impossibility result\nof Gerchinovitz and Lattimore [2016]. This assumption is weaker than the assumptions in Cesa-\nBianchi and Shamir [2018]. We propose an algorithm, which achieves improved regret guarantees\nboth when the effective loss range is small and when the losses are stochastic (generated i.i.d.). The\nalgorithm is inspired by the BOA algorithm of Wintenberger [2017], but instead of working with\nexponential weights of the cumulative losses and their second moment corrections it uses estimates\nof the loss differences. The algorithm achieves an O(\u03b5\nT ) expected regret\nguarantee with no prior knowledge of the effective loss range \u03b5 or time horizon T . We also provide\nregret lower bound of \u2126(\u03b5\nKT ), which matches the upper bound up to logarithmic terms and\nsmaller order factors. Furthermore, we show that in the stochastic setting the algorithm achieves an\npseudo-regret guarantee. The improvement in the stochastic setting is achieved\nO\nwithout compromising the adversarial regret guarantee.\nThe paper is structured in the following way. In Section 2 we lay out the problem setting. In Section 3\nwe present the algorithm and in Section 4 the main results about the algorithm. Proofs of the main\nresults are presented in Section 5.\n\n\u221a\nKT ln K) + \u02dcO(\u03b5K 4\n\n(cid:16)(cid:80)\n\n(cid:17)\n\n\u221a\n\n\u221a\n\na:\u2206a>0\n\nK\u03b52\n\u2206a\n\n2 Problem Setting\n\n2, . . .}a\u2208{1,...,K},\nWe consider sequential games de\ufb01ned by K in\ufb01nite sequences of losses {(cid:96)a\nt \u2208 [0, 1] for all a and t. At each round t \u2208 {1, 2, . . .} of the game the learner selects an\nwhere (cid:96)a\naction (a.k.a. \u201carm\u201d) At \u2208 [K] := {1, . . . , K} and then suffers and observes the corresponding loss\n(cid:96)At\n. Additionally, the learner is allowed to choose a second arm, Bt, and observe (cid:96)Bt\n. The loss of\nt\nt\nthe second arm, (cid:96)Bt\n, is not suffered by the learner. (This is analogous to the full information setting,\nt\nwhere the losses of all arms a (cid:54)= At are observed, but not suffered). It is assumed that (cid:96)Bt\nis observed\n\n1, (cid:96)a\n\nt\n\n2\n\n\fafter At has been selected, but other relative timing of events within a round is unimportant for our\nanalysis.\nThe performance of the learner up to round T is measured by expected regret de\ufb01ned as\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\nRT := E\n\n(cid:96)At\nt\n\n\u2212 min\na\u2208[K]\n\nE\n\n(cid:96)a\nt\n\n,\n\n(1)\n\nwhere the expectation is taken with respect to potential randomization of the loss generation process\nand potential randomization of the algorithm. We note that in the adversarial setting the losses\nare considered deterministic and the second expectation can be omitted, whereas in the stochastic\nsetting the de\ufb01nition coincides with the de\ufb01nition of pseudo-regret [Bubeck and Cesa-Bianchi, 2012,\nSeldin and Lugosi, 2017]. In some literature RT is termed excess of cumulative predictive risk\n[Wintenberger, 2017].\nBelow we de\ufb01ne adversarial and stochastic loss generation models and effective range of loss\nsequences.\n\nAdversarial losses\n\nIn the adversarial setting the loss sequences are selected arbitrarily by an adversary. We restrict\nourselves to the oblivious model, where the losses are \ufb01xed before the start of the game and do not\ndepend on the actions of the learner.\n\nStochastic losses\nIn the stochastic setting the losses are drawn i.i.d., so that E[(cid:96)a\nt ] = \u00b5a independently of t. Since we\nhave a \ufb01nite number of arms, there exists a best arm a(cid:63) (not necessarily unique) such that \u00b5a(cid:63) \u2264 \u00b5a\nfor all a. We further de\ufb01ne the suboptimality gaps by\n\n\u2206a := \u00b5a \u2212 \u00b5a(cid:63) \u2265 0.\nIn the stochastic setting the expected regret can be rewritten as\n\n(cid:88)\n\na\u2208[K]:\u2206a>0\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\nRT =\n\nwhere 1 is the indicator function.\n\nEffective loss range\n\n\u2206a E\n\n1(At = a)\n\n,\n\n(2)\n\nFor both the adversarial and stochastic losses, we de\ufb01ne the effective loss range as the smallest\nnumber \u03b5, such that for all t \u2208 [T ] and a, a(cid:48) \u2208 [K]:\nt | \u2264 \u03b5\n\n(3)\nt \u2208 [0, 1], we have \u03b5 \u2264 1, where \u03b5 = 1 corresponds to an unrestricted\n\nalmost surely.\n\n|(cid:96)a\nt \u2212 (cid:96)a(cid:48)\n\nSince we have assumed that (cid:96)a\nsetting.\n\n3 Algorithm\n\nWe introduce the Second Order Difference Adjustments (SODA) algorithm, summarized in Algo-\nrithm 1. SODA belongs to the general class of exponential weights algorithms. The algorithm\nhas two important distinctions from the common members of this class. First, it uses cumulative\nloss difference estimators instead of cumulative loss estimators for the exponential weights updates.\nInstantaneous loss difference estimators at round t are de\ufb01ned by\n\n(cid:102)\u2206(cid:96)\nt = (K \u2212 1)1(Bt = a)\n\na\n\nt \u2212 (cid:96)At\n(cid:96)Bt\n\nt\n\n(cid:16)\n\n(cid:17)\n\n(4)\nSODA samples the \u201csecondary\u201d action Bt (the additional observation) uniformly from K \u2212 1 arms,\nall except At, and the (K \u2212 1) term above corresponds to importance weighting with respect to the\n\n.\n\n3\n\n\fsampling of Bt. The loss difference estimators scale with the effective range of the losses and they\ncan be positive and negative. Both of these properties are distinct from the traditional loss estimators.\nThe second difference is that we are using a second order adjustment in the weighting inspired by\nWintenberger [2017]. We de\ufb01ne the cumulative loss difference estimator and its second moment by\n\nWe then have the distribution pt for selecting the primary action At de\ufb01ned by\n\nDt(a) :=\n\npa\nt =\n\na\ns\n\ns=1\n\n(cid:102)\u2206(cid:96)\n\na\ns , St(a) :=\n\n(cid:17)2\n(cid:16)(cid:102)\u2206(cid:96)\nt(cid:88)\nt(cid:88)\nt St\u22121(a)(cid:1)\nexp(cid:0)\u2212\u03b7tDt\u22121(a) \u2212 \u03b72\n(cid:80)K\na=1 exp (\u2212\u03b7tDt\u22121(a) \u2212 \u03b72\n(cid:40)(cid:115)\n\nt St\u22121(a))\n\ns=1\n\n.\n\nln K\n\n1\n\n,\n\n(cid:41)\n\n(5)\n\n(6)\n\nwhere \u03b7t is a learning rate scheme, de\ufb01ned as\n\n\u03b7t = min\n\n(7)\nThe learning rate satis\ufb01es \u03b7t \u2264 1/(2\u03b5(K \u2212 1)) for all t, which is required for the subsequent analysis.\nThe algorithm is summarized below:\n\nmaxa St\u22121(a) + (K \u2212 1)2 ,\n\n2(K \u2212 1)\n\n.\n\nInitialize p1 \u2190 (1/K, . . . , 1/K).\nfor t = 1, 2, . . . do\n\nDraw At according to pt;\nDraw Bt uniformly at random from the remaining actions [K] \\ {At};\nObserve (cid:96)At\nt\n\n;\n\nConstruct (cid:102)\u2206(cid:96)\n\nand suffer (cid:96)At\n, (cid:96)Bt\nt\nt\na\nt by equation (4);\n\nUpdate Dt(a), St(a) by (5);\nDe\ufb01ne pt+1 by (6);\n\nend\n\nAlgorithm 1: Second Order Difference Adjustments (SODA)\n\n4 Main Results\n\nWe are now ready to present the regret bounds for SODA. We start with regret upper and lower\nbounds in the adversarial regime and then show that the algorithm simultaneously achieves improved\nregret guarantee in the stochastic regime.\n\n4.1 Regret Upper Bound in the Adversarial Regime\n\nFirst we provide an upper bound for the expected regret of SODA against oblivious adversaries that\nproduce loss sequences with effective loss range bounded by \u03b5. Note that this result does not depend\non prior knowledge of the effective loss range \u03b5 or time horizon T .\nTheorem 1. The expected regret of SODA against an oblivious adversary satis\ufb01es\n\nRT \u2264 4\u03b5(cid:112)(K \u2212 1) ln K\n\n(cid:118)(cid:117)(cid:117)(cid:116)T + (K \u2212 1)\n\n\u221a\n\n(cid:32)\n\nT\n\n2 +\n\nln\n\nT (K \u2212 1)\n\n/2\n\n+ 4(K \u2212 1) ln K.\n\n(cid:114)\n\n(cid:16)\u221a\n\n(cid:33)\n\n(cid:17)\n\n\u221a\n\u221a\nA proof of this theorem is provided in Section 5.1.2 The upper bound scales as O(\u03b5\n\u02dcO(\u03b5K 4\n\nT ), which nearly matches the lower bound provided below.\n\nKT ln K) +\n\n2It is straightforward to extended the analysis to time-varying ranges, \u03b5t : |(cid:96)a\n\n(cid:19)\n\n(cid:18)\n\nK 4(cid:113)(cid:80)T\n\n+ \u02dcO\n\nt=1 \u03b52\nt\n\n(cid:19)\n\nt \u2212 (cid:96)a(cid:48)\n\nt | \u2264 \u03b5t for all a, a(cid:48) a.s.,\nregret bound . For the sake of clarity we\n\n(cid:18)(cid:113)(cid:80)T\n\nwhich leads to an O\nrestrict the presentation to a constant \u03b5.\n\nt=1(\u03b52\n\nt )K ln K\n\n4\n\n\f4.2 Regret Lower Bound in the Adversarial Regime\n\nWe show that in the worst case the regret must scale linearly with the effective loss range \u03b5.\nTheorem 2. In prediction with limited advice with M = 1 (one additional observation per round or,\nequivalently, two observations per round in total), for loss sequences with effective loss range \u03b5, we\nhave for T \u2265 3K/32:\n\ninf supRT \u2265 0.02\u03b5\n\nKT ,\n\n\u221a\n\nwhere the in\ufb01mum is with respect to the choices of the algorithm and the supremum is over all\noblivious loss sequences with effective loss range bounded by \u03b5.\n\nThe theorem is proven by adaptation of the \u2126(\nKT ) lower bound by Seldin et al. [2014] for\nprediction with limited advice with unrestricted losses in [0, 1] and one extra observation. We provide\nit in Appendix A. Note that the upper bound in Theorem 1 matches the lower bound up to logarithmic\nterms and lower order additive factors. In particular, changing the selection strategy for the second\narm, Bt, from uniform to anything more sophisticated is not expected to yield signi\ufb01cant bene\ufb01ts in\nthe adversarial regime.\n\n\u221a\n\n4.3 Regret Upper Bound in the Stochastic Regime\n\nFinally, we show that SODA enjoys constant expected regret in the stochastic regime. This is achieved\nwithout sacri\ufb01cing the adversarial regret guarantee.\nTheorem 3. The expected regret of SODA applied to stochastic loss sequences with gaps \u2206a satis\ufb01es\n\n(cid:20)(cid:18) 8K\n\nRT \u2264 (cid:88)\n\na:\u2206a>0\n\n+ 16\n\nln K\n\n(cid:19) \u03b52\n\n\u2206a\n\n(cid:21)\n\n+ 4K +\n\n\u2206a\nK\n\n.\n\n(8)\n\nA brief sketch of a proof of this theorem is given in Section 5.2, with the complete proof provided in\nAppendix C.\nNote that \u03b5 is the effective range of realizations of the losses, whereas the gaps \u2206a are based on\nthe expected losses. Naturally, \u2206a \u2264 \u03b5. For example, if the losses are Bernoulli then the range is\n\u03b5 = 1, but the gaps are based on the distances between the biases of the Bernoulli variables. When\nthe losses are not {0, 1}, but con\ufb01ned to a smaller range \u03b5, Theorem 3 yields a tighter regret bound.\nThe scaling of the regret bound in K is suboptimal and it is currently unknown whether it could be\nimproved without compromising the worst-case guarantee. Perhaps changing the selection strategy\nfor Bt could help here. We leave this improvement for future work.\nTo summarize, SODA achieves adversarial regret guarantee that scales with the effective loss range\nand almost matches the lower bound and simultaneously has improved regret guarantee in the\nstochastic regime.\n\n5 Proofs\n\nThis section contains the proof of Theorem 1 and a proof sketch for Theorem 3. The proof of\nTheorem 2 is provided in Appendix A.\n\n5.1 Proof of Theorem 1\n\nThe proof of the theorem is prefaced by two lemmas, but \ufb01rst we show some properties of the loss\ndifference estimators. We use EBt to denote expectation with respect to selection of Bt conditioned\non all random outcomes prior to this selection. For oblivious adversaries, the expected cumulative\nloss difference estimators are equal to the negative expect regret against the corresponding arm a:\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\n(cid:102)\u2206(cid:96)\n\na\nt\n\nE\n\n= E\n\n(cid:34) T(cid:88)\n\n(cid:105)(cid:35)\n\n(cid:104)(cid:102)\u2206(cid:96)\n\na\nt\n\nE\nBt\n\nt=1\n\nt=1\n\n(cid:34) T(cid:88)\n\n(cid:16)\n\n(cid:17)(cid:35)\n\nT(cid:88)\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\n= E\n\nt \u2212 (cid:96)At\n(cid:96)a\n\nt\n\n=\n\nt \u2212 E\n(cid:96)a\n\n(cid:96)At\nt\n\n=: \u2212Ra\nT ,\n\nt=1\n\nt=1\n\nt=1\n\n5\n\n\fwhere we have used the fact that(cid:102)\u2206(cid:96)\n(cid:34) T(cid:88)\n(cid:16)(cid:102)\u2206(cid:96)\n\nE\n\na\nt\n\n(cid:17)2(cid:35)\n\nwith respect to the choice of Bt. Similarly, we have\n\n(cid:34) T(cid:88)\n\n(cid:16)\n\n(cid:17)2(cid:35)\n\na\nt is an unbiased estimate of (cid:96)a\n\nt \u2212 (cid:96)At\n\nt due to importance weighting\n\n= (K \u2212 1) E\n\nt \u2212 (cid:96)At\n(cid:96)a\n\nt\n\n.\n\n(9)\n\nt=1\n\nt=1\n\nSimilar to the analysis of the anytime version of EXP3 in Bubeck and Cesa-Bianchi [2012], which\nbuilds on Auer et al. [2002b], we consider upper and lower bounds on the expectation of the\nincremental update. This is captured by the following lemma:\nLemma 1. With a learning rate scheme \u03b7t for t = 1, 2, . . . , where \u03b7t \u2264 1/2\u03b5(K \u2212 1), SODA ful\ufb01lls:\n\n\u2212 T(cid:88)\n\nt=1\n\n(cid:102)\u2206(cid:96)\nt \u2264 ln K\n\u03b7T\n\na\n\n+ \u03b7T\n\nT(cid:88)\n\nt=1\n\na\nt\n\n(cid:16)(cid:102)\u2206(cid:96)\n(cid:32)\n\nln\n\n(cid:105)\n\nE\na\u223cpt\n\n(cid:88)\n\n(cid:104)(cid:102)\u2206(cid:96)\n\n(cid:17)2 \u2212 T(cid:88)\nexp(cid:0)\u2212\u03b7Dt(a) \u2212 \u03b72St(a)(cid:1)(cid:33)\nK(cid:88)\n\nt=1\n\n+\n\na\nt\n\nt\n\n.\n\n\u03a6t(\u03b7) :=\n\n1\n\u03b7\n\n1\nK\n\na=1\n\nfor all a, where we de\ufb01ne the potential\n\n(\u03a6t(\u03b7t+1) \u2212 \u03a6t(\u03b7t))\n\n(10)\n\n(11)\n\nNote that unlike in the analysis of EXP3, here the learning rates \u03b7t do not have to be non-increasing.\nA proof of this lemma is based on modi\ufb01cation of standard arguments and is found in Appendix B.1.\nThe second lemma is a technical one and is proven in Appendix B.2.\nLemma 2. Let \u03c3t with t \u2208 N be an increasing positive sequence with bounded differences such that\n\u03c3t \u2212 \u03c3t\u22121 \u2264 c for a \ufb01nite constant c. Let further \u03c30 = 0. Then\n\n(cid:18)\n\nT(cid:88)\n\nt=1\n\n\u03c3t\n\n1\u221a\n\u03c3t\u22121 + c\n\n\u2212\n\n1\u221a\n\u03c3t + c\n\n(cid:19)\n\n\u2264 2(cid:112)\u03c3T\u22121 + c.\n\nProof of Theorem 1 We apply Lemma 1, which leads to the following inequality for any learning\nrate scheme \u03b7t for t = 1, 2, . . . , where \u03b7t \u2264 1/2\u03b5(K \u2212 1):\n\n\u2212 T(cid:88)\n\nt=1\n\na\n\n(cid:102)\u2206(cid:96)\nt \u2264 ln K\n\u03b7T(cid:124)(cid:123)(cid:122)(cid:125)\n\n1st\n\n+ \u03b7T\n\n(cid:124)\n\n(cid:16)(cid:102)\u2206(cid:96)\nT(cid:88)\n(cid:123)(cid:122)\n\nt=1\n\n2nd\n\na\nt\n\n(cid:17)2\n(cid:125)\n\n\u2212 T(cid:88)\n(cid:124)\n\nt=1\n\nE\na\u223cpt\n\n(cid:123)(cid:122)\n\n3rd\n\n(cid:104)(cid:102)\u2206(cid:96)\n\na\nt\n\n(cid:105)\n(cid:125)\n\nT(cid:88)\n(cid:124)\n\nt=1\n\n+\n\n(\u03a6t(\u03b7t+1) \u2212 \u03a6t(\u03b7t))\n\n.\n\n(12)\n\n(cid:123)(cid:122)\n\n4th\n\n(cid:125)\n\nNote that in expectation, the left hand side of (12) is the regret against arm a. We are thus interested\nin bounding the expectation of the terms on the right hand side, where we note that the third term\n\nvanishes in expectation. We \ufb01rst consider the case where \u03b7t =(cid:112)ln K/(maxa St(a) + (K \u2212 1)2),\n\npostponing the initial value for now.\nThe \ufb01rst term becomes:\n\n\u221a\n\n(cid:113)\n\nmax\n\na\n\nln K\n\nST\u22121(a) + (K \u2212 1)2.\n\n(13)\n\nln K\n\u03b7T\n\n=\n\nThe second term becomes:\n\n\u221a\n\n(cid:112)maxa ST\u22121(a) + (K \u2212 1)2\n\nST (a)\n\n\u03b7T ST (a) =\n\nln K\n\n(cid:113)\n\n\u221a\n\n\u2264\n\nln K\n\nST\u22121(a) + (K \u2212 1)2,\n\n(14)\n\nmax\n\na\n\nwhere we use that St(a) \u2264 St\u22121(a) + (K \u2212 1)2 for all t by design.\nFinally, for the fourth term in equation (12), we need to consider the potential differences. Unlike in\nthe anytime analysis of EXP3, where this term is negative [Bubeck and Cesa-Bianchi, 2012], in our\ncase it turns to be related to the second moment of the loss difference estimators. We let\n\nexp(cid:0)\u2212\u03b7Dt(a) \u2212 \u03b72St(a)(cid:1)\n(cid:80)K\na=1 exp (\u2212\u03b7Dt(a) \u2212 \u03b72St(a))\n\nq\u03b7\nt =\n\n(15)\n\n6\n\n\f1\nK\n\n\u03b72 ln\n\n(cid:88)\n\n(cid:1)(cid:33)\n\nexp(cid:0)\u2212\u03b7Da \u2212 \u03b72Sa\n\ndenote the exponential update using the loss estimators up to t, but with a free learning rate \u03b7. We\nfurther suppress some indices for readability, such that Da = Dt(a) and Sa = St(a) in the following.\nWe have\nt(\u03b7) = \u2212 1\n\u03a6(cid:48)\n(cid:88)\n\nexp(cid:0)\u2212\u03b7Da \u2212 \u03b72Sa\nBy using \u2212\u03b7Da \u2212 2\u03b72Sa = ln(cid:0)exp(\u2212\u03b7Da \u2212 \u03b72Sa) exp(\u2212\u03b72Sa)(cid:1) the above becomes\n\n(cid:32)\n(cid:32)\nexp(cid:0)\u2212\u03b7Da \u2212 \u03b72Sa\n(cid:18) q\u03b7\n\n(cid:1) \u00b7 (\u2212Da \u2212 2\u03b7Sa)\n(cid:1)(cid:33)(cid:33)(cid:33)\n\na exp(cid:0)\u2212\u03b7Da \u2212 \u03b72Sa\n(cid:80)\n(cid:80)\na exp (\u2212\u03b7Da \u2212 \u03b72Sa)\n(cid:32)\n(cid:88)\n1\nK\na exp (\u2212\u03b7Da \u2212 \u03b72Sa)\n(cid:19)(cid:21)\n\n(cid:32)\n\u03b72(cid:80)\n\n\u2212\u03b7Da \u2212 2\u03b72Sa \u2212 ln\n\n(cid:1) \u00b7\n\n(cid:20)\n\n1\n\u03b7\n\n+\n\n=\n\na\n\na\n\na\n\n\u03a6(cid:48)\nt(\u03b7) =\n\n1\n\u03b72\n\nE\na\u223cq\u03b7\n\nt\n\nln\n\nexp(\u2212\u03b72Sa)\n\n=\n\n1\n\u03b72 KL (q\u03b7\n\n.\n\n(16)\n\n[St(a)] ,\n\nwhere we have used that 1/K is the pmf. of the uniform distribution over K arms. Since the\nKL-divergence is always positive, we can rewrite the potential differences as\n\nt\n\nt (cid:107)1/K) \u2212 E\na\u223cq\u03b7\n(cid:90) \u03b7t\n\n(cid:90) \u03b7t\n\n\u03a6t(\u03b7t+1) \u2212 \u03a6t(\u03b7t) = \u2212\n\nt(\u03b7)d\u03b7 \u2264\n\u03a6(cid:48)\n\nE\na\u223cq\u03b7\n\nt\n\n\u03b7t+1\n\n[St(a)] d\u03b7 \u2264\n\nmax\n\na\n\nSt(a)d\u03b7\n\n\u03b7t+1\n\n\u221a\n\n=\n\nln K max\n\na\n\nSt(a)\n\n1\n\nSt\u22121(a) + (K \u2212 1)2\n\n\u2212\n\nmax\n\na\n\n1\n\nSt(a) + (K \u2212 1)2\n\n(cid:113)\n\nmax\n\na\n\n\uf8f6\uf8f8 .\n\nt (a)\n1/K\n\n(cid:90) \u03b7t\n\uf8eb\uf8ed\n(cid:113)\n\n\u03b7t+1\n\nBy Lemma 2 we then have\n\n\u221a\n\u03a6t(\u03b7t+1) \u2212 \u03a6t(\u03b7t) \u2264 2\n\nln K\n\n(cid:113)\n\nST\u22121(a) + (K \u2212 1)2.\n\nmax\n\na\n\n(17)\n\nT(cid:88)\n\nt=1\n\nCollecting the terms (13), (14) and (17) and noting that these bounds hold for all a, by taking\nexpectations and using Jensen\u2019s inequality we get\n\n(cid:21)\n\nST\u22121(a) + (K \u2212 1)2\n\n(cid:105)\n\nST\u22121(a)\n\n+ (K \u2212 1)2.\n\n(18)\n\n(cid:20)\n\n\u221a\n4\n\na\n\nln K\n\nmax\n\n(cid:113)\n(cid:114)\nE(cid:104)\n(cid:105) \u2264 (K \u2212 1)2\u03b52 E\n\nmax\n\na\n\nln K\n\n(cid:34)\n\nRT \u2264 E\n\u221a\n\u2264 4\n\nE(cid:104)\n\nmax\n\na\n\nST\u22121(a)\n\nThe remainder of the proof is to bound this inner expectation:\n\n(cid:35)\n\n1[Bt = a]\n\n.\n\nT\u22121(cid:88)\n\nt=1\n\nmax\n\na\n\n(cid:111)\n\nmax\n\na\n\nZ a\n\nT\u22121 > \u03b1\n\nt = (cid:80)t\n\nE[max\n\na\n\nZ a\n\nLet Z a\nprobability for a cutoff \u03b1 > 0:\n\ns=1\n\n1[Bs = a] and note that Z a\n\nT\u22121 \u2264 T \u2212 1. We now consider a partioning of the\n\na\n\nmax\n\nT\u22121 \u2264 \u03b1\nZ a\n\n+ (T \u2212 1) P(cid:110)\n(cid:111)\nT\u22121] \u2264 \u03b1 P(cid:110)\n\u2264 \u03b1 + (T \u2212 1)K P(cid:8)Z a\nT\u22121 > \u03b1(cid:9) ,\nT =(cid:80)T\nT\u22121 > \u03b1(cid:9) .\nT\u22121 > \u03b1(cid:9) \u2264 P(cid:8)X a\nP(cid:8)Z a\n\nt=1 xa\n\nusing a union bound for the \ufb01nal inequality. To continue we need to address the fact that the Bt\u2019s are\nnot independent. We can however note that P{Bt = a} \u2264 (K \u2212 1)\u22121 for all t and a. By letting xa\nt\nbe Bernoulli with parameter (K \u2212 1)\u22121 and X a\n\nt we then get\n\nIn the upper bound we can thus substitute X a\nindependent by construction. Note further that E[X a\n\nT\u22121 for Z a\n\nT\u22121 and exploit the fact that the xa\n\nK\u22121, so by choosing \u03b1 = T\u22121\n\nT\u22121] = T\u22121\n\n(19)\nt \u2019s are\nK\u22121 + \u03b4 for\n\n7\n\n\f\u03b4 > 0, we obtain by Hoeffding\u2019s inequality:\n\n(cid:114)\n\nWe now choose \u03b4 =\n\nE[max\n\na\n\nZ a\n\nT\u22121] \u2264 T \u2212 1\nK \u2212 1\n\u2264 T \u2212 1\n(cid:17)\n(cid:16)\u221a\nK \u2212 1\nT (K \u2212 1)\nT\u22121] \u2264 T \u2212 1\nK \u2212 1\n\nZ a\n\na\n\nT\n2 ln\n\nE[max\n\n+ \u03b4 + (T \u2212 1)K P\n\n+ \u03b4 + (T \u2212 1)K exp\n\n(cid:26)\n\n(cid:27)\n\n> \u03b4\n\nX a\n\nT\u22121 \u2212 T \u2212 1\n(cid:18)\n(cid:19)\nK \u2212 1\n\u2212 2\u03b42\nT \u2212 1\n\n.\n\n, which gives us\n\n(cid:114)\n\n(cid:16)\u221a\n\n+\n\nT\n2\n\nln\n\nT (K \u2212 1)\n\n(cid:17)\n\n\u221a\n\nT .\n\n+ 2\n\nInserting this in (18) gives us the desired bound.\nFor the case where the learning rate at T is instead given by 1/2(K \u2212 1) implying 4(K \u2212 1)2 ln K \u2265\nmaxa ST\u22121(a) + (K \u2212 1)2, the \ufb01rst term is ln K\n\n= 2(K \u2212 1) ln K, and the second term is\n\n\u03b7T\n\n\u03b7T ST (a) =\n\n1\n\n2(K \u2212 1)\n\nST (a) \u2264 ST\u22121(a) + (K \u2212 1)2\n\n2(K \u2212 1)\n\n\u2264 4(K \u2212 1)2 ln K\n2(K \u2212 1)\n\n\u2264 2(K \u2212 1) ln K.\n\nSince the learning rate is constant the potential differences vanish, completing the proof.\n\n(cid:3)\n\n5.2 Proof sketch of Theorem 3\n\nE [pa\n\nt ] \u2264 \u03c3 + P{pa\n\nHere we present the key ideas used to prove Theorem 3. The complete proof is provided in Ap-\npendix C.\nRecall that the expected regret in the stochastic setting is given by (2), where E[1(At = a)] = E[pa\nt ].\n\nThus, we need to bound E[(cid:80)\n\nt pa\n\n(cid:111)\nt ]. The \ufb01rst step is to bound this as\n(cid:80)t\u22121\nKe\u2212\u03b7t\ni=1 Xi for Xi := (cid:102)\u2206(cid:96)\n(cid:80)t\u22121\n\nt > \u03c3} \u2264 \u03c3 + P(cid:110)\ni \u2212(cid:102)\u2206(cid:96)\nt \u2264 Ke\u2212\u03b7t\n\ni=1 Xi > \u03c3\n\nfor a positive threshold \u03c3, where we show that pa\n\napproach is motivated by the fact that EBi [(cid:102)\u2206(cid:96)\nThe next step is to tune \u03c3 \u221d exp((cid:80) Ei[Xi]), which allows us to bound the second term using\n\n. This\n] \u221d \u2206a, where the expectation is with respect\n\nto selection of Bi and the loss generation, conditioned on all prior randomness.\n\nAzuma\u2019s inequality and balance the two terms. Finally, this bound is summed over t using a technical\nlemma for the limit of this sum.\n\na(cid:63)\ni\n\na(cid:63)\ni\n\na\n\n(20)\n\ni \u2212(cid:102)\u2206(cid:96)\n\na\n\n6 Discussion\n\n\u221a\n\n(cid:17)\n\n(cid:16)(cid:80)\n\n\u221a\nKT ln K) + \u02dcO(\u03b5K 4\n\nWe have presented the SODA algorithm for prediction with limited advice with two observations\nper round (the \u201cprimary\u201d observation of the loss of the action that was played and one additional\nobservation). We have shown that the algorithm adapts to two types of simplicity of loss sequences\nsimultaneously: (a) it provides improved regret guarantees for adversarial sequences with bounded\neffective range of the losses and (b) for stochastic loss sequences. In both cases the regret scales\nlinearly with the effective range and the knowledge of the range is not required. In the adversarial\ncase we achieve O(\u03b5\nT ) regret guarantee and in the stochastic case we achieve\nregret guarantee. Our result demonstrates that just one extra observation per\nO\nround is suf\ufb01cient to circumvent the impossibility result of Gerchinovitz and Lattimore [2016] and\nsigni\ufb01cantly relaxes the assumptions made by Cesa-Bianchi and Shamir [2018] to achieve the same\ngoal.\nThere are a number of open questions and interesting directions for future research. One is to improve\nthe regret guarantee in the stochastic regime. Another is to extend the results to bandits with limited\nadvice in the spirit of Seldin et al. [2013], Kale [2014].\n\nK\u03b52\n\u2206a\n\na:\u2206a>0\n\n8\n\n\fAcknowledgments\n\nThe authors thank Julian Zimmert for valuable input and discussions.\n\nReferences\nNoga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir.\nNonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing,\n46(6):1785\u20131826, 2017.\n\nPeter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic\nand adversarial bandits. In Proceedings of the International Conference on Computational Learning\nTheory (COLT), 2016.\n\nPeter Auer, Nicol\u00f2 Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit\n\nproblem. Machine Learning, 47, 2002a.\n\nPeter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM Journal of Computing, 32(1), 2002b.\n\nS\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-\n\narmed bandit problems. Foundations and Trends in Machine Learning, 5, 2012.\n\nS\u00e9bastien Bubeck and Aleksandrs Slivkins. The best of both worlds: stochastic and adversarial\nbandits. In Proceedings of the International Conference on Computational Learning Theory\n(COLT), 2012.\n\nNicol\u00f2 Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, 2006.\n\nNicol\u00f2 Cesa-Bianchi and Ohad Shamir. Bandit regret scaling with the effective loss range. In\n\nProceedings of the International Conference on Algorithmic Learning Theory (ALT), 2018.\n\nNicol\u00f2 Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for predic-\n\ntion with expert advice. Machine Learning, 66, 2007.\n\nPierre Gaillard, Gilles Stoltz, and Tim van Erven. A second-order bound with excess losses. In\nProceedings of the International Conference on Computational Learning Theory (COLT), 2014.\n\nS\u00e9bastien Gerchinovitz and Tor Lattimore. Re\ufb01ned lower bounds for adversarial bandits. In Advances\n\nin Neural Information Processing Systems (NIPS), 2016.\n\nSatyen Kale. Multiarmed bandits with limited expert advice. In Proceedings of the International\n\nConference on Computational Learning Theory (COLT), 2014.\n\nWouter M. Koolen and Tim van Erven. Second-order quantile methods for experts and combinatorial\ngames. In Proceedings of the International Conference on Computational Learning Theory (COLT),\n2015.\n\nTze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\nApplied Mathematics, 6, 1985.\n\nNick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and\n\nComputation, 108, 1994.\n\nHaipeng Luo and Robert E. Schapire. Achieving all with no parameters: Adanormalhedge. In\nProceedings of the International Conference on Computational Learning Theory (COLT), 2015.\n\nHerbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematical Society, 1952.\n\nYevgeny Seldin and G\u00e1bor Lugosi. An improved parametrization and analysis of the EXP3++\nalgorithm for stochastic and adversarial bandits. In Proceedings of the International Conference\non Computational Learning Theory (COLT), 2017.\n\n9\n\n\fYevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial\n\nbandits. In Proceedings of the International Conference on Machine Learning (ICML), 2014.\n\nYevgeny Seldin, Koby Crammer, and Peter L. Bartlett. Open problem: Adversarial multiarmed\nbandits with limited advice. In Proceedings of the International Conference on Computational\nLearning Theory (COLT), 2013.\n\nYevgeny Seldin, Peter L. Bartlett, Koby Crammer, and Yasin Abbasi-Yadkori. Prediction with\nlimited advice and multiarmed bandits with paid observations. In Proceedings of the International\nConference on Machine Learning (ICML), 2014.\n\nVladimir Vovk. Aggregating strategies. In Proceedings of the International Conference on Computa-\n\ntional Learning Theory (COLT), 1990.\n\nChen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Proceedings of\n\nthe International Conference on Computational Learning Theory (COLT), 2018.\n\nOlivier Wintenberger. Optimal learning with Bernstein online aggregation. Machine Learning, 106,\n\n2017.\n\nJulian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits.\n\nTechnical report, https://arxiv.org/abs/1807.07623, 2018.\n\n10\n\n\f", "award": [], "sourceid": 1521, "authors": [{"given_name": "Tobias Sommer", "family_name": "Thune", "institution": "University of Copenhagen"}, {"given_name": "Yevgeny", "family_name": "Seldin", "institution": "University of Copenhagen"}]}