{"title": "Learning the Learning Rate for Prediction with Expert Advice", "book": "Advances in Neural Information Processing Systems", "page_first": 2294, "page_last": 2302, "abstract": "Most standard algorithms for prediction with expert advice depend on a parameter called the learning rate. This learning rate needs to be large enough to fit the data well, but small enough to prevent overfitting. For the exponential weights algorithm, a sequence of prior work has established theoretical guarantees for higher and higher data-dependent tunings of the learning rate, which allow for increasingly aggressive learning. But in practice such theoretical tunings often still perform worse (as measured by their regret) than ad hoc tuning with an even higher learning rate. To close the gap between theory and practice we introduce an approach to learn the learning rate. Up to a factor that is at most (poly)logarithmic in the number of experts and the inverse of the learning rate, our method performs as well as if we would know the empirically best learning rate from a large range that includes both conservative small values and values that are much higher than those for which formal guarantees were previously available. Our method employs a grid of learning rates, yet runs in linear time regardless of the size of the grid.", "full_text": "Learning the Learning Rate for\nPrediction with Expert Advice\n\nWouter M. Koolen\n\nTim van Erven\n\nQueensland University of Technology and UC Berkeley\n\nLeiden University, the Netherlands\n\nwouter.koolen@qut.edu.au\n\ntim@timvanerven.nl\n\nLeiden University and Centrum Wiskunde & Informatica, the Netherlands\n\nPeter D. Gr\u00a8unwald\n\npdg@cwi.nl\n\nAbstract\n\nMost standard algorithms for prediction with expert advice depend on a parameter\ncalled the learning rate. This learning rate needs to be large enough to \ufb01t the data\nwell, but small enough to prevent over\ufb01tting. For the exponential weights algo-\nrithm, a sequence of prior work has established theoretical guarantees for higher\nand higher data-dependent tunings of the learning rate, which allow for increas-\ningly aggressive learning. But in practice such theoretical tunings often still per-\nform worse (as measured by their regret) than ad hoc tuning with an even higher\nlearning rate. To close the gap between theory and practice we introduce an ap-\nproach to learn the learning rate. Up to a factor that is at most (poly)logarithmic\nin the number of experts and the inverse of the learning rate, our method performs\nas well as if we would know the empirically best learning rate from a large range\nthat includes both conservative small values and values that are much higher than\nthose for which formal guarantees were previously available. Our method em-\nploys a grid of learning rates, yet runs in linear time regardless of the size of the\ngrid.\n\n1\n\nIntroduction\n\nT = mink Lk\n\nT =(cid:80)T\n\nt=1 (cid:96)k\n\nHT =(cid:80)T\n\nt=1 ht is compared to the cumulative losses Lk\n\nConsider a learner who in each round t = 1, 2, . . . speci\ufb01es a probability distribution wt on K\nexperts, before being told a vector (cid:96)t \u2208 [0, 1]K with their losses and consequently incurring loss\nht := wt \u00b7 (cid:96)t. Losses are summed up over trials and after T rounds the learner\u2019s cumulative loss\nt of the experts k = 1, . . . , K.\nThis is essentially the framework of prediction with expert advice [1, 2], in particular the standard\nHedge setting [3]. Ideally, the learner\u2019s predictions would not be much worse than those of the best\nexpert, who has cumulative loss L\u2217\nFollow-the-Leader (FTL) is a natural strategy for the learner.\nIn any round t, it predicts with a\nt\u22121, i.e. the expert that was best on the previous\npoint mass on the expert k with minimum loss Lk\nt \u2212 1 rounds. However, in the standard game-theoretic analysis, the experts\u2019 losses are assumed\nto be generated by an adversary, and then the regret for FTL can grow linearly in T [4], which\nmeans that it is not learning. To do better, the predictions need to be less outspoken, which can\nbe accomplished by replacing FTL\u2019s choice of the expert with minimal cumulative loss by the soft\nt\u22121, which is known as the exponential weights or Hedge algorithm [3]. Here\nminimum wk\n\u03b7 > 0 is a regularisation parameter that is called the learning rate. As \u03b7 \u2192 \u221e the soft minimum\napproaches the exact minimum and exponential weights converges to FTL. In contrast, the lower \u03b7,\nthe more the soft minimum resembles a uniform distribution and the more conservative the learner.\n\nT , so that the regret RT = HT \u2212 L\u2217\n\nt \u221d e\u2212\u03b7Lk\n\nT is small.\n\n1\n\n\fT(cid:88)\n\nt=1\n\nLet R\u03b7\nT denote the regret for exponential weights with learning rate \u03b7. To obtain guarantees against\nadversarial losses, several tunings of \u03b7 have been proposed in the literature. Most of these may be\nunderstood by starting with the bound\n\n+\n\n(1)\n\n\u03b4\u03b7\nt ,\n\nR\u03b7\nT \u2264 ln K\n\u03b7\nt \u2265 0 is the approximation error (called mixability\nwhich holds for any sequence of losses. Here \u03b4\u03b7\ngap by [5]) when the loss of the learner in round t is approximated by the so-called mix loss, which\nis a certain \u03b7-exp-concave lower bound (see Section 2.1). The analysis then proceeds by giving\nan upper bound bt(\u03b7) \u2265 \u03b4\u03b7\nt bt(\u03b7). In\nparticular, the bound \u03b4\u03b7\n\nt and choosing \u03b7 to balance the two terms ln(K)/\u03b7 and(cid:80)\nt \u2264 \u03b7/8 results in the most conservative tuning \u03b7 = (cid:112)8 ln(K)/T , for\nwhich the regret is always bounded by O((cid:112)T ln(K)); the same guarantee can still be achieved\nln(1 +(cid:112)2 ln(K)/L\u2217\nT ) \u2248 (cid:112)2 ln(K)/L\u2217\nunknown in advance, which leads to a bound of O((cid:112)L\u2217\n\neven if the horizon T is unknown in advance by using, for instance, the so-called doubling trick\n[4]. It is possible though to learn more aggressively by using a bound on \u03b4\u03b7\nt that depends on the\nt \u2264 e\u03b7wt \u00b7 (cid:96)t and choosing \u03b7 =\ndata. The \ufb01rst such improvement can be obtained by using \u03b4\u03b7\nT is\nT \u2264 T\nthis is never worse than the conservative tuning, and it can be better if the best expert has very\nsmall losses (a case sometimes called the \u201clow noise condition\u201d). A further improvement has been\nproposed by Cesa-Bianchi et al. [7], who bound \u03b4\u03b7\nt when k\nt = wt \u00b7 ((cid:96)t \u2212 ht)2. Rather than using a constant learning\nis distributed according to wt, such that v\u03b7\nrate, at time t they play the Hedge weights wt based on a time-varying learning rate \u03b7t that is\ns . This leads to a so-called second-order\n\napproximately tuned as(cid:112)ln(K)/Vt\u22121 with Vt =(cid:80)\n\nT , where again the doubling trick can be used if L\u2217\n\nT ln(K) + ln K) [6, 4]. Since L\u2217\n\nt by a constant times the variance v\u03b7\n\nt of (cid:96)k\n\ns\u2264t v\u03b7s\n\n(cid:16)(cid:112)Vt ln(K) + ln K\n\n(cid:17)\n\nbound on the regret of the form\n\nRT = O\nwhich, as Cesa-Bianchi et al. show, implies\nT (T \u2212 L\u2217\nL\u2217\nT )\n\n(cid:32)(cid:114)\n\nRT = O\n\n,\n\n(cid:33)\n\nln(K) + ln K\n\nT\n\n(2)\n\n(3)\n\n(4)\n\ns\u2264t \u03b4\u03b7s\n\nwhere \u2206t =(cid:80)\n\nand is therefore always better than the tuning in terms of L\u2217\nT (note though that (2) can be much\nstronger than (3) on data for which the exponential weights rapidly concentrate on a single expert,\nsee also [8]). The general pattern that emerges is that the better the bound on \u03b4\u03b7\nt , the higher \u03b7\ncan be chosen and the more aggressive the learning. De Rooij et al. [5] take this approach to its\nextreme and do not bound \u03b4\u03b7\nt at all. In their AdaHedge algorithm they tune \u03b7t = ln(K)/\u2206t\u22121\ns , which is very similar to the second-order tuning of Cesa-Bianchi et al. and\nindeed also satis\ufb01es (2) and (3). Thus, this sequence of prior works appears to have reached the\nlimit of what is possible based on improving the bound on \u03b4\u03b7\nt . Unfortunately, however, if the data\nare not adversarial, then even second-order bounds do not guarantee the best possible tuning of \u03b7\nfor the data at hand. (See the experiments that study the in\ufb02uence of \u03b7 in [5].) In practice, selecting\n\u03b7t to be the best-performing learning rate so far (that is, running FTL at the meta-level) appears to\nwork well [9], but this approach requires a computationally intensive grid search over learning rates\n[9] and formal guarantees can only be given for independent and identically distributed (IID) data\n[10]. A new technique based on speculatively trying out different \u03b7 was therefore introduced in the\nFlipFlop algorithm [5]. By alternating learning rates \u03b7t = \u221e and \u03b7t that are very similar to those\nof AdaHedge, FlipFlop is both able to satisfy the second-order bounds (2) and (3), and to guarantee\nthat its regret is never much worse than the regret R\u221e\n\nT for Follow-the-Leader:\n\nRT = O(cid:0)R\u221e\n\nT\n\n(cid:1).\n\nThus FlipFlop covers two extremes: on the one hand it is able to compete with \u03b7 that are small\nenough to deal with the worst case, and on the other hand it can compete with \u03b7 = \u221e (FTL).\n\nMain Contribution We generalise the FlipFlop approach to cover a large range of \u03b7 in between.\nAs before, let R\u03b7\nT denote the regret of exponential weights with \ufb01xed learning rate \u03b7. We introduce\n\n2\n\n\fT\n\n(5)\n\nln 1\n\u03b7\n\n(cid:16)\n\nthe learning the learning rate (LLR) algorithm, which satis\ufb01es (2), (3) and (4) and in addition\nguarantees a regret satisfying\nRT = O\n\n(cid:17)1+\u03b5 R\u03b7\n\n(cid:18)\n\nln(K)\n\nfor all \u03b7 \u2208 [\u03b7ah\n\n(cid:19)\nt\u2217 \u2265 (1 \u2212 o(1))(cid:112)ln(K)/T (as follows from\n\nt\u2217 , 1]\n\nt\u2217 , 1] that is\n\nfor any \u03b5 > 0. Thus, LLR performs almost as well as the learning rate \u02c6\u03b7T \u2208 [\u03b7ah\noptimal with hindsight. Here the lower end-point \u03b7ah\n(28) below) is a data-dependent value that is suf\ufb01ciently conservative (i.e. small) to provide second-\norder guarantees and consequently worst-case optimality. The upper end-point 1 is an artefact of the\nanalysis, which we introduce because, for general losses in [0, 1]K, we do not have a guarantee in\nterms of R\u03b7\nT for 1 < \u03b7 < \u221e. For the special case of binary losses (cid:96)t \u2208 {0, 1}K, however, we can\nsay a bit more: as shown in Appendix B of the supplementary material, in this special case the LLR\nalgorithm guarantees regret bounded by RT = O(KR\u03b7\nThe additional factor ln(K) ln1+\u03b5(1/\u03b7) in (5) comes from a prior on an exponentially spaced grid\nof \u03b7. It is logarithmic in the number of experts K, and its dependence on 1/\u03b7 grows slower than\nln1+\u03b5(1/\u03b7) \u2264 ln1+\u03b5(1/\u03b7ah\nt\u2217 ) = O(ln1+\u03b5(T )) for any \u03b5 > 0. For the optimally tuned \u02c6\u03b7T , we have\nin mind regret that grows like R\u02c6\u03b7T\nT = O(T \u03b1) for some \u03b1 \u2208 [0, 1/2], so an additional polylog factor\nseems a small price to pay to adapt to the right exponent \u03b1.\nAlthough \u03b7 \u2265 \u03b7ah\nlower \u03b7:\n\nt\u2217 appear to be most important, the regret for LLR can also be related to R\u03b7\n\nT ) for all \u03b7 \u2208 [1,\u221e].\n\nT for\n\n(cid:18) ln K\n\n(cid:19)\n\nRT = O\n\n\u03b7\n\nfor all \u03b7 < \u03b7ah\nt\u2217,\n\nT , but still improves on the standard bound (1) because \u03b4\u03b7\n\n(6)\nt \u2265 0 for all \u03b7.\nwhich is not in terms of R\u03b7\nThe LLR algorithm takes two parameters, which determine the trade-off between constants in the\nbounds (2)\u2013(6) above. Normally we would propose to set these parameters to moderate values, but if\nwe do let them approach various limits, LLR becomes essentially the same as FlipFlop, AdaHedge\nor FTL (see Section 2).\nWe emphasise that we do not just have a bound\non LLR that is unavailable for earlier methods;\nthere also exist actual losses for which the op-\ntimal learning rate with hindsight \u02c6\u03b7T is funda-\nmentally in between the robust learning rates\nchosen by AdaHedge and the aggressive choice\n\u03b7 = \u221e of FTL. On such data, Hedge with \ufb01xed\nlearning rate \u02c6\u03b7T performs signi\ufb01cantly better\nthan both these extremes; see Figure 1. In Ap-\npendix A we describe the data used to generate\nFigure 1 and explain why the regret obtained by\nLLR is signi\ufb01cantly smaller than the regret of\nAdaHedge, FTL and all other tunings described\nabove.\n\nWorst-case bound and \u03b7\nHedge(\u03b7)\nAdaHedge\nFlipFlop\nLLR and \u03b7ah\nt\u2217\n\nt\ne\nr\ng\ne\nr\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n6000\n\n7000\n\n9000\n\n8000\n\n1\n\n102\n\n0\n10\u22124\n\n10\u22122\n\nComputational Ef\ufb01ciency Although LLR\nemploys a grid of \u03b7, it does not have to search\nover this grid. Instead, in each time step it only\nhas to do computations for the single \u03b7 that is\nactive, and, as a consequence, it runs as fast as\nusing exponential weights with a single \ufb01xed\n\u03b7, which is linear in K and T . LLR, as pre-\nsented here, does store information about all\nthe grid points, which requires O(ln(K) ln(T ))\nstorage, but we describe a simple approxima-\ntion that runs equally fast and only requires a\nconstant amount of storage.\n\nlearning rate(\u03b7)\n\nFigure 1: Example data (details in Appendix A)\non which Hedge/exponential weights with inter-\nmediate learning rate (global minimum) performs\nmuch better than both the worst-case optimal\nlearning rate (local minimum on the left) and large\nlearning rates (plateau on the right). We also show\nthe performance of the algorithms mentioned in\nthe introduction.\n\n3\n\n\fOutline The paper is organized as follows.\nIn Section 2 we de\ufb01ne the LLR algorithm and in\nSection 3 we make precise how it satis\ufb01es (2), (3), (4), (5) and (6). Section 4 provides a discussion.\nFinally, the appendix contains a description of the data in Figure 1 and most of the proofs.\n\n2 The Learning the Learning Rate Algorithm\n\nIn this section we describe the LLR algorithm, which is a particular strategy for choosing a time-\nvarying learning rate in exponential weights. We start by formally describing the setting and then\nexplain how LLR chooses its learning rates.\n\n2.1 The Hedge Setting\n\nt , . . . , wK\n\nlearner\u2019s loss ht = wt \u00b7 (cid:96)t =(cid:80)\ncumulative loss is HT =(cid:80)T\n\nt ) on K \u2265 2 experts. Then the experts incur losses (cid:96)t = ((cid:96)1\n\nAt the start of each round t = 1, 2, . . . the learner produces a probability distribution wt =\nt ) \u2208 [0, 1]K and the\n(w1\nt is the expected loss under wt. After T rounds, the learner\u2019s\nt . The\ngoal is to minimize the regret RT = HT \u2212L\u2217\nT of\nthe best expert. We consider strategies for the learner that play the exponential weights distribution\n\nt=1 ht and the cumulative losses for the experts are Lk\nT with respect to the cumulative loss L\u2217\n\nT =(cid:80)T\n\nT = mink Lk\n\nt , . . . , (cid:96)K\n\nt=1 (cid:96)k\n\nk wk\n\nt (cid:96)k\n\nwk\n\nt =\n\n(cid:80)K\ne\u2212\u03b7tLk\nt\u22121\nj=1 e\u2212\u03b7tLj\n\nt\u22121\n\nln(cid:80)\n\n\u03b7t\n\nk wk\n\nt e\u2212\u03b7t(cid:96)k\n\nfor a choice of learning rate \u03b7t that may depend on all losses before time t. To analyse such methods,\nit is common to approximate the learner\u2019s loss ht by the mix loss mt = \u2212 1\nt , which\nappears under a variety of names in e.g. [7, 4, 11, 5]. The resulting approximation error or mixability\ngap \u03b4t = ht\u2212mt is always non-negative and cannot exceed 1. This, and some other basic properties\nof the mix loss are listed in Lemma 1 of De Rooij et al. [5], which we reproduce as Lemma C.1 in\nthe additional material.\nAs will be explained in the next section, LLR does not monitor the regrets of all learning rates\ndirectly. Instead, it tracks their cumulative mixability gaps, which provide a convenient lower bound\non the regret that is monotonically increasing with the number of rounds T , in contrast to the regret\nitself. To show this, let R\u03b7\nT denote the regret of the exponential weights strategy with \ufb01xed learning\nrate \u03b7t = \u03b7, and similarly let M \u03b7\nt denote its cumulative mix loss\nand mixability gap.\nLemma 2.1. For any \ufb01xed learning rate \u03b7 \u2208 (0,\u221e], the regret of exponential weights satis\ufb01es\n\nT =(cid:80)T\n\nT =(cid:80)T\n\nt and \u2206\u03b7\n\nt=1 m\u03b7\n\nt=1 \u03b4\u03b7\n\nR\u03b7\nT \u2265 \u2206\u03b7\nT .\n\n(7)\n\nProof. Apply property 3 in Lemma C.1 to the regret decomposition R\u03b7\n\nT = M \u03b7\n\nT \u2212 L\u2217\n\nT + \u2206\u03b7\nT .\n\nWe will use the following notational conventions. Lower-case letters indicate instantaneous quan-\ntities like mt, \u03b4t and wt, whereas uppercase letters denote cumulative quantities like MT , \u2206T and\nRT . In the absence of a superscript the learning rates present in any such quantity are those chosen\nby LLR. In contrast, the superscript \u03b7 refers to using the same \ufb01xed learning rate \u03b7 throughout.\n\n2.2 LLR\u2019s Choice of Learning Rate\n\nThe LLR algorithm is a member of the exponential weights family of algorithms. Its de\ufb01ning prop-\nerty is its adaptive and non-monotonic selection of the learning rate \u03b7t, which is speci\ufb01ed in Al-\ngorithm 1 and explained next. The LLR algorithm works in regimes in which it speculatively tries\nout different strategies for \u03b7t. Almost all of these strategies consist of choosing a \ufb01xed \u03b7 from the\nfollowing grid:\n\n\u03b71 = \u221e,\n\n\u03b7i = \u03b12\u2212i\n\nfor i = 2, 3, . . . ,\n\n(8)\n\nwhere the exponential base\n\n\u03b1 = 1 + 1/ log2 K\n\n4\n\n(9)\n\n\fAlgorithm 1 LLR(\u03c0ah, \u03c0\u221e). The grid \u03b71, \u03b72, . . . and weights \u03c01, \u03c02, . . . are de\ufb01ned in (8) and (12).\n\nInitialise b0 := 0; \u2206ah\nfor t = 1, 2, . . . do\n\n0 := 0 for all i \u2265 1.\nif all active indices and ah are bt\u22121-full then\n\n0 := 0; \u2206i\n\nIncrease bt := \u03c6\u2206ah\n\nt\u22121/\u03c0ah (with \u03c6 as de\ufb01ned in (14))\n\nelse\n\nKeep bt := bt\u22121\n\nend if\nLet i be the least non-bt-full index.\nif i is active then\n\nPlay \u03b7i.\nUpdate \u2206i\n\nelse\n\nt := \u2206i\n\nt\u22121 + \u03b4i\n\nt. Keep \u2206j\n\nt := \u2206j\n\nt\u22121 for j (cid:54)= i and \u2206ah\n\nt\n\n:= \u2206ah\n\nt\u22121.\n\nPlay \u03b7ah\nUpdate \u2206ah\nt\n\nt as de\ufb01ned in (10).\nt\u22121 + \u03b4ah\n\n:= \u2206ah\n\nt . Keep \u2206j\n\nt := \u2206j\n\nt\u22121 for all j \u2265 1.\n\nend if\n\nend for\n\nis chosen to ensure that the grid is dense enough so that \u03b7i for i \u2265 2 is representative for all\n\u03b7 \u2208 [\u03b7i+1, \u03b7i] (this is made precise in Lemma 3.3). We also include the special value \u03b71 = \u221e,\nbecause it corresponds to FTL, which works well for IID data and data with a small number of\nleader changes, as discussed by De Rooij et al. [5].\nt \u2286 {1, . . . , t} denote the set of rounds up to trial t in\nFor each index i = 1, 2, . . . in the grid, let Ai\nwhich the LLR algorithm plays \u03b7i. Then LLR keeps track of the performance of \u03b7i by storing the\n(cid:88)\nsum of mixability gaps \u03b4i\n\nfor which \u03b7i is responsible:\n\nt \u2261 \u03b4\u03b7i\n\nt\n\n\u2206i\n\nt =\n\n\u03b4i\ns.\n\ns\u2208Ai\n\nt\n\nIn addition to the grid in (8), LLR considers one more strategy, which we will call the AdaHedge\nstrategy, because it is very similar to the learning rate chosen by the AdaHedge algorithm [5]. In the\nAdaHedge strategy, LLR plays \u03b7t equal to\n\nt = (cid:80)\n\n\u03b7ah\nt =\n\nln K\n\u2206ah\nt\u22121\n\n,\n\n(10)\n\nt\n\nt\n\nt\n\n\u03b4ah\ns\n\ns\u2208Aah\n\nt \u2261 \u03b4\u03b7ah\n\nduring the rounds Aah\n\nis the sum of mixability gaps \u03b4ah\n\nt does not change during rounds outside Aah\nt .\n\nt \u2286\nwhere \u2206ah\n{1, . . . , t} in which LLR plays the AdaHedge strategy. The only difference to the original Ada-\nHedge is that the latter sums the mixability gaps over all s \u2208 {1, . . . , t}, not just those in Aah\nt . Note\nthat, in our variation, \u03b7ah\nThe AdaHedge learning rate \u03b7ah\nis non-increasing with t, and (as we will show in Theorem 3.6\nt\nbelow) it is small enough to guarantee the worst-case bound (3), which is optimal for adversarial\ndata. We therefore focus on \u03b7 > \u03b7ah\nt and call an index i in the grid active in round t if \u03b7i > \u03b7ah\nt .\nLet imax \u2261 imax(t) be the number of grid indices that are active at time t, such that \u03b7imax(t) \u2248 \u03b7ah\nt .\nThen LLR cyclically alternates grid learning rates and the AdaHedge learning rate, in a way that\napproximately maintains\n\n\u22061\nt\n\n\u03c01 \u2248 \u22062\n\n\u03c02 \u2248 . . . \u2248 \u2206imax\n\n\u03c0imax\n\nt\n\nt\n\n\u2248 \u2206ah\nt\n\u03c0ah\n\nfor all t,\n\n(11)\n\nwhere \u03c0ah > 0 and \u03c01, \u03c02, . . . > 0 are \ufb01xed weights that control the relative importance of Ada-\nHedge and the grid points (higher weight = more important). The LLR algorithm takes as parameters\n\u03c0ah and \u03c0\u221e, where \u03c0ah only has to be positive, but \u03c0\u221e is restricted to (0, 1). We then choose\n\nwhere \u03c1 is a prior probability distribution on {1, 2, . . .}. It follows that(cid:80)\u221e\n\n(12)\ni=1 \u03c0i = 1, so that \u03c0i may\nbe interpreted as a prior probability mass on grid index i. For \u03c1, we require a distribution with very\n\n\u03c0i = (1 \u2212 \u03c0\u221e)\u03c1(i \u2212 1)\n\n\u03c01 = \u03c0\u221e,\n\nfor i \u2265 2,\n\n5\n\n\fheavy tails (meaning \u03c1(i) not much smaller than 1\n\n(cid:90) i\n\nln K\n\n1\n\ndx =\n\ni ), and we \ufb01x the convenient choice\n\nln(cid:0) i\u22121\nln K + e(cid:1) \u2212\n\n1\n\nln(cid:0) i\nln K + e(cid:1) .\n\n1\n\n\u03c1(i) =\n\ni\u22121\nln K\n\n(x + e) ln2(x + e)\n\n(13)\n\nWe cannot guarantee that the invariant (11) holds exactly, and our algorithm incurs overhead for\nchanging learning rates, so we do not want to change learning rates too often. LLR therefore uses\nan exponentially increasing budget b and tries grid indices and the AdaHedge strategy in sequence\nuntil they exhaust the budget. To make this precise, we say that an index i is b-full in round t if\nt\u22121/\u03c0ah > b. Let bt be the\nt\u22121/\u03c0i > b and similarly that AdaHedge is b-full in round t if \u2206ah\n\u2206i\nbudget at time t, which LLR chooses as follows: \ufb01rst it initialises b0 = 0 and then, for t \u2265 1, it\ntests whether all active indices and AdaHedge are bt\u22121-full. If this is the case, LLR approximately\nincreases the budget by a factor \u03c6 > 1 by setting bt = \u03c6\u2206ah\nt\u22121/\u03c0ah > \u03c6bt\u22121, otherwise it just keeps\nthe budget the same: bt = bt\u22121. In particular, we will \ufb01x budget multiplier\n\n\u221a\n\n\u03c6 = 1 +\n\n\u03c0ah,\n\n(14)\n\nwhich minimises the constants in our bounds. Now if, at time t, there exists an active index that is\nnot bt-full, then LLR plays the \ufb01rst such index. And if all active indices are bt-full, LLR plays the\nAdaHedge strategy, which cannot be bt-full in this case by de\ufb01nition of bt. This guarantees that all\nT are approximately within a factor \u03c6 of each other for all i that are active at time t\u2217,\nratios \u2206i\nwhich we de\ufb01ne to be the last time t \u2264 T that LLR plays AdaHedge:\n\nT /\u03c0i\n\nt\u2217 = maxAah\nT .\n\n(15)\n\nWhenever LLR plays AdaHedge it is possible, however, that a new index i becomes active and it\nthen takes a while for this index\u2019s cumulative mixability gap \u2206i\nT to also grow up to the budget.\nSince AdaHedge is not played while the new index is catching up, the ratio guarantee always still\nholds for all indices that were active at time t\u2217.\n\n2.3 Choosing the LLR Parameters\nLLR has several existing strategies as sub-cases. For \u03c0ah \u2192 \u221e it essentially becomes AdaHedge.\nFor \u03c0\u221e \u2192 1 it becomes FlipFlop. For \u03c0\u221e \u2192 1 and \u03c0ah \u2192 0 it becomes FTL. Intermediate values\nfor \u03c0ah and \u03c0\u221e retain the bene\ufb01ts of these algorithms, but in addition allow LLR to compete with\nessentially all learning rates ranging from worst-case safe to extremely aggressive.\n\n2.4 Run time and storage\n\nLLR, as presented here, runs in constant time per round. This is because, in each round, it only\nneeds to compute the weights and update the corresponding cumulative mixability gap for a single\nlearning rate strategy. If the current strategy exceeds its budget (becomes bt-full), LLR proceeds\nto the next1. The memory requirement is dominated by the storage of \u22061\n, which,\nfollowing the discussion below (5), is at most\n\nt , . . . , \u2206imax(t)\n\nt\n\nimax(T ) = 2 +\n\nln\n\n1\n\n\u03b7imax (T )\nln \u03b1\n\n\u2264 2 + log\u03b1\n\n1\n\u03b7ah\nT\n\n= O(ln(K) ln(T )).\n\nHowever, a minor approximation reduces the memory requirement down to a constant: At any point\nin time the grid strategies considered by LLR split in three. Let us say that \u03b7i is played at time t.\nThen all preceding \u03b7j for j \u2264 i are already at (or slightly past) the budget. And all succeeding \u03b7j\nfor i < j \u2264 imax are still at (or slightly past) the previous budget. So we can approximate their\ncumulative mixability gaps by simply ignoring these slight overshoots. It then suf\ufb01ces to store only\nthe cumulative mixability gap for the currently advancing \u03b7i, and the current and previous budget.\n\n1In the early stages it may happen that the next strategy is already over the budget and needs to be skipped,\nt/\u03c0i \u2264\n\nbut this start-up effect quickly disappears when the budget exceeds 1, as the weighted increment \u03b4i\n\u03b7i/8 log1+\u0001(1/\u03b7) is bounded for all 0 \u2264 \u03b7 \u2264 1.\n\n6\n\n\f3 Analysis of the LLR algorithm\n\nT and \u2206ah\n\nT relates to \u2206i\n\nIn this section we analyse the regret of LLR. We \ufb01rst show that for each loss sequence the regret is\nbounded in terms of the cumulative mixability gaps \u2206i\nT incurred by the active learning rates\n(Lemma 3.1). As LLR keeps the cumulative mixability gaps approximately balanced according to\n(11), we can then further bound the regret in terms of each of the individual learning rates in the grid\n(Lemma 3.2). The next step is to deal with learning rates between grid points, by showing that their\nT for the nearest higher grid point \u03b7i \u2265 \u03b7 (Lemma 3.3).\ncumulative mixability gap \u2206\u03b7\nIn Lemma 3.4 we put all these steps together. As the cumulative mixability gap \u2206\u03b7\nT does not exceed\nthe regret R\u03b7\nT for \ufb01xed learning rates (Lemma 2.1), we can then derive the bounds (2) through (6)\nfrom the introduction in Theorems 3.5 and 3.6.\nWe start by showing that the regret of LLR is bounded by the cumulative mixability gaps of the\nlearning rates that it plays. The proof, which appears in Section C.4, is a generalisation of Lemma 12\nin [5]. It crucially uses the fact that the lowest learning rate played by LLR is the AdaHedge rate \u03b7ah\nt\nwhich relates to \u2206ah\nt .\nLemma 3.1. On any sequence of losses, the regret of the LLR algorithm with parameters \u03c0ah > 0\nand \u03c0\u221e \u2208 (0, 1) is bounded by\n\nRT \u2264(cid:16) \u03c6\n\n\u03c6 \u2212 1\n\n(cid:17)\n\n+ 2\n\n\u2206ah\n\nT +\n\n\u2206i\n\nT ,\n\nimax(cid:88)\n\ni=1\n\nwhere imax is the largest i such that \u03b7i is active in round T and \u03c6 is de\ufb01ned in (14).\n\nThe LLR budgeting scheme keeps the cumulative mixability gaps from Lemma 3.1 approximately\nbalanced according to (11). The next result, proved in Section C.5, makes this precise.\nLemma 3.2. Fix t\u2217 as in (15). Then for each index i that was active at time t\u2217 and arbitrary j (cid:54)= i:\n\n(cid:19)\n\n(cid:18) \u03c0j\n\nT \u2264 \u03c6\n\u2206j\nT \u2264 \u03c6\n\u2206j\nT \u2264 \u03c0ah\n\u2206ah\n\n\u03c0i \u2206i\n\u03c0j\n\u03c0ah \u2206ah\n\u03c0i \u2206i\n\nT + 1.\n\n+ min{1, \u03b7j/8},\n\n\u03c0j\n\u03c0ah\n\nT +\nT + min{1, \u03b7j/8},\n\n(16a)\n\n(16b)\n\n(16c)\n\nLLR employs an exponentially spaced grid of learning rates that are evaluated using \u2014 and played\nproportionally to \u2014 their cumulative mixability gaps. In the next step (which is restated and proved\nas Lemma C.7 in the additional material) we show that the mixability gap of a learning rate between\ngrid-points cannot be much smaller than that of its next higher grid neighbour. This establishes in\nparticular that an exponential grid is suf\ufb01ciently \ufb01ne.\nLemma 3.3. For \u03b3 \u2265 1 and for any sequence of losses with values in [0, 1]:\n\nt \u2264 \u03b3e(\u03b3\u22121)(ln K+\u03b7)\u03b4\u03b7\n\u03b4\u03b3\u03b7\nt .\n\nThe preceding results now allow us to bound the regret of LLR in terms of the cumulative mixability\ngap of any \ufb01xed learning rate (which does not exceed its regret by Lemma 2.1) and in terms of the\ncumulative mixability gap of AdaHedge (which we will use to establish worst-case optimality).\nLemma 3.4. Suppose the losses take values in [0, 1], let \u03c0ah > 0 and \u03c0\u221e \u2208 (0, 1) be the parameters\n\nof the LLR algorithm, and abbreviate B =(cid:0) \u03c6\n\n\u03c6\u22121 + 2(cid:1)\u03c0ah + \u03c6. Then the regret of the LLR algorithm\n(cid:18)\n\nis bounded by\n\n(cid:19)\n\n+\n\n\u03b1\n\n8(\u03b1 \u2212 1)\n\n+\n\n\u03c6\n\u03c0ah +\n\n\u03c6\n\u03c6 \u2212 1\n\n+ 3\n\nfor all \u03b7 \u2208 [\u03b7ah\nby\n\nt\u2217 , 1], where i(\u03b7) = 2 +(cid:98)log\u03b1(1/\u03b7)(cid:99) is the index of the nearest grid point above \u03b7, and\n\nRT \u2264 B\u03b1e(\u03b1\u22121)(ln K+1) \u2206\u03b7\nT\n\u03c0i(\u03b7)\n(cid:18)\n\nRT \u2264 B\n\n\u2206\u221e\n\u03c0\u221e +\n\nT\n\n(cid:19)\n\n\u03b1\n\n8(\u03b1 \u2212 1)\n\n+\n\n\u03c6\n\u03c0ah +\n\n\u03c6\n\u03c6 \u2212 1\n\n+ 3\n\n7\n\n\ffor \u03b7 = \u221e. In addition\n\nand for any \u03b7 < \u03b7ah\nt\u2217\n\nRT \u2264 B\n\n\u2206ah\nT\n\u03c0ah +\n\n\u03b1\n\n8(\u03b1 \u2212 1)\n\n+ 1,\n\nT \u2264 ln K\n\u2206ah\n\u03b7\n\n+ 1.\n\nThe proof appears in additional material Section C.6.\nWe are now ready for our main result, which is proved in Section C.7. It shows that LLR competes\nwith the regret of any learning rate above the worst-case safe rate and below 1 modulo a mild factor.\nIn addition, LLR also performs well on all data favoured by Follow-the-Leader.\nTheorem 3.5. Suppose the losses take values in [0, 1], let \u03c0ah > 0 and \u03c0\u221e \u2208 (0, 1) be the\n\u221a\n\u03c0ah + 3\u03c0ah and\nparameters of the LLR algorithm, and introduce the constants B = 1 + 2\nCK = (log2 K + 1)/8 + B/\u03c0ah + 1. Then the regret of LLR is simultaneously bounded by\nfor all \u03b7 \u2208 [\u03b7ah\n\n1 \u2212 \u03c0\u221e (log2 K + 1) ln(7/\u03b7) ln2(cid:0)2 log2(5/\u03b7)(cid:1)\n(cid:125)\n\nRT \u2264 4Be\n\nR\u03b7\nT + CK\n\n(cid:123)(cid:122)\n\nt\u2217 , 1]\n\n(cid:124)\n\n=O(ln1+\u03b5(1/\u03b7)) for any \u03b5 > 0\n\nand by\n\nIn addition\n\nRT \u2264 B\n\n\u03c0\u221eR\u221e\n\nT + CK\n\nfor \u03b7 = \u221e.\n\nRT \u2264 B\n\u03c0ah\n\nln K\n\n\u03b7\n\n+ CK\n\nfor any \u03b7 < \u03b7ah\nt\u2217.\n\nTo interpret the theorem, we recall from the introduction that ln(1/\u03b7) is better than O(ln T ) for all\n\u03b7 \u2265 \u03b7ah\nt\u2217.\nWe \ufb01nally show that LLR is robust to the worst-case. We do this by showing something much\nstronger, namely that LLR guarantees a so-called second-order bound (a concept introduced in [7]).\n\nThe bound is phrased in terms of the cumulative variance VT =(cid:80)T\nparameters of the LLR algorithm, and introduce the constants B = (cid:0) \u03c6\n\n(cid:3)\n(cid:2)(cid:96)k\n\u03c6\u22121 + 2(cid:1)\u03c0ah + \u03c6 and\n\nis the variance of (cid:96)k\nTheorem 3.6. Suppose the losses take values in [0, 1], let \u03c0ah > 0 and \u03c0\u221e \u2208 (0, 1) be the\n\nt for k distributed according to wt. See Section C.8 for the proof.\n\nt=1 vt, where vt = Vk\u223cwt\n\nt\n\nCK = (log2 K + 1)/8 + B/\u03c0ah + 1. Then the regret of LLR is bounded by\n\n(cid:19)\n\nRT \u2264 B\n\u03c0ah\n\nVT ln K +\n\nCK +\n\n2B ln K\n\n3\u03c0ah\n\n(cid:112)\n\n(cid:114)\n\nT (T \u2212 L\u2217\nL\u2217\nT )\n\nT\n\n(cid:18)\n(cid:18)\n\nand consequently by\n\nRT \u2264 B\n\u03c0ah\n\n4 Discussion\n\nln K + 2\n\nCK +\n\n2B ln K\n3\u03c0ah +\n\nB2 ln K\n(\u03c0ah)2\n\n(cid:19)\n\n.\n\nWe have shown that our new LLR algorithm is able to recover the same second-order bounds as\nprevious methods, which guard against worst-case data by picking a small learning rate if necessary.\nWhat LLR adds is that, at the cost of a (poly)logarithmic overhead factor, it is also able to learn a\nrange of higher learning rates \u03b7, which can potentially achieve much smaller regret (see Figure 1).\nThis is accomplished by covering this range with a grid of suf\ufb01cient granularity. The overhead\nfactor depends on a prior on the grid, for which we have \ufb01xed a particular choice with a heavy tail.\nHowever, the algorithm would also work with any other prior, so if it were known a priori that certain\nvalues in the grid were of special importance, they could be given larger prior mass. Consequently,\na more advanced analysis demonstrating that only a subset of learning rates could potentially be\noptimal (in the sense of minimizing the regret R\u03b7\nT ) would directly lead to factors of improvement in\nthe algorithm. Thus we raise the open question: what is the smallest subset E of learning rates such\nthat, for any data, the minimum of the regret over this subset min\u03b7\u2208E R\u03b7\nT is approximately the same\nas the minimum min\u03b7 R\u03b7\n\nT over all or a large range of learning rates?\n\n8\n\n\fReferences\n[1] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Com-\n\nputation, 108(2):212\u2013261, 1994.\n\n[2] V. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences,\n\n56(2):153\u2013173, 1998.\n\n[3] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55:119\u2013139, 1997.\n\n[4] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[5] S. de Rooij, T. van Erven, P. D. Gr\u00a8unwald, and W. M. Koolen. Follow the leader if you can,\n\nHedge if you must. Journal of Machine Learning Research, 15:1281\u20131316, 2014.\n\n[6] P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-con\ufb01dent on-line learning algo-\n\nrithms. Journal of Computer and System Sciences, 64:48\u201375, 2002.\n\n[7] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz.\n\nImproved second-order bounds for prediction\n\nwith expert advice. Machine Learning, 66(2/3):321\u2013352, 2007.\n\n[8] T. van Erven, P. Gr\u00a8unwald, W. M. Koolen, and S. de Rooij. Adaptive hedge. In Advances in\n\nNeural Information Processing Systems 24 (NIPS), 2011.\n\n[9] M. Devaine, P. Gaillard, Y. Goude, and G. Stoltz. Forecasting electricity consumption by ag-\ngregating specialized experts; a review of the sequential aggregation of specialized experts,\nwith an application to Slovakian and French country-wide one-day-ahead (half-)hourly predic-\ntions. Machine Learning, 90(2):231\u2013260, 2013.\n\n[10] P. Gr\u00a8unwald. The safe Bayesian: learning the learning rate via the mixability gap. In Proceed-\nings of the 23rd International Conference on Algorithmic Learning Theory (ALT). Springer,\n2012.\n\n[11] V. Vovk. Competitive on-line statistics. International Statistical Review, 69:213\u2013248, 2001.\n[12] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.\n\n9\n\n\f", "award": [], "sourceid": 1218, "authors": [{"given_name": "Wouter", "family_name": "Koolen", "institution": "Queensland University of Technology"}, {"given_name": "Tim", "family_name": "van Erven", "institution": "Leiden University"}, {"given_name": "Peter", "family_name": "Gr\u00fcnwald", "institution": "CWI and Leiden University"}]}