{"title": "Adaptive Hedge", "book": "Advances in Neural Information Processing Systems", "page_first": 1656, "page_last": 1664, "abstract": "Most methods for decision-theoretic online learning are based on the Hedge algorithm, which takes a parameter called the learning rate. In most previous analyses the learning rate was carefully tuned to obtain optimal worst-case performance, leading to suboptimal performance on easy instances, for example when there exists an action that is significantly better than all others. We propose a new way of setting the learning rate, which adapts to the difficulty of the learning problem: in the worst case our procedure still guarantees optimal performance, but on easy instances it achieves much smaller regret. In particular, our adaptive method achieves constant regret in a probabilistic setting, when there exists an action that on average obtains strictly smaller loss than all other actions. We also provide a simulation study comparing our approach to existing methods.", "full_text": "Adaptive Hedge\n\nTim van Erven\n\nDepartment of Mathematics\n\nVU University\n\nDe Boelelaan 1081a\n\n1081 HV Amsterdam, the Netherlands\n\ntim@timvanerven.nl\n\nWouter M. Koolen\n\nCWI and Department of Computer Science\n\nRoyal Holloway, University of London\n\nEgham Hill, Egham, Surrey\nTW20 0EX, United Kingdom\nwouter@cs.rhul.ac.uk\n\nPeter Gr\u00a8unwald\n\nCentrum Wiskunde & Informatica (CWI)\n\nScience Park 123, P.O. Box 94079\n\n1090 GB Amsterdam, the Netherlands\n\npdg@cwi.nl\n\nSteven de Rooij\n\nCentrum Wiskunde & Informatica (CWI)\n\nScience Park 123, P.O. Box 94079\n\n1090 GB Amsterdam, the Netherlands\n\ns.de.rooij@cwi.nl\n\nAbstract\n\nMost methods for decision-theoretic online learning are based on the Hedge algo-\nrithm, which takes a parameter called the learning rate. In most previous analyses\nthe learning rate was carefully tuned to obtain optimal worst-case performance,\nleading to suboptimal performance on easy instances, for example when there ex-\nists an action that is signi\ufb01cantly better than all others. We propose a new way\nof setting the learning rate, which adapts to the dif\ufb01culty of the learning prob-\nlem: in the worst case our procedure still guarantees optimal performance, but on\neasy instances it achieves much smaller regret. In particular, our adaptive method\nachieves constant regret in a probabilistic setting, when there exists an action that\non average obtains strictly smaller loss than all other actions. We also provide a\nsimulation study comparing our approach to existing methods.\n\n1\n\nIntroduction\n\nDecision-theoretic online learning (DTOL) is a framework to capture learning problems that proceed\nin rounds. It was introduced by Freund and Schapire [1] and is closely related to the paradigm of\nprediction with expert advice [2, 3, 4]. In DTOL an agent is given access to a \ufb01xed set of K actions,\nand at the start of each round must make a decision by assigning a probability to every action. Then\nall actions incur a loss from the range [0, 1], and the agent\u2019s loss is the expected loss of the actions\nunder the probability distribution it produced. Losses add up over rounds and the goal for the agent\nis to minimize its regret after T rounds, which is the difference in accumulated loss between the\nagent and the action that has accumulated the least amount of loss.\nThe most commonly studied strategy for the agent is called the Hedge algorithm [1, 5]. Its per-\nformance crucially depends on a parameter \u03b7 called the learning rate. Different ways of tuning\nthe learning rate have been proposed, which all aim to minimize the regret for the worst possi-\nble sequence of losses the actions might incur. If T is known to the agent, then the learning rate\n\nmay be tuned to achieve worst-case regret bounded by(cid:112)T ln(K)/2, which is known to be opti-\n(cid:112)2L\u2217\n\nmal as T and K become large [4]. Nevertheless, by slightly relaxing the problem, one can obtain\nbetter guarantees. Suppose for example that the cumulative loss L\u2217\nT of the best action is known\nto the agent beforehand. Then, if the learning rate is set appropriately, the regret is bounded by\nT ln(K) + ln(K) [4], which has the same asymptotics as the previous bound in the worst case\n\n1\n\n\fT\n\nT\n\nT\n\n(because L\u2217\n\nKale [6] obtain a bound of 8(cid:112)VARmax\n\nT \u2264 T ) but may be much better when L\u2217\n\nT turns out to be small. Similarly, Hazan and\nln(K) + 10 ln(K) for a modi\ufb01cation of Hedge if the cumu-\nlative empirical variance VARmax\nof the best expert is known. In applications it may be unrealistic to\nassume that T or (especially) L\u2217\nT or VARmax\nis known beforehand, but at the cost of slightly worse\nconstants such problems may be circumvented using either the doubling trick (setting a budget on\nthe unknown quantity and restarting the algorithm with a double budget when the budget is depleted)\n[4, 7, 6], or a variable learning rate that is adjusted each round [4, 8].\nBounding the regret in terms of L\u2217\nis based on the idea that worst-case performance is\nnot the only property of interest: such bounds give essentially the same guarantee in the worst case,\nbut a much better guarantee in a plausible favourable case (when L\u2217\nis small). In this\npaper, we pursue the same goal for a different favourable case. To illustrate our approach, consider\nthe following simplistic example with two actions: let 0 < a < b < 1 be such that b\u2212 a > 2\u0001. Then\nin odd rounds the \ufb01rst action gets loss a + \u0001 and the second action gets loss b \u2212 \u0001; in even rounds\nthe actions get losses a \u2212 \u0001 and b + \u0001, respectively. Informally, this seems like a very easy instance\nof DTOL, because the cumulative losses of the actions diverge and it is easy to see from the losses\nwhich action is the best one. In fact, the Follow-the-Leader strategy, which puts all probability mass\non the action with smallest cumulative loss, gives a regret of at most 1 in this case \u2014 the worst-case\nln(K)), which is of the\n\nbound O((cid:112)L\u2217\nsame order(cid:112)T ln(K). On the other hand, for Follow-the-Leader one cannot guarantee sublinear\n\nT ln(K)) is very loose by comparison, and so is O((cid:112)VARmax\n\nT or VARmax\n\nT or VARmax\n\nT\n\nT\n\nT\n\nregret for worst-case instances. (For example, if one out of two actions yields losses 1\n2 , 0, 1, 0, 1, . . .\nand the other action yields losses 0, 1, 0, 1, 0, . . ., its regret will be at least T /2 \u2212 1.) To get the best\nof both worlds, we introduce an adaptive version of Hedge, called AdaHedge, that automatically\nadapts to the dif\ufb01culty of the problem by varying the learning rate appropriately. As a result we\nobtain constant regret for the simplistic example above and other \u2018easy\u2019 instances of DTOL, while\n\nat the same time guaranteeing O((cid:112)L\u2217\n\nT ln(K)) regret in the worst case.\n\nIt remains to characterise what we consider easy problems, which we will do in terms of the prob-\nabilities produced by Hedge. As explained below, these may be interpreted as a generalisation of\nBayesian posterior probabilities. We measure the dif\ufb01culty of the problem in terms of the speed at\nwhich the posterior probability of the best action converges to one. In the previous example, this\nhappens at an exponential rate, whereas for worst-case instances the posterior probability of the best\naction does not converge to one at all.\n\nOutline\nIn the next section we describe a new way of tuning the learning rate, and show that it\nyields essentially optimal performance guarantees in the worst case. To construct the AdaHedge\nalgorithm, we then add the doubling trick to this idea in Section 3, and analyse its worst-case regret.\nIn Section 4 we show that AdaHedge in fact incurs much smaller regret on easy problems. We\ncompare AdaHedge to other instances of Hedge by means of a simulation study in Section 5. The\nproof of our main technical lemma is postponed to Section 6, and open questions are discussed in\nthe concluding Section 7. Finally, longer proofs are only available as Additional Material in the full\nversion at arXiv.org.\n\n2 Tuning the Learning Rate\nSetting Let the available actions be indexed by k \u2208 {1, . . . , K}. At the start of each round\nt = 1, 2, . . . the agent A is to assign a probability wk\nto each action k by producing a vector\nt\nt ) with nonnegative components that sum up to 1. Then every action k incurs\nwt = (w1\na loss (cid:96)k\nt ), and the loss of the agent\nt , and the\n\nt , . . . , wK\nt \u2208 [0, 1], which we collect in the loss vector (cid:96)t = ((cid:96)1\n\nis wt \u00b7 (cid:96)t = (cid:80)K\n\nt . After T rounds action k has accumulated loss Lk\n\nT = (cid:80)T\n\nt , . . . , (cid:96)K\n\nk=1 wk\n\nt=1 (cid:96)k\n\nt (cid:96)k\n\nagent\u2019s regret is\n\nRA(T ) =\n\nwt \u00b7 (cid:96)t \u2212 L\u2217\nT ,\n\nwhere L\u2217\n\nT = min1\u2264k\u2264K Lk\n\nT is the cumulative loss of the best action.\n\nT(cid:88)\n\nt=1\n\n2\n\n\fHedge The Hedge algorithm chooses the weights wk\nt , where \u03b7 > 0 is\nthe learning rate. As is well-known, these weights may essentially be interpreted as Bayesian pos-\nterior probabilities on actions, relative to a uniform prior and pseudo-likelihoods P k\nt =\n\nt+1 proportional to e\u2212\u03b7Lk\n\nt = e\u2212\u03b7Lk\n\n(cid:81)t\ns=1 e\u2212\u03b7(cid:96)k\n\ns [9, 10, 4]:\n\nwhere\n\nBt =\n\nwk\n\nt+1 =\n\n(cid:88)\n\nk\n\n1\n\nK \u00b7 P k\nBt\n\nt\n\n,\n\n1\n\nK \u00b7 e\u2212\u03b7Lk\n\nt\n\n(1)\n\n=\n\nt\n\nt\n\n1\n\n(cid:80)\ne\u2212\u03b7Lk\n(cid:88)\nk(cid:48) e\u2212\u03b7Lk(cid:48)\nK \u00b7 P k\nt(cid:89)\n\nt =\n\nk\n\nBt =\n\nws \u00b7 e\u2212\u03b7(cid:96)s.\n\nis a generalisation of the Bayesian marginal likelihood. And like the ordinary marginal likelihood,\nBt factorizes into sequential per-round contributions:\n\n(2)\n\nWe will sometimes write wt(\u03b7) and Bt(\u03b7) instead of wt and Bt in order to emphasize the depen-\ndence of these quantities on \u03b7.\n\ns=1\n\nThe Learning Rate and the Mixability Gap A key quantity in our and previous [4] analyses is\nthe gap between the per-round loss of the Hedge algorithm and the per-round contribution to the\nnegative logarithm of the \u201cmarginal likelihood\u201d BT , which we call the mixability gap:\n\n\u03b4t(\u03b7) = wt(\u03b7) \u00b7 (cid:96)t \u2212(cid:16) \u2212 1\n\n\u03b7 ln(wt(\u03b7) \u00b7 e\u2212\u03b7(cid:96)t)\n\n(cid:17)\n\n.\n\nIn the setting of prediction with expert advice, the subtracted term coincides with the loss incurred\nby the Aggregating Pseudo-Algorithm (APA) which, by allowing the losses of the actions to be\nmixed with optimal ef\ufb01ciency, provides an idealised lower bound for the actual loss of any predic-\ntion strategy [9]. The mixability gap measures how closely we approach this ideal. As the same\ninterpretation still holds in the more general DTOL setting of this paper, we can measure the dif\ufb01-\nculty of the problem, and tune \u03b7, in terms of the cumulative mixability gap:\n\nT(cid:88)\n\nT(cid:88)\n\n\u2206T (\u03b7) =\n\n\u03b4t(\u03b7) =\n\nwt(\u03b7) \u00b7 (cid:96)t + 1\n\n\u03b7 ln BT (\u03b7).\n\nt=1\n\nt=1\n\nWe proceed to list some basic properties of the mixability gap. First, it is nonnegative and bounded\nabove by a constant that depends on \u03b7:\nLemma 1. For any t and \u03b7 > 0 we have 0 \u2264 \u03b4t(\u03b7) \u2264 \u03b7/8.\n\nProof. The lower bound follows by applying Jensen\u2019s inequality to the concave function ln, the\nupper bound from Hoeffding\u2019s bound on the cumulant generating function [4, Lemma A.1].\nFurther, the cumulative mixability gap \u2206T (\u03b7) can be related to L\u2217\nproved in the Additional Material:\nLemma 2. For any T and \u03b7 \u2208 (0, 1] we have \u2206T (\u03b7) \u2264 \u03b7L\u2217\n\nT via the following upper bound,\n\n.\n\nT + ln(K)\ne \u2212 1\n\nThis relationship will make it possible to provide worst-case guarantees similar to what is possible\nwhen \u03b7 is tuned in terms of L\u2217\nT . However, for easy instances of DTOL this inequality is very loose,\nin which case we can prove substantially better regret bounds. We could now proceed by optimizing\nthe learning rate \u03b7 given the rather awkward assumption that \u2206T (\u03b7) is bounded by a known constant\nb for all \u03b7, which would be the natural counterpart to an analysis that optimizes \u03b7 when a bound on\nL\u2217\nT is known. However, as \u2206T (\u03b7) varies with \u03b7 and is unknown a priori anyway, it makes more\nsense to turn the analysis on its head and start by \ufb01xing \u03b7. We can then simply run the Hedge\nalgorithm until the smallest T such that \u2206T (\u03b7) exceeds an appropriate budget b(\u03b7), which we set to\n\nb(\u03b7) =\n\n\u03b7 + 1\ne\u22121\n\nln(K).\n\n(3)\n\n(cid:16) 1\n\n(cid:17)\n\n3\n\n\fWhen at some point the budget is depleted, i.e. \u2206T (\u03b7) \u2265 b(\u03b7), Lemma 2 implies that\n\n(e \u2212 1) ln(K)/L\u2217\nT ,\n\nrates proportional to(cid:112)ln(K)/L\u2217\nlarge, because we can still provide a bound of order O((cid:112)L\u2217\n\nso that, up to a constant factor, the learning rate used by AdaHedge is at least as large as the learning\nT that are used in the literature. On the other hand, it is not too\nTheorem 3. Suppose the agent runs Hedge with learning rate \u03b7 \u2208 (0, 1], and after T rounds has\njust used up the budget (3), i.e. b(\u03b7) \u2264 \u2206T (\u03b7) < b(\u03b7) + \u03b7/8. Then its regret is bounded by\n\nT ln(K)) on the worst-case regret:\n\n(4)\n\n\u03b7 \u2265(cid:113)\n\n(cid:113) 4\n\nRHedge(\u03b7)(T ) <\n\ne\u22121 L\u2217\n\nT ln(K) + 1\n\ne\u22121 ln(K) + 1\n8 .\n\nT(cid:88)\n\nt=1\n\nProof. The cumulative loss of Hedge is bounded by\n\nwt \u00b7 (cid:96)t = \u2206T (\u03b7) \u2212 1\n\n\u03b7 ln BT < b(\u03b7) + \u03b7/8 \u2212 1\n\n\u03b7 ln BT \u2264 1\n\ne\u22121 ln(K) + 1\n\n8 + 2\n\n\u03b7 ln(K) + L\u2217\n\nT , (5)\n\nwhere we have used the bound BT \u2265 1\n\nK e\u2212\u03b7L\u2217\n\nT . Plugging in (4) completes the proof.\n\n3 The AdaHedge Algorithm\n\nWe now introduce the AdaHedge algorithm by adding the doubling trick to the analysis of the\nprevious section. The doubling trick divides the rounds in segments i = 1, 2, . . ., and on each\nsegment restarts Hedge with a different learning rate \u03b7i. For AdaHedge we set \u03b71 = 1 initially, and\nscale down the learning rate by a factor of \u03c6 > 1 for every new segment, such that \u03b7i = \u03c61\u2212i. We\nmonitor \u2206t(\u03b7i), measured only on the losses in the i-th segment, and when it exceeds its budget\nbi = b(\u03b7i) a new segment is started. The factor \u03c6 is a parameter of the algorithm. Theorem 5 below\nsuggests setting its value to the golden ratio \u03c6 = (1 +\n\n5)/2 \u2248 1.62 or simply to \u03c6 = 2.\n\n\u221a\n\nAlgorithm 1 AdaHedge(\u03c6)\n\n\u03b7 \u2190 \u03c6\nfor t = 1, 2, . . . do\n\n(cid:46) Requires \u03c6 > 1\n\nif t = 1 or \u2206 \u2265 b then\n(cid:46) Start a new segment\n\u03b7 \u2190 \u03b7/\u03c6; b \u2190 ( 1\n\u03b7 ) ln(K)\n\u2206 \u2190 0; w = (w1, . . . , wK) \u2190 ( 1\n\ne\u22121 + 1\n\nend if\n(cid:46) Make a decision\nOutput probabilities w for round t\nActions receive losses (cid:96)t\n(cid:46) Prepare for the next round\n\u2206 \u2190 \u2206 + w \u00b7 (cid:96)t + 1\nw \u2190 (w1 \u00b7 e\u2212\u03b7(cid:96)1\n\n\u03b7 ln(w \u00b7 e\u2212\u03b7(cid:96)t)\nt , . . . , wK \u00b7 e\u2212\u03b7(cid:96)K\n\nK , . . . , 1\nK )\n\nt )/(w \u00b7 e\u2212\u03b7(cid:96)t)\n\nend for\n\nend\n\nThe regret of AdaHedge is determined by the number of segments it creates: the fewer segments\nthere are, the smaller the regret.\nLemma 4. Suppose that after T rounds, the AdaHedge algorithm has started m new segments.\nThen its regret is bounded by\n\nRAdaHedge(T ) < 2 ln(K)\n\n+ m\n\ne\u22121 ln(K) + 1\n\n8\n\n.\n\n(cid:80)m\ni=1 1/\u03b7i =(cid:80)m\u22121\n\nProof. The regret per segment is bounded as in (5). Summing over all m segments, and plugging in\n\ni=0 \u03c6i = (\u03c6m \u2212 1)/(\u03c6 \u2212 1) gives the required inequality.\n\n(cid:16) \u03c6m \u2212 1\n\n(cid:17)\n\n\u03c6 \u2212 1\n\n(cid:16) 1\n\n(cid:17)\n\n4\n\n\fRAdaHedge(T ) \u2264 \u03c6(cid:112)\u03c62 \u2212 1\n\nUsing (4), one can obtain an upper bound on the number of segments that leads to the following\nguarantee for AdaHedge:\nTheorem 5. Suppose the agent runs AdaHedge for T rounds. Then its regret is bounded by\n\nT ln(K) + O(cid:0)ln(L\u2217\n\u03c6 = 2 leads to a very similar factor of \u03c6(cid:112)\u03c62 \u2212 1/(\u03c6 \u2212 1) \u2248 3.46.\n\n(cid:113) 4\n5)/2, for which \u03c6(cid:112)\u03c62 \u2212 1/(\u03c6 \u2212 1) \u2248 3.33, but simply taking\n\nFor details see the proof in the Additional Material. The value for \u03c6 that minimizes the leading\nfactor is the golden ratio \u03c6 = (1 +\n\nT + 2) ln(K)(cid:1),\n\n\u03c6 \u2212 1\n\u221a\n\ne\u22121 L\u2217\n\n4 Easy Instances\n\nWhile the previous sections reassure us that AdaHedge performs well for the worst possible se-\nquence of losses, we are also interested in its behaviour when the losses are not maximally an-\ntagonistic. We will characterise such sequences in terms of convergence of the Hedge posterior\nprobability of the best action:\n\nw\u2217\nt (\u03b7) = max\n1\u2264k\u2264K\nt is proportional to e\u2212\u03b7Lk\nt\u22121, so w\u2217\nt corresponds to the posterior probability of the\n(Recall that wk\naction with smallest cumulative loss.) Technically, this is expressed by the following re\ufb01nement of\nLemma 1, which is proved in Section 6.\n\nt (\u03b7)(cid:1).\nLemma 6. For any t and \u03b7 \u2208 (0, 1] we have \u03b4t(\u03b7) \u2264 (e \u2212 2)\u03b7(cid:0)1 \u2212 w\u2217\n\nt (\u03b7).\n\nwk\n\nThis lemma, which may be of independent interest, is a variation on Hoeffding\u2019s bound on the\ncumulant generating function. While Lemma 1 leads to a bound on \u2206T (\u03b7) that grows linearly\nin T , Lemma 6 shows that \u2206T (\u03b7) may grow much slower. In fact, if the posterior probabilities w\u2217\nt\nconverge to 1 suf\ufb01ciently quickly, then \u2206T (\u03b7) is bounded, as shown by the following lemma. Recall\nthat L\u2217\nLemma 7. Let \u03b1 and \u03b2 be positive constants, and let \u03c4 \u2208 Z+. Suppose that for t = \u03c4, \u03c4 + 1, . . . , T\nt , and for k (cid:54)= k\u2217 the\nthere exists a single action k\u2217 that achieves minimal cumulative loss Lk\u2217\ncumulative losses diverge as Lk\n\nt \u2265 \u03b1t\u03b2. Then for all \u03b7 > 0\n\nT .\nT = min1\u2264k\u2264K Lk\n\nt = L\u2217\n\nt \u2212 L\u2217\nT(cid:88)\n\nt=\u03c4\n\n(cid:0)1 \u2212 w\u2217\n\nt+1(\u03b7)(cid:1) \u2264 CK \u03b7\u22121/\u03b2,\n\nwhere CK = (K \u2212 1)\u03b1\u22121/\u03b2\u0393(1 + 1\n\n\u03b2 ) is a constant that does not depend on \u03b7, \u03c4 or T .\n\nThe lemma is proved in the Additional Material. Together with Lemmas 1 and 6, it gives an upper\nbound on \u2206T (\u03b7), which may be used to bound the number of segments started by AdaHedge. This\nleads to the following result, whose proof is also delegated to the Additional Material.\nLet s(m) denote the round in which AdaHedge starts its m-th segment, and let Lk\ns(m)+r\u22121 \u2212 Lk\nLk\nLemma 8. Let \u03b1 > 0 and \u03b2 > 1/2 be constants, and let CK be as in Lemma 7. Suppose there\nexists a segment m\u2217 \u2208 Z+ started by AdaHedge, such that \u03c4 := (cid:98)8 ln(K)\u03c6(m\u2217\u22121)(2\u22121/\u03b2) \u2212 8(e \u2212\n2)CK + 1(cid:99) \u2265 1 and for some action k\u2217 the cumulative losses in segment m\u2217 diverge as\n\ns(m)\u22121 denote the cumulative loss of action k in that segment.\n\nr (m) =\n\nr (m\u2217) \u2212 Lk\u2217\nLk\n\nr (m\u2217) \u2265 \u03b1r\u03b2\n\n(6)\nThen AdaHedge starts at most m\u2217 segments, and hence by Lemma 4 its regret is bounded by a\nconstant:\n\nfor all r \u2265 \u03c4 and k (cid:54)= k\u2217.\n\nRAdaHedge(T ) = O(1).\n\nIn the simplistic example from the introduction, we may take \u03b1 = b \u2212 a \u2212 2\u0001 and \u03b2 = 1, such that\n(6) is satis\ufb01ed for any \u03c4 \u2265 1. Taking m\u2217 large enough to ensure that \u03c4 \u2265 1, we \ufb01nd that AdaHedge\nnever starts more than m\u2217 = 1 + (cid:100)log\u03c6( e\u22122\n8 ln(2) )(cid:101) segments. Let us also give an example of\na probabilistic setting in which Lemma 8 applies:\n\n\u03b1 ln(2) + 1\n\n5\n\n\fTheorem 9. Let \u03b1 > 0 and \u03b4 \u2208 (0, 1] be constants, and let k\u2217 be a \ufb01xed action. Suppose the loss\nvectors (cid:96)t are independent random variables such that the expected differences in loss satisfy\n\nThen, with probability at least 1 \u2212 \u03b4, AdaHedge starts at most\n\n(cid:108)\n\nm\u2217 = 1 +\n\nlog\u03c6\n\nE[(cid:96)k\n\nmin\nk(cid:54)=k\u2217\n\nt\n\nfor all t \u2208 Z+.\n\n] \u2265 2\u03b1\n\nt \u2212 (cid:96)k\u2217\nln(cid:0)2K/(\u03b12\u03b4)(cid:1)\n(cid:16) (K \u2212 1)(e \u2212 2)\nRAdaHedge(T ) = O(cid:0)K + log(1/\u03b4)(cid:1).\n\n4\u03b12 ln(K)\n\n\u03b1 ln(K)\n\n+\n\n(cid:17)(cid:109)\n\n+\n\n1\n\n8 ln(K)\n\n(7)\n\n(8)\n\nsegments and consequently its regret is bounded by a constant:\n\nonly a bound on the regret of order O((cid:112)T ln(K)) is possible, and that AdaHedge automatically\n\nThis shows that the probabilistic setting of the theorem is much easier than the worst case, for which\n\nadapts to this easier setting. The proof of Theorem 9 is in the Additional Material. It veri\ufb01es that the\nconditions of Lemma 8 hold with suf\ufb01cient probability for \u03b2 = 1, and \u03b1 and m\u2217 as in the theorem.\n\n5 Experiments\n\nWe compare AdaHedge to other hedging algorithms in two experiments involving simulated losses.\n\n5.1 Hedging Algorithms\n\n(9)\n\n\u03b7 =\n\nT , the agent\n\nFollow-the-Leader. This algorithm is included because it is simple and very effective if the losses\nare not antagonistic, although as mentioned in the introduction its regret is linear in the worst case.\nHedge with \ufb01xed learning rate. We also include Hedge with a \ufb01xed learning rate\n\n(cid:113)\nwhich achieves the regret bound(cid:112)2 ln(K)L\u2217\n2 ln(K)/L\u2217\nT ,\nT + ln(K)1. Since \u03b7 is a function of L\u2217\n\nneeds to use post-hoc knowledge to use this strategy.\nHedge with doubling trick. The common way to apply the doubling trick to L\u2217\nT is to set a budget on\nT and multiply it by some constant \u03c6(cid:48) at the start of each new segment, after which \u03b7 is optimized\nL\u2217\nfor the new budget [4, 7]. Instead, we proceed the other way around and with each new segment\n\ufb01rst divide \u03b7 by \u03c6 = 2 and then calculate the new budget such that (9) holds when \u2206t(\u03b7) reaches\nthe budget. This way we keep the same invariant (\u03b7 is never larger than the right-hand side of (9),\nwith equality when the budget is depleted), and the frequency of doubling remains logarithmic in\nL\u2217\nT with a constant determined by \u03c6, so both approaches are equally valid. However, controlling the\nsequence of values of \u03b7 allows for easier comparison to AdaHedge.\nAdaHedge (Algorithm 1). Like in the previous algorithm, we set \u03c6 = 2. Because of how we set up\nthe doubling, both algorithms now use the same sequence of learning rates 1, 1/2, 1/4, . . . ; the only\ndifference is when they decide to start a new segment.\nHedge with variable learning rate. Rather than using the doubling trick, this algorithm, described\nin [8], changes the learning rate each round as a function of L\u2217\nt . This way there is no need to relearn\nthe weights of the actions in each block, which leads to a better worst-case bound and potentially\nbetter performance in practice. Its behaviour on easy problems, as we are currently interested in, has\nnot been studied.\n\n5.2 Generating the Losses\nIn both experiments we choose losses in {0, 1}. The experiments are set up as follows.\n\n1Cesa-Bianchi and Lugosi use \u03b7 = ln(1 +(cid:112)2 ln K/L\u2217\n\nT ) [4], but the same bound can be obtained for the\n\nsimpli\ufb01ed expression we use.\n\n6\n\n\f(a) I.I.D. losses\n\n(b) Correlated losses\n\nFigure 1: Simulation results\n\nI.I.D. losses. In the \ufb01rst experiment, all T = 10 000 losses for all K = 4 actions are independent,\nwith distribution depending only on the action: the probabilities of incurring loss 1 are 0.35, 0.4,\n0.45 and 0.5, respectively. The results are then averaged over 50 repetitions of the experiment.\nCorrelated losses. In the second experiment, the T = 10 000 loss vectors are still independent,\nbut no longer identically distributed. In addition there are dependencies within the loss vectors (cid:96)t,\nbetween the losses for the K = 2 available actions: each round is hard with probability 0.3, and\neasy otherwise. If round t is hard, then action 1 yields loss 1 with probability 1 \u2212 0.01/t and action\n2 yields loss 1 with probability 1\u2212 0.02/t. If the round is easy, then the probabilities are \ufb02ipped and\nthe actions yield loss 0 with the same probabilities. The results are averaged over 200 repetitions.\n\n5.3 Discussion and Results\n\nFigure 1 shows the results of the experiments above. We plot the regret (averaged over repetitions\nof the experiment) as a function of the number of rounds, for each of the considered algorithms.\n\nI.I.D. Losses.\nIn the \ufb01rst considered regime, the accumulated losses for each action diverge lin-\nearly with high probability, so that the regret of Follow-the-Leader is bounded. Based on Theorem 9\nwe expect AdaHedge to incur bounded regret also; this is con\ufb01rmed in Figure 1(a). Hedge with a\n\ufb01xed learning rate shows much larger regret. This happens because the learning rate, while it op-\ntimizes the worst-case bound, is much too small for this easy regime. In fact, if we would include\nmore rounds, the learning rate would be set to an even smaller value, clearly showing the need to\ndetermine the learning rate adaptively. The doubling trick provides one way to adapt the learning\nrate; indeed, we observe that the regret of Hedge with the doubling trick is initially smaller than the\nregret of Hedge with \ufb01xed learning rate. However, unlike AdaHedge, the algorithm never detects\nthat its current value of \u03b7 is working well; instead it keeps exhausting its budget, which leads to a\nsequence of clearly visible bumps in its regret. Finally, it appears that the Hedge algorithm with\nvariable learning rate also achieves bounded regret. This is surprising, as the existing theory for\nthis algorithm only considers its worst-case behaviour, and the algorithm was not designed to do\nspeci\ufb01cally well in easy regimes.\n\nCorrelated Losses.\nIn the second simulation we investigate the case where the mean cumulative\nloss of two actions is extremely close \u2014 within O(log t) of one another. If the losses of the actions\nwhere independent, such a small difference would be dwarfed by random \ufb02uctuations in the cumula-\ntive losses, which would be of order O(\nt). Thus the two actions can only be distinguished because\nwe have made their losses dependent. Depending on the application, this may actually be a more nat-\nural scenario than complete independence as in the \ufb01rst simulation; for example, we can think of the\nlosses as mistakes of two binary classi\ufb01ers, say, two naive Bayes classi\ufb01ers with different smooth-\ning parameters. In such a scenario, losses will be dependent, and the difference in cumulative loss\nwill be much smaller than O(\nt). In the previous experiment, the posterior weights of the actions\n\n\u221a\n\n\u221a\n\n7\n\n0100020003000400050006000700080009000100000102030405060708090100Number of RoundsRegret Hedge (doubling)Hedge (fixed learning rate)Hedge (variable learning rate)AdaHedgeFollow the leader01000200030004000500060007000800090001000002468101214161820Number of RoundsRegret Hedge (doubling)Hedge (fixed learning rate)Hedge (variable learning rate)AdaHedgeFollow the leader\fconverged relatively quickly for a large range of learning rates, so that the exact value of the learning\nrate was most important at the start (e.g., from 3000 rounds onward Hedge with \ufb01xed learning rate\ndoes not incur much additional regret any more). In this second setting, using a high learning rate\nremains important throughout. This explains why in this case Hedge with variable learning rate can\nno longer keep up with Follow-the-Leader. The results for AdaHedge are also interesting: although\nTheorem 9 does not apply in this case, we may still hope that \u2206t(\u03b7) grows slowly enough that the\nalgorithm does not start too many segments. This turns out to be the case: over the 200 repetitions\nof the experiment, AdaHedge started only 2.265 segments on average, which explains its excellent\nperformance in this simulation.\n\n6 Proof of Lemma 6\n\nOur main technical tool is Lemma 6. Its proof requires the following intermediate result:\nLemma 10. For any \u03b7 > 0 and any time t, the function f ((cid:96)t) = ln\n\nwt \u00b7 e\u2212\u03b7(cid:96)t\n\nis convex.\n\n(cid:16)\n\n(cid:17)\n\nThis may be proved by observing that f is the convex conjugate of the Kullback-Leibler divergence.\nAn alternative proof based on log-convexity is provided in the Additional Material.\nProof of Lemma 6. We need to bound \u03b4t = wt(\u03b7) \u00b7 (cid:96)t + 1\n\u03b7 ln(wt(\u03b7) \u00b7 e\u2212\u03b7(cid:96)t), which is a convex\nfunction of (cid:96)t by Lemma 10. As a consequence, its maximum is achieved when (cid:96)t lies on the\nt are either 0 or 1 for all k, and in the remainder of the\nboundary of its domain, such that the losses (cid:96)k\nproof we will assume (without loss of generality) that this is the case. Now let \u03b1t = wt \u00b7 (cid:96)t be the\nposterior probability of the actions with loss 1. Then\n\nln(cid:0)(1 \u2212 \u03b1t) + \u03b1te\u2212\u03b7(cid:1) = \u03b1t +\n\n\u03b4t = \u03b1t +\n\n1\n\u03b7\n\nln(cid:0)1 + \u03b1t(e\u2212\u03b7 \u2212 1)(cid:1) .\n\n1\n\u03b7\n2 \u03b1t\u03b7, which is tight for \u03b1t near 0. For \u03b1t\n\nUsing ln x \u2264 x \u2212 1 and e\u2212\u03b7 \u2264 1 \u2212 \u03b7 + 1\nnear 1, rewrite\n\n2 \u03b72, we get \u03b4t \u2264 1\n\n\u03b4t = \u03b1t \u2212 1 +\n\nln(e\u03b7(1 \u2212 \u03b1t) + \u03b1t)\n\n1\n\u03b7\n\nand use ln x \u2264 x \u2212 1 and e\u03b7 \u2264 1 + \u03b7 + (e \u2212 2)\u03b72 for \u03b7 \u2264 1 to obtain \u03b4t \u2264 (e \u2212 2)(1 \u2212 \u03b1t)\u03b7.\nCombining the bounds, we \ufb01nd\n\n\u03b4t \u2264 (e \u2212 2)\u03b7 min{\u03b1t, 1 \u2212 \u03b1t}.\n\nNow, let k\u2217 be an action such that w\u2217\nhand, if (cid:96)k\u2217\nwhich completes the proof.\n\nt = 1, then \u03b1t \u2265 w\u2217\n\nt = wk\u2217\n\nt\n\nt so 1\u2212\u03b1t \u2264 1\u2212w\u2217\n\n. Then (cid:96)k\u2217\n\nt = 0 implies \u03b1t \u2264 1 \u2212 w\u2217\n\nt . On the other\nt . Hence, in both cases min{\u03b1t, 1\u2212\u03b1t} \u2264 1\u2212w\u2217\nt ,\n\n7 Conclusion and Future Work\n\nthat the regret of AdaHedge is of the optimal order O((cid:112)L\u2217\n\nWe have presented a new algorithm, AdaHedge, that adapts to the dif\ufb01culty of the DTOL learning\nproblem. This dif\ufb01culty was characterised in terms of convergence of the posterior probability of the\nbest action. For hard instances of DTOL, for which the posterior does not converge, it was shown\nT ln(K)); for easy instances, for which\nthe posterior converges suf\ufb01ciently fast, the regret was bounded by a constant. This behaviour was\ncon\ufb01rmed in a simulation study, where the algorithm outperformed existing versions of Hedge.\nA surprising observation in the experiments was the good performance of Hedge with a variable\nlearning rate on some easy instances. It would be interesting to obtain matching theoretical guar-\nantees, like those presented here for AdaHedge. A starting point might be to consider how fast the\nposterior probability of the best action converges to one, and plug that into Lemma 6.\n\nAcknowledgments\n\nThe authors would like to thank Wojciech Kot\u0142owski for useful discussions. This work was sup-\nported in part by the IST Programme of the European Community, under the PASCAL2 Network\nof Excellence, IST-2007-216886, and by NWO Rubicon grant 680-50-1010. This publication only\nre\ufb02ects the authors\u2019 views.\n\n8\n\n\fReferences\n[1] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55:119\u2013139, 1997.\n\n[2] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Com-\n\nputation, 108(2):212\u2013261, 1994.\n\n[3] V. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences,\n\n56(2):153\u2013173, 1998.\n\n[4] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[5] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games\n\nand Economic Behavior, 29:79\u2013103, 1999.\n\n[6] E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation\nin costs. In Proceedings of the 21st Annual Conference on Learning Theory (COLT), pages\n57\u201367, 2008.\n\n[7] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth.\n\nHow to use expert advice. Journal of the ACM, 44(3):427\u2013485, 1997.\n\n[8] P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-con\ufb01dent on-line learning algo-\n\nrithms. Journal of Computer and System Sciences, 64:48\u201375, 2002.\n\n[9] V. Vovk. Competitive on-line statistics. International Statistical Review, 69(2):213\u2013248, 2001.\n[10] D. Haussler, J. Kivinen, and M. K. Warmuth. Sequential prediction of individual sequences\nIEEE Transactions on Information Theory, 44(5):1906\u20131925,\n\nunder general loss functions.\n1998.\n\n[11] A. N. Shiryaev. Probability. Springer-Verlag, 1996.\n\n9\n\n\f", "award": [], "sourceid": 944, "authors": [{"given_name": "Tim", "family_name": "Erven", "institution": null}, {"given_name": "Wouter", "family_name": "Koolen", "institution": null}, {"given_name": "Steven", "family_name": "Rooij", "institution": null}, {"given_name": "Peter", "family_name": "Gr\u00fcnwald", "institution": null}]}