{"title": "A Drifting-Games Analysis for Online Learning and Applications to Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 1368, "page_last": 1376, "abstract": "We provide a general mechanism to design online learning algorithms based on a minimax analysis within a drifting-games framework. Different online learning settings (Hedge, multi-armed bandit problems and online convex optimization) are studied by converting into various kinds of drifting games. The original minimax analysis for drifting games is then used and generalized by applying a series of relaxations, starting from choosing a convex surrogate of the 0-1 loss function. With different choices of surrogates, we not only recover existing algorithms, but also propose new algorithms that are totally parameter-free and enjoy other useful properties. Moreover, our drifting-games framework naturally allows us to study high probability bounds without resorting to any concentration results, and also a generalized notion of regret that measures how good the algorithm is compared to all but the top small fraction of candidates. Finally, we translate our new Hedge algorithm into a new adaptive boosting algorithm that is computationally faster as shown in experiments, since it ignores a large number of examples on each round.", "full_text": "A Drifting-Games Analysis for Online Learning and\n\nApplications to Boosting\n\nHaipeng Luo\n\nDepartment of Computer Science\n\nPrinceton University\nPrinceton, NJ 08540\n\nhaipengl@cs.princeton.edu\n\nschapire@cs.princeton.edu\n\nRobert E. Schapire\u21e4\n\nDepartment of Computer Science\n\nPrinceton University\nPrinceton, NJ 08540\n\nAbstract\n\nWe provide a general mechanism to design online learning algorithms based on\na minimax analysis within a drifting-games framework. Different online learning\nsettings (Hedge, multi-armed bandit problems and online convex optimization) are\nstudied by converting into various kinds of drifting games. The original minimax\nanalysis for drifting games is then used and generalized by applying a series of\nrelaxations, starting from choosing a convex surrogate of the 0-1 loss function.\nWith different choices of surrogates, we not only recover existing algorithms, but\nalso propose new algorithms that are totally parameter-free and enjoy other useful\nproperties. Moreover, our drifting-games framework naturally allows us to study\nhigh probability bounds without resorting to any concentration results, and also a\ngeneralized notion of regret that measures how good the algorithm is compared to\nall but the top small fraction of candidates. Finally, we translate our new Hedge\nalgorithm into a new adaptive boosting algorithm that is computationally faster as\nshown in experiments, since it ignores a large number of examples on each round.\n\n1\n\nIntroduction\n\nIn this paper, we study online learning problems within a drifting-games framework, with the aim of\ndeveloping a general methodology for designing learning algorithms based on a minimax analysis.\nTo solve an online learning problem, it is natural to consider game-theoretically optimal algorithms\nwhich \ufb01nd the best solution even in worst-case scenarios. This is possible for some special cases\n([7, 1, 3, 21]) but dif\ufb01cult in general. On the other hand, many other ef\ufb01cient algorithms with optimal\nregret rate (but not exactly minimax optimal) have been proposed for different learning settings (such\nas the exponential weights algorithm [14, 15], and follow the perturbed leader [18]). However, it is\nnot always clear how to come up with these algorithms. Recent work by Rakhlin et al. [26] built a\nbridge between these two classes of methods by showing that many existing algorithms can indeed\nbe derived from a minimax analysis followed by a series of relaxations.\nIn this paper, we provide a parallel way to design learning algorithms by \ufb01rst converting online\nlearning problems into variants of drifting games, and then applying a minimax analysis and relax-\nations. Drifting games [28] (reviewed in Section 2) generalize Freund\u2019s \u201cmajority-vote game\u201d [13]\nand subsume some well-studied boosting and online learning settings. A nearly minimax optimal\nalgorithm is proposed in [28]. It turns out the connections between drifting games and online learn-\ning go far beyond what has been discussed previously. To show that, we consider variants of drifting\ngames that capture different popular online learning problems. We then generalize the minimax\nanalysis in [28] based on one key idea: relax a 0-1 loss function by a convex surrogate. Although\n\n\u21e4R. Schapire is currently at Microsoft Research in New York City.\n\n1\n\n\fthis idea has been applied widely elsewhere in machine learning, we use it here in a new way to\nobtain a very general methodology for designing and analyzing online learning algorithms. Using\nthis general idea, we not only recover existing algorithms, but also design new ones with special\nuseful properties. A somewhat surprising result is that our new algorithms are totally parameter-\nfree, which is usually not the case for algorithms derived from a minimax analysis. Moreover, a\ngeneralized notion of regret (\u270f-regret, de\ufb01ned in Section 3) that measures how good the algorithm is\ncompared to all but the top \u270f fraction of candidates arises naturally in our drifting-games framework.\nBelow we summarize our results for a range of learning settings.\nHedge Settings: (Section 3) The Hedge problem [14] investigates how to cleverly bet across a set\nof actions. We show an algorithmic equivalence between this problem and a simple drifting game\n(DGv1). We then show how to relax the original minimax analysis step by step to reach a general\nrecipe for designing Hedge algorithms (Algorithm 3). Three examples of appropriate convex sur-\nrogates of the 0-1 loss function are then discussed, leading to the well-known exponential weights\nalgorithm and two other new ones, one of which (NormalHedge.DT in Section 3.3) bears some sim-\nilarities with the NormalHedge algorithm [10] and enjoys a similar \u270f-regret bound simultaneously\nfor all \u270f and horizons. However, our regret bounds do not depend on the number of actions, and thus\ncan be applied even when there are in\ufb01nitely many actions. Our analysis is also arguably simpler\nand more intuitive than the one in [10] and easy to be generalized to more general settings. More-\nover, our algorithm is more computationally ef\ufb01cient since it does not require a numerical searching\nstep as in NormalHedge. Finally, we also derive high probability bounds for the randomized Hedge\nsetting as a simple side product of our framework without using any concentration results.\nMulti-armed Bandit Problems: (Section 4) The multi-armed bandit problem [6] is a classic ex-\nample for learning with incomplete information where the learner can only obtain feedback for the\nactions taken. To capture this problem, we study a quite different drifting game (DGv2) where ran-\ndomness and variance constraints are taken into account. Again the minimax analysis is generalized\nand the EXP3 algorithm [6] is recovered. Our results could be seen as a preliminary step to answer\nthe open question [2] on exact minimax optimal algorithms for the multi-armed bandit problem.\nOnline Convex Optimization: (Section 4) Based the theory of convex optimization, online convex\noptimization [31] has been the foundation of modern online learning theory. The corresponding\ndrifting game formulation is a continuous space variant (DGv3). Fortunately, it turns out that all\nresults from the Hedge setting are ready to be used here, recovering the continuous EXP algorithm\n[12, 17, 24] and also generalizing our new algorithms to this general setting. Besides the usual\nregret bounds, we also generalize the \u270f-regret, which, as far as we know, is the \ufb01rst time it has been\nexplicitly studied. Again, we emphasize that our new algorithms are adaptive in \u270f and the horizon.\nBoosting: (Section 4) Realizing that every Hedge algorithm can be converted into a boosting algo-\nrithm ([29]), we propose a new boosting algorithm (NH-Boost.DT) by converting NormalHedge.DT.\nThe adaptivity of NormalHedge.DT is then translated into training error and margin distribution\nbounds that previous analysis in [29] using nonadaptive algorithms does not show. Moreover, our\nnew boosting algorithm ignores a great many examples on each round, which is an appealing prop-\nerty useful to speeding up the weak learning algorithm. This is con\ufb01rmed by our experiments.\nRelated work: Our analysis makes use of potential functions. Similar concepts have widely ap-\npeared in the literature [8, 5], but unlike our work, they are not related to any minimax analysis and\nmight be hard to interpret. The existence of parameter free Hedge algorithms for unknown number\nof actions was shown in [11], but no concrete algorithms were given there. Boosting algorithms\nthat ignore some examples on each round were studied in [16], where a heuristic was used to ignore\nexamples with small weights and no theoretical guarantee is provided.\n\n2 Reviewing Drifting Games\n\nWe consider a simpli\ufb01ed version of drifting games similar to the one described in [29, chap. 13]\n(also called chip games). This game proceeds through T rounds, and is played between a player and\nan adversary who controls N chips on the real line. The positions of these chips at the end of round\nt are denoted by st 2 RN, with each coordinate st,i corresponding to the position of chip i. Initially,\nall chips are at position 0 so that s0 = 0. On every round t = 1, . . . , T : the player \ufb01rst chooses a\ndistribution pt over the chips, then the adversary decides the movements of the chips zt so that the\n\n2\n\n\fNPN\n\nnew positions are updated as st = st1 + zt. Here, each zt,i has to be picked from a prespeci\ufb01ed\nset B \u21e2 R, and more importantly, satisfy the constraint pt \u00b7 zt 0 for some \ufb01xed constant .\nAt the end of the game, each chip is associated with a nonnegative loss de\ufb01ned by L(sT,i) for some\nnonincreasing function L mapping from the \ufb01nal position of the chip to R+. The goal of the player\nis to minimize the chips\u2019 average loss 1\ni=1 L(sT,i) after T rounds. So intuitively, the player\naims to \u201cpush\u201d the chips to the right by assigning appropriate weights on them so that the adversary\nhas to move them to the right by in a weighted average sense on each round. This game captures\nmany learning problems. For instance, binary classi\ufb01cation via boosting can be translated into a\ndrifting game by treating each training example as a chip (see [28] for details).\nWe regard a player\u2019s strategy D as a function mapping from the history of the adversary\u2019s de-\ncisions to a distribution that the player is going to play with, that is, pt = D(z1:t1) where\nz1:t1 stands for z1, . . . , zt1. The player\u2019s worst case loss using this algorithm is then denoted\nby LT (D). The minimax optimal loss of the game is computed by the following expression:\nt=1 zt,i), where\nminD LT (D) = minp12N maxz12Zp1 \u00b7\u00b7\u00b7 minpT 2N maxzT 2ZpT\nN is the N dimensional simplex and Zp = BN \\{ z : p \u00b7 z } is assumed to be compact.\nA strategy D\u21e4 that realizes the minimum in minD LT (D) is called a minimax optimal strategy.\nA nearly optimal strategy and its analysis is originally given in [28], and a derivation by directly\ntackling the above minimax expression can be found in [29, chap. 13]. Speci\ufb01cally, a sequence of\npotential functions of a chip\u2019s position is de\ufb01ned recursively as follows:\n\ni=1 L(PT\n\n1\n\nNPN\n\nT (s) = L(s), t1(s) = min\nw2R+\n\nmax\nz2B\n\n(t(s + z) + w(z )).\n\n(1)\n\nLet wt,i be the weight that realizes the minimum in the de\ufb01nition of t1(st1,i), that is, wt,i 2\narg minw maxz(t(st1,i + z) + w(z )). Then the player\u2019s strategy is to set pt,i / wt,i. The\nkey property of this strategy is that it assures that the sum of the potentials over all the chips never\nincreases, connecting the player\u2019s \ufb01nal loss with the potential at time 0 as follows:\n\n1\nN\n\nNXi=1\n\nL(sT,i) \uf8ff\n\n1\nN\n\nNXi=1\n\nT (sT,i) \uf8ff\n\n1\nN\n\nNXi=1\n\nT1(sT1,i) \uf8ff\u00b7\u00b7\u00b7\uf8ff\n\n1\nN\n\nNXi=1\n\nIt has been shown in [28] that this upper bound on the loss is optimal in a very strong sense.\nMoreover, in some cases the potential functions have nice closed forms and thus the algorithm can\nbe ef\ufb01ciently implemented. For example, in the boosting setting, B is simply {1, +1}, and one can\nverify t(s) = 1+\n2 (t(st1,i 1) t(st1,i + 1)).\nWith the loss function L(s) being 1{s \uf8ff 0}, these can be further simpli\ufb01ed and eventually give\nexactly the boost-by-majority algorithm [13].\n\n2 t+1(s1) and wt,i = 1\n\n2 t+1(s+1)+ 1\n\n0(s0,i) = 0(0).\n\n(2)\n\n3 Online Learning as a Drifting Game\n\nThe connection between drifting games and some speci\ufb01c settings of online learning has been no-\nticed before ([28, 23]). We aim to \ufb01nd deeper connections or even an equivalence between variants\nof drifting games and more general settings of online learning, and provide insights on designing\nlearning algorithms through a minimax analysis. We start with a simple yet classic Hedge setting.\n\n3.1 Algorithmic Equivalence\n\nIn the Hedge setting [14], a player tries to earn as much as possible (or lose as little as possible) by\ncleverly spreading a \ufb01xed amount of money to bet on a set of actions on each day. Formally, the game\nproceeds for T rounds, and on each round t = 1, . . . , T : the player chooses a distribution pt over N\nactions, then the adversary decides the actions\u2019 losses `t (i.e. action i incurs loss `t,i 2 [0, 1]) which\nare revealed to the player. The player suffers a weighted average loss pt \u00b7 `t at the end of this round.\nThe goal of the player is to minimize his \u201cregret\u201d, which is usually de\ufb01ned as the difference between\nhis total loss and the loss of the best action. Here, we consider an even more general notion of regret\nstudied in [20, 19, 10, 11], which we call \u270f-regret. Suppose the actions are ordered according to\nt=1 `t,i) from smallest to largest, and let i\u270f be the index\n\ntheir total losses after T rounds (i.e. PT\n\n3\n\n\ft=1 pt \u00b7 `t PT\n\nInput: A DGv1 Algorithm DR\nfor t = 1 to T do\n\nQuery DR: pt = DR(z1:t1).\nSet: H(`1:t1) = pt.\nReceive losses `t from the adversary.\nSet: zt,i = `t,i pt \u00b7 `t, 8i.\nAlgorithm 2: Conversion of a DGv1 Algo-\nrithm DR to a Hedge Algorithm H\n\nInput: A Hedge Algorithm H\nfor t = 1 to T do\nQuery H: pt = H(`1:t1).\nSet: DR(z1:t1) = pt.\nReceive movements zt from the adversary.\nSet: `t,i = zt,i minj zt,j, 8i.\nAlgorithm 1: Conversion of a Hedge Algo-\nrithm H to a DGv1 Algorithm DR\nof the action that is the dN\u270fe-th element in the sorted list (0 <\u270f \uf8ff 1). Now, \u270f-regret is de\ufb01ned\nT (p1:T , `1:T ) =PT\nas R\u270f\nt=1 `t,i\u270f. In other words, \u270f-regret measures the difference\nbetween the player\u2019s loss and the loss of the dN\u270fe-th best action (recovering the usual regret with\n\u270f \uf8ff 1/N), and sublinear \u270f-regret implies that the player\u2019s loss is almost as good as all but the top\n\u270f fraction of actions. Similarly, R\u270f\nT (H) denotes the worst case \u270f-regret for a speci\ufb01c algorithm H.\nFor convenience, when \u270f \uf8ff 0 or \u270f> 1, we de\ufb01ne \u270f-regret to be 1 or 1 respectively.\nNext we discuss how Hedge is highly related to drifting games. Consider a variant of drifting games\nwhere B = [1, 1], = 0 and L(s) = 1{s \uf8ff R} for some constant R. Additionally, we impose\nan extra restriction on the adversary: |zt,i zt,j|\uf8ff 1 for all i and j. In other words, the difference\nbetween any two chips\u2019 movements is at most 1. We denote this speci\ufb01c variant of drifting games\nby DGv1 (summarized in Appendix A) and a corresponding algorithm by DR to emphasize the\ndependence on R. The reductions in Algorithm 1 and 2 and Theorem 1 show that DGv1 and the\nHedge problem are algorithmically equivalent (note that both conversions are valid). The proof is\nstraightforward and deferred to Appendix B. By Theorem 1, it is clear that the minimax optimal\nalgorithm for one setting is also minimax optimal for the other under these conversions.\nTheorem 1. DGv1 and the Hedge problem are algorithmically equivalent in the following sense:\n(1) Algorithm 1 produces a DGv1 algorithm DR satisfying LT (DR) \uf8ff i/N where i 2{ 0, . . . , N}\nis such that R(i+1)/N\n(2) Algorithm 2 produces a Hedge algorithm H with R\u270f\n3.2 Relaxations\n\nT (H) < R for any R such that LT (DR) <\u270f .\n\n(H) < R \uf8ff Ri/N\n\nT (H).\n\nT\n\nFrom now on we only focus on the direction of converting a drifting game algorithm into a Hedge\nalgorithm. In order to derive a minimax Hedge algorithm, Theorem 1 tells us it suf\ufb01ces to derive\nminimax DGv1 algorithms. Exact minimax analysis is usually dif\ufb01cult, and appropriate relaxations\nseem to be necessary. To make use of the existing analysis for standard drifting games, the \ufb01rst\nobvious relaxation is to drop the additional restriction in DGv1, that is, |zt,i zt,j|\uf8ff 1 for all i\nand j. Doing this will lead to the exact setting discussed in [23] where a near optimal strategy is\nproposed using the recipe in Eq. (1). It turns out that this relaxation is reasonable and does not give\ntoo much more power to the adversary. To see this, \ufb01rst recall that results from [23], written in our\nnotation, state that minDR LT (DR) \uf8ff 1\nbounded by 2 exp\u21e3 (R+1)2\n\u270f> 2 exp\u21e3 (R+1)2\n\nj , which, by Hoeffding\u2019s inequality, is upper\n2(T +1)\u2318. Second, statement (2) in Theorem 1 clearly remains valid if the input\nT (H) \uf8ff O\u21e3qT ln( 1\n\u270f )\u2318, which is the known\n\nj=0 T +1\n2(T +1)\u2318 and solving for R, we have R\u270f\n\noptimal regret rate for the Hedge problem, showing that we lose little due to this relaxation.\nHowever, the algorithm proposed in [23] is not computationally ef\ufb01cient since the potential functions\nt(s) do not have closed forms. To get around this, we would want the minimax expression in Eq.\n(1) to be easily solved, just like the case when B = {1, 1}. It turns out that convexity would allow\nus to treat B = [1, 1] almost as B = {1, 1}. Speci\ufb01cally, if each t(s) is a convex function of\ns, then due to the fact that the maximum of a convex function is always realized at the boundary of\na compact region, we have\n\nof Algorithm 2 is a drifting game algorithm for this relaxed version of DGv1. Therefore, by setting\n\n2T P TR\n\n2\n\nmin\nw2R+\n\nmax\nz2[1,1]\n\n(t(s + z) + wz) = min\nw2R+\n\nmax\n\nz2{1,1}\n\n(t(s + z) + wz) =\n\n4\n\nt(s 1) + t(s + 1)\n\n2\n\n,\n(3)\n\n\fInput: A convex, nonincreasing, nonnegative function T (s).\nfor t = T down to 1 do\n\nFind a convex function t1(s) s.t. 8s, t(s 1) + t(s + 1) \uf8ff 2t1(s).\n\nSet: s0 = 0.\nfor t = 1 to T do\n\nSet: H(`1:t1) = pt s.t. pt,i / t(st1,i 1) t(st1,i + 1).\nReceive losses `t and set st,i = st1,i + `t,i pt \u00b7 `t, 8i.\n\nAlgorithm 3: A General Hedge Algorithm H\n\nT (H) < R.\n\nwith w = ( t(s 1) t(s + 1))/2 realizing the minimum. Since the 0-1 loss function L(s) is\nnot convex, this motivates us to \ufb01nd a convex surrogate of L(s). Fortunately, relaxing the equality\nconstraints in Eq. (1) does not affect the key property of Eq. (2) as we will show in the proof of\nTheorem 2. \u201cCompiling out\u201d the input of Algorithm 2, we thus have our general recipe (Algorithm\n3) for designing Hedge algorithms with the following regret guarantee.\nTheorem 2. For Algorithm 3, if R and \u270f are such that 0(0) <\u270f and T (s) 1{s \uf8ff R} for all\ns 2 R, then R\u270f\nProof.\n(2) holds so that the theorem follows by a direct applica-\ntion of statement (2) of Theorem 1. Let wt,i = ( t(st1,i 1) t(st1,i + 1))/2. Then\nPi t(st,i) \uf8ffPi (t(st1,i + zt,i) + wt,izt,i) since pt,i / wt,i and pt\u00b7zt 0. On the other hand,\nby Eq. (3), we have t(st1,i + zt,i) + wt,izt,i \uf8ff minw2R+ maxz2[1,1] (t(st1,i + z) + wz) =\n2 (t(st1,i 1) + t(st1,i + 1)), which is at most t1(st1,i) by Algorithm 3. This shows\nPi t(st,i) \uf8ffPi t1(st1,i) and Eq. (2) follows.\nTheorem 2 tells us that if solving 0(0) <\u270f for R gives R > R for some value R, then the regret\nof Algorithm 3 is less than any value that is greater than R, meaning the regret is at most R.\n\nIt suf\ufb01ces to show that Eq.\n\n1\n\n3.3 Designing Potentials and Algorithms\n\nNow we are ready to recover existing algorithms and develop new ones by choosing an appropriate\npotential T (s) as Algorithm 3 suggests. We will discuss three different algorithms below, and\nsummarize these examples in Table 1 (see Appendix C).\n\nExponential Weights (EXP) Algorithm. Exponential loss is an obvious choice for T (s) as it\nhas been widely used as the convex surrogate of the 0-1 loss function in the literature.\nIt turns\nout that this will lead to the well-known exponential weights algorithm [14, 15]. Speci\ufb01cally, we\npick T (s) to be exp (\u2318(s + R)) which exactly upper bounds 1{s \uf8ff R}. To compute t(s)\nfor t \uf8ff T , we simply let t(s 1) + t(s + 1) \uf8ff 2t1(s) hold with equality. Indeed, direct\ncomputations show that all t(s) share a similar form: t(s) =\u21e3 e\u2318+e\u2318\n\u00b7 exp (\u2318(s + R)) .\n\nTherefore, according to Algorithm 3, the player\u2019s strategy is to set\n\n\u2318Tt\n\n2\n\npt,i / t(st1,i 1) t(st1,i + 1) / exp (\u2318st1,i) ,\n\n2 = q2T ln 1\n\n\u2318\u21e3ln( 1\n\u270f where the last step is by optimally tuning \u2318 to beq2(ln 1\n\nwhich is exactly the same as EXP (note that R becomes irrelevant after normalization). To derive re-\ngret bounds, it suf\ufb01ces to require 0(0) <\u270f, which is equivalent to R > 1\nBy Theorem 2 and Hoeffding\u2019s lemma (see [9, Lemma A.1]), we thus know R\u270f\nT\u2318\n\n\u270f ) + T ln e\u2318+e\u2318\nT (H) \uf8ff 1\n\u270f )/T . Note that this\nalgorithm is not adaptive in the sense that it requires knowledge of T and \u270f to set the parameter \u2318.\nWe have thus recovered the well-known EXP algorithm and given a new analysis using the drifting-\ngames framework. More importantly, as in [26], this derivation may shed light on why this algorithm\nworks and where it comes from, namely, a minimax analysis followed by a series of relaxations,\nstarting from a reasonable surrogate of the 0-1 loss function.\n\n\u2318 .\n\u2318 ln 1\n\u270f +\n\n2\n\n2-norm Algorithm. We next move on to another simple convex surrogate: T (s) = a[s]2\n \n1{s \uf8ff 1/pa}, where a is some positive constant and [s] = min{0, s} represents a truncating\noperation. The following lemma shows that t(s) can also be simply described.\n\n5\n\n\fLemma 1. If a > 0, then t(s) = a[s]2\n\n + T t satis\ufb01es t(s 1) + t(s + 1) \uf8ff 2t1(s).\n\nThus, Algorithm 3 can again be applied. The resulting algorithm is extremely concise:\n [st1,i + 1]2\n.\n\npt,i / t(st1,i 1) t(st1,i + 1) / [st1,i 1]2\n\nWe call this the \u201c2-norm\u201d algorithm since it resembles the p-norm algorithm in the literature when\np = 2 (see [9]). The difference is that the p-norm algorithm sets the weights proportional to the\nderivative of potentials, instead of the difference of them as we are doing here. A somewhat sur-\nprising property of this algorithm is that it is totally adaptive and parameter-free (since a disappears\nunder normalization), a property that we usually do not expect to obtain from a minimax analy-\n\nsis. Direct application of Theorem 2 (0(0) = aT < \u270f , 1/pa > pT /\u270f) shows that its regret\nachieves the optimal dependence on the horizon T .\nCorollary 1. Algorithm 3 with potential t(s) de\ufb01ned in Lemma 1 produces a Hedge algorithm H\nsuch that R\u270f\n\nT (H) \uf8ffpT /\u270f simultaneously for all T and \u270f.\n\nNormalHedge.DT. The regret for the 2-norm algorithm does not have the optimal dependence on\n\u270f. An obvious follow-up question would be whether it is possible to derive an adaptive algorithm\n\nthat achieves the optimal rate O(pT ln(1/\u270f)) simultaneously for all T and \u270f using our framework.\n\nAn even deeper question is: instead of choosing convex surrogates in a seemingly arbitrary way, is\nthere a more natural way to \ufb01nd the right choice of T (s)?\nTo answer these questions, we recall that the reason why the 2-norm algorithm can get rid of the\ndependence on \u270f is that \u270f appears merely in the multiplicative constant a that does not play a role\nafter normalization. This motivates us to let T (s) in the form of \u270fF (s) for some F (s). On the\nother hand, from Theorem 2, we also want \u270fF (s) to upper bound the 0-1 loss function 1{s \uf8ff\npdT ln(1/\u270f)} for some constant d. Taken together, this is telling us that the right choice of F (s)\nshould be of the form \u21e5exp(s2/T )1. Of course we still need to re\ufb01ne it to satisfy the monotonicity\n\nand other properties. We de\ufb01ne T (s) formally and more generally as:\n\nT (s) = a\u21e3exp\u21e3 [s]2\nwhere a and d are some positive constants. This time it is more involved to \ufb01gure out what other\nt(s) should be. The following lemma addresses this issue (proof deferred to Appendix C).\ndt \u2318 bt\u2318\n\u2327 =t+1exp 4\n2PT\nLemma 2. If bt = 1 1\n(de\ufb01ne 0(s) \u2318 a(1 b0)), then we have t(s 1) + t(s + 1) \uf8ff 2t1(s) for all s 2 R and\nt = 2, . . . , T . Moreover, Eq. (2) still holds.\nNote that even if 1(s 1) + 1(s + 1) \uf8ff 20(s) is not valid in general, Lemma 2 states that Eq.\n(2) still holds. Thus Algorithm 3 can indeed still be applied, leading to our new algorithm:\n\ndT \u2318 1\u2318 1\u21e2s \uf8ff qdT ln 1\nd\u2327 1 , a > 0, d 3 and t(s) = a\u21e3exp\u21e3 [s]2\n\na + 1 ,\n\n\n\n\n\npt,i / t(st1,i 1) t(st1,i + 1) / exp\u21e3 [st1,i1]2\n\ndt\n\n\n\n\u2318 exp\u21e3 [st1,i+1]2\n\ndt\n\n\n\n\u2318 .\n\nR\u270f\n\nHere, d seems to be an extra parameter, but in fact, simply setting d = 3 is good enough:\nCorollary 2. Algorithm 3 with potential t(s) de\ufb01ned in Lemma 2 and d = 3 produces a Hedge\nalgorithm H such that the following holds simultaneously for all T and \u270f:\n\nT (H) \uf8ffq3T ln 1\n\n2\u270fe4/3 1 (ln T + 1) + 1 = O\u21e3pT ln (1/\u270f) + T ln ln T\u2318 .\n\nWe have thus proposed a parameter-free adaptive algorithm with optimal regret rate (ignoring the\nln ln T term) using our drifting-games framework. In fact, our algorithm bears a striking similarity\nto NormalHedge [10], the \ufb01rst algorithm that has this kind of adaptivity. We thus name our algorithm\nNormalHedge.DT2. We include NormalHedge in Table 1 for comparison. One can see that the main\ndifferences are: 1) On each round NormalHedge performs a numerical search to \ufb01nd out the right\nparameter used in the exponents; 2) NormalHedge uses the derivative of potentials as weights.\n\n1Similar potential was also proposed in recent work [22, 25] for a different setting.\n2\u201cDT\u201d stands for discrete time.\n\n6\n\n\fCompared to NormalHedge, the regret bound for NormalHedge.DT has no explicit dependence on\nN, but has a slightly worse dependence on T (indeed ln ln T is almost negligible). We emphasize\nother advantages of our algorithm over NormalHedge: 1) NormalHedge.DT is more computationally\nef\ufb01cient especially when N is very large, since it does not need a numerical search for each round;\n2) our analysis is arguably simpler and more intuitive than the one in [10]; 3) as we will discuss\nin Section 4, NormalHedge.DT can be easily extended to deal with the more general online convex\noptimization problem where the number of actions is in\ufb01nitely large, while it is not clear how to\ndo that for NormalHedge by generalizing the analysis in [10]. Indeed, the extra dependence on the\nnumber of actions N for the regret of NormalHedge makes this generalization even seem impossible.\nFinally, we will later see that NormalHedge.DT outperforms NormalHedge in experiments. Despite\nthe differences, it is worth noting that both algorithms assign zero weight to some actions on each\nround, an appealing property when N is huge. We will discuss more on this in Section 4.\n\n3.4 High Probability Bounds\n\nt=1 `t,it miniPT\n\nWe now consider a common variant of Hedge: on each round, instead of choosing a distribution\npt, the player has to randomly pick a single action it, while the adversary decides the losses `t at\nthe same time (without seeing it). For now we only focus on the player\u2019s regret to the best action:\nRT (i1:T , `1:T ) =PT\nt=1 `t,i. Notice that the regret is now a random variable, and\nwe are interested in a bound that holds with high probability. Using Azuma\u2019s inequality, standard\nanalysis (see for instance [9, Lemma 4.1]) shows that the player can simply draw it according to\npt = H(`1:t1), the output of a standard Hedge algorithm, and suffers regret at most RT (H) +\npT ln(1/) with probability 1 . Below we recover similar results as a simple side product of\nour drifting-games analysis without resorting to concentration results, such as Azuma\u2019s inequality.\nFor this, we only need to modify Algorithm 3 by setting zt,i = `t,i `t,it. The restriction\n(2) also still\npt \u00b7 zt 0 is then relaxed to hold in expectation. Moreover, it is clear that Eq.\nholds in expectation. On the other hand, by de\ufb01nition and the union bound, one can show that\nPi E[L(sT,i)] =Pi Pr [sT,i \uf8ff R] Pr [RT (i1:T , `1:T ) R]. So setting 0(0) = shows that\nthe regret is smaller than R with probability 1 . Therefore, for example, if EXP is used, then the\nregret would be at mostp2T ln(N/) with probability 1, giving basically the same bound as the\nstandard analysis. One draw back is that EXP would need as a parameter. However, this can again\nbe addressed by NormalHedge.DT for the exact same reason that NormalHedge.DT is independent\nof \u270f. We have thus derived high probability bounds without using any concentration inequalities.\n\n4 Generalizations and Applications\n\nMulti-armed Bandit (MAB) Problem: The only difference between Hedge (randomized version)\nand the non-stochastic MAB problem [6] is that on each round, after picking it, the player only sees\nthe loss for this single action `t,it instead of the whole vector `t. The goal is still to compete with\nthe best action. A common technique used in the bandit setting is to build an unbiased estimator \u02c6`t\nfor the losses, which in this case could be \u02c6`t,i = 1{i = it}\u00b7`t,it/pt,it. Then algorithms such as EXP\ncan be used by replacing `t with \u02c6`t, leading to the EXP3 algorithm [6] with regret O(pT N ln N ).\nOne might expect that Algorithm 3 would also work well by replacing `t with \u02c6`t. However, doing so\nbreaks an important property of the movements zt,i: boundedness. Indeed, Eq. (3) no longer makes\nsense if z could be in\ufb01nitely large, even if in expectation it is still in [1, 1] (note that zt,i is now a\nrandom variable). It turns out that we can address this issue by imposing a variance constraint on zt,i.\nFormally, we consider a variant of drifting games where on each round, the adversary picks a random\nmovement zt,i for each chip such that: zt,i 1, Et[zt,i] \uf8ff 1, Et[z2\nt,i] \uf8ff 1/pt,i and Et[pt \u00b7 zt] 0.\nWe call this variant DGv2 and summarize it in Appendix A. The standard minimax analysis and the\nderivation of potential functions need to be modi\ufb01ed in a certain way for DGv2, as stated in Theorem\n4 (Appendix D). Using the analysis for DGv2, we propose a general recipe for designing MAB\nalgorithms in a similar way as for Hedge and also recover EXP3 (see Algorithm 4 and Theorem\n5 in Appendix D). Unfortunately so far we do not know other appropriate potentials due to some\ntechnical dif\ufb01culties. We conjecture, however, that there is a potential function that could recover\nthe poly-INF algorithm [4, 5] or give its variants that achieve the optimal regret O(pT N ).\n\n7\n\n\fOnline Convex Optimization: We next consider a general online convex optimization setting [31].\nLet S \u21e2 Rd be a compact convex set, and F be a set of convex functions with range [0, 1] on S. On\neach round t, the learner chooses a point xt 2 S, and the adversary chooses a loss function ft 2F\n(knowing xt). The learner then suffers loss ft(xt). The regret after T rounds is RT (x1:T , f1:T ) =\nPT\nt=1 ft(xt) minx2SPT\nt=1 ft(x). There are two general approaches to OCO: one builds on\nconvex optimization theory [30], and the other generalizes EXP to a continuous space [12, 24]. We\nwill see how the drifting-games framework can recover the latter method and also leads to new ones.\nTo do so, we introduce a continuous variant of drifting games (DGv3, see Appendix A). There are\nnow in\ufb01nitely many chips, one for each point in S. On round t, the player needs to choose a distribu-\ntion over the chips, that is, a probability density function pt(x) on S. Then the adversary decides the\nmovements for each chip, that is, a function zt(x) with range [1, 1] on S (not necessarily convex\nor continuous), subject to a constraint Ex\u21e0pt[zt(x)] 0. At the end, each point x is associated with\na loss L(x) = 1{Pt zt(x) \uf8ff R}, and the player aims to minimize the total lossRx2S L(x)dx.\nOCO can be converted into DGv3 by setting zt(x) = ft(x)ft(xt) and predicting xt = Ex\u21e0pt[x] 2\nS. The constraint Ex\u21e0pt[zt(x)] 0 holds by the convexity of ft. Moreover, it turns out that the\nminimax analysis and potentials for DGv1 can readily be used here, and the notion of \u270f-regret, now\ngeneralized to the OCO setting, measures the difference of the player\u2019s loss and the loss of a best\n\ufb01xed point in a subset of S that excludes the top \u270f fraction of points. With different potentials, we\nobtain versions of each of the three algorithms of Section 3 generalized to this setting, with the same\n\u270f-regret bounds as before. Again, two of these methods are adaptive and parameter-free. To derive\nbounds for the usual regret, at \ufb01rst glance it seems that we have to set \u270f to be close to zero, leading\nto a meaningless bound. Nevertheless, this is addressed by Theorem 6 using similar techniques in\n[17], giving the usual O(pdT ln T ) regret bound. All details can be found in Appendix E.\nApplications to Boosting: There is a deep and well-known connection between Hedge and boost-\ning [14, 29]. In principle, every Hedge algorithm can be converted into a boosting algorithm; for\ninstance, this is how AdaBoost was derived from EXP. In the same way, NormalHedge.DT can be\nconverted into a new boosting algorithm that we call NH-Boost.DT. See Appendix F for details and\nfurther background on boosting. The main idea is to treat each training example as an \u201caction\u201d, and\nto rely on the Hedge algorithm to compute distributions over these examples which are used to train\nthe weak hypotheses. Typically, it is assumed that each of these has \u201cedge\u201d , meaning its accuracy\non the training distribution is at least 1/2 + . The \ufb01nal hypothesis is a simple majority vote of the\nweak hypotheses. To understand the prediction accuracy of a boosting algorithm, we often study the\ntraining error rate and also the distribution of margins, a well-established measure of con\ufb01dence (see\nAppendix F for formal de\ufb01nitions). Thanks to the adaptivity of NormalHedge.DT, we can derive\nbounds on both the training error and the distribution of margins after any number of rounds:\nTheorem 3. After T rounds, the training error of NH-Boost.DT is of order \u02dcO(exp( 1\n3 T 2)), and\nthe fraction of training examples with margin at most \u2713(\uf8ff 2) is of order \u02dcO(exp( 1\n3 T (\u2713 2)2)).\nThus, the training error decreases at roughly the same rate as AdaBoost. In addition, this theorem\nimplies that the fraction of examples with margin smaller than 2 eventually goes to zero as T gets\nlarge, which means NH-Boost.DT converges to the optimal margin 2; this is known not to be true\nfor AdaBoost (see [29]). Also, like AdaBoost, NH-Boost.DT is an adaptive boosting algorithm that\ndoes not require or T as a parameter. However, unlike AdaBoost, NH-Boost.DT has the striking\nproperty that it completely ignores many examples on each round (by assigning zero weight), which\nis very helpful for the weak learning algorithm in terms of computational ef\ufb01ciency. To test this, we\nconducted experiments to compare the ef\ufb01ciency of AdaBoost, \u201cNH-Boost\u201d (an analogous boosting\nalgorithm derived from NormalHedge) and NH-Boost.DT. All details are in Appendix G. Here we\nonly brie\ufb02y summarize the results. While the three algorithms have similar performance in terms\nof training and test error, NH-Boost.DT is always the fastest one in terms of running time for the\nsame number of rounds. Moreover, the average faction of examples with zero weight is signi\ufb01cantly\nhigher for NH-Boost.DT than for NH-Boost (see Table 3). On one hand, this explains why NH-\nBoost.DT is faster (besides the reason that it does not require a numerical step). On the other hand,\nthis also implies that NH-Boost.DT tends to achieve larger margins, since zero weight is assigned to\nexamples with large margin. This is also con\ufb01rmed by our experiments.\nAcknowledgements. Support for this research was provided by NSF Grant #1016029. The authors\nthank Yoav Freund for helpful discussions and the anonymous reviewers for their comments.\n\n8\n\n\f[10] Kamalika Chaudhuri, Yoav Freund, and Daniel Hsu. A parameter-free hedging algorithm. Advances in\n\nNeural Information Processing Systems 22, 2009.\n\n[11] Alexey Chernov and Vladimir Vovk. Prediction with advice of unknown number of experts. arXiv preprint\n\n[12] Thomas M. Cover. Universal portfolios. Mathematical Finance, 1(1):1\u201329, January 1991.\n[13] Yoav Freund. Boosting a weak learning algorithm by majority.\n\nInformation and Computation,\n\n2006.\n\narXiv:1006.0475, 2010.\n\n121(2):256\u2013285, 1995.\n\nReferences\n[1] Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and mini-\nmax lower bounds for online convex games. In Proceedings of the 21st Annual Conference on Learning\nTheory, 2008.\n\n[2] Jacob Abernethy and Manfred K. Warmuth. Minimax games with bandits. In Proceedings of the 22st\n\nAnnual Conference on Learning Theory, 2009.\n\n[3] Jacob Abernethy and Manfred K. Warmuth. Repeated games against budgeted adversaries. In Advances\n\nin Neural Information Processing Systems 23, 2010.\n\n[4] Jean-Yves Audibert and S\u00b4ebastien Bubeck. Regret bounds and minimax policies under partial monitoring.\n\nThe Journal of Machine Learning Research, 11:2785\u20132836, 2010.\n\n[5] Jean-Yves Audibert, S\u00b4ebastien Bubeck, and G\u00b4abor Lugosi. Regret in online combinatorial optimization.\n\nMathematics of Operations Research, 39(1):31\u201345, 2014.\n\n[6] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[7] Nicol`o Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Man-\n\nfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427\u2013485, May 1997.\n\n[8] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Potential-based algorithms in on-line prediction and game theory.\n\nMachine Learning, 51(3):239\u2013261, 2003.\n\n[9] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\n[14] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, August 1997.\n\n[15] Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and\n\nEconomic Behavior, 29:79\u2013103, 1999.\n\n[16] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: A statistical view\n\nof boosting. Annals of Statistics, 28(2):337\u2013407, April 2000.\n\n[17] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimiza-\n\n[18] Adam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of Com-\n\ntion. Machine Learning, 69(2-3):169\u2013192, 2007.\n\nputer and System Sciences, 71(3):291\u2013307, 2005.\n\n[19] Robert Kleinberg. Anytime algorithms for multi-armed bandit problems. In Proceedings of the seven-\n\nteenth annual ACM-SIAM symposium on Discrete algorithm, pages 928\u2013936. ACM, 2006.\n\n[20] Robert David Kleinberg. Online decision problems with large strategy sets. PhD thesis, MIT, 2005.\n[21] Haipeng Luo and Robert E. Schapire. Towards Minimax Online Learning with Unknown Time Horizon.\n\nIn Proceedings of the 31st International Conference on Machine Learning, 2014.\n\n[22] H Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in hilbert spaces:\nIn Proceedings of the 27th Annual Conference on\n\nMinimax algorithms and normal approximations.\nLearning Theory, 2014.\n\n[23] Indraneel Mukherjee and Robert E. Schapire. Learning with continuous experts using drifting games.\n\nTheoretical Computer Science, 411(29):2670\u20132683, 2010.\n\n[24] Hariharan Narayanan and Alexander Rakhlin. Random walk approach to regret minimization. In Ad-\n\nvances in Neural Information Processing Systems 23, 2010.\n\n[25] Francesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic\n\nlearning. In Advances in Neural Information Processing Systems 28, 2014.\n\n[26] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and localize: From value to algorithms. In\nAdvances in Neural Information Processing Systems 25, 2012. Full version available in arXiv:1204.0870.\n[27] Lev Reyzin and Robert E. Schapire. How boosting the margin can also boost classi\ufb01er complexity. In\n\nProceedings of the 23rd International Conference on Machine Learning, 2006.\n\n[28] Robert E. Schapire. Drifting games. Machine Learning, 43(3):265\u2013291, June 2001.\n[29] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. MIT Press, 2012.\n[30] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Ma-\n\nchine Learning, 4(2):107\u2013194, 2011.\n\n[31] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Pro-\n\nceedings of the Twentieth International Conference on Machine Learning, 2003.\n\n9\n\n\f", "award": [], "sourceid": 755, "authors": [{"given_name": "Haipeng", "family_name": "Luo", "institution": "Princeton University"}, {"given_name": "Robert", "family_name": "Schapire", "institution": "MIcrosoft Research"}]}