{"title": "A Parameter-free Hedging Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 297, "page_last": 305, "abstract": "We study the problem of decision-theoretic online learning (DTOL). Motivated by practical applications, we focus on DTOL when the number of actions is very large. Previous algorithms for learning in this framework have a tunable learning rate parameter, and a major barrier to using online-learning in practical applications is that it is not understood how to set this parameter optimally, particularly when the number of actions is large. In this paper, we offer a clean solution by proposing a novel and completely parameter-free algorithm for DTOL. In addition, we introduce a new notion of regret, which is more natural for applications with a large number of actions. We show that our algorithm achieves good performance with respect to this new notion of regret; in addition, it also achieves performance close to that of the best bounds achieved by previous algorithms with optimally-tuned parameters, according to previous notions of regret.", "full_text": "A Parameter-free Hedging Algorithm\n\nKamalika Chaudhuri\n\nITA, UC San Diego\n\nYoav Freund\n\nCSE, UC San Diego\n\nDaniel Hsu\n\nCSE, UC San Diego\n\nkamalika@soe.ucsd.edu\n\nyfreund@ucsd.edu\n\ndjhsu@cs.ucsd.edu\n\nAbstract\n\nWe study the problem of decision-theoretic online learning (DTOL). Motivated\nby practical applications, we focus on DTOL when the number of actions is very\nlarge. Previous algorithms for learning in this framework have a tunable learning\nrate parameter, and a barrier to using online-learning in practical applications is\nthat it is not understood how to set this parameter optimally, particularly when the\nnumber of actions is large.\nIn this paper, we offer a clean solution by proposing a novel and completely\nparameter-free algorithm for DTOL. We introduce a new notion of regret, which\nis more natural for applications with a large number of actions. We show that our\nalgorithm achieves good performance with respect to this new notion of regret; in\naddition, it also achieves performance close to that of the best bounds achieved\nby previous algorithms with optimally-tuned parameters, according to previous\nnotions of regret.\n\n1\n\nIntroduction\n\nIn this paper, we consider the problem of decision-theoretic online learning (DTOL), proposed by\nFreund and Schapire [1]. DTOL is a variant of the problem of prediction with expert advice [2, 3].\nIn this problem, a learner must assign probabilities to a \ufb01xed set of actions in a sequence of rounds.\nAfter each assignment, each action incurs a loss (a value in [0, 1]); the learner incurs a loss equal\nto the expected loss of actions for that round, where the expectation is computed according to the\nlearner\u2019s current probability assignment. The regret (of the learner) to an action is the difference\nbetween the learner\u2019s cumulative loss and the cumulative loss of that action. The goal of the learner\nis to achieve, on any sequence of losses, low regret to the action with the lowest cumulative loss (the\nbest action).\n\nDTOL is a general framework that captures many learning problems of interest. For example, con-\nsider tracking the hidden state of an object in a continuous state space from noisy observations [4].\nTo look at tracking in a DTOL framework, we set each action to be a path (sequence of states) over\nthe state space. The loss of an action at time t is the distance between the observation at time t and\nthe state of the action at time t, and the goal of the learner is to predict a path which has loss close\nto that of the action with the lowest cumulative loss.\n\nThe most popular solution to the DTOL problem is the Hedge algorithm [1, 5]. In Hedge, each action\nis assigned a probability, which depends on the cumulative loss of this action and a parameter \u03b7, also\ncalled the learning rate. By appropriately setting the learning rate as a function of the iteration [6, 7]\nand the number of actions, Hedge can achieve a regret upper-bounded by O(\u221aT ln N ), for each\niteration T , where N is the number of actions. This bound on the regret is optimal as there is a\n\u2126(\u221aT ln N ) lower-bound [5].\nIn this paper, motivated by practical applications such as tracking, we consider DTOL in the regime\nwhere the number of actions N is very large. A major barrier to using online-learning for practical\nproblems is that when N is large, it is not understood how to set the learning rate \u03b7. [7, 6] suggest\n\n1\n\n\fs\ns\no\nl\n \nl\na\nt\no\nT\n\n\u03b5\n\nActions\n\nFigure 1: A new notion of regret. Suppose each action is a point on a line, and the total losses are\nas given in the plot. The regret to the top \u01eb-quantile is the difference between the learner\u2019s total loss\nand the total loss of the worst action in the indicated interval of measure \u01eb.\n\nsetting \u03b7 as a \ufb01xed function of the number of actions N. However, this can lead to poor performance,\nas we illustrate by an example in Section 3, and the degradation in performance is particularly\nexacerbated as N grows larger. One way to address this is by simultaneously running multiple\ncopies of Hedge with multiple values of the learning rate, and choosing the output of the copy\nthat performs the best in an online way. However, this solution is impractical for real applications,\nparticularly as N is already very large. (For more details about these solutions, please see Section 4.)\nIn this paper, we take a step towards making online learning more practical by proposing a novel,\ncompletely adaptive algorithm for DTOL. Our algorithm is called NormalHedge. NormalHedge\nis very simple and easy to implement, and in each round, it simply involves a single line search,\nfollowed by an updating of weights for all actions.\n\nA second issue with using online-learning in problems such as tracking, where N is very large, is\nthat the regret to the best action is not an effective measure of performance. For problems such as\ntracking, one expects to have a lot of actions that are close to the action with the lowest loss. As\nthese actions also have low loss, measuring performance with respect to a small group of actions\nthat perform well is extremely reasonable \u2013 see, for example, Figure 1.\n\nIn this paper, we address this issue by introducing a new notion of regret, which is more natural\nfor practical applications. We order the cumulative losses of all actions from lowest to highest and\nde\ufb01ne the regret of the learner to the top \u01eb-quantile to be the difference between the cumulative loss\nof the learner and the \u230a\u01ebN\u230b-th element in the sorted list.\nWe prove that for NormalHedge, the regret to the top \u01eb-quantile of actions is at most\n\nO rT ln\n\n1\n\u01eb\n\n+ ln2 N! ,\n\nwhich holds simultaneously for all T and \u01eb. If we set \u01eb = 1/N, we get that the regret to the best\n\naction is upper-bounded by O(cid:16)\u221aT ln N + ln2 N(cid:17), which is only slightly worse than the bound\n\nachieved by Hedge with optimally-tuned parameters. Notice that in our regret bound, the term\ninvolving T has no dependence on N. In contrast, Hedge cannot achieve a regret-bound of this\nnature uniformly for all \u01eb. (For details on how Hedge can be modi\ufb01ed to perform with our new\nnotion of regret, see Section 4).\n\nNormalHedge works by assigning each action i a potential; actions which have lower cumulative\ni,t/2ct), where Ri,t is the regret of action\nloss than the algorithm are assigned a potential exp(R2\ni and ct is an adaptive scale parameter, which is adjusted from one round to the next, depending\non the loss-sequences. Actions which have higher cumulative loss than the algorithm are assigned\npotential 1. The weight assigned to an action in each round is then proportional to the derivative of its\npotential. One can also interpret Hedge as a potential-based algorithm, and under this interpretation,\nthe potential assigned by Hedge to action i is proportional to exp(\u03b7Ri,t). This potential used by\nHedge differs signi\ufb01cantly from the one we use. Although other potential-based methods have been\nconsidered in the context of online learning [8], our potential function is very novel, and to the best\n\n2\n\n\fInitially: Set Ri,0 = 0, pi,1 = 1/N for each i.\nFor t = 1, 2, . . .\n\n1. Each action i incurs loss \u2113i,t.\n\ni=1 pi,t\u2113i,t.\n\n2. Learner incurs loss \u2113A,t =PN\n3. Update cumulative regrets: Ri,t = Ri,t\u22121 + (\u2113A,t \u2212 \u2113i,t) for each i.\n(cid:17) = e.\nN PN\n4. Find ct > 0 satisfying 1\n5. Update distribution for round t + 1: pi,t+1 \u221d [Ri,t]+\n\ni=1 exp(cid:16) ([Ri,t]+)2\n\n2ct\n\nct\n\nexp(cid:16) ([Ri,t]+)2\n\n2ct\n\n(cid:17) for each i.\n\nFigure 2: The Normal-Hedge algorithm.\n\nof our knowledge, has not been studied in prior work. Our proof techniques are also different from\nprevious potential-based methods.\n\nAnother useful property of NormalHedge, which Hedge does not possess, is that it assigns zero\nweight to any action whose cumulative loss is larger than the cumulative loss of the algorithm it-\nself. In other words, non-zero weights are assigned only to actions which perform better than the\nalgorithm. In most applications, we expect a small set of the actions to perform signi\ufb01cantly better\nthan most of the actions. The regret of the algorithm is guaranteed to be small, which means that the\nalgorithm will perform better than most of the actions and thus assign them zero probability.\n\n[9, 10] have proposed more recent solutions to DTOL in which the regret of Hedge to the best action\nis upper bounded by a function of L, the loss of the best action, or by a function of the variations in\nthe losses. These bounds can be sharper than the bounds with respect to T . Our analysis (and in fact,\nto our knowledge, any analysis based on potential functions in the style of [11, 8]) do not directly\nyield these kinds of bounds. We therefore leave open the question of \ufb01nding an adaptive algorithm\nfor DTOL which has regret upper-bounded by a function that depends on the loss of the best action.\n\nThe rest of the paper is organized as follows. In Section 2, we provide NormalHedge. In Section\n3, we provide an example that illustrates the suboptimality of standard online learning algorithms,\nwhen the parameter is not set properly. In Section 4, we discuss Related Work. In Section 5, we\npresent some outlines of the proof. The proof details are in the Supplementary Materials.\n\n2 Algorithm\n\n2.1 Setting\n\nWe consider the decision-theoretic framework for online learning. In this setting, the learner is given\naccess to a set of N actions, where N \u2265 2. In round t, the learner chooses a weight distribution\npt = (p1,t, . . . , pN,t) over the actions 1, 2, . . . , N. Each action i incurs a loss \u2113i,t, and the learner\nincurs the expected loss under this distribution:\n\n\u2113A,t =\n\nN\n\nXi=1\n\npi,t\u2113i,t.\n\nThe learner\u2019s instantaneous regret to an action i in round t is ri,t = \u2113A,t \u2212 \u2113i,t, and its (cumulative)\nregret to an action i in the \ufb01rst t rounds is\n\nRi,t =\n\nri,\u03c4 .\n\nt\n\nX\u03c4 =1\n\nWe assume that the losses \u2113i,t lie in an interval of length 1 (e.g. [0, 1] or [\u22121/2, 1/2]; the sign of the\nloss does not matter). The goal of the learner is to minimize this cumulative regret Ri,t to any action\ni (in particular, the best action), for any value of t.\n\n3\n\n\f2.2 Normal-Hedge\n\nOur algorithm, Normal-Hedge, is based on a potential function reminiscent of the half-normal dis-\ntribution, speci\ufb01cally\n\n\u03c6(x, c) = exp(cid:18) ([x]+)2\n\n2c (cid:19) for x \u2208 R, c > 0\n\n(1)\n\nwhere [x]+ denotes max{0, x}. It is easy to check that this function is separately convex in x and c,\ndifferentiable, and twice-differentiable except at x = 0.\nIn addition to tracking the cumulative regrets Ri,t to each action i after each round t, the algorithm\nalso maintains a scale parameter ct. This is chosen so that the average of the potential, over all\nactions i, evaluated at Ri,t and ct, remains constant at e:\n\n1\nN\n\nN\n\nXi=1\n\nexp(cid:18) ([Ri,t]+)2\n\n2ct\n\n(cid:19) = e.\n\n(2)\n\nWe observe that since \u03c6(x, c) is convex in c > 0, we can determine ct with a line search.\nThe weight assigned to i in round t is set proportional to the \ufb01rst-derivative of the potential, evaluated\nat Ri,t\u22121 and ct\u22121:\n\nNotice that the actions for which Ri,t\u22121 \u2264 0 receive zero weight in round t.\nWe summarize the learning algorithm in Figure 2.\n\n=\n\n[Ri,t\u22121]+\n\nct\u22121\n\nexp(cid:18) ([Ri,t\u22121]+)2\n\n2ct\u22121\n\n(cid:19) .\n\npi,t \u221d\n\n\u2202\n\u2202x\n\n\u03c6(x, c)(cid:12)(cid:12)(cid:12)(cid:12)x=Ri,t\u22121,c=ct\u22121\n\n3 An Illustrative Example\n\nIn this section, we present an example to illustrate that setting the parameters of DTOL algorithms\nas a function of N, the total number of actions, is suboptimal. To do this, we compare the perfor-\nmance of NormalHedge with two representative algorithms: a version of Hedge due to [7], and the\nPolynomial Weights algorithm, due to [12, 11]. Our experiments with this example indicate that the\nperformance of both these algorithms suffer because of the suboptimal setting of the parameters; on\nthe other hand, NormalHedge automatically adapts to the loss-sequences of the actions.\n\nThe main feature of our example is that the effective number of actions n (i.e. the number of distinct\nactions) is smaller than the total number of actions N. Notice that without prior knowledge of the\nactions and their loss-sequences, one cannot determine the effective number actions in advance; as a\nresult, there is no direct method by which Hedge and Polynomial Weights could set their parameters\nas a function of n.\nOur example attempts to model a practical scenario where one often \ufb01nds multiple actions with\nloss-sequences which are almost identical. For example, in the tracking problem, groups of paths\nwhich are very close together in the state space, will have very close loss-sequences. Our example\nindicates that in this case, the performance of Hedge and the Polynomial Weights will depend on\nthe discretization of the state space, however, NormalHedge will comparatively unaffected by such\ndiscretization.\n\nOur example has four parameters: N, the total number of actions; n, the effective number of actions\n(the number of distinct actions); k, the (effective) number of good actions; and \u01eb, which indicates\nhow much better the good actions are compared to the rest. Finally, T is the number of rounds.\nThe instantaneous losses of the N actions are represented by a N \u00d7 T matrix B\u03b5,k\nN ; the loss of\naction i in round t is the (i, t)-th entry in the matrix. The construction of the matrix is as follows.\nFirst, we construct a (preliminary) n \u00d7 T matrix An based on the 2d \u00d7 2d Hadamard matrix, where\nn = 2d+1 \u2212 2. This matrix An is obtained from the 2d \u00d7 2d Hadamard matrix by (1) deleting\nthe constant row, (2) stacking the remaining rows on top of their negations, (3) repeating each row\n\n4\n\n\fhorizontally T /2d times, and \ufb01nally, (4) halving the \ufb01rst column. We show A6 for concreteness:\n\nA6 =\n\n\uf8ee\n\n\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\u22121/2 +1 \u22121 +1 \u22121 +1 \u22121 +1 \u22121 +1 \u22121 +1 . . .\n\u22121/2 \u22121 +1 +1 \u22121 \u22121 +1 +1 \u22121 \u22121 +1 +1 . . .\n\u22121/2 +1 +1 \u22121 \u22121 +1 +1 \u22121 \u22121 +1 +1 \u22121 . . .\n+1/2 \u22121 +1 \u22121 +1 \u22121 +1 \u22121 +1 \u22121 +1 \u22121 . . .\n+1/2 +1 \u22121 \u22121 +1 +1 \u22121 \u22121 +1 +1 \u22121 \u22121 . . .\n+1/2 \u22121 \u22121 +1 +1 \u22121 \u22121 +1 +1 \u22121 \u22121 +1 . . .\n\n\uf8f9\n\n\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\nIf the rows of An give the losses for n actions over time, then it is clear that on average, no action\nis better than any other. Therefore for large enough T , for these losses, a typical algorithm will\neventually assign all actions the same weight. Now, let A\u03b5,k\nn be the same as An except that \u03b5 is\nsubtracted from each entry of the \ufb01rst k rows, e.g.\n\nA\u03b5,2\n\n6 =\n\n\uf8ee\n\n\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n+1\n\u22121\n+1\n\u22121\n\n\u22121/2\n+1/2\n+1/2\n+1/2\n\n\u22121/2 \u2212 \u03b5 +1 \u2212 \u03b5 \u22121 \u2212 \u03b5 +1 \u2212 \u03b5 \u22121 \u2212 \u03b5 +1 \u2212 \u03b5 \u22121 \u2212 \u03b5 +1 \u2212 \u03b5\n\u22121/2 \u2212 \u03b5 \u22121 \u2212 \u03b5 +1 \u2212 \u03b5 +1 \u2212 \u03b5 \u22121 \u2212 \u03b5 \u22121 \u2212 \u03b5 +1 \u2212 \u03b5 +1 \u2212 \u03b5\n\u22121\n\u22121\n\u22121\n+1\n\n+1\n+1\n\u22121\n\u22121\nn , the \ufb01rst k actions (the good actions) perform better than the\nNow, when losses are given by A\u03b5,k\nremaining n \u2212 k; so, for large enough T , a typical algorithm will eventually recognize this and\nassign the \ufb01rst k actions equal weights (giving little or no weight to the remaining n \u2212 k). Finally,\nwe arti\ufb01cially replicate each action (each row) N/n times to yield the \ufb01nal loss matrix B\u03b5,k\nN for N\nactions:\n\n. . .\n. . .\n. . .\n. . .\n. . .\n. . .\n\n+1\n+1\n\u22121\n\u22121\n\n+1\n\u22121\n+1\n\u22121\n\n\u22121\n+1\n+1\n+1\n\n.\n\n\uf8f9\n\n\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n\u22121\n\u22121\n\u22121\n+1\n\nB\u03b5,k\n\nN = \uf8ee\n\uf8ef\uf8ef\uf8ef\uf8f0\n\nA\u03b5,k\nn\nA\u03b5,k\nn\n...\nA\u03b5,k\nn\n\n\uf8fc\uf8f4\uf8f4\uf8f4\uf8fd\n\uf8f9\n\uf8fa\uf8fa\uf8fa\uf8fb\n\uf8f4\uf8f4\uf8f4\uf8fe\n\nN/n replicates of A\u03b5,k\nn .\n\nThe replication of actions signi\ufb01cantly affects the behavior of algorithms that set parameters with\nrespect to the number of actions N, which is in\ufb02ated compared to the effective number of actions n.\nNormalHedge, having no such parameters, is completely unaffected by the replication of actions.\n\nWe compare the performance of NormalHedge to two other representative algorithms, which we\ncall \u201cExp\u201d and \u201cPoly\u201d. Exp is a time/variation-adaptive version of Hedge (exponential weights)\n\ndue to [7] (roughly, \u03b7t = O(p(log N )/Vart), where Vart is the cumulative loss variance). Poly\n\nis polynomial weights [12, 11], which has a parameter p that is typically set as a function of the\nnumber of actions; we set p = 2 ln N as is recommended to guarantee a regret bound comparable to\nthat of Hedge.\n\nFigure 3 shows the regrets to the best action versus the replication factor N/n, where the effective\nnumber of actions n is held \ufb01xed. Recall that Exp and Poly have parameters set with respect to the\nnumber of actions N.\nWe see from the \ufb01gures that NormalHedge is completely unaffected by the replication of actions;\nno matter how many times the actions may be replicated, the performance of NormalHedge stays\nexactly the same. In contrast, increasing the replication factor affects the performance of Exp and\nPoly: Exp and Poly become more sensitive to the changes in the total losses of the actions (e.g. the\nbase of the exponent in the weights assigned by Exp increases with N); so when there are multiple\ngood actions (i.e. k > 1), Exp and Poly are slower to stabilize their weights over these good actions.\nWhen k = 1, Exp and Poly actually perform better using the in\ufb02ated value N (as opposed to n), as\nthis causes the slight advantage of the single best action to be magni\ufb01ed. However, this particular\ncase is an anomaly; this does not happen even for k = 2. We note that if the parameters of Exp\nand Poly were set to be a function of n, instead of N, then, then their performance would also\nnot depend on the replication factor (the peformance would be the same as the N/n = 1 case).\nTherefore, the degradation in performance of Exp and Poly is solely due to the suboptimality in\nsetting their parameters.\n\n5\n\n\f8\n6\n7\n2\n3\n=\nT\n \nr\ne\n\nt\nf\n\n \n\na\nn\no\n\ni\nt\nc\na\n\n \nt\ns\ne\nb\no\n\n \n\nt\n \nt\n\ne\nr\ng\ne\nR\n\n8\n6\n7\n2\n3\n=\nT\n \nr\ne\n\nt\nf\n\na\n\n \nt\nr\ne\np\nx\ne\n\n \nt\ns\ne\nb\no\n\n \n\nt\n \nt\ne\nr\ng\ne\nR\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n \n100\n\n900\n\n800\n\n700\n\n600\n\n500\n\n400\n \n100\n\n \n\n103\n\n \n\nExp.\nPoly.\nNormal\n\nExp.\nPoly.\nNormal\n\n101\n\n102\n\nReplication factor\nk = 1\n\n101\n\n102\n\n103\n\nReplication factor\nk = 8\n\nExp.\nPoly.\nNormal\n\nExp.\nPoly.\nNormal\n\n8\n6\n7\n2\n3\n=\nT\n \nr\ne\n\nt\nf\n\n \n\na\nn\no\n\ni\nt\nc\na\n\n \nt\ns\ne\nb\no\n\n \n\nt\n \nt\n\ne\nr\ng\ne\nR\n\n8\n6\n7\n2\n3\n=\nT\n \nr\ne\n\nt\nf\n\n \n\na\nn\no\n\ni\nt\nc\na\n\n \nt\ns\ne\nb\no\n\n \n\nt\n \nt\ne\nr\ng\ne\nR\n\n650\n\n600\n\n550\n\n500\n\n450\n\n400\n \n100\n\n900\n\n800\n\n700\n\n600\n\n500\n\n400\n \n100\n\n \n\n103\n\n \n\n101\n\n102\n\nReplication factor\nk = 2\n\n101\n\n102\n\n103\n\nReplication factor\nk = 32\n\nFigure 3: Regrets to the best action after T = 32768 rounds, versus replication factor N/n. Recall,\nk is the (effective) number of good actions. Here, we \ufb01x n = 126 and \u01eb = 0.025.\n\n4 Related work\n\nThere has been a large amount of literature on various aspects of DTOL. The Hedge algorithm of\n[1] belongs to a more general family of algorithms, called the exponential weights algorithms; these\nare originally based on Littlestone and Warmuth\u2019s Weighted Majority algorithm [2], and they have\nbeen well-studied.\n\nThe standard measure of regret in most of these works is the regret to the best action. The original\nHedge algorithm has a regret bound of O(\u221aT log N ). Hedge uses a \ufb01xed learning rate \u03b7 for all\niterations, and requires one to set \u03b7 as a function of the total number of iterations T . As a result,\nits regret bound also holds only for a \ufb01xed T . The algorithm of [13] guarantees a regret bound\nof O(\u221aT log N ) to the best action uniformly for all T by using a doubling trick. Time-varying\nlearning rates for exponential weights algorithms were considered in [6]; there, they show that if\n\u03b7t =p8 ln(N )/t, then using exponential weights with \u03b7 = \u03b7t in round t guarantees regret bounds\nof \u221a2T ln N + O(ln N ) for any T . This bound provides a better regret to the best action than we\ndo. However, this method is still susceptible to poor performance, as illustrated in the example in\nSection 3. Moreover, they do not consider our notion of regret.\n\nThough not explicitly considered in previous works, the exponential weights algorithms can be\npartly analyzed with respect to the regret to the top \u01eb-quantile. For any \ufb01xed \u01eb, Hedge can be\nmodi\ufb01ed by setting \u03b7 as a function of this \u01eb such that the regret to the top \u01eb-quantile is at most\n\nO(pT log(1/\u01eb)). The problem with this solution is that it requires that the learning rate to be\nset as a function of that particular \u01eb (roughly \u03b7 = p(log 1/\u01eb)/T ). Therefore, unlike our bound,\nthis bound does not hold uniformly for all \u01eb. One way to ensure a bound for all \u01eb uniformly is to\nrun log N copies of Hedge, each with a learning rate set as a function of a different value of \u01eb. A\n\ufb01nal master copy of the Hedge algorithm then looks at the probabilities given by these subordinate\ncopiesto give the \ufb01nal probabilities. However, this procedure adds an additive O(\u221aT log log N )\nfactor to the regret to the \u01eb quantile of actions, for any \u01eb. More importantly, this procedure is also\nimpractical for real applications, where one might be already working with a large set of actions.\nIn contrast, our solution NormalHedge is clean and simple, and we guarantee a regret bound for all\nvalues of \u01eb uniformly, without any extra overhead.\n\n6\n\n\fMore recent work in [14, 7, 10] provide algorithms with signi\ufb01cantly improved bounds when the\ntotal loss of the best action is small, or when the total variation in the losses is small. These bounds\ndo not explicitly depend on T , and thus can often be sharper than ones that do (including ours). We\nstress, however, that these methods use a different notion of regret, and their learning rates depend\nexplicitly on N.\nBesides exponential weights, another important class of online learning algorithms are the poly-\nnomial weights algorithms studied in [12, 11, 8]. These algorithms too require a parameter; this\nparameter does not depend on the number of rounds T , but depends crucially on the number of ac-\ntions N. The weight assigned to action i in round t is proportional to ([Ri,t\u22121]+)p\u22121 for some p > 1;\n\nsetting p = 2 ln N yields regret bounds of the formp2eT (ln N \u2212 0.5) for any T . Our algorithm\n\nand polynomial weights share the feature that zero weight is given to actions that are performing\nworse than the algorithm, although the degree of this weight sparsity is tied to the performance of\nthe algorithm. Finally, [15] derive a time-adaptive variation of the follow-the-(perturbed) leader\nalgorithm [16, 17] by scaling the perturbations by a parameter that depends on both t and N.\n\n5 Analysis\n\n5.1 Main results\n\nOur main result is the following theorem.\nTheorem 1. If Normal-Hedge has access to N actions, then for all loss sequences, for all t, for all\n0 < \u01eb \u2264 1 and for all 0 < \u03b4 \u2264 1/2, the regret of the algorithm to the top \u01eb-quantile of the actions is\nat most\n\nIn particular, with \u01eb = 1/N, the regret to the best action is at most\n\ns(1 + ln(1/\u01eb))(cid:18)3(1 + 50\u03b4)t +\ns(1 + ln N )(cid:18)3(1 + 50\u03b4)t +\n\n16 ln2 N\n\n\u03b4\n\n(\n\n10.2\n\n\u03b42 + ln N )(cid:19) .\n\n16 ln2 N\n\n\u03b4\n\n(\n\n10.2\n\n\u03b42 + ln N )(cid:19) .\n\nThe value \u03b4 in Theorem 1 appears to be an artifact of our analysis; we divide the sequence of rounds\ninto two phases \u2013 the length of the \ufb01rst is controlled by the value of \u03b4 \u2013 and bound the behavior of\nthe algorithm in each phase separately. The following corollary illustrates the performance of our\nalgorithm for large values of t, in which case the effect of this \ufb01rst phase (and the \u03b4 in the bound)\nessentially goes away.\nCorollary 2. If Normal-Hedge has access to N actions, then, as t \u2192 \u221e, the regret of Normal-\nHedge to the top \u01eb-quantile of actions approaches an upper bound of\n\nIn particular, the regret of Normal-Hedge to the best action approaches an upper bound of of\n\np3t(1 + ln(1/\u01eb)) + o(t) .\np3t(1 + ln N ) + o(t) .\n\nThe proof of Theorem 1 follows from a combination of Lemmas 3, 4, and 5, and is presented in\ndetail at the end of the current section.\n\n5.2 Regret bounds from the potential equation\n\nThe following lemma relates the performance of the algorithm at time t to the scale ct.\nLemma 3. At any time t, the regret to the best action can be bounded as\n\nMoreover, for any 0 \u2264 \u01eb \u2264 1 and any t, the regret to the top \u01eb-quantile of actions is at most\n\nmax\n\ni\n\nRi,t \u2264p2ct(ln N + 1) .\np2ct(ln(1/\u01eb) + 1) .\n\n7\n\n\fProof. We use Et to denote the actions that have non-zero weight on iteration t. The \ufb01rst part of the\nlemma follows from the fact that, for any action i \u2208 Et,\n(cid:19) \u2264\n\n2ct (cid:19) = exp(cid:18) ([Ri,t]+)2\n\nexp(cid:18) (Ri,t)2\n\nexp(cid:18) ([Ri\u2032,t]+)2\n\n2ct\n\n(cid:19) \u2264 N e\n\n2ct\n\nN\n\nXi\u2032=1\n\nwhich implies Ri,t \u2264p2ct(ln N + 1).\n\nFor the second part of the lemma, let Ri,t denote the regret of our algorithm to the action with the\n\u01ebN-th highest regret. Then, the total potential of the actions with regrets greater than or equal to\nRi,t is at least\n\nfrom which the second part of the lemma follows.\n\n\u01ebN exp(cid:18) ([Ri,t]+)2\n\n2ct\n\n(cid:19) \u2264 N e\n\n5.3 Bounds on the scale ct and the proof of Theorem 1\n\nIn Lemmas 4 and 5, we bound the growth of the scale ct as a function of the time t.\nThe main outline of the proof of Theorem 1 is as follows. As ct increases monotonically with t, we\ncan divide the rounds t into two phases, t < t0 and t \u2265 t0, where t0 is the \ufb01rst time such that\n\n4 ln2 N\n\n\u03b4\n\n+\n\n16 ln N\n\n\u03b43\n\n,\n\nct0 \u2265\n\nfor some \ufb01xed \u03b4 \u2208 (0, 1/2). We then show bounds on the growth of ct for each phase separately.\nLemma 4 shows that ct is not too large at the end of the \ufb01rst phase, while Lemma 5 bounds the\nper-round growth of ct in the second phase. The proofs of these two lemmas are quite involved, so\nwe defer them to the supplementary appendix.\nLemma 4. For any time t,\n\nct+1 \u2264 2ct(1 + ln N ) + 3 .\n\nLemma 5. Suppose that at some time t0, ct0 \u2265 4 ln2 N\nThen, for any time t \u2265 t0,\n\n\u03b4 + 16 ln N\n\n\u03b43\n\n, where 0 \u2264 \u03b4 \u2264 1\n\n2 is a constant.\n\nct+1 \u2212 ct \u2264\n\n3\n2\n\n(1 + 49.19\u03b4) .\n\nWe now combine Lemmas 4 and 5 together with Lemma 3 to prove the main theorem.\n\nProof of Theorem 1. Let t0 be the \ufb01rst time at which ct0 \u2265 4 ln2 N\n\n\u03b4 + 16 ln N\n\n\u03b43\n\n. Then, from Lemma 4,\n\nwhich is at most\n\nct0 \u2264 2ct0\u22121(1 + ln N ) + 3,\n\n8 ln3 N\n\n\u03b4\n\n+\n\n34 ln2 N\n\n\u03b43\n\n+\n\n32 ln N\n\n\u03b43 + 3 \u2264\n\n8 ln3 N\n\n\u03b4\n\n+\n\n81 ln2 N\n\n\u03b43\n\n.\n\nThe last inequality follows because N \u2265 2 and \u03b4 \u2264 1/2. By Lemma 5, we have that for any t \u2265 t0,\n\nct \u2264\n\n3\n2\n\n(1 + 49.19\u03b4)(t \u2212 t0) + ct0.\n\nCombining these last two inequalities yields\n\nct \u2264\n\n3\n2\n\n(1 + 49.19\u03b4)t +\n\n8 ln3 N\n\n\u03b4\n\n+\n\n81 ln2 N\n\n\u03b43\n\n.\n\nNow the theorem follows by applying Lemma 3.\n\n8\n\n\fReferences\n[1] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. Journal of Computer and System Sciences, 55:119\u2013139, 1997.\n\n[2] N. Littlestone and M. Warmuth. The weighted majority algorithm.\n\n108:212\u2013261, 1994.\n\nInformation and Computation,\n\n[3] V. Vovk. A game of prediction witih expert advice. Journal of Computer and System Sciences, 56(2):153\u2013\n\n173, 1998.\n\n[4] K. Chaudhuri, Y. Freund, and D. Hsu.\n\narXiv:0903.2862.\n\nTracking using explanation-based modeling, 2009.\n\n[5] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic\n\nBehavior, 29:79\u2013103, 1999.\n\n[6] P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-con\ufb01dent on-line learning algorithms. Journal\n\nof Computer and System Sciences, 64(1), 2002.\n\n[7] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert\n\nadvice. Machine Learning, 66(2\u20133):321\u2013352, 2007.\n\n[8] N. Cesa-Bianchi and G. Lugosi. Potential-based algorithms in on-line prediction and game theory. Ma-\n\nchine Learning, 51:239\u2013261, 2003.\n\n[9] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University Press, 2006.\n[10] E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. In\n\nCOLT, 2008.\n\n[11] C. Gentile. The robustness of p-norm algorithms. Machine Learning, 53(3):265\u2013299, 2003.\n[12] A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant\n\nupdates. Machine Learning, 43(3):173\u2013210, 2001.\n\n[13] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Hembold, R. E. Schapire, and M. Warmuth. How to use\n\nexpert advice. Journal of the ACM, 44(3):427\u2013485, 1997.\n\n[14] R. Yaroshinsky, R. El-Yaniv, , and S. Seiden. How to better use expert advice. Machine Learning,\n\n55(3):271\u2013309, 2004.\n\n[15] M. Hutter and J. Poland. Adaptive online prediction by following the perturbed leader. Journal of Machine\n\nLearning Research, 6:639\u2013660, 2005.\n\n[16] J. Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97\u2013\n\n139, 1957.\n\n[17] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for the online optimization. Journal of Computer and\n\nSystem Sciences, 71(3):291\u2013307, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1110, "authors": [{"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": null}, {"given_name": "Yoav", "family_name": "Freund", "institution": null}, {"given_name": "Daniel", "family_name": "Hsu", "institution": null}]}