{"title": "No-Regret Algorithms for Unconstrained Online Convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2402, "page_last": 2410, "abstract": "Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is R^n. Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point x* are known in advance. We present an algorithm that, without such prior knowledge, offers near-optimal regret bounds with respect to _any_ choice of x*. In particular, regret with respect to x* = 0 is _constant_. We then prove lower bounds showing that our algorithm's guarantees are optimal in this setting up to constant factors.", "full_text": "No-Regret Algorithms for Unconstrained\n\nOnline Convex Optimization\n\nMatthew Streeter\nDuolingo, Inc.\u2217\n\nPittsburgh, PA 15232\n\nmatt@duolingo.com\n\nH. Brendan McMahan\n\nGoogle, Inc.\n\nSeattle, WA 98103\n\nmcmahan@google.com\n\nAbstract\n\nSome of the most compelling applications of online convex optimization, includ-\ning online prediction and classi\ufb01cation, are unconstrained: the natural feasible set\nis Rn. Existing algorithms fail to achieve sub-linear regret in this setting unless\nconstraints on the comparator point \u02dax are known in advance. We present algo-\nrithms that, without such prior knowledge, offer near-optimal regret bounds with\nrespect to any choice of \u02dax. In particular, regret with respect to \u02dax = 0 is constant.\nWe then prove lower bounds showing that our guarantees are near-optimal in this\nsetting.\n\n1\n\nIntroduction\n\nOver the past several years, online convex optimization has emerged as a fundamental tool for solv-\ning problems in machine learning (see, e.g., [3, 12] for an introduction). The reduction from general\nonline convex optimization to online linear optimization means that simple and ef\ufb01cient (in memory\nand time) algorithms can be used to tackle large-scale machine learning problems. The key theoret-\nical techniques behind essentially all the algorithms in this \ufb01eld are the use of a \ufb01xed or increasing\nstrongly convex regularizer (for gradient descent algorithms, this is equivalent to a \ufb01xed or decreas-\ning learning rate sequence). In this paper, we show that a fundamentally different type of algorithm\ncan offer signi\ufb01cant advantages over these approaches. Our algorithms adjust their learning rates\nbased not just on the number of rounds, but also based on the sum of gradients seen so far. This\nallows us to start with small learning rates, but effectively increase the learning rate if the problem\ninstance warrants it.\n\nThis approach produces regret bounds of the form O(cid:0)R\n\nT log((1 + R)T )(cid:1), where R = (cid:107)\u02dax(cid:107)2 is the\n\nL2 norm of an arbitrary comparator. Critically, our algorithms provide this guarantee simultaneously\nfor all \u02dax \u2208 Rn, without any need to know R in advance. A consequence of this is that we can\nguarantee at most constant regret with respect to the origin, \u02dax = 0. This technique can be applied to\nany online convex optimization problem where a \ufb01xed feasible set is not an essential component of\nthe problem. We discuss two applications of particular interest below:\n\n\u221a\n\nOnline Prediction Perhaps the single most important application of online convex optimization\nis the following prediction setting: the world presents an attribute vector at \u2208 Rn; the prediction\nalgorithm produces a prediction \u03c3(at \u00b7 xt), where xt \u2208 Rn represents the model parameters, and\n\u03c3 : R \u2192 Y maps the linear prediction into the appropriate label space. Then, the adversary reveals\nthe label yt \u2208 Y , and the prediction is penalized according to a loss function (cid:96) : Y \u00d7 Y \u2192 R.\nFor appropriately chosen \u03c3 and (cid:96), this becomes a problem of online convex optimization against\nfunctions ft(x) = (cid:96)(\u03c3(at\u00b7x), yt). In this formulation, there are no inherent restrictions on the model\ncoef\ufb01cients x \u2208 Rn. The practitioner may have prior knowledge that \u201csmall\u201d model vectors are more\n\n\u2217This work was performed while the author was at Google.\n\n1\n\n\flikely than large ones, but this is rarely best encoded as a feasible set F, which says: \u201call xt \u2208 F are\nequally likely, and all other xt are ruled out.\u201d A more general strategy is to introduce a \ufb01xed convex\nregularizer: L1 and L2\n2 penalties are common, but domain-speci\ufb01c choices are also possible. While\nalgorithms of this form have proved very effective at solving these problems, theoretical guarantees\nusually require \ufb01xing a feasible set of radius R, or at least an intelligent guess of the norm of an\noptimal comparator \u02dax.\n\nxt,i on each expert i, and then receives reward(cid:80)\n\nThe Unconstrained Experts Problem and Portfolio Management\nIn the classic problem of\npredicting with expert advice (e.g., [3]), there are n experts, and on each round t the player selects\nan expert (say i), and obtains reward gt,i from a bounded interval (say [\u22121, 1]). Typically, one uses\nan algorithm that proposes a probability distribution pt on experts, so the expected reward is pt \u00b7 gt.\nOur algorithms apply to an unconstrained version of this problem: there are still n experts with\npayouts in [\u22121, 1], but rather than selecting an individual expert, the player can place a \u201cbet\u201d of\ni xt,igt,i = xt \u00b7 gt. The bets are unconstrained\n(betting a negative value corresponds to betting against the expert). In this setting, a natural goal is\nthe following: place bets so as to achieve as much reward as possible, subject to the constraint that\ntotal losses are bounded by a constant (which can be set equal to some starting budget which is to be\ninvested). Our algorithms can satisfy constraints of this form because regret with respect to \u02dax = 0\n(which equals total loss) is bounded by a constant.\nIt is useful to contrast our results in this setting to previous applications of online convex optimiza-\ntion to portfolio management, for example [6] and [2]. By applying algorithms for exp-concave\nloss functions, they obtain log-wealth within O(log(T )) of the best constant rebalanced portfolio.\nHowever, this approach requires a \u201cno-junk-bond\u201d assumption: on each round, for each investment,\nyou always retain at least an \u03b1 > 0 fraction of your initial investment. While this may be realistic\n(though not guaranteed!) for blue-chip stocks, it certainly is not for bets on derivatives that can\nlose all their value unless a particular event occurs (e.g., a stock price crosses some threshold). Our\nmodel allows us to handle such investments: if we play xi > 0, an outcome of gi = \u22121 corresponds\nexactly to losing 100% of that investment. Our results imply that if even one investment (out of\nexponentially many choices) has signi\ufb01cant returns, we will increase our wealth exponentially.\n\nNotation and Problem Statement For the algorithms considered in this paper, it will be more\nnatural to consider reward-maximization rather than loss-minimization. Therefore, we consider\nonline linear optimization where the goal is to maximize cumulative reward given adversarially\nselected linear reward functions ft(x) = gt \u00b7 x. On each round t = 1 . . . T , the algorithm selects a\npoint xt \u2208 Rn, receives reward ft(xt) = gt \u00b7 xt, and observes gt. For simplicity, we assume gt,i \u2208\n[\u22121, 1], that is, (cid:107)gt(cid:107)\u221e \u2264 1. If the real problem is against convex loss functions (cid:96)t(x), they can be\nconverted to our framework by taking gt = \u2212(cid:79)(cid:96)t(xt) (see pseudo-code for REWARD-DOUBLING),\nusing the standard reduction from online convex optimization to online linear optimization [13].\n\nWe use the compressed summation notation g1:t =(cid:80)t\n\ns=1 gs for both vectors and scalars. We study\n\nthe reward of our algorithms, and their regret against a \ufb01xed comparator \u02dax:\n\nReward \u2261 T(cid:88)\n\ngt \u00b7 xt\n\nand\n\nRegret(\u02dax) \u2261 g1:T \u00b7 \u02dax \u2212 T(cid:88)\n\ngt \u00b7 xt.\n\nt=1\n\nt=1\n\nrithm that, for any \u02dax \u2208 Rn, guarantees Regret(\u02dax) \u2264 O(cid:0)(cid:107)\u02dax(cid:107)2\n\nComparison of Regret Bounds The primary contribution of this paper is to establish matching\nupper and lower bounds for unconstrained online convex optimization problems, using algorithms\nthat require no prior information about the comparator point \u02dax. Speci\ufb01cally, we present an algo-\n\u221a\nthis guarantee, we show that it is suf\ufb01cient (and necessary) that reward is \u2126(exp(|g1:T|/\nT )) (see\nTheorem 1). This shift of emphasis from regret-minimization to reward-maximization eliminates\nthe quanti\ufb01cation on \u02dax, and may be useful in other contexts.\nTable 1 compares the bounds for REWARD-DOUBLING (this paper) to those of two previous algo-\nrithms: online gradient descent [13] and projected exponentiated gradient descent [8, 12]. For each\nOur bounds are not directly comparable to the bounds cited above: a O(log(T )) regret bound on log-\n\nwealth implies wealth at least O(cid:0)OPT/T(cid:1), whereas we guarantee wealth like O(cid:0)OPT\u2019 \u2212 \u221a\n\nT )(cid:1). To obtain\n\n\u221a\nT log((1 + (cid:107)\u02dax(cid:107)2)\n\nT(cid:1). But more\n\n\u221a\n\nimportantly, the comparison classes are different.\n\n2\n\n\fAssuming (cid:107)gt(cid:107)2 \u2264 1:\n\nGradient Descent, \u03b7 = R\u221a\nREWARD-DOUBLING\nAssuming (cid:107)gt(cid:107)\u221e \u2264 1:\n\nT\n\n\u221a\n\u02dax = 0\nR\nT\n\n\u0001\n\n\u221a\n\nR\n\n\u221a\n\n(cid:107)\u02dax(cid:107)2 \u2264 R\nR\n\n(cid:16) n(1+R)T\n\nT\n\nT log\n\n\u0001\n\nExponentiated G.D.\nREWARD-DOUBLING\n\nR\n\n\u221a\n\u02dax = 0\n\nT log n\n\u0001\n\n\u221a\n\nR\n\n(cid:107)\u02dax(cid:107)1 \u2264 R\n\u221a\nT log n\nR\nT log\n\n(cid:16) n(1+R)T\n\n\u0001\n\n(cid:17)\n\n(cid:17)\n\nArbitrary \u02dax\n(cid:107)\u02dax(cid:107)2T\n\n(cid:16) n(1+(cid:107)\u02dax(cid:107)2)T\n\nT log\n\n\u0001\n\n\u221a\n\n(cid:107)\u02dax(cid:107)2\n\nArbitrary \u02dax\n(cid:107)\u02dax(cid:107)1T\n\n(cid:16) n(1+(cid:107)\u02dax(cid:107)1)\n\n\u221a\n\n\u0001\n\n\u221a\n\n(cid:107)\u02dax(cid:107)1\n\nT log\n\n(cid:17)\n\n(cid:17)\n\nT\n\nTable 1: Worst-case regret bounds for various algorithms (up to constant factors). Exponentiated\nG.D. uses feasible set {x : (cid:107)x(cid:107)1 \u2264 R}, and REWARD-DOUBLING uses \u0001i = \u0001\n\nn in both cases.\n\nalgorithm, we consider a \ufb01xed choice of parameter settings and then look at how regret changes as\nwe vary the comparator point \u02dax.\nGradient descent is minimax-optimal [1] when the comparator point is contained in a hypershere\nwhose radius is known in advance ((cid:107)\u02dax(cid:107)2 \u2264 R) and gradients are sparse ((cid:107)gt(cid:107)2 \u2264 1, top table).\nExponentiated gradient descent excels when gradients are dense ((cid:107)gt(cid:107)\u221e \u2264 1, bottom table) but the\ncomparator point is sparse ((cid:107)\u02dax(cid:107)1 \u2264 R for R known in advance). In both these cases, the bounds for\nREWARD-DOUBLING match those of the previous algorithms up to logarithmic factors, even when\nthey are tuned optimally with knowledge of R.\nThe advantage of REWARD-DOUBLING shows up when the guess of R used to tune the compet-\n\u221a\ning algorithms turns out to be wrong. When \u02dax = 0, REWARD-DOUBLING offers constant regret\nT ) for the other algorithms. When \u02dax can be arbitrary, only REWARD-DOUBLING\ncompared to \u2126(\noffers sub-linear regret (and in fact its regret bound is optimal, as shown in Theorem 8).\nIn order to guarantee constant origin-regret, REWARD-DOUBLING frequently \u201cjumps\u201d back to\nplaying the origin, which may be undesirable in some applications.\nIn Section 4 we introduce\nSMOOTH-REWARD-DOUBLING, which achieves similar guarantees without resetting to the origin.\n\nRelated Work Our work is related, at least in spirit, to the use of a momentum term in stochastic\ngradient descent for back propagation in neural networks [7, 11, 9]. These results are similar in\nmotivation in that they effectively yield a larger learning rate when many recent gradients point in\nthe same direction.\nIn Follow-The-Regularized-Leader terms, the exponentiated gradient descent algorithm with unnor-\n\u03b7 (x log x \u2212 x),\nmalized weights of Kivinen and Warmuth [8] plays xt+1 = arg minx\u2208Rn\nwhich has closed-form solution xt+1 = exp(\u2212\u03b7g1:t). Like our algorithm, this algorithm moves\naway from the origin exponentially fast, but unlike our algorithm it can incur arbitrarily large regret\nwith respect to \u02dax = 0. Theorem 9 shows that no algorithm of this form can provide bounds like the\nones proved in this paper.\nHazan and Kale [5] give regret bounds in terms of the variance of the gt. Letting G = |g1:t| and\nV ) where V = H \u2212 G2/T . This result\n\u221a\nH \u2212 V , and so if we hold H constant, then\nhas some similarity to our work in that G/\nwhen V is low, the critical ratio G/\nT that appears in our bounds is large. However, they consider\nthe case of a known feasible set, and their algorithm (gradient descent with a constant learning rate)\ncannot obtain bounds of the form we prove.\n\n\u221a\nt , they prove regret bounds of the form O(\n\nH =(cid:80)T\n\ng1:t \u00b7 x + 1\n\nt=1 g2\n\nT =\n\n\u221a\n\n\u221a\n\n+\n\n2 Reward and Regret\n\nIn this section we present a general result that converts lower bounds on reward into upper bounds\non regret, for one-dimensional online linear optimization. In the unconstrained setting, this result\nwill be suf\ufb01cient to provide guarantees for general n-dimensional online convex optimization.\n\n3\n\n\fTheorem 1. Consider an algorithm for one-dimensional online linear optimization that, when run\non a sequence of gradients g1, g2, . . . , gT , with gt \u2208 [\u22121, 1] for all t, guarantees\n\nwhere \u03b3, \u03ba > 0 and \u0001 \u2265 0 are constants. Then, against any comparator \u02dax \u2208 [\u2212R, R], we have\n\nReward \u2265 \u03ba exp (\u03b3|g1:T|) \u2212 \u0001,\n\n(cid:18)\n\n(cid:18) R\n\n(cid:19)\n\n\u03ba\u03b3\n\n(cid:19)\n\n\u2212 1\n\n+ \u0001,\n\nRegret(\u02dax) \u2264 R\n\u03b3\n\nlog\n\n(1)\n\n(2)\n\nletting 0 log 0 = 0 when R = 0. Further, any algorithm with the regret guarantee of Eq. (2) must\nguarantee the reward of Eq. (1).\n\nWe give a proof of this theorem in the appendix. The duality between reward and regret can also be\nseen as a consequence of the fact that exp(x) and y log y \u2212 y are convex conjugates. The \u03b3 term\ntypically contains a dependence on T like 1/\nT . This bound holds for all R, and so for some small\nR the log term becomes negative; however, for real algorithms the \u0001 term will ensure the regret\nbound remains positive. The minus one can of course be dropped to simplify the bound further.\n\n\u221a\n\n3 Gradient Descent with Increasing Learning Rates\n\nIn this section we show that allowing the learning rate of gradient descent to sometimes increase\nleads to novel theoretical guarantees.\nTo build intuition, consider online linear optimization in one dimension, with gradients\ng1, g2, . . . , gT , all in [\u22121, 1]. In this setting, the reward of unconstrained gradient descent has a\nsimple closed form:\nLemma 2. Consider unconstrained gradient descent in one dimension, with learning rate \u03b7. On\nt , the\n\nround t, this algorithm plays the point xt = \u03b7g1:t\u22121. Letting G = |g1:t| and H = (cid:80)T\n\nt=1 g2\n\ncumulative reward of the algorithm is exactly\n\n(cid:0)G2 \u2212 H(cid:1) .\n\nReward =\n\n\u03b7\n2\n\nWe give a simple direct proof in Appendix A. Perhaps surprisingly, this result implies that the reward\nis totally independent of the order of the linear functions selected by the adversary. Examining the\nexpression in Lemma 2, we see that the optimal choice of learning rate \u03b7 depends fundamentally on\ntwo quantities: the absolute value of the sum of gradients (G), and the sum of the squared gradients\n(H). If G2 > H, we would like to use as large a learning rate as possible in order to maximize\nreward. In contrast, if G2 < H, the algorithm will obtain negative reward, and the best it can do is\nto cut its losses by setting \u03b7 as small as possible.\nOne of the motivations for this work is the observation that the state-of-the-art online gradient de-\nscent algorithms adjust their learning rates based only on the observed value of H (or its upper bound\nT ); for example [4, 10]. We would like to increase reward by also accounting for G. But unlike H,\nwhich is monotonically increasing with time, G can both increase and decrease. This makes simple\nguess-and-doubling tricks fail when applied to G, and necessitates a more careful approach.\n\n3.1 Analysis in One Dimension\n\nseries of epochs. We suppose for the moment that an upper bound \u00afH on H =(cid:80)T\nLemma 3. Applied to a sequence of gradients g1, g2, . . . , gT , all in [\u22121, 1], where H =(cid:80)T\n\nIn this section we analyze algorithm REWARD-DOUBLING-1D (Algorithm 1), which consists of a\nt is known\nin advance. In the \ufb01rst epoch, we run gradient descent with a small initial learning rate \u03b7 = \u03b71.\nWhenever the total reward accumulated in the current epoch reaches \u03b7 \u00afH, we double \u03b7 and start a\nnew epoch (returning to the origin and forgetting all previous gradients except the most recent one).\nt \u2264\n\nt=1 g2\n\nt=1 g2\n\n\u00afH, REWARD-DOUBLING-1D obtains reward satisfying\n\n\u2212 \u03b71 \u00afH,\n\n(3)\n\nReward =\n\n\u221a\nfor a = log(2)/\n\n3.\n\nt=1\n\n(cid:19)\n\n(cid:18)\n\n|g1:T|\u221a\n\u00afH\n\na\n\nT(cid:88)\n\nxtgt \u2265 1\n4\n\n\u03b71 \u00afH exp\n\n4\n\n\fAlgorithm 1 REWARD-DOUBLING-1D\n\nbound \u00afH \u2265(cid:80)T\n\nParameters: initial learning rate \u03b71, upper\nt .\nt=1 g2\nInitialize x1 \u2190 0, i \u2190 1, and Q1 \u2190 0.\nfor t = 1, 2, . . . , T do\n\nPlay xt, and receive reward xtgt.\nQi \u2190 Qi + xtgt.\nif Qi < \u03b7i \u00afH then\n\nelse\n\nxt+1 \u2190 xt + \u03b7igt.\ni \u2190 i + 1.\n\u03b7i \u2190 2\u03b7i\u22121; Qi \u2190 0.\nxt+1 \u2190 0 + \u03b7igt.\n\nAlgorithm 2 REWARD-DOUBLING\n\nParameters: maximum origin-regret \u0001i\nfor 1 \u2264 i \u2264 n.\nfor i = 1, 2, . . . , n do\n\na\n\nLet Ai be\nREWARD-DOUBLING-1D-GUESS\n(see Theorem 4), with parameter \u0001i.\n\ncopy of\n\nalgorithm\n\nfor t = 1, 2, . . . , T do\n\nPlay xt, with xt,i selected by Ai.\nReceive gradient vector gt = \u2212(cid:79)ft(xt).\nfor i = 1, 2, . . . , n do\nFeed back gt,i to Ai.\n\nProof. Suppose round T occurs during the k\u2019th epoch. Because epoch i can only come to an end if\nQi \u2265 \u03b7i \u00afH, where \u03b7i = 2i\u22121\u03b71, we have\n\nk(cid:88)\n\n(cid:32)k\u22121(cid:88)\n\nReward =\n\nQi \u2265\n\n2i\u22121\u03b71 \u00afH\n\ni=1\n\ni=1\n\n(cid:33)\n\n+ Qk =(cid:0)2k\u22121 \u2212 1(cid:1) \u03b71 \u00afH + Qk .\n\n(4)\n\nWe now lower bound Qk. For i = 1, . . . , k let ti denote the round on which Qi is initialized to 0,\nwith t1 \u2261 1, and de\ufb01ne tk+1 \u2261 T . By construction, Qi is the total reward of a gradient descent\nalgorithm that is active on rounds ti through ti+1 inclusive, and that uses learning rate \u03b7i (note that\non round ti, this algorithm gets 0 reward and we initialize Qi to 0 on that round). Thus, by Lemma\n2, we have that for any i,\n\n\u03b7i\n2\n\nQi =\n\n(cid:33)\n\n(cid:32)\n(gti:ti+1)2 \u2212\n\nti+1(cid:88)\n2 \u03b7k \u00afH = \u22122k\u22122\u03b71 \u00afH. Substituting into (4) gives\n(5)\n. At the end of round ti+1 \u2212 1, we must have had Qi < \u03b7i \u00afH (otherwise\n\n\u2265 \u2212 \u03b7i\n2\n\nReward \u2265 \u03b71 \u00afH(2k\u22121 \u2212 1 \u2212 2k\u22122) = \u03b71 \u00afH(2k\u22122 \u2212 1) .\n\n\u00afH .\n\ng2\ns\n\ns=ti\n\nApplying this bound to epoch k, we have Qk \u2265 \u2212 1\n\nWe now show that k \u2265 |g1:T |\u221a\nepoch i + 1 would have begun earlier). Thus, again using Lemma 2,\n\n3 \u00afH\n\nso |gti:ti+1\u22121| \u2264\n\n\u221a\n\n3 \u00afH. Thus,\n\n\u03b7i\n2\n\n(cid:0)(gti:ti+1\u22121)2 \u2212 \u00afH(cid:1) \u2264 \u03b7i \u00afH\n(cid:112)\n\n|gti:ti+1\u22121| \u2264 k\n\n|g1:T| \u2264 k(cid:88)\n\n3 \u00afH .\n\n(cid:32)\n\n(cid:112) \u00afH\n\n\u221a\n\n(cid:32)\n\n\u221a\n4Rb\n\u03b71\n\n\u00afH\n\n(cid:33)\n\n(cid:33)\n\n\u2212 1\n\n5\n\nRearranging gives k \u2265 |g1:T |\u221a\n\n3 \u00afH\n\ni=1\n\n, and combining with Eq. (5) proves the lemma.\n\nWe can now apply Theorem 1 to the reward (given by Eq. (3)) of REWARD-DOUBLING-1D to show\n\nRegret(\u02dax) \u2264 bR\n\nlog\n\n+ \u03b71 \u00afH\n\n(6)\n\nfor any \u02dax \u2208 [\u2212R, R], where b = a\u22121 =\n3/ log(2) < 2.5. When the feasible set is also \ufb01xed in\nadvance, online gradient descent with a \ufb01xed learning obtains a regret bound of O(R\nT ). Suppose\nwe use the estimate \u00afH = T . By choosing \u03b71 = 1\nT , we guarantee constant regret against the origin,\n\u02dax = 0 (equivalently, constant total loss). Further, for any feasible set of radius R, we still have\n\n\u221a\n\n\f\u221a\n\nT log((1 + R)T )), which is only modestly worse than that of\n\nworst-case regret of at most O(R\ngradient descent with the optimal R known in advance.\nThe need for an upper bound \u00afH can be removed using a standard guess-and-doubling approach, at\nthe cost of a constant factor increase in regret (see appendix for proof).\nTheorem 4. Consider algorithm REWARD-DOUBLING-1D-GUESS, which behaves as follows. On\neach era i, the algorithm runs REWARD-DOUBLING-1D with an upper bound of \u00afHi = 2i\u22121, and\n1 = \u00012\u22122i. An era ends when \u00afHi is no longer an upper bound on the sum of\ninitial learning rate \u03b7i\nsquared gradients seen during that era. Letting c =\n\n, this algorithm has regret at most\n\n(cid:18)\n\n\u221a\n2\u221a\n2\u22121\n\n(cid:18) R\n\n\u0001\n\n(cid:19)\n\n(cid:19)\n\nH + 1\n\nlog\n\n(2H + 2)5/2\n\n\u2212 1\n\n+ \u0001.\n\nRegret \u2264 cR\n\n\u221a\n\n3.2 Extension to n dimensions\n\nTo extend our results to general online convex optimization, it is suf\ufb01cient to run a separate copy of\nREWARD-DOUBLING-1D-GUESS for each coordinate, as is done in REWARD-DOUBLING (Algo-\nrithm 2). The key to the analysis of this algorithm is that overall regret is simply the sum of regret\non n one-dimensional subproblems which can be analyzed independently.\nTheorem 5. Given a sequence of convex loss functions f1, f2, . . . , fT from Rn to R,\nREWARD-DOUBLING with \u0001i = \u0001\nRegret(\u02dax) \u2264 \u0001 + c\n\nn has regret bounded by\n\nlog\n\nHi + 1\n\n(cid:16) n\n\n(cid:16)\nn(cid:88)\n|\u02daxi|(cid:112)\n(cid:16)\n(cid:16) n\n\u0001\n(cid:107)\u02dax(cid:107)2\nt,i and H =(cid:80)T\nlog\nt=1 (cid:107)gt(cid:107)2\n2.\n\nH + n\n\n\u221a\n\ni=1\n\n\u0001\n\n|\u02daxi|(2Hi + 2)5/2(cid:17) \u2212 1\n(cid:17)\n2(2H + 2)5/2(cid:17) \u2212 1\n(cid:17)\n\n\u2264 \u0001 + c(cid:107)\u02dax(cid:107)2\n\n, where Hi =(cid:80)T\n\nt=1 g2\n\nfor c =\n\n\u221a\n2\u221a\n2\u22121\n\nProof. Fix a comparator \u02dax. For any coordinate i, de\ufb01ne\n\nObserve that\n\nRegreti =\n\nn(cid:88)\n\ni=1\n\nRegreti =\n\nT(cid:88)\n\nt=1\n\nxt,igt,i .\n\n\u02daxigt,i \u2212 T(cid:88)\nT(cid:88)\n\u02dax \u00b7 gt \u2212 T(cid:88)\n\nt=1\n\nt=1\n\nt=1\n\nxt \u00b7 gt = Regret(\u02dax) .\n\nFurthermore, Regreti is simply the regret of REWARD-DOUBLING-1D-GUESS on the gradient se-\nquence g1,i, g2,i, . . . , gT,i. Applying the bound of Theorem 4 to each Regreti term completes the\n\u221a\nproof of the \ufb01rst inequality. For the second inequality, let (cid:126)H be a vector whose ith component is\nHi + 1, and let (cid:126)x \u2208 Rn where (cid:126)xi = |\u02daxi|. Using the Cauchy-Schwarz inequality, we have\n\nn(cid:88)\n\n|\u02daxi|(cid:112)\n\nHi + 1 = (cid:126)x \u00b7 (cid:126)H \u2264 (cid:107)\u02dax(cid:107)2 (cid:107) (cid:126)H(cid:107)2 = (cid:107)\u02dax(cid:107)2\n\nH + n .\n\n\u221a\n\ni=1\n\nThis, together with the fact that log(|\u02daxi|(2Hi + 2)5/2) \u2264 log((cid:107)\u02dax(cid:107)2\nsecond inequality.\n\n2(2H + 2)5/2), suf\ufb01ces to prove\n\nIn some applications, n is not known in advance.\ncoordinate we encounter, and get the same bound up to constant factors.\n\nIn this case, we can set \u0001i = \u0001\n\ni2 for the ith\n\n4 An Epoch-Free Algorithm\n\nIn this section we analyze SMOOTH-REWARD-DOUBLING, a simple algorithm that achieves bounds\ncomparable to those of Theorem 4, without guessing-and-doubling. We consider only the 1-d prob-\nlem, as the technique of Theorem 5 can be applied to extend to n dimensions. Given a parameter\n\n6\n\n\f\u03b7 > 0, we achieve\n\n\u221a\n\nRegret \u2264 R\n\n(cid:18)\n\n(cid:18) RT 3/2\n\n(cid:19)\n\n(cid:19)\n\n\u2212 1\n\nT\n\nlog\n\n(7)\nfor all T and R, which is better (by constant factors) than Theorem 4 when gt \u2208 {\u22121, 1} (which\nimplies T = H). The bound can be worse on a problems where H < T .\nThe idea of the algorithm is to maintain the invariant that our cumulative reward, as a function of\ng1:t and t, satis\ufb01es Reward \u2265 N (g1:t, t), for some \ufb01xed function N. Because reward changes by\ngtxt on round t, it suf\ufb01ces to guarantee that for any g \u2208 [\u22121, 1],\n\n+ 1.76\u03b7,\n\n\u03b7\n\nN (g1:t, t) + gxt+1 \u2265 N (g1:t + g, t + 1)\n\nwhere xt+1 is the point the algorithm plays on round t + 1, and we assume N (0, 1) = 0.\nThis inequality is approximately satis\ufb01ed (for small g) if we choose\n\nxt+1 =\n\n\u2202N (g1:t + g, t)\n\n\u2202g\n\n\u2248 N (g1:t + g, t) \u2212 N (g1:t, t)\n\n\u2248 N (g1:t + g, t + 1) \u2212 N (g1:t, t)\n\nThis suggests that if we want to maintain reward at least N (g1:t, t) = 1\nshould set xt+1 \u2248 sign(g1:t)t\u22123/2 exp\nprovides an inductive analysis of an algorithm of this form.\nTheorem 6. Fix a sequence of reward functions ft(x) = gtx with gt \u2208 [\u22121, 1], and let Gt = |g1:t|.\nWe consider SMOOTH-REWARD-DOUBLING, which plays 0 on round 1 and whenever Gt = 0;\notherwise, it plays\n\nt) \u2212 1) , we\n. The following theorem (proved in the appendix)\n\nt\n\ng\n\n\u221a\nt (exp(|g1:t|/\n\ng\n\n(cid:16)|g1:t|\u221a\n\n(cid:17)\n\nxt+1 = \u03b7 sign(g1:t)B(Gt, t + 5)\n\n(8)\n\n.\n\n(9)\n\nwith \u03b7 > 0 a learning-rate parameter and\n\n1\nt3/2\nThen, at the end of each round t, this algorithm has\n\nB(G, t) =\n\nReward(t) \u2265 \u03b7\n\n1\n\nexp\n\nt + 5\n\nexp\n\n(cid:18) G\u221a\n(cid:18) Gt\u221a\n\nt\n\n(cid:19)\n(cid:19)\n\n.\n\nt + 5\n\n(10)\n\n\u2212 1.76\u03b7.\n\n\u221a\n\nTwo main technical challenges arise in the proof: \ufb01rst, we prove a result like Eq. (8) for N (g1:t, t) =\n\nt(cid:1). However, this Lemma only holds for t \u2265 6 and when the sign of g1:t doesn\u2019t\n\n(1/t) exp(cid:0)|g1:t|/\n\nchange. We account for this by showing that a small modi\ufb01cation to N (costing only a constant over\nall rounds) suf\ufb01ces.\nBy running this algorithm independently for each coordinate using an appropriate choice of \u03b7, one\ncan obtain a guarantee similar to that of Theorem 5.\n\n5 Lower Bounds\n\nAs with our previous results, it is suf\ufb01cient to show a lower bound in one dimension, as it can then\nlower bound contains the factor log(|\u02dax|\u221a\nbe replicated independently in each coordinate to obtain an n dimensional bound. Note that our\nT ), which can be negative when \u02dax is small relative to T ,\nhence it is important to hold \u02dax \ufb01xed and consider the behavior as T \u2192 \u221e. Here we give only a\nproof sketch; see Appendix A for the full proof.\nTheorem 7. Consider the problem of unconstrained online linear optimization in one dimension,\nand an online algorithm that guarantees origin-regret at most \u0001. Then, for any \ufb01xed comparator \u02dax,\nand any integer T0, there exists a gradient sequence {gt} \u2208 [\u22121, 1]T of length T \u2265 T0 for which\nthe algorithm\u2019s regret satis\ufb01es\n\nRegret(\u02dax) \u2265 0.336|\u02dax|\n\n(cid:118)(cid:117)(cid:117)(cid:116)T log\n\n(cid:32)|\u02dax|\u221a\n\nT\n\n(cid:33)\n\n.\n\n\u0001\n\n7\n\n\fProof. (Sketch) Assume without loss of generality that \u02dax > 0. Let Q be the algorithm\u2019s reward\nwhen each gt is drawn independently uniformly from {\u22121, 1}. We have E[Q] = 0, and because the\nalgorithm guarantees origin-regret at most \u0001, we have Q \u2265 \u2212\u0001 with probability 1. Letting G = g1:T ,\nit follows that for any threshold Z = Z(T ),\n\n0 = E[Q]\n= E[Q|G < Z] \u00b7 Pr[G < Z] + E[Q|G \u2265 Z] \u00b7 Pr[G \u2265 Z]\n\u2265 \u2212\u0001 Pr[G < Z] + E[Q|G \u2265 Z] \u00b7 Pr[G \u2265 Z]\n> \u2212\u0001 + E[Q|G \u2265 Z] \u00b7 Pr[G \u2265 Z] .\n\nEquivalently,\n\n\u221a\n\nE[Q|G \u2265 Z] <\n\u221a\n\n(cid:106)\n\n\u0001\n\n.\nPr[G \u2265 Z]\n)/ log(p\u22121)\n\n(cid:107)\n\n. Here R = |\u02dax| and p > 0 is a\n\nWe choose Z(T ) =\nconstant chosen using binomial distribution lower bounds so that Pr[G \u2265 Z] \u2265 pk. This implies\n\nE[Q|G \u2265 Z] < \u0001p\u2212k = \u0001 exp(cid:0)k log p\u22121(cid:1) \u2264 R\n\nkT , where k =\n\nlog( R\n\n\u221a\n\nT .\n\nT\n\n\u0001\n\n\u221a\n\n\u221a\n\n\u221a\n\nkT \u2212 R\n\nThis implies there exists a sequence with G \u2265 Z and Q < R\nG\u02dax \u2212 Q \u2265 R\nTheorem 8. Consider the problem of unconstrained online linear optimization in Rn, and consider\nan online algorithm that guarantees origin-regret at most \u0001. For any radius R, and any T0, there ex-\nists a gradient sequence gradient sequence {gt} \u2208 ([\u22121, 1]n)T of length T \u2265 T0, and a comparator\n\u02dax with (cid:107)\u02dax(cid:107)1 = R, for which the algorithm\u2019s regret satis\ufb01es\n\nT . On this sequence, regret is at least\n\nT = \u2126(R\n\nkT ).\n\n\u221a\n\nRegret(\u02dax) \u2265 0.336\n\n|\u02daxi|\n\nn(cid:88)\n\ni=1\n\nT(cid:88)\n\n\u02daxigt,i \u2212 T(cid:88)\n\nt=1\n\nt=1\n\nxt,igt,i \u2265 0.336|\u02daxi|\n\n(cid:33)\n\nT\n\n.\n\n(cid:118)(cid:117)(cid:117)(cid:116)T log\n(cid:32)|\u02daxi|\u221a\n(cid:118)(cid:117)(cid:117)(cid:116)T log\n\n\u0001\n\n(cid:32)|\u02daxi|\u221a\n\nT\n\n(cid:33)\n\n.\n\n\u0001\n\nProof. For each coordinate i, Theorem 7 implies that there exists a T \u2265 T0 and a sequence of\ngradients gt,i such that\n\n(The proof of Theorem 7 makes it clear that we can use the same T for all i.) Summing this\ninequality across all n coordinates then gives the regret bound stated in the theorem.\n\nThe following theorem presents a stronger negative result for Follow-the-Regularized-Leader algo-\nrithms with a \ufb01xed regularizer: for any such algorithm that guarantees origin-regret at most \u0001T after\nT rounds, worst-case regret with respect to any point outside [\u2212\u0001T , \u0001T ] grows linearly with T .\nTheorem 9. Consider a Follow-The-Regularized-Leader algorithm that sets\n\nxt = arg min\n\nx\n\n(g1:t\u22121x + \u03c8T (x))\n\nwhere \u03c8T is a convex, non-negative function with \u03c8T (0) = 0. Let \u0001T be the maximum origin-regret\nincurred by the algorithm on a sequence of T gradients. Then, for any \u02dax with |\u02dax| > \u0001T , there exists a\n2 (|\u02dax| \u2212 \u0001T ).\nsequence of T gradients such that the algorithm\u2019s regret with respect to \u02dax is at least T\u22121\nIn fact, it is clear from the proof that the above result holds for any algorithm that selects xt+1 purely\nas a function of g1:t (in particular, with no dependence on t).\n\n6 Future Work\n\nThis work leaves open many interesting questions. It should be possible to apply our techniques\nto problems that do have constrained feasible sets; for example, it is natural to consider the uncon-\nstrained experts problem on the positive orthant. While we believe this extension is straightforward,\nhandling arbitrary non-axis-aligned constraints will be more dif\ufb01cult. Another possibility is to de-\nvelop an algorithm with bounds in terms of H rather than T that doesn\u2019t use a guess and double\napproach.\n\n8\n\n\fReferences\n[1] Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies\n\nand minimax lower bounds for online convex games. In COLT, 2008.\n\n[2] Amit Agarwal, Elad Hazan, Satyen Kale, and Robert E. Schapire. Algorithms for portfolio\n\nmanagement based on the Newton method. In ICML, 2006.\n\n[3] Nicol`o Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge Uni-\n\nversity Press, New York, NY, USA, 2006. ISBN 0521841089.\n\n[4] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\n\nand stochastic optimization. In COLT, 2010.\n\n[5] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by varia-\n\ntion in costs. In COLT, 2008.\n\n[6] Elad Hazan and Satyen Kale. On stochastic and worst-case models for investing. In Advances\n\nin Neural Information Processing Systems 22. 2009.\n\n[7] Robert A. Jacobs. Increased rates of convergence through learning rate adaptation. Neural\n\nNetworks, 1987.\n\n[8] Jyrki Kivinen and Manfred Warmuth. Exponentiated Gradient Versus Gradient Descent for\n\nLinear Predictors. Journal of Information and Computation, 132, 1997.\n\n[9] Todd K. Leen and Genevieve B. Orr. Optimal stochastic search and adaptive momentum. In\n\nNIPS, 1993.\n\n[10] H. Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex\n\noptimization. In COLT, 2010.\n\n[11] Barak Pearlmutter. Gradient descent: Second order momentum and saturating error. In NIPS,\n\n1991.\n\n[12] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends\n\nin Machine Learning, 4(2):107\u2013194, 2012.\n\n[13] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent.\n\nIn ICML, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1161, "authors": [{"given_name": "Brendan", "family_name": "Mcmahan", "institution": null}, {"given_name": "Matthew", "family_name": "Streeter", "institution": null}]}