{"title": "Mind the Duality Gap: Logarithmic regret algorithms for online optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1457, "page_last": 1464, "abstract": "We describe a primal-dual framework for the design and analysis of online strongly convex optimization algorithms. Our framework yields the tightest known logarithmic regret bounds for Follow-The-Leader and for the gradient descent algorithm proposed in HazanKaKaAg06. We then show that one can interpolate between these two extreme cases. In particular, we derive a new algorithm that shares the computational simplicity of gradient descent but achieves lower regret in many practical situations. Finally, we further extend our framework for generalized strongly convex functions.", "full_text": "Mind the Duality Gap:\n\nLogarithmic regret algorithms for online optimization\n\nToyota Technological Institute at Chicago\n\nToyota Technological Institute at Chicago\n\nShai Shalev-Shwartz\n\nshai@tti-c.org\n\nSham M. Kakade\n\nsham@tti-c.org\n\nAbstract\n\nWe describe a primal-dual framework for the design and analysis of online\nstrongly convex optimization algorithms. Our framework yields the tightest\nknown logarithmic regret bounds for Follow-The-Leader and for the gradient de-\nscent algorithm proposed in Hazan et al. [2006]. We then show that one can inter-\npolate between these two extreme cases. In particular, we derive a new algorithm\nthat shares the computational simplicity of gradient descent but achieves lower\nregret in many practical situations. Finally, we further extend our framework for\ngeneralized strongly convex functions.\n\n1\n\nIntroduction\n\nIn recent years, online regret minimizing algorithms have become widely used and empirically suc-\ncessful algorithms for many machine learning problems. Notable examples include ef\ufb01cient learn-\ning algorithms for structured prediction and ranking problems [Collins, 2002, Crammer et al., 2006].\nMost of these empirically successful algorithms are based on algorithms which are tailored to gen-\neral convex functions, whose regret is O(\u221aT ). Rather recently, there is a growing body of work\nproviding online algorithms for strongly convex loss functions, with regret guarantees that are only\nO(log T ). These algorithms have potential to be highly applicable since many machine learning\noptimization problems are in fact strongly convex \u2014 either with strongly convex loss functions (e.g.\nlog loss, square loss) or, indirectly, via strongly convex regularizers (e.g. L2 or KL based regular-\nization). Note that in this later case, the loss function itself may only be just convex but a strongly\nconvex regularizer effectively makes this a strongly convex optimization problem (e.g. the SVM op-\ntimization problem uses the hinge loss with L2 regularization). The aim of this paper is to provide\na template for deriving a wider class of regret-minimizing algorithms for online strongly convex\nprogramming.\nOnline convex optimization takes place in a sequence of consecutive rounds. At each round, the\nlearner predicts a vector wt \u2208 S \u2282 Rn, and the environment responds with a convex loss function,\n!t : S \u2192 R. The goal of the learner is to minimize the difference between his cumulative loss and\nthe cumulative loss of the optimal \ufb01xed vector,!T\nt=1 !t(w). This is termed\n\u2018regret\u2019 since it measures how \u2018sorry\u2019 the learner is, in retrospect, not to have predicted the optimal\nvector.\nRoughly speaking, the family of regret minimizing algorithms (for general convex functions) can be\nseen as varying on two axes, the \u2018style\u2019 and the \u2018aggressiveness\u2019 of the update. In addition to online\nalgorithms\u2019 relative simplicity, the empirical successes are also due to having these two knobs to tune\nfor the problem at hand (which determine the nature of the regret bound). By style, we mean updates\nwhich favor either rotational invariance (such as gradient descent like update rules) or sparsity (like\nthe multiplicative updates). Of course there is a much richer family here, including the Lp updates.\nBy the aggressiveness of the update, we mean how much the algorithm moves its decision to be\nconsistent with most recent loss functions. For example, the preceptron algorithm makes no update\n\nt=1 !t(wt)\u2212minw\u2208S!T\n\n1\n\n\fwhen there is no error. In contrast, there is a family of algorithms which more aggressively update\nthe loss when there is a margin mistake. These algorithms are shown to have improved performance\n(see for example the experimental study in Shalev-Shwartz and Singer [2007b]).\nWhile historically much of the analysis of these algorithms have been done on a case by case basis,\nin retrospect, the proof techniques have become somewhat boilerplate, which has lead to growing\nbody of work to unify these analyses (see Cesa-Bianchi and Lugosi [2006] for review). Perhaps the\nmost uni\ufb01ed view of these algorithms is the \u2018primal-dual\u2019 framework of Shalev-Shwartz and Singer\n[2006], Shalev-Shwartz [2007], for which the gamut of these algorithms can be largely viewed as\nspecial cases. Two aspects are central in providing this uni\ufb01cation. First, the framework works with\na complexity function, which determines the style of algorithm and the nature of the regret guarantee\n(If this function is the L2 norm, then one obtains gradient like updates, and if this function is the KL-\ndistance, then one obtains multiplicative updates). Second, the algorithm maintains both \u201cprimal\u201d\nand \u201cdual\u201d variables. Here, the the primal objective function is!T\nt=1 !t(w) (where !t is the loss\nfunction provided at round t), and one can construct a dual objective function Dt(\u00b7), which only\ndepends on the loss functions !1,! 2, . . .! t\u22121. The algorithm works by incrementally increasing the\ndual objective value (in an online manner), which can be done since each Dt is only a function of the\nprevious loss functions. By weak duality, this can be seen as decreasing the duality gap. The level\nof aggressiveness is seen to be how fast the algorithm is attempting to increase the dual objective\nvalue.\nThis paper focuses on extending the duality framework for online convex programming to the case\nof strongly convex functions. This analysis provides a more uni\ufb01ed and intuitive view of the extant\nalgorithms for online strongly convex programming. An important observation we make is that any\n\u03c3-strongly convex loss function can be rewritten as !i(w) = f(w) + gi(w), where f is a \ufb01xed \u03c3-\nstrongly convex function (i.e. f does not depend on i), and gi is a convex function. Therefore, after t\ni=1 !t(w)\n\u03c3t proposed in the gradient descent\nis at least \u03c3t. In particular, this explains the learning rate of\nalgorithm of Hazan et al. [2006]. Indeed, we show that our framework includes the gradient descent\nalgorithm of Hazan et al. [2006] as an important special case, in which the aggressiveness level\nis minimal. At the most aggressive end, our framework yields the Follow-The-Leader algorithm.\nFurthermore, the template algorithm serves as a vehicle for deriving new algorithms (which enjoy\nlogarithmic regret guarantees).\nThe remainder of the paper is outlined as follows. We \ufb01rst provide background on convex duality. As\na warmup, in Section 3, we present an intuitive primal-dual analysis of Follow-The-Leader (FTL),\nwhen f is the Euclidean norm. This naturally leads to a more general primal-dual algorithm (for\nwhich FTL is a special case), which we present in Section 4. Next, we further generalize our\nalgorithmic framework to include strongly convex complexity functions f with respect to arbitrary\nnorms &\u00b7& . We note that the introduction of a complexity function was already provided in Shalev-\nShwartz and Singer [2007a], but the analysis is rather specialized and does not have a knob which\ncan tune the aggressiveness of the algorithm. Finally, in Sec. 6 we conclude with a side-by-side\ncomparison of our algorithmic framework for strongly convex functions and the framework for\n(non-strongly) convex functions given in Shalev-Shwartz [2007].\n\nonline rounds, the amount of intrinsic strong convexity we have in the primal objective!t\n\n1\n\n2 Mathematical Background\n\nWe denote scalars with lower case letters (e.g. w and \u03bb), and vectors with bold face letters (e.g. w\nand \u03bb). The inner product between vectors x and w is denoted by \u2019x, w(. To simplify our notation,\ngiven a sequence of vectors \u03bb1, . . . , \u03bbt or a sequence of scalars \u03c31, . . . ,\u03c3 t we use the shorthand\n\n\u03bb1:t =\n\n\u03bbi\n\nand\n\n\u03c31:t =\n\nt\"i=1\n\n\u03c3i .\n\nt\"i=1\n\nSets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted\nby R+. For any k \u2265 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is\ndenoted by &x&. The dual norm is de\ufb01ned as &\u03bb&\" = sup{\u2019x, \u03bb( : &x& \u2264 1}. For example, the\nEuclidean norm, &x&2 = (\u2019x, x()1/2 is dual to itself and the L1 norm, &x&1 =!i |xi|, is dual to\nthe L\u221e norm, &x&\u221e = maxi |xi|.\n\n2\n\n\fFOR t = 1, 2, . . . , T :\n\n\u03bbt\n\n\u03c31:(t\u22121)\n\nDe\ufb01ne wt = \u2212 1\n1:(t\u22121)\nReceive a function !t(w) = \u03c3t\nUpdate \u03bbt+1\n(\u03bbt+1\n\n, . . . , \u03bbt+1\n, . . . , \u03bbt+1\n\n1\n\nt\n\nt\n\n1\n\n2 \"w\"2 + gt(w) and suffer loss !t(wt)\n\ns.t. the following holds\n) \u2208 argmax\n\n\u03bb1,...,\u03bbt Dt+1(\u03bb1, . . . , \u03bbt)\n\nFigure 1: A primal-dual view of Follow-the-Leader. Here the algorithm\u2019s decision wt is the best decision\nwith respect to the previous losses. This presentation exposes the implicit role of the dual variables. Slightly\nabusing notation, \u03bb1:0 = 0, so that w1 = 0. See text.\n\nWe next recall a few de\ufb01nitions from convex analysis. A function f is \u03c3-strongly convex if\n\nf(\u03b1u + (1 \u2212 \u03b1)v) \u2264 \u03b1f(u) + (1 \u2212 \u03b1)f(v) \u2212\n\n\u03c3\n2 \u03b1 (1 \u2212 \u03b1)&u \u2212 v&2\n2 .\n\nIn Sec. 5 we generalize the above de\ufb01nition to arbitrary norms. If a function f is \u03c3-strongly convex\nthen the function g(w) = f(w) \u2212 \u03c3\nThe Fenchel conjugate of a function f : S \u2192 R is de\ufb01ned as\n\n2&w&2 is convex.\n\nf \"(\u03b8) = sup\n\nw\u2208S \u2019w, \u03b8( \u2212f (w) .\n\nIf f is closed and convex, then the Fenchel conjugate of f \" is f itself (a function is closed if for\nall \u03b1> 0 the level set {w : f(w) \u2264 \u03b1} is a closed set). It is straightforward to verify that the\nfunction f(w) is conjugate to itself. The de\ufb01nition of f \" also implies that for c >0 we have\n(c f)\"(\u03b8) = c f \"(\u03b8/c).\nA vector \u03bb is a sub-gradient of a function f at w if for all w$ \u2208 S, we have that f(w$) \u2212 f(w) \u2265\n\u2019w$ \u2212 w, \u03bb(. The differential set of f at w, denoted \u2202f (w), is the set of all sub-gradients of f at\nw. If f is differentiable at w, then \u2202f (w) consists of a single vector which amounts to the gradient\nof f at w and is denoted by \u2207f(w).\nThe Fenchel-Young inequality states that for any w and \u03b8 we have that f(w) + f \"(\u03b8) \u2265 \u2019w, \u03b8(.\nSub-gradients play an important role in the de\ufb01nition of the Fenchel conjugate. In particular, the\nfollowing lemma, whose proof can be found in Borwein and Lewis [2006], states that if \u03bb \u2208 \u2202f (w)\nthen the Fenchel-Young inequality holds with equality.\nfor all \u03bb$ \u2208 \u2202f (w$), we have f(w$) +f \"(\u03bb$) = #\u03bb$, w$$ .\nt (\u03bbt) \u2264 min\ng\"\n\nLemma 1 Let f be a closed and convex function and let \u2202f (w$) be its differential set at w$. Then,\n\nWe make use of the following variant of Fenchel duality (see the appendix for more details):\n\n\u03bb1,...,\u03bbT \u2212f \"(\u2212\nmax\n\n\u03bbt) \u2212\n\nf(w) +\n\ngt(w) .\n\n(1)\n\nw\n\nT\"t=1\n\nT\"t=1\n\nT\"t=1\n\n3 Warmup: A Primal-Dual View of Follow-The-Leader\n\nIn this section, we provide a dual analysis for the FTL algorithm. The dual view of FTL will help\nus to derive a family of logarithmic regret algorithms for online convex optimization with strongly\nconvex functions.\nRecall that FTL algorithm is de\ufb01ned as follows:\n\nFor each i \u2208 [t \u2212 1] de\ufb01ne gi(w) = !i(w) \u2212 \u03c3i\n2 &w&2, where \u03c3i is the largest scalar such that gi is\nstill a convex function. The assumption that !i is \u03c3-strongly convex guarantees that \u03c3i \u2265 \u03c3. We can\n\n!i(w) .\n\n(2)\n\nwt = argmin\n\nw\n\nt\u22121\"i=1\n\n3\n\n\ftherefore rewrite the objective function on the right-hand side of Eq. (2) as\n\nPt(w) = \u03c31:(t\u22121)\n\n2\n\n&w&2 +\n\nt\u22121\"i=1\n\ngi(w) ,\n\n(3)\n\ni=1 \u03c3i). The Fenchel dual optimization problem (see Sec. 2) is to maximize\n\n(recall that \u03c31:(t\u22121) =!t\u22121\n\nthe following dual objective function\n\nDt(\u03bb1, . . . , \u03bbt\u22121) = \u2212\n\n1\n\n2 \u03c31:(t\u22121)&\u03bb1:(t\u22121)&2 \u2212\n\nt\u22121\"i=1\n\ni (\u03bbi) .\ng\"\n\n(4)\n\n1, . . . , \u03bbt\n\nLet (\u03bbt\nthe optimal primal vector is given by (see again Sec. 2)\n\nt\u22121) be the maximizer of Dt. The relation between the optimal dual variables and\n\nwt = \u2212\n\n1\n\n\u03c31:(t\u22121)\n\n\u03bbt\n1:(t\u22121) .\n\n(5)\n\nThroughout this section we assume that strong duality holds (i.e. Eq. (1) holds with equality). See\nthe appendix for suf\ufb01cient conditions. In particular, under this assumption, we have that the above\nsetting for wt is in fact a minimizer of the primal objective, since (\u03bbt\nt\u22121) maximizes the dual\nobjective (see the appendix). The primal-dual view of Follow-the-Leader is presented in Figure 1.\nDenote\n\n1, . . . , \u03bbt\n\nTo analyze the FTL algorithm, we \ufb01rst note that (by strong duality)\n\n\u2206t = Dt+1(\u03bbt+1\n\n1\n\n, . . . , \u03bbt+1\n\nt\n\n) \u2212D t(\u03bbt\n\n1, . . . , \u03bbt\n\nt\u22121) .\n\n(6)\n\n, . . . , \u03bbT +1\n\nT\n\n) = min\n\nw PT +1(w) = min\n\nw\n\n!t(w) .\n\n(7)\n\nT\"t=1\n\nT\"t=1\n\n1\n\n\u2206t = DT +1(\u03bbT +1\n, . . . , \u03bbt+1\n\u2206t \u2265D t+1(\u03bbt\n\n1\n\nt\n\nSecond, the fact that (\u03bbt+1\n\n) is the maximizer of Dt+1 implies that for any \u03bb we have\n1, . . . , \u03bbt\n\n(8)\nThe following central lemma shows that there exists \u03bb such that the right-hand side of the above is\nsuf\ufb01ciently large.\nLemma 2 Let (\u03bb1, . . . , \u03bbt\u22121) be an arbitrary sequence of vectors. Denote w = \u2212 1\nlet v \u2208 \u2202!t(w), and let \u03bb = v \u2212 \u03c3tw. Then, \u03bb \u2208 \u2202gt(w) and\n\nt\u22121, \u03bb) \u2212D t(\u03bbt\n\n\u03bb1:(t\u22121),\n\n1, . . . , \u03bbt\n\nt\u22121) .\n\n\u03c31:(t\u22121)\n\nDt+1(\u03bb1, . . . , \u03bbt\u22121, \u03bb) \u2212D t(\u03bb1, . . . , \u03bbt\u22121) = !t(w) \u2212 &v&2\n2 \u03c31:t\n\n.\n\nProof We prove the lemma for the case t >1. The case\nlarly. Since !t(w) = \u03c3t\n\u00af\u2206t = Dt+1(\u03bb1, . . . , \u03bbt\u22121, \u03bb) \u2212D t(\u03bb1, . . . , \u03bbt\u22121). Simple algebraic manipulations yield\n\nt = 1 can be proved simi-\n2 &w&2 + gt(w) and v \u2208 \u2202!t(w) we have that \u03bb \u2208 \u2202gt(w). Denote\n1\n\n1\n\n\u00af\u2206t = \u2212\n\n1\n\n\u03c3t\n\n2\u03c31:(t\u22121)%%\u03bb1:(t\u22121)%%2 \u2212 g\"\nt (\u03bb)\n\u03c31:t\u2019 + \u2019w, \u03bb(\n\u03c31:t \u2212 &\u03bb&2\n\u03c31:(t\u22121)\n2\u03c31:t \u2212 g\"\n\u03c31:t \u2212 &\u03bb&2\n\u03c31:(t\u22121)\nt (\u03bb)\n2\u03c31:t \u2212 g\"\n+ \u03c3t \u2019w, \u03bb(\nt &w&2\n+ &\u03bb&2\n2\u03c31:t\n\n2\u03c31:t%%\u03bb1:(t\u22121) + \u03bb%%2 +\n& 1\n= &\u03bb1:(t\u22121)&2\n\u03c31:(t\u22121) \u2212\n&1 \u2212\n\u03c31:t\u2019 + \u2019w, \u03bb(\n\u2212& \u03c32\n2\u03c31:t\u2019\n+ \u2019w, \u03bb( \u2212g \"\n+\n)*\n+\n(\n. We have thus shown that \u00af\u2206t = !t(w) \u2212 %\u03c3tw+\u03bb%2\n\n2\n= \u03c3t&w&2\n= \u03c3t &w&2\n(\n\n)*\n\nt (\u03bb)\n\n\u03c31:t\n\n2\u03c31:t\n\n2\n\n2\n\nB\n\nA\n\nt (\u03bb)\n\nSince \u03bb \u2208 \u2202gt(w), Lemma 1 thus implies that \u2019w, \u03bb( \u2212 g\"\nNext, we note that B = %\u03c3tw+\u03bb%2\nthe de\ufb01nition of \u03bb into the above we conclude our proof.\n\n2\u03c31:t\n\nt (\u03bb) = gt(w). Therefore, A = !t(w).\n. Plugging\n\nCombining Lemma 2 with Eq. (7) and Eq. (8) we obtain the following:\n\n4\n\n\fFOR t = 1, 2, . . . , T :\n\n\u03bbt\n\nDe\ufb01ne wt = \u2212 1\n1:(t\u22121)\nReceive a function !t(w) = \u03c3t\nUpdate \u03bbt+1\n\n, . . . , \u03bbt+1\n\n\u03c31:(t\u22121)\n\nt\n\n1\n\n2 \"w\"2 + gt(w) and suffer loss !t(wt)\n\ns.t. the following holds\n\n\u2203\u03bbt \u2208 \u2202gt(wt), s.t. Dt+1(\u03bbt+1\n\n1\n\n, . . . , \u03bbt+1\n\nt\n\n) \u2265D t+1(\u03bbt\n\n1, . . . , \u03bbt\n\nt\u22121, \u03bbt)\n\nFigure 2: A primal-dual algorithmic framework for online convex optimization. Again, w1 = 0.\n\nCorollary 1 Let !1, . . . ,! T be a sequence of functions such that for all t \u2208 [T ], !t is \u03c3t-strongly\nconvex. Assume that the FTL algorithm runs on this sequence and for each t \u2208 [T ], let vt be in\n\u2202!t(wt). Then,\n\nT\"t=1\n\n!t(wt) \u2212 min\n\nw\n\n!t(w) \u2264\n\n1\n2\n\n&vt&2\n\u03c31:t\n\n(9)\n\nT\"t=1\n\nT\"t=1\n\n2\u03c3 (log(T ) + 1).\n\nFurthermore, let L = maxt &vt& and assume that for all t \u2208 [T ], \u03c3t \u2265 \u03c3. Then, the regret is\nbounded by L2\nIf we are dealing with the square loss !t(w) =&w \u2212 \u00b5t&2\n2 (where nature chooses \u00b5t), then it is\nstraightforward to see that Eq. (8) holds with equality, and this leads to the previous regret bound\nholding with equality. This equality is the underlying reason that the FTL strategy is a minimax\nstrategy (See Abernethy et al. [2008] for a proof of this claim).\n\n4 A Primal-Dual Algorithm for Online Strongly Convex Optimization\n\nIn the previous section, we provided a dual analysis for FTL. Here, we extend the analysis and derive\na more general algorithmic framework for online optimization.\nWe start by examining the analysis of the FTL algorithm. We \ufb01rst make the important observation\nthat Lemma 2 is not speci\ufb01c to the FTL algorithm and in fact holds for any con\ufb01guration of dual\nvariables. Consider an arbitrary sequence of dual variables: (\u03bb2\n)\n, . . . , \u03bbT +1\nand denote \u2206t as in Eq. (6). Using weak duality, we can replace the equality in Eq. (7) with the\nfollowing inequality that holds for any sequence of dual variables:\n\n2), . . . , (\u03bbT +1\n\n1), (\u03bb3\n\n1, \u03bb3\n\nT\n\n1\n\n\u2206t = DT +1(\u03bbT +1\n\n1\n\n, . . . , \u03bbT +1\n\nT\n\n) \u2264 min\n\nw PT +1(w) = min\n\nw\n\n!t(w) .\n\n(10)\n\nT\"t=1\n\nT\"t=1\n\nA summary of the algorithmic framework is given in Fig. 2.\nThe following theorem, a direct corollary of the previous equation and Lemma 2, shows that all\ninstances of the framework achieve logarithmic regret.\nTheorem 1 Let !1, . . . ,! T be a sequence of functions such that for all t \u2208 [T ], !t is \u03c3t-strongly\nconvex. Then, any algorithm that can be derived from Fig. 2 satis\ufb01es\n\nT\"t=1\n\n!t(wt) \u2212 min\n\nw\n\n!t(w) \u2264\n\n1\n2\n\n&vt&2\n\u03c31:t\n\n(11)\n\nT\"t=1\n\nT\"t=1\n\nwhere vt \u2208 \u2202!t(wt).\nProof Let \u2206t be as de\ufb01ned in Eq. (6). The last condition in the algorithm implies that\n\nt\u22121) .\nThe proof follows directly by combining the above with Eq. (10) and Lemma 2.\n\nt\u22121, vt \u2212 \u03c3twt) \u2212D t(\u03bbt\n\n\u2206t \u2265D t+1(\u03bbt\n\n1, . . . , \u03bbt\n\n1, . . . , \u03bbt\n\nWe conclude this section by deriving several algorithms from the framework.\n\n5\n\n\fExample 1 (Follow-The-Leader) As we have shown in Sec. 3, the FTL algorithm (Fig. 1) is equiva-\nlent to optimizing the dual variables at each online round. This update clearly satis\ufb01es the condition\nin Fig. 2 and is therefore a special case.\n\nExample 2 (Gradient-Descent) Following Hazan et al. [2006], Bartlett et al. [2007] suggested the\nfollowing update rule for differentiable strongly convex function\n1\n\u03c31:t\u2207!t(wt) .\n\nwt+1 = wt \u2212\n\n(12)\n\nUsing a simple inductive argument, it is possible to show that the above update rule is equivalent to\nthe following update rule of the dual variables\n) = (\u03bbt\n\n(13)\nClearly, this update rule satis\ufb01es the condition in Fig. 2. Therefore our framework encompasses this\nalgorithm as a special case.\n\nt\u22121,\u2207!t(wt) \u2212 \u03c3twt) .\n\n, . . . , \u03bbt+1\n\n1, . . . , \u03bbt\n\n(\u03bbt+1\n\n1\n\nt\n\nExample 3 (Online Coordinate-Dual-Ascent) The FTL and the Gradient-Descent updates are\ntwo extreme cases of our algorithmic framework. The former makes the largest possible increase of\nthe dual while the latter makes the smallest possible increase that still satis\ufb01es the suf\ufb01cient dual\nincrease requirement. Intuitively, the FTL method should have smaller regret as it consumes more\nof its potential earlier in the optimization process. However, its computational complexity is large\nas it requires a full blown optimization procedure at each online round. A possible compromise is\nto fully optimize the dual objective but only with respect to a small number of dual variables. In the\nextreme case, we optimize only with respect to the last dual variable. Formally, we let\n\n\u03bbt+1\n\ni\n\n= ,\u03bbt\n\ni\n\nargmax\n\n\u03bbt Dt+1(\u03bbt\n\n1, . . . , \u03bbt\n\nt\u22121, \u03bbt)\n\nif i < t\nif i = t\n\nClearly, the above update satis\ufb01es the requirement in Fig. 2 and therefore enjoys a logarithmic regret\nbound. The computational complexity of performing this update is often small as we optimize over\na single dual vector. Occasionally, it is possible to obtain a closed-form solution of the optimization\nproblem and in these cases the computational complexity of the coordinate-dual-ascent update is\nidentical to that of the gradient-descent method.\n\n5 Generalized strongly convex functions\n\nIn this section, we extend our algorithmic framework to deal with generalized strongly convex func-\ntions. We \ufb01rst need the following generalized de\ufb01nition of strong convexity.\n\nDe\ufb01nition 1 A continuous function f is \u03c3-strongly convex over a convex set S with respect to a\nnorm &\u00b7& if S is contained in the domain of f and for all v, u \u2208 S and \u03b1 \u2208 [0, 1] we have\n\nq is strongly convex over S = Rn with\n2(p\u22121)&\u03b8&2\n\np, where p = (1 \u2212 1/q)\u22121.\n\nFor proofs, see for example Shalev-Shwartz [2007].\nIn the appendix, we list several important\nproperties of strongly convex functions. In particular, the Fenchel conjugate of a strongly convex\nfunction is differentiable.\n\n6\n\nf(\u03b1 v + (1 \u2212 \u03b1) u) \u2264 \u03b1f (v) + (1 \u2212 \u03b1) f(u) \u2212\n2&w&2\n\nIt is straightforward to show that the function f(w) = 1\nEuclidean norm. Less trivial examples are given below.\n\ni=1 wi log(wi/ 1\n\nExample 4 The function f(w) =!n\nplex, S = {w \u2208 Rn\nn!n\nf \"(\u03b8) = log( 1\nExample 5 For q \u2208 (1, 2), the function f(w) = 1\n2(q\u22121)&w&2\nrespect to the Lq norm. Its conjugate function is f \"(\u03b8) = 1\n\ni=1 exp(\u03b8i)).\n\n+ : &w&1 = 1}, with respect to the L1 norm.\n\n\u03c3\n2 \u03b1 (1 \u2212 \u03b1)&v \u2212 u&2 .\n\n(14)\n\n2 is strongly convex with respect to the\n\nn) is strongly convex over the probability sim-\nIts conjugate function is\n\n\fINPUT: A strongly convex function f\nFOR t = 1, 2, . . . , T :\n\n1) De\ufb01ne wt = \u2207f \"\u201e\u2212\n\n\u03bbt\n\n1:(t\u22121)\u221at \u00ab\n\n2) Receive a function !t\n3) Suffer loss !t(wt)\n4) Update \u03bbt+1\n\n, . . . , \u03bbt+1\nexists \u03bbt \u2208 \u2202lt(wt) with\n\n1\n\nt\n\ns.t. there\n\n1\n\nDt+1(\u03bbt+1\nDt+1(\u03bbt\n\n, . . . , \u03bbt+1\nt\n1, . . . , \u03bbt\n\n) \u2265\nt\u22121, \u03bbt)\n\nINPUT: A \u03c3-strongly convex function f\nFOR t = 1, 2, . . . , T :\n\n1) De\ufb01ne wt = \u2207f \"\u201e\u2212\n\n\u03bbt\n\n1:(t\u22121)\n\n\u03c31:t \u00ab\n\n2) Receive a function !t = \u03c3f + gt\n3) Suffer loss !t(wt)\n4) Update \u03bbt+1\n\n, . . . , \u03bbt+1\nexists \u03bbt \u2208 \u2202gt(wt) with\n\ns.t. there\n\n1\n\nt\n\n1\n\nDt+1(\u03bbt+1\nDt+1(\u03bbt\n\n, . . . , \u03bbt+1\nt\n1, . . . , \u03bbt\n\n) \u2265\nt\u22121, \u03bbt)\n\nFigure 3: Primal-dual template algorithms for general online convex optimization (left) and online strongly\ni=1 ai, and for notational convenient, we implicitly assume that\n\nconvex optimization (right). Here a1:t = Pt\n\na1:0 = 0. See text for description.\n\nConsider the case where for all t, !t can be written as \u03c3tf + gt where f is 1-strongly convex with\nrespect to some norm &\u00b7& and gt is a convex function. We also make the simplifying assumption\nthat \u03c3t is known to the forecaster before he de\ufb01nes wt.\nFor each round t, we now de\ufb01ne the primal objective to be\n\nThe dual objective is (see again Sec. 2)\n\ngi(w) .\n\nPt(w) = \u03c31:(t\u22121)f(w) +\n\nt\u22121\"i=1\n\u03c31:(t\u22121). \u2212\nDt(\u03bb1, . . . , \u03bbt\u22121) = \u2212 \u03c31:(t\u22121)f \"-\u2212 \u03bb1:(t\u22121)\n\ni (\u03bbi) .\ng\"\n\nt\u22121\"i=1\n\n(15)\n\n(16)\n\nAn algorithmic framework for online optimization in the presence of general strongly convex func-\ntions is given on the right-hand side of Fig. 3.\nThe following theorem provides a logarithmic regret bound for the algorithmic framework given on\nthe right-hand side of Fig. 3.\nTheorem 2 Let !1, . . . ,! T be a sequence of functions such that for all t \u2208 [T ], !t = \u03c3tf + gt for f\nbeing strongly convex w.r.t. a norm & \u00b7& and gt is a convex function. Then, any algorithm that can\nbe derived from Fig. 3 (right) satis\ufb01es\n\nT\"t=1\n\n!t(wt) \u2212 min\n\n!t(w) \u2264\nwhere vt \u2208 \u2202gt(wt) and &\u00b7& \" is the norm dual to &\u00b7& .\nThe proof of the theorem is given in Sec. B\n\nw\n\nT\"t=1\n\n&vt&2\n\u03c31:t\n\n\"\n\n,\n\n1\n2\n\nT\"t=1\n\n(17)\n\n6 Summary\n\nIn this paper, we extended the primal-dual algorithmic framework for general convex functions from\nShalev-Shwartz and Singer [2006], Shalev-Shwartz [2007] to strongly convex functions. The tem-\nplate algorithms are outlined in Fig. 3. The left algorithm is the primal-dual algorithm for general\nconvex functions from Shalev-Shwartz and Singer [2006], Shalev-Shwartz [2007]. Here, f is the\ncomplexity function, (\u03bbt\nt) are the dual variables at time t, and Dt(\u00b7) is the dual objective\n\n1, . . . , \u03bbt\n\n7\n\n\ffunction at time t (which is a lower bound on primal value). The function \u2207f \" is the gradient of the\nconjugate function of f, which can be viewed as a projection of the dual variables back into the pri-\nmal space. At the least aggressive extreme, in order to obtain \u221aT regret, it is suf\ufb01cient to set \u03bbi\nt (for\nall i) to be a subgradient of the loss \u2202!t(wt). We then recover an online \u2018mirror descent\u2019 algorithm\n[Beck and Teboulle, 2003, Grove et al., 2001, Kivinen and Warmuth, 1997], which specializes to\ngradient descent when f is the squared 2-norm or the exponentiated gradient descent algorithm when\nf is the relative entropy. At the most aggressive extreme, where Dt is maximized at each round, we\ni=1 !i(w) + \u221at f(w). Inter-\nhave \u2018Follow the Regularized Leader\u2019, which is wt = arg minw!t\u22121\nmediate algorithms can also be devised such as the \u2018passive-aggressive\u2019 algorithms [Crammer et al.,\n2006, Shalev-Shwartz, 2007].\nThe right algorithm in Figure 3 is our new contribution for strongly convex functions. Any \u03c3-\nstrongly convex loss function can be decomposed into !t = \u03c3f + gt, where gt is convex. The\nalgorithm for strongly convex functions is different in two ways. First, the effective learning rate is\nrather than 1\u221at (see Step 1 in both algorithms). Second, more subtly, the condition on the\nnow 1\n\u03c31:t\ndual variables (in Step 4) is only determined by the subgradient of gt, rather than the subgradient of\n!t. At the most aggressive end of the spectrum, where Dt is maximized at each round, we have the\n\u2018Follow the Leader\u2019 (FTL) algorithm: wt = arg minw!t\u22121\ni=1 !i(w). At the least aggressive end,\n). Fur-\nwe have the gradient descent algorithm of Hazan et al. [2006] (which uses learning rate 1\n\u03c31:t\nthermore, we provide algorithms which lie in between these two extremes \u2014 it is these algorithms\nwhich have the potential for most practical impact.\nEmpirical observations suggest that algorithms which most aggressively close the duality gap tend\nto perform most favorably [Crammer et al., 2006, Shalev-Shwartz and Singer, 2007b]. However, at\nthe FTL extreme, this is often computationally prohibitive to implement (as one must solve a full\nblown optimization problem at each round). Our template algorithm suggests a natural compromise,\nwhich is to optimize the dual objective but only with respect to a small number of dual variables\n(say the most current dual variable) \u2014 we coin this algorithm online coordinate-dual-ascent. In\nfact, it is sometimes possible to obtain a closed-form solution of this optimization problem, so that\nthe computational complexity of the coordinate-dual-ascent update is identical to that of a vanilla\ngradient-descent method. This variant update still enjoys a logarithmic regret bound.\n\nReferences\n\nJ. Abernethy, P. Bartlett, A. Rakhlin, and A. Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the Nineteenth Annual\n\nConference on Computational Learning Theory, 2008.\n\nP. L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. In Advances in Neural Information Processing Systems 21, 2007.\nA. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31:167\u2013175, 2003.\nJ. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006.\nS. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\nN. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.\nM. Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Conference on Empirical Methods in\n\nNatural Language Processing, 2002.\n\nK. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. Journal of Machine Learning Research, 7:551\u2013585, Mar\n\n2006.\n\nA. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3):173\u2013210, 2001.\nE. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic regret algorithms for online convex optimization. In Proceedings of the Nineteenth Annual Conference on\n\nComputational Learning Theory, 2006.\n\nJ. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1\u201364, January 1997.\nS. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University, 2007.\nS. Shalev-Shwartz and Y. Singer. Convex repeated games and Fenchel duality. In Advances in Neural Information Processing Systems 20, 2006.\nS. Shalev-Shwartz and Y. Singer. Logarithmic regret algorithms for strictly convex repeated games. Technical report, The Hebrew University, 2007a. Available at\n\nhttp://www.cs.huji.ac.il/\u223cshais.\n\nS. Shalev-Shwartz and Y. Singer. A uni\ufb01ed algorithmic approach for ef\ufb01cient online label ranking. In aistat07, 2007b.\n\n8\n\n\f", "award": [], "sourceid": 193, "authors": [{"given_name": "Shai", "family_name": "Shalev-shwartz", "institution": null}, {"given_name": "Sham", "family_name": "Kakade", "institution": null}]}