{"title": "On the Generalization Ability of Online Strongly Convex Programming Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 801, "page_last": 808, "abstract": "This paper examines the generalization properties of online convex programming algorithms when the loss function is Lipschitz and strongly convex. Our main result is a sharp bound, that holds with high probability, on the excess risk of the output of an online algorithm in terms of the average regret. This allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excess risk with high probability. The bound also solves an open problem regarding the convergence rate of {\\pegasos}, a recently proposed method for solving the SVM optimization problem.", "full_text": "On the Generalization Ability of\n\nOnline Strongly Convex Programming Algorithms\n\nSham M. Kakade\n\nTTI Chicago\n\nChicago, IL 60637\nsham@tti-c.org\n\nAmbuj Tewari\nTTI Chicago\n\nChicago, IL 60637\n\ntewari@tti-c.org\n\nAbstract\n\nThis paper examines the generalization properties of online convex programming\nalgorithms when the loss function is Lipschitz and strongly convex. Our main\nresult is a sharp bound, that holds with high probability, on the excess risk of the\noutput of an online algorithm in terms of the average regret. This allows one to\nuse recent algorithms with logarithmic cumulative regret guarantees to achieve\nfast convergence rates for the excess risk with high probability. As a corollary, we\ncharacterize the convergence rate of PEGASOS (with high probability), a recently\nproposed method for solving the SVM optimization problem.\n\n1 Introduction\n\nOnline regret minimizing algorithms provide some of the most successful algorithms for many ma-\nchine learning problems, both in terms of the speed of optimization and the quality of generalization.\nNotable examples include ef\ufb01cient learning algorithms for structured prediction [Collins, 2002] (an\nalgorithm now widely used) and for ranking problems [Crammer et al., 2006] (providing competitive\nresults with a fast implementation).\n\nOnline convex optimization is a sequential paradigm in which at each round, the learner predicts a\nvector wt \u2208 S \u2282 Rn, nature responds with a convex loss function, `t, and the learner suffers loss\n`t(wt). In this setting, the goal of the learner is to minimize the regret:\n\nTXt=1\n\n`t(wt) \u2212 min\nw\u2208S\n\n`t(w)\n\nTXt=1\n\nwhich is the difference between his cumulative loss and the cumulative loss of the optimal \ufb01xed\nvector.\n\nTypically, these algorithms are used to train a learning algorithm incrementally, by sequentially\nfeeding the algorithm a data sequence, (X1, Y1), . . . , (XT , YT ) (generated in an i.i.d. manner). In\nessence, the loss function used in the above paradigm at time t is `(w; (Xt, Yt)), and this leads to a\nguaranteed bound on the regret:\n\nRegT =\n\n`(wt; (Xt, Yt)) \u2212 min\nw\u2208S\n\n`(w; (Xt, Yt))\n\nTXt=1\n\nTXt=1\n\nalization ability, i.e. we would like:\n\nHowever, in the batch setting, we are typically interested in \ufb01nding a parameter bw with good gener-\n\nR(w)\n\nto be small, where R(w) := E [`(w; (X, Y ))] is the risk.\n\nR(bw) \u2212 min\n\nw\u2208S\n\n\fIntuitively, it seems plausible that low regret on an i.i.d. sequence, should imply good generaliza-\ntion performance. In fact, for most of the empirically successful online algorithms, we have a set of\ntechniques to understand the generalization performance of these algorithms on new data via \u2018online\nto batch\u2019 conversions \u2014 the conversions relate the regret of the algorithm (on past data) to the gen-\neralization performance (on future data). These include cases which are tailored to general convex\nfunctions [Cesa-Bianchi et al., 2004] (whose regret is O(\u221aT )) and mistake bound settings [Cesa-\nBianchi and Gentile, 2008] (where the the regret could be O(1) under separability assumptions).\n\nalgorithm.\n\nIn these conversions, we typically choose bw to be the average of the wt produced by our online\n\nRecently, there has been a growing body of work providing online algorithms for strongly convex\nloss functions (i.e. `t is strongly convex), with regret guarantees that are merely O(ln T ). Such\nalgorithms have the potential to be highly applicable since many machine learning optimization\nproblems are in fact strongly convex \u2014 either with strongly convex loss functions (e.g. log loss,\nsquare loss) or, indirectly, via strongly convex regularizers (e.g. L2 or KL based regularization).\nNote that in the latter case, the loss function itself may only be just convex but a strongly convex reg-\nularizer effectively makes this a strongly convex optimization problem; e.g. the SVM optimization\nproblem uses the hinge loss with L2 regularization. In fact, for this case, the PEGASOS algorithm\nof Shalev-Shwartz et al. [2007] \u2014 based on the online strongly convex programming algorithm of\nHazan et al. [2006] \u2014 is a state-of-the-art SVM solver. Also, Ratliff et al. [2007] provide a similar\nsubgradient method for max-margin based structured prediction, which also has favorable empirical\nperformance.\n\nThe aim of this paper is to examine the generalization properties of online convex programming\nalgorithms when the loss function is strongly convex (where strong convexity can be de\ufb01ned in a\ngeneral sense, with respect to some arbitrary norm || \u00b7 ||). Suppose we have an online algorithm\nwhich has some guaranteed cumulative regret bound RegT (e.g. say RegT \u2264 ln T with T samples).\nThen a corollary of our main result shows that with probability greater than 1 \u2212 \u03b4 ln T , we obtain a\n\nparameter bw from our online algorithm such that:\n+ O\uf8eb\uf8ed\n\nR(bw) \u2212 min\n\nR(w) \u2264\n\nRegT\n\nT\n\nw\n\nqRegT ln 1\n\n\u03b4\n\nT\n\n+\n\nln 1\n\u03b4\n\nT \uf8f6\uf8f8 .\n\nHere, the constants hidden in the O-notation are determined by the Lipschitz constant and the strong\nconvexity parameter of the loss `. Importantly, note that the correction term is of lower order than\nthe regret \u2014 if the regret is ln T then the additional penalty is O(\n). If one naively uses the\nHoeffding-Azuma methods in Cesa-Bianchi et al. [2004], one would obtain a signi\ufb01cantly worse\npenalty of O(1/\u221aT ).\nThis result solves an open problem in Shalev-Shwartz et al. [2007], which was on characterizing the\nconvergence rate of the PEGASOS algorithm, with high probability. PEGASOS is an online strongly\nconvex programming algorithm for the SVM objective function \u2014 it repeatedly (and randomly)\nsubsamples the training set in order to minimize the empirical SVM objective function. A corollary\nto this work essentially shows the convergence rate of PEGASOS (as a randomized optimization\nalgorithm) is concentrated rather sharply.\n\n\u221aln T\n\nT\n\nRatliff et al. [2007] also provide an online algorithm (based on Hazan et al. [2006]) for max-margin\nbased structured prediction. Our results are also directly applicable in providing a sharper concen-\ntration result in their setting (In particular, see the regret bound in Equation 15, for which our results\ncan be applied to).\n\nThis paper continues the line of research initiated by several researchers [Littlestone, 1989, Cesa-\nBianchi et al., 2004, Zhang, 2005, Cesa-Bianchi and Gentile, 2008] which looks at how to convert\nonline algorithms into batch algorithms with provable guarantees. Cesa-Bianchi and Gentile [2008]\nprove faster rates in the case when the cumulative loss of the online algorithm is small. Here,\nwe are interested in the case where the cumulative regret is small. The work of Zhang [2005] is\nclosest to ours. Zhang [2005] explicitly goes via the exponential moment method to derive sharper\nconcentration results. In particular, for the regression problem with squared loss, Zhang [2005] gives\na result similar to ours (see Theorem 8 therein). The present work can also be seen as generalizing\nhis result to the case where we have strong convexity with respect to a general norm. Coupled with\n\n\frecent advances in low regret algorithms in this setting, we are able to provide a result that holds\nmore generally.\n\nOur key technical tool is a probabilistic inequality due to Freedman [Freedman, 1975]. This, com-\nbined with a variance bound (Lemma 1) that follows from our assumptions about the loss function,\nallows us to derive our main result (Theorem 2). We then apply it to statistical learning with bounded\nloss, and to PEGASOS in Section 4.\n\n2 Setting\n\nFix a compact convex subset S of some space equipped with a norm k\u00b7k. Let k\u00b7k\u2217 be the dual norm\nde\ufb01ned by kvk\u2217 := supw : kwk\u22641\nv \u00b7 w. Let Z be a random variable taking values in some space\nZ. Our goal is to minimize F (w) := E [f (w; Z)] over w \u2208 S. Here, f : S \u00d7 Z \u2192 [0, B] is some\nfunction satisfying the following assumption.\n\nAssumption LIST.\nfz(w) = f (w; z) is convex in w and satis\ufb01es:\n\n(LIpschitz and STrongly convex assumption) For all z \u2208 Z, the function\n\n1. fz has Lipschitz constant L w.r.t. to the norm k\u00b7k, i.e. \u2200w \u2208 S, \u2200\u03bb \u2208 \u2202fz(w) (\u2202fz denotes\nthe subdifferential of fz), k\u03bbk\u2217 \u2264 L. Note that this assumption implies \u2200w, w0 \u2208 S,\n|fz(w) \u2212 fz(w0)| \u2264 Lkw \u2212 w0k.\n\n2. fz is \u03bd-strongly convex w.r.t. k \u00b7 k, i.e. \u2200\u03b8 \u2208 [0, 1], \u2200w, w0 \u2208 S,\n\u03bd\n2\n\nfz(\u03b8w + (1 \u2212 \u03b8)w0) \u2264 \u03b8fz(w) + (1 \u2212 \u03b8)fz(w0) \u2212\n\n\u03b8(1 \u2212 \u03b8)kw \u2212 w0k2 .\n\nDenote the minimizer of F by w?, w? := arg minw\u2208SF (w). We consider an online setting in\nwhich independent (but not necessarily identically distributed) random variables Z1, . . . , ZT be-\ncome available to us in that order. These have the property that\n\n\u2200t,\u2200w \u2208 S, E [f (w; Zt)] = F (w) .\n\nNow consider an algorithm that starts out with some w1 and at time t, having seen Zt, updates the\nparameter wt to wt+1. Let Et\u22121 [\u00b7] denote conditional expectation w.r.t. Z1, . . . , Zt\u22121. Note that\nwt is measurable w.r.t. Z1, . . . , Zt\u22121 and hence Et\u22121 [f (wt; Zt)] = F (wt).\nDe\ufb01ne the statistics,\n\nRegT :=\n\nDi\ufb00 T :=\n\nf (wt; Zt) \u2212 min\nw\u2208S\n\nTXt=1\n(F (wt) \u2212 F (w?)) =\n\nTXt=1\nTXt=1\n\nf (w; Zt) ,\n\nTXt=1\n\nF (wt) \u2212 T F (w?) .\n\nDe\ufb01ne the sequence of random variables\n\n\u03bet := F (wt) \u2212 F (w?) \u2212 (f (wt; Zt) \u2212 f (w?; Zt)) .\n\n(1)\nSince Et\u22121 [f (wt; Zt)] = F (wt) and Et\u22121 [f (w?; Zt)] = F (w?), \u03bet is a martingale difference\nsequence. This de\ufb01nition needs some explanation as it is important to look at the right martingale\nT Pt f (wt; Zt)\ndifference sequence to derive the results we want. Even under assumption LIST, 1\nT Pt F (wt) and F (w?) respectively at a\nand 1\nrate better then O(1/\u221aT ) in general. But if we look at the difference, we are able to get sharper\nconcentration.\n\nT Pt f (w?; Zt) will not be concentrated around 1\n\n3 A General Online to Batch Conversion\n\nThe following simple lemma is crucial for us. It says that under assumption LIST, the variance\nof the increment in the regret f (wt; Zt) \u2212 f (w?; Zt) is bounded by its (conditional) expectation\nF (wt) \u2212 F (w?). Such a control on the variance is often the main ingredient in obtaining sharper\nconcentration results.\n\n\fLemma 1. Suppose assumption LIST holds and let \u03bet be the martingale difference sequence de\ufb01ned\nin (1). Let\n\nbe the conditional variance of \u03bet given Z1, . . . , Zt\u22121. Then, under assumption LIST, we have,\n\nt(cid:3)\nVart\u22121\u03bet := Et\u22121(cid:2)\u03be2\n\nVart\u22121\u03bet \u2264\n\n4L2\n\u03bd\n\n(F (wt) \u2212 F (w?)) .\n\nThe variance bound given by the above lemma allows us to prove our main theorem.\nTheorem 2. Under assumption LIST, we have, with probability at least 1 \u2212 4 ln(T )\u03b4,\n\n1\nT\n\nTXt=1\n\nF (wt) \u2212 F (w?) \u2264\n\nRegT\n\nT\n\nFurther, using Jensen\u2019s inequality, 1\n\n3.1 Proofs\n\nProof of Lemma 1. We have,\n\npRegT\n\nT\n\n+ 4r L2 ln(1/\u03b4)\n, 6B(cid:27) ln(1/\u03b4)\nT Pt F (wt) can be replaced by F ( \u00afw) where \u00afw := 1\nT Pt\n\n+ max(cid:26) 16L2\n\nT\n\n\u03bd\n\n\u03bd\n\nwt.\n\n[ Assumption LIST, part 1 ]\n\n(2)\n\n(3)\n\nOn the other hand, using part 2 of assumption LIST, we also have for any w, w0 \u2208 S,\n\nf (w; Z) + f (w0; Z)\n\nVart\u22121\u03bet \u2264 Et\u22121h(f (wt; Zt) \u2212 f (w?; Zt))2i\n\n\u03bd\n8kw \u2212 w0k2 .\n\n2\n\n\u2264 Et\u22121(cid:2)L2kwt \u2212 w?k2(cid:3)\n= L2kwt \u2212 w?k2 .\n\u2265 f(cid:18) w + w0\n; Z(cid:19) +\n2 (cid:19) +\n\u2265 F(cid:18) w + w0\n\u2265 F(cid:18) wt + w?\n2\n\u2265 F (w?) +\n\n\u03bd\n8kw \u2212 w0k2 .\n(cid:19) +\n\u03bd\n8kwt \u2212 w?k2 .\n\n\u03bd\n8kwt \u2212 w?k2\n\nTaking expectation this gives, for any w, w0 \u2208 S,\n\nF (w) + F (w0)\n\nNow using this with w = wt, w0 = w?, we get\n\n2\n\n2\n\nF (wt) + F (w?)\n\n2\n\n[\u2235 w? minimizes F ]\n\nThis implies that\n\nCombining (2) and (3) we get,\n\nkwt \u2212 w?k2 \u2264\n\n4(F (wt) \u2212 F (w?))\n\n\u03bd\n\nVart\u22121\u03bet \u2264\n\n4L2\n\u03bd\n\n(F (wt) \u2212 F (w?))\n\nThe proof of Theorem 2 relies on the following inequality for martingales which is an easy conse-\nquence of Freedman\u2019s inequality [Freedman, 1975, Theorem 1.6]. The proof of this lemma can be\nfound in the appendix.\nLemma 3. Suppose X1, . . . , XT is a martingale difference sequence with |Xt| \u2264 b. Let\nLet V =PT\nhave, for any \u03b4 < 1/e and T \u2265 3,\n\nt=1 VartXt be the sum of conditional variances of Xt\u2019s. Further, let \u03c3 = \u221aV . Then we\n\nVartXt = Var (Xt | X1, . . . , Xt\u22121) .\n\nProb TXt=1\n\nXt > maxn2\u03c3, 3bpln(1/\u03b4)opln(1/\u03b4)! \u2264 4 ln(T )\u03b4 .\n\n\fProof of Theorem 2. By Lemma 1, we have \u03c3 :=qPT\n\n\u03bd Di\ufb00 T . Note that |\u03bet| \u2264\n2B because our f has range [0, B]. Therefore, Lemma 3 gives us that with probability at least\n1 \u2212 4 ln(T )\u03b4, we have\n\nt=1 Vart\u03bet \u2264q 4L2\n\u03bet \u2264 maxn2\u03c3, 6Bpln(1/\u03b4)opln(1/\u03b4) .\n\nTXt=1\n\nBy de\ufb01nition of RegT ,\n\nDi\ufb00 T \u2212 RegT \u2264\nand therefore, with probability, 1 \u2212 4 ln(T )\u03b4, we have\n\nDi\ufb00 T \u2212 RegT \u2264 max(4r L2\n\n\u03bd\n\n\u03bet\n\nTXt=1\n\nUsing Lemma 4 below to solve the above quadratic inequality for Di\ufb00 T , gives\n\nPT\n\nt=1 F (wt)\n\nT\n\n\u2212 F (w?) \u2264\n\nRegT\n\nT\n\nDi\ufb00 T , 6Bpln(1/\u03b4))pln(1/\u03b4) .\n+ max(cid:26) 16L2\n\npRegT\n\nT\n\n\u03bd\n\n+ 4r L2 ln(1/\u03b4)\n\n\u03bd\n\n, 6B(cid:27) ln(1/\u03b4)\n\nT\n\nThe following elementary lemma was required to solve a recursive inequality in the proof of the\nabove theorem. Its proof can be found in the appendix.\nLemma 4. Suppose s, r, d, b, \u2206 \u2265 0 and we have\n\ns \u2212 r \u2264 max{4\u221ads, 6b\u2206}\u2206 .\n\ns \u2264 r + 4\u221adr\u2206 + max{16d, 6b}\u22062 .\n\nThen, it follows that\n\n4 Applications\n\n4.1 Online to Batch Conversion for Learning with Bounded Loss\n\nSuppose (X1, Y1), . . . , (XT , YT ) are drawn i.i.d. from a distribution. The pairs (Xi, Yi) belong\nto X \u00d7 Y and our algorithm are allowed to make predictions in a space D \u2287 Y. A loss function\n` : D \u00d7 Y \u2192 [0, 1] measures quality of predictions. Fix a convex set S of some normed space and a\nfunction h : X \u00d7 S \u2192 D. Let our hypotheses class be {x 7\u2192 h(x; w)| w \u2208 S}.\nOn input x, the hypothesis parameterized by w predicts h(x; w) and incurs loss `(h(x; w), y) if the\ncorrect prediction is y. The risk of w is de\ufb01ned by\n\nR(w) := E [`(h(X; w), Y )]\n\nand let w? := arg minw\u2208S R(w) denote the (parameter for) the hypothesis with minimum risk. It\nis easy to see that this setting falls under the general framework given above by thinking of the pair\n(X, Y ) as Z and setting f (w; Z) = f (w; (X, Y )) to be `(h(X; w), Y ). Note that F (w) becomes\nthe risk R(w). The range of f is [0, 1] by our assumption about the loss functions so B = 1.\nSuppose we run an online algorithm on our data that generates a sequence of hypotheses w0, . . . , wT\nsuch that wt is measurable w.r.t. X c max{r\u03c3, \u03b10}pln(1/\u03b4)!\n(cid:19)\nProb(cid:18)Pt Xt > c max{r\u03c3, \u03b10}pln(1/\u03b4)\nlXj=0\nProb(cid:18)Pt Xt > c\u03b1jpln(1/\u03b4)\nj (cid:19)\nlXj=0\nProb Xt\nlXj=0\nexp\uf8eb\uf8ed\nlXj=0\nexp\uf8eb\uf8ed\nlXj=0\n\nj!\nXt > c\u03b1jpln(1/\u03b4) & V \u2264 \u03b12\n3(cid:16)c\u03b1jpln(1/\u03b4)(cid:17) b\uf8f6\uf8f8\n3(cid:16)cpln(1/\u03b4)(cid:17) b\uf8f6\uf8f8\n\n\u2212c2\u03b1j ln(1/\u03b4)\n\nj ln(1/\u03b4)\n\n2\u03b1j + 2\n\n\u2212c2\u03b12\n\nj + 2\n\n(?)\n\u2264\n\n2\u03b12\n\n=\n\n\u2264\n\n\u2264\n\n=\n\nwhere the inequality (?) follows from Freedman\u2019s inequality. If we now choose \u03b10 = bcpln(1/\u03b4)\nthen \u03b1j \u2265 bcpln(1/\u03b4) for all j and hence every term in the above summation is bounded by\nexp(cid:16) \u2212c2 ln(1/\u03b4)\n2+2/3 (cid:17) which is less then \u03b4 if we choose c = 5/3. Set r = 2/c = 6/5. We want\n\u03b10rl \u2265 b\u221aT . Since cpln(1/\u03b4) \u2265 1, choosing l = logr(\u221aT ) ensures that. Thus we have\nProb TXt=1\n= Prob Xt\n\u2264 (l + 1)\u03b4 = (log6/5(\u221aT ) + 1)\u03b4\n\u2264 (6 ln(\u221aT ) + 1)\u03b4 \u2264 4 ln(T )\u03b4 .\n\nXt > c max{r\u03c3, \u03b10}pln(1/\u03b4)!\n\nbpln(1/\u03b4)}pln(1/\u03b4)!\n\n(\u2235 T \u2265 3)\n\nmax{\n\nXt >\n\n5\n3\n\n6\n5\n\n5\n3\n\n\u03c3,\n\nIn the second case, we have\n\nwhich means that \u221as should be smaller than the larger root of the above quadratic. This gives us,\n\nAppendix\n\nProof of Lemma 3. Note that a crude upper bound on VartXt is b2. Thus, \u03c3 \u2264 b\u221aT . We choose a\ndiscretization 0 = \u03b1\u22121 < \u03b10 < . . . < \u03b1l such that \u03b1i+1 = r\u03b1i for i \u2265 0 and \u03b1l \u2265 b\u221aT . We will\n\nspecify the choice of \u03b10 and r shortly. We then have, for any c > 0,\n\nProof of Lemma 4. The assumption of the lemma implies that one of the following inequalities\nholds:\n\ns \u2212 r \u2264 4\u221ads\u2206 .\n\n\u2212 (4\u221ad\u2206)\u221as \u2212 r \u2264 0\n\ns \u2212 r \u2264 6b\u22062\n(cid:0)\u221as(cid:1)2\ns = (\u221as)2 \u2264(cid:16)2\u221ad\u2206 +p4d\u22062 + r(cid:17)2\n\u2264 4d\u22062 + 4d\u22062 + r + 4p4d2\u22064 + d\u22062r\n\u2264 8d\u22062 + r + 8d\u22062 + 4\u221adr\u2206\n\u2264 r + 4\u221adr\u2206 + 16d\u22062 .\n\n[\u2235 \u221ax + y \u2264 \u221ax + \u221ay]\n\n(7)\n\n(8)\n\nCombining (7) and (8) \ufb01nishes the proof.\n\n\f", "award": [], "sourceid": 290, "authors": [{"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": null}]}