{"title": "Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1116, "page_last": 1124, "abstract": "Stochastic gradient descent algorithms for training linear and kernel predictors are gaining more and more importance, thanks to their scalability. While various methods have been proposed to speed up their convergence, the model selection phase is often ignored. In fact, in theoretical works most of the time assumptions are made, for example, on the prior knowledge of the norm of the optimal solution, while in the practical world validation methods remain the only viable approach. In this paper, we propose a new kernel-based stochastic gradient descent algorithm that performs model selection while training, with no parameters to tune, nor any form of cross-validation. The algorithm builds on recent advancement in online learning theory for unconstrained settings, to estimate over time the right regularization in a data-dependent way. Optimal rates of convergence are proved under standard smoothness assumptions on the target function as well as preliminary empirical results.", "full_text": "Simultaneous Model Selection and Optimization\n\nthrough Parameter-free Stochastic Learning\n\nFrancesco Orabona\u2217\n\nYahoo! Labs\n\nNew York, USA\n\nfrancesco@orabona.com\n\nAbstract\n\nStochastic gradient descent algorithms for training linear and kernel predictors\nare gaining more and more importance, thanks to their scalability. While various\nmethods have been proposed to speed up their convergence, the model selection\nphase is often ignored.\nIn fact, in theoretical works most of the time assump-\ntions are made, for example, on the prior knowledge of the norm of the optimal\nsolution, while in the practical world validation methods remain the only viable\napproach. In this paper, we propose a new kernel-based stochastic gradient de-\nscent algorithm that performs model selection while training, with no parameters\nto tune, nor any form of cross-validation. The algorithm builds on recent advance-\nment in online learning theory for unconstrained settings, to estimate over time\nthe right regularization in a data-dependent way. Optimal rates of convergence are\nproved under standard smoothness assumptions on the target function as well as\npreliminary empirical results.\n\n1\n\nIntroduction\n\nStochastic Gradient Descent (SGD) algorithms are gaining more and more importance in the Ma-\nchine Learning community as ef\ufb01cient and scalable machine learning tools. There are two possible\nways to use a SGD algorithm: to optimize a batch objective function, e.g. [23], or to directly opti-\nmize the generalization performance of a learning algorithm, in a stochastic approximation way [20].\nThe second use is the one we will consider in this paper. It allows learning over streams of data,\ncoming Independent and Identically Distributed (IID) from a stochastic source. Moreover, it has\nbeen advocated that SGD theoretically yields the best generalization performance in a given amount\nof time compared to other more sophisticated optimization algorithms [6].\nYet, both in theory and in practice, the convergence rate of SGD for any \ufb01nite training set critically\ndepends on the step sizes used during training. In fact, often theoretical analysis assumes the use\nof optimal step sizes, rarely known in reality, and in practical applications wrong step sizes can\nresult in arbitrarily bad performance. While in \ufb01nite dimensional hypothesis spaces simple optimal\nstrategies are known [2], in in\ufb01nite dimensional spaces the only attempts to solve this problem\nachieve convergence only in the realizable case, e.g. [25], or assume prior knowledge of intrinsic\n(and unknown) characteristic of the problem [24, 29, 31, 33, 34]. The only known practical and\ntheoretical way to achieve optimal rates in in\ufb01nite Reproducing Kernel Hilbert Space (RKHS) is\nto use some form of cross-validation to select the step size that corresponds to a form of model\nselection [26, Chapter 7.4]. However, cross-validation techniques would result in a slower training\nprocedure partially neglecting the advantage of the stochastic training. A notable exception is the\nalgorithm in [21], that keeps the step size constant and the number of epochs on the training set acts\nas a regularizer. Yet, the number of epochs is decided through the use of a validation set [21].\n\n\u2217Work done mainly while at Toyota Technological Institute at Chicago.\n\n1\n\n\fNote that the situation is exactly the same in the batch setting where the regularization takes the role\nof the step size. Even in this case, optimal rates can be achieved only when the regularization is\nchosen in a problem dependent way [12, 17, 27, 32].\nOn a parallel route, the Online Convex Optimization (OCO) literature studies the possibility to\nlearn in a scenario where the data are not IID [9, 36]. It turns out that this setting is strictly more\ndif\ufb01cult than the IID one and OCO algorithms can also be used to solve the corresponding stochastic\nproblems [8]. The literature on OCO focuses on the adversarial nature of the problem and on various\nways to achieve adaptivity to its unknown characteristics [1, 11, 14, 15].\nThis paper is in between these two different worlds: We extend tools from OCO to design a novel\nstochastic parameter-free algorithm able to obtain optimal \ufb01nite sample convergence bounds in in\ufb01-\nnite dimensional RKHS. This new algorithm, called Parameter-free STOchastic Learning (PiSTOL),\nhas the same complexity as the plain stochastic gradient descent procedure and implicitly achieves\nthe model selection while training, with no parameters to tune nor the need for cross-validation. The\ncore idea is to change the step sizes over time in a data-dependent way. As far as we know, this is\nthe \ufb01rst algorithm of this kind to have provable optimal convergence rates.\nThe rest of the paper is organized as follows. After introducing some basic notations (Sec. 2), we\nwill explain the basic intuition of the proposed method (Sec. 3). Next, in Sec. 4 we will describe\nthe PiSTOL algorithm and its regret bounds in the adversarial setting and in Sec. 5 we will show its\nconvergence results in the stochastic setting. The detailed discussion of related work is deferred to\nSec. 6. Finally, we show some empirical results and draw the conclusions in Sec. 7.\n\n2 Problem Setting and De\ufb01nitions\n\n|, \u2200x, x(cid:48)\n\nLet X \u2282 Rd a compact set and HK the RKHS associated to a Mercer kernel K : X \u00d7 X \u2192 R\nimplementing the inner product (cid:104)\u00b7 , \u00b7(cid:105)K that satis\ufb01es the reproducing property, (cid:104)K(x,\u00b7) , f (\u00b7)(cid:105)K =\nf (x). Without loss of generality, in the following we will always assume (cid:107)k(xt,\u00b7)(cid:107)K \u2264 1.\nPerformance is measured w.r.t. a loss function (cid:96) : R \u2192 R+. We will consider L-Lipschitz losses,\nthat is |(cid:96)(x) \u2212 (cid:96)(x(cid:48))| \u2264 L|x \u2212 x(cid:48)\n\u2208 R, and H-smooth losses, that is differentiable losses\nwith the \ufb01rst derivative H-Lipschitz. Note that a loss can be both Lipschitz and smooth. A vector x\nis a subgradient of a convex function (cid:96) at v if (cid:96)(u) \u2212 (cid:96)(v) \u2265 (cid:104)u \u2212 v, x(cid:105) for any u in the domain of\n(cid:96). The differential set of (cid:96) at v, denoted by \u2202(cid:96)(v), is the set of all the subgradients of (cid:96) at v. 1(\u03a6)\nwill denote the indicator function of a Boolean predicate \u03a6.\nIn the OCO framework, at each round t the algorithm receives a vector xt \u2208 X , picks a ft \u2208 HK,\nand pays (cid:96)t(ft(xt)), where (cid:96)t is a loss function. The aim of the algorithm is to minimize the\nt=1 (cid:96)t(ft(xt)), and the\n\nregret, that is the difference between the cumulative loss of the algorithm,(cid:80)T\ncumulative loss of an arbitrary and \ufb01xed competitor h \u2208 HK,(cid:80)T\n\nt=1 (cid:96)t(h(xt)).\n\n\u03c1(x) := arg mint\u2208R(cid:82)\n\nt=1 will consist of samples drawn IID from \u03c1. Denote by f\u03c1(x) :=(cid:82)\n(cid:113)(cid:82)\n\u03c1X . De\ufb01ne the (cid:96)-risk of f, as E (cid:96)(f ) := (cid:82)\n\nFor the statistical setting, let \u03c1 a \ufb01xed but unknown distribution on X \u00d7 Y, where Y = [\u22121, 1]. A\ntraining set {xt, yt}T\nY yd\u03c1(y|x)\nthe regression function, where \u03c1(\u00b7|x) is the conditional probability measure at x induced by \u03c1.\nDenote by \u03c1X the marginal probability measure on X and let L2\n\u03c1X be the space of square in-\ntegrable functions with respect to \u03c1X , whose norm is denoted by (cid:107)f(cid:107)L2\nX f 2(x)d\u03c1X .\nX\u00d7Y (cid:96)(yf (x))d\u03c1. Also, de\ufb01ne\nNote that f\u03c1 \u2208 L2\n\u03c1X E (cid:96)(f ). In\nf (cid:96)\nthe binary classi\ufb01cation case, de\ufb01ne the misclassi\ufb01cation risk of f as R(f ) := P (y (cid:54)= sign(f (x))).\nThe in\ufb01mum of the misclassi\ufb01cation risk over all measurable f will be called Bayes risk and\nfc := sign(f\u03c1), called the Bayes classi\ufb01er, is such that R(fc) = inf f\u2208L2\n\u03c1X R(f ).\nX K(x, x(cid:48))f (x(cid:48))d\u03c1X (x(cid:48)).\nLet LK : L2\n\u03c1X consisting of eigenfunctions of LK with\nThere exists an orthonormal basis {\u03a61, \u03a62,\u00b7\u00b7\u00b7} of L2\ncorresponding non-negative eigenvalues {\u03bb1, \u03bb2,\u00b7\u00b7\u00b7} and the set {\u03bbi} is \ufb01nite or \u03bbk \u2192 0 when\nk \u2192 \u221e [13, Theorem 4.7]. Since K is a Mercer kernel, LK is compact and positive. Therefore,\nthe fractional power operator L\u03b2\nK is well de\ufb01ned for any \u03b2 \u2265 0. We indicate its range space by\n\n\u03c1X \u2192 HK the integral operator de\ufb01ned by (LKf )(x) = (cid:82)\n\nY (cid:96)(yt)d\u03c1(y|x), that gives the optimal (cid:96)-risk, E (cid:96)(f (cid:96)\n\n:=\n\n\u03c1X\n\n\u03c1) = inf f\u2208L2\n\n2\n\n\fAlgorithm 1 Averaged SGD.\n\nAlgorithm 2 The Kernel Perceptron.\n\nParameters: \u03b7 > 0\nInitialize: f1 = 0 \u2208 HK\nfor t = 1, 2, . . . do\nReceive input vector xt \u2208 X\nPredict with \u02c6yt = ft(xt)\nUpdate ft+1 = ft + \u03b7yt(cid:96)(cid:48)(yt \u02c6yt)k(xt,\u00b7)\nend for\nReturn \u00affT = 1\nT\n\n(cid:80)T\n\nt=1 ft\n\nL\u03b2\nK(L2\n\n\u03c1X ) :=\n\nParameters: None\nInitialize: f1 = 0 \u2208 HK\nfor t = 1, 2, . . . do\nReceive input vector xt \u2208 X\nPredict with \u02c6yt = sign(ft(xt))\nSuffer loss 1(\u02c6yt (cid:54)= yt)\nUpdate ft+1 = ft + yt1(\u02c6yt (cid:54)= yt)k(xt,\u00b7)\n(cid:27)\n\nend for\n\n(cid:26)\n\n(cid:88)\n\nf =\n\nai\u03a6i\n\n:\n\n\u22122\u03b2\na2\ni \u03bb\ni\n\n< \u221e\n\n.\n\n(1)\n\n\u221e(cid:88)\n\ni=1\n\ni:ai(cid:54)=0\n\n1\n2\n\n\u03c1X ) = HK, that\nBy the Mercer\u2019s theorem, we have that L\nK(L2\nis every function f \u2208 HK can be written as L\nKg for some g \u2208\n\u03c1X , with (cid:107)f(cid:107)K = (cid:107)g(cid:107)L2\n. On the other hand, by de\ufb01nition\nL2\n\u03c1X\n\u03c1X . Thus, the smaller\nof the orthonormal basis, L0\nK(L2\n\u03b2 is, the bigger this space of the functions will be,1 see Fig. 1.\nThis space has a key role in our analysis. In particular, we will\nassume that f (cid:96)\n\n\u03c1X ) for \u03b2 > 0, that is\n\n\u03c1X ) = L2\n\n1\n2\n\n\u03c1X : f (cid:96)\n\n\u03c1 = L\u03b2\n\nKg.\n\n(2)\n\n\u03c1 \u2208 L\u03b2\n\nK(L2\n\u2203g \u2208 L2\n\nFigure 1: L2\nspaces, with 0 < \u03b21 < 1\n\n\u03c1X , HK, and L\u03b2\n2 < \u03b22.\n\nK (L2\n\n\u03c1X )\n\n3 A Gentle Start: ASGD, Optimal Step Sizes, and the Perceptron\n\nConsider the square loss, (cid:96)(x) = (1 \u2212 x)2. We want to investigate the problem of training a pre-\ndictor, \u00affT , on the training set {xt, yt}T\nt=1 in a stochastic way, using each sample only once, to have\nE (cid:96)( \u00affT ) converge to E (cid:96)(f (cid:96)\n\u03c1). The Averaged Stochastic Gradient Descent (ASGD) in Algorithm 1 has\nbeen proposed as a fast stochastic algorithm to train predictors [35]. ASGD simply goes over all\nthe samples once, updates the predictor with the gradients of the losses, and returns the averaged\nsolution. For ASGD with constant step size 0 < \u03b7 \u2264 1\n\n4, it is immediate to show2 that\n\nE[E (cid:96)( \u00affT )] \u2264 inf\n\nh\u2208HK E (cid:96)(h) + (cid:107)h(cid:107)2\n\nK (\u03b7T )\n\n\u22121 + 4\u03b7.\n\n(3)\n\n\u03c1) = inf h\u2208HK E (cid:96)(h) but there is no guarantee that the in\ufb01mum is attained [26].\n\nThis result shows the link between step size and regularization: In expectation, the (cid:96)-risk of the\naveraged predictor will be close to the (cid:96)-risk of the best regularized function in HK. Moreover,\nthe amount of regularization depends on the step size used. From (3), one might be tempted to\nchoose \u03b7 = O(T \u2212 1\n2 ). With this choice, when the number of samples goes to in\ufb01nity, ASGD would\nconverge to the performance of the best predictor in HK at a rate of O(T \u2212 1\n2 ), only if the in\ufb01mum\ninf h\u2208HK E (cid:96)(h) is attained by a function in HK. Note that even with a universal kernel we only have\nE (cid:96)(f (cid:96)\nOn the other hand, there is a vast, and often ignored, literature examining the general case when (2)\nholds [4, 7, 12, 17, 24, 27, 29, 31\u201334]. Under this assumption, this in\ufb01mum is attained only when\n2, yet it is possible to prove convergence for \u03b2 > 0. In fact, when (2) holds it is known that\n\u03b2 \u2265 1\n(cid:17)\n\u03c1) = O((\u03b7T )\u22122\u03b2) [13, Proposition 8.5]. Hence, it was\nminh\u2208HK\nobserved in [33] that setting \u03b7 = O(T\n,\n1The case that \u03b2 < 1 implicitly assumes that HK is in\ufb01nite dimensional. If HK has \ufb01nite dimension, \u03b2 is\n\n\u2212 E (cid:96)(f (cid:96)\n\u2212 2\u03b2\n2\u03b2+1 ) in (3), we obtain E[E (cid:96)( \u00affT )]\u2212E (cid:96)(f (cid:96)\n\nK (\u03b7T )\u22121(cid:105)\n\nE (cid:96)(h) + (cid:107)h(cid:107)2\n\n\u03c1) = O\n\n\u2212 2\u03b2\n\n(cid:16)\n\n(cid:104)\n\n2\u03b2+1\n\nT\n\n0 or 1. See also the discussion in [27].\n\n2The proofs of this statement and of all other presented results are in [19] .\n\n3\n\n\fT(cid:88)\n\nt=1\n\n(cid:118)(cid:117)(cid:117)(cid:116) T(cid:88)\n\nt=1\n\nthat is the optimal rate [27, 33]. Hence, the setting \u03b7 = O(T \u2212 1\n2, that is\n\u03c1 \u2208 HK. In all the other cases, the convergence rate of ASGD to the optimal (cid:96)-risk is suboptimal.\nf (cid:96)\nUnfortunately, \u03b2 is typically unknown to the learner.\nOn the other hand, using the tools to design self-tuning algorithms, e.g. [1, 14], it may be possible\nto design an ASGD-like algorithm, able to self-tune its step size in a data-dependent way. Indeed,\nwe would like an algorithm able to select the optimal step size in (3), that is\n\n2 ) is optimal only when \u03b2 = 1\n\nE[E (cid:96)( \u00affT )] \u2264 inf\n\nh\u2208HK E (cid:96)(h) + min\n\n\u03b7>0 (cid:107)h(cid:107)2\n\nK (\u03b7T )\n\n\u22121 + 4\u03b7 = inf\n\nh\u2208HK E (cid:96)(h) + 4(cid:107)h(cid:107)K T\n\n\u2212 1\n2 .\n\n(4)\n\n2 ). An algorithm\nIn the OCO setting, this would correspond to a regret bound of the form O((cid:107)h(cid:107)K T\nthat has this kind of guarantee is the Perceptron algorithm [22], see Algorithm 2. In fact, for the\nPerceptron it is possible to prove the following mistake bound [9]:\n\n1\n\nNumber of Mistakes \u2264 inf\nh\u2208HK\n\n(cid:96)h(yth(xt)) + (cid:107)h(cid:107)2\n\nK + (cid:107)h(cid:107)K\n\n(cid:96)h(yth(xt)),\n\n(5)\n\nwhere (cid:96)h is the hinge loss, (cid:96)h(x) = max(1 \u2212 x, 0). The Perceptron algorithm is similar to SGD\nbut its behavior is independent of the step size, hence, it can be thought as always using the optimal\none. Unfortunately, we are not done yet: While (5) has the right form of the bound, it is not a regret\nbound, rather only a mistake bound, speci\ufb01c for binary classi\ufb01cation. In fact, the performance of the\ncompetitor h is measured with a different loss (hinge loss) than the performance of the algorithm\n2 cannot be proved. In-\n(misclassi\ufb01cation loss). For this asymmetry, the convergence when \u03b2 < 1\n2 ), returns the averaged\nstead, we need an online algorithm whose regret bound scales as O((cid:107)h(cid:107)K T\nsolution, and, thanks to the equality in (4), obtains a convergence rate which would depend on\n\n1\n\nmin\n\n\u03b7>0 (cid:107)h(cid:107)2\n\nK (\u03b7T )\n\n\u22121 + \u03b7.\n\n(6)\n\nNote that (6) has the same form of the expression in (3), but with a minimum over \u03b7. Hence, we can\nexpect such algorithm to always have the optimal rate of convergence. In the next section, we will\npresent an algorithm that has this guarantee.\n\n4 PiSTOL: Parameter-free STOchastic Learning\n\nIn this section we describe the PiSTOL algorithm. The pseudo-code is in Algorithm 3. The algo-\nrithm builds on recent advancement in unconstrained online learning [16, 18, 28]. It is very similar\nto a SGD algorithm [35], the main difference being the computation of the solution based on the\npast gradients, in line 4. Note that the calculation of (cid:107)gt(cid:107)2\nK can be done incrementally, hence, the\ncomputational complexity is the same as ASGD in a RKHS, Algorithm 1, that is O(d) in Rd and\nO(t) in a RKHS. For the PiSTOL algorithm we have the following regret bound.\nTheorem 1. Assume that the losses (cid:96)t are convex and L-Lipschitz. Let a > 0 such that a \u2265 2.25L.\nThen, for any h \u2208 HK, the following bound on the regret holds for the PiSTOL algorithm\n\nT(cid:88)\n\nt=1\n\n[(cid:96)t(ft(xt)) \u2212 (cid:96)t(h(xt))] \u2264(cid:107)h(cid:107)K\n\n(cid:18) exp( x\n\nwhere \u03c6(x) := x\n\n2 exp\n\n2 )(x+1)+2\n2 )\u2212x\n\n1\u2212x exp( x\n\n(cid:33)\n\n(cid:107)h(cid:107)K\n\n\u221aaLT\nb\n\n+ 1\n\nlog\n\n(cid:32)\n\n(cid:118)(cid:117)(cid:117)(cid:116)2a\n(cid:33)\n(cid:32)\nT\u22121(cid:88)\n\u22121L(cid:1) log (1 + T ) ,\n+ b\u03c6(cid:0)a\nt=1 |st|\n(cid:19)(cid:0)exp(cid:0) x\n(cid:1) (x + 1) + 2(cid:1).\n\nL +\n\n2\n\nThis theorem shows that PiSTOL has the right dependency on (cid:107)h(cid:107)K and T that was outlined in\nSec. 3 and its regret bound is also optimal up to \u221alog log T terms [18]. Moreover, Theorem 1\nimproves on the results in [16, 18], obtaining an almost optimal regret that depends on the sum of\nthe absolute values of the gradients, rather than on the time T . This is critical to obtain a tighter\nbound when the losses are H-smooth, as shown in the next Corollary.\n\n4\n\n\f(cid:17)\n\nAlgorithm 3 PiSTOL: Parameter-free STOchastic Learning.\n1: Parameters: a, b, L > 0\n(cid:16)(cid:107)gt\u22121(cid:107)2\n2: Initialize: g0 = 0 \u2208 HK, \u03b10 = aL\n3: for t = 1, 2, . . . do\nSet ft = gt\u22121\nb\n4:\nReceive input vector xt \u2208 X\n5:\nAdversarial setting: Suffer loss (cid:96)t(ft(xt))\n6:\nReceive subgradient st \u2208 partial(cid:96)t(ft(xt))\n7:\nUpdate gt = gt\u22121 \u2212 stk(xt,\u00b7) and \u03b1t = \u03b1t\u22121 + a|st|(cid:107)k(xt,\u00b7)(cid:107)K\n8:\n9: end for\n10: Statistical setting: Return \u00affT = 1\nT\n\nt=1 ft\n\n2\u03b1t\u22121\n\nexp\n\n\u03b1t\u22121\n\nK\n\n(cid:80)T\n\uf8eb\uf8edmax\n\n\uf8f1\uf8f2\uf8f3(cid:107)h(cid:107)\n\nT(cid:88)\n\nt=1\n\nCorollary 1. Under the same assumptions of Theorem 1, if the losses (cid:96)t are also H-smooth, then3\n\n4\uf8fc\uf8fd\uf8fe\n\uf8f6\uf8f8 .\n(cid:33) 1\n\n(cid:32) T(cid:88)\n(cid:17)\n\nt=1\n\n[(cid:96)t(ft(xt)) \u2212 (cid:96)t(h(xt))] = \u02dcO\n\n4\n3\n\nK T\n\n1\n\n3 ,(cid:107)h(cid:107)K T\n\n1\n4\n\n(cid:96)t(h(xt)) + 1\n\nThis bound shows that, if the cumulative loss of the competitor is small, the regret can grow slower\nthan \u221aT . It is worse than the regret bounds for smooth losses in [9, 25] because when the cumulative\nloss of the competitor is equal to 0, the regret still grows as \u02dcO\ninstead of being constant.\nHowever, the PiSTOL algorithm does not require the prior knowledge of the norm of the competitor\nfunction h, as all the ones in [9, 25] do.\nIn [19] , we also show a variant of PiSTOL for linear kernels with almost optimal learning rate for\neach coordinate. Contrary to other similar algorithms, e.g. [14], it is a truly parameter-free one.\n\n(cid:107)f(cid:107)\n\nK T\n\n1\n3\n\n4\n3\n\n(cid:16)\n\n5 Convergence Results for PiSTOL\n\nIn this section we will use the online-to-batch conversion to study the (cid:96)-risk and the misclassi\ufb01cation\nrisk of the averaged solution of PiSTOL. We will also use the following de\ufb01nition: \u03c1 has Tsybakov\nnoise exponent q \u2265 0 [30] iff there exist cq > 0 such that\n\nPX ({x \u2208 X : \u2212s \u2264 f\u03c1(x) \u2264 s}) \u2264 cqsq,\n\n\u2200s \u2208 [0, 1].\n\nSetting \u03b1 = q\n\nq+1 \u2208 [0, 1], and c\u03b1 = cq + 1, condition (7) is equivalent [32, Lemma 6.1] to:\n\nPX (sign(f (x)) (cid:54)= fc(x)) \u2264 c\u03b1(R(f ) \u2212 R(f\u03c1))\u03b1,\n\n(8)\nThese conditions allow for faster rates in relating the expected excess misclassi\ufb01cation risk to the\nexpected (cid:96)-risk, as detailed in the following Lemma that is a special case of [3, Theorem 10].\nLemma 1. Let (cid:96) : R \u2192 R+ be a convex loss function, twice differentiable at 0, with (cid:96)(cid:48)(0) < 0,\n(cid:96)(cid:48)(cid:48)(0) > 0, and with the smallest zero in 1. Assume condition (8) is veri\ufb01ed. Then for the averaged\nsolution \u00affT returned by PiSTOL it holds\n\n\u2200f \u2208 L2\n\u03c1X .\n\n(7)\n\nE[R( \u00affT )] \u2212 R(fc) \u2264\n\n32\n\nc\u03b1\nC\n\n(cid:16)\n\n(cid:0)E[E (cid:96)( \u00affT )] \u2212 E (cid:96)(f (cid:96)\n\n\u03c1)(cid:1)(cid:17) 1\n\n2\u2212\u03b1\n\n(cid:26)\n\n, C = min\n\n(cid:48)\n\u2212(cid:96)\n\n(0),\n\n((cid:96)(cid:48)(0))2\n(cid:96)(cid:48)(cid:48)(0)\n\n(cid:27)\n\n.\n\nThe results in Sec. 4 give regret bounds over arbitrary sequences. We now assume to have a se-\nt=1 IID from \u03c1. We want to train a predictor from this data, that\nquence of training samples (xt, yt)T\nminimizes the (cid:96)-risk. To obtain such predictor we employ a so-called online-to-batch conversion [8].\nt=1,\nFor a convex loss (cid:96), we just need to run an online algorithm over the sequence of data (xt, yt)T\nusing the losses (cid:96)t(x) = (cid:96)(ytx), \u2200t = 1,\u00b7\u00b7\u00b7 , T . The online algorithm will generate a sequence\nof solutions ft and the online-to-batch conversion can be obtained with a simple averaging of all\nthe solutions, \u00affT = 1\nt=1 ft, as for ASGD. The average regret bound of the online algorithm\nT\nbecomes a convergence guarantee for the averaged solution [8]. Hence, for the averaged solution of\nPiSTOL, we have the following Corollary that is immediate from Corollary 1 and the results in [8].\n\n(cid:80)T\n\n3For brevity, the \u02dcO notation hides polylogarithmic terms.\n\n5\n\n\f(cid:16)\n\n(cid:110)\n\n4(cid:111)(cid:17)\n4(cid:0)TE (cid:96)(h) + 1(cid:1) 1\n\nCorollary 2. Assume that the samples (xt, yt)T\nthe assumptions of Corollary 1, the averaged solution of PiSTOL satis\ufb01es\n\nt=1 are IID from \u03c1, and (cid:96)t(x) = (cid:96)(ytx). Then, under\n\n.\n\n\u2212 3\n\n4\n3\n\nK T\n\nmax\n\n(cid:107)h(cid:107)\n\n\u2212 2\n3 ,(cid:107)h(cid:107)K T\n\nh\u2208HK E (cid:96)(h) + \u02dcO\n\nE[E (cid:96)( \u00affT )] \u2264 inf\nHence, we have a \u02dcO(T \u2212 2\n3 ) convergence rate to the \u03c6-risk of the best predictor in HK, if the best\npredictor has \u03c6-risk equal to zero, and \u02dcO(T \u2212 1\n2 ) otherwise. Contrary to similar results in literature,\ne.g. [25], we do not have to restrict the in\ufb01mum over a ball of \ufb01xed radius in HK and our bounds\ndepends on \u02dcO((cid:107)h(cid:107)K) rather than O((cid:107)h(cid:107)2\nK), e.g. [35]. The advantage of not restricting the competi-\ntor in a ball is clear: The performance is always close to the best function in HK, regardless of its\nnorm. The logarithmic terms are exactly the price we pay for not knowing in advance the norm of\n2(2\u2212\u03b1) )\nthe optimal solution. For binary classi\ufb01cation using Lemma 1, we can also prove a \u02dcO(T\nbound on the excess misclassi\ufb01cation risk in the realizable setting, that is if f (cid:96)\nIt would be possible to obtain similar results with other algorithms, as the one in [25], using a\ndoubling-trick approach [9]. However, this would result most likely in an algorithm not useful in\nany practical application. Moreover, the doubling-trick itself would not be trivial, for example the\none used in [28] achieves a suboptimal regret and requires to start from scratch the learning over two\ndifferent variables, further reducing its applicability in any real-world application.\nAs anticipated in Sec. 3, we now show that the dependency on \u02dcO((cid:107)h(cid:107)K) rather than on O((cid:107)h(cid:107)2\nK)\n\u03c1X ), without the need\ngives us the optimal rates of convergence in the general case that f (cid:96)\nto tune any parameter. This is our main result.\nTheorem 2. Assume that the samples (xt, yt)T\n(cid:96)(ytx). Then, under the assumptions of Corollary 1, the averaged solution of PiSTOL satis\ufb01es\n\u2212 2\u03b2\n\nt=1 are IID from \u03c1, (2) holds for \u03b2 \u2264 1\n(cid:16)\n\n2 , and (cid:96)t(x) =\n\n\u03c1 \u2208 HK.\n\n\u03c1 \u2208 L\u03b2\n\n(cid:111)(cid:17)\n\nK(L2\n\n(cid:110)\n\n\u2212 1\n\n\u2212 2\u03b2\n\n\u03b2\n\n\u2022 If \u03b2 \u2264 1\n\u2022 If 1\n\n3 then E[E (cid:96)( \u00affT )] \u2212 E (cid:96)(f (cid:96)\n(cid:16)\n3 < \u03b2 \u2264 1\n\u2264 \u02dcO\nmax\n\n\u03c1) \u2264 \u02dcO\n(cid:110)\n2 , then E[E (cid:96)( \u00affT )] \u2212 E (cid:96)(f (cid:96)\n\u03c1)\n(E (cid:96)(f (cid:96)\n\n\u03c1) + 1/T )\n\n2\u03b2+1 T\n\n\u03b2\n\nmax\n\n(E (cid:96)(f (cid:96)\n\n\u03c1) + 1/T )\n\n2\u03b2+1 T\n\n2\u03b2+1 , T\n\n\u2212 2\u03b2\n2\u03b2+1 , (E (cid:96)(f (cid:96)\n\n\u03c1) + 1/T )\n\n3\u03b2\u22121\n4\u03b2 T\n\n\u2212 1\n\n2 , T\n\n\u2212 2\u03b2\n\n\u03b2+1\n\n.\n\n\u03b2+1\n\n(cid:111)(cid:17)\n\n.\n\n2\u03b2 ).\n\n\u2212 2\u03b2\n\n\u03c1) = 0, and \u02dcO(T\n\u2212 \u03b2+1\n\n\u2212 3\u03b2\n2\u03b2+1 ), if E (cid:96)(f (cid:96)\n\u2212 2\u03b2\n\u03b2+1 ) for any T = O(E (cid:96)(f (cid:96)\n\u03c1)\n\nThis theorem guarantees consistency w.r.t. the (cid:96)-risk. We\nhave that the rate of convergence to the optimal (cid:96)-risk\nis \u02dcO(T\n2\u03b2+1 ) other-\nwise. However, for any \ufb01nite T the rate of convergence\nis \u02dcO(T\nIn other\nwords, we can expect a \ufb01rst regime at faster convergence,\nthat saturates when the number of samples becomes big\nenough, see Fig. 2. This is particularly important because\noften in practical applications the features and the kernel\nare chosen to have good performance, meaning low opti-\nmal (cid:96)-risk. Using Lemma 1, we have that the excess mis-\n(2\u03b2+1)(2\u2212\u03b1) ) if E (cid:96)(f (cid:96)\nclassi\ufb01cation risk is \u02dcO(T\n\u03c1) (cid:54)= 0, and\n(\u03b2+1)(2\u2212\u03b1) ) if E (cid:96)(f (cid:96)\n\u02dcO(T\n\u03c1) = 0. It is also worth noting\nthat, being the algorithm designed to work in the adver-\nsarial setting, we expect its performance to be robust to\nsmall deviations from the IID scenario.\n\n\u2212\n\n\u2212\n\n2\u03b2\n\n2\u03b2\n\nFigure 2: Upper bound on the excess\n(cid:96)-risk of PiSTOL for \u03b2 = 1\n2 .\n\nAlso, note that the guarantees of Corollary 2 and Theorem 2 hold simultaneously. Hence, the theo-\nretical performance of PiSTOL is always better than both the ones of SGD with the step sizes tuned\nwith the knowledge of \u03b2 or with the agnostic choice \u03b7 = O(T \u2212 1\n2 ). In [19] , we also show another\nconvergence result assuming a different smoothness condition.\nRegarding the optimality of our results, lower bounds for the square loss are known [27] under\nassumption (2) and further assuming that the eigenvalues of LK have a polynomial decay, that is\n\n(9)\n\n(\u03bbi)i\u2208N \u223c i\n\n\u2212b, b \u2265 1.\n\n6\n\n10110210310410510610710\u2212310\u2212210\u22121100101TBoundExcess\u2113-riskbound E\u2113(f\u2113\u03c1)=0E\u2113(f\u2113\u03c1)=0.1E\u2113(f\u2113\u03c1)=1\fCondition (9) can be interpreted as an effective dimension of the space. It always holds for b =\n1 [27] and this is the condition we consider that is usually denoted as capacity independent, see\nthe discussion in [21, 33]. In the capacity independent setting, the lower bound is O(T\n2\u03b2+1 ),\nthat matches the asymptotic rates in Theorem 2, up to logarithmic terms. Even if we require the\nloss function to be Lipschitz and smooth, it is unlikely that different lower bounds can be proved\nin our setting. Note that the lower bounds are worst case w.r.t. E (cid:96)(f (cid:96)\n\u03c1), hence they do not cover\n\u03c1) = 0, where we get even better rates. Hence, the optimal regret bound of PiSTOL\nthe case E (cid:96)(f (cid:96)\nin Theorem 1 translates to an optimal convergence rate for its averaged solution, up to logarithmic\nterms, establishing a novel link between these two areas.\n\n\u2212 2\u03b2\n\n6 Related Work\n\n2 \u2264 \u03b2 \u2264 1. Note that, while in the range \u03b2 \u2265 1\n\nThe approach of stochastically minimizing the (cid:96)-risk of the square loss in a RKHS has been pio-\nneered by [24]. The rates were improved, but still suboptimal, in [34], with a general approach for\nlocally Lipschitz loss functions in the origin. The optimal bounds, matching the ones we obtain for\n\u03c1) (cid:54)= 0, were obtained for \u03b2 > 0 in expectation by [33]. Their rates also hold for \u03b2 > 1\n2,\nE (cid:96)(f (cid:96)\nwhile our rates, as the ones in [27], saturate at \u03b2 = 1\n2. In [29], high probability bounds were proved\n2, that implies f\u03c1 \u2208 HK, it is\nin the case that 1\n2 considered in this\npossible to prove high probability bounds [4, 7, 27, 29], the range 0 < \u03b2 < 1\npaper is very tricky, see the discussion in [27]. In this range no high probability bounds are known\nwithout additional assumptions. All the previous approaches require the knowledge of \u03b2, while our\n\u03c1) = 0.\nalgorithm is parameter-free. Also, we obtain faster rates for the excess (cid:96)-risk, when E (cid:96)(f (cid:96)\nAnother important difference is that we can use any smooth and Lipschitz loss, useful for example\nto generate sparse solutions, while the optimal results in [29, 33] are speci\ufb01c for the square loss.\nFor \ufb01nite dimensional spaces and self-concordant losses, an optimal parameter-free stochastic algo-\nrithm has been proposed in [2]. However, the convergence result seems speci\ufb01c to \ufb01nite dimension.\nThe guarantees obtained from worst-case online algorithms, for example [25], have typically opti-\nmal convergence only w.r.t. the performance of the best in HK, see the discussion in [33]. Instead,\nall the guarantees on the misclassi\ufb01cation loss w.r.t. a convex (cid:96)-risk of a competitor, e.g. the Per-\nceptron\u2019s guarantee, are inherently weaker than the presented ones. To see why, assume that the\nclassi\ufb01er returned by the algorithm after seeing T samples is fT , these bounds are of the form of\nR(fT ) \u2264 E (cid:96)(h) +O(T \u2212 1\nK + 1)). For simplicity, assume the use of the hinge loss so that easy\n\u03c1) = 2R(fc). Hence, even in the easy case that fc \u2208 HK,\ncalculations show that f (cid:96)\nwe have R(fT ) \u2264 2R(fc) + O(T \u2212 1\nIn the batch setting, the same optimal rates were obtained by [4, 7] for the square loss, in high\nprobability, for \u03b2 > 1\n2. In [27], using an additional assumption on the in\ufb01nity norm of the functions\n2. The optimal tuning of the\nin HK, they give high probability bounds also in the range 0 < \u03b2 \u2264 1\nregularization parameter is achieved by cross-validation. Hence, we match the optimal rates of a\nbatch algorithm, without the need to use validation methods.\nIn Sec. 3 we saw that the core idea to have the optimal rate was to have a classi\ufb01er whose per-\nformance is close to the best regularized solution, where the regularizer is (cid:107)h(cid:107)K. Changing the\nregularization term from the standard (cid:107)h(cid:107)2\nK with q \u2265 1 is not new in the batch learning\nliterature. It has been \ufb01rst proposed for classi\ufb01cation by [5], and for regression by [17]. Note that, in\nboth cases no computational methods to solve the optimization problem were proposed. Moreover,\nin [27] it was proved that all the regularizers of the form (cid:107)h(cid:107)q\nK with q \u2265 1 gives optimal conver-\ngence rates bound for the square loss, given an appropriate setting of the regularization weight. In\nparticular, [27, Corollary 6] proves that, using the square loss and under assumptions (2) and (9),\nthe optimal weight for the regularizer (cid:107)h(cid:107)q\n. This implies a very important conse-\nquence, not mentioned in that paper: In the the capacity independent setting, that is b = 1, if we\nuse the regularizer (cid:107)h(cid:107)K, the optimal regularization weight is T \u2212 1\n2 , independent of the exponent of\nthe range space (1) where f\u03c1 belongs. Moreover, in the same paper it was argued that \u201cFrom an\nalgorithmic point of view however, q = 2 is currently the only feasible case, which in turn makes\nSVMs the method of choice\u201d. Indeed, in this paper we give a parameter-free ef\ufb01cient procedure to\n\n2 ((cid:107)h(cid:107)2\n\u03c1 = fc and E (cid:96)(f (cid:96)\n2 ((cid:107)fc(cid:107)2\n\nK + 1)), i.e. no convergence to the Bayes risk.\n\nK to (cid:107)h(cid:107)q\n\n\u2212 2\u03b2+q(1\u2212\u03b2)\n\n2\u03b2+2/b\n\nK is T\n\n7\n\n\fFigure 3: Average test errors and standard deviations of PiSTOL and SVM w.r.t.\nsamples over 5 random permutations, on a9a, SensIT Vehicle, and news20.binary.\n\nthe number of training\n\ntrain predictors with smooth losses, that implicitly uses the (cid:107)h(cid:107)K regularizer. Thanks to this, the\nregularization parameter does not need to be set using prior knowledge of the problem.\n\n7 Discussion\n\nBorrowing from OCO and statistical learning theory tools, we have presented the \ufb01rst parameter-\nfree stochastic learning algorithm that achieves optimal rates of convergence w.r.t. the smoothness\nof the optimal predictor. In particular, the algorithm does not require any validation method for the\nmodel selection, rather it automatically self-tunes in an online and data-dependent way.\nEven if this is mainly a theoretical work, we believe that it might also have a big potential in the\napplied world. Hence, as a proof of concept on the potentiality of this method we have also run\na few preliminary experiments, to compare the performance of PiSTOL to an SVM using 5-folds\ncross-validation to select the regularization weight parameter. The experiments were repeated with 5\nrandom shuf\ufb02es, showing the average and standard deviations over three datasets.4 The latest version\nof LIBSVM was used to train the SVM [10]. We have that PiSTOL closely tracks the performance\nof the tuned SVM when a Gaussian kernel is used. Also, contrary to the common intuition, the\nstochastic approach of PiSTOL seems to have an advantage over the tuned SVM when the number\nof samples is small. Probably, cross-validation is a poor approximation of the generalization perfor-\nmance in that regime, while the small sample regime does not affect at all the analysis of PiSTOL.\nNote that in the case of News20, a linear kernel is used over the vectors of size 1355192. The \ufb01nite\ndimensional case is not covered by our theorems, still we see that PiSTOL seems to converge at the\nsame rate of SVM, just with a worse constant. It is important to note that the total time the 5-folds\ncross-validation plus the training with the selected parameter for the SVM on 58000 samples of\nSensIT Vehicle takes \u223c 6.5 hours, while our unoptimized Matlab implementation of PiSTOL less\nthan 1 hour, \u223c 7 times faster. The gains in speed are similar on the other two datasets.\nThis is the \ufb01rst work we know of in this line of research of stochastic adaptive algorithms for sta-\ntistical learning, hence many questions are still open. In particular, it is not clear if high probability\nbounds can be obtained, as the empirical results hint, without additional hypothesis. Also, we only\nproved convergence w.r.t. the (cid:96)-risk, however for \u03b2 \u2265 1\n\u03c1 \u2208 HK, hence it would be\n, e.g. [29]. Probably this would\nrequire a major change in the proof techniques used. Finally, it is not clear if the regret bound in\nTheorem 1 can be improved to depend on the squared gradients. This would result in a \u02dcO(T \u22121)\nbound for the excess (cid:96)-risk for smooth losses when E (cid:96)(f (cid:96)\nAcknowledgments\n\npossible to prove the stronger convergence results on(cid:13)(cid:13)fT \u2212 f (cid:96)\n\n2.\n\u03c1) = 0 and \u03b2 = 1\n\n2 we know that f (cid:96)\n\n(cid:13)(cid:13)K\n\n\u03c1\n\nI am thankful to Lorenzo Rosasco for introducing me to the beauty of the operator L\u03b2\nBrendan McMahan for fruitful discussions.\n\nK and to\n\n4Datasets available at http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/.\n\nThe precise details to replicate the experiments are in [19] .\n\n8\n\n1011021031040.150.160.170.180.190.20.210.220.230.24Number of Training SamplesPercentage of Errors on the Test Seta9a, Gaussian Kernel SVM, 5\u2212folds CVPiSTOL, averaged solution1021031040.10.110.120.130.140.150.160.170.180.190.2Number of Training SamplesPercentage of Errors on the Test SetSensIT Vehicle, Gaussian Kernel SVM, 5\u2212folds CVPiSTOL, averaged solution10210310400.10.20.30.40.50.60.7Number of Training SamplesPercentage of Errors on the Test Setnews20.binary, Linear Kernel SVM, 5\u2212folds CVPiSTOL, averaged solution\fReferences\n[1] P. Auer, N. Cesa-Bianchi, and C. Gentile. Adaptive and self-con\ufb01dent on-line learning algorithms. J.\n\nComput. Syst. Sci., 64(1):48\u201375, 2002.\n\nO(1/n). In NIPS, pages 773\u2013781, 2013.\n\n[2] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate\n\n[3] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, Classi\ufb01cation, and Risk Bounds. Journal of\n\nthe American Statistical Association, 101(473):138\u2013156, March 2006.\n\n[4] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. J. Complexity,\n\n23(1):52\u201372, February 2007.\n\nStatist., 36(2):489\u2013531, 04 2008.\n\n[5] G. Blanchard, O. Bousquet, and P. Massart. Statistical performance of support vector machines. Ann.\n\n[6] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information\n\nProcessing Systems, volume 20, pages 161\u2013168. NIPS Foundation, 2008.\n\n[7] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of\n\n[8] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms.\n\nComputational Mathematics, 7(3):331\u2013368, 2007.\n\nIEEE Trans. Inf. Theory, 50(9):2050\u20132057, 2004.\n\n[9] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.\n[10] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at\n\nhttp://www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[11] K. Chaudhuri, Y. Freund, and D. J. Hsu. A parameter-free hedging algorithm. In Advances in neural\n\ninformation processing systems, pages 297\u2013305, 2009.\n\n[12] D.-R. Chen, Q. Wu, Y. Ying, and D.-X. Zhou. Support vector machine soft margin classi\ufb01ers: Error\n\nanalysis. Journal of Machine Learning Research, 5:1143\u20131175, 2004.\n\n[13] F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University\n\nPress, New York, NY, USA, 2007.\n\n[14] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[15] H. Luo and R. E. Schapire. A drifting-games analysis for online learning and applications to boosting. In\n\nAdvances in Neural Information Processing Systems, 2014.\n\n[16] H. B. McMahan and F. Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax algo-\n\nrithms and normal approximations. In COLT, 2014.\n\n[17] S. Mendelson and J. Neeman. Regularization in kernel learning. Ann. Statist., 38(1):526\u2013565, 02 2010.\nIn Advances in Neural Information Processing\n[18] F. Orabona. Dimension-free exponentiated gradient.\n\nSystems 26, pages 1806\u20131814. Curran Associates, Inc., 2013.\n\n[19] F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning,\n\n[20] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Stat., 22:400\u2013407, 1951.\n[21] L. Rosasco, A. Tacchetti, and S. Villa. Regularization by early stopping for online learning algorithms,\n\n2014. arXiv:1406.3816.\n\n2014. arXiv:1405.0042.\n\n[22] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the\n\n[23] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM.\n\nbrain. Psychological Review, 65:386\u2013407, 1958.\n\nIn Proc. of ICML, pages 807\u2013814, 2007.\n\n[24] S. Smale and Y. Yao. Online learning algorithms. Found. Comp. Math, 6:145\u2013170, 2005.\n[25] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In Advances in Neural\n\nInformation Processing Systems 23, pages 2199\u20132207. Curran Associates, Inc., 2010.\n\n[26] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.\n[27] I. Steinwart, D. R. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In COLT,\n\n[28] M. Streeter and B. McMahan. No-regret algorithms for unconstrained online convex optimization. In\nAdvances in Neural Information Processing Systems 25, pages 2402\u20132410. Curran Associates, Inc., 2012.\n[29] P. Tarr`es and Y. Yao. Online learning as stochastic approximation of regularization paths, 2013.\n\n[30] A. B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Ann. Statist., 32:135\u2013166, 2004.\n[31] Y. Yao. On complexity issues of online learning algorithms. IEEE Trans. Inf. Theory, 56(12):6470\u20136481,\n\n[32] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constr. Approx.,\n\n2009.\n\narXiv:1103.5538.\n\n2010.\n\n26:289\u2013315, 2007.\n\n[33] Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of Computational\n\n[34] Y. Ying and D.-X. Zhou. Online regularized classi\ufb01cation algorithms.\n\nIEEE Trans. Inf. Theory,\n\nMathematics, 8(5):561\u2013596, 2008.\n\n52(11):4775\u20134788, 2006.\n\n[35] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In\n\nProc. of ICML, pages 919\u2013926, New York, NY, USA, 2004. ACM.\n\n[36] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Proc. of\n\nICML, pages 928\u2013936, 2003.\n\n9\n\n\f", "award": [], "sourceid": 659, "authors": [{"given_name": "Francesco", "family_name": "Orabona", "institution": "Yahoo! Labs"}]}