{"title": "Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)", "book": "Advances in Neural Information Processing Systems", "page_first": 773, "page_last": 781, "abstract": "We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which includes machine learning methods based on the minimization of the empirical risk. We focus on problems without strong convexity, for which all previously known algorithms achieve a convergence rate for function values of $O(1/\\sqrt{n})$. We consider and analyze two algorithms that achieve a rate of $O(1/n)$ for classical supervised learning problems. For least-squares regression, we show that averaged stochastic gradient descent with constant step-size achieves the desired rate. For logistic regression, this is achieved by a simple novel stochastic gradient algorithm that (a) constructs successive local quadratic approximations of the loss functions, while (b) preserving the same running time complexity as stochastic gradient descent. For these algorithms, we provide a non-asymptotic analysis of the generalization error (in expectation, and also in high probability for least-squares), and run extensive experiments showing that they often outperform existing approaches.", "full_text": "Non-strongly-convex smooth stochastic\n\napproximation with convergence rate O(1/n)\n\nFrancis Bach\n\nINRIA - Sierra Project-team\n\nEcole Normale Sup\u00b4erieure, Paris, France\n\nfrancis.bach@ens.fr\n\nAbstract\n\nEric Moulines\n\nLTCI\n\nTelecom ParisTech, Paris, France\neric.moulines@enst.fr\n\nWe consider the stochastic approximation problem where a convex function has\nto be minimized, given only the knowledge of unbiased estimates of its gradients\nat certain points, a framework which includes machine learning methods based\non the minimization of the empirical risk. We focus on problems without strong\nconvexity, for which all previously known algorithms achieve a convergence rate\nfor function values of O(1/\u221an) after n iterations. We consider and analyze two\nalgorithms that achieve a rate of O(1/n) for classical supervised learning prob-\nlems. For least-squares regression, we show that averaged stochastic gradient\ndescent with constant step-size achieves the desired rate. For logistic regression,\nthis is achieved by a simple novel stochastic gradient algorithm that (a) constructs\nsuccessive local quadratic approximations of the loss functions, while (b) preserv-\ning the same running-time complexity as stochastic gradient descent. For these\nalgorithms, we provide a non-asymptotic analysis of the generalization error (in\nexpectation, and also in high probability for least-squares), and run extensive ex-\nperiments showing that they often outperform existing approaches.\n\n1 Introduction\n\nLarge-scale machine learning problems are becoming ubiquitous in many areas of science and en-\ngineering. Faced with large amounts of data, practitioners typically prefer algorithms that process\neach observation only once, or a few times. Stochastic approximation algorithms such as stochastic\ngradient descent (SGD) and its variants, although introduced more than sixty years ago [1], still\nremain the most widely used and studied method in this context (see, e.g., [2, 3, 4, 5, 6, 7]).\nWe consider minimizing convex functions f , de\ufb01ned on a Euclidean space F, given by f (\u03b8) =\nE(cid:2)\u2113(y,h\u03b8, xi)(cid:3), where (x, y) \u2208 F \u00d7 R denotes the data and \u2113 denotes a loss function that is con-\nvex with respect to the second variable. This includes logistic and least-squares regression.\nIn\nthe stochastic approximation framework, independent and identically distributed pairs (xn, yn) are\nobserved sequentially and the predictor de\ufb01ned by \u03b8 is updated after each pair is seen.\nWe partially understand the properties of f that affect the problem dif\ufb01culty. Strong convexity (i.e.,\nwhen f is twice differentiable, a uniform strictly positive lower-bound \u00b5 on Hessians of f ) is a key\nproperty. Indeed, after n observations and with the proper step-sizes, averaged SGD achieves the\nrate of O(1/\u00b5n) in the strongly-convex case [5, 4], while it achieves only O(1/\u221an) in the non-\nstrongly-convex case [5], with matching lower-bounds [8].\n\nThe main issue with strong convexity is that typical machine learning problems are high-dimensional\nand have correlated variables so that the strong convexity constant \u00b5 is zero or very close to zero,\nand in any case smaller than O(1/\u221an). This then makes the non-strongly convex methods better.\nIn this paper, we aim at obtaining algorithms that may deal with arbitrarily small strong-convexity\nconstants, but still achieve a rate of O(1/n).\n\n1\n\n\fSmoothness plays a central role in the context of deterministic optimization. The known convergence\nrates for smooth optimization are better than for non-smooth optimization (e.g., see [9]). However,\nfor stochastic optimization the use of smoothness only leads to improvements on constants (e.g.,\nsee [10]) but not on the rate itself, which remains O(1/\u221an) for non-strongly-convex problems.\nWe show that for the square loss and for the logistic loss, we may use the smoothness of the loss and\nobtain algorithms that have a convergence rate of O(1/n) without any strong convexity assumptions.\nMore precisely, for least-squares regression, we show in Section 2 that averaged stochastic gradient\ndescent with constant step-size achieves the desired rate. For logistic regression this is achieved by\na novel stochastic gradient algorithm that (a) constructs successive local quadratic approximations\nof the loss functions, while (b) preserving the same running-time complexity as stochastic gradi-\nent descent (see Section 3). For these algorithms, we provide a non-asymptotic analysis of their\ngeneralization error (in expectation, and also in high probability for least-squares), and run exten-\nsive experiments on standard machine learning benchmarks showing in Section 4 that they often\noutperform existing approaches.\n\n2 Constant-step-size least-mean-square algorithm\nIn this section, we consider stochastic approximation for least-squares regression, where SGD is\noften referred to as the least-mean-square (LMS) algorithm. The novelty of our convergence result\nis the use of the constant step-size with averaging, which was already considered by [11], but now\nwith an explicit non-asymptotic rate O(1/n) without any dependence on the lowest eigenvalue of\nthe covariance matrix.\n\n2.1 Convergence in expectation\n\nWe make the following assumptions:\n(A1) F is a d-dimensional Euclidean space, with d > 1.\n(A2) The observations (xn, zn) \u2208 F \u00d7 F are independent and identically distributed.\n(A3) Ekxnk2 and Ekznk2 are \ufb01nite. Denote by H = E(xn \u2297 xn) the covariance operator from\nF to F. Without loss of generality, H is assumed invertible (by projecting onto the minimal\nsubspace where xn lies almost surely). However, its eigenvalues may be arbitrarily small.\n(A4) The global minimum of f (\u03b8) = (1/2)E(cid:2)h\u03b8, xni2 \u2212 2h\u03b8, zni(cid:3) is attained at a certain \u03b8\u2217 \u2208 F.\nWe denote by \u03ben = zn \u2212 h\u03b8\u2217, xnixn the residual. We have E(cid:2)\u03ben(cid:3) = 0, but in general, it is not\ntrue that E(cid:2)\u03ben (cid:12)(cid:12) xn(cid:3) = 0 (unless the model is well-speci\ufb01ed).\n\u03b8n = \u03b8n\u22121 \u2212 \u03b3(h\u03b8n\u22121, xnixn \u2212 zn) = (I \u2212 \u03b3xn \u2297 xn)\u03b8n\u22121 + \u03b3zn,\nstarted from \u03b80 \u2208 F. We also consider the averaged iterates \u00af\u03b8n = (n + 1)\u22121Pn\n(A6) There exists R > 0 and \u03c3 > 0 such that E(cid:2)\u03ben \u2297 \u03ben(cid:3) 4 \u03c32H and E(cid:0)kxnk2xn \u2297 xn(cid:1) 4 R2H,\nwhere 4 denotes the the order between self-adjoint operators, i.e., A 4 B if and only if B \u2212 A\nis positive semi-de\ufb01nite.\n\n(A5) We study the stochastic gradient (a.k.a. least mean square) recursion de\ufb01ned as\n\n(1)\n\nk=0 \u03b8k.\n\nDiscussion of assumptions. Assumptions (A1-5) are standard in stochastic approximation (see,\ne.g., [12, 6]). Note that for least-squares problems, zn is of the form ynxn, where yn \u2208 R is\nthe response to be predicted as a linear function of xn. We consider a slightly more general case\nthan least-squares because we will need it for the quadratic approximation of the logistic loss in\nSection 3.1. Note that in assumption (A4), we do not assume that the model is well-speci\ufb01ed.\nAssumption (A6) is true for least-square regression with almost surely bounded data, since, if\n\nkxnk2 6 R2 almost surely, then E(cid:0)kxnk2xn \u2297 xn(cid:1) 4 E(cid:0)R2xn \u2297 xn(cid:1) = R2H; a similar inequality\n\nholds for the output variables yn. Moreover, it also holds for data with in\ufb01nite supports, such as\nGaussians or mixtures of Gaussians (where all covariance matrices of the mixture components are\nlower and upper bounded by a constant times the same matrix). Note that the \ufb01nite-dimensionality\nassumption could be relaxed, but this would require notions similar to degrees of freedom [13],\nwhich is outside of the scope of this paper.\n\nThe goal of this section is to provide a non-asymptotic bound on the expectation E(cid:2)f (\u00af\u03b8n)\u2212 f (\u03b8\u2217)(cid:3),\n\nthat (a) does not depend on the smallest non-zero eigenvalue of H (which could be arbitrarily small)\nand (b) still scales as O(1/n).\n\n2\n\n\fTheorem 1 Assume (A1-6). For any constant step-size \u03b3 < 1/R2, we have\n\nE(cid:2)f (\u00af\u03b8n\u22121) \u2212 f (\u03b8\u2217)(cid:3) 6\n\n1 \u2212p\u03b3R2\nWhen \u03b3 = 1/(4R2), we obtain E(cid:2)f (\u00af\u03b8n\u22121) \u2212 f (\u03b8\u2217)(cid:3) 6\n\n2\n\n1\n\n2n(cid:20)\n\n1\n\np\u03b3R2(cid:21)2\n+ Rk\u03b80 \u2212 \u03b8\u2217k\nnh\u03c3\u221ad + Rk\u03b80 \u2212 \u03b8\u2217ki2\n\n.\n\n\u03c3\u221ad\n\n.\n\n(2)\n\nProof technique. We adapt and extend a proof technique from [14] which is based on non-\nasymptotic expansions in powers of \u03b3. We also use a result from [2] which studied the recursion in\nEq. (1), with xn \u2297 xn replaced by its expectation H. See [15] for details.\nOptimality of bounds. Our bound in Eq. (2) leads to a rate of O(1/n), which is known to be optimal\nfor least-squares regression (i.e., under reasonable assumptions, no algorithm, even more complex\nthan averaged SGD can have a better dependence in n) [16]. The term \u03c32d/n is also unimprovable.\nInitial conditions. If \u03b3 is small, then the initial condition is forgotten more slowly. Note that with\nadditional strong convexity assumptions, the initial condition would be forgotten faster (exponen-\ntially fast without averaging), which is one of the traditional uses of constant-step-size LMS [17].\nSpeci\ufb01city of constant step-sizes. The non-averaged iterate sequence (\u03b8n) is a homogeneous\nMarkov chain; under appropriate technical conditions, this Markov chain has a unique stationary\n(invariant) distribution and the sequence of iterates (\u03b8n) converges in distribution to this invari-\nant distribution; see [18, Chapter 17]. Denote by \u03c0\u03b3 the invariant distribution. Assuming that\nthe Markov Chain is Harris-recurrent, the ergodic theorem for Harris Markov chain shows that\n\nk=0 \u03b8k converges almost-surely to \u00af\u03b8\u03b3\n\n\u00af\u03b8n\u22121 = n\u22121Pn\u22121\n= R \u03b8\u03c0\u03b3(d\u03b8), which is the mean of the\nstationary distribution. Taking the expectation on both side of Eq. (1), we get E[\u03b8n] \u2212 \u03b8\u2217 =\n(I \u2212 \u03b3H)(E[\u03b8n\u22121] \u2212 \u03b8\u2217), which shows, using that limn\u2192\u221e E[\u03b8n] = \u00af\u03b8\u03b3 that H \u00af\u03b8\u03b3 = H\u03b8\u2217 and\ntherefore \u00af\u03b8\u03b3 = \u03b8\u2217 since H is invertible. Under slightly stronger assumptions, it can be shown that\n\ndef\n\nlimn\u2192\u221e nE[(\u00af\u03b8n \u2212 \u03b8\u2217)2] = Var\u03c0\u03b3 (\u03b80) + 2P\u221e\n\nk=1 Cov\u03c0\u03b3 (\u03b80, \u03b8k) ,\n\nwhere Cov\u03c0\u03b3 (\u03b80, \u03b8k) denotes the covariance of \u03b80 and \u03b8k when the Markov chain is started from\nstationarity. This implies that limn\u2192\u221e nE[f (\u00af\u03b8n)\u2212 f (\u03b8\u2217)] has a \ufb01nite limit. Therefore, this in-\nterpretation explains why the averaging produces a sequence of estimators which converges to the\nsolution \u03b8\u2217 pointwise, and that the rate of convergence of E[f (\u03b8n)\u2212f (\u03b8\u2217)] is of order O(1/n). Note\nthat (a) our result is stronger since it is independent of the lowest eigenvalue of H, and (b) for other\nlosses than quadratic, the same properties hold except that the mean under the stationary distribution\ndoes not coincide with \u03b8\u2217 and its distance to \u03b8\u2217 is typically of order \u03b3 2 (see Section 3).\n\n2.2 Convergence in higher orders\n\nWe are now going to consider an extra assumption in order to bound the p-th moment of the excess\nrisk and then get a high-probability bound. Let p be a real number greater than 1.\n(A7) There exists R > 0, \u03ba > 0 and \u03c4 > \u03c3 > 0 such that, for all n > 1, kxnk2 6 R2 a.s., and\n\nEk\u03benkp 6 \u03c4 pRp\n\u2200z \u2208 F , Ehz, xni4\n\nand E(cid:2)\u03ben \u2297 \u03ben(cid:3) 4 \u03c32H,\n6 \u03ba(cid:0)Ehz, xni2(cid:1)2\n\n= \u03bahz, Hzi2.\n\n(3)\n\n(4)\n\nThe last condition in Eq. (4) says that the kurtosis of the projection of the covariates xn on any\ndirection z \u2208 F is bounded. Note that computing the constant \u03ba happens to be equivalent to the\noptimization problem solved by the FastICA algorithm [19], which thus provides an estimate of \u03ba. In\nTable 1, we provide such an estimate for the non-sparse datasets which we have used in experiments,\nwhile we consider only directions z along the axes for high-dimensional sparse datasets. For these\ndatasets where a given variable is equal to zero except for a few observations, \u03ba is typically quite\nlarge. Adapting and analyzing normalized LMS techniques [20] to this set-up is likely to improve\nthe theoretical robustness of the algorithm (but note that results in expectation from Theorem 1 do\nnot use \u03ba). The next theorem provides a bound for the p-th moment of the excess risk.\n\nTheorem 2 Assume (A1-7). For any real p > 1, and for a step-size \u03b3 6 1/(12p\u03baR2), we have:\n\n(cid:0)E(cid:12)(cid:12)f (\u00af\u03b8n\u22121) \u2212 f (\u03b8\u2217)(cid:12)(cid:12)\n\np(cid:1)1/p\n\n6\n\np\n\n2n(cid:18)7\u03c4\u221ad + Rk\u03b80 \u2212 \u03b8\u2217kr3 +\n\n2\n\n\u03b3pR2(cid:19)2\n\n.\n\n(5)\n\n3\n\n\fFor \u03b3 = 1/(12p\u03baR2), we get: (cid:0)E(cid:12)(cid:12)f (\u00af\u03b8n\u22121) \u2212 f (\u03b8\u2217)(cid:12)(cid:12)\n\np(cid:1)1/p\n\nNote that to control the p-th order moment, a smaller step-size is needed, which scales as 1/p. We\ncan now provide a high-probability bound; the tails decay polynomially as 1/(n\u03b412\u03b3\u03baR2\n) and the\nsmaller the step-size \u03b3, the lighter the tails.\nCorollary 1 For any step-size such that \u03b3 6 1/(12\u03baR2), any \u03b4 \u2208 (0, 1),\n\n6 p\n\n2n(cid:0)7\u03c4\u221ad + 6\u221a\u03baRk\u03b80 \u2212 \u03b8\u2217k(cid:1)2\n\n.\n\nP(cid:18)f (\u00af\u03b8n\u22121) \u2212 f (\u03b8\u2217) >\n\nn\u03b412\u03b3\u03baR2 (cid:2)7\u03c4\u221ad + Rk\u03b80 \u2212 \u03b8\u2217k(\u221a3 + \u221a24\u03ba)(cid:3)2\n\n24\u03b3\u03baR2\n\n1\n\n(cid:19) 6 \u03b4 .\n\n(6)\n\n3 Beyond least-squares: M-estimation\nIn Section 2, we have shown that for least-squares regression, averaged SGD achieves a convergence\nrate of O(1/n) with no assumption regarding strong convexity. For all losses, with a constant step-\nsize \u03b3, the stationary distribution \u03c0\u03b3 corresponding to the homogeneous Markov chain (\u03b8n) does\n\nalways satisfyR f \u2032(\u03b8)\u03c0\u03b3(d\u03b8) = 0, where f is the generalization error. When the gradient f \u2032 is linear\n(i.e., f is quadratic), then this implies that f \u2032(R \u03b8\u03c0\u03b3(d\u03b8)) = 0, i.e., the averaged recursion converges\npathwise to \u00af\u03b8\u03b3 = R \u03b8\u03c0\u03b3(d\u03b8) which coincides with the optimal value \u03b8\u2217 (de\ufb01ned through f \u2032(\u03b8\u2217) = 0).\nWhen the gradient f \u2032 is no longer linear, then R f \u2032(\u03b8)\u03c0\u03b3 (d\u03b8) 6= f \u2032(R \u03b8\u03c0\u03b3 (d\u03b8)). Therefore, for\ngeneral M -estimation problems we should expect that the averaged sequence still converges at rate\nO(1/n) to the mean of the stationary distribution \u00af\u03b8\u03b3, but not to the optimal predictor \u03b8\u2217. Typically,\nthe average distance between \u03b8n and \u03b8\u2217 is of order \u03b3 (see Section 4 and [21]), while for the averaged\niterates that converge pointwise to \u00af\u03b8\u03b3, it is of order \u03b3 2 for strongly convex problems under some\nadditional smoothness conditions on the loss functions (these are satis\ufb01ed, for example, by the\nlogistic loss [22]).\n\nSince quadratic functions may be optimized with rate O(1/n) under weak conditions, we are going\nto use a quadratic approximation around a well chosen support point, which shares some similarity\nwith the Newton procedure (however, with a non trivial adaptation to the stochastic approximation\nframework). The Newton step for f around a certain point \u02dc\u03b8 is equivalent to minimizing a quadratic\ndef\nsurrogate g of f around \u02dc\u03b8, i.e., g(\u03b8) = f (\u02dc\u03b8) + hf \u2032(\u02dc\u03b8), \u03b8 \u2212 \u02dc\u03b8i + 1\n=\nn (\u02dc\u03b8)(\u03b8\u2212 \u02dc\u03b8)i; the\n\u2113(yn,h\u03b8, xni), then g(\u03b8) = Egn(\u03b8), with gn(\u03b8) = f (\u02dc\u03b8)+hf \u2032\nNewton step may thus be solved approximately with stochastic approximation (here constant-step\nsize LMS), with the following recursion:\n\n2h\u03b8 \u2212 \u02dc\u03b8, f \u2032\u2032(\u02dc\u03b8)(\u03b8 \u2212 \u02dc\u03b8)i. If fn(\u03b8)\n\nn(\u02dc\u03b8), \u03b8\u2212 \u02dc\u03b8i+ 1\n\n2h\u03b8\u2212 \u02dc\u03b8, f \u2032\u2032\n\nn(\u02dc\u03b8) + f \u2032\u2032\n\n\u03b8n = \u03b8n\u22121 \u2212 \u03b3g \u2032\n\nn (\u02dc\u03b8)(\u03b8n\u22121 \u2212 \u02dc\u03b8)(cid:3).\n\nn (\u03b8) is a rank-one matrix.\n\nn(\u03b8n\u22121) = \u03b8n\u22121 \u2212 \u03b3(cid:2)f \u2032\n\n(7)\nn(\u03b8n\u22121) by its \ufb01rst-order approximation around \u02dc\u03b8. A\nThis is equivalent to replacing the gradient f \u2032\ncrucial point is that for machine learning scenarios where fn is a loss associated to a single data\npoint, its complexity is only twice the complexity of a regular stochastic approximation step, since,\nwith fn(\u03b8) = \u2113(yn,hxn, \u03b8i), f \u2032\u2032\nChoice of support points for quadratic approximation. An important aspect is the choice of the\nsupport point \u02dc\u03b8. In this paper, we consider two strategies:\n\u2013 Two-step procedure: for convex losses, averaged SGD with a step-size decaying at O(1/\u221an)\nachieves a rate (up to logarithmic terms) of O(1/\u221an) [5, 6]. We may thus use it to obtain a \ufb01rst\ndecent estimate. The two-stage procedure is as follows (and uses 2n observations): n steps of\naveraged SGD with constant step size \u03b3 \u221d 1/\u221an to obtain \u02dc\u03b8, and then averaged LMS for the\nNewton step around \u02dc\u03b8. As shown below, this algorithm achieves the rate O(1/n) for logistic\nregression. However, it is not the most ef\ufb01cient in practice.\n\n\u2013 Support point = current average iterate: we simply consider the current averaged iterate \u00af\u03b8n\u22121\n\nas the support point \u02dc\u03b8, leading to the recursion:\n\nn(\u00af\u03b8n\u22121) + f \u2032\u2032\n\n\u03b8n = \u03b8n\u22121 \u2212 \u03b3(cid:2)f \u2032\n\n(8)\nAlthough this algorithm has shown to be the most ef\ufb01cient in practice (see Section 4) we cur-\nrently have no proof of convergence. Given that the behavior of the algorithms does not change\nmuch when the support point is updated less frequently than each iteration, there may be some\nconnections to two-time-scale algorithms (see, e.g., [23]). In Section 4, we also consider several\nother strategies based on doubling tricks.\n\nn (\u00af\u03b8n\u22121)(\u03b8n\u22121 \u2212 \u00af\u03b8n\u22121)(cid:3).\n\n4\n\n\fInterestingly, for non-quadratic functions, our algorithm imposes a new bias (by replacing the true\ngradient by an approximation which is only valid close to \u00af\u03b8n\u22121) in order to reach faster convergence\n(due to the linearity of the underlying gradients).\nRelationship with one-step-estimators. One-step estimators (see, e.g., [24]) typically take any\nestimator with O(1/n)-convergence rate, and make a full Newton step to obtain an ef\ufb01cient estima-\ntor (i.e., one that achieves the Cramer-Rao lower bound). Although our novel algorithm is largely\ninspired by one-step estimators, our situation is slightly different since our \ufb01rst estimator has only\nconvergence rate O(1/\u221an) and is estimated on different observations.\n\n3.1 Self-concordance and logistic regression\n\nWe make the following assumptions:\n(B1) F is a d-dimensional Euclidean space, with d > 1.\n(B2) The observations (xn, yn) \u2208 F \u00d7 {\u22121, 1} are independent and identically distributed.\n(B3) We consider f (\u03b8) = E(cid:2)\u2113(yn,hxn, \u03b8i)(cid:3), with the following assumption on the loss function \u2113\n\n(whenever we take derivatives of \u2113, this will be with respect to the second variable):\n\n\u2200(y, \u02c6y) \u2208 {\u22121, 1} \u00d7 R,\n\n\u2113\u2032(y, \u02c6y) 6 1, \u2113\u2032\u2032(y, \u02c6y) 6 1/4,\n\n|\u2113\u2032\u2032\u2032(y, \u02c6y)| 6 \u2113\u2032\u2032(y, \u02c6y).\n\nWe denote by \u03b8\u2217 a global minimizer of f , which we thus assume to exist, and we denote by\nH = f \u2032\u2032(\u03b8\u2217) the Hessian operator at a global optimum \u03b8\u2217.\n\n(B4) We assume that there exists R > 0, \u03ba > 0 and \u03c1 > 0 such that kxnk2 6 R2 almost surely, and\n(9)\n\nE(cid:2)xn \u2297 xn(cid:3) 4 \u03c1E(cid:2)\u2113\u2032\u2032(yn,h\u03b8\u2217, xni)xn \u2297 xn(cid:3) = \u03c1H,\n\n.\n\n(10)\n\n\u2200z \u2208 F , \u03b8 \u2208 F , E(cid:2)\u2113\u2032\u2032(yn,h\u03b8, xni)2hz, xni4(cid:3) 6 \u03ba(cid:0)E(cid:2)\u2113\u2032\u2032(yn,h\u03b8, xni)hz, xni2(cid:3)(cid:1)2\n\nAssumption (B3) is satis\ufb01ed for the logistic loss and extends to all generalized linear models (see\nmore details in [22]), and the relationship between the third derivative and second derivative of the\nloss \u2113 is often referred to as self-concordance (see [9, 25] and references therein). Note moreover\nthat we must have \u03c1 > 4 and \u03ba > 1.\nA loose upper bound for \u03c1 is 1/ infn \u2113\u2032\u2032(yn,h\u03b8\u2217, xni) but in practice, it is typically much smaller\n(see Table 1). The condition in Eq. (10) is hard to check because it is uniform in \u03b8. With a slightly\nmore complex proof, we could restrict \u03b8 to be close to \u03b8\u2217; with such constraints, the value of \u03ba we\nhave found is close to the one from Section 2.2 (i.e., without the terms in \u2113\u2032\u2032(yn,h\u03b8, xni)).\nTheorem 3 Assume (B1-4), and consider the vector \u03b6n obtained as follows: (a) perform n steps of\naveraged stochastic gradient descent with constant step size 1/2R2\u221an, to get \u02dc\u03b8n, and (b) perform n\nstep of averaged LMS with constant step-size 1/R2 for the quadratic approximation of f around \u02dc\u03b8n.\nIf n > (19 + 9Rk\u03b80 \u2212 \u03b8\u2217k)4, then\n\n(11)\nWe get an O(1/n) convergence rate without assuming strong convexity, even locally, thus improving\non results from [22] where the the rate is proportional to 1/(n\u03bbmin(H)). The proof relies on self-\nconcordance properties and the sharp analysis of the Newton step (see [15] for details).\n\n(16Rk\u03b80 \u2212 \u03b8\u2217k + 19)4.\n\nEf (\u03b6n) \u2212 f (\u03b8\u2217) 6\n\nn\n\n\u03ba3/2\u03c13d\n\n4 Experiments\n4.1 Synthetic data\nLeast-mean-square algorithm. We consider normally distributed inputs, with covariance matrix H\nthat has random eigenvectors and eigenvalues 1/k, k = 1, . . . , d. The outputs are generated from a\nlinear function with homoscedastic noise with unit signal to noise-ratio. We consider d = 20 and\nthe least-mean-square algorithm with several settings of the step size \u03b3n, constant or proportional to\n1/\u221an. Here R2 denotes the average radius of the data, i.e., R2 = tr H. In the left plot of Figure 1,\nwe show the results, averaged over 10 replications.\n\nWithout averaging, the algorithm with constant step-size does not converge pointwise (it oscillates),\nand its average excess risk decays as a linear function of \u03b3 (indeed, the gap between each values of\nthe constant step-size is close to log10(4), which corresponds to a linear function in \u03b3).\n\n5\n\n\fsynthetic square\n\n \n\nsynthetic logistic \u2212 1\n\n \n\nsynthetic logistic \u2212 2\n\n \n\n]\n)\n\n*\n\n\u03b8\n(\nf\n\u2212\n)\n\u03b8\n(\nf\n[\n\n0\n1\n\ng\no\n\nl\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n \n0\n\n1/2R2\n1/8R2\n1/32R2\n1/2R2n1/2\n\n2\n\n4\n(n)\n\nlog\n\n10\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n \n0\n\n]\n)\n\n*\n\n\u03b8\n(\nf\n\u2212\n)\n\u03b8\n(\nf\n[\n\n0\n1\n\ng\no\n\nl\n\n6\n\n1/2R2\n1/8R2\n1/32R2\n1/2R2n1/2\n2\n\n4\n(n)\n\nlog\n\n10\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n\u22125\n \n0\n\n]\n)\n\n*\n\n\u03b8\n(\nf\n\u2212\n)\n\u03b8\n(\nf\n[\n\n0\n1\n\ng\no\n\nl\n\n6\n\nevery 2p\nevery iter.\n2\u2212step\n2\u2212step\u2212dbl.\n\n2\n\n4\n(n)\n\nlog\n\n10\n\n6\n\nFigure 1: Synthetic data. Left: least-squares regression. Middle: logistic regression with averaged\nSGD with various step-sizes, averaged (plain) and non-averaged (dashed). Right: various Newton-\nbased schemes for the same logistic regression problem. Best seen in color; see text for details.\n\nWith averaging, the algorithm with constant step-size does converge at rate O(1/n), and for all\nvalues of the constant \u03b3, the rate is actually the same. Moreover (although it is not shown in the\nplots), the standard deviation is much lower.\nWith decaying step-size \u03b3n = 1/(2R2\u221an) and without averaging,\nO(1/\u221an), and improves to O(1/n) with averaging.\nLogistic regression. We consider the same input data as for least-squares, but now generates outputs\nfrom the logistic probabilistic model. We compare several algorithms and display the results in\nFigure 1 (middle and right plots).\n\nthe convergence rate is\n\nOn the middle plot, we consider SGD; without averaging, the algorithm with constant step-size does\nnot converge and its average excess risk reaches a constant value which is a linear function of \u03b3\n(indeed, the gap between each values of the constant step-size is close to log10(4)). With averaging,\nthe algorithm does converge, but as opposed to least-squares, to a point which is not the optimal\nsolution, with an error proportional to \u03b3 2 (the gap between curves is twice as large).\nOn the right plot, we consider various variations of our online Newton-approximation scheme. The\n\u201c2-step\u201d algorithm is the one for which our convergence rate holds (n being the total number of\nexamples, we perform n/2 steps of averaged SGD, then n/2 steps of LMS). Not surprisingly, it is\nnot the best in practice (in particular at n/2, when starting the constant-size LMS, the performance\nworsens temporarily). It is classical to use doubling tricks to remedy this problem while preserving\nconvergence rates [26], this is done in \u201c2-step-dbl.\u201d, which avoids the previous erratic behavior.\n\nWe have also considered getting rid of the \ufb01rst stage where plain averaged stochastic gradient is\nused to obtain a support point for the quadratic approximation. We now consider only Newton-steps\nbut change only these support points. We consider updating the support point at every iteration, i.e.,\nthe recursion from Eq. (8), while we also consider updating it every dyadic point (\u201cdbl.-approx\u201d).\nThe last two algorithms perform very similarly and achieve the O(1/n) early. In all experiments on\nreal data, we have considered the simplest variant (which corresponds to Eq. (8)).\n\n4.2 Standard benchmarks\n\nWe have considered 6 benchmark datasets which are often used in comparing large-scale optimiza-\ntion methods. The datasets are described in Table 1 and vary in values of d, n and sparsity levels.\nThese are all \ufb01nite binary classi\ufb01cation datasets with outputs in {\u22121, 1}. For least-squares and lo-\ngistic regression, we have followed the following experimental protocol: (1) remove all outliers (i.e.,\nsample points xn whose norm is greater than 5 times the average norm), (2) divide the dataset in two\nequal parts, one for training, one for testing, (3) sample within the training dataset with replacement,\nfor 100 times the number of observations in the training set (this corresponds to 100 effective passes;\nin all plots, a black dashed line marks the \ufb01rst effective pass), (4) compute averaged costs on training\nand testing data (based on 10 replications). All the costs are shown in log-scale, normalized to that\nthe \ufb01rst iteration leads to f (\u03b80) \u2212 f (\u03b8\u2217) = 1.\nAll algorithms that we consider (ours and others) have a step-size, and typically a theoretical value\nthat ensures convergence. We consider two settings: (1) one when this theoretical value is used, (2)\none with the best testing error after one effective pass through the data (testing powers of 4 times the\ntheoretical step-size).\n\n6\n\n\fHere, we only consider covertype, alpha, sido and news, as well as test errors. For all training errors\nand the two other datasets (quantum, rcv1), see [15].\nLeast-squares regression. We compare three algorithms: averaged SGD with constant step-size,\naveraged SGD with step-size decaying as C/R2\u221an, and the stochastic averaged gradient (SAG)\nmethod which is dedicated to \ufb01nite training data sets [27], which has shown state-of-the-art perfor-\nmance in this set-up. We show the results in the two left plots of Figure 2 and Figure 3.\nAveraged SGD with decaying step-size equal to C/R2\u221an is slowest (except for sido). In particu-\nlar, when the best constant C is used (right columns), the performance typically starts to increase\nsigni\ufb01cantly. With that step size, even after 100 passes, there is no sign of over\ufb01tting, even for the\nhigh-dimensional sparse datasets.\n\nSAG and constant-step-size averaged SGD exhibit the best behavior, for the theoretical step-sizes\nand the best constants, with a signi\ufb01cant advantage for constant-step-size SGD. The non-sparse\ndatasets do not lead to over\ufb01tting, even close to the global optimum of the (unregularized) training\nobjectives, while the sparse datasets do exhibit some over\ufb01tting after more than 10 passes.\nLogistic regression. We also compare two additional algorithms: our Newton-based technique and\n\u201cAdagrad\u201d [7], which is a stochastic gradient method with a form a diagonal scaling1 that allows to\nreduce the convergence rate (which is still in theory proportional to O(1/\u221an)). We show results in\nthe two right plots of Figure 2 and Figure 3.\nAveraged SGD with decaying step-size proportional to 1/R2\u221an has the same behavior than for\nleast-squares (step-size harder to tune, always inferior performance except for sido).\n\nSAG, constant-step-size SGD and the novel Newton technique tend to behave similarly (good with\ntheoretical step-size, always among the best methods). They differ notably in some aspects: (1)\nSAG converges quicker for the training errors (shown in [15]) while it is a bit slower for the testing\nerror, (2) in some instances, constant-step-size averaged SGD does under\ufb01t (covertype, alpha, news),\nwhich is consistent with the lack of convergence to the global optimum mentioned earlier, (3) the\nnovel online Newton algorithm is consistently better.\n\nOn the non-sparse datasets, Adagrad performs similarly to the Newton-type method (often better in\nearly iterations and worse later), except for the alpha dataset where the step-size is harder to tune\n(the best step-size tends to have early iterations that make the cost go up signi\ufb01cantly). On sparse\ndatasets like rcv1, the performance is essentially the same as Newton. On the sido data set, Adagrad\n(with \ufb01xed steps size, left column) achieves a good testing loss quickly then levels off, for reasons\nwe cannot explain. On the news dataset, it is inferior without parameter-tuning and a bit better with.\nAdagrad uses a diagonal rescaling; it could be combined with our technique, early experiments show\nthat it improves results but that it is more sensitive to the choice of step-size.\n\nOverall, even with d and \u03ba very large (where our bounds are vacuous), the performance of our\nalgorithm still achieves the state of the art, while being more robust to the selection of the step-size:\n\ufb01ner quantities likes degrees of freedom [13] should be able to quantify more accurately the quality\nof the new algorithms.\n\n5 Conclusion\n\nIn this paper, we have presented two stochastic approximation algorithms that can achieve rates\nof O(1/n) for logistic and least-squares regression, without strong-convexity assumptions. Our\nanalysis reinforces the key role of averaging in obtaining fast rates, in particular with large step-\nsizes. Our work can naturally be extended in several ways: (a) an analysis of the algorithm that\nupdates the support point of the quadratic approximation at every iteration, (b) proximal extensions\n(easy to implement, but potentially harder to analyze); (c) adaptive ways to \ufb01nd the constant-step-\nsize; (d) step-sizes that depend on the iterates to increase robustness, like in normalized LMS [20],\nand (e) non-parametric analysis to improve our theoretical results for large values of d.\nAcknowledgements. Francis Bach was partially supported by the European Research Council\n(SIERRA Project). We thank Aymeric Dieuleveut and Nicolas Flammarion for helpful discussions.\n\n1Since a bound on k\u03b8\u2217k is not available, we have used step-sizes proportional to 1/ supn kxnk\u221e.\n\n7\n\n\fTable 1: Datasets used in our experiments. We report the proportion of non-zero entries, as well\nas estimates for the constant \u03ba and \u03c1 used in our theoretical results, together with the non-sharp\nconstant which is typically used in analysis of logistic regression and which our analysis avoids\n(these are computed for non-sparse datasets only).\nsparsity\n\u03c1\n100 % 5.8 \u00d7102\n16\n100 % 9.6 \u00d7102\n160\n100 % 6\n18\n10 % 1.3 \u00d7104 \u00d7\n0.2 % 2 \u00d7104\n\u00d7\n0.03 % 2 \u00d7104\n\u00d7\n\n1/ infn \u2113\u2032\u2032(yn,h\u03b8\u2217, xni)\n8.5 \u00d7102\n3 \u00d71012\n8 \u00d7104\n\u00d7\n\u00d7\n\u00d7\n\nd\n79\n55\n501\n4 933\n47 237\n1 355 192\n\nName\nquantum\ncovertype\nalpha\nsido\nrcv1\nnews\n\nn\n50 000\n581 012\n500 000\n12 678\n20 242\n19 996\n\n\u03ba\n\ncovertype square C=1 test\n\ncovertype square C=opt test\n \n\ncovertype logistic C=1 test\n\ncovertype logistic C=opt test\n \n\n \n\n]\n)\n\n*\n\n\u03b8\n(\nf\n\u2212\n)\n\u03b8\n(\nf\n[\n\n0\n1\n\ng\no\n\nl\n\n0\n\u22120.5\n\u22121\n\n\u22121.5\n\u22122\n\u22122.5\n\n\u22123\n\n \n0\n\n1/R2\n1/R2n1/2\nSAG\n\n2\n\nlog\n\n4\n(n)\n10\n\n6\n\nalpha square C=1 test\n\n]\n)\n\n*\n\n\u03b8\n(\nf\n\u2212\n)\n\u03b8\n(\nf\n[\n\n0\n1\n\ng\no\n\nl\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1/R2\n1/R2n1/2\nSAG\n\n \n\n0\n\u22120.5\n\u22121\n\n\u22121.5\n\u22122\n\u22122.5\n\n\u22123\n\n \n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n]\n)\n\n*\n\n\u03b8\n(\nf\n\u2212\n)\n\u03b8\n(\nf\n[\n\n0\n1\n\ng\no\n\nl\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\nC/R2\nC/R2n1/2\nSAG\n\n \n0\n\n2\n\nlog\n\n4\n(n)\n10\n\n6\n\nalpha square C=opt test\n\n \n\n1/R2\n1/R2n1/2\nSAG\nAdagrad\nNewton\n\n2\n\nlog\n\n4\n(n)\n10\n\n6\n\nalpha logistic C=1 test\n\n]\n)\n\n*\n\n\u03b8\n(\nf\n\u2212\n)\n\u03b8\n(\nf\n[\n\n0\n1\n\ng\no\n\nl\n\n6\n\nC/R2\nC/R2n1/2\nSAG\n\nlog\n\n4\n(n)\n10\n\n1/R2\n1/R2n1/2\nSAG\nAdagrad\nNewton\n\n2\n\nlog\n\n4\n(n)\n10\n\n\u22122.5\n \n0\n\n6\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n \n0\n\n \n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\nC/R2\nC/R2n1/2\nSAG\nAdagrad\nNewton\n\n2\n\nlog\n\n4\n(n)\n10\n\n6\n\nalpha logistic C=opt test\n\nC/R2\nC/R2n1/2\nSAG\nAdagrad\nNewton\n\n2\n\nlog\n\n4\n(n)\n10\n\n6\n\n \n0\n\n2\n\nlog\n\n4\n(n)\n10\n\n6\n\n \n0\n\n2\n\nFigure 2: Test performance for least-square regression (two left plots) and logistic regression (two\nright plots). From top to bottom: covertype, alpha. Left: theoretical steps, right: steps optimized for\nperformance after one effective pass through the data. Best seen in color.\n\nsido square C=1 test\n1/R2\n1/R2n1/2\nSAG\n\n \n\n0\n\n\u22120.5\n\n]\n)\n\n*\n\n\u03b8\n(\nf\n\u2212\n)\n\u03b8\n(\nf\n[\n\n0\n1\n\ng\no\n\nl\n\n\u22121\n\n \n0\n\n2\nlog\n\n(n)\n\n10\n\n4\n\n0\n\n\u22120.5\n\n\u22121\n\n \n0\n\nnews square C=1 test\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n1/R2\n1/R2n1/2\nSAG\n\n \n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n]\n)\n\n*\n\n\u03b8\n(\nf\n\u2212\n)\n\u03b8\n(\nf\n[\n\n0\n1\n\ng\no\n\nl\n\nsido square C=opt test\n\nC/R2\nC/R2n1/2\nSAG\n\n2\nlog\n\n(n)\n\n10\n\n4\n\nnews square C=opt test\n\n \n\n \n\n]\n)\n\n*\n\n\u03b8\n(\nf\n\u2212\n)\n\u03b8\n(\nf\n[\n\n0\n1\n\ng\no\n\nl\n\n]\n)\n\n*\n\n\u03b8\n(\nf\n\u2212\n)\n\u03b8\n(\nf\n[\n\n0\n1\n\ng\no\n\nl\n\n4\n\nC/R2\nC/R2n1/2\nSAG\n\nlog\n\n(n)\n\n10\n\n \n0\n\n2\n\n4\n\n \n0\n\n2\n\nlog\n\n(n)\n\n10\n\nsido logistic C=1 test\n\n1/R2\n1/R2n1/2\nSAG\nAdagrad\nNewton\n\n4\n\n2\nlog\n\n(n)\n\n10\n\nnews logistic C=1 test\n\n1/R2\n1/R2n1/2\nSAG\nAdagrad\nNewton\n\n2\n\nlog\n\n(n)\n\n10\n\n4\n\n \n\n0\n\n\u22120.5\n\n\u22121\n\n \n0\n\n \n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n \n0\n\nsido logistic C=opt test\n\nC/R2\nC/R2n1/2\nSAG\nAdagrad\nNewton\n\n4\n\n2\nlog\n\n(n)\n\n10\n\nnews logistic C=opt test\n\nC/R2\nC/R2n1/2\nSAG\nAdagrad\nNewton\n\n2\n\nlog\n\n(n)\n\n10\n\n4\n\nFigure 3: Test performance for least-square regression (two left plots) and logistic regression (two\nright plots). From top to bottom: sido, news. Left: theoretical steps, right: steps optimized for\nperformance after one effective pass through the data. Best seen in color.\n\n \n\n \n\n \n\n \n0\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n \n0\n\n0\n\n\u22120.5\n\n\u22121\n\n \n0\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n \n0\n\n8\n\n\fReferences\n[1] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical\n\nStatistics, pages 400\u2013407, 1951.\n\n[2] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM\n\nJournal on Control and Optimization, 30(4):838\u2013855, 1992.\n\n[3] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008.\n[4] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor SVM. In Proc. ICML, 2007.\n\n[5] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[6] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for\n\nmachine learning. In Adv. NIPS, 2011.\n\n[7] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159, 2010.\n\n[8] A. S. Nemirovsky and D. B. Yudin. Problem complexity and method ef\ufb01ciency in optimization.\n\nWiley & Sons, 1983.\n\n[9] Y. Nesterov. Introductory lectures on convex optimization. Kluwer, 2004.\n[10] G. Lan. An optimal method for stochastic composite optimization. Mathematical Program-\n\nming, 133(1-2):365\u2013397, 2012.\n\n[11] L. Gy\u00a8or\ufb01 and H. Walk. On the averaged stochastic approximation for linear regression. SIAM\n\nJournal on Control and Optimization, 34(1):31\u201361, 1996.\n\n[12] H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applica-\n\ntions. Springer-Verlag, second edition, 2003.\n\n[13] C. Gu. Smoothing spline ANOVA models. Springer, 2002.\n[14] R. Aguech, E. Moulines, and P. Priouret. On a perturbation approach for the analysis of\n\nstochastic tracking algorithms. SIAM J. Control and Optimization, 39(3):872\u2013899, 2000.\n\n[15] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with conver-\n\ngence rate O(1/n). Technical Report 00831977, HAL, 2013.\n\n[16] A. B. Tsybakov. Optimal rates of aggregation. In Proc. COLT, 2003.\n[17] O. Macchi. Adaptive processing: The least mean squares approach with applications in trans-\n\nmission. Wiley West Sussex, 1995.\n\n[18] S. Meyn and R. Tweedie. Markov Chains and Stochastic Stability. Cambridge U. P., 2009.\n[19] A. Hyv\u00a8arinen and E. Oja. A fast \ufb01xed-point algorithm for independent component analysis.\n\nNeural computation, 9(7):1483\u20131492, 1997.\n\n[20] N.J. Bershad. Analysis of the normalized LMS algorithm with Gaussian inputs. IEEE Trans-\n\nactions on Acoustics, Speech and Signal Processing, 34(4):793\u2013806, 1986.\n\n[21] A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms. Stochas-\n\ntic Optimization: Algorithms and Applications, pages 263\u2013304, 2000.\n\n[22] F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logis-\n\ntic regression. Technical Report 00804431-v2, HAL, 2013.\n\n[23] V. S. Borkar. Stochastic approximation with two time scales. Systems & Control Letters,\n\n29(5):291\u2013294, 1997.\n\n[24] A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge Univ. Press, 2000.\n[25] F. Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics,\n\n4:384\u2013414, 2010.\n\n[26] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for\n\nstochastic strongly-convex optimization. In Proc. COLT, 2001.\n\n[27] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Technical Report 00860051, HAL, 2013.\n\n9\n\n\f", "award": [], "sourceid": 459, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": "INRIA & ENS"}, {"given_name": "Eric", "family_name": "Moulines", "institution": "Telecom ParisTech"}]}