{"title": "Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes", "book": "Advances in Neural Information Processing Systems", "page_first": 8114, "page_last": 8124, "abstract": "We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix.\nWe illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.", "full_text": "Statistical Optimality of Stochastic Gradient Descent\non Hard Learning Problems through Multiple Passes\n\nLoucas Pillaud-Vivien\n\nINRIA - Ecole Normale Sup\u00e9rieure\n\nPSL Research University\n\nloucas.pillaud-vivien@inria.fr\n\nAlessandro Rudi\n\nINRIA - Ecole Normale Sup\u00e9rieure\n\nPSL Research University\n\nalessandro.rudi@inria.fr\n\nFrancis Bach\n\nINRIA - Ecole Normale Sup\u00e9rieure\n\nPSL Research University\nfrancis.bach@inria.fr\n\nAbstract\n\nWe consider stochastic gradient descent (SGD) for least-squares regression with\npotentially several passes over the data. While several passes have been widely\nreported to perform practically better in terms of predictive performance on unseen\ndata, the existing theoretical analysis of SGD suggests that a single pass is statisti-\ncally optimal. While this is true for low-dimensional easy problems, we show that\nfor hard problems, multiple passes lead to statistically optimal predictions while\nsingle pass does not; we also show that in these hard models, the optimal number\nof passes over the data increases with sample size. In order to de\ufb01ne the notion\nof hardness and show that our predictive performances are optimal, we consider\npotentially in\ufb01nite-dimensional models and notions typically associated to kernel\nmethods, namely, the decay of eigenvalues of the covariance matrix of the features\nand the complexity of the optimal predictor as measured through the covariance\nmatrix. We illustrate our results on synthetic experiments with non-linear kernel\nmethods and on a classical benchmark with a linear model.\n\n1\n\nIntroduction\n\nStochastic gradient descent (SGD) and its multiple variants\u2014averaged [1], accelerated [2], variance-\nreduced [3, 4, 5]\u2014are the workhorses of large-scale machine learning, because (a) these methods\nlooks at only a few observations before updating the corresponding model, and (b) they are known in\ntheory and in practice to generalize well to unseen data [6].\nBeyond the choice of step-size (often referred to as the learning rate), the number of passes to make\non the data remains an important practical and theoretical issue. In the context of \ufb01nite-dimensional\nmodels (least-squares regression or logistic regression), the theoretical answer has been known for\nmany years: a single passes suf\ufb01ces for the optimal statistical performance [1, 7]. Worse, most of the\ntheoretical work only apply to single pass algorithms, with some exceptions leading to analyses of\nmultiple passes when the step-size is taken smaller than the best known setting [8, 9].\nHowever, in practice, multiple passes are always performed as they empirically lead to better\ngeneralization (e.g., loss on unseen test data) [6]. But no analysis so far has been able to show that,\ngiven the appropriate step-size, multiple pass SGD was theoretically better than single pass SGD.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThe main contribution of this paper is to show that for least-squares regression, while single pass\naveraged SGD is optimal for a certain class of \u201ceasy\u201d problems, multiple passes are needed to reach\noptimal prediction performance on another class of \u201chard\u201d problems.\nIn order to de\ufb01ne and characterize these classes of problems, we need to use tools from in\ufb01nite-\ndimensional models which are common in the analysis of kernel methods. De facto, our analysis\nwill be done in in\ufb01nite-dimensional feature spaces, and for \ufb01nite-dimensional problems where the\ndimension far exceeds the number of samples, using these tools are the only way to obtain non-vacuous\ndimension-independent bounds. Thus, overall, our analysis applies both to \ufb01nite-dimensional models\nwith explicit features (parametric estimation), and to kernel methods (non-parametric estimation).\nThe two important quantities in the analysis are:\n\n(a) The decay of eigenvalues of the covariance matrix \u2303 of the input features, so that the ordered\neigenvalues m decay as O(m\u21b5); the parameter \u21b5 > 1 characterizes the size of the feature\nspace, \u21b5 = 1 corresponding to the largest feature spaces and \u21b5 = +1 to \ufb01nite-dimensional\nspaces. The decay will be measured through tr\u23031/\u21b5 =Pm 1/\u21b5\nm , which is small when the\ndecay of eigenvalues is faster than O(m\u21b5).\n(b) The complexity of the optimal predictor \u2713\u21e4 as measured through the covariance matrix \u2303,\nthat is with coef\ufb01cients hem,\u2713 \u21e4i in the eigenbasis (em)m of the covariance matrix that\ndecay so that h\u2713\u21e4, \u230312r\u2713\u21e4i is small. The parameter r > 0 characterizes the dif\ufb01culty\nof the learning problem: r = 1/2 corresponds to characterizing the complexity of the\npredictor through the squared norm k\u2713\u21e4k2, and thus r close to zero corresponds to the\nhardest problems while r larger, and in particular r > 1/2, corresponds to simpler problems.\n\nDealing with non-parametric estimation provides a simple way to evaluate the optimality of learning\nprocedures. Indeed, given problems with parameters r and \u21b5, the best prediction performance\n(averaged square loss on unseen data) is well known [10] and decay as O(n 2r\u21b5\n2r\u21b5+1 ), with \u21b5 = +1\nleading to the usual parametric rate O(n1). For easy problems, that is for which r > \u21b51\n2\u21b5 , then it\nis known that most iterative algorithms achieve this optimal rate of convergence (but with various\nrunning-time complexities), such as exact regularized risk minimization [11], gradient descent on the\nempirical risk [12], or averaged stochastic gradient descent [13].\nWe show that for hard problems, that is for which r 6 \u21b51\n2\u21b5 (see Example 1 for a typical hard problem),\nthen multiple passes are superior to a single pass. More precisely, under additional assumptions\ndetailed in Section 2 that will lead to a subset of the hard problems, with \u21e5(n(\u21b512r\u21b5)/(1+2r\u21b5))\npasses, we achieve the optimal statistical performance O(n 2r\u21b5\n2r\u21b5+1 ), while for all other hard problems,\na single pass only achieves O(n2r). This is illustrated in Figure 1.\nWe thus get a number of passes that grows with the number of observations n and depends precisely\non the quantities r and \u21b5. In synthetic experiments with kernel methods where \u21b5 and r are known,\nthese scalings are precisely observed. In experiments on parametric models with large dimensions,\nwe also exhibit an increasing number of required passes when the number of observations increases.\n\nFigure 1 \u2013 (Left) easy and hard problems in the (\u21b5, r)-plane. (Right) different regions for which\nmultiple passes improved known previous bounds (green region) or reaches optimality (red region).\n\n2\n\n\f2 Least-squares regression in \ufb01nite dimension\nWe consider a joint distribution \u21e2 on pairs of input/output (x, y) 2 X\u21e5 R, where X is any input space,\nand we consider a feature map from the input space X to a feature space H, which we assume\nEuclidean in this section, so that all quantities are well-de\ufb01ned. In Section 4, we will extend all the\nnotions to Hilbert spaces.\n\n2.1 Main assumptions\nWe are considering predicting y as a linear function f\u2713(x) = h\u2713, (x)iH of (x), that is estimating\n2E(y h\u2713, (x)iH)2 is as small as possible. Estimators will depend on n\n\u2713 2 H such that F (\u2713) = 1\nobservations, with standard sampling assumptions:\n\n(A1)\n\nThe n observations (xi, yi) 2 X \u21e5 R, i = 1, . . . , n, are independent and identically\ndistributed from the distribution \u21e2.\n\nSince H is \ufb01nite-dimensional, F always has a (potentially non-unique) minimizer in H which we\ndenote \u2713\u21e4. We make the following standard boundedness assumptions:\n(A2)\n\nk(x)k 6 R almost surely, |y h\u2713\u21e4, (x)iH| is almost surely bounded by and |y| is\nalmost surely bounded by M.\n\nIn order to obtain improved rates with multiple passes, and motivated by the equivalent previously\nused condition in reproducing kernel Hilbert spaces presented in Section 4, we make the following\nextra assumption (we denote by \u2303= E[(x) \u2326H (x)] the (non-centered) covariance matrix).\n(A3)\n\nFor \u00b5 2 [0, 1], there exists \uf8ff\u00b5 > 0 such that, almost surely, (x)\u2326H (x) 4H \uf8ff2\nNote that it can also be written as k\u2303\u00b5/21/2(x)kH 6 \uf8ff\u00b5R\u00b5.\n\n\u00b5R2\u00b5\u23031\u00b5.\n\nAssumption (A3) is always satis\ufb01ed with any \u00b5 2 [0, 1], and has particular values for \u00b5 = 1, with\n\uf8ff1 = 1, and \u00b5 = 0, where \uf8ff0 has to be larger than the dimension of the space H.\nWe will also introduce a parameter \u21b5 that characterizes the decay of eigenvalues of \u2303 through the\nquantity tr\u23031/\u21b5, as well as the dif\ufb01culty of the learning problem through k\u23031/2r\u2713\u21e4kH, for r 2 [0, 1].\nIn the \ufb01nite-dimensional case, these quantities can always be de\ufb01ned and most often \ufb01nite, but may\nbe very large compared to sample size. In the following assumptions the quantities are assumed to be\n\ufb01nite and small compared to n.\n\n(A4)\n\nThere exists \u21b5> 1 such that tr \u23031/\u21b5 < 1.\n\nAssumption (A4) is often called the \u201ccapacity condition\u201d. First note that this assumption implies\nthat the decreasing sequence of the eigenvalues of \u2303, (m)m>1, satis\ufb01es m = o (1/m\u21b5). Note that\ntr\u2303\u00b5 6 \uf8ff2\n\u00b5R2\u00b5 and thus often we have \u00b5 > 1/\u21b5, and in the most favorable cases in Section 4, this\nbound will be achieved. We also assume:\n\n(A5)\n\nThere exists r > 0, such that k\u23031/2r\u2713\u21e4kH < 1.\n\nAssumption (A5) is often called the \u201csource condition\u201d. Note also that for r = 1/2, this simply says\nthat the optimal predictor has a small norm.\nIn the subsequent sections, we essentially assume that \u21b5, \u00b5 and r are chosen (by the theoretical\nanalysis, not by the algorithm) so that all quantities R\u00b5, k\u23031/2r\u2713\u21e4kH and tr\u23031/\u21b5 are \ufb01nite and\nsmall. As recalled in the introduction, these parameters are often used in the non-parametric literature\nto quantify the hardness of the learning problem (Figure 1).\nWe will use result with O(\u00b7) and \u21e5(\u00b7) notations, which will all be independent of n and t (number of\nobservations and number of iterations) but can depend on other \ufb01nite constants. Explicit dependence\non all parameters of the problem is given in proofs. More precisely, we will use the usual O(\u00b7) and\n\u21e5(\u00b7) notations for sequences bnt and ant that can depend on n and t, as ant = O(bnt) if and only if,\nthere exists M > 0 such that for all n, t, ant 6 M bnt, and ant =\u21e5( bnt) if and only if, there exist\nM, M0 > 0 such that for all n, t, M0bnt 6 ant 6 M bnt.\n\n3\n\n\f2.2 Related work\nGiven our assumptions above, several algorithms have been developed for obtaining low values of\n\nthe expected excess risk E\u21e5F (\u2713)\u21e4 F (\u2713\u21e4).\n\nH, for appropriate values of . It is known that for easy problems where r > \u21b51\n\nRegularized empirical risk minimization. Forming the empirical risk \u02c6F (\u2713), it minimizes \u02c6F (\u2713) +\n2\u21b5 , it achieves\nk\u2713k2\nthe optimal rate of convergence O(n 2r\u21b5\n2r\u21b5+1 ) [11]. However, algorithmically, this requires to solve a\nlinear system of size n times the dimension of H. One could also use fast variance-reduced stochastic\ngradient algorithms such as SAG [3], SVRG [4] or SAGA [5], with a complexity proportional to the\ndimension of H times n + R2/.\nEarly-stopped gradient descent on the empirical risk. Instead of solving the linear system directly,\none can use gradient descent with early stopping [12, 14]. Similarly to the regularized empirical\nrisk minimization case, a rate of O(n 2r\u21b5\n2r\u21b5+1 ) is achieved for the easy problems, where r > \u21b51\n2\u21b5 .\nDifferent iterative regularization techniques beyond batch gradient descent with early stopping have\nbeen considered, with computational complexities ranging from O(n1+ \u21b5\n4r\u21b5+2 ) times\nthe dimension of H (or n in the kernel case in Section 4) for optimal predictions [12, 15, 16, 17, 14].\nStochastic gradient. The usual stochastic gradient recursion is iterating from i = 1 to n,\n\n2r\u21b5+1 ) to O(n1+ \u21b5\n\n\u2713i = \u2713i1 + yi h\u2713i1, (xi)iH(xi),\nnPn\n\nwith the averaged iterate \u00af\u2713n = 1\ni=1 \u2713i. Starting from \u27130 = 0, [18] shows that the expected excess\nperformance E[F (\u00af\u2713n)] F (\u2713\u21e4) decomposes into a variance term that depends on the noise 2 in\nthe prediction problem, and a bias term, that depends on the deviation \u2713\u21e4 \u27130 = \u2713\u21e4 between the\n+ k\u2713\u21e4k2\ninitialization and the optimal predictor. Their bound is, up to universal constants, 2dim(H)\nn .\nFurther, [13] considered the quantities \u21b5 and r above to get the bound, up to constant factors:\n\nn\n\nH\n\n2tr\u23031/\u21b5(n)1/\u21b5\n\nn\n\n+ k\u23031/2r\u2713\u21e4k2\n\n2rn2r\n\n.\n\nWe recover the \ufb01nite-dimensional bound for \u21b5 = +1 and r = 1/2. The bounds above are valid for\nall \u21b5 > 1 and all r 2 [0, 1], and the step-size is such that R2 6 1/4, and thus we see a natural\ntrade-off appearing for the step-size , between bias and variance.\n2\u21b5 , then the optimal step-size minimizing the bound above is / n 2\u21b5 min{r,1}1+\u21b5\nWhen r > \u21b51\n,\n2\u21b5 min{r,1}+1\nand the obtained rate is optimal. Thus a single pass is optimal. However, when r 6 \u21b51\n2\u21b5 , the best\nstep-size does not depend on n, and one can only achieve O(n2r).\nFinally, in the same multiple pass set-up as ours, [9] has shown that for easy problems where r > \u21b51\n2\u21b5\n(and single-pass averaged SGD is already optimal) that multiple-pass non-averaged SGD is becoming\noptimal after a correct number of passes (while single-pass is not). Our proof principle of comparing\nto batch gradient is taken from [9], but we apply it to harder problems where r 6 \u21b51\n2\u21b5 . Moreover we\nconsider the multi-pass averaged-SGD algorithm, instead of non-averaged SGD, and take explicitly\ninto account the effect of Assumption (A3).\n\n3 Averaged SGD with multiple passes\n\nWe consider the following algorithm, which is stochastic gradient descent with sampling with\nreplacement with multiple passes over the data (we experiment in Section E of the Appendix with\ncycling over the data, with or without reshuf\ufb02ing between each pass).\n\n\u2022 Initialization: \u27130 = \u00af\u27130 = 0, t = maximal number of iterations, = 1/(4R2) = step-size\n\u2022 Iteration: for u = 1 to t, sample i(u) uniformly from {1, . . . , n} and make the step\n\u2713u = \u2713u1 + yi(u) h\u2713t1, (xi(u))iH(xi(u)) and \u00af\u2713u = (1 1\nu )\u00af\u2713u1 + 1\n\nIn this paper, following [18, 13], but as opposed to [19], we consider unregularized recursions. This\nremoves a unnecessary regularization parameter (at the expense of harder proofs).\n\nu \u2713u.\n\n4\n\n\f3.1 Convergence rate and optimal number of passes\n\nOur main result is the following (see full proof in Appendix):\nTheorem 1. Let n 2 N\u21e4 and t > n, under Assumptions (A1), (A2), (A3), (A4), (A5), (A6), with\n = 1/(4R2).\n\n\u2022 For \u00b5\u21b5 < 2r\u21b5 + 1 <\u21b5 , if we take t =\u21e5( n\u21b5/(2r\u21b5+1)), we obtain the following rate:\n\nEF (\u00af\u2713t) F (\u2713\u21e4) = O(n2r\u21b5/(2r\u21b5+1)).\n\n\u2022 For \u00b5\u21b5 > 2r\u21b5 + 1, if we take t =\u21e5( n1/\u00b5 (log n)\n\n1\n\n\u00b5 ), we obtain the following rate:\n\nEF (\u00af\u2713t) F (\u2713\u21e4) 6 O(n2r/\u00b5).\n\nSketch of proof. The main dif\ufb01culty in extending proofs from the single pass case [18, 13] is that\nas soon as an observation is processed twice, then statistical dependences are introduced and the\nproof does not go through. In a similar context, some authors have considered stability results [8],\nbut the large step-sizes that we consider do not allow this technique. Rather, we follow [16, 9] and\ncompare our multi-pass stochastic recursion \u2713t to the batch gradient descent iterate \u2318t de\ufb01ned as\ni=1yi h\u2318t1, (xi)iH(xi) with its averaged iterate \u00af\u2318t. We thus need to\nnPn\n\u2318t = \u2318t1 + \nstudy the predictive performance of \u00af\u2318t and the deviation \u00af\u2713t \u00af\u2318t. It turns out that, given the data, the\ndeviation \u2713t \u2318t satis\ufb01es an SGD recursion (with the respect to the randomness of the sampling with\nreplacement). For a more detailed summary of the proof technique see Section B.\nThe novelty compared to [16, 9] is (a) to use re\ufb01ned results on averaged SGD for least-squares, in\nparticular convergence in various norms for the deviation \u00af\u2713t \u00af\u2318t (see Section A), that can use our\nnew Assumption (A3). Moreover, (b) we need to extend the convergence results for the batch gradient\ndescent recursion from [14], also to take into account the new assumption (see Section D). These two\nresults are interesting on their own.\n\nImproved rates with multiple passes. We can draw the following conclusions:\n\n\u2022 If 2\u21b5r + 1 > \u21b5, that is, easy problems, it has been shown by [13] that a single pass with a\nsmaller step-size than the one we propose here is optimal, and our result does not apply.\n\u2022 If \u00b5\u21b5 < 2r\u21b5 + 1 <\u21b5 , then our proposed number of iterations is t =\u21e5( n\u21b5/(2\u21b5r+1)),\nwhich is now greater than n; the convergence rate is then O(n 2r\u21b5\n2r\u21b5+1 ), and, as we will see in\nSection 4.2, the predictive performance is then optimal when \u00b5 6 2r.\n\u2022 If \u00b5\u21b5 > 2r\u21b5 + 1, then with a number of iterations is t =\u21e5( n1/\u00b5), which is greater than n\n(thus several passes), with a convergence rate equal to O(n2r/\u00b5), which improves upon\nthe best known rates of O(n2r). As we will see in Section 4.2, this is not optimal.\n\nNote that these rates are theoretically only bounds on the optimal number of passes over the data,\nand one should be cautious when drawing conclusions; however our simulations on synthetic data,\nsee Figure 2 in Section 5, con\ufb01rm that our proposed scalings for the number of passes is observed in\npractice.\n\n4 Application to kernel methods\n\nIn the section above, we have assumed that H was \ufb01nite-dimensional, so that the optimal predic-\ntor \u2713\u21e4 2 H was always de\ufb01ned. Note however, that our bounds that depends on \u21b5, r and \u00b5 are\nindependent of the dimension, and hence, intuitively, following [19], should apply immediately to\nin\ufb01nite-dimensional spaces.\nWe now \ufb01rst show in Section 4.1 how this intuition can be formalized and how using kernel methods\nprovides a particularly interesting example. Moreover, this interpretation allows to characterize the\nstatistical optimality of our results in Section 4.2.\n\n5\n\n\f4.1 Extension to Hilbert spaces, kernel methods and non-parametric estimation\nOur main result in Theorem 1 extends directly to the case where H is an in\ufb01nite-dimensional Hilbert\nspace. In particular, given a feature map : X ! H, any vector \u2713 2 H is naturally associated to\na function de\ufb01ned as f\u2713(x) = h\u2713, (x)iH. Algorithms can then be run with in\ufb01nite-dimensional\nobjects if the kernel K(x0, x) = h(x0), (x)iH can be computed ef\ufb01ciently. This identi\ufb01cation of\nelements \u2713 of H with functions f\u2713 endows the various quantities we have introduced in the previous\nsections, with natural interpretations in terms of functions. The stochastic gradient descent described\nin Section 3 adapts instantly to this new framework as the iterates (\u2713u)u6t are linear combinations of\nfeature vectors (xi), i = 1, . . . , n, and the algorithms can classically be \u201ckernelized\u201d [20, 13], with\nan overall running time complexity of O(nt).\nFirst note that Assumption (A3) is equivalent to, for all x 2 X and \u2713 2 H, |f\u2713(x)|2 6\nL1 6 \uf8ff2\n\u00b5R2\u00b5hf\u2713, \u23031\u00b5f\u2713iH, that is, kgk2\nH for any g 2 H and also im-\n\uf8ff2\nHkgk1\u00b5\nplies1 kgkL1 6 \uf8ff\u00b5R\u00b5kgk\u00b5\nL2 , which are common assumptions in the context of kernel\nmethods [22], essentially controlling in a more re\ufb01ned way the regularity of the whole space of\nfunctions associated to H, with respect to the L1-norm, compared to the too crude inequality\nkgkL1 = supx |h (x), giH | 6 supx k(x)kHkgkH 6 RkgkH.\nThe natural relation with functions allows to analyze effects that are crucial in the context of learning,\nbut dif\ufb01cult to grasp in the \ufb01nite-dimensional setting. Consider the following prototypical example of\na hard learning problem,\nExample 1 (Prototypical hard problem on simple Sobolev space). Let X = [0, 1], with x sampled\nuniformly on X and\n\n\u00b5R2\u00b5k\u23031/2\u00b5/2gk2\n\ny = sign(x 1/2) + \u270f, (x) = {|k|1e2ik\u21e1x}k2Z\u21e4.\n\n\u2713k\n|k|\n\nThis corresponds to the kernel K(x, y) =Pk2Z\u21e4 |k|2e2ik\u21e1(xy), which is well de\ufb01ned (and lead\nto the simplest Sobolev space). Note that for any \u2713 2 H, which is here identi\ufb01ed as the space\nof square-summable sequences `2(Z), we have f\u2713(x) = h\u2713, (x)i`2(Z) =Pk2Z\u21e4\ne2ik\u21e1x. This\nmeans that for any estimator \u02c6\u2713 given by the algorithm, f\u02c6\u2713 is at least once continuously differentiable,\nwhile the target function sign(\u00b7 1/2) is not even continuous. Hence, we are in a situation where \u2713\u21e4,\nthe minimizer of the excess risk, does not belong to H. Indeed let represent sign(\u00b7 1/2) in H,\nfor almost all x 2 [0, 1], by its Fourier series sign(x 1/2) =Pk2Z\u21e4\n\u21b5ke2ik\u21e1x, with |\u21b5k|\u21e0 1/k,\nan informal reasoning would lead to (\u2713\u21e4)k = \u21b5k|k|\u21e0 1, which is not square-summable and thus\n\u2713\u21e4 /2 H. For more details, see [23, 24].\nThis setting generalizes important properties that are valid for Sobolev spaces, as shown in the\nfollowing example, where \u21b5, r, \u00b5 are characterized in terms of the smoothness of the functions in H,\nthe smoothness of f\u21e4 and the dimensionality of the input space X.\nExample 2 (Sobolev Spaces [25, 22, 26, 10]). Let X \u2713 Rd, d 2 N, with \u21e2X supported on X,\nabsolutely continous with the uniform distribution and such that \u21e2X(x) > a > 0 almost everywhere,\nfor a given a. Assume that f\u21e4(x) = E[y|x] is s-times differentiable, with s > 0. Choose a kernel,\ninducing Sobolev spaces of smoothness m with m > d/2, as the Mat\u00e9rn kernel\n\nK(x0, x) = kx0 xkmd/2Kd/2m(kx0 xk),\n\nwhere Kd/2m is the modi\ufb01ed Bessel function of the second kind. Then the assumptions are satis\ufb01ed\nfor any \u270f> 0, with \u21b5 = 2m\n\nr = s\n\nd , \u00b5 = d\n\n2m + \u270f,\n\n2m .\n\nIn the following subsection we compare the rates obtained in Thm. 1, with known lower bounds\nunder the same assumptions.\n\n4.2 Minimax lower bounds\nIn this section we recall known lower bounds on the rates for classes of learning problems satisfying\nthe conditions in Sect. 2.1. Interestingly, the comparison below shows that our results in Theorem 1\n\n1Indeed, for any g 2 H, k\u23031/2\u00b5/2gkH = k\u2303\u00b5/2gkL2 6 k\u23031/2gk\u00b5\n\nL2kgk1\u00b5\nwe used that for any g 2 H, any bounded operator A, s 2 [0, 1]: kAsgkL2 6 kAgks\n\n= kgk\u00b5\nL2kgk1s\n\nL2\n\nL2\n\nL2 , where\n\nHkgk1\u00b5\n(see [21]).\n\n6\n\n\fare optimal in the setting 2r > \u00b5. While the optimality of SGD was known for the regime {2r\u21b5 +1 >\n\u21b5 \\ 2r > \u00b5}, here we extend the optimality to the new regime \u21b5 > 2r\u21b5 + 1 > \u00b5\u21b5, covering\nessentially all the region 2r > \u00b5, as it is possible to observe in Figure 1, where for clarity we plotted\nthe best possible value for \u00b5 that is \u00b5 = 1/\u21b5 [10] (which is true for Sobolev spaces).\nWhen r 2 (0, 1] is \ufb01xed, but there are no assumptions on \u21b5 or \u00b5, then the optimal minimax rate of\nconvergence is O(n2r/(2r+1)), attained by regularized empirical risk minimization [11] and other\nspectral \ufb01lters on the empirical covariance operator [27].\nWhen r 2 (0, 1] and \u21b5 > 1 are \ufb01xed (but there are no constraints on \u00b5), the optimal minimax rate\nof convergence O(n 2r\u21b5\n2\u21b5 , with empirical risk minimization [14] or\nstochastic gradient descent [13].\nWhen r > \u21b51\n2r\u21b5+1 ) is known to be a lower bound on the opti-\nmal minimax rate, but the best upper-bound so far is O(n2r) and is achieved by empirical risk\nminimization [14] or stochastic gradient descent [13], and the optimal rate is not known.\nWhen r 2 (0, 1], \u21b5 > 1 and \u00b5 2 [1/\u21b5, 1] are \ufb01xed, then the rate of convergence O(n max{\u00b5,2r}\u21b5\n2 max{\u00b5,2r}\u21b5+1 )\nis known to be a lower bound on the optimal minimax rate [10]. This is attained by regularized\nempirical risk minimization when 2r > \u00b5 [10], and now by SGD with multiple passes, and it is thus\nthe optimal rate in this situation. When 2r < \u00b5, the only known upper bound is O(n2\u21b5r/(\u00b5\u21b5+1)),\nand the optimal rate is not known.\n\n2\u21b5 , the rate of convergence O(n 2r\u21b5\n\n2r\u21b5+1 ) is attained when r > \u21b51\n\n5 Experiments\n\nIn our experiments, the main goal is to show that with more that one pass over the data, we can\nimprove the accuracy of SGD when the problem is hard. We also want to highlight our dependence\nof the optimal number of passes (that is t/n) with respect to the number of observations n.\n\nSynthetic experiments. Our main experiments are performed on arti\ufb01cial data following the setting\nin [21]. For this purpose, we take kernels K corresponding to splines of order q (see [24]) that ful\ufb01ll\nAssumptions (A1) (A2) (A3) (A4) (A5) (A6). Indeed, let us consider the following function\n\n\u21e4q(x, z) =Xk2Z\n\ne2i\u21e1k(xz)\n\n|k|q\n\n,\n\nde\ufb01ned almost everywhere on [0, 1], with q 2 R, and for which we have the interesting relationship:\nh\u21e4q(x,\u00b7), \u21e4q0(z,\u00b7)iL2(d\u21e2X) =\u21e4 q+q0(x, z) for any q, q0 2 R. Our setting is the following:\n\n\u2022 Input distribution: X = [0, 1] and \u21e2X is the uniform distribution.\n\u2022 Kernel: 8(x, z) 2 [0, 1], K(x, z) =\u21e4 \u21b5(x, z).\n\u2022 Target function: 8x 2 [0, 1],\u2713 \u21e4 =\u21e4 r\u21b5+ 1\n\u2022 Output distribution : \u21e2(y|x) is a Gaussian with variance 2 and mean \u2713\u21e4.\n\n(x, 0).\n\n2\n\nFor this setting we can show that the learning problem satis\ufb01es Assumptions (A1) (A2) (A3) (A4)\n(A5) (A6) with r, \u21b5, and\u00b5 = 1/\u21b5. We take different values of these parameters to encounter all the\ndifferent regimes of the problems shown in Figure 1.\nFor each n from 100 to 1000, we found the optimal number of steps t\u21e4(n) that minimizes the test\nerror F (\u00af\u2713t) F (\u2713\u21e4). Note that because of over\ufb01tting the test error increases for t > t\u21e4(n). In\nFigure 2, we show t\u21e4(n) with respect to n in log scale. As expected, for the easy problems (where\nr > \u21b51\n2\u21b5 , see top left and right plots), the slope of the plot is 1 as one pass over the data is enough:\nt\u21e4(n) =\u21e5( n). But we see that for hard problems (where r 6 \u21b51\n2\u21b5 , see bottom left and right plots),\nwe need more than one pass to achieve optimality as the optimal number of iterations is very close to\n\n2r\u21b5+1. That matches the theoretical predictions of Theorem 1. We also notice in the\n\nplots that, the bigger\n2r\u21b5+1 the harder the problem is and the bigger the number of epochs we have\nto take. Note, that to reduce the noise on the estimation of t\u21e4(n), plots show an average over 100\nreplications.\n\nt\u21e4(n) =\u21e5 n\n\n\u21b5\n\n\u21b5\n\n7\n\n\fTo conclude, the experiments presented in the section correspond exactly to the theoretical setting\nof the article (sampling with replacement), however we present in Figures 4 and 5 of Section E of\nthe Appendix results on the same datasets for two different ways of sampling the data: (a)without\nreplacement: for which we select randomly the data points but never use twice the same point in one\nepoch, (b) cycles: for which we pick successively the data points in the same order. The obtained\nscalings relating number of iterations or passes to number of observations are the same.\n\nFigure 2 \u2013 The four plots represent each a different con\ufb01guration on the (\u21b5, r) plan represented in Figure 1, for\nr = 1/(2\u21b5). Top left (\u21b5 = 1.5) and right (\u21b5 = 2) are two easy problems (Top right is the limiting case where\n2\u21b5 ) for which one pass over the data is optimal. Bottom left (\u21b5 = 2.5) and right (\u21b5 = 3) are two hard\nr = \u21b51\nproblems for which an increasing number of passes is required. The blue dotted line are the slopes predicted by\nthe theoretical result in Theorem 1.\n\nLinear model. To illustrate our result with some real data, we show how the optimal number of\npasses over the data increases with the number of samples. In Figure 3, we simply performed linear\nleast-squares regression on the MNIST dataset and plotted the optimal number of passes over the\ndata that leads to the smallest error on the test set. Evaluating \u21b5 and r from Assumptions (A4) and\n(A5), we found \u21b5 = 1.7 and r = 0.18. As r = 0.18 6 \u21b51\n2\u21b5 \u21e0 0.2, Theorem 1 indicates that this\ncorresponds to a situation where only one pass on the data is not enough, con\ufb01rming the behavior of\nFigure 3. This suggests that learning MNIST with linear regression is a hard problem.\n\n6 Conclusion\n\nIn this paper, we have shown that for least-squares regression, in hard problems where single-pass\nSGD is not statistically optimal (r < \u21b51\n2\u21b5 ), then multiple passes lead to statistical optimality with a\nnumber of passes that somewhat surprisingly needs to grow with sample size, with a convergence\nrate which is superior to previous analyses of stochastic gradient. Using a non-parametric estimation,\nwe show that under certain conditions (2r > \u00b5), we attain statistical optimality.\nOur work could be extended in several ways: (a) our experiments suggest that cycling over the\ndata and cycling with random reshuf\ufb02ing perform similarly to sampling with replacement, it would\nbe interesting to combine our theoretical analysis with work aiming at analyzing other sampling\nschemes [28, 29]. (b) Mini-batches could be also considered with a potentially interesting effects\ncompared to the streaming setting. Also, (c) our analysis focuses on least-squares regression, an\nextension to all smooth loss functions would widen its applicability. Moreover, (d) providing optimal\n\n8\n\n\fFigure 3 \u2013 For the MNIST data set, we show the optimal number of passes over the data with respect to the\nnumber of samples in the case of the linear regression.\n\nef\ufb01cient algorithms for the situation 2r < \u00b5 is a clear open problem (for which the optimal rate is\nnot known, even for non-ef\ufb01cient algorithms). Additionally, (e) in the context of classi\ufb01cation, we\ncould combine our analysis with [30] to study the potential discrepancies between training and testing\nlosses and errors when considering high-dimensional models [31]. More generally, (f) we could\nexplore the effect of our analysis for methods based on the least squares estimator in the context of\nstructured prediction [32, 33, 34] and (non-linear) multitask learning [35]. Finally, (g) to reduce the\ncomputational complexity of the algorithm, while retaining the (optimal) statistical guarantees, we\ncould combine multi-pass stochastic gradient descent, with approximation techniques like random\nfeatures [36], extending the analysis of [37] to the more general setting considered in this paper.\n\nAcknowledgements\n\nWe acknowledge support from the European Research Council (grant SEQUOIA 724063). We also\nthank Rapha\u00ebl Berthier and Yann Labb\u00e9 for their enlightening advices on this project.\n\nReferences\n[1] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM\n\nJournal on Control and Optimization, 30(4):838\u2013855, 1992.\n\n[2] Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical\n\nProgramming, 133(1-2):365\u2013397, 2012.\n\n[3] Nicolas L. Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an\nIn Advances in Neural Information\n\nexponential convergence rate for \ufb01nite training sets.\nProcessing Systems (NIPS), 2012.\n\n[4] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, 2013.\n\n[5] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing Systems, 2014.\n\n[6] L\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. SIAM Review, 60(2):223\u2013311, 2018.\n\n[7] A. S. Nemirovski and D. B. Yudin. Problem complexity and method ef\ufb01ciency in optimization.\n\nJohn Wiley, 1983.\n\n[8] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of\n\nstochastic gradient descent. In International Conference on Machine Learning, 2016.\n\n[9] Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods.\n\nJournal of Machine Learning Research, 18(97):1\u201347, 2017.\n\n[10] Simon Fischer and Ingo Steinwart. Sobolev norm learning rates for regularized least-squares\n\nalgorithm. Fakult\u00e4t f\u00fcr Mathematik und Physik, Universit\u00e4t Stuttgart, 2017.\n\n9\n\n\f[11] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares\n\nalgorithm. Foundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[12] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent\n\nlearning. Constructive Approximation, 26(2):289\u2013315, 2007.\n\n[13] Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large\n\nstep-sizes. The Annals of Statistics, 44(4):1363\u20131399, 2016.\n\n[14] Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral\nalgorithms with least-squares regression over hilbert spaces. Applied and Computational\nHarmonic Analysis, 2018.\n\n[15] L Lo Gerfo, L Rosasco, F Odone, E De Vito, and A Verri. Spectral algorithms for supervised\n\nlearning. Neural Computation, 20(7):1873\u20131897, 2008.\n\n[16] Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization.\n\nAdvances in Neural Information Processing Systems, pages 1630\u20131638, 2015.\n\nIn\n\n[17] Gilles Blanchard and Nicole Kr\u00e4mer. Convergence rates of kernel conjugate gradient for random\n\ndesign regression. Analysis and Applications, 14(06):763\u2013794, 2016.\n\n[18] Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with\nconvergence rate O(1/n). In Advances in Neural Information Processing Systems (NIPS), pages\n773\u2013781, 2013.\n\n[19] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger con-\nvergence rates for least-squares regression. Journal of Machine Learning Research, 18(1):3520\u2013\n3570, 2017.\n\n[20] Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations\n\nof Computational Mathematics, 8(5):561\u2013596, Oct 2008.\n\n[21] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random\n\nfeatures. In Advances in Neural Information Processing Systems, pages 3215\u20133225, 2017.\n\n[22] Ingo Steinwart, Don R. Hush, and Clint Scovel. Optimal rates for regularized least squares\n\nregression. In Proc. COLT, 2009.\n\n[23] R. A. Adams. Sobolev spaces / Robert A. Adams. Academic Press New York, 1975.\n[24] G. Wahba. Spline Models for Observational Data. Society for Industrial and Applied Mathe-\n\nmatics, 1990.\n\n[25] Holger Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004.\n[26] Francis Bach. On the equivalence between kernel quadrature rules and random feature expan-\n\nsions. Journal of Machine Learning Research, 18(21):1\u201338, 2017.\n\n[27] Gilles Blanchard and Nicole M\u00fccke. Optimal rates for regularization of statistical inverse\n\nlearning problems. Foundations of Computational Mathematics, pages 1\u201343, 2017.\n\n[28] Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in\n\nNeural Information Processing Systems 29, pages 46\u201354, 2016.\n\n[29] Mert G\u00fcrb\u00fczbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuf\ufb02ing beats stochastic\n\ngradient descent. Technical Report 1510.08560, arXiv, 2015.\n\n[30] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Exponential convergence of testing\nerror for stochastic gradient methods. In Proceedings of the 31st Conference On Learning\nTheory, volume 75, pages 250\u2013296, 2018.\n\n[31] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In Proceedings of the International Conference\non Learning Representations (ICLR), 2017.\n\n[32] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach\nfor structured prediction. In Advances in Neural Information Processing Systems, pages 4412\u2013\n4420, 2016.\n\n[33] Anton Osokin, Francis Bach, and Simon Lacoste-Julien. On structured prediction theory with\ncalibrated convex surrogate losses. In Advances in Neural Information Processing Systems,\npages 302\u2013313, 2017.\n\n[34] Carlo Ciliberto, Francis Bach, and Alessandro Rudi. Localized structured prediction. arXiv\n\npreprint arXiv:1806.02402, 2018.\n\n10\n\n\f[35] Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco, and Massimiliano Pontil. Consistent multi-\ntask learning with nonlinear output relations. In Advances in Neural Information Processing\nSystems, pages 1986\u20131996, 2017.\n\n[36] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin Neural Information Processing Systems, pages 1177\u20131184, 2008.\n\n[37] Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features.\n\nIn Advances in Neural Information Processing Systems 31, pages 10213\u201310224, 2018.\n\n[38] R. Aguech, E. Moulines, and P. Priouret. On a perturbation approach for the analysis of\n\nstochastic tracking algorithms. SIAM J. Control and Optimization, 39(3):872\u2013899, 2000.\n\n[39] Alessandro Rudi, Guillermo D Canas, and Lorenzo Rosasco. On the sample complexity of\nsubspace learning. In Advances in Neural Information Processing Systems, pages 2067\u20132075,\n2013.\n\n[40] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computa-\n\ntional mathematics, 12(4):389\u2013434, 2012.\n\n[41] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. Falkon: An optimal large scale kernel\n\nmethod. In Advances in Neural Information Processing Systems, pages 3888\u20133898, 2017.\n\n11\n\n\f", "award": [], "sourceid": 4978, "authors": [{"given_name": "Loucas", "family_name": "Pillaud-Vivien", "institution": "INRIA - Ecole Normale Sup\u00e9rieure"}, {"given_name": "Alessandro", "family_name": "Rudi", "institution": "INRIA, Ecole Normale Superieure"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}]}