{"title": "Beating SGD Saturation with Tail-Averaging and Minibatching", "book": "Advances in Neural Information Processing Systems", "page_first": 12568, "page_last": 12577, "abstract": "While stochastic gradient descent (SGD) is one of the major workhorses in machine learning, the \nlearning properties of many practically used variants are still poorly understood.\nIn this paper, we consider least squares learning in a nonparametric setting and contribute \nto filling this gap by focusing on the effect and interplay of multiple passes, mini-batching and \naveraging, in particular tail averaging. Our results show how these different variants of SGD \ncan be combined to achieve optimal learning rates, also providing practical insights. A novel key result is \nthat tail averaging allows faster convergence rates than uniform averaging in the nonparametric setting. \nFurther, we show that a combination of tail-averaging and minibatching allows more aggressive \nstep-size choices than using any one of said components.", "full_text": "Beating SGD Saturation with Tail-Averaging and\n\nMinibatching\n\nInstitute for Stochastics and Applications\n\nUniversity of Stuttgart\n\nnicole.muecke@mathematik.uni-stuttgart.de\n\nNicole M\u00fccke\n\nGergely Neu\n\nUniversitat Pompeu Fabra\ngergely.neu@gmail.com\n\nLorenzo Rosasco\n\nUniversita\u2019 degli Studi di Genova\n\nIstituto Italiano di Tecnologia\n\nMassachusetts Institute of Technology\n\nlrosasco@mit.edu\n\nAbstract\n\nWhile stochastic gradient descent (SGD) is one of the major workhorses in machine\nlearning, the learning properties of many practically used variants are still poorly\nunderstood. In this paper, we consider least squares learning in a nonparametric\nsetting and contribute to \ufb01lling this gap by focusing on the effect and interplay of\nmultiple passes, mini-batching and averaging, in particular tail averaging. Our re-\nsults show how these different variants of SGD can be combined to achieve optimal\nlearning rates, providing practical insights. A novel key result is that tail averaging\nallows faster convergence rates than uniform averaging in the nonparametric setting.\nFurther, we show that a combination of tail-averaging and minibatching allows\nmore aggressive step-size choices than using any one of said components.\n\n1\n\nIntroduction\n\nStochastic gradient descent (SGD) provides a simple and yet stunningly ef\ufb01cient way to solve a\nbroad range of machine learning problems. Our starting observation is that, while a number of\nvariants including multiple passes over the data, mini-batching and averaging are commonly used,\ntheir combination and learning properties are studied only partially. The literature on convergence\nproperties of SGD is vast, but usually only one pass over the data is considered, see, e.g., [23]. In the\ncontext of nonparametric statistical learning, which we consider here, the study of one-pass SGD was\nprobably \ufb01rst considered in [35] and then further developed in a number of papers (e.g., [37, 36, 25]).\nAnother line of work derives statistical learning results for one pass SGD with averaging from a\nworst-case sequential prediction analysis [29, 18, 28]. The idea of using averaging also has a long\nhistory going back to at least the works of [32] and [27], see also [34] and references therein. More\nrecently, averaging was shown to lead to larger, possibly constant, step-sizes, see [2, 10, 11]. A\ndifferent take on the role of (weighted) averaging was given in [24], highlighting a connection with\nridge regression, a.k.a. Tikhonov regularization. A different \ufb02avor of averaging called tail averaging\nfor one-pass SGD was considered in [19] in a parametric setting. The role of minibatching has\nalso being considered and shown to potentially lead to linear parallelization speedups, see e.g. [7]\nand references therein. Very few results consider the role of multiple passes for learning. Indeed,\nthis variant of SGD is typically analyzed for the minimization of the empirical risk, rather than\nthe actual population risk, see for example [4]. To the best of our knowledge the \ufb01rst paper to\nanalyze the learning properties of multipass SGD was [31], where a cyclic selection strategy was\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fconsidered. Other results for multipass SGD were then given in [16] and [20]. Our starting point are\nthe results in [21] where optimal results for multipass SGD where derived considering also the effect\nof mini-batching. Following the approach in this latter paper, multipass SGD with averaging was\nanalyzed by [26] with no minibatching.\nIn this paper, we develop and improve the above results on two fronts. On the one hand, we consider\nfor the \ufb01rst time the role of multiple passes, mini-batching and averaging at once. On the other hand,\nwe further study the bene\ufb01cial effect of tail averaging. Both mini-batching and averaging are known\nto allow larger step-sizes. Our results show that their combination allows even more aggressive\nparameter choices. At the same time averaging was shown to lead to slower convergence rates in some\ncases. In a parametric setting, averaging prevents linear convergence rates [2, 11]. In a nonparametric\nsetting, it prevents exploiting the possible regularity in the solution [10], a phenomenon called\nsaturation [12]. In other words, uniform averaging can prevent optimal rates in a nonparametric\nsetting. Our results provide a simple explanation to this effect, showing it has a purely deterministic\nnature. Further, we show that tail averaging allows to bypass this problem. These results parallel\nthe \ufb01ndings of [19], showing similar bene\ufb01cial effects of tail-averaging and minibatching in the\n\ufb01nite-dimensional setting. Following [21], our analysis relies on the study of batch gradient descent\nand then of the discrepancy between batch gradient and SGD, with the additional twist that it also\nconsiders the role of tail-averaging. The rest of the paper is organized as follows. In Section 2, we\ndescribe the least-squares learning problem that we consider, as well as the different SGD variants we\nanalyze. In Section 3, we collect a number of observations shedding light on the role of uniform and\ntail averaging. In Section 4, we present and discuss our main results. In Section 5 we illustrate our\nresults via some numerical simulations. Proofs and technical results are deferred to the appendices.\n\n2 Least Squares Learning with SGD\n\nIn this section, we introduce the problem of supervised learning with the least squares loss and then\npresent SGD and its variants.\n\n2.1 Least squares learning\nWe let (X, Y ) be a pair of random variables with values in H \u00d7 R, with H a real separable Hilbert\nspace. This latter setting is known to be equivalent to nonparametric learning with kernels [31].\nWe focus on this setting since considering in\ufb01nite dimensios allows to highlight more clearly the\nregularization role played by different parameters. Indeed, unlike in \ufb01nite dimensions, regularization\nis needed to derive learning rates in this case. Throughout the paper we will suppose that the\nfollowing assumption holds:\nAssumption 1. Assume (cid:107)X(cid:107) \u2264 \u03ba, |Y | \u2264 M almost surely, for some \u03ba, M > 0.\n\nThe problem of interest is to solve\n\nL(w),\n\nmin\nw\u2208H\n\nL(w) =\n\n1\n2\n\nE[(Y \u2212 (cid:104)w, X(cid:105))2]\n\nprovided a realization x1, . . . , xn of n identical copies X1, . . . , Xn of X. De\ufb01ning\n\n\u03a3 = E[X \u2297 X],\n\nand h = E[XY ],\n\nthe optimality condition of problem (1) shows that a solution w\u2217 satis\ufb01es the normal equation\n\n\u03a3w\u2217 = h.\n\n(1)\n\n(2)\n\n(3)\n\nFinally, recall that the excess risk associated with any w \u2208 H can be written as 1\n\nL(w) \u2212 L(w\u2217) =\n\n.\n\n(cid:13)(cid:13)(cid:13)\u03a31/2(w \u2212 w\u2217)\n\n(cid:13)(cid:13)(cid:13)2\n\n1It is a standard fact that the operator \u03a3 is symmetric, positive de\ufb01nite and trace class (hence compact), since\n\nX is bounded. Then fractional powers of \u03a3 are naturally de\ufb01ned using spectral calculus.\n\n2\n\n\f2.2 Learning with stochastic gradients\n\nWe now introduce various gradient iterations relevant in the following. The basic stochastic gradient\niteration is given by the recursion\n\nwt+1 = wt \u2212 \u03b3txt((cid:104)xt, wt(cid:105) \u2212 yt)\nfor all t = 0, 1 . . . , with w0 = 0. For all w \u2208 H and t = 1, . . . n,\nE[Xt((cid:104)Xt, w(cid:105) \u2212 Yt)] = \u2207L(w),\n\n(4)\nhence the name. While the above iteration is not ensured to decrease the objective at each step, the\nabove procedure and its variants are commonly called Stochastic Gradient Descent (SGD). We will\nalso use this terminology. The sequence (\u03b3t)t > 0, is called step-size or learning rate. In its basic\nform, the above iteration prescribes to use each data point only once. This is the classical stochastic\napproximation perspective pioneered by [30].\nIn practice, however, a number of different variants are considered. In particular, often times, data\npoints are visited multiple times, in which case we can write the recursion as\n\nwt+1 = wt \u2212 \u03b3txit((cid:104)xit, wt(cid:105) \u2212 yit).\n\nHere it = i(t) denotes a map specifying a strategy with which data are selected at each iteration.\nPopular choices include: cyclic, where an order over [n] is \ufb01xed a priori and data points are visited\nmultiple times according to it; reshuf\ufb02ing, where the order of the data points is permuted after\nall of them have been sampled once, amounting to sampling without replacement; and \ufb01nally the\nmost common approach, which is sampling each point with replacement uniformly at random. This\nlatter choice is also the one we consider in this paper. We broadly refer to this variant of SGD as\nmultipass-SGD, referring to the \u201cmultiple passes\u201d \u2018over the data set as t grows larger than n.\nAnother variant of SGD is based on considering more than one data point at each iteration, a procedure\ncalled mini-batching. Given b \u2208 [n] the mini-batch SGD recursion is given by\n\nbt(cid:88)\n\nwt+1 = wt + \u03b3t\n\n1\nb\n\ni=b(t\u22121)+1\n\n((cid:104)wt, xji(cid:105) \u2212 yji)xji ,\n\nwhere j1, ..., jbT are i.i.d. random variables, distributed according to the uniform distribution on [n].\nHere the number of passes over the data after t iterations is (cid:100)bt/n(cid:101). Mini-batching can be useful for\nat least two different reasons. The most important is that considering mini-batches is natural to make\nthe best use of memory resources, in particular when distributed computations are available. Another\nadvantage is that in this case more accurate gradient estimates are clearly available at each step.\nFinally, one last idea is considering averaging of the iterates, rather than working with the \ufb01nal iterate,\n\nT(cid:88)\n\nt=1\n\n\u00afwT =\n\n1\nT\n\nwt.\n\nThis is a classical idea in optimization, where it is known to provide improved convergence results\n[32, 27, 15, 2], but it is also used when recovering stochastic results from worst case sequential\nprediction analysis [33, 17]. More recently, averaging was shown to lead to larger step-sizes, see\n[2, 10, 11]. In the following, we consider a variant of the above idea, namely tail-avaraging, where\nfor 0 \u2264 S \u2264 T \u2212 1 we let\n\nT(cid:88)\n\n\u00afwS,T =\n\n1\n\nT \u2212 S\n\nwt .\n\nt=S+1\n\nWe will occasionally write \u00afwL = \u00afwS,T , with L = T \u2212 S. In the following, we study how the above\nideas can be combined to solve problem (1) and how such combinations affect the learning properties\nof the obtained solutions.\n\n3 An appetizer: Averaging and Gradient Descent Convergence\n\nAveraging is known to allow larger step-sizes for SGD but also to slower convergence rates in certain\nsettings [10]. In this section, we present calculations shedding light on these effects. In particular,\n\n3\n\n\fwe show how the slower convergence is a completely deterministic effect and how tail averaging\ncan provide a remedy. In the rest of the paper, we will build on these reasonings to derive novel\nquantitative results in terms of learning bounds. The starting observation is that since SGD is based\non stochastic estimates of the expected risk gradient (cf. equations (1), (4)) it is natural to start from\nthe exact gradient descent to understand the role played by averaging.\nFor \u03b3 > 0, w0 = 0, consider the population gradient descent iteration,\n\nut = ut\u22121 \u2212 \u03b3E[X((cid:104)X, ut\u22121(cid:105) \u2212 Y )] = (I \u2212 \u03b3\u03a3)ut\u22121 + \u03b3h,\n\nwhere the last equality follows from (2). Then using the normal equation (3) and a simple induction\nargument [12], it is easy to see that,\n\nuT = gT (\u03a3)\u03a3w\u2217,\n\ngT (\u03a3) = \u03b3\n\n(I \u2212 \u03b3\u03a3)j.\n\n(5)\n\nT\u22121(cid:88)\n\nj=0\n\nHere, gT is a spectral \ufb01ltering function corresponding to a truncated matrix geometric series (the\nvon Neumann series). For the latter to converge, we need \u03b3 such that (cid:107)I \u2212 \u03b3\u03a3(cid:107) < 1, e.g. \u03b3 <\n1/\u03c3M < 1/\u03ba2, with \u03c3M = \u03c3max(\u03a3) \u2264 \u03ba2, hence recovering a classical step-size choice. The above\ncomputation provides a way to analyze gradient descent convergence. Indeed, one can easily show\nthat\n\nw\u2217 \u2212 uT = rT (\u03a3)w\u2217,\n\nrT (\u03a3) = (I \u2212 \u03b3\u03a3)T\n\nsince gT (\u03a3)\u03a3 = (I \u2212 (I \u2212 \u03b3\u03a3)T )w\u2217, from basic properties of the Neumann series de\ufb01ning gT .\nThe properties of the so-called residual operators rT (\u03a3) control the convergence of GD. Indeed, if\n\u03c3m = \u03c3min(\u03a3) > 0, then\n\n(cid:13)(cid:13)(cid:13)\u03a31/2rT (\u03a3)w\u2217\n\n(cid:13)(cid:13)(cid:13)2 \u2264 \u03c3M (1 \u2212 \u03b3\u03c3m)2T(cid:107)w\u2217(cid:107) \u2264 \u03c3M e\u22122\u03c3m\u03b3T(cid:107)w\u2217(cid:107)2,\n\n(cid:13)(cid:13)(cid:13)\u03a31/2(uT \u2212 w\u2217)\n\n(cid:13)(cid:13)(cid:13)2\n\n=\n\nfrom the basic inequality 1 + z \u2264 ez, highlighting that the population GD iteration converges\nexponentially fast to the risk minimizer. However, a major caveat is that assuming \u03c3min(\u03a3) > 0\nis clearly restrictive in an in\ufb01nite dimensional (nonparametric) setting, since it effectively implies\nthat \u03a3 has \ufb01nite rank. In general, \u03a3 will not be \ufb01nite rank, but rather compact with 0 as the only\naccumulation point of its spectrum. In this case, it is easy to see that the slower rate\n\n(cid:13)(cid:13)(cid:13)\u03a31/2(uT \u2212 w\u2217)\n\n(cid:13)(cid:13)(cid:13)2 \u2264 1\n\n\u03b3T\n\n(cid:107)w\u2217(cid:107)2\n\nholds without any further assumption on the spectrum, since one can show, using spectral calculus\nand a direct computation 2, that s1/2rT (s) \u2264 1/\u03b3T . It is reasonable to ask whether it is possible to\ninterpolate between the above-described slow and fast rates by making some intermediate assumption.\nRaher than making assumption on the spectrum of \u03a3, one can assume the optimal solution w\u2217 to\nbelong to a subspace of the range of \u03a3, more precisely that\n\nw\u2217 = \u03a3rv\u2217\n\n(6)\nholds for some r \u2265 0 and v\u2217 \u2208 H, where larger values of r correspond to making more stringent\nassumptions. In particular, as r goes to in\ufb01nity we are essentially assuming w\u2217 to belong to a \ufb01nite\ndimensional space. Assumption (6) is common in the literature of inverse problems [12] and statistical\nlearning [8, 9]. Interestingly, it is also related to so-called conditioning and \u0141ojasiewicz conditions,\nknown to lead to improved rates in continuous optimization, see [13] and references therein. Under\nassumption (6), and using again spectral calculus, it is possible to show that, for all r \u2265 0,\n\n(cid:13)(cid:13)(cid:13)\u03a31/2(uT \u2212 w\u2217)\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)\u03a31/2rT (\u03a3)\u03a3rv\u2217\n\n(cid:13)(cid:13)(cid:13)2 (cid:46)\n\n=\n\n(cid:18) 1\n\n\u03b3T\n\nThus, higher values of r result in faster convergence rates, at the price of more stringent assumptions.\n\n(cid:19)2r+1(cid:107)v\u2217(cid:107)2.\n(cid:16)\n\n1\n\n\u03b3(T +1)\n\n(cid:17)t \u2264\n\n2Setting d\n\nds s(1 \u2212 \u03b3s)T = 0 gives 1 \u2212 \u03b3s \u2212 s\u03b3T = 0 \u21d2 s = 1\n\n\u03b3(T +1) and\n\n1 \u2212 \u03b3\n\n1\n\n\u03b3(T +1)\n\n1\n\u03b3t .\n\n4\n\n\f3.1 Tail averaged gradient descent\n\nGiven the above discussion, we can derive analogous computations for (tail) averaged GD and draw\nsome insights. Using (5), for S < T , we can write the tail-averaged gradient\n\nT(cid:88)\n\nuS,T =\n\n1\n\nT \u2212 S\n\nut\n\nt=S+1\n\nas\n\nuS,T = GS,T (\u03a3)\u03a3w\u2217,\n\nGS,T (\u03a3) =\n\nT(cid:88)\n\nj=S+1\n\n1\n\nT \u2212 S\n\ngt(\u03a3).\n\nAs before, we can analyze convergence considering a suitable residual operator\n\nw\u2217 \u2212 uS,T = RS,T (\u03a3)w\u2217,\n\nRS,T (\u03a3) = I \u2212 GS,T (\u03a3)\u03a3\n\nwhich, in this case, can be shown to take the form,\n(I \u2212 \u03b3\u03a3)S+1\n\u03b3(T \u2212 S)\n\nRS,T (\u03a3) =\n\n(I \u2212 (I \u2212 \u03b3\u03a3)T\u2212S)\u03a3\u22121\n\n(7)\n\n(8)\n\n(9)\n\nand where with an abuse of notation we denote by \u03a3\u22121 the pseudoinverse of \u03a3. The case of uniform\naveraging corresponds to S = 0, in which case the residual operator simpli\ufb01es to\n\nR0,T (\u03a3) =\n\n(I \u2212 \u03b3\u03a3)\n\n\u03b3T\n\n(I \u2212 (I \u2212 \u03b3\u03a3)T )\u03a3\u22121.\n\nWhen \u03c3m > 0, the residual operators behave roughly as\n\n(cid:107)RS,T (\u03a3)(cid:107)2 \u2248 e\u2212\u03c3m\u03b3(S+1)\n\u03b3(T \u2212 S)\n\n,\n\n(cid:107)R0,T (\u03a3)(cid:107)2 \u2248 1\n\u03b3T\n\n,\n\nrespectively. This leads to a slower convergence rate for uniform averaging and shows instead how\ntail averaging with S \u221d T can preserve the fast convergence of GD.\nWhen \u03c3m = 0, taking again S \u221d T , it is easy to see by spectral calculus that the residual operators\n\nbehave similarly, (cid:13)(cid:13)(cid:13)\u03a31/2RS,T (\u03a3)\n\nleading to comparable rates. The advantage of tail averaging is again apparent if we consider\nAssumption (6). In this case for all r > 0, if we take S \u221d T\n\n(cid:13)(cid:13)(cid:13)2 \u2248 1\n\n\u03b3T\n\n,\n\n,\n\n\u03b3T\n\n(cid:13)(cid:13)(cid:13)2 \u2248 1\n(cid:13)(cid:13)(cid:13)\u03a31/2RS,T (\u03a3)\u03a3r(cid:13)(cid:13)(cid:13)2 \u2248\n(cid:18) 1\n(cid:13)(cid:13)(cid:13)\u03a31/2R0,T (\u03a3)\u03a3r(cid:13)(cid:13)(cid:13)2 \u2248\n\n\u03b3T\n\n(cid:13)(cid:13)(cid:13)\u03a31/2R0,T (\u03a3)\n(cid:19)2r+1\n(cid:18) 1\n(cid:19)2 min(r,1/2)+1\n\n\u03b3T\n\n,\n\n(10)\n\n(11)\n\n.\n\nwhereas with uniform averaging one can only prove\n\nOne immediate observation following from the above discussion is that uniform averaging induces a\nso-called saturation effect [12], meaning that the rates do not improve after r reaches a critical point.\nAs shown above, this effect vanishes considering tail-averaging and the convergence rate of GD is\nrecovered. These results are critically important for our analysis and constitute the main conceptual\ncontribution of our paper. They are proved in Appendix B, while Section A.1 highlights their critical\nrole for SGD. To the best of our knowledge, we are the \ufb01rst to highlight this acceleration property of\ntail averaging beyond the \ufb01nite-dimensional setting.\n\n5\n\n\f4 Main Results and Discussion\n\nIn this section we present and discuss our main results. We start by presenting a general bound and\nthen use it to derive the optimal parameter settings and corresponding performance guarantees. A key\nquantity in our results will be the effective dimension\n\n(cid:20)\n\n(cid:21)\n\nN(1/\u03b3L) = Tr\n\n(\u03a3 +\n\n)\u22121\u03a3\n\n1\n\u03b3L\n\n,\n\nK+1 T for some 1 \u2264 K, and also T \u2264 (K + 1)S for the latter case.\n\nintroduced in [38] to generalize results from parametric estimation problems to non-parametric kernel\nmethods. Similarly this will be one of the main quantities in our learning bounds.\nFurther, in all our results we will require that the stepsize is bounded as \u03b3\u03ba2 < 1/4, and that the\ntail length L = T \u2212 S is scaled appropriately with the total number of iterations T . More precisely,\nour analysis considers two different scenarios where S = 0 (plain averaging) is explicitly allowed\nand where S > 0, i.e., where we investigate the merits of tail-averaging. To do so, we will assume\n0 \u2264 S \u2264 K\u22121\nThe following theorem presents a simpli\ufb01ed version of our main technical result that we present in its\ngeneral form in the Appendix. Here, we omit constants and lower order terms for clarity and give\nthe \ufb01rst insights into the interplay between the tuning parameters, namely the step-size \u03b3, tail-length\nL, and mini-batch size b, and the number of points n. Note that in a nonparametric setting these are\nthe quantities controlling learning rates. The following result provides a bound for any choice of the\ntuning parameters, and will allow to derive optimal choices balancing the various error contributions.\nTheorem 1. Let \u03b1 \u2208 (0, 1], 1 \u2264 L \u2264 T and let Assumption 1 hold. Assume \u03b3\u03ba2 < 1/4 as well as\nn (cid:38) \u03b3L N(1/\u03b3L). Then, the excess risk of the tail-averaged SGD iterates satis\ufb01es\n\u03b3 Tr[\u03a3\u03b1]\nb(\u03b3L)1\u2212\u03b1 .\n\n(cid:46)(cid:13)(cid:13)\u03a31/2RL(\u03a3)w\u2217(cid:13)(cid:13)2\n\n(cid:20)(cid:13)(cid:13)(cid:13)\u03a3\n\n(cid:13)(cid:13)(cid:13)2(cid:21)\n\n1\n\n2 ( \u00afwL \u2212 w\u2217)\n\nE\n\nN(1/\u03b3L)\n\n+\n\n+\n\nn\n\nThe proof of the result is given in Appendix E. We make a few comments. The \ufb01rst term in the\nbound is the approximation error, already discussed in Section 3. It is controlled by the bound in (10)\nand which is decreasing in \u03b3L. The second term corresponds to a variance error due to sampling\nand noise in the data. It depends on the effective dimension which is increasing in \u03b3L. The third\nterm is a computational error due to the randomization in SGD. Note how it depends on both \u03b3L and\nthe minibatch size b. The larger b is, the smaller this error becomes. The dependence of all three\nterms on \u03b3L suggest already at this stage that (\u03b3L)\u22121 plays the role of a regularization parameter.\nWe derive our \ufb01nal bound by balancing all terms, i.e. choosing them to be of the same order. To do\nso we make additional assumptions. The \ufb01rst one is Eq. (6), enforcing the optimal solution w\u2217 to\nbelong to a subspace of the range of \u03a3.\nAssumption 2. For some r \u2265 0 we assume w\u2217 = \u03a3rv\u2217, for some v\u2217 \u2208 H satisfying ||v\u2217|| \u2264 R.\nThe larger is r the more stringent is the assumption, or, equivalently, the easier is the problem, see\nSection 3. A second further assumption is related to the effective dimension.\nAssumption 3. For some \u03bd \u2208 (0, 1] and C\u03bd < \u221e we assume N(1/\u03b3L) \u2264 C\u03bd(\u03b3L)\u03bd.\nThis assumption is common in the nonparametric regression setting, see e.g [6]. Roughly speaking, it\nquanti\ufb01es how far \u03a3 is from being \ufb01nite rank. Indeed, it is satis\ufb01ed if the eigenvalues (\u03c3i)i of \u03a3 have\na polynomial decay \u03c3i \u223c i\u2212 1\n\u03bd . Since \u03a3 is trace class, the assumption is always satis\ufb01ed for \u03bd = 1\nwith C\u03bd = \u03ba2. Smaller values of \u03bd lead to faster convergence rates.\nThe following corollary of Theorem 1, together with Assumptions 2 and 3, derives optimal parameter\nsettings and corresponding learning rates.\nCorollary 1. Let all assumptions of Theorem 1 be satis\ufb01ed, and suppose that Assumptions 2, 3 also\nhold. Further, assume either\n\n1. 0 \u2264 r \u2264 1/2, 1 \u2264 L \u2264 T (here S = 0, i.e., full averaging is allowed) or\n2. 1/2 < r, 1 \u2264 L < T with the additional constraint that for some K \u2265 2\n\n(only tail-averaging is considered).\n\nK + 1\nK \u2212 1\n\nS \u2264 T \u2264 (K + 1)S ,\n\n6\n\n\fThen, for any n suf\ufb01ciently large, the excess risk of the (tail)-averaged SGD iterate satis\ufb01es\n\n(cid:20)(cid:13)(cid:13)(cid:13)\u03a3\n\nE\n\n(cid:13)(cid:13)(cid:13)2(cid:21)\n\n1\n\n2 ( \u00afwLn \u2212 w\u2217)\n\n(cid:46) n\u2212 2r+1\n\n2r+1+\u03bd\n\nfor each of the following choices:\n\n(a) bn (cid:39) 1, Ln (cid:39) n, \u03b3n (cid:39) n\u2212 2r+\u03bd\n(b) bn (cid:39) n\n(c) bn (cid:39) n, Ln (cid:39) n\n\n2r+1+\u03bd , Ln (cid:39) n\n\n2r+1+\u03bd , \u03b3n (cid:39) 1\n\n2r+1+\u03bd , \u03b3n (cid:39) 1\n(n\n\n2r+1+\u03bd\n\n2r+\u03bd\n\n1\n\n(one pass over data)\n\n(one pass over data)\n\n1\n\n1\n\n2r+1+\u03bd passes over data) .\n\nThe proof of Corollary 1 is given in Appendix E. It gives optimal rates [6, 5] under different\nassumptions and choices for the stepsize \u03b3, the minibatch size b and the tail length L, considered as\nfunctions of n and the parameters r and \u03bd from Assumptions 2, 3. We now discuss our \ufb01ndings in\nmore detail and compare them to previous related work.\n\nOptimality of the bound: The above results show that different parameter choices allow to achieve\nthe same error bound. The latter is known to be optimal in minmax sense, see e.g. [6]. As noted\nbefore, here we provide simpli\ufb01ed statements highlighting the dependence of the bound on the\nnumber of points n and the parameters r and \u03bd that control the regularity of the problem. These are\nquantities controlling the learning rates and for which lower bounds are available. Note however, that\nall the constants in the Theorem are worked out and reported in detail in the Appendices.\nRegularization properties of tail-length: We recall that for GD it is well known that (\u03b3T )\u22121\nserves as a regularization parameter, having a quantitatively similar effect to Tikhonov regularization\nwith parameter \u03bb > 0, see e.g. [12]. More generally, our result shows that in the case of tail averaging\nthe quantity (\u03b3L)\u22121 becomes the regularizing parameter for both GD and SGD.\n\nThe bene\ufb01t of tail-averaging: For SGD with b = 1 and full averaging it has been shown by\n[10] that a single pass over data (i.e., Tn = n) gives optimal rates of convergence provided that\n\u03b3n is chosen as in case (a) in the corollary. However the results in [10] held only in the case\nr \u2264 1/2. Indeed, beyond this regime, there is a saturation effect which precludes optimality for\nhigher smoothness, see the discussion in Section 3, eq. (11). Our analysis for case (a) shows that\noptimal rates for r \u2265 0 can still be achieved with the same number of passes and step-size by using\nnon-trivial tail averaging. Additionally, we compare our results with those from [26]. In that paper it\nis shown that multi-passes are bene\ufb01cial for obtaining improved rates for averaged SGD in a regime\nwhere the optimal solution w\u2217 does not belong to H (Assumption 2 does not hold in that case). In\nthat regime, tail-averaging does not improve convergence. Our analysis focuses on the \u201copposite\u201d\nregime where w\u2217 \u2208 H and saturation slows down the convergence of uniformly-averaged SGD,\npreventing optimal rates. Here, tail-averaging is indeed bene\ufb01cial and leads to improved rates.\n\n1\n\nThe bene\ufb01t of multi-passes and mini-batching: We compare our results with those in [21] where\nno averaging but mini-batching is considered. In particular, there it is shown that a relatively large\nstepsize of order log(n)\u22121 can be chosen provided the minibatch size is set to n\n2r+1+\u03bd and a number\nof n\n2r+1+\u03bd passes is considered. Comparing to these results we can see the bene\ufb01ts of combining\nminibatching with tail averaging. Indeed from (c) we see that with a comparable number of passes,\nwe can use a larger, constant step-size already with a much smaller minibatch size. Further, comparing\n(b) and (c) we see that the setting of \u03b3 and L is the same and there is a full range of possible values\nfor bn between [n\n2r+1+\u03bd , n] where a constant stepsize is allowed, still ensuring optimality. As noted\nin [21], increasing the minibatch size beyond a critical value does not yield any bene\ufb01t. Compared to\n[21], we show that that tail-averaging can lead to a much smaller critical minibatch size, and hence\nmore ef\ufb01cient computations.\n\n2r+\u03bd\n\n2r+1\n\nComparison to \ufb01nite-dimensional setting: The relationship between the step-size and batch size\nin \ufb01nite dimensions dim H = d < \u221e is derived in [19] where also tail-averaging but only one\npass over the data is considered. One of the main contributions of this work is characterizing the\nlargest stepsize that allows achieving statistically optimal rates, showing that the largest permissible\n\n7\n\n\fstepsize grows linearly in b before hitting a certain quantity bthresh. Setting b > bthresh results in loss\nof computational and statistical ef\ufb01ciency: in this regime, each step of minibatch SGD is exactly as\neffective in decreasing the bias as a step of batch gradient descent. The critical value bthresh and the\ncorresponding largest admissible stepsize is problem dependent and does not depend on the sample\nsize n. Notably, the statistically optimal rate of order \u03c32d/n is achieved for all constant minibatch\nsizes, and the particular choice of b only impacts the constants in the decay rate of the bias (which\nis of the lower order 1/n2 anyway). That is, choosing the right minibatch size does not involve a\ntradeoff between statistical and optimization error. In contrast, our work shows that setting a large\nbatch size bn (cid:39) n\u03b1, \u03b1 \u2208 [0, 1] yields optimality guarantees in the in\ufb01nite dimensional setting. This\nis due to the fact that choosing the optimal values for parameters like \u03b3 and b involve a tradeoff\nbetween the bias and the variance in this setting. [19] also show that tail-averaging improves the\nrate at which the initial bias decays if the smallest eigenvalue of the covariance matrix \u03c3min(\u03a3) is\nlower-bounded by a constant. Their analysis of this algorithmic component is based on observations\nsimilar to the ones we made in Section 3. Our analysis signi\ufb01cantly extends these arguments by\nshowing the usefulness of tail-averaging in cases when \u03c3min is not necessarily lower-bounded.\n\n5 Numerical Illustration\n\nThis section provides an empirical illustration to the effects characterized in the previous sections. We\nfocus on two aspects of our results: the bene\ufb01ts of tail-averaging over uniform averaging as a function\nof the smoothness parameter r, and the impact of tail-averaging on the best choice of minibatch sizes.\nAll experiments are conducted on synthetic data with d = 1, 000 dimensions, generated as follows.\nWe set \u03a3 as a diagonal matrix with entries \u03a3ii = i\u22121/\u03bd and choose w\u2217 = \u03a3re, where e is a vector of\nall 1\u2019s. The covariates Xt are generated from a Gaussian distribution with covariance \u03a3, and labels\nare generated as Yt = (cid:104)w\u2217, Xt(cid:105) + \u03b5t, where \u03b5t is standard Gaussian noise. For all experiments, we\nchoose \u03bd = 1/2 and n = 10, 000. With this choice of parameters, we have seen that increasing d\nbeyond 100 does not yield any noticeable change in the results, indicating that setting d = 1, 000 is\nan appropriate approximation to the in\ufb01nite-dimensional setting.\nOur \ufb01rst experiment illustrates the saturation effect described in Section 3 (cf. Eqs. 10,11) by\nplotting the respective excess risks of uniformly-averaged and tail-averaged SGD as a function\nof r (Figure 1(a)). We \ufb01x b = 1 and set \u03b3 = n\u2212 2r+\u03bd\n2r+1+\u03bd as recommended in Corollary 1. As\npredicted by our theoretical results, the two algorithms behave similarly for smaller values of r, but\nuniformly-averaged SGD noticeably starts to lag behind its tail-averaged counterpart for larger values\nof r exceeding 1/2, eventually \ufb02attening out and showing no improvement as r increases. On the\nother hand, the performance of the tail-averaged version continues to improve for large values of r,\ncon\ufb01rming that this algorithm can indeed massively bene\ufb01t from favorable structural properties of\nthe data.\nIn our second experiment, we study the performance of both tail- and uniformly-averaged SGD as\na function of the stepsize \u03b3 and the minibatch-size b (Figure 1(b), (c)). We \ufb01x r = 1/2 and set\nT = n/b for all tested values of b, amounting to a single pass over the data. Again, as theory predicts,\nperformance remains largely constant as \u03b3 \u00b7 b remains constant for both algorithms, until a critical\nthreshold stepsize is reached. However, it is readily apparent from the \ufb01gures that tail-averaging\npermits the use of larger minibatch sizes, therefore allowing for more ef\ufb01cient parallelization.\n\nFigure 1: Illustration of the effects of tail-averaging and minibatching. (a) Excess risk as a function of r with\nuniform and tail averaging. (b) Excess risk as a function of stepsize \u03b3 and minibatch-size b for SGD with\nuniform averaging. (c) Excess risk as a function of stepsize \u03b3 and minibatch-size b for SGD with tail-averaging.\n\n8\n\n00.511.52r10-410-2100102excess risk(a)SGD-unifSGD-tail(b)0 0.51 1.52 200016001200800 400 b0.0050.010.0150.020.05+(c)0 0.51 1.52 200016001200800 400 b0.0050.010.0150.020.05+\fAcknowledgments\n\nNicole M\u00fccke is supported by the German Research Foundation under DFG Grant STE 1074/4-1.\nGergely Neu was supported by La Caixa Banking Foundation through the Junior Leader Postdoctoral\nFellowship Program and a Google Faculty Research Award.\nLorenzo Rosasco acknowledges the \ufb01nancial support of the AFOSR projects FA9550-17-1-0390 and\nBAA-AFRL-AFOSR-2016-0007 (European Of\ufb01ce of Aerospace Research and Development), and\nthe EU H2020-MSCA-RISE project NoMADS - DLV-777826.\n\nReferences\n[1] R. Aguech, E. Moulines, and P. Priouret. On a perturbation approach for the analysis of\n\nstochastic tracking algorithms. SIAM J. Control and Optimization, 39 (3):872\u2013899, 2000.\n\n[2] F. R. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with\n\nconvergence rate o(1/n). In NIPS, pages 773\u2013781, 2013.\n\n[3] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory.\n\nJournal of Complexity, 23(1):52 \u2013 72, 2007.\n\n[4] D. P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM\n\nJournal on Optimization, 7(4):913\u2013926, 1997.\n\n[5] G. Blanchard and N. M\u00fccke. Optimal rates for regularization of statistical inverse learning\n\nproblems. Foundations of Computational Mathematics, 18(4):971\u20131013, Aug 2017.\n\n[6] A. Caponnetto and E. De Vito. Optimal rates for regularized least-squares algorithm. Founda-\n\ntions of Computational Mathematics, 7(3):331\u2013368, 2006.\n\n[7] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan. Better mini-batch algorithms via accelerated\ngradient methods.\nIn J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 24, pages 1647\u20131655.\nCurran Associates, Inc., 2011.\n\n[8] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American\n\nmathematical society, 39(1):1\u201349, 2002.\n\n[9] E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares\nalgorithm in learning theory. Foundations of Computational Mathematics, 5(1):59\u201385, 2005.\n\n[10] A. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. Ann.\n\nStatist., 44(4):1363\u20131399, 08 2016.\n\n[11] A. Dieuleveut, N. Flammarion, and F. Bach. Harder, better, faster, stronger convergence rates\nfor least-squares regression. Journal of Machine Learning Research, 18:101:1\u2013101:51, 2017.\n\n[12] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems, volume 375.\n\nSpringer Science & Business Media, 1996.\n\n[13] G. Garrigos, L. Rosasco, and S. Villa. Convergence of the forward-backward algorithm: Beyond\n\nthe worst case with the help of geometry. arXiv:1703.09477, 2017.\n\n[14] Z.-C. Guo, S.-B. Lin, and D.-X. Zhou. Learning theory of distributed spectral algorithms.\n\nInverse Problems, 33(7):074009, 2017.\n\n[15] L. Gy\u00f6r\ufb01 and H. Walk. On the averaged stochastic approximation for linear regression. SIAM\n\nJournal on Control and Optimization, 34(1):31\u201361, 1996.\n\n[16] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic\ngradient descent. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages\n1225\u20131234. JMLR.org, 2016.\n\n[17] E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization,\n\n2(3-4):157\u2013325, 2016.\n\n9\n\n\f[18] E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic\nstrongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489\u20132512,\n2014.\n\n[19] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Parallelizing stochastic gradient\ndescent for least squares regression: Mini-batching, averaging, and model misspeci\ufb01cation.\nJournal of Machine Learning Research, 18(223):1\u201342, 2018.\n\n[20] J. Lin, R. Camoriano, and L. Rosasco. Generalization properties and implicit regularization for\n\nmultiple passes SGM. CoRR, abs/1605.08375, 2016.\n\n[21] J. Lin and L. Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of\n\nMachine Learning Research, 18:97:1\u201397:47, 2017.\n\n[22] J. Lin, A. Rudi, L. Rosasco, and V. Cevher. Optimal rates for spectral algorithms with least-\nsquares regression over hilbert spaces. Applied and Computational Harmonic Analysis, 2018.\n[23] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach\n\nto stochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[24] G. Neu and L. Rosasco. Iterate averaging as regularization for stochastic gradient descent. In\nCOLT, volume 75 of Proceedings of Machine Learning Research, pages 3222\u20133242. PMLR,\n2018.\n\n[25] F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic\n\nlearning. In Advances in Neural Information Processing Systems, pages 1116\u20131124, 2014.\n\n[26] L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on\n\nhard learning problems through multiple passes. CoRR, abs/1805.10074, 2018.\n\n[27] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM\n\nJ. Control Optim., 30(4):838\u2013855, jul 1992.\n\n[28] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex\n\nstochastic optimization. arXiv:1109.5647, 2011.\n\n[29] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex\nstochastic optimization. In Proceedings of the 29th International Conference on Machine\nLearning (ICML), pages 1571\u20131578, 2012.\n\n[30] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical\n\nstatistics, pages 400\u2013407, 1951.\n\n[31] L. Rosasco and S. Villa. Learning with incremental iterative regularization. In NIPS, pages\n\n1630\u20131638, 2015.\n\n[32] D. Ruppert. Ef\ufb01cient estimations from a slowly convergent Robbins\u2013Monro process. Technical\n\nreport, Cornell University Operations Research and Industrial Engineering, 1988.\n\n[33] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in\n\nMachine Learning, 4(2):107\u2013194, 2012.\n\n[34] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization:convergence\nresults and optimal averaging schemes. In Proceedings of the 30thInternational Conference on\nMachine Learning, 2013.\n\n[35] S. Smale and Y. Yao. Online learning algorithms. Foundations of Computational Mathematics,\n\n6(2):145\u2013170, 2006.\n\n[36] P. Tarres and Y. Yao. Online learning as stochastic approximation of regularization paths:\nOptimality and almost-sure convergence. IEEE Trans. Information Theory, 60(9):5716\u20135735,\n2014.\n\n[37] Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of Computa-\n\ntional Mathematics, 8(5):561\u2013596, 2008.\n\n[38] T. Zhang. Effective dimension and generalization of kernel learning. Advances in Neural\n\nInformation Processing Systems 2003, 2003.\n\n10\n\n\f", "award": [], "sourceid": 6831, "authors": [{"given_name": "Nicole", "family_name": "Muecke", "institution": "University of Stuttgart"}, {"given_name": "Gergely", "family_name": "Neu", "institution": "Universitat Pompeu Fabra"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova- MIT - IIT"}]}