{"title": "Implicit Regularization of Accelerated Methods in Hilbert Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 14481, "page_last": 14491, "abstract": "We study learning properties of accelerated gradient descent methods for linear least-squares in Hilbert spaces. We analyze the implicit regularization properties of Nesterov acceleration and a variant of heavy-ball in terms of corresponding learning error bounds. Our results show that acceleration can provides faster bias decay than gradient descent, but also suffers of a more unstable behavior. As a result acceleration cannot be in general expected to improve learning accuracy with respect to gradient descent, but rather to achieve the same accuracy with reduced computations. Our theoretical results are validated by numerical simulations. Our analysis is based on studying suitable polynomials induced by the accelerated dynamics and combining spectral techniques with concentration inequalities.", "full_text": "Implicit Regularization of Accelerated Methods in\n\nHilbert Spaces\n\nNicol\u00f2 Pagliana\n\nUniversity of Genoa\nDIMA & MaLGa\n\npagliana@dima.unige.it\n\nLorenzo Rosasco\nUniversity of Genoa\n\nDIBRIS, MaLGa, IIT & MIT\n\nlrosasco@mit.edu\n\nAbstract\n\nWe study learning properties of accelerated gradient descent methods for linear\nleast-squares in Hilbert spaces. We analyze the implicit regularization properties\nof Nesterov acceleration and a variant of heavy-ball in terms of corresponding\nlearning error bounds. Our results show that acceleration can provides faster bias\ndecay than gradient descent, but also suffers of a more unstable behavior. As a\nresult acceleration cannot be in general expected to improve learning accuracy with\nrespect to gradient descent, but rather to achieve the same accuracy with reduced\ncomputations. Our theoretical results are validated by numerical simulations.\nOur analysis is based on studying suitable polynomials induced by the accel-\nerated dynamics and combining spectral techniques with concentration inequalities.\n\n1\n\nIntroduction\n\nThe focus on optimization is a major trend in modern machine learning, where ef\ufb01ciency is mandatory\nin large scale problems [4]. Among other solutions, \ufb01rst order methods have emerged as methods of\nchoice. While these techniques are known to have potentially slow convergence guarantees, they also\nhave low iteration costs, ideal in large scale problems. Consequently the question of accelerating\n\ufb01rst order methods while keeping their small iteration costs have received much attention, see e.g.\n[33]. Since machine learning solutions are typically derived minimizing an empirical objective (the\ntraining error), most theoretical studies have focused on the error estimated for this latter quantity.\nHowever, it has recently become clear that optimization can play a key role from a statistical point of\nview when the goal is to minimize the expected (test) error. On the one hand, iterative optimization\nimplicitly bias the search for a solution, e.g. converging to suitable minimal norm solutions [27]. On\nthe other hand, the number of iterations parameterize paths of solutions of different complexity [31].\nThe idea that optimization can implicitly perform regularization has a long history. In the context\nof linear inverse problems, it is known as iterative regularization [11]. It is also an old trick for\ntraining neural networks where it is called early stopping [15]. The question of understanding the\ngeneralization properties of deep learning applications has recently sparked a lot of attention on\nthis approach, which has be referred to as implicit regularization, see e.g. [13]. Establishing the\nregularization properties of iterative optimization requires the study of the corresponding expected\nerror by combining optimization and statistical tools. First results in this sense focused on linear\nleast squares with gradient descent and go back to [6, 31], see also [25] and references there in for\nimprovements. Subsequent works have started considering other loss functions [16], multi-linear\nmodels [13] and other optimization methods, e.g. stochastic approaches [26, 18, 14].\nIn this paper, we consider the implicit regularization properties of acceleration. We focus on linear\nleast squares in Hilbert space, because this setting allows to derive sharp results and working in\nin\ufb01nite dimension magnify the role of regularization. Unlike in \ufb01nite dimension learning bounds are\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fpossible only if some form of regularization is considered. In particular, we consider two of the most\npopular accelerated gradient approaches, based on Nesterov acceleration [22] and (a variant of) the\nheavy-ball method [24]. Both methods achieve acceleration by exploiting a so called momentum\nterm, which uses not only the previous, but the previous two iterations at each step. Considering\na suitable bias-variance decomposition, our results show that accelerated methods have a behavior\nqualitatively different from basic gradient descent. While the bias decays faster with the number of\niterations, the variance increases faster too. The two effect balance out, showing that accelerated\nmethods achieve the same optimal statistical accuracy of gradient descent but they can indeed do\nthis with less computations. Our analysis takes advantage of the linear structures induced by least\nsquares to exploit tools from spectral theory. Indeed, the characterization of both convergence and\nstability rely on the study of suitable spectral polynomials de\ufb01ned by the iterates. While the idea that\naccelerated methods can be more unstable, this has been pointed out in [10] in a pure optimization\ncontext. Our results quantify this effect from a statistical point of view. Close to our results is the\nstudy in [9], where a stability approach is considered to analyze gradient methods for different loss\nfunctions [5].\n\n2 Learning with (accelerated) gradient methods\nLet the input space X be a separable Hilbert space (with scalar product (cid:104)\u00b7,\u00b7(cid:105) and induced norm (cid:107)\u00b7(cid:107))\nand the output space be R 1. Let \u03c1 be a unknown probability measure on the input-output space\nX \u00d7 R, \u03c1X the induced marginal probability on X, and \u03c1(\u00b7|x) the conditional probability measure on\nR given x \u2208 X. We make the following standard assumption: there exist \u03ba > 0 such that\n\n(cid:104)x, x(cid:48)(cid:105) \u2264 \u03ba2\n\n\u2200x, x(cid:48) \u2208 X, \u03c1X-almost surely.\n\nThe goal of least-squares linear regression is to solve the expected risk minimization problem\n\n(cid:90)\n\nE(w), E(w) =\n\ninf\nw\u2208X\n\nX\u00d7R\n\n((cid:104)w, x(cid:105) \u2212 y)2 d\u03c1(x, y),\n\n(1)\n\n(2)\n\nn(cid:88)\n\ni=1\n\nwhere \u03c1 is known only through the n i.i.d. samples (x1, y1), . . . , (xn, yn). In the following, we\nmeasure the quality of an approximate solution \u02c6w with the excess risk\n\nE( \u02c6w) \u2212 inf\n\nE .\n\nX\n\nThe search of a solution is often based on replacing (2) with the empirical risk minimization (ERM)\n\n\u02c6E(w),\n\nmin\nw\u2208X\n\n\u02c6E(w) =\n\n1\nn\n\n((cid:104)w, xi(cid:105) \u2212 yi)2 .\n\n(3)\n\nFor least squares an ERM solution can be computed in closed form using a direct solver. However,\nfor large problems, iterative solvers are preferable and we next describe the approaches we consider.\nFirst, it is useful to rewrite the ERM with vectors notation. Let y \u2208 Rn with (y)i = yi and\nX : X \u2192 Rn s.t. (X w)i = (cid:104)w, xi(cid:105) for i = 1 . . . , n. Here the norm (cid:107)\u00b7(cid:107)n is norm in Rn multiplied by\n\u221a\ni=1 xiyi. Then, ERM becomes\n1/\n(4)\n\nn. Let X\u2217 : Rn \u2192 X be the adjoint of X de\ufb01ned by X\u2217 y = 1\n\u02c6E(w) = (cid:107)X w \u2212 y(cid:107)2\nn .\n\n(cid:80)n\n\nn\n\nmin\nw\u2208X\n\n2.1 Gradient descent and accelerated methods\n\nGradient descent serves as a reference approach throughout the paper. For problem (4) it becomes\n\u02c6wt+1 = \u02c6wt \u2212 \u03b1 X\u2217 (X \u02c6wt \u2212 y)\n(5)\n2 The progress made by gradient\nwith initial point \u02c6w0 = 0 and the step-size \u03b1 that satisfy \u03b1 < 1\n\u03ba2\ndescent at each iteration can be slow and the idea behind acceleration is to use the information of the\nprevious directions in order to improves the convergence rate of the algorithm.\n\n1As shown in Appendix this choice allows to recover nonparametric kernel learning as a special case.\n2 The step-size \u03b1 is the step-size at the t-th iteration and satis\ufb01es the condition 0 < \u03b1(cid:107)X(cid:107)2\n\nop < 1 , where\n(cid:107)\u00b7(cid:107)op denotes the operatorial norm. Since the operator X is bounded by \u03ba (which means (cid:107)X(cid:107)op \u2264 \u03ba) it is\nsuf\ufb01cient to assume \u03b1 < 1\n\u03ba2 .\n\n2\n\n\fHeavy-ball\nHeavy-ball is a popular accelerated method that adds the momentum \u02c6wt \u2212 \u02c6wt\u22121 at each iteration\n\n\u02c6wt+1 = \u02c6wt \u2212 \u03b1 X\u2217 (X \u02c6wt \u2212 y) + \u03b2( \u02c6wt \u2212 \u02c6wt\u22121)\n\n(6)\nwith \u03b1, \u03b2 \u2265 0, the case \u03b2 = 0 reduces to gradient descent. In the quadratic case we consider it is\nalso called Chebyshev iterative method. The optimization properties of heavy-ball have been studied\nextensively [24, 32]. Here, we consider the following variant. Let \u03bd > 1, consider the varying\nparameter heavy-ball replacing \u03b1, \u03b2 in (6) with \u03b1t+1, \u03b2t+1 de\ufb01ned as:\n\n\u03b1t =\n\n4\n\u03ba2\n\n(2t + 2\u03bd \u2212 1)(t + \u03bd \u2212 1)\n(t + 2\u03bd \u2212 1)(2t + 4\u03bd \u2212 1)\n\n(t \u2212 1)(2t \u2212 3)(2t + 2\u03bd \u2212 1)\n\n\u03b2t =\n\n(t + 2\u03bd \u2212 1)(2t + 4\u03bd \u2212 1)(2t + 2\u03bd \u2212 3)\n\n,\n\nfor t > 0 and with initialization \u02c6w\u22121 = \u02c6w0 = 0, \u03b11 = 1\n4\u03bd+1 , \u03b21 = 0. With this choice and\nconsidering the least-squares problem this algorithm is known as \u03bd\u2212method in the inverse problem\n\u03ba2\nliterature (see e.g. [11]). This seemingly complex parameters\u2019 choice allows to relates the approach\nto suitable orthogonal polynomials recursion as we discuss later.\n\n4\u03bd+2\n\nNesterov acceleration\n\nThe second form of gradient acceleration we consider is the popular Nesterov acceleration [22]. In\nour setting, it corresponds to the iteration\n\n\u02c6wt+1 = \u02c6vt \u2212 \u03b1 X\u2217 (X \u02c6vt \u2212 y) ,\n\n\u02c6vt = \u02c6wt + \u03b2t ( \u02c6wt \u2212 \u02c6wt\u22121)\n\nwith the two initial points \u02c6w\u22121 = \u02c6w0 = 0, and the sequence \u03b2t chosen as\n\nt \u2212 1\nt + \u03b2\n\n\u03b2t =\n\n\u03b2 \u2265 1 .\n\n,\n\n(7)\n\n(8)\n\nDifferently from heavy-ball, Nesterov acceleration uses the momentum term also in the evaluation of\nthe gradient. Also in this case optimization results are well known [1, 29].\nHere, as above, optimization results refer to solving ERM (3), (4), whereas in the following we study\nto which extent the above iterations can used to minimize the expected error (2). In the next section,\nwe discuss a spectral approach which will be instrumental towards this goal.\n\n3 Spectral \ufb01ltering for accelerated methods\n\nLeast squares allows to consider spectral approaches to study the properties of gradient methods for\nlearning. We illustrate these ideas for gradient descent before considering accelerated methods.\n\nGradient descent as spectral \ufb01ltering\n\nNote that by a simple (and classical) induction argument, gradient descent can be written as\n\nt\u22121(cid:88)\n\nj=0\n\n\u02c6wt = \u03b1\n\n(I \u2212 \u03b1 \u02c6\u03a3)j X\u2217 y .\n\nwith\n\n\u02c6\u03a3 = X\u2217 X,\n\nj=0(I \u2212 \u03b1\u03c3)j for all \u03c3 \u2208 (0, \u03ba2] and t \u2208 N. Note that, the\npolynomials gt are bounded by \u03b1t. A \ufb01rst observation is that gt(\u03c3)\u03c3 converges to 1 as t \u2192 \u221e, since\n\u03c3 . A second observation is that the residual polynomials rt(\u03c3) = 1 \u2212 \u03c3gt(\u03c3),\ngt(\u03c3) converges to 1\nwhich are all bounded by 1, control ERM convergence since,\n\nEquivalently using spectral calculus\n\n\u02c6wt = gt( \u02c6\u03a3) X\u2217 y ,\n\nwhere gt are the polynomials gt(\u03c3) = \u03b1(cid:80)t\u22121\n(cid:13)(cid:13)(cid:13)n\n|rt(\u03c3)\u03c3q| \u2264(cid:16) q\n\n(cid:13)(cid:13)(cid:13)X gt( \u02c6\u03a3) X\n\n(cid:107)X \u02c6wt \u2212 y(cid:107)n =\n\n\u2217 y \u2212 y\n\n=\n\n(cid:13)(cid:13)(cid:13)n\n(cid:13)(cid:13)(cid:13)gt( \u02c6\u03a3) \u02c6\u03a3y \u2212 y\n(cid:19)q\n(cid:17)q(cid:18) 1\n\n=\n\n.\n\nIn particular, if y is in the range of \u02c6\u03a3r for some r > 0 (source condition on y) improved convergence\nrates can be derived noting that by an easy calculation\n\n(cid:13)(cid:13)(cid:13)n\n(cid:13)(cid:13)(cid:13)rt( \u02c6\u03a3)y\n\n(cid:13)(cid:13)(cid:13)op\n\u2264(cid:13)(cid:13)(cid:13)rt( \u02c6\u03a3)\n\n(cid:107)y(cid:107)n .\n\n\u03b1\n\nt\n\n3\n\n\fAs we show in Section 4, considering the polynomials gt and rt allows to study not only ERM\nbut also expected risk minimization (2), by relating gradient methods to their in\ufb01nite sample limit.\nFurther, we show how similar reasoning hold for accelerated methods. In order to do so, it useful to\n\ufb01rst de\ufb01ne the characterizing properties of gt and rt.\n\n3.1 Spectral \ufb01ltering\n\nThe following de\ufb01nition abstracts the key properties of the function gt and rt often called spectral\n\ufb01ltering function [2]. Following the classical de\ufb01nition we replace t with a generic parameter \u03bb.\nDe\ufb01nition 1.\nThe family {g\u03bb}\u03bb\u2208(0,1] is called spectral \ufb01ltering function if the following conditions hold:\n(i) There exist a constant E < +\u221e such that, for any \u03bb \u2208 (0, 1]\n|g\u03bb(\u03c3)| \u2264 E\n\u03bb\n\nsup\n\n(9)\n\n\u03c3\u2208(0,\u03ba2]\n\n.\n\n(ii) Let r\u03bb(\u03c3) = 1 \u2212 \u03c3 g\u03bb(\u03c3) there exist a constant F0 such that, for any \u03bb \u2208 (0, 1]\n\n|r\u03bb(\u03c3)| \u2264 F0 .\n\nsup\n\n\u03c3\u2208(0,\u03ba2]\n\n(10)\n\nDe\ufb01nition 2. (Quali\ufb01cation)\nThe quali\ufb01cation of the spectral \ufb01ltering function {g\u03bb}\u03bb is the maximum parameter q such that for\nany \u03bb \u2208 (0, 1] there exist a constant Fq such that\n\n|r\u03bb(\u03c3)\u03c3q| \u2264 Fq\u03bbq .\n\nsup\n\n\u03c3\u2208(0,\u03ba2]\n\n(11)\n\nMoreover we say that a \ufb01ltering function has quali\ufb01cation \u221e if (11) holds for every q > 0.\nMethods with \ufb01nite quali\ufb01cation might have slow convergence rates in certain regimes. The smallest\nthe quali\ufb01cation the worse the rates can be.\nThe discussion in the previous section shows that gradient descent de\ufb01nes a spectral \ufb01ltering function\nwhere \u03bb = 1/t. More precisely, the following holds.\nfor t \u2208 N, then the polynomials gt related to the gradient descent\nProposition 1. Assume \u03bb = 1\nt\niterates, de\ufb01ned in (5), are a \ufb01ltering function with parameters E = \u03b1 and F0 = 1. Moreover it has\nquali\ufb01cation \u221e with parameters Fq = (q/\u03b1)q.\nThe above result is classical and we report a proof in the appendix for completeness. Next, we discuss\nanalogous results for accelerate methods and then compare the different spectral \ufb01ltering functions.\n\n3.2 Spectral \ufb01ltering for accelerated methods\n\nFor the heavy-ball (6) the following result holds\nProposition 2. Assume \u03ba \u2264 1, let \u03bd > 0 and \u03bb = 1\nt2 for t \u2208 N, then the polynomials gt related to\nheavy-ball method (6) are a \ufb01ltering function with parameters E = 2 and F0 = 1. Moreover there\nexist a positive constant c\u03bd < +\u221e such that the \u03bd-method has quali\ufb01cation \u03bd.\nThe proof of the above proposition follows combining several intermediate results from [11]. The\nkey idea is to show that the residual polynomials de\ufb01ned by heavy-ball iteration form a sequence of\northogonal polynomials with respect to the weight function\n\n\u03c9\u03bd(\u03c3) =\n\n\u03c32\u03bd\n\n2 (1 \u2212 \u03c3)\n\u03c3 1\n\n,\n\n1\n2\n\nwhich is a so called shifted Jacobi weight. Results from orthogonal polynomials can then be used to\ncharacterize the corresponding spectral \ufb01ltering function.\nThe following proposition considers Nesterov acceleration.\n\n4\n\n\fProposition 3. Assume \u03bb = 1/t2, then the polynomials gt related to Nesterov iterates (7) are a\n\ufb01ltering function with constants E = 2\u03b1 and F0 = 1. Moreover the quali\ufb01cation of this method is at\nleast 1/2 with constants Fq =\n\n.\n\n(cid:17)q\n\n(cid:16) \u03b22\n\n\u03b1\n\nFiltering properties of the Nesterov iteration (7) have been studied recently in the context of inverse\nproblems [23]. In the appendix 7.3 we provide a simpli\ufb01ed proof based on studying the properties of\nsuitable discrete dynamical systems de\ufb01ned by the Nesterov iteration (7).\n\n3.3 Comparing the different \ufb01lter functions\n\nWe summarize the properties of the spectral \ufb01ltering function of the various methods for \u03ba = 1.\n\nMethod\n\nGradien descent\n\nHeavy-ball\nNesterov\n\nE F0\n1\n1\n1\n2\n2\n1\n\nFq\nqq\n(q = \u03bd)\n\u03b22q\n\nc\u03bd\n\nQuali\ufb01cation\n\n\u221e\n\u03bd\n\u2265 1/2\n\nThe main observation is that the properties of the spectral \ufb01ltering functions corresponding to the\ndifferent iterations depend on \u03bb = 1/t for gradient descent, but on \u03bb = 1/t2 for the accelerated\nmethods. As we see in the next section this leads to substantially different learning properties. Further\nwe can see that gradient descent is the only algorithm with quali\ufb01cation \u221e, even if the parameter\nFq = qq can be very large. The accelerated methods seem to have smaller quali\ufb01cation. In particular,\nthe heavy-ball method can attain a high quali\ufb01cation, depending on \u03bd, but the constant c\u03bd is unknown\nand could be large. For Nesterov accelerated method, the quali\ufb01cation is at least 1/2 and it\u2019s an open\nquestion whether this bound is tight or higher quali\ufb01cation can be attained.\nIn the next section, we show how the properties of the spectral \ufb01ltering functions can be exploited to\nstudy the excess risk of the corresponding iterations.\n\n4 Learning properties for accelerated methods\n\nWe \ufb01rst consider a basic scenario and then a more re\ufb01ned analysis leading to a more general setting\nand potentially faster learning rates.\n\n4.1 Attainable case\n\nConsider the following basic assumption.\nAssumption 1. Assume there exist M > 0 such that |y| < M \u03c1-almost surely and w\u2217 \u2208 X such that\nE(w\u2217) = inf X E.\nThen the following result can be derived.\nTheorem 1. Under Assumption 1, let \u02c6wGD\nbe the t-th iterations respectively of gradient\ndescent (5) and an accelerated version given by (6) or (7). Assuming the sample-size n to be\nlarge enough and let \u03b4 \u2208 (0, 1/2) then there exist two positive constant C1 and C2 such that with\nprobability at least 1 \u2212 \u03b4\n\nand \u02c6wacc\n\nt\n\nt\n\nE( \u02c6wGD\n\nt\n\n) \u2212 inf\n\nH\n\nE \u2264 C1\n\nE( \u02c6wacc\n\nt\n\n) \u2212 inf\n\nH\n\nE \u2264 C2\n\n(cid:18) 1\n(cid:18) 1\n\nt\n\n+\n\nt2 +\n\n(cid:19)\n(cid:19)\n\nt\nn\nt2\nn\n\nlog2 2\n\u03b4\nlog2 2\n\u03b4\n\n.\n\nwhere the constants C1 and C2 do not depend on n, t, \u03b4, but depend on the chosen optimization\nmethod.\nMoreover by choosing the stopping rules tGD = O(n1/2) and tacc = O(n1/4) both algorithms have\nlearning rate of order 1\u221a\nn .\n\nThe proof of the above results is given in the appendix and the novel part is the one concerning\naccelerated methods, particularly Nesterov acceleration. The result shows how the number of iteration\n\n5\n\n\fcontrols the learning properties both for gradient descent and accelerated gradient. In this sense\nimplicit regularization occurs in all these approaches. For any t the error is split in two contributions.\nInspecting the proof it is easy to see that, the \ufb01rst term in the bound comes from the convergence\nproperties of the algorithm with in\ufb01nite data. Hence the optimization error translates into a bias\nterm. The decay for accelerated method is much faster than for gradient descent. The second term\narises from comparing the empirical iterates with their in\ufb01nite sample (population) limit. It is a\nvariance term depending on the sampling in the data and hence decreases with the sample size.\nFor all methods, this term increases with the number of iterations, indicating that the empirical\nand population iterations are increasingly different. However, the behavior is markedly worse for\naccelerated methods. The bene\ufb01t of acceleration seems to be balanced out by this more unstable\nbehavior. In fact, the bene\ufb01t of acceleration is apparent balancing the error terms to obtain a \ufb01nal\nbound. The obtained bound is the same for gradient descent and accelerated methods, and is indeed\noptimal since it matches corresponding lower bounds [3, 7]. However, the number of iterations\nneeded by accelerated methods is the square root of those needed by gradient descent, indicating a\nsubstantial computational gain can be attained. Next we show how these results can be generalized to\na more general setting, considering both weaker and stronger assumptions, corresponding to harder\nor easier learning problems.\n\n4.2 More re\ufb01ned result\n\nTheorem 1 is a simpli\ufb01ed version of the more general result that we discuss in this section. We\nare interested in covering also the non-attainable case, that is when there is no w\u2217 \u2208 X such that\nE(w\u2217) = inf X E. In order to cover this case we have to introduce several more de\ufb01nitions and\nnotations. In Appendix 8.2 we give a more detailed description of the general setting. Consider\nX f (x)2 d\u03c1X(x) and\nthe space L2\nextend the expected risk to L2\n\u03c1X be the\nhypothesis space of functions such that f (x) = (cid:104)w, x(cid:105) \u03c1X almost surely. Recall that, the minimizer\nX y d\u03c1(y|x). The projection fH over\nof the expected risk over L2\nthe closure of the hypothesis space H is de\ufb01ned as\n\n\u03c1X de\ufb01ning E(f ) =(cid:82)\n\u03c1X is the regression function f\u03c1 =(cid:82)\n\n\u03c1X of the square integrable functions with the norm (cid:107)f(cid:107)2\n\nX\u00d7R(f (x) \u2212 y)2 d\u03c1(x, y). Let H \u2286 L2\n\n\u03c1X = (cid:82)\n\nLet L : L2\n\n\u03c1X \u2192 L2\n\n\u03c1X be the integral operator\n\nfH = arg min\n\ng\u2208H\n\n(cid:107)g \u2212 f\u03c1(cid:107)\u03c1X\n\n.\n\nLf (x) =\n\nf (x(cid:48))(cid:104)x, x(cid:48)(cid:105) d\u03c1X(x(cid:48)) .\n\n(cid:90)\n\nX\n\nThe \ufb01rst assumption we consider concern the moments of the output variable and is more general\nthan assuming the output variable y to be bounded as assumed before.\nAssumption 2. There exist positive constant Q and M such that for all N (cid:51) l \u2265 2,\n\n(cid:90)\n\n|y|l d\u03c1(y|x) \u2264 1\n2\n\nl!M l\u22122Q2\n\n\u03c1X almost surely.\n\nR\n\nThis assumption is standard and satis\ufb01ed in classi\ufb01cation or regression with well behaved noise.\nUnder this assumption the regression function f\u03c1 is bounded almost surely\n\n(cid:90)\n\nR\n\n(cid:18)(cid:90)\n\nR\n\n(cid:19)1/2 \u2264 Q .\n\n|f\u03c1(x)| \u2264\n\n|y| d\u03c1(y|x) \u2264\n\n|y|2 d\u03c1(y|x)\n\n(12)\n\nThe next assumptions are related to the regularity of the target function fH.\nAssumption 3.\nThere exist a positive constant B such that the target function fH satisfy\n(fH(x) \u2212 f\u03c1(x))2 x \u2297 x d\u03c1X(x) (cid:22) B2\u03a3 .\n\n(cid:90)\n\nX\n\nThis assumption is needed to deal with the misspeci\ufb01cation of the model. The last assumptions\nquantify the regularity of fH and the size (capacity) of the space H.\n\n6\n\n\fAssumption 4.\nThere exist g0 \u2208 L2\n\n\u03c1X and r > 0 such that\n\nMoreover we assume that there exist \u03b3 \u2265 1 and a positive constant c\u03b3 such that the effective dimension\n\nfH = Lrg0 ,\n\nwith (cid:107)g0(cid:107)\u03c1X \u2264 R.\n\nN(\u03bb) = Tr\n\nL (L + \u03bbI)\n\n\u2212 1\n\n\u03b3 .\n\n(cid:16)\n\n\u22121(cid:17) \u2264 c\u03b3\u03bb\n(cid:1) .\n\n\u03c1X\n\nThe assumption on N(\u03bb) is always true for \u03b3 = 1 and c1 = \u03ba2 and it\u2019s satis\ufb01ed when the eigenvalues\n\u03c3i of L decay as i\u2212\u03b3. We recall that, the space H can be characterized in terms of the operator L,\nindeed\n\nH = L1/2(cid:0)L2\n\nHence, the non-attainable corresponds to considering r < 1/2.\nTheorem 2. Under Assumption 2, 3, 4, let \u02c6wGD\nbe the t-th iterations of gradient descent\n(5) and an accelerated version given by (6) or (7) respectively. Assuming the sample-size n to be\nlarge enough, let \u03b4 \u2208 (0, 1/2) and assuming r to be smaller than the quali\ufb01cation of the considered\nalgorithm (and equal to 1/2 in the case of Nesterov accelerated methods), then there exist two positive\nconstant C1 and C2 such that with probability at least 1 \u2212 \u03b4\n\nand \u02c6wacc\n\nt\n\nt\n\n(cid:32)\n(cid:32)\n\n(cid:33)\n(cid:33)\n\nE( \u02c6wGD\n\nt\n\n) \u2212 inf\n\nH\n\nE \u2264 C1\n\nE( \u02c6wacc\n\nt\n\n) \u2212 inf\n\nH\n\nE \u2264 C2\n\n1\nt2r +\n\n1\nt4r +\n\n1\n\u03b3\n\nt\nn\n\n2\n\u03b3\n\nt\nn\n\nlog2 2\n\u03b4\n\nlog2 2\n\u03b4\n\n.\n\nwhere the constants C1 and C2 do not depend on n, t, \u03b4, but depend on the chosen optimization.\nChoosing the stopping rules tGD = O(n\naccelerated methods achieve a learning rate of order O\n\n4\u03b3r+2 ) both gradient descent and\n.\n\n2\u03b3r+1 ) and tacc = O(n\n\n\u22122\u03b3r\n2\u03b3r+1\n\nn\n\n\u03b3\n\n\u03b3\n\n(cid:16)\n\n(cid:17)\n\nThe only reason why we do not consider r < 1/2 in the analysis of Nesterov accelerated methods is\nthat our proof require the quali\ufb01cation of the method to be larger than 1 for technical reasons. However\nwe think that our result can be extended to that case, furthermore we think Nesterov quali\ufb01cation\nto be larger than 1, however it\u2019s an open question whether higher quali\ufb01cation can be attained. The\nproof of the above result is given in the appendix. The general structure of the bound is the same\nas in the basic setting, which is now recovered as a special case. However, in this more general\nform, the various terms in the bound depend now on the regularity assumptions on the problem. In\nparticular, the variance depends on the effective dimension behavior, e.g. on the eigenvalue decay,\nwhile the bias depend on the regularity assumption on fH. The general comparison between gradient\ndescent and accelerated methods follows the same line as in the previous section. Faster bias decay of\naccelerated methods is contrasted by a more unstable behavior. As before, the bene\ufb01t of accelerated\nmethods becomes clear when deriving optimal stopping time and corresponding learning bound:\nthey achieve the accuracy of gradient methods but in considerable less time. While heavy-ball and\nNesterov have again similar behaviors, here a subtle difference resides in their different quali\ufb01cations,\nwhich in principle lead to different behavior for easy problems, that is for large r and \u03b3. In this\nregime, gradient descent could work better since it has in\ufb01nite quali\ufb01cation. For problems in which\nr < 1/2 and \u03b3 = 1 the rates are worse than in the basic setting, hence these problems are hard.\n\n4.3 Related work\n\nIn the convex optimization framework a similar phenomenon was pointed out in [10] where they\nintroduce the notion of inexact \ufb01rst-order oracle and study the behaviour of several \ufb01rst-order methods\nof smooth convex optimization with such oracle. In particular they show that the superiority of\naccelerated methods over standard gradient descent is no longer absolute when an inexact oracle\nis used. This because acceleration suffer from the accumulation of the errors committed by the\ninexact oracle. A relevant result on the generalization properties of learning algorithm is [5] in\nwhich they introduce the notion of uniform stability and use it to obtain generalization error bounds\n\n7\n\n\ffor regularization based learning algorithms. Recently, to show the effectiveness of commonly\nused optimization algorithms in many large-scale learning problems, algorithmic stability has been\nestablished for stochastic gradient methods [14], as well as for any algorithm in situations where\nglobal minima are approximately achieved [8]. For Nesterov\u2019s accelerated gradient descent and heavy-\nball method, [9] provide stability upper bounds for quadratic loss function in a \ufb01nite dimensional\nsetting. All these approaches focus on the de\ufb01nition of uniform stability given in [5]. Our approach\nto the stability of a learning algorithm is based on the study of \ufb01ltering properties of accelerated\nmethods together with concentration inequalities, we obtain upper bounds on the generalization error\nfor quadratic loss in a in\ufb01nite dimensional Hilbert space and generalize the bounds obtained in [9]\nby considering different regularity assumptions and by relaxing the hypothesis of the existence of a\nminimizer of the expected risk on the hypothesis space.\n\n5 Numerical simulation\n\nIn this section we show some numerical simulations to validate our results. We want to simulate the\ncase in which the eigenvalues \u03c3i of the operator L are \u03c3i = i\u2212\u03b3 for some \u03b3 \u2264 1 and the non-attainable\ncase r < 1/2. In order to do this we observe that if we consider the kernel setting over a \ufb01nite space\nZ = {z1, . . . , zn} of size N with the uniform probability distribution \u03c1Z, then the space L2(Z, \u03c1Z)\nbecomes RN with the usual scalar product multiplied by 1/N. the operator L becomes a N \u00d7 N\nmatrix which entries are Li,j = K(zi, zj) for every i, j \u2208 {1, . . . , N}, where K is the kernel, which\nis \ufb01xed by the choice of the matrix L. We build the matrix L = U DU T with U \u2208 RN\u00d7N orthogonal\nmatrix and D diagonal matrix with entries Di,i = i\u2212\u03b3. The source condition becomes fH = Lrg0\nfor some g0 \u2208 RN , r > 0. We simulate the observed output as y = fH + N(0, \u03c3) where N(0, \u03c3) is\nthe standardx normal distribution of variance \u03c32. The sampling operation can be seen as extracting n\nindices i1, . . . , in and building the kernel matrix \u02c6Kj,k = K(zij , zik ) and the noisy labels \u02c6yj = yij\nfor every j, k \u2208 {1, . . . , n}. The Representer Theorem ensure that we can built our estimator \u02c6f \u2208 RN\nj=1 K(z, zij )cj where the vector c depends on the chosen optimization algorithm and\n\nas \u02c6f (z) =(cid:80)n\n\nL2\nZ\n\ntakes the form c = gt( \u02c6K)y. The excess risk of the estimator \u02c6f is given by (cid:107) \u02c6f \u2212 fH(cid:107)2\nFor every algorithm considered, we run 50 repetitions, in which we sample the data-space and\ncompute the error (cid:107) \u02c6ft \u2212 fH(cid:107)2\n, where \u02c6ft represents the estimator related to the t-th iteration of one\nof the considered algorithms, and in the end we compute the mean and the variance of those errors.\nIn Figure 1 we simulate the error of all the algorithms considered for both attainable and non-attainable\ncase. We observe that both heavy-ball and Nesterov acceleration provides faster convergence rates\nwith respect to gradient descent method, but the learning accuracy is not improved. We observe\nthat the accelerated methods considered show similar behavior and that for \u201ceasy problem\u201d (large r)\nthat gradient descent can exploits its higher quali\ufb01cation and perform similarly to the accelerated\nmethods.\nIn Figure 2 we show the test error related to the real dataset pumadyn8nh (available at\nhttps://www.dcc.fc.up.pt/ ltorgo/Regression/puma.html). Even in this case we can observe the\nbehaviors shown in our theoretical results.\n\n.\n\nL2\nZ\n\nFig. 1: Mean and variance of error (cid:107) \u02c6ft \u2212 fH(cid:107)2\nN for the t-th iteration of gradient descent (GD), Nesterov\naccelerated algorithm and heavy-ball (\u03bd = 1). Black dots shows the absolute minimum of the curves. The\nparameters are chosen N = 104, n = 102, \u03b3 = 1, \u03c3 = 0.5. We show the attainable case (r = 1/2) in the left,\nthe \u201chard case\u201d (r = 0.1 < 1/2) in the center and the \u201ceasy case\u201d (r=2>1/2) in the right.\n\n8\n\n\fFig. 2: Test error on the real dataset pumadyn8nh using gradient descent (GD), Nesterov accelerated algorithm\nand heavy-ball. In the left we use a gaussian kernel with \u03c3 = 1.2 and in the right a polynomial kernel of degree\n9.\n\n6 Conclusion\n\nIn this paper, we have considered the implicit regularization properties of accelerated gradient methods\nfor least squares in Hilbert space. Using spectral calculus we have characterized the properties of\nthe different iterations in terms of suitable polynomials. Using the latter, we have derived error\nbounds in terms of suitable bias and variance terms. The main conclusion is that under the considered\nassumptions accelerated methods have smaller bias but also larger variance. As a byproduct they\nachieve the same accuracy of vanilla gradient descent but with much fewer iterations. Our study\nopens a number of potential theoretical and empirical research directions. From a theory point of\nview, it would be interesting to consider other learning regimes, for examples classi\ufb01cation problems,\ndifferent loss functions or other regularity assumptions beyond classical nonparametric assumptions,\ne.g. misspeci\ufb01ed models and fast eigenvalues decays (Gaussian kernel). From an empirical point of\nview it would be interesting to do a more thorough investigation on a larger number of simulated and\nreal data-sets of varying dimension.\n\nAcknowledgments\n\nThis material is based upon work supported by the Center for Brains, Minds and Machines (CBMM),\nfunded by NSF STC award CCF-1231216, and the Italian Institute of Technology. We gratefully\nacknowledge the support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the\nTesla k40 GPU used for this research. L. R. acknowledges the \ufb01nancial support of the AFOSR\nprojects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Of\ufb01ce of Aerospace\nResearch and Development), and the EU H2020-MSCA-RISE project NoMADS - DLV-777826. N.P.\nwould like to thank Murata Tomoya for the useful observations.\n\nReferences\n[1] Hedy Attouch, Zaki Chbani, and Hassan Riahi. Rate of convergence of the nesterov accelerated\ngradient method in the subcritical case \u03b1 \u2264 3. ESAIM: Control, Optimisation and Calculus of\nVariations, 25:2, 2019.\n\n[2] Luca Baldassarre, Lorenzo Rosasco, Annalisa Barla, and Alessandro Verri. Multi-output\n\nlearning via spectral \ufb01ltering. Machine learning, 87(3):259\u2013301, 2012.\n\n[3] Gilles Blanchard and Nicole M\u00fccke. Optimal rates for regularization of statistical inverse\nlearning problems. Foundations of Computational Mathematics, 18(4):971\u20131013, Aug 2018.\n\n[4] L\u00e9on Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural\n\ninformation processing systems, pages 161\u2013168, 2008.\n\n[5] Olivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. Journal of machine learning\n\nresearch, 2(Mar):499\u2013526, 2002.\n\n9\n\n\f[6] Peter B\u00fchlmann and Bin Yu. Boosting with the l 2 loss: regression and classi\ufb01cation. Journal\n\nof the American Statistical Association, 98(462):324\u2013339, 2003.\n\n[7] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares\n\nalgorithm. Foundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[8] Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms\n\nthat converge to global optima. arXiv preprint arXiv:1710.08402, 2017.\n\n[9] Yuansi Chen, Chi Jin, and Bin Yu. Stability and convergence trade-off of iterative optimization\n\nalgorithms. arXiv preprint arXiv:1804.01619, 2018.\n\n[10] Olivier Devolder, Fran\u00e7ois Glineur, and Yurii Nesterov. First-order methods of smooth convex\n\noptimization with inexact oracle. Mathematical Programming, 146(1-2):37\u201375, 2014.\n\n[11] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems,\n\nvolume 375. Springer Science & Business Media, 1996.\n\n[12] Junichi Fujii, Masatoshi Fujii, Takayuki Furuta, and Ritsuo Nakamoto. Norm inequalities\nequivalent to heinz inequality. Proceedings of the American Mathematical Society, 118(3):827\u2013\n830, 1993.\n\n[13] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient\nIn Advances in Neural Information Processing\n\ndescent on linear convolutional networks.\nSystems, pages 9461\u20139471, 2018.\n\n[14] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of\n\nstochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.\n\n[15] Yann A LeCun, L\u00e9on Bottou, Genevieve B Orr, and Klaus-Robert M\u00fcller. Ef\ufb01cient backprop.\n\nIn Neural networks: Tricks of the trade, pages 9\u201348. Springer, 2012.\n\n[16] Junhong Lin, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties and implicit\nregularization for multiple passes sgm. In International Conference on Machine Learning, pages\n2340\u20132348, 2016.\n\n[17] Junhong Lin and Volkan Cevher. Optimal convergence for distributed learning with stochastic\n\ngradient methods and spectral-regularization algorithms. stat, 1050:22, 2018.\n\n[18] Junhong Lin and Lorenzo Rosasco. Optimal learning for multi-pass stochastic gradient methods.\n\nIn Advances in Neural Information Processing Systems, pages 4556\u20134564, 2016.\n\n[19] Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral\nalgorithms with least-squares regression over hilbert spaces. Applied and Computational\nHarmonic Analysis, 2018.\n\n[20] Peter Math\u00e9 and Sergei Pereverzev. Regularization of some linear ill-posed problems with\n\ndiscretized random noisy data. Mathematics of Computation, 75(256):1913\u20131929, 2006.\n\n[21] Peter Math\u00e9 and Sergei V Pereverzev. Moduli of continuity for operator valued functions.\n\nNumerical Functional Analysis and Optimization, 23(5-6):623\u2013631, 2002.\n\n[22] Yurii E Nesterov. A method for solving the convex programming problem with convergence\n\nrate o (1/k\u02c6 2). In Dokl. akad. nauk Sssr, volume 269, pages 543\u2013547, 1983.\n\n[23] Andreas Neubauer. On nesterov acceleration for landweber iteration of linear ill-posed problems.\n\nJournal of Inverse and Ill-posed Problems, 25(3):381\u2013390, 2017.\n\n[24] Boris T Polyak. Introduction to optimization. Technical report, 1987.\n\n10\n\n\f[25] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regres-\nsion: an optimal data-dependent stopping rule. The Journal of Machine Learning Research,\n15(1):335\u2013366, 2014.\n\n[26] Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization.\n\nAdvances in Neural Information Processing Systems, pages 1630\u20131638, 2015.\n\nIn\n\n[27] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The\nimplicit bias of gradient descent on separable data. The Journal of Machine Learning Research,\n19(1):2822\u20132878, 2018.\n\n[28] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science &\n\nBusiness Media, 2008.\n\n[29] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nes-\nterov\u2019s accelerated gradient method: Theory and insights. In Advances in Neural Information\nProcessing Systems, pages 2510\u20132518, 2014.\n\n[30] Gabor Szeg. Orthogonal polynomials, volume 23. American Mathematical Soc., 1939.\n\n[31] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent\n\nlearning. Constructive Approximation, 26(2):289\u2013315, 2007.\n\n[32] SK Zavriev and FV Kostyuk. Heavy-ball method in nonconvex optimization problems.\n\nComputational Mathematics and Modeling, 4(4):336\u2013341, 1993.\n\n[33] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under-\nstanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530,\n2016.\n\n11\n\n\f", "award": [], "sourceid": 8201, "authors": [{"given_name": "Nicol\u00f2", "family_name": "Pagliana", "institution": "Universit\u00e0 degli studi di Genova (DIMA)"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova- MIT - IIT"}]}