{"title": "A Differential Equation for Modeling Nesterov\u2019s Accelerated Gradient Method: Theory and Insights", "book": "Advances in Neural Information Processing Systems", "page_first": 2510, "page_last": 2518, "abstract": "We derive a second-order ordinary differential equation (ODE), which is the limit of Nesterov\u2019s accelerated gradient method. This ODE exhibits approximate equivalence to Nesterov\u2019s scheme and thus can serve as a tool for analysis. We show that the continuous time ODE allows for a better understanding of Nesterov\u2019s scheme. As a byproduct, we obtain a family of schemes with similar convergence rates. The ODE interpretation also suggests restarting Nesterov\u2019s scheme leading to an algorithm, which can be rigorously proven to converge at a linear rate whenever the objective is strongly convex.", "full_text": "A Differential Equation for Modeling Nesterov\u2019s\n\nAccelerated Gradient Method: Theory and Insights\n\nWeijie Su1\n\nStephen Boyd2\n\nEmmanuel J. Cand`es1,3\n\n1Department of Statistics, Stanford University, Stanford, CA 94305\n\n2Department of Electrical Engineering, Stanford University, Stanford, CA 94305\n\n3Department of Mathematics, Stanford University, Stanford, CA 94305\n\n{wjsu, boyd, candes}@stanford.edu\n\nAbstract\n\nWe derive a second-order ordinary differential equation (ODE), which is the limit\nof Nesterov\u2019s accelerated gradient method. This ODE exhibits approximate equiv-\nalence to Nesterov\u2019s scheme and thus can serve as a tool for analysis. We show that\nthe continuous time ODE allows for a better understanding of Nesterov\u2019s scheme.\nAs a byproduct, we obtain a family of schemes with similar convergence rates.\nThe ODE interpretation also suggests restarting Nesterov\u2019s scheme leading to an\nalgorithm, which can be rigorously proven to converge at a linear rate whenever\nthe objective is strongly convex.\n\n1\n\nIntroduction\n\nAs data sets and problems are ever increasing in size, accelerating \ufb01rst-order methods is both of\npractical and theoretical interest. Perhaps the earliest \ufb01rst-order method for minimizing a convex\nfunction f is the gradient method, which dates back to Euler and Lagrange. Thirty years ago, in a\nseminar paper [11] Nesterov proposed an accelerated gradient method, which may take the following\nform: starting with x0 and y0 = x0, inductively de\ufb01ne\n\nxk = yk\u22121 \u2212 s\u2207f (yk\u22121)\nyk = xk +\n\nk \u2212 1\nk + 2\n\n(xk \u2212 xk\u22121).\n\n(1.1)\n\nFor a \ufb01xed step size s = 1/L, where L is the Lipschitz constant of \u2207f , this scheme exhibits the\nconvergence rate\n\nf (xk) \u2212 f \u22c6 \u2264 O(cid:16) Lkx0 \u2212 x\u22c6k2\n\nk2\n\n(cid:17).\n\nAbove, x\u22c6 is any minimizer of f and f \u22c6 = f (x\u22c6). It is well-known that this rate is optimal among\nall methods having only information about the gradient of f at consecutive iterates [12]. This is in\ncontrast to vanilla gradient descent methods, which can only achieve a rate of O(1/k) [17]. This\nimprovement relies on the introduction of the momentum term xk \u2212 xk\u22121 as well as the particularly\ntuned coef\ufb01cient (k \u2212 1)/(k + 2) \u2248 1 \u2212 3/k. Since the introduction of Nesterov\u2019s scheme, there\nhas been much work on the development of \ufb01rst-order accelerated methods, see [12, 13, 14, 1, 2] for\nexample, and [19] for a uni\ufb01ed analysis of these ideas.\n\nIn a different direction, there is a long history relating ordinary differential equations (ODE) to opti-\nmization, see [6, 4, 8, 18] for references. The connection between ODEs and numerical optimization\nis often established via taking step sizes to be very small so that the trajectory or solution path con-\nverges to a curve modeled by an ODE. The conciseness and well-established theory of ODEs provide\ndeeper insights into optimization, which has led to many interesting \ufb01ndings [5, 7, 16].\n\n1\n\n\fIn this work, we derive a second-order ordinary differential equation, which is the exact limit of\nNesterov\u2019s scheme by taking small step sizes in (1.1). This ODE reads\n\n3\nt\n\n\u00a8X +\n\n\u02d9X + \u2207f (X) = 0\n\u02d9X(0) = 0; here, x0 is the starting point in Nesterov\u2019s\n\u02d9X denotes the time derivative or velocity dX/dt and similarly \u00a8X = d2X/dt2 denotes the\nacceleration. The time parameter in this ODE is related to the step size in (1.1) via t \u2248 k\u221as. Case\n\nfor t > 0, with initial conditions X(0) = x0,\nscheme,\n\nstudies are provided to demonstrate that the homogeneous and conceptually simpler ODE can serve\nas a tool for analyzing and generalizing Nesterov\u2019s scheme. To the best of our knowledge, this work\nis the \ufb01rst to model Nesterov\u2019s scheme or its variants by ODEs.\nWe denote by FL the class of convex functions f with L\u2013Lipschitz continuous gradients de\ufb01ned on\nRn, i.e., f is convex, continuously differentiable, and obeys\n\n(1.2)\n\nk\u2207f (x) \u2212 \u2207f (y)k \u2264 Lkx \u2212 yk\n\nfor any x, y \u2208 Rn, where k \u00b7 k is the standard Euclidean norm and L > 0 is the Lipschitz constant\nthroughout this paper. Next, S\u00b5 denotes the class of \u00b5\u2013strongly convex functions f on Rn with\ncontinuous gradients, i.e., f is continuously differentiable and f (x) \u2212 \u00b5kxk2/2 is convex. Last, we\nset S\u00b5,L = FL \u2229 S\u00b5.\n\n2 Derivation of the ODE\n\nAssume f \u2208 FL for L > 0. Combining the two equations of (1.1) and applying a rescaling give\n\nxk+1 \u2212 xk\u221as\n\n=\n\nk \u2212 1\nk + 2\n\nxk \u2212 xk\u22121\u221as\n\n\u2212 \u221as\u2207f (yk).\n\n(2.1)\n\nIntroduce the ansatz xk \u2248 X(k\u221as) for some smooth curve X(t) de\ufb01ned for t \u2265 0. For \ufb01xed t,\nas the step size s goes to zero, X(t) \u2248 xt/\u221as = xk and X(t + \u221as) \u2248 x(t+\u221as)/\u221as = xk+1 with\nk = t/\u221as. With these approximations, we get Taylor expansions:\n\n(xk+1 \u2212 xk)/\u221as = \u02d9X(t) +\n(xk \u2212 xk\u22121)/\u221as = \u02d9X(t) \u2212\n\n1\n2\n1\n2\n\n\u00a8X(t)\u221as + o(\u221as)\n\u00a8X(t)\u221as + o(\u221as)\n\n\u221as\u2207f (yk) = \u221as\u2207f (X(t)) + o(\u221as),\n\nwhere in the last equality we use yk \u2212 X(t) = o(1). Thus (2.1) can be written as\n\n\u02d9X(t) +\n\n1\n2\n\n\u00a8X(t)\u221as + o(\u221as)\n3\u221as\nt (cid:17)(cid:16) \u02d9X(t) \u2212\n\n= (cid:16)1 \u2212\n\nBy comparing the coef\ufb01cients of \u221as in (2.2), we obtain\n\n1\n2\n\n\u00a8X(t)\u221as + o(\u221as)(cid:17) \u2212 \u221as\u2207f (X(t)) + o(\u221as).\n\n(2.2)\n\n\u00a8X +\n\n3\nt\n\n\u02d9X + \u2207f (X) = 0\n\nfor t > 0. The \ufb01rst initial condition is X(0) = x0. Taking k = 1 in (2.1) yields (x2 \u2212 x1)/\u221as =\n\u2212\u221as\u2207f (y1) = o(1). Hence, the second initial condition is simply \u02d9X(0) = 0 (vanishing initial\nvelocity). In the formulation of [1] (see also [20]), the momentum coef\ufb01cient (k \u2212 1)/(k + 2) is\nreplaced by \u03b8k(\u03b8\u22121\n\nk\u22121 \u2212 1), where \u03b8k are iteratively de\ufb01ned as\nk \u2212 \u03b82\n\n\u03b8k+1 = p\u03b84\n\nk + 4\u03b82\n\nk\n\n2\n\nstarting from \u03b80 = 1. A bit of analysis reveals that \u03b8k(\u03b8\u22121\nO(1/k2), thus leading to the same ODE as (1.1).\n\nk\u22121 \u2212 1) asymptotically equals 1 \u2212 3/k +\n\n2\n\n(2.3)\n\n\fClassical results in ODE theory do not directly imply the existence or uniqueness of the solution to\nthis ODE because the coef\ufb01cient 3/t is singular at t = 0. In addition, \u2207f is typically not analytic\nat x0, which leads to the inapplicability of the power series method for studying singular ODEs.\nNevertheless, the ODE is well posed: the strategy we employ for showing this constructs a series of\nODEs approximating (1.2) and then chooses a convergent subsequence by some compactness argu-\nments such as the Arzel\u00b4a-Ascoli theorem. A proof of this theorem can be found in the supplementary\nmaterial for this paper.\nTheorem 2.1. For any f \u2208 F\u221e\nX(0) = x0,\n\n, \u222aL>0FL and any x0 \u2208 Rn, the ODE (1.2) with initial conditions\n\n\u02d9X(0) = 0 has a unique global solution X \u2208 C 2((0,\u221e); Rn) \u2229 C 1([0,\u221e); Rn).\n\n3 Equivalence between the ODE and Nesterov\u2019s scheme\n\nWe study the stable step size allowed for numerically solving the ODE in the presence of accumu-\nlated errors. The \ufb01nite difference approximation of (1.2) by the forward Euler method is\n\nX(t + \u2206t) \u2212 2X(t) + X(t \u2212 \u2206t)\n\n\u2206t2\n\n+\n\n3\nt\n\nX(t) \u2212 X(t \u2212 \u2206t)\n\n\u2206t\n\n+ \u2207f (X(t)) = 0,\n\n(3.1)\n\nwhich is equivalent to\n\nX(t + \u2206t) = (cid:16)2 \u2212\n\n3\u2206t\n\nt (cid:17)X(t) \u2212 \u2206t2\u2207f (X(t)) \u2212(cid:16)1 \u2212\n\n3\u2206t\n\nt (cid:17)X(t \u2212 \u2206t).\n\nAssuming that f is suf\ufb01ciently smooth, for small perturbations \u03b4x, \u2207f (x + \u03b4x) \u2248 \u2207f (x) +\n\u22072f (x)\u03b4x, where \u22072f (x) is the Hessian of f evaluated at x. Identifying k = t/\u2206t, the char-\nacteristic equation of this \ufb01nite difference scheme is approximately\n\ndet(cid:16)\u03bb2 \u2212(cid:16)2 \u2212 \u2206t2\u22072f \u2212\n\n3\u2206t\n\nt (cid:17)\u03bb + 1 \u2212\n\n3\u2206t\n\nt (cid:17) = 0.\n\n(3.2)\n\nThe numerical stability of (3.1) with respect to accumulated errors is equivalent to this: all the roots\n\nof (3.2) lie in the unit circle [9]. When \u22072f (cid:22) LIn (i.e., LIn \u2212 \u22072f is positive semide\ufb01nite), if\n\u2206t/t small and \u2206t < 2/\u221aL, we see that all the roots of (3.2) lie in the unit circle. On the other\nhand, if \u2206t > 2/\u221aL, (3.2) can possibly have a root \u03bb outside the unit circle, causing numerical\ninstability. Under our identi\ufb01cation s = \u2206t2, a step size of s = 1/L in Nesterov\u2019s scheme (1.1) is\napproximately equivalent to a step size of \u2206t = 1/\u221aL in the forward Euler method, which is stable\n\nfor numerically integrating (3.1).\nAs a comparison, note that the corresponding ODE for gradient descent with updates xk+1 = xk \u2212\ns\u2207f (xk), is\nwhose \ufb01nite difference scheme has the characteristic equation det(\u03bb \u2212 (1 \u2212 \u2206t\u22072f )) = 0. Thus,\nto guarantee \u2212In (cid:22) 1 \u2212 \u2206t\u22072f (cid:22) In in worst case analysis, one can only choose \u2206t \u2264 2/L for a\n\ufb01xed step size, which is much smaller than the step size 2/\u221aL for (3.1) when \u2207f is very variable,\ni.e., L is large.\n\n\u02d9X(t) + \u2207f (X(t)) = 0,\n\nNext, we exhibit approximate equivalence between the ODE and Nesterov\u2019s scheme in terms of\nconvergence rates. We \ufb01rst recall the original result from [11].\nTheorem 3.1 (Nesterov). For any f \u2208 FL, the sequence {xk} in (1.1) with step size s \u2264 1/L obeys\n\nf (xk) \u2212 f \u22c6 \u2264\n\n2kx0 \u2212 x\u22c6k2\ns(k + 1)2\n\n.\n\nOur \ufb01rst result indicates that the trajectory of ODE (1.2) closely resembles the sequence {xk} in\nterms of the convergence rate to a minimizer x\u22c6.\nTheorem 3.2. For any f \u2208 F\u221e, let X(t) be the unique global solution to (1.2) with initial condi-\ntions X(0) = x0,\n\n\u02d9X(0) = 0. For any t > 0,\n\nf (X(t)) \u2212 f \u22c6 \u2264\n\n2kx0 \u2212 x\u22c6k2\n\nt2\n\n.\n\n3\n\n\fProof of Theorem 3.2. Consider the energy functional de\ufb01ned as\n\nE(t) , t2(f (X(t)) \u2212 f \u22c6) + 2kX +\n\nt\n2\n\n\u02d9X \u2212 x\u22c6k2,\n\nwhose time derivative is\n\nt\n2\n\n\u02d9X \u2212 x\u22c6,\n\n\u02d9E = 2t(f (X) \u2212 f \u22c6) + t2h\u2207f,\n\n\u02d9Xi + 4hX +\nSubstituting 3 \u02d9X/2 + t \u00a8X/2 with \u2212t\u2207f (X)/2, (3.3) gives\nt\n\u02d9E = 2t(f (X) \u2212 f \u22c6) + 4hX \u2212 x\u22c6,\u2212\n2\u2207f (X)i = 2t(f (X) \u2212 f \u22c6) \u2212 2thX \u2212 x\u22c6,\u2207f (X)i \u2264 0,\nwhere the inequality follows from the convexity of f . Hence by monotonicity of E and non-\nnegativity of 2kX + t \u02d9X/2 \u2212 x\u22c6k2, the gap obeys f (X(t)) \u2212 f \u22c6 \u2264 E(t)/t2 \u2264 E(0)/t2 =\n2kx0 \u2212 x\u22c6k2/t2.\n\n\u00a8Xi.\n\n\u02d9X +\n\n(3.3)\n\n3\n2\n\nt\n2\n\n4 A family of generalized Nesterov\u2019s schemes\n\nIn this section we show how to exploit the power of the ODE for deriving variants of Nesterov\u2019s\nscheme. One would be interested in studying the ODE (1.2) with the number 3 appearing in the\ncoef\ufb01cient of \u02d9X/t replaced by a general constant r as in\n\n\u00a8X +\n\nr\nt\n\n\u02d9X + \u2207f (X) = 0, X(0) = x0,\n\n\u02d9X(0) = 0.\n\n(4.1)\n\nUsing arguments similar to those in the proof of Theorem 2.1, this new ODE is guaranteed to assume\na unique global solution for any f \u2208 F\u221e.\n4.1 Continuous optimization\n\nTo begin with, we consider a modi\ufb01ed energy functional de\ufb01ned as\n\nSince r \u02d9X + t \u00a8X = \u2212t\u2207f (X), the time derivative \u02d9E is equal to\n\n4t\nr \u2212 1\n\n(f (X) \u2212 f \u22c6) +\n\nE(t) =\n\n2t2\nr \u2212 1\n\n(f (X(t)) \u2212 f \u22c6) + (r \u2212 1)(cid:13)(cid:13)(cid:13)(cid:13)\nr \u2212 1h\u2207f,\n\n\u02d9Xi + 2hX +\n\n2t2\n\nt\n\nr \u2212 1\n4t\nr \u2212 1\n\n=\n\nX(t) +\n\n2\n\n.\n\nt\n\nr \u2212 1\n\n\u02d9X(t) \u2212 x\u22c6(cid:13)(cid:13)(cid:13)(cid:13)\n\n\u02d9X \u2212 x\u22c6, r \u02d9X + t \u00a8Xi\n\nA consequence of (4.2) is this:\nTheorem 4.1. Suppose r > 3 and let X be the unique solution to (4.1) for some f \u2208 F\u221e. Then X\nobeys\n\n(f (X) \u2212 f \u22c6) \u2212 2thX \u2212 x\u22c6,\u2207f (X)i.\n\n(4.2)\n\nand\n\nf (X(t)) \u2212 f \u22c6 \u2264\n\n(r \u2212 1)2kx0 \u2212 x\u22c6k2\n\n2t2\n\nZ \u221e\n\n0\n\nt(f (X(t)) \u2212 f \u22c6)dt \u2264\n\n(r \u2212 1)2kx0 \u2212 x\u22c6k2\n\n2(r \u2212 3)\n\n.\n\nProof of Theorem 4.1. By (4.2), the derivative dE/dt equals\n2t(f (X)\u2212 f \u22c6)\u2212 2thX \u2212 x\u22c6,\u2207f (X)i\u2212\n(f (X)\u2212 f \u22c6), (4.3)\nwhere the inequality follows from the convexity of f . Since f (X) \u2265 f \u22c6, (4.3) implies that E is\n\n(f (X)\u2212 f \u22c6) \u2264 \u2212\n\n2(r \u2212 3)t\nr \u2212 1\n\n2(r \u2212 3)t\nr \u2212 1\n\nnon-increasing. Hence\n2t2\nr \u2212 1\n\n(f (X(t)) \u2212 f \u22c6) \u2264 E(t) \u2264 E(0) = (r \u2212 1)kx0 \u2212 x\u22c6k2,\n\n4\n\n\fyielding the \ufb01rst inequality of the theorem as desired. To complete the proof, by (4.2) it follows that\n\nZ \u221e\n\n0\n\n2(r \u2212 3)t\nr \u2212 1\n\n(f (X) \u2212 f \u22c6)dt \u2264 \u2212Z \u221e\n\n0\n\ndE\ndt\n\ndt = E(0) \u2212 E(\u221e) \u2264 (r \u2212 1)kx0 \u2212 x\u22c6k2,\n\nas desired for establishing the second inequality.\n\nWe now demonstrate faster convergence rates under the assumption of strong convexity. Given a\nstrongly convex function f , consider a new energy functional de\ufb01ned as\n\n\u02dcE(t) = t3(f (X(t)) \u2212 f \u22c6) +\n\n2r \u2212 3\nAs in Theorem 4.1, a more re\ufb01ned study of the derivative of \u02dcE(t) gives\nTheorem 4.2. For any f \u2208 S\u00b5,L(Rn), the unique solution X to (4.1) with r \u2265 9/2 obeys\n\n8\n\n(2r \u2212 3)2t\n\nX(t) +\n\n2t\n\n2\n\n.\n\n\u02d9X(t) \u2212 x\u22c6(cid:13)(cid:13)(cid:13)(cid:13)\n\n(cid:13)(cid:13)(cid:13)(cid:13)\n\nf (X(t)) \u2212 f \u22c6 \u2264\nfor any t > 0 and a universal constant C > 1/2.\n\nCr\n\n5\n\n2kx0 \u2212 x\u22c6k2\nt3\u221a\u00b5\n\nThe restriction r \u2265 9/2 is an artifact required in the proof. We believe that this theorem should be\nvalid as long as r \u2265 3. For example, the solution to (4.1) with f (x) = kxk2/2 is\n\nr\u22121\n\n2\n\nX(t) =\n\n2 \u0393((r + 1)/2)J(r\u22121)/2(t)\n\nr\u22121\n\n2\n\nt\n\nx0,\n\n(4.4)\n\nwhere J(r\u22121)/2(\u00b7) is the \ufb01rst kind Bessel function of order (r\u22121)/2. For large t, this Bessel function\nobeys J(r\u22121)/2(t) = p2/(\u03c0t)(cos(t \u2212 (r \u2212 1)\u03c0/4 \u2212 \u03c0/4) + O(1/t)). Hence,\nin which the inequality fails if 1/tr is replaced by any higher order rate. For general strongly convex\nfunctions, such re\ufb01nement, if possible, might require a construction of a more sophisticated energy\nfunctional and careful analysis. We leave this problem for future research.\n\nf (X(t)) \u2212 f \u22c6 . kx0 \u2212 x\u22c6k2/tr,\n\n4.2 Composite optimization\n\nInspired by Theorem 4.2, it is tempting to obtain such analogies for the discrete Nesterov\u2019s scheme\nas well. Following the formulation of [1], we consider the composite minimization:\n\nwhere g \u2208 FL for some L > 0 and h is convex on Rn with possible extended value \u221e. De\ufb01ne the\n\nproximal subgradient\n\nminimize\n\nf (x) = g(x) + h(x),\n\nx\u2208Rn\n\nGs(x) ,\n\nx \u2212 argminzhkz \u2212 (x \u2212 s\u2207g(x))k2/(2s) + h(z)i\n\ns\n\n.\n\nParametrizing by a constant r, we propose a generalized Nesterov\u2019s scheme,\n\nxk = yk\u22121 \u2212 sGs(yk\u22121)\nyk = xk +\n\nk \u2212 1\nk + r \u2212 1\n\n(xk \u2212 xk\u22121),\n\n(4.5)\n\nstarting from y0 = x0. The discrete analog of Theorem 4.1 is below, whose proof is also deferred to\nthe supplementary materials as well.\nTheorem 4.3. The sequence {xk} given by (4.5) with r > 3 and 0 < s \u2264 1/L obeys\n\nand\n\nf (xk) \u2212 f \u22c6 \u2264\n\n(r \u2212 1)2kx0 \u2212 x\u22c6k2\n\n2s(k + r \u2212 2)2\n\n\u221e\n\nXk=1\n\n(k + r \u2212 1)(f (xk) \u2212 f \u22c6) \u2264\n\n(r \u2212 1)2kx0 \u2212 x\u22c6k2\n\n2s(r \u2212 3)\n\n.\n\n5\n\n\fThe idea behind the proof is the same as that employed for Theorem 4.1; here, however, the energy\nfunctional is de\ufb01ned as\n\nE(k) = 2s(k + r \u2212 2)2(f (xk) \u2212 f \u22c6)/(r \u2212 1) + k(k + r \u2212 1)yk \u2212 kxk \u2212 (r \u2212 1)x\u22c6k2/(r \u2212 1).\nThe \ufb01rst inequality in Theorem 4.3 suggests that the generalized Nesterov\u2019s scheme still achieves\nO(1/k2) convergence rate. However, if the error bound satis\ufb01es\n\nf (xk\u2032) \u2212 f \u22c6 \u2265\n\nc\nk\u20322\n\nfor some c > 0 and a dense subsequence {k\u2032}, i.e., |{k\u2032} \u2229 {1, . . . , m}| \u2265 \u03b1m for any positive\ninteger m and some \u03b1 > 0, then the second inequality of the theorem is violated. Hence, the second\ninequality is not trivial because it implies the error bound is in some sense O(1/k2) suboptimal.\nIn closing, we would like to point out this new scheme is equivalent to setting \u03b8k = (r\u22121)/(k+r\u22121)\nand letting \u03b8k(\u03b8\u22121\nk\u22121 \u2212 1) replace the momentum coef\ufb01cient (k \u2212 1)/(k + r \u2212 1). Then, the equal\nsign \u201c = \u201d in (2.3) has to be replaced by \u201c \u2265 \u201d. In examining the proof of Theorem 1(b) in [20], we\ncan get an alternative proof of Theorem 4.3 by allowing (2.3), which appears in Eq. (36) in [20], to\nbe an inequality.\n\n5 Accelerating to linear convergence by restarting\n\nAlthough an O(1/k3) convergence rate is guaranteed for generalized Nesterov\u2019s schemes (4.5), the\nexample (4.4) provides evidence that O(1/poly(k)) is the best rate achievable under strong con-\nvexity. In contrast, the vanilla gradient method achieves linear convergence O((1 \u2212 \u00b5/L)k) and\n[12] proposed a \ufb01rst-order method with a convergence rate of O((1 \u2212 p\u00b5/L)k), which, however,\nrequires knowledge of the condition number \u00b5/L. While it is relatively easy to bound the Lipschitz\nconstant L by the use of backtracking [3, 19], estimating the strong convexity parameter \u00b5, if not\nimpossible, is very challenging. Among many approaches to gain acceleration via adaptively esti-\nmating \u00b5/L, [15] proposes a restarting procedure for Nesterov\u2019s scheme in which (1.1) is restarted\nwith x0 = y0 := xk whenever \u2207f (yk)T (xk+1 \u2212 xk) > 0. In the language of ODEs, this gradi-\n\u02d9Xi negative along the trajectory. Although it has been\nent based restarting essentially keeps h\u2207f,\nempirically observed that this method signi\ufb01cantly boosts convergence, there is no general theory\ncharacterizing the convergence rate.\n\nIn this section, we propose a new restarting scheme we call the speed restarting scheme. The un-\nderlying motivation is to maintain a relatively high velocity \u02d9X along the trajectory. Throughout this\nsection we assume f \u2208 S\u00b5,L for some 0 < \u00b5 \u2264 L.\nDe\ufb01nition 5.1. For ODE (1.2) with X(0) = x0,\n\n\u02d9X(0) = 0, let\n\nT = T (f, x0) = sup{t > 0 : \u2200u \u2208 (0, t),\n\ndk \u02d9X(u)k2\n\ndu\n\n> 0}\n\nbe the speed restarting time.\n\nIn words, T is the \ufb01rst time the velocity k \u02d9Xk decreases. The de\ufb01nition itself does not imply that\n0 < T < \u221e, which is proven in the supplementary materials. Indeed, f (X(t)) is a decreasing\nfunction before time T ; for t \u2264 T ,\n\ndf (X(t))\n\ndt\n\n= h\u2207f (X),\n\n\u02d9Xi = \u2212\n\n3\nt k \u02d9Xk2 \u2212\n\n1\n2\n\ndk \u02d9Xk2\n\ndt \u2264 0.\n\nThe speed restarted ODE is thus\n\n\u00a8X(t) +\n\n3\ntsr\n\n\u02d9X(t) + \u2207f (X(t)) = 0,\n\n(5.1)\n\nwhere tsr is set to zero whenever h \u02d9X, \u00a8Xi = 0 and between two consecutive restarts, tsr grows just\nas t. That is, tsr = t \u2212 \u03c4 , where \u03c4 is the latest restart time. In particular, tsr = 0 at t = 0. The\ntheorem below guarantees linear convergence of the solution to (5.1). This is a new result in the\nliterature [15, 10].\nTheorem 5.2. There exists positive constants c1 and c2, which only depend on the condition number\nL/\u00b5, such that for any f \u2208 S\u00b5,L, we have\n\nf (X sr(t)) \u2212 f (x\u22c6) \u2264\n\nc1Lkx0 \u2212 x\u22c6k2\n\ne\u2212c2t\u221aL.\n\n2\n\n6\n\n\f5.1 Numerical examples\n\nBelow we present a discrete analog to the restarted scheme. There, kmin is introduced to avoid\nhaving consecutive restarts that are too close. To compare the performance of the restarted scheme\nwith the original (1.1), we conduct four simulation studies, including both smooth and non-smooth\nobjective functions. Note that the computational costs of the restarted and non-restarted schemes are\nthe same.\n\nAlgorithm 1 Speed Restarting Nesterov\u2019s Scheme\n\ninput: x0 \u2208 Rn, y0 = x0, x\u22121 = x0, 0 < s \u2264 1/L, kmax \u2208 N+ and kmin \u2208 N+\nj \u2190 1\nfor k = 1 to kmax do\nxk \u2190 argminx( 1\nyk \u2190 xk + j\u22121\nif kxk \u2212 xk\u22121k < kxk\u22121 \u2212 xk\u22122k and j \u2265 kmin then\nelse\n\n2skx \u2212 yk\u22121 + s\u2207g(yk\u22121)k2 + h(x))\n\nj+2 (xk \u2212 xk\u22121)\n\nj \u2190 1\nj \u2190 j + 1\n\nend if\nend for\n\nQuadratic. f (x) = 1\n2 xT Ax + bT x is a strongly convex function, in which A is a 500\u00d7 500 random\npositive de\ufb01nite matrix and b a random vector. The eigenvalues of A are between 0.001 and 1. The\nvector b is generated as i. i. d. Gaussian random variables with mean 0 and variance 25.\n\nLog-sum-exp.\n\nf (x) = \u03c1 logh\n\nm\n\nXi=1\n\nexp((aT\n\ni x \u2212 bi)/\u03c1)i,\n\n2kXobs \u2212 Mobsk2\n\nwhere n = 50, m = 200, \u03c1 = 20. The matrix A = {aij} is a random matrix with i. i. d. standard\nGaussian entries, and b = {bi} has i. i. d. Gaussian entries with mean 0 and variance 2. This function\nis not strongly convex.\nMatrix completion. f (X) = 1\nF + \u03bbkXk\u2217, in which the ground truth M is a\nrank-5 random matrix of size 300 \u00d7 300. The regularization parameter is set to \u03bb = 0.05. The 5\nsingular values of M are 1, . . . , 5. The observed set is independently sampled among the 300 \u00d7 300\nentries so that 10% of the entries are actually observed.\nLasso in \u21131\u2013constrained form with large sparse design. f = 1\ns.t. kxk1 \u2264 \u03b4, where\nA is a 5000 \u00d7 50000 random sparse matrix with nonzero probability 0.5% for each entry and b is\ngenerated as b = Ax0 + z. The nonzero entries of A independently follow the Gaussian distribution\nwith mean 0 and variance 1/25. The signal x0 is a vector with 250 nonzeros and z is i. i. d. standard\nGaussian noise. The parameter \u03b4 is set to kx0k1.\nIn these examples, kmin is set to be 10 and the step sizes are \ufb01xed to be 1/L. If the objective is in\ncomposite form, the Lipschitz bound applies to the smooth part. Figures 1(a), 1(b), 1(c) and 1(d)\npresent the performance of the speed restarting scheme, the gradient restarting scheme proposed in\n[15], the original Nesterov\u2019s scheme and the proximal gradient method. The objective functions\ninclude strongly convex, non-strongly convex and non-smooth functions, violating the assumptions\nin Theorem 5.2. Among all the examples, it is interesting to note that both speed restarting scheme\nempirically exhibit linear convergence by signi\ufb01cantly reducing bumps in the objective values. This\nleaves us an open problem of whether there exists provable linear convergence rate for the gradient\nrestarting scheme as in Theorem 5.2. It is also worth pointing that compared with gradient restarting,\nthe speed restarting scheme empirically exhibits more stable linear convergence rate.\n\n2kAx \u2212 bk2\n\n6 Discussion\n\nThis paper introduces a second-order ODE and accompanying tools for characterizing Nesterov\u2019s\naccelerated gradient method. This ODE is applied to study variants of Nesterov\u2019s scheme. Our\n\n7\n\n\f108\n\n106\n\n104\n\n102\n\n*\n\nf\n \n\n\u2212\n\n \nf\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n \n0\n\n102\n\n100\n\n10\u22122\n\n10\u22124\n\n*\n\nf\n \n\n\u2212\n\n \nf\n\n10\u22126\n\n10\u22128\n\n10\u221210\n\n10\u221212\n\n \n0\n\n \n\nsrN\ngrN\noN\nPG\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\n1400\n\niterations\n\n(a) min 1\n\n2 xT Ax + bx\n\n*\n\nf\n \n\n\u2212\n\n \nf\n\n102\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n10\u22128\n\n10\u221210\n\n10\u221212\n\n \n0\n\n \n\nsrN\ngrN\noN\nPG\n\n500\n\n1000\n\n1500\n\niterations\n\n(b) min \u03c1 log(Pm\n\ni=1 e(aT\n\ni x\u2212bi)/\u03c1)\n\n \n\n105\n\nsrN\ngrN\noN\nPG\n\n \n\nsrN\ngrN\noN\nPG\n\n*\n\nf\n \n\n\u2212\n\n \nf\n\n100\n\n10\u22125\n\n10\u221210\n\n \n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\niterations\n\n120\n\n140\n\n160\n\n180\n\n200\n\n(c) min 1\n\n2 kXobs \u2212 Mobsk2\n\nF + \u03bbkXk\u2217\n\n200\n\n400\n\n600\n\n800\n\n1000\n\n1200\n\n1400\n\niterations\n\n(d) min 1\n\n2 kAx \u2212 bk2\n\ns.t. kxk1 \u2264 C\n\nFigure 1: Numerical performance of speed restarting (srN), gradient restarting (grN) proposed in\n[15], the original Nesterov\u2019s scheme (oN) and the proximal gradient (PG)\n\napproach suggests (1) a large family of generalized Nesterov\u2019s schemes that are all guaranteed to\nconverge at the rate 1/k2, and (2) a restarted scheme provably achieving a linear convergence rate\nwhenever f is strongly convex.\n\nIn this paper, we often utilize ideas from continuous-time ODEs, and then apply these ideas to\ndiscrete schemes. The translation, however, involves parameter tuning and tedious calculations.\nThis is the reason why a general theory mapping properties of ODEs into corresponding properties\nfor discrete updates would be a welcome advance. Indeed, this would allow researchers to only\nstudy the simpler and more user-friendly ODEs.\n\n7 Acknowledgements\n\nWe would like to thank Carlos Sing-Long and Zhou Fan for helpful discussions about parts of this\npaper, and anonymous reviewers for their insightful comments and suggestions.\n\n8\n\n\fReferences\n\n[1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM J. Imaging Sci., 2(1):183\u2013202, 2009.\n\n[2] S. Becker, J. Bobin, and E. J. Cand`es. NESTA: a fast and accurate \ufb01rst-order method for sparse\n\nrecovery. SIAM Journal on Imaging Sciences, 4(1):1\u201339, 2011.\n\n[3] S. Becker, E. J. Cand`es, and M. Grant. Templates for convex cone problems with applications\n\nto sparse signal recovery. Mathematical Programming Computation, 3(3):165\u2013218, 2011.\n\n[4] A. Bloch (Editor). Hamiltonian and gradient \ufb02ows, algorithms, and control, volume 3. Amer-\n\nican Mathematical Soc., 1994.\n\n[5] F. H. Branin. Widely convergent method for \ufb01nding multiple solutions of simultaneous non-\n\nlinear equations. IBM Journal of Research and Development, 16(5):504\u2013522, 1972.\n\n[6] A. A. Brown and M. C. Bartholomew-Biggs. Some effective methods for unconstrained op-\ntimization based on the solution of systems of ordinary differential equations. Journal of\nOptimization Theory and Applications, 62(2):211\u2013224, 1989.\n\n[7] R. Hauser and J. Nedic. The continuous Newton\u2013Raphson method can look ahead. SIAM\n\nJournal on Optimization, 15(3):915\u2013925, 2005.\n\n[8] U. Helmke and J. Moore. Optimization and dynamical systems. Proceedings of the IEEE,\n\n84(6):907, 1996.\n\n[9] J. J. Leader. Numerical Analysis and Scienti\ufb01c Computation. Pearson Addison Wesley, 2004.\n\n[10] R. Monteiro, C. Ortiz, and B. Svaiter. An adaptive accelerated \ufb01rst-order method for convex\n\noptimization, 2012.\n\n[11] Y. Nesterov. A method of solving a convex programming problem with convergence rate\n\nO(1/k2). In Soviet Mathematics Doklady, volume 27, pages 372\u2013376, 1983.\n\n[12] Y. Nesterov.\n\nIntroductory lectures on convex optimization: A basic course, volume 87 of\n\nApplied Optimization. Kluwer Academic Publishers, Boston, MA, 2004.\n\n[13] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical programming,\n\n103(1):127\u2013152, 2005.\n\n[14] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discus-\n\nsion Papers, 2007.\n\n[15] B. O\u2019Donoghue and E. J. Cand`es. Adaptive restart for accelerated gradient schemes. Found.\n\nComput. Math., 2013.\n\n[16] Y.-G. Ou. A nonmonotone ODE-based method for unconstrained optimization. International\n\nJournal of Computer Mathematics, (ahead-of-print):1\u201321, 2014.\n\n[17] R. T. Rockafellar. Convex analysis. Princeton Landmarks in Mathematics. Princeton University\n\nPress, Princeton, NJ, 1997. Reprint of the 1970 original, Princeton Paperbacks.\n\n[18] J. Schropp and I. Singer. A dynamical systems approach to constrained minimization. Numer-\n\nical functional analysis and optimization, 21(3-4):537\u2013551, 2000.\n\n[19] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. submit-\n\nted to SIAM J. 2008.\n\n[20] P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex\n\noptimization. Mathematical Programming, 125(2):263\u2013295, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1313, "authors": [{"given_name": "Weijie", "family_name": "Su", "institution": "Stanford University"}, {"given_name": "Stephen", "family_name": "Boyd", "institution": "Stanford University"}, {"given_name": "Emmanuel", "family_name": "Candes", "institution": "Stanford University"}]}