{"title": "Integration Methods and Optimization Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1109, "page_last": 1118, "abstract": "We show that accelerated optimization methods can be seen as particular instances of multi-step integration schemes from numerical analysis, applied to the gradient flow equation. Compared with recent advances in this vein, the differential equation considered here is the basic gradient flow, and we derive a class of multi-step schemes which includes accelerated algorithms, using classical conditions from numerical analysis. Multi-step schemes integrate the differential equation using larger step sizes, which intuitively explains the acceleration phenomenon.", "full_text": "Integration Methods and Optimization Algorithms\n\nDamien Scieur\nINRIA, ENS,\n\nPSL Research University,\n\nParis France\n\ndamien.scieur@inria.fr\n\nVincent Roulet\nINRIA, ENS,\n\nPSL Research University,\n\nParis France\n\nvincent.roulet@inria.fr\n\nFrancis Bach\nINRIA, ENS,\n\nPSL Research University,\n\nParis France\n\nfrancis.bach@inria.fr\n\nAlexandre d\u2019Aspremont\n\nCNRS, ENS\n\nPSL Research University,\n\nParis France\n\naspremon@ens.fr\n\nAbstract\n\nWe show that accelerated optimization methods can be seen as particular instances\nof multi-step integration schemes from numerical analysis, applied to the gradient\n\ufb02ow equation. Compared with recent advances in this vein, the differential equation\nconsidered here is the basic gradient \ufb02ow, and we derive a class of multi-step\nschemes which includes accelerated algorithms, using classical conditions from\nnumerical analysis. Multi-step schemes integrate the differential equation using\nlarger step sizes, which intuitively explains the acceleration phenomenon.\n\nIntroduction\n\nApplying the gradient descent algorithm to minimize a function f has a simple numerical interpreta-\ntion as the integration of the gradient \ufb02ow equation, written\n\n\u02d9x(t) = \u2212\u2207f (x(t)),\n\nx(0) = x0\n\n(Gradient Flow)\n\nusing Euler\u2019s method. This appears to be a somewhat unique connection between optimization and\nnumerical methods, since these two \ufb01elds have inherently different goals. On one hand, numerical\nmethods aim to get a precise discrete approximation of the solution x(t) on a \ufb01nite time interval. On\nthe other hand, optimization algorithms seek to \ufb01nd the minimizer of a function, which corresponds\nto the in\ufb01nite time horizon of the gradient \ufb02ow equation. More sophisticated methods than Euler\u2019s\nwere developed to get better consistency with the continuous time solution but still focus on a\n\ufb01nite time horizon [see e.g. S\u00fcli and Mayers, 2003]. Similarly, structural assumptions on f lead to\nmore sophisticated optimization algorithms than the gradient method, such as the mirror gradient\nmethod [see e.g. Ben-Tal and Nemirovski, 2001; Beck and Teboulle, 2003], proximal gradient\nmethod [Nesterov, 2007] or a combination thereof [Duchi et al., 2010; Nesterov, 2015]. Among\nthem Nesterov\u2019s accelerated gradient algorithm [Nesterov, 1983] is proven to be optimal on the\nclass of smooth convex or strongly convex functions. This latter method was designed with optimal\ncomplexity in mind, but the proof relies on purely algebraic arguments and the key mechanism behind\nacceleration remains elusive, with various interpretations discussed in e.g. [Bubeck et al., 2015;\nAllen Zhu and Orecchia, 2017; Lessard et al., 2016].\nAnother recent stream of papers used differential equations to model the acceleration behavior and\noffer another interpretation of Nesterov\u2019s algorithm [Su et al., 2014; Krichene et al., 2015; Wibisono\net al., 2016; Wilson et al., 2016]. However, the differential equation is often quite complex, being\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\freverse-engineered from Nesterov\u2019s method itself, thus losing the intuition. Moreover, integration\nmethods for these differential equations are often ignored or are not derived from standard numerical\nintegration schemes.\nHere, we take another approach. Rather than using a complicated differential equation, we use\nadvanced multistep discretization methods on the basic gradient \ufb02ow equation in (Gradient Flow).\nEnsuring that the methods effectively integrate this equation for in\ufb01nitesimal step sizes is essential\nfor the continuous time interpretation and leads to a family of integration methods which contains\nvarious well-known optimization algorithms. A full analysis is carried out for linear gradient \ufb02ows\n(quadratic optimization) and provides compelling explanations for the acceleration phenomenon.\nIn particular, Nesterov\u2019s method can be seen as a stable and consistent gradient \ufb02ow discretization\nscheme that allows bigger step sizes in integration, leading to faster convergence.\n\n1 Gradient \ufb02ow\nWe seek to minimize a L-smooth \u00b5-strongly convex function de\ufb01ned on Rd. We discretize the\ngradient \ufb02ow equation (Gradient Flow), given by the following ordinary differential equation\n\n\u02d9x(t) = g(x(t)),\n\nx(0) = x0,\n\n(ODE)\nwhere g comes from a potential \u2212f, meaning g = \u2212\u2207f. Smoothness of f means Lipschitz continuity\nof g, i.e.\nwhere (cid:107).(cid:107) is the Euclidean norm. This property ensures existence and uniqueness of the solution\nof (ODE) (see [S\u00fcli and Mayers, 2003, Theorem 12.1]). Strong convexity of f also means strong\nmonotonicity of \u2212g, i.e.,\n\n(cid:107)g(x) \u2212 g(y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107),\n\nfor every x, y \u2208 Rd,\n\n\u00b5(cid:107)x \u2212 y(cid:107)2 \u2264 \u2212(cid:104)x \u2212 y, g(x) \u2212 g(y)(cid:105),\n\nfor every x, y \u2208 Rd,\n\nand ensures that (ODE) has a unique point x\u2217 such that g(x\u2217) = 0, called the equilibrium. This is the\nminimizer of f and the limit point of the solution, i.e. x(\u221e) = x\u2217. Finally this assumption allows us\nto control the convergence rate of the potential f and the solution x(t) as follows.\nProposition 1.1 Let f be a L-smooth and \u00b5-strongly convex function and x0 \u2208 dom(f ). Writing\nx\u2217 the minimizer of f, the solution x(t) of (Gradient Flow) satis\ufb01es\n\nf (x(t)) \u2212 f (x\u2217) \u2264 (f (x0) \u2212 f (x\u2217))e\u22122\u00b5t,\n\n(cid:107)x(t) \u2212 x\u2217(cid:107) \u2264 (cid:107)x0 \u2212 x\u2217(cid:107)e\u2212\u00b5t.\n\n(1)\n\nA proof of this last result is recalled in the Supplementary Material. We now focus on numerical\nmethods to integrate (ODE).\n\n2 Numerical integration of differential equations\n\n2.1 Discretization schemes\n\nIn general, we do not have access to an explicit solution x(t) of (ODE). We thus use integration\nalgorithms to approximate the curve (t, x(t)) by a grid (tk, xk) \u2248 (tk, x(tk)) on a \ufb01nite interval\n[0, tmax]. For simplicity here, we assume the step size hk = tk \u2212 tk\u22121 is constant, i.e., hk = h and\ntk = kh. The goal is then to minimize the approximation error (cid:107)xk \u2212 x(tk)(cid:107) for k \u2208 [0, tmax/h].\nWe \ufb01rst introduce Euler\u2019s method to illustrate this on a basic example.\n\nEuler\u2019s explicit method. Euler\u2019s (explicit) method is one of the oldest and simplest schemes for\nintegrating the curve x(t). The idea stems from a Taylor expansion of x(t) which reads\n\nx(t + h) = x(t) + h \u02d9x(t) + O(h2).\n\nWhen t = kh, Euler\u2019s method approximates x(t + h) by xk+1, neglecting the second order term,\n\nIn optimization terms, we recognize the gradient descent algorithm used to minimize f. Approxima-\ntion errors in an integration method accumulate with iterations, and as Euler\u2019s method uses only the\nlast point to compute the next one, it has only limited control over the accumulated error.\n\nxk+1 = xk + hg(xk).\n\n2\n\n\fLinear multistep methods. Multi-step methods use a combination of past iterates to improve\nconvergence. Throughout the paper, we focus on linear s-step methods whose recurrence can be\nwritten\n\nxk+s = \u2212 s\u22121(cid:88)\n\ni=0\n\ns(cid:88)\n\ni=0\n\n\u03c1ixk+i + h\n\n\u03c3ig(xk+i),\n\nfor k \u2265 0,\n\nwhere \u03c1i, \u03c3i \u2208 R are the parameters of the multistep method and h is again the step size. Each new\npoint xk+s is a function of the information given by the s previous points. If \u03c3s = 0, each new point\nis given explicitly by the s previous points and the method is called explicit. Otherwise each new\npoint requires solving an implicit equation and the method is called implicit.\nTo simplify notations we use the shift operator E, which maps Exk \u2192 xk+1. Moreover, if we write\ngk = g(xk), then the shift operator also maps Egk \u2192 gk+1. Recall that a univariate polynomial is\ncalled monic if its leading coef\ufb01cient is equal to 1. We now give the following concise de\ufb01nition of\ns-step linear methods.\nDe\ufb01nition 2.1 Given an (ODE) de\ufb01ned by g, x0, a step size h and x1, . . . , xs\u22121 initial points, a\nlinear s-step method generates a sequence (tk, xk) which satis\ufb01es\n\nfor every k \u2265 0,\n\n\u03c1(E)xk = h\u03c3(E)gk,\n\n(2)\nwhere \u03c1 is a monic polynomial of degree s with coef\ufb01cients \u03c1i, and \u03c3 a polynomial of degree s with\ncoef\ufb01cients \u03c3i.\nA linear s\u2212step method is uniquely de\ufb01ned by the polynomials (\u03c1, \u03c3). The sequence generated by\nthe method then depends on the initial points and the step size. We now recall a few results describing\nthe performance of multistep methods.\n\n2.2 Stability\n\nStability is a key concept for integration methods. First of all, consider two curves x(t) and y(t), both\nsolutions of (ODE), but starting from different points x(0) and y(0). If the function g is Lipchitz-\ncontinuous, it is possible to show that the distance between x(t) and y(t) is bounded on a \ufb01nite\ninterval, i.e.\n\n(cid:107)x(t) \u2212 y(t)(cid:107) \u2264 C(cid:107)x(0) \u2212 y(0)(cid:107)\n\n\u2200t \u2208 [0, tmax],\n\nwhere C may depend exponentially on tmax. We would like to have a similar behavior for our\nsequences xk and yk, approximating x(tk) and y(tk), i.e.\n\n(cid:107)xk \u2212 yk(cid:107) \u2248 (cid:107)x(tk) \u2212 y(tk)(cid:107) \u2264 C(cid:107)x(0) \u2212 y(0)(cid:107)\n\n(3)\nwhen h \u2192 0, so k \u2192 \u221e. Two issues quickly arise. First, for a linear s-step method, we need s\nstarting values x0, ..., xs\u22121. Condition (3) will therefore depend on all these starting values and\nnot only x0. Secondly, any discretization scheme introduces at each step an approximation error,\ncalled local error, which accumulates over time. We write this error \u0001loc(xk+s) and de\ufb01ne it as\n\u0001loc(xk+s) (cid:44) xk+s \u2212 x(tk+s), where xk+s is computed using the real solution x(tk), ..., x(tk+s\u22121).\nIn other words, the difference between xk and yk can be described as follows\n\n\u2200k \u2208 [0, tmax/h],\n\n(cid:107)xk \u2212 yk(cid:107) \u2264 Error in the initial condition + Accumulation of local errors.\n\nWe now write a complete de\ufb01nition of stability, inspired by De\ufb01nition 6.3.1 from Gautschi [2011].\nDe\ufb01nition 2.2 (Stability) A linear multistep method is stable iff, for two sequences xk, yk generated\nby (\u03c1, \u03c3) using a suf\ufb01ciently small step size h > 0, from the starting values x0, ..., xs\u22121, and\ny0, ..., ys\u22121, we have\n\ntmax/h(cid:88)\n\n(cid:107)\u0001loc(xi+s)(cid:107) + (cid:107)\u0001loc(yi+s)(cid:107)(cid:17)\n\n,\n\n(4)\n\n(cid:107)xk \u2212 yk(cid:107) \u2264 C\n\nmax\n\ni\u2208{0,...,s\u22121}\n\n(cid:107)xi \u2212 yi(cid:107) +\n\ni=1\n\nfor any k \u2208 [0, tmax/h]. Here, the constant C may depend on tmax but is independent of h.\nWhen h tends to zero, we may recover equation (3) only if the accumulated local error also tends to\nzero. We thus need\n\n(cid:16)\n\nlim\nh\u21920\n\n1\nh\n\n(cid:107)\u0001loc(xi+s)(cid:107) = 0 \u2200i \u2208 [0, tmax/h].\n\n3\n\n\fThis condition is called consistency. The following proposition shows there exist simple conditions to\ncheck consistency, which rely on comparing a Taylor expansion of the solution with the coef\ufb01cients\nof the method. Its proof and further details are given in the Supplementary Material.\n\nProposition 2.3 (Consistency) A linear multistep method de\ufb01ned by polynomials (\u03c1, \u03c3) is consistent\nif and only if\n\n\u03c1(1) = 0\n\nand\n\n\u03c1(cid:48)(1) = \u03c3(1).\n\n(5)\n\nAssuming consistency, we still need to control sensitivity to initial conditions, written\n\n(cid:107)xk \u2212 yk(cid:107) \u2264 C max\n\ni\u2208{0,...,s\u22121}\n\n(cid:107)xi \u2212 yi(cid:107).\n\n(6)\n\nInterestingly, analyzing the special case where g = 0 is completely equivalent to the general case\nand this condition is therefore called zero-stability. This reduces to standard linear algebra results as\nwe only need to look at the solution of the homogeneous difference equation \u03c1(E)xk = 0. This is\ncaptured in the following theorem whose technical proof can be found in [Gautschi, 2011, Theorem\n6.3.4].\n\nTheorem 2.4 (Root condition) Consider a linear multistep method (\u03c1, \u03c3). The method is zero-stable\nif and only if all roots of \u03c1(z) are in the unit disk, and the roots on the unit circle are simple.\n\n2.3 Convergence of the global error\n\nNumerical analysis focuses on integrating an ODE on a \ufb01nite interval of time [0, tmax]. It studies the\nbehavior of the global error de\ufb01ned by x(tk) \u2212 xk, as a function of the step size h. If the global error\nconverges to 0 with the step size, the method is guaranteed to approximate correctly the ODE on the\ntime interval, for h small enough.\nWe now state Dahlquist\u2019s equivalence theorem, which shows that the global error converges to zero\nwhen h does if the method is stable, i.e. when the method is consistent and zero-stable. This naturally\nneeds the additional assumption that the starting values x0, . . . , xs\u22121 are computed such that they\nconverge to the solution (x(0), . . . , x(ts\u22121)). The proof of the theorem can be found in Gautschi\n[2011].\n\nTheorem 2.5 (Dahlquist\u2019s equivalence theorem) Given an (ODE) de\ufb01ned by g and x0 and a con-\nsistent linear multistep method (\u03c1, \u03c3), whose starting values are computed such that limh\u21920 xi =\nx(ti) for any i \u2208 {0, . . . , s \u2212 1}, zero-stability is necessary and suf\ufb01cient for convergence, i.e. to\nensure x(tk) \u2212 xk \u2192 0 for any k when the step size h goes to zero.\n\n2.4 Region of absolute stability\n\nThe results above ensure stability and global error bounds on \ufb01nite time intervals. Solving optimiza-\ntion problems however requires looking at in\ufb01nite time horizons. We start by \ufb01nding conditions\nensuring that the numerical solution does not diverge when the time interval increases, i.e. that the\nnumerical solution is stable with a constant C which does not depend on tmax. Formally, for a \ufb01xed\nstep-size h, we want to ensure\n\n(cid:107)xk(cid:107) \u2264 C max\n\ni\u2208{0,...,s\u22121}\n\n(cid:107)xi(cid:107)\n\nfor all k \u2208 [0, tmax/h] and tmax > 0.\n\n(7)\n\nThis is not possible without further assumptions on the function g as in the general case the solution\nx(t) itself may diverge. We begin with the simple scalar linear case which, given \u03bb > 0, reads\n\n\u02d9x(t) = \u2212\u03bbx(t),\n\nx(0) = x0.\n\n(Scalar Linear ODE)\n\nThe recurrence of a linear multistep methods with parameters (\u03c1, \u03c3) applied to (Scalar Linear ODE)\nthen reads\n\n\u03c1(E)xk = \u2212\u03bbh\u03c3(E)xk \u21d4 [\u03c1 + \u03bbh\u03c3](E)xk = 0,\n\nwhere we recognize a homogeneous recurrence equation. Condition (7) is then controlled by the\nstep size h and the constant \u03bb, ensuring that this homogeneous recurrent equation produces bounded\nsolutions. This leads us to the de\ufb01nition of the region of absolute stability, also called A-stability.\n\n4\n\n\fDe\ufb01nition 2.6 (Absolute stability) The region of absolute stability of a linear multistep method\nde\ufb01ned by (\u03c1, \u03c3) is the set of values \u03bbh such that the characteristic polynomial\n\n\u03c0\u03bbh(z) (cid:44) \u03c1(z) + \u03bbh \u03c3(z)\n\n(8)\n\nof the homogeneous recurrence equation \u03c0\u03bbh(E)xk = 0 produces bounded solutions.\n\nStandard linear algebra links this condition to the roots of the characteristic polynomial as recalled in\nthe next proposition (see e.g. Lemma 12.1 of S\u00fcli and Mayers [2003]).\n\nProposition 2.7 Let \u03c0 be a polynomial and write xk a solution of the homogeneous recurrence\nequation \u03c0(E)xk = 0 with arbitrary initial values. If all roots of \u03c0 are inside the unit disk and the\nones on the unit circle have a multiplicity exactly equal to one, then (cid:107)xk(cid:107) \u2264 \u221e.\nAbsolute stability of a linear multistep method determines its ability to integrate a linear ODE de\ufb01ned\nby\n\n\u02d9x(t) = \u2212Ax(t),\n\n(Linear ODE)\nwhere A is a positive symmetric matrix whose eigenvalues belong to [\u00b5, L] for 0 < \u00b5 \u2264 L. In this\ncase the step size h must indeed be chosen such that for any \u03bb \u2208 [\u00b5, L], \u03bbh belongs to the region of\nabsolute stability of the method. This (Linear ODE) is a special instance of (Gradient Flow) where f\nis a quadratic function. Therefore absolute stability gives a necessary (but not suf\ufb01cient) condition to\nintegrate (Gradient Flow) on L-smooth, \u00b5-strongly convex functions.\n\nx(0) = x0,\n\n2.5 Convergence analysis in the linear case\n\nBy construction, absolute stability also gives hints on the convergence of xk to the equilibrium in the\nlinear case. More precisiely, it allows us to control the rate of convergence of xk, approximating the\nsolution x(t) of (Linear ODE) as shown in the following proposition whose proof can be found in\nSupplementary Material.\n\nProposition 2.8 Given a (Linear ODE) de\ufb01ned by x0 and a positive symmetric matrix A whose\neigenvalues belong to [\u00b5, L] with 0 < \u00b5 \u2264 L, using a linear multistep method de\ufb01ned by (\u03c1, \u03c3) and\napplying a \ufb01xed step size h, we de\ufb01ne rmax as\n\nrmax = max\n\u03bb\u2208[\u00b5,L]\n\nmax\n\nr\u2208roots(\u03c0\u03bbh(z))\n\n|r|,\n\nwhere \u03c0\u03bbh is de\ufb01ned in (8). If rmax < 1 and its multiplicity is equal to m, then the speed of\nconvergence of the sequence xk produced by the linear multistep method to the equilibrium x\u2217 of the\ndifferential equation is given by\n\n(cid:107)xk \u2212 x\u2217(cid:107) = O(km\u22121rk\n\nmax).\n\n(9)\n\nWe can now use these properties to analyze and design multistep methods.\n\n3 Analysis and design of multi-step methods\n\nAs shown previously, we want to integrate (Gradient Flow) and Proposition 1.1 gives a rate of\nconvergence in the continuous case. If the method tracks x(t) with suf\ufb01cient accuracy, then the rate\nof the method will be close to the rate of convergence of x(kh). So, larger values of h yield faster\nconvergence of x(t) to the equilibrium x\u2217. However h cannot be too large, as the method may be\ntoo inaccurate and/or unstable as h increases. Convergence rates of optimization algorithms are thus\ncontrolled by our ability to discretize the gradient \ufb02ow equation using large step sizes. We recall the\ndifferent conditions that proper linear multistep methods should satisfy.\n\n\u2022 Monic polynomial (Section 2.1). Without loss of generality (dividing both sides of the\n\u2022 Explicit method (Section 2.1). We assume that the scheme is explicit in order to avoid\n\ndifference equation of the multistep method (2) by \u03c1s does not change the method).\n\nsolving a non-linear system at each step.\n\n5\n\n\fconverge when the step size goes to zero.\n\n\u2022 Consistency (Section 2.2). If the method is not consistent, then the local error does not\n\u2022 Zero-stability (Section 2.2). Zero-stability ensures convergence of the global error (Section\n\u2022 Region of absolute stability (Section 2.4). If \u03bbh is not inside the region of absolute stability\n\n2.3) when the method is also consistent.\nfor any \u03bb \u2208 [\u00b5, L], then the method is divergent when tmax increases.\n\nUsing the remaining degrees of freedom, we can tune the algorithm to improve the convergence\nrate on (Linear ODE), which corresponds to the optimization of a quadratic function. Indeed, as\nshowed in Proposition 2.8, the largest root of \u03c0\u03bbh(z) gives us the rate of convergence on quadratic\nfunctions (when \u03bb \u2208 [\u00b5, L]). Since smooth and strongly convex functions are close to quadratic (being\nsandwiched between two quadratics), this will also give us a good idea of the rate of convergence\non these functions. We do not derive a proof of convergence of the sequence for general smooth\nand (strongly) convex function (but convergence is proved by Nesterov [2013] or using Lyapunov\ntechniques by Wilson et al. [2016]). Still our results provide intuition on why accelerated methods\nconverge faster.\n\n3.1 Analysis of two-step methods\n\nWe now analyze convergence of two-step methods (an analysis of Euler\u2019s method is provided in the\nSupplementary Material). We \ufb01rst translate the conditions multistep method, listed at the beginning\nof this section, into constraints on the coef\ufb01cients:\n\n\u03c12 = 1\n\u03c32 = 0\n\u03c10 + \u03c11 + \u03c12 = 0\n\u03c30 + \u03c31 + \u03c32 = \u03c11 + 2\u03c12\n\n|Roots(\u03c1)| \u2264 1\n\n(Monic polynomial)\n(Explicit method)\n(Consistency)\n(Consistency)\n(Zero-stability).\n\n\u03c11 = \u2212(1 + \u03c10),\n\nL = {\u03c10, \u03c11, \u03c30, \u03c31 :\n\nEquality contraints yield three linear constraints, de\ufb01ning the set L such that\n\u03c31 = 1 \u2212 \u03c10 \u2212 \u03c30,\n\n(10)\nWe now seek conditions on the remaining parameters to produce a stable method. Absolute stability\nrequires that all roots of the polynomial \u03c0\u03bbh(z) in (8) are inside the unit circle, which translates into\ncondition on the roots of second order equations here. The following proposition gives the values of\nthe roots of \u03c0\u03bbh(z) as a function of the parameters \u03c1i and \u03c3i.\nProposition 3.1 Given constants 0 < \u00b5 \u2264 L, a step size h > 0 and a linear two-step method de\ufb01ned\nby (\u03c1, \u03c3), under the conditions\n\n|\u03c10| < 1}.\n\n(\u03c11 + \u00b5h\u03c31)2 \u2264 4(\u03c10 + \u00b5h\u03c30),\n\n(11)\nthe roots r\u00b1(\u03bb) of \u03c0\u03bbh, de\ufb01ned in (8), are complex conjugate for any \u03bb \u2208 [\u00b5, L]. Moreover, the\nlargest root modulus is equal to\n\n(\u03c11 + Lh\u03c31)2 \u2264 4(\u03c10 + Lh\u03c30),\n\n|r\u00b1(\u03bb)|2 = max{\u03c10 + \u00b5h\u03c30, \u03c10 + Lh\u03c30} .\n\nmax\n\u03bb\u2208[\u00b5,L]\n\n(12)\n\nThe proof can be found in the Supplementary Material. The next step is to minimize the largest\nmodulus (12) in the coef\ufb01cients \u03c1i and \u03c3i to get the best rate of convergence, assuming the roots are\ncomplex (the case were the roots are real leads to weaker results).\n\n3.2 Design of a family of two-step methods for quadratics\n\nWe now have all ingredients to build a two-step method for which the sequence xk converges quickly\nto x\u2217 for quadratic functions. Optimizing the convergence rate means solving the following problem,\n\nmin max{\u03c10 + \u00b5h\u03c30, \u03c10 + Lh\u03c30}\n(\u03c10, \u03c11, \u03c30, \u03c31) \u2208 L\ns.t.\n(\u03c11 + \u00b5h\u03c31)2 \u2264 4(\u03c10 + \u00b5h\u03c30)\n(\u03c11 + Lh\u03c31)2 \u2264 4(\u03c10 + Lh\u03c30),\n\n6\n\n\fin the variables \u03c10, \u03c11, \u03c30, \u03c31, h > 0, where L is de\ufb01ned in (10). If we use the equality constraints\nin (10) and make the following change of variables,\n\n\u02c6h = h(1 \u2212 \u03c10),\n\n(13)\nthe problem can be solved, for \ufb01xed \u02c6h, in the variables c\u00b5, cL. In that case, the optimal solution is\ngiven by\n\ncL = \u03c10 + Lh\u03c30,\n\nc\u00b5 = \u03c10 + \u00b5h\u03c30,\n\n\u00b5 = (1 \u2212\nc\u2217\n\n(14)\nobtained by tightening the two \ufb01rst inequalities, for \u02c6h \u2208]0, (1+\u00b5/L)2\n[. Now if we \ufb01x \u02c6h we can recover\na two step linear method de\ufb01ned by (\u03c1, \u03c3) and a step size h by using the equations in (13). We de\ufb01ne\n\nthe following quantity \u03b2 = (1 \u2212(cid:112)\u00b5/L)/(1 +(cid:112)\u00b5/L).\n\nL\n\nL\u02c6h)2,\n\n\u00b5\u02c6h)2,\n\nL = (1 \u2212(cid:112)\n\nc\u2217\n\n(cid:113)\n\n(cid:26)\n\nA suboptimal two-step method. Setting \u02c6h = 1/L for example, the parameters of the correspond-\ning two-step method, called method M1, are\n\nM1 =\n\n\u03c1(z) = \u03b2 \u2212 (1 + \u03b2)z + z2,\n\n\u03c3(z) = \u2212\u03b2(1 \u2212 \u03b2) + (1 \u2212 \u03b22)z,\n\nh =\n\n1\n\nL(1 \u2212 \u03b2)\n\nand its largest modulus root (12) is given by\n\nrate(M1) =\n\nmax{c\u00b5, cL} =\n\nc\u00b5 = 1 \u2212(cid:112)\u00b5/L.\n\n\u221a\n\n(cid:113)\n\nOptimal two-step method for quadratics. We can compute the optimal \u02c6h which minimizes the\nmaximum of the two roots c\u2217\nL de\ufb01ned in (14). The solution simply balances the two terms in\nthe maximum, with \u02c6h\u2217 = (1 + \u03b2)2/L. This choice of \u02c6h leads to the method M2, described by\n\n\u00b5 and c\u2217\n\n(cid:26)\n\n(cid:27)\ncL = \u03b2 = (1 \u2212(cid:112)\u00b5/L)/(1 +(cid:112)\u00b5/L) < rate(M1).\n\n\u03c3(z) = (1 \u2212 \u03b22)z,\n\n1\u221a\n\u00b5L\n\nh =\n\nwith convergence rate\n\nrate(M2) =\n\n\u221a\n\nc\u00b5 =\n\n\u221a\n\n(cid:27)\n\n(15)\n\n(16)\n\nM2 =\n\n\u03c1(z) = \u03b22 \u2212 (1 + \u03b22)z + z2,\n\nWe will now see that methods M1 and M2 are actually related to Nesterov\u2019s accelerated method and\nPolyak\u2019s heavy ball algorithms.\n\n4 On the link between integration and optimization\n\nIn the previous section, we derived a family of linear multistep methods, parametrized by \u02c6h. We\nwill now compare these methods to common optimization algorithms used to minimize L-smooth,\n\u00b5-strongly convex functions.\n\n4.1 Polyak\u2019s heavy ball method\n\n\u221a\n\nThe heavy ball method was proposed by Polyak [1964]. It adds a momentum term to the gradient step\n\nwhere c1 = 4/(\ngeneral structure of linear multistep methods, to get\n\nL +\n\n\u221a\n\nxk+2 = xk+1 \u2212 c1\u2207f (xk+1) + c2(xk+1 \u2212 xk),\n\u00b5)2 and c2 = \u03b22. We can organize the terms in the sequence to match the\n\u03b22xk \u2212 (1 + \u03b22)xk+1 + xk+2 = c1 (\u2212\u2207f (xk+1)) .\n\nWe easily identify \u03c1(z) = \u03b22\u2212(1+\u03b22)z +z2 and h\u03c3(z) = c1z. To extract h, we will assume that the\nmethod is consistent (see conditions (5)). All computations done, we can identify the corresponding\nlinear multistep method as\n\nMPolyak =\n\n\u03c1(z) = \u03b22 \u2212 (1 + \u03b22)z + 1,\n\n(17)\nThis shows that MPolyak = M2. In fact, this result was expected since Polyak\u2019s method is known\nto be optimal for quadratic functions. However, it is also known that Polyak\u2019s algorithm does not\nconverge for a general smooth and strongly convex function [Lessard et al., 2016].\n\nh =\n\n.\n\n\u03c3(z) = (1 \u2212 \u03b22)z,\n\n1\u221a\n\u00b5L\n\n(cid:26)\n\n(cid:27)\n\n7\n\n\f4.2 Nesterov\u2019s accelerated gradient\n\nNesterov\u2019s accelerated method in its simplest form is described by two sequences xk and yk, with\n\nyk+1 = xk \u2212 1\nL\nxk+1 = yk+1 + \u03b2(yk+1 \u2212 yk).\n\n\u2207f (xk),\n\nAs above, we will write Nesterov\u2019s accelerated gradient as a linear multistep method by expanding\nyk in the de\ufb01nition of xk, to get\n\n\u03b2xk \u2212 (1 + \u03b2)xk+1 + xk+2 =\n\n1\nL\n\n(\u2212\u03b2(\u2212\u2207f (xk)) + (1 + \u03b2)(\u2212\u2207f (xk+1))) .\n\n(cid:26)\n\nAgain, assuming as above that the method is consistent to extract h, we identify the linear multistep\nmethod associated to Nesterov\u2019s algorithm. After identi\ufb01cation,\nMNest =\nwhich means that M1 = MNest.\n\n\u03c3(z) = \u2212\u03b2(1 \u2212 \u03b2) + (1 \u2212 \u03b22)z,\n\n\u03c1(z) = \u03b2 \u2212 (1 + \u03b2)z + z2,\n\nL(1 \u2212 \u03b2)\n\n(cid:27)\n\nh =\n\n1\n\n,\n\n4.3 The convergence rate of Nesterov\u2019s method\n\nPushing the analysis a little bit further, we have a simple intuitive argument that explains why\nNesterov\u2019s algorithm is faster than the gradient method. There is of course a complete proof of its\nrate of convergence [Nesterov, 2013], even using differential equations arguments [Wibisono et al.,\n2016; Wilson et al., 2016], but we take a simpler approach here. The key parameter is the step size h.\nIf we compare it with the step size in the classical gradient method, Nesterov\u2019s method uses a step\n\nsize which is (1 \u2212 \u03b2)\u22121 \u2248(cid:112)L/\u00b5 times larger.\n\nRecall that, in continuous time, the rate of convergence of x(t) to x\u2217 is given by\n\nf (x(t)) \u2212 f (x\u2217) \u2264 e\u22122\u00b5t(f (x0) \u2212 f (x\u2217)).\n\nThe gradient method tries to approximate x(t) using an Euler scheme with step size h = 1/L, which\nmeans x(grad)\n\nk\n\n\u2248 x(k/L), so\nf (x(grad)\n\nk\n\nhNest =\n\n(cid:17)\nk/(cid:112)4\u00b5L\n\n.\n\nHowever, Nesterov\u2019s method has a step size equal to\n\n) \u2212 f (x\u2217) \u2248 f (x(k/L)) \u2212 f (x\u2217) \u2264 (f (x0) \u2212 f (x\u2217))e\u22122k \u00b5\nL .\n\n1\n\nL(1 \u2212 \u03b2)\n\n1 +(cid:112)\u00b5/L\nk ) \u2212 f (x\u2217) \u2248 f(cid:0)x(cid:0)k/\n\n\u221a\n\n\u00b5L\n\n=\n\n2\n\n\u2248 1\u221a\n\n(cid:16)\n4\u00b5L(cid:1)(cid:1) \u2212 f (x\u2217) \u2264 (f (x0) \u2212 f (x\u2217))e\u2212k\n\nwhich means xnest\n\nk \u2248 x\n\n4\u00b5L\n\n\u221a\n\nwhile maintaining stability. In that case, the estimated rate of convergence becomes\n\u221a\n\nf (xnest\n\nrecover the accelerated rate in(cid:112)\u00b5/L versus \u00b5/L for gradient descent.\n\nwhich is approximatively the rate of convergence of Nesterov\u2019s algorithm in discrete time and we\n\n\u00b5/L,\n\nOverall, the accelerated method is more ef\ufb01cient because it integrates the gradient \ufb02ow faster than\nsimple gradient descent, making longer steps. A numerical simulation in Figure 1 makes this argument\nmore visual.\n\n5 Generalization and Future Work\n\nWe showed that accelerated optimization methods can be seen as multistep integration schemes\napplied to the basic gradient \ufb02ow equation. Our results give a natural interpretation of acceleration:\nmultistep schemes allow for larger steps, which speeds up convergence. In the Supplementary\nMaterial, we detail further links between integration methods and other well-known optimization\nalgorithms such as proximal point methods, mirror gradient decent, proximal gradient descent, and\n\n8\n\n\fFigure 1: Integration of a linear ODE with optimal (left) and small (right) step sizes.\n\ndiscuss the weakly convex case. The extra-gradient algorithm and its recent accelerated version Di-\nakonikolas and Orecchia [2017] can also be linked to another family of integration methods called\nRunge-Kutta which include notably predictor-corrector methods.\nOur stability analysis is limited to the quadratic case, the de\ufb01nition of A-stability being too restrictive\nfor the class of smooth and strongly convex functions. A more appropriate condition would be G-\nstability, which extends A-stability to non-linear ODEs, but this condition requires strict monotonicity\nof the error (which is not the case with accelerated algorithms). Stability may also be tackled by\nrecent advances in lower bound theory provided by Taylor [2017] but these yield numerical rather\nthan analytical convergence bounds. Our next objective is thus to derive a new stability condition in\nbetween A-stability and G-stability.\n\nAcknowledgments\n\nThe authors would like to acknowledge support from a starting grant from the European Research\nCouncil (ERC project SIPA), from the European Union\u2019s Seventh Framework Programme (FP7-\nPEOPLE-2013-ITN) under grant agreement number 607290 SpaRTaN, an AMX fellowship, as well as\nsupport from the chaire \u00c9conomie des nouvelles donn\u00e9es with the data science joint research initiative\nwith the fonds AXA pour la recherche and a gift from Soci\u00e9t\u00e9 G\u00e9n\u00e9rale Cross Asset Quantitative\nResearch.\n\n9\n\n55.566.5766.577.588.59x0xstarExactEulerNesterovPolyak55.566.566.577.588.59x0xstarExactEulerNesterovPolyak\fReferences\nAllen Zhu, Z. and Orecchia, L. [2017], Linear coupling: An ultimate uni\ufb01cation of gradient and\nmirror descent, in \u2018Proceedings of the 8th Innovations in Theoretical Computer Science\u2019, ITCS 17.\n\nBeck, A. and Teboulle, M. [2003], \u2018Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization\u2019, Operations Research Letters 31(3), 167\u2013175.\n\nBen-Tal, A. and Nemirovski, A. [2001], Lectures on modern convex optimization: analysis, algo-\n\nrithms, and engineering applications, SIAM.\n\nBubeck, S., Tat Lee, Y. and Singh, M. [2015], \u2018A geometric alternative to nesterov\u2019s accelerated\n\ngradient descent\u2019, ArXiv e-prints .\n\nDiakonikolas, J. and Orecchia, L. [2017], \u2018Accelerated extra-gradient descent: A novel accelerated\n\n\ufb01rst-order method\u2019, arXiv preprint arXiv:1706.04680 .\n\nDuchi, J. C., Shalev-Shwartz, S., Singer, Y. and Tewari, A. [2010], Composite objective mirror\n\ndescent., in \u2018COLT\u2019, pp. 14\u201326.\n\nGautschi, W. [2011], Numerical analysis, Springer Science & Business Media.\n\nKrichene, W., Bayen, A. and Bartlett, P. L. [2015], Accelerated mirror descent in continuous and\n\ndiscrete time, in \u2018Advances in neural information processing systems\u2019, pp. 2845\u20132853.\n\nLessard, L., Recht, B. and Packard, A. [2016], \u2018Analysis and design of optimization algorithms via\n\nintegral quadratic constraints\u2019, SIAM Journal on Optimization 26(1), 57\u201395.\n\nNesterov, Y. [1983], A method of solving a convex programming problem with convergence rate o\n\n(1/k2), in \u2018Soviet Mathematics Doklady\u2019, Vol. 27, pp. 372\u2013376.\n\nNesterov, Y. [2007], \u2018Gradient methods for minimizing composite objective function\u2019.\n\nNesterov, Y. [2013], Introductory lectures on convex optimization: A basic course, Vol. 87, Springer\n\nScience & Business Media.\n\nNesterov, Y. [2015], \u2018Universal gradient methods for convex optimization problems\u2019, Mathematical\n\nProgramming 152(1-2), 381\u2013404.\n\nPolyak, B. T. [1964], \u2018Some methods of speeding up the convergence of iteration methods\u2019, USSR\n\nComputational Mathematics and Mathematical Physics 4(5), 1\u201317.\n\nSu, W., Boyd, S. and Candes, E. [2014], A differential equation for modeling nesterov\u2019s accelerated\ngradient method: Theory and insights, in \u2018Advances in Neural Information Processing Systems\u2019,\npp. 2510\u20132518.\n\nS\u00fcli, E. and Mayers, D. F. [2003], An introduction to numerical analysis, Cambridge University\n\nPress.\n\nTaylor, A. [2017], Convex Interpolation and Performance Estimation of First-order Methods for\n\nConvex Optimization, PhD thesis, Universit\u00e9 catholique de Louvain.\n\nWibisono, A., Wilson, A. C. and Jordan, M. I. [2016], \u2018A variational perspective on accelerated\n\nmethods in optimization\u2019, Proceedings of the National Academy of Sciences p. 201614734.\n\nWilson, A. C., Recht, B. and Jordan, M. I. [2016], \u2018A lyapunov analysis of momentum methods in\n\noptimization\u2019, arXiv preprint arXiv:1611.02635 .\n\n10\n\n\f", "award": [], "sourceid": 758, "authors": [{"given_name": "Damien", "family_name": "Scieur", "institution": "INRIA - ENS"}, {"given_name": "Vincent", "family_name": "Roulet", "institution": "INRIA / ENS Ulm"}, {"given_name": "Francis", "family_name": "Bach", "institution": "Inria"}, {"given_name": "Alexandre", "family_name": "d'Aspremont", "institution": "CNRS - Ecole Normale Sup\u00e9rieure"}]}