{"title": "Acceleration via Symplectic Discretization of High-Resolution Differential Equations", "book": "Advances in Neural Information Processing Systems", "page_first": 5744, "page_last": 5752, "abstract": "We study first-order optimization algorithms obtained by discretizing ordinary differential equations (ODEs) corresponding to Nesterov\u2019s accelerated gradient methods (NAGs) and Polyak\u2019s heavy-ball method. We consider three discretization schemes: symplectic Euler (S), explicit Euler (E) and implicit Euler (I) schemes. We show that the optimization algorithm generated by applying the symplectic scheme to a high-resolution ODE proposed by Shi et al. [2018] achieves the accelerated rate for minimizing both strongly convex function and convex function. On the other hand, the resulting algorithm either fails to achieve acceleration or is impractical when the scheme is implicit, the ODE is low-resolution, or the scheme is explicit.", "full_text": "Acceleration via Symplectic Discretization of\n\nHigh-Resolution Differential Equations\n\nBin Shi\n\nUniversity of California, Berkeley\n\nbinshi@berkeley.edu\n\nSimon S. Du\n\nInstitute for Advanced Study\n\nssdu@ias.edu\n\nWeijie J. Su\n\nUniversity of Pennsylvania\n\nsuw@wharton.upenn.edu\n\nMichael I. Jordan\n\nUniversity of California, Berkeley\njordan@cs.berkeley.edu\n\nAbstract\n\nWe study \ufb01rst-order optimization algorithms obtained by discretizing ordinary\ndifferential equations (ODEs) corresponding to Nesterov\u2019s accelerated gradient\nmethods (NAGs) and Polyak\u2019s heavy-ball method. We consider three discretization\nschemes: symplectic Euler (S), explicit Euler (E) and implicit Euler (I) schemes.\nWe show that the optimization algorithm generated by applying the symplectic\nscheme to a high-resolution ODE proposed by Shi et al. [2018] achieves the accel-\nerated rate for minimizing both strongly convex functions and convex functions.\nOn the other hand, the resulting algorithm either fails to achieve acceleration or is\nimpractical when the scheme is implicit, the ODE is low-resolution, or the scheme\nis explicit.\n\n1\n\nIntroduction\n\nIn this paper, we consider unconstrained minimization problems:\n\nf (x),\n\n(1.1)\n\nmin\nx2Rn\n\nwhere f is a smooth convex function. The touchstone method in this setting is gradient descent (GD):\n(1.2)\nwhere x0 is a given initial point and s > 0 is the step size. Whether there exist methods that improve\non GD while remaining within the framework of \ufb01rst-order optimization is a subtle and important\nquestion.\nModern attempts to address this question date to Polyak [1964, 1987], who incorporated a momentum\nterm into the gradient step, yielding a method that is referred to as the heavy-ball method:\n\nxk+1 = xk srf (xk),\n\nyk+1 = xk srf (xk),\n\nxk+1 = yk+1 \u21b5(xk xk1),\n\n(1.3)\nwhere \u21b5> 0 is a momentum coef\ufb01cient. While the heavy-ball method provably attains a faster rate\nof local convergence than GD near a minimum of f, it generally does not provide a guarantee of\nacceleration globally [Polyak, 1964].\nThe next major development in \ufb01rst-order methods is due to Nesterov, who introduced \ufb01rst-order\ngradient methods that have a faster global convergence rate than GD [Nesterov, 1983, 2013]. For a\n\u00b5-strongly convex objective f with L-Lipschitz gradients, Nesterov\u2019s accelerated gradient method\n(NAG-SC) involves the following pair of update equations:\n\nyk+1 = xk srf (xk),\n\nxk+1 = yk+1 +\n\n(yk+1 yk) .\n\n(1.4)\n\n1 p\u00b5s\n1 + p\u00b5s\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIf one sets s = 1/L, then NAG-SC enjoys a O\u21e3(1 p\u00b5/L)k\u2318 convergence rate, improving on\nthe O\u21e3(1 \u00b5/L)k\u2318 convergence rate of GD. Nesterov also developed an accelerated algorithm\n\n(NAG-C) targeting smooth convex functions that are not strongly convex:\n\nyk+1 = xk srf (xk),\n\nxk+1 = yk+1 +\n\nk\n\nk + 3\n\n(yk+1 yk).\n\n(1.5)\n\nThis algorithm has a O(L/k2) convergence rate, which is faster than GD\u2019s O(L/k) rate.\nWhile yielding optimal and effective algorithms, the design principle of Nesterov\u2019s accelerated\ngradient algorithms (NAG) is not transparent. Convergence proofs for NAG often use the estimate\nsequence technique, which is inductive in nature and relies on series of algebraic tricks [Bubeck,\n2015]. In recent years progress has been made in the understanding of acceleration by moving to a\ncontinuous-time formulation. In particular, Su et al. [2016] showed that as s ! 0, NAG-C converges\nto an ordinary differential equation (ODE) (Equation (2.2)); moreover, for this ODE, Su et al. [2016]\nderived a (continuous-time) convergence rate using a Lyapunov function, and further transformed\nthis Lyapunov function to a discrete version and thereby provided a new proof of the fact that\nNAG-C enjoys a O(L/k2) rate.\nFurther progress in this vein has involved taking a variational point of view that derives ODEs from\nan underlying Lagrangian rather than from a limiting argument [Wibisono et al., 2016]. While this\napproach captures many of the variations of Nesterov acceleration presented in the literature, it does\nnot distinguish between the heavy-ball dynamics and the NAG dynamics, and thus fails to distinguish\nbetween local and global acceleration. More recently, Shi et al. [2018] have returned to limiting\narguments with a more sophisticated methodology. They have derived high-resolution ODEs for the\nheavy-ball method (Equation (2.4)), NAG-SC (Equation (2.5)) and NAG-C (Equation (2.6)). Notably,\nthe high-resolution ODEs for the heavy-ball dynamics and the accelerated dynamics are different.\nShi et al. [2018] also presented Lyapunov functions for these ODEs as well as the corresponding\nalgorithms, and showed that these Lyapunov functions can be used to derive the accelerated rates\nof NAG-SC and NAG-C. A number of other papers have also contributed to the understanding of\nacceleration by working in a continuous-time formulation [Krichene and Bartlett, 2017, Krichene\net al., 2015, Diakonikolas and Orecchia, 2017, Ghadimi and Lan, 2016, Diakonikolas and Orecchia,\n2017].\nThis emerging literature has thus provided a new level of understanding of design principles for\naccelerated optimization. The design involves an interplay between continuous-time and discrete-time\ndynamics. ODEs are obtained either variationally or via a limiting scheme, and various properties of\nthe ODEs are studied, including their convergence rate, topological aspects of their \ufb02ow and their\nbehavior under perturbation. Lyapunov functions play a key role in such analyses, and also allow\naspects of the continuous-time analysis to be transferred to discrete time [see, e.g., Wilson et al.,\n2016].\nAnd yet the literature has not yet provided a full exploration of the transition from continuous-time\nODEs to discrete-time algorithms. Indeed, this transition is a non-trivial one, as evidenced by the\ndecades of research on numerical methods for the discretization of ODEs, including most notably the\nsophisticated arsenal of techniques referred to as \u201cgeometric numerical integration\u201d that are used for\nODEs obtained from underlying variational principles [Hairer et al., 2006]. Recent work has begun\nto explore these issues; examples include the use of symplectic integrators by Betancourt et al. [2018]\nand the use of Runge-Kutta integration by Zhang et al. [2018]. However, these methods do not\nalways yield proofs that accelerated rates are retained in discrete time, and when they do they involve\nimplicit discretization, which is generally not practical except in the setting of quadratic objectives.\nThus we wish to address the following fundamental question:\n\nCan we systematically and provably obtain new accelerated methods via the numerical discretization\n\nof ordinary differential equations?\n\nOur approach to this question is a dynamical systems framework based on Lyapunov theory. Our\nmain results are as follows:\n1. In Section 3.1, we consider three simple numerical discretization schemes\u2014symplectic Euler\n(S), explicit Euler (E) and implicit Euler (I) schemes\u2014to discretize the high-resolution ODE of\n\n2\n\n\fNesterov\u2019s accelerated method for strongly convex functions. We show that the optimization\n\nmethod generated by symplectic discretization achieves a O((1 O(1)p\u00b5/L)k) rate, thereby\n\nattaining acceleration. In sharp contrast, the implicit scheme is not practical for implementation,\nand the explicit scheme, while being simple, fails to achieve acceleration.\n\n2. In Section 3.2, we apply these discretization schemes to the ODE for modeling the heavy-ball\nmethod, which can be viewed as a low-resolution ODE that lacks a gradient-correction term [Shi\net al., 2018]. In contrast to the previous two cases of high-resolution ODEs, the symplectic scheme\ndoes not achieve acceleration for this low-resolution ODE. More broadly, in Appendix D we\npresent more examples of low-resolution ODEs where symplectic discretization does not lead to\nacceleration.\n\n3. Next, we apply the three simple Euler schemes to the high-resolution ODE of Nesterov\u2019s acceler-\nated method for convex functions. Again, our Lyapunov analysis sheds light on the superiority of\nthe symplectic scheme over the other two schemes. This is the subject of Section 4.\n\nTaken together, the three \ufb01ndings have the implication that high-resolution ODEs and symplectic\nschemes are critical to achieving acceleration using numerical discretization. More precisely, in\naddition to allowing relatively simple implementations, symplectic schemes allow for a large step size\nwithout a loss of stability, in a manner akin to (but better than) implicit schemes. In stark contrast,\nin the setting of low-resolution ODEs, only the implicit schemes remain stable with a large step\nsize, due to the lack of gradient correction. Moreover, the choice of Lyapunov function is equally\nessential to obtaining sharp convergence rates. This important fact is highlighted in Theorem A.6 in\nthe Appendix, where we analyze GD by considering it as a discretization method for gradient \ufb02ow\n(the ODE counterpart of GD). Using the discrete version of the Lyapunov function proposed in Su\net al. [2016] instead of the classical one, we show that GD in fact minimizes the squared gradient\nnorm (choosing the best iterate so far) at a rate of O(L2/k2). Although this rate of convergence in\nthe problem of squared gradient norm minimization is known in the literature [Nesterov, 2012], the\nLyapunov function argument provides a systematic approach to obtaining this rate in this problem and\nothers. In particular, this example demonstrates the usefulness and \ufb02exibility of Lyapunov functions\nas a mathematical tool for optimization problems.\n\n2 Preliminaries\n\nIn this section, we introduce necessary notation, and review ODEs derived in previous work and three\nclassical numerical discretization schemes.\nWe mostly follow the notation of Nesterov [2013], with slight modi\ufb01cations tailored to the present\nL(Rn) if\npaper. Let F 1\nf (y) f (x) + hrf (x), y xi for all x, y 2 Rn and its gradient is L-Lipschitz continuous in the\nsense that\n\nL(Rn) be the class of L-smooth convex functions de\ufb01ned on Rn; that is, f 2F 1\n\nkrf (x) rf (y)k \uf8ff Lkx yk ,\n\n\u00b5,L(Rn) if f 2F p\n\n\u00b5,L(Rn) denote the subclass of F p\n\nL(Rn) and f (y) f (x)+hrf (x), y xi+ \u00b5\n\nwhere k\u00b7k denotes the standard Euclidean norm and L > 0 is the Lipschitz constant. The function class\nL(Rn) such that each f has a Lipschitz-continuous Hessian. For p = 1, 2,\nL(Rn) is the subclass of F 1\nF 2\nlet S p\nL(Rn) such that each member f is \u00b5-strongly convex for some\n2 ky xk2\n0 < \u00b5 \uf8ff L. That is, f 2S p\nfor all x, y 2 Rn. Let x? denote a minimizer of f (x).\n2.1 Approximating ODEs\nIn this section we list all of the ODEs that we will discretize in this paper. We refer readers to recent\npapers by Su et al. [2016], Wibisono et al. [2016] and Shi et al. [2018] for the rigorous derivations of\nthese ODEs. We begin with the simplest. Taking the step size s ! 0 in Equation (1.2), we obtain the\nfollowing ODE (gradient \ufb02ow):\n(2.1)\n\n\u02d9X = rf (X),\n\nwith any initial X(0) = x0 2 Rn.\nNext, by taking s ! 0 in Equation (1.5), Su et al. [2016] derived the low-resolution ODE of NAG-C:\n(2.2)\n\n\u00a8X +\n\n3\nt\n\n\u02d9X + rf (X) = 0,\n\n3\n\n\f\u00a8X + 2p\u00b5 \u02d9X + rf (X) = 0\n\nwith X(0) = x0 and \u02d9X(0) = 0. For strongly convex functions, by taking s ! 0, one can derive the\nfollowing low-resolution ODE (see, for example, Wibisono et al. [2016])\n(2.3)\nthat models both the heavy-ball method and NAG-SC. This ODE has the same initial conditions\nas (2.2).\nRecently, Shi et al. [2018] proposed high-resolution ODEs for modeling acceleration methods. The\nkey ingredient in these ODEs is that the O(ps) terms are preserved in the ODEs. As a result, the\nheavy-ball method and NAG-SC have different models as ODEs.\n(a) If f 2S 1\n\n\u00b5,L(Rn), the high-resolution ODE of the heavy-ball method (1.3) is\n\n\u00a8X + 2p\u00b5 \u02d9X + (1 + p\u00b5s)rf (X) = 0,\n\nwith X(0) = x0 and \u02d9X(0) = 2psrf (x0)\nlow-resolution counterpart (2.3) due to the absence of r2f (X) \u02d9X.\n\n\u00b5,L(Rn), the high-resolution ODE of NAG-SC (1.4) is\n\n1+p\u00b5s\n\n(b) If f 2S 2\n\n. This ODE has essentially the same properties as its\n\n(2.4)\n\n(2.5)\n\n(2.6)\n\n\u00a8X + 2p\u00b5 \u02d9X + psr2f (X) \u02d9X + (1 + p\u00b5s)rf (X) = 0,\n\nwith X(0) = x0 and \u02d9X(0) = 2psrf (x0)\n\n1+p\u00b5s\n\n.\n\n(c) If f 2F 2\n\nL(Rn), the high-resolution ODE of NAG-C (1.5) is\n3ps\n\n2t \u25c6rf (X) = 0\nfor t 3ps/2, with X(3ps/2) = x0 and \u02d9X(3ps/2) = psrf (x0).\n\n\u02d9X + psr2f (X) \u02d9X +\u27131 +\n\n\u00a8X +\n\n3\nt\n\nrule:\n\n2.2 Discretization schemes\nTo discretize ODEs (2.1)-(2.6), we replace \u02d9X by xk+1 xk, \u02d9V by vk+1 vk and replace other terms\nwith approximations. Different discretization schemes correspond to different approximations.\n\u2022 The most straightforward scheme is the explicit scheme, which uses the following approximation\n\nxk+1 xk = psvk,\nxk+1 xk = psvk+1,\n\n\u2022 Another discretization scheme is the implicit scheme, which uses the following approximation\n\npsr2f (xk)vk \u21e1 rf (xk+1) rf (xk).\npsr2f (xk+1)vk+1 \u21e1 rf (xk+1) rf (xk).\nNote that compared with the explicit scheme, the implicit scheme is not practical because the\nupdate of xk+1 requires knowing vk+1 while the update of vk+1 requires knowing xk+1.\n\nrule:\n\n\u2022 The last discretization scheme considered in this paper is the symplectic scheme, which uses the\n\nfollowing approximation rule.\n\npsr2f (xk+1)vk \u21e1 rf (xk+1) rf (xk).\nNote this scheme is practical because the update of xk+1 only requires knowing vk.\n\nxk+1 xk = psvk,\n\nWe remark that for low-resolution ODEs, there is no r2f (x) term, whereas for high-resolution ODEs,\nwe have this term and we use the difference of gradients to approximate this term. This additional\napproximation term is critical to acceleration.\n\n3 High-Resolution ODEs for Strongly Convex Functions\n\nThis section considers numerical discretization of the high-resolution ODEs of NAG-SC and the\nheavy-ball method using the symplectic Euler, explicit Euler and implicit Euler scheme. In particular,\nwe compare rates of convergence towards the objective minimum of the three simple Euler schemes\nand the two methods (NAG-SC and the heavy-ball method) in Section 3.1 and Section 3.2, respectively.\nFor both cases, the associated symplectic scheme is shown to exhibit surprisingly similarity to the\ncorresponding classical method.\n\n4\n\n\f3.1 NAG-SC\nThe high-resolution ODE (2.5) of NAG-SC can be equivalently written in the phase space as\n\n\u02d9X = V,\n\n\u02d9V = 2p\u00b5V psr2f (X)V (1 + p\u00b5s)rf (X),\n. For any f 2S 2\nof Shi et al. [2018] shows that the solution X = X(t) of the ODE (2.5) satis\ufb01es\n\nwith the initial conditions X(0) = x0 and V (0) = 2psrf (x0)\n\n1+p\u00b5s\n\n(3.1)\n\n\u00b5,L(Rn), Theorem 1\n\nf (X) f (x?) \uf8ff\n\n2kx0 x?k2\n\ns\n\np\u00b5t\n4 ,\n\ne\n\nfor any step size 0 < s \uf8ff 1/L. In particular, setting the step size to s = 1/L, we get\n\nIn the phase space representation, NAG-SC is formulated as\n\nf (X) f (x?) \uf8ff 2Lkx0 x?k2 e\n\np\u00b5t\n4 .\n\nxk+1 xk = psvk\nvk+1 vk = \n\n2p\u00b5s\n1 p\u00b5s\n\n8><>:\n\nwith the initial condition v0 = 2psrf (x0)\n\n1+p\u00b5s\n\nof the ODE by recognizing\n\nvk+1 ps(rf (xk+1) rf (xk)) \n\n1 + p\u00b5s\n1 p\u00b5s \u00b7 psrf (xk+1),\n\n(3.2)\nfor any x0. This method maintains the accelerated rate\n\n;\n\nf (xk) f (x?) \uf8ff\n\n5Lkx0 x?k2\n(1 +p\u00b5/L/12)k\n(see Theorem 3 in Shi et al. [2018]) and the identi\ufb01cation t \u21e1 kps.\nViewing NAG-SC as a numerical discretization of (2.5), one might wonder if any of the three\nsimple Euler schemes\u2014symplectic Euler scheme, explicit Euler scheme, and implicit Euler scheme\u2014\nmaintain the accelerated rate in discretizing the high-resolution ODE. For clarity, the update rules of\n\n1+p\u00b5s\n\nthe three schemes are given as follows, each with the initial points x0 and v0 = 2psrf (x0)\nEuler scheme of (3.1): (S), (E) and (I) respectively\n(S) ( xk+1 xk = psvk\n(E) ( xk+1 xk = psvk\n(I) ( xk+1 xk = psvk+1\n\nvk+1 vk = 2p\u00b5svk+1 ps (rf (xk+1) rf (xk)) ps(1 + p\u00b5s)rf (xk+1).\nvk+1 vk = 2p\u00b5svk ps (rf (xk+1) rf (xk)) ps(1 + p\u00b5s)rf (xk).\nvk+1 vk = 2p\u00b5svk+1 ps (rf (xk+1) rf (xk)) ps(1 + p\u00b5s)rf (xk+1).\n\n.\n\n1\n\nAmong the three Euler schemes, the symplectic scheme is the closest to NAG-SC (3.2). More\nprecisely, NAG-SC differs from the symplectic scheme only in an additional factor of\n1p\u00b5s in\nthe second line of (3.2). When the step size s is small, NAG-SC is, roughly speaking, a symplectic\nmethod if we make use of\n1p\u00b5s \u21e1 1. In relating to the literature, the connection between accelerated\nmethods and the symplectic schemes has been explored in Betancourt et al. [2018], which mainly\nconsiders the leapfrog integrator, a second-order symplectic integrator. In contrast, the symplectic\nEuler scheme studied in this paper is a \ufb01rst-order symplectic integrator.\nInterestingly, the close resemblance between the two algorithms is found not only in their formulations,\nbut also in their convergence rates, which are both accelerated as shown by Theorem B.1 and\nTheorem 3.1.\nNote that the discrete Lyapunov function used in the proof of the symplectic Euler scheme of (3.1) is\n\n1\n\nE(k) =\n\n1\n4 kvkk2 +\n\n1\n\n42p\u00b5(xk+1 x?) + vk + psrf (xk)2\n\n5\n\n\f+ (1 + p\u00b5s) (f (xk) f (x?)) \n\n(1 + p\u00b5s)2\n1 + 2p\u00b5s \u00b7\n\ns\n2 krf (xk)k2 .\n\n(3.3)\n\nThe proof of Theorem B.1 is deferred to Appendix B.1. The following result is a useful consequence\nof this theorem.\nTheorem 3.1 (Discretization of NAG-SC ODE). For any f 2S 1\nhold:\n\n\u00b5,L(Rn), the following conclusions\n\n(a) Taking step size s = 4/(9L), the symplectic Euler scheme of (3.1) satis\ufb01es\n\n(b) Taking step size s = \u00b5/(100L2), the explicit Euler scheme of (3.1) satis\ufb01es\n\nf (xk) f (x?) \uf8ff\n\n5Lkx0 x?k2\nLk .\n1 + 1\n9p \u00b5\n80L\u2318k\nf (xk) f (x?) \uf8ff 3Lkx0 x?k2\u21e31 \n13kx0 x?k2\nLk .\n41 + 1\n4p \u00b5\n\nf (xk) f (x?) \uf8ff\n\n\u00b5\n\n.\n\n(3.4)\n\n(3.5)\n\n(3.6)\n\n(c) Taking step size s = 1/L, the implicit Euler scheme of (3.1) satis\ufb01es\n\nIn addition, Theorem 3.1 shows that the implicit scheme also achieves acceleration. However, unlike\nNAG-SC, the symplectic scheme, and the explicit scheme, the implicit scheme is generally not easy\nto use in practice because it requires solving a nonlinear \ufb01xed-point equation when the objective is\nnot quadratic. On the other hand, the explicit scheme can only take a smaller step size O(\u00b5/L2),\nwhich prevents this scheme from achieving acceleration.\n\n3.2 The heavy-ball method\nWe turn to the heavy-ball method ODE (2.4), whose phase space representation reads\n\n\u02d9X = V,\n\n\u02d9V = 2p\u00b5V (1 + p\u00b5s)rf (X),\n\nshows that the solution X = X(t) to this ODE satis\ufb01es\n\nwith the initial conditions X(0) = x0 and V (0) = 2psrf (x0)\n7kx0 x?k2\n\n1+p\u00b5s\n\ne\n\nf (X(t)) f (x?) \uf8ff\n\n2s\n\np\u00b5t\n4 ,\n\n(3.7)\n\n. Theorem 2 in Shi et al. [2018]\n\nfor f 2S 1\n\n\u00b5,L(Rn) and any step size 0 < s \uf8ff 1/L. In particular, taking s = 1/L gives\n\nf (X(t)) f (x?) \uf8ff\n\n7Lkx0 x?k2\n\np\u00b5t\n4 .\n\ne\n\n2\n\nReturning to the discrete regime, Polyak\u2019s heavy-ball method uses the following update rule:\n\nxk+1 xk = psvk\nvk+1 vk = \n\n8><>:\n\nvk+1 \n\n2p\u00b5s\n1 p\u00b5s\n\nwhich attains a non-accelerated rate (see Theorem 4 of Shi et al. [2018]):\n\n1 + p\u00b5s\n1 p\u00b5s \u00b7 psrf (xk+1),\n5Lkx0 x?k2\n1 + \u00b5\n16Lk\nscheme starts with any arbitrary x0 and v0 = 2psrf (x0)\n\nf (xk) f (x?) \uf8ff\n\nThe three simple Euler schemes for numerically solving the ODE (2.4) are given as follows. Every\n. As in the case of NAG-SC, the symplectic\n\n1+p\u00b5s\n\n(3.8)\n\n.\n\nscheme is the closest to the heavy-ball method.\n\n6\n\n\fEuler scheme of (3.7): (S), (E) and (I) respectively\n\n(S)\n\n(E)\n\n(I)\n\n( xk+1 xk = psvk,\nvk+1 vk = 2p\u00b5svk+1 ps(1 + p\u00b5s)rf (xk+1).\n( xk+1 xk = psvk\nvk+1 vk = 2p\u00b5svk ps(1 + p\u00b5s)rf (xk).\n( xk+1 xk = psvk+1\nvk+1 vk = 2p\u00b5svk+1 ps(1 + p\u00b5s)rf (xk+1).\n\nThe theorem below characterizes the convergence rates of the three schemes. This theorem is extended\nto general step sizes by Theorem B.2 in Appendix B.2.\nTheorem 3.2 (Discretization of heavy-ball ODE). For any f 2S 1\nhold:\n\n\u00b5,L(Rn), the following conclusions\n\n(a) Taking step size s = \u00b5/(16L2), the symplectic Euler scheme of (3.7) satis\ufb01es\n\n(b) Taking step size s = \u00b5/(36L2), the explicit Euler scheme of (3.7) satis\ufb01es\n\n(c) Taking step size s = 1/L, the implicit Euler scheme of (3.7) satis\ufb01es\n\n.\n\nf (xk) f (x?) \uf8ff\n\n3Lkx0 x?k2\n1 + \u00b5\n16Lk\nf (xk) f (x?) \uf8ff 3Lkx0 x?k2\u21e31 \n15Lkx0 x?k2\nLk .\n41 + 1\n4p \u00b5\n\nf (xk) f (x?) \uf8ff\n\n\u00b5\n\n48L\u2318k\n\n.\n\n(3.10)\n\n(3.9)\n\n(3.11)\n\nTaken together, (3.8) and Theorem 3.2 imply that neither the heavy-ball method nor the symplectic\nscheme attains an accelerated rate. In contrast, the implicit scheme achieves acceleration as in the\nNAG-SC case, but it is impractical except for quadratic objectives.\n\n4 High-Resolution ODEs for Convex Functions\n\nIn this section, we turn to numerical discretization of the high-resolution ODE (2.6) related to NAG-C.\nAll proofs are deferred to Appendix C. This ODE in the phase space representation reads [Shi et al.,\n2018] as follows:\n\n\u02d9X = V,\n\n\u02d9V = \n\n(4.1)\nwith X(3ps/2) = x0 and V (3ps/2) = psrf (x0). Theorem 5 of Shi et al. [2018] shows that\nLet f 2F 1\nL(Rn). For any step size 0 < s \uf8ff 1/L, the solution X = X(t) of the high-resolution\nODE (2.6) satis\ufb01es\n\n3\n\nt \u00b7 V psr2f (X)V \u27131 +\n\n3ps\n\n2t \u25c6rf (X),\n\n(4 + 3sL)kx0 x?k2\n\nf (X) f (x?) \uf8ff\nt0\uf8ffu\uf8fftkrf (X(u))k2 \uf8ff\n\ninf\n\nt (2t + ps)\n(12 + 9sL)kx0 x?k2\n\n2ps (t3 t3\n\n0)\n\n8>>><>>>:\n\n,\n\n(4.2)\n\nfor any t > t0 = 1.5ps. A caveat here is that it is unclear how to use a Lyapunov function to\nprove convergence of the (simple) explicit, symplectic or implicit Euler scheme by direct numerical\ndiscretization of the ODE (2.2). See Appendix C.2 for more discussion on this point. Therefore, we\nslightly modify the ODE to the following one:\n\n\u02d9X = V,\n\n\u02d9V = \n\n3\n\nt \u00b7 V psr2f (X)V \u27131 +\n\n3ps\n\nt \u25c6rf (X).\n\n(4.3)\n\n7\n\n\fThe only difference is in the third term on the right-hand side of the second equation, where we replace\n\u21e31 + 3ps\nt \u2318rf (X). Now, we apply the three schemes on this (modi\ufb01ed)\n2t \u2318rf (X) by\u21e31 + 3ps\nODE in the phase space, including the original NAG-C, which all start with x0 and v0 = psrf (x0).\nEuler scheme of (4.3): (S), (E) and (I) respectively\nxk+1 xk = psvk\n(S) 8<:\nvk+1 ps (rf (xk+1) rf (xk)) ps\u2713 k + 4\nk + 1\u25c6rf (xk+1).\nvk+1 vk = \nxk+1 xk = psvk\n(E) 8<:\nvk ps (rf (xk+1) rf (xk)) ps\u2713 k + 3\nk \u25c6rf (xk).\nvk+1 vk = \nxk+1 xk = psvk+1\n(I) 8<:\nk + 1\u25c6rf (xk+1).\nvk+1 ps (rf (xk+1) rf (xk)) ps\u2713 k + 4\nvk+1 vk = \nTheorem 4.1. Let f 2F 1\n(a) For any step size 0 < s \uf8ff 1/(3L), the symplectic Euler scheme of (4.3) (original NAG-C) satis\ufb01es\n\nL (Rn). The following statements are true:\n\nk + 1\n\n3\nk\n\n3\n\n3\n\nk + 1\n\nf (xk) f (x?) \uf8ff\n\n119kx0 x?k2\n\ns(k + 1)2\n\n, min\n\n0\uf8ffi\uf8ffk krf (xi)k2 \uf8ff\n\n8568kx0 x?k2\n\ns2(k + 1)3\n\n;\n\n(4.4)\n\n(b) Taking any step size 0 < s \uf8ff 1/L, the implicit Euler scheme of (4.3) satis\ufb01es\n\nf (xk) f (x?) \uf8ff\n\n(3sL + 2)kx0 x?k2\n\ns(k + 2)(k + 3)\n\n, min\n\n0\uf8ffi\uf8ffk krf (xi)k2 \uf8ff\n\n(3sL + 2)kx0 x?k2\n\ns2(k + 1)3\n\n.\n(4.5)\n\nNote that Theorem 4.1 (a) is the same as Theorem 6 of Shi et al. [2018]. The explicit Euler scheme\ndoes not guarantee convergence; see the analysis in Appendix C.1.\n\n5 Discussion\n\nIn this paper, we have analyzed the convergence rates of three numerical discretization schemes\u2014the\nsymplectic Euler scheme, explicit Euler scheme, and implicit Euler scheme\u2014applied to ODEs that are\nused for modeling Nesterov\u2019s accelerated methods and Polyak\u2019s heavy-ball method. The symplectic\nscheme is shown to achieve accelerated rates for the high-resolution ODEs of NAG-SC and (slightly\nmodi\ufb01ed) NAG-C [Shi et al., 2018], whereas no acceleration rates are observed when the same\nscheme is used to discretize the low-resolution counterparts [Su et al., 2016]. For comparison, the\nexplicit scheme only allows for a small step size in discretizing these ODEs in order to ensure stability,\nthereby failing to achieve acceleration. Although the implicit scheme is proved to yield accelerated\nmethods no matter whether high-resolution or low-resolution ODEs are discretized, this scheme is\ngenerally not practical except for a limited number of cases (for example, quadratic objectives).\nWe conclude this paper by presenting several directions for future work. This work suggests that\nboth symplectic schemes and high-resolution ODEs are crucial for numerical discretization to\nachieve acceleration. It would be of interest to formalize and prove this assertion. For example,\ndoes any higher-order symplectic scheme maintain acceleration for the high-resolution ODEs of\nNAGs? What is the fundamental mechanism of the gradient correction in high-resolution ODE in\nstabilizing symplectic discretization? Moreover, since the discretizations are applied to the modi\ufb01ed\nhigh-resolution ODE of NAG-C, it is tempting to perform a comparison study between the two\nhigh-resolution ODEs in terms of discretization properties. Finally, recognizing Nesterov\u2019s method\n(NAG-SC) is very similar to, but still different from, the corresponding symplectic scheme, one can\ndesign new algorithms as interpolations of the two methods; it would be interesting to investigate the\nconvergence properties of these new algorithms.\n\n8\n\n\fReferences\nMichael Betancourt, Michael I Jordan, and Ashia C Wilson. On symplectic optimization. arXiv\n\npreprint arXiv:1802.03653, 2018.\n\nS\u00e9bastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in\n\nMachine Learning, 8(3-4):231\u2013357, 2015.\n\nJelena Diakonikolas and Lorenzo Orecchia. The approximate duality gap technique: A uni\ufb01ed theory\n\nof \ufb01rst-order methods. arXiv preprint arXiv:1712.02485, 2017.\n\nSaeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and\n\nstochastic programming. Mathematical Programming, 156(1-2):59\u201399, 2016.\n\nErnst Hairer, Christian Lubich, and Gerhard Wanner. Geometric numerical integration: structure-\npreserving algorithms for ordinary differential equations, volume 31. Springer Science & Business\nMedia, 2006.\n\nWalid Krichene and Peter L Bartlett. Acceleration and averaging in stochastic descent dynamics. In\n\nAdvances in Neural Information Processing Systems, pages 6796\u20136806, 2017.\n\nWalid Krichene, Alexandre Bayen, and Peter L Bartlett. Accelerated mirror descent in continuous\nand discrete time. In Advances in Neural Information Processing Systems, pages 2845\u20132853, 2015.\nYurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2).\n\nSoviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\nYurii Nesterov. How to make the gradients small. Optima, 88:10\u201311, 2012.\nYurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer\n\nScience & Business Media, 2013.\n\nBoris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\nBoris T Polyak. Introduction to optimization. Optimization Software, Inc, New York, 1987.\nBin Shi, Simon S Du, Michael I Jordan, and Weijie J Su. Understanding the acceleration phenomenon\n\nvia high-resolution differential equations. arXiv preprint arXiv:1810.08907, 2018.\n\nWeijie Su, Stephen Boyd, and Emmanuel J Cand\u00e8s. A differential equation for modeling Nesterov\u2019s\naccelerated gradient method: theory and insights. Journal of Machine Learning Research, 17(153):\n1\u201343, 2016.\n\nAndre Wibisono, Ashia C Wilson, and Michael I Jordan. A variational perspective on accelerated\nmethods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351\u2013E7358,\n2016.\n\nAshia C Wilson, Benjamin Recht, and Michael I Jordan. A Lyapunov analysis of momentum methods\n\nin optimization. arXiv preprint arXiv:1611.02635, 2016.\n\nJingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct Runge\u2013Kutta discretization\n\nachieves acceleration. arXiv preprint arXiv:1805.00521, 2018.\n\n9\n\n\f", "award": [], "sourceid": 3080, "authors": [{"given_name": "Bin", "family_name": "Shi", "institution": "UC Berkeley"}, {"given_name": "Simon", "family_name": "Du", "institution": "Institute for Advanced Study"}, {"given_name": "Weijie", "family_name": "Su", "institution": "The Wharton School, University of Pennsylvania"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}