{"title": "Direct Runge-Kutta Discretization Achieves Acceleration", "book": "Advances in Neural Information Processing Systems", "page_first": 3900, "page_last": 3909, "abstract": "We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lipschitz-gradient, convexity and order-$(s+2)$ differentiability assumptions, the sequence of iterates generated by discretizing the proposed second-order ODE converges to the optimal solution at a rate of $\\mathcal{O}({N^{-2\\frac{s}{s+1}}})$, where $s$ is the order of the Runge-Kutta numerical integrator. Furthermore, we introduce a new local flatness condition on the objective, under which rates even faster than $\\mathcal{O}(N^{-2})$ can be achieved with low-order integrators and only gradient information. Notably, this flatness condition is satisfied by several standard loss functions used in machine learning. We provide numerical experiments that verify the theoretical rates predicted by our results.", "full_text": "Direct Runge-Kutta Discretization\n\nAchieves Acceleration\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nJingzhao Zhang\n\nLIDS\n\nCambridge, MA, 02139\njzhzhang@mit.edu\n\nSuvrit Sra\nLIDS, IDSS\n\nCambridge, MA, 02139\n\nsuvrit@mit.edu\n\nAbstract\n\nAryan Mokhtari\n\nLIDS\n\nCambridge, MA, 02139\n\naryanm@mit.edu\n\nAli Jadbabaie\nLIDS, IDSS\n\nCambridge, MA, 02139\njadbabai@mit.edu\n\nWe study gradient-based optimization methods obtained by directly discretizing\na second-order ordinary differential equation (ODE) related to the continuous\nlimit of Nesterov\u2019s accelerated gradient method. When the function is smooth\nenough, we show that acceleration can be achieved by a stable discretization of\nthis ODE using standard Runge-Kutta integrators. Speci\ufb01cally, we prove that un-\nder Lipschitz-gradient, convexity and order-(s + 2) differentiability assumptions,\nthe sequence of iterates generated by discretizing the proposed second-order ODE\nconverges to the optimal solution at a rate of O(N\u22122 s\ns+1 ), where s is the order\nof the Runge-Kutta numerical integrator. Furthermore, we introduce a new local\n\ufb02atness condition on the objective, under which rates even faster than O(N\u22122)\ncan be achieved with low-order integrators and only gradient information. No-\ntably, this \ufb02atness condition is satis\ufb01ed by several standard loss functions used in\nmachine learning. We provide numerical experiments that verify the theoretical\nrates predicted by our results.\n\nIntroduction\n\n1\nWe study accelerated \ufb01rst-order optimization algorithms for the problem\n\nmin\nx\u2208Rd\n\nf (x),\n\n(1)\n\nwhere f is convex and suf\ufb01ciently smooth. A classical method for solving (1) is gradient descent\n(GD), which displays a sub-optimal convergence rate of O(N\u22121)\u2014i.e., the gap f (xN ) \u2212 f (x\u2217)\nbetween GD and the optimal value f (x\u2217) decreases to zero at the rate of O(N\u22121). Nesterov\u2019s\nseminal accelerated gradient method [19] matches the oracle lower bound of O(N\u22122) [18], and is\nthus a central result in the theory of convex optimization.\nHowever, ever since its introduction, acceleration has remained somewhat mysterious, especially\nbecause Nesterov\u2019s original derivation relies on elegant but unintuitive algebraic arguments. This\nlack of understanding has spurred a variety of recent attempts to uncover the rationale behind the\nphenomenon of acceleration [1, 9, 11, 13, 16, 21].\nWe pursue instead an approach to NAG (and accelerated methods in general) via a continuous-time\nperspective. This view was recently studied by Su et al. [23], who showed that the continuous limit\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fof NAG is a second order ODE describing a physical system with vanishing friction; Wibisono et al.\n[26] generalized this idea and proposed a class of ODEs by minimizing Bregman Lagrangians.\nAlthough these works succeed in providing a richer understanding of Nesterov\u2019s scheme via its\ncontinuous time ODE, they fail to provide a general discretization procedure that generates provably\nconvergent accelerated methods. In contrast, we introduce a second-order ODE that generates an\naccelerated \ufb01rst-order method for smooth functions if we simply discretize it using any Runge-Kutta\nnumerical integrator and choose a suitable step size.\n\n1.1 Summary of results\n\nAssuming that the objective function is convex and suf\ufb01ciently smooth, we establish the following:\n(cid:4) We propose a second-order ODE, and show that the sequence of iterates generated by discretizing\nusing a Runge-Kutta integrator converges to the optimal solution at the rate O(N\n\u22122s\ns+1 ), where s\nis the order of the integrator. By using a more precise numerical integrator, (i.e., a larger s), this\nrate approaches the optimal rate O(N\u22122).\n(cid:4) We introduce a new local \ufb02atness condition for the objective function (Assumption 1), under\nwhich Runge-Kutta discretization obtains convergence rates even faster than O(N\u22122), without\nrequiring high-order integrators. In particular, we show that if the objective is locally \ufb02at around\na minimum, by using only gradient information we can obtain a convergence rate of O(N\u2212p),\nwhere p quanti\ufb01es the degree of local \ufb02atness. Acceleration due to local \ufb02atness may seem\ncounterintuitive at \ufb01rst, but our analysis reveals why it helps.\n\nTo the best of our knowledge, this work presents the \ufb01rst direct1 discretization of an ODE that yields\naccelerated gradient methods. Unlike Betancourt et al. [7] who study symplecticity and consider\nvariational integrators, and Scieur et al. [22] who study consistency of integrators, we focus on the\norder of integrators (see \u00a72.1). We argue that the stability inherent to the ODE and order conditions\non the integrators suf\ufb01ce to achieve acceleration.\n\n1.2 Additional related work\nSeveral works [2, 3, 5, 8] have studied the asymptotic behavior of solutions to dissipative dynamical\nsystems. However, these works retain a theoretical focus as they remain in the continuous time do-\nmain and do not discuss the key issue, namely, stability of discretization. Other works such as [15],\nstudy the counterpart of Su et al. [23]\u2019s work for mirror descent algorithms and achieve acceleration\nvia Nesterov\u2019s technique. Diakonikolas and Orecchia [10] proposes a framework to analyze the \ufb01rst\norder mirror descent algorithms by studying ODEs derived from duality gaps. Also, Raginsky and\nBouvrie [20] obtain nonasymptotic rates for continuous time mirror descent in a stochastic setting.\nA textbook treatment of numerical integration is given in [12]; some of our proofs build on material\nfrom Chapters 3 and 9. [14] and [25] also provide nice introductions to numerical analysis.\n\n2 Problem setup and background\nThroughout the paper we assume that the objective f is convex and suf\ufb01ciently smooth. Our key\nresult rests on two key assumptions introduced below. The \ufb01rst assumption is a local \ufb02atness con-\ndition on f around a minimum; our second assumption requires f to have bounded higher order\nderivatives. These assumptions are suf\ufb01cient to achieve acceleration simply by discretizing suitable\nODEs without either resorting to reverse engineering to obtain discretizations or resorting to other\nmore involved integration mechanisms.\nWe will require our assumptions to hold on a suitable subset of Rd. Let x0 be the initial point to our\nproposed iterative algorithm. First consider the sublevel set\n\nS := {x \u2208 Rd | f (x) \u2264 exp(1)((f (x0) \u2212 f (x\u2217) + (cid:107)x0 \u2212 x\u2217(cid:107)2) + 1},\n\n(2)\nwhere x\u2217 is a minimum of (1). Later we will show that the sequence of iterates obtained from\ndiscretizing a suitable ODE never escapes this sublevel set. Thus, the assumptions that we introduce\n\n1That is, discretize the ODE with known numerical integration schemes without resorting to reverse engi-\n\nneering NAG\u2019s updates.\n\n2\n\n\fneed to hold only within a subset of Rd. Let this subset be de\ufb01ned as\nA := {x \u2208 Rd | \u2203x(cid:48) \u2208 S, (cid:107)x \u2212 x(cid:48)(cid:107) \u2264 1},\n\n(3)\nthat is, the set of points at unit distance to the initial sublevel set (2). The choice of unit distance is\narbitrary, and one can scale that to any desired constant.\nAssumption 1. There exists an integer p \u2265 2 and a positive constant L such that for any point\nx \u2208 A, and for all indices i \u2208 {1, ..., p \u2212 1}, we have the lower-bound\n\nf (x) \u2212 f (x\u2217) \u2265 1\n\nL(cid:107)\u2207(i)f (x)(cid:107) p\np\u2212i ,\n\n(4)\n\nwhere x\u2217 minimizes f and (cid:107)\u2207(i)f (x)(cid:107) denotes the operator norm of the tensor \u2207(i)f (x).\nAssumption 1 bounds high order derivatives by function suboptimality, so that these derivatives\nvanish as the suboptimality converges to 0. Thus, it quanti\ufb01es the \ufb02atness of the objective around\na minimum.2 When p = 2, Assumption 1 is slightly weaker than the usual Lipschitz-continuity of\ngradients (see Example 1) typically assumed in the analysis of \ufb01rst-order methods, including NAG.\nIf we further know that the objectives Taylor expansion around an optimum does not have low order\nterms, p would be the degree of the \ufb01rst nonzero term.\n2 -Lipschitz continuous gradients, i.e., (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264\nExample 1. Let f be convex with L\n2 (cid:107)x \u2212 y(cid:107). Then, for any x, y \u2208 Rd we have\nL\n\nf (x) \u2265 f (y) + (cid:104)\u2207f (y), x \u2212 y(cid:105) + 1\n\nL(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107)2.\n\n1\n\nIn particular, for y = x\u2217, an optimum point, we have \u2207f (y) = 0, and thus we have f (x)\u2212 f (x\u2217) \u2265\nL(cid:107)\u2207f (x)(cid:107)2, which is nothing but inequality (4) for p = 2 and i = 1.\nExample 2. Consider the (cid:96)p-norm regression problem: minx f (x) = (cid:107)Ax \u2212 b(cid:107)p\np, for even integer\np \u2265 2. If \u2203x\u2217, Ax\u2217 = b, then f satis\ufb01es inequality (4) for p, and L depends on p and the operator\nnorm of A.\n\nLogistic loss satis\ufb01es a slightly different version of Assumption 1 because its minimum can be at\nin\ufb01nity. We will explain this point in more detail in Section 3.1.\nNext, we introduce our second assumption that adds additional restrictions on differentiability and\nbounds the growth of derivatives.\nAssumption 2. There exists an integer s \u2265 p and a constant M \u2265 0, such that f (x) is order (s + 2)\ndifferentiable. Furthermore, for any x \u2208 A, the following operator norm bounds hold:\n\n(cid:107)\u2207(i)f (x)(cid:107) \u2264 M,\n\n(5)\nWhen the sublevel sets of f are compact and hence the set A is also compact; as a result, the\nbound (5) on high order derivatives is implied by continuity. In addition, an Lp loss of the form\n(cid:107)Ax \u2212 b(cid:107)p\n\np also satisfy (5) with M = p!(cid:107)A(cid:107)p\n2.\n\nfor i = p, p + 1, . . . , s, s + 1, s + 2.\n\n2.1 Runge-Kutta integrators\nBefore moving onto our new results (\u00a73) let us brie\ufb02y recall explicit Runge-Kutta (RK) integrators\nused in our work. For a more in depth discussion please see the textbook [12].\nDe\ufb01nition 1. Given a dynamical system \u02d9y = F (y), let the current point be y0 and the step size by\nh. An explicit S stage Runge-Kutta method generates the next step via the following update:\n\ni\u22121(cid:88)\n\nS(cid:88)\n\ngi = y0 + h\n\naijF (gj),\n\n\u03a6h(y0) = y0 + h\n\nbiF (gi),\n\n(6)\n\nj=1\n\ni=1\n\nwhere aij and bi are suitable coef\ufb01cients de\ufb01ned by the integrator; \u03a6h(y0) is the estimation of the\nstate after time step h, while gi (for i = 1, . . . , S) are a few neighboring points where the gradient\ninformation F (gi) is evaluated.\n\n2One could view this as an error bound condition that reverses the gradient-based upper bounds on subop-\n\ntimality stipulated by the Polyak-\u0141ojasiewicz condition [6, 17].\n\n3\n\n\fAlgorithm 1: Input(f, x0, p, L, M, s, N)\n1: Set the initial state y0 = [(cid:126)0; x0; 1] \u2208 R2d+1\n2: Set step size h = C/N\n3: xN \u2190 Order-s-Runge-Kutta-Integrator(F, y0, N, h)\n4: return xN\n\ns+1\n\n(cid:46) Constants p, L, M are the same as in Assumptions\n\n1\n\n(cid:46) C is determined by p, L, M, s, x0\n(cid:46) F is de\ufb01ned in equation 12\n\nBy combining the gradients at several evaluation points, the integrator can achieve higher precision\nby matching up Taylor expansion coef\ufb01cients. Let \u03d5h(y0) be the true solution to the ODE with\ninitial condition y0; we say that an integrator \u03a6h(y0) has order s if its discretization error shrinks as\n\n(cid:107)\u03a6h(y0) \u2212 \u03d5h(y0)(cid:107) = O(hs+1),\n\nas h \u2192 0.\n\n(7)\n\nIn general, RK methods offer a powerful class of numerical integrators, encompassing several basic\nschemes. The explicit Euler\u2019s method de\ufb01ned by \u03a6h(y0) = y0 + hF (y0) is an explicit RK method\nof order 1, while the midpoint method \u03a6h(y0) = y0 + hF (y0 + h\n2 F (y0)) is of order 2. Some\nhigh-order RK methods are summarized in [24].\n3 Main results\nIn this section, we introduce a second-order ODE and use explicit RK integrators to generate iterates\nthat converge to the optimal solution at a rate faster than O(1/t) (where t denotes the time variable\nin the ODE). A central outcome of our result is that, at least for objective functions that are smooth\nenough, it is not the integrator type that is the key ingredient of acceleration, but a careful analysis\nof the dynamics with a more powerful Lyapunov function that achieves the desired result. More\nspeci\ufb01cally, we will show that by carefully exploiting boundedness of higher order derivatives, we\ncan achieve both stability and acceleration at the same time.\nWe start with Nesterov\u2019s accelerated gradient (NAG) method that is de\ufb01ned according to the updates\n\nxk = yk\u22121 \u2212 h\u2207f (yk\u22121),\n\nyk = xk + k\u22121\n\nk+2 (xk \u2212 xk\u22121).\n\nSu et al. [23] showed that the iteration (8) in the limit is equivalent to the following ODE\n\n\u00a8x(t) + 3\n\nt \u02d9x(t) + \u2207f (x(t)) = 0,\n\nwhere \u02d9x = dx\ndt\n\n(8)\n\n(9)\n\nwhen one drives the step size h to zero. It can be further shown that in the continuous domain\nthe function value f (x(t)) decreases at the rate of O(1/t2) along the trajectories of the ODE. This\nconvergence rate can be accelerated to an arbitrary rate in continuous time via time dilation as in\n[26]. In particular, the solution to\n\n\u02d9x(t) + p2tp\u22122\u2207f (x(t)) = 0,\n\n\u00a8x(t) + p+1\nt\n\n(10)\nhas a convergence rate O(1/tp). When p > 2, Wibisono et al. [26] proposed rate matching al-\ngorithms via utilizing higher order derivatives (e.g., Hessians). In this work, we focus purely on\n\ufb01rst-order methods and study the stability of discretizing the ODE directly when p \u2265 2.\nThough deriving the ODE from the algorithm is solved, deriving the update of NAG or any other\naccelerated method by directly discretizing an ODE is not. As stated in [26], explicit Euler dis-\ncretization of the ODE in (9) may not lead to a stable algorithm. Recently, Betancourt et al. [7]\nobserved empirically that Verlet integration is stable and suggested that the stability relates to the\nsymplectic property of the Verlet integration. However, in our proof, we found that the order condi-\ntion of Verlet integration would suf\ufb01ce to achieve acceleration. Though symplectic integrators are\nknown to be stable, we weren\u2019t able to leverage the symplecticity for the dissipative system (11).\nThis principal point of departure from previous works underlies Algorithm 1, which solves (1) by\ndiscretizing the following ODE with an order-s integrator:\n\n\u00a8x(t) +\n\n2p + 1\n\nt\n\n\u02d9x(t) + p2tp\u22122\u2207f (x(t)) = 0.\n\n(11)\n\nThe solution to (11) exists and is unique when t > 0. This claim follows by local Lipschitzness of\nf and is discussed in more details in Appendix A.2 of [26].\n\n4\n\n\f\uf8ee\uf8f0\u2212 2p+1\n\nt v \u2212 p2tp\u22122\u2207f (x)\n\nv\n1\n\n\uf8f9\uf8fb ,\n\nWe further highlight that the ODE in (11) can also be written as the dynamical system\n\n\u02d9y = F (y) =\n\nwhere y = [v; x; t].\n\n(12)\n\nt\n\nfor\n. This modi\ufb01cation is crucial for our analysis via Lyapunov functions (more\n\nWe have augmented the state with time to obtain an autonomous system, which can be readily\nsolved numerically with a Runge-Kutta integrator as in Algorithm 1. To avoid singularity at t = 0,\nAlgorithm 1 discretizes the ODE starting from t = 1 with initial condition y(1) = y0 = [0; x0; 1].\nThe choice of 1 can be replaced by any arbitrary positive constant.\nNotice that the ODE in (11) is slightly different from the one in (10); it has a coef\ufb01cient 2p+1\n\u02d9x(t) instead of p+1\nt\ndetails in Section 4 and Appendix A).\nThe parameter p in the ODE (11) is set to be the same as the constant in Assumption 1 to achieve\nthe best theoretical upper bound by balancing stability and acceleration. Particularly, the larger\np is, the faster the system evolves. Hence, the numerical integrator requires smaller step sizes to\nstabilize the process, but a smaller step size increases the number of iterations to achieve a target\naccuracy. This tension is alleviated by Assumption 1. The larger p is, the \ufb02atter the function f is\naround its stationary points. In other words, Assumption 1 implies that as the iterates approach a\nminimum, the high order derivatives of the function f, in addition to the gradient, also converge to\nzero. Consequently, the trajectory slows down around the optimum and we can stably discretize the\nprocess with a large enough step size. This intuition ultimately translates into our main result.\nTheorem 1. (Main Result) Consider the second-order ODE in (11). Suppose that the function f\nis convex and Assumptions 1 and 2 are satis\ufb01ed. Further, let s be the order of the Runge-Kutta\nintegrator used in Algorithm 1, N be the total number of iterations, and x0 be the initial point.\nAlso, let E0 := f (x0) \u2212 f (x\u2217) + (cid:107)x0 \u2212 x\u2217(cid:107)2 + 1. Then, there exists a constant C1 such that if we\nset the step size as h = C1N\u22121/(s+1)(L + M + 1)\u22121E\u22121\n0 , the iterate xN generated after running\nAlgorithm 1 for N iterations satis\ufb01es the inequality\n\n(cid:104) (L+M +1)E0\n\n(cid:105)p\n\ns\n\nN\n\ns+1\n\nf (xN ) \u2212 f (x\u2217) \u2264 C2E0\n\n(13)\n\nps\n\ns+1 N\u2212 ps\n\nwhere the constants C1 and C2 only depend on s, p, and the Runge-Kutta integrator. S is the number\nof stage as de\ufb01ned in 1. Since each iteration consumes S gradient, f (xN ) \u2212 f (x\u2217) will converge as\nO(S\ns+1 ) with respect to the number of gradient evaluations. Note that for commonly used\nRunge-Kutta integrators, S \u2264 8.\nThe proof of this theorem is quite involved; we provide a sketch in Section 4, deferring the detailed\ntechnical steps to the appendix. We do not need to know the constant C1 exactly in order to set the\nstep size h. Replacing C1 by any smaller positive constant leads to the same polynomial rate.\nTheorem 1 indicates that if the objective has bounded high order derivatives and satis\ufb01es the \ufb02atness\ncondition in Assumption 1 with p > 0, then discretizing the ODE in (11) with a high order integrator\nresults in an algorithm that converges to the optimal solution at a rate that is close to O(N\u2212p). In\nthe following corollaries, we highlight two special instances of Theorem 1.\nCorollary 2. If the function f is convex with L-Lipschitz gradients and is 4th order differentiable,\nthen simulating the ODE (11) for p = 2 with a numerical integrator of order s = 2 for N iterations\nresults in the suboptimality bound\n\nf (xN ) \u2212 f (x\u2217) \u2264 C2(f (x0) \u2212 f (x\u2217) + (cid:107)x0 \u2212 x\u2217(cid:107)2 + 1)3(L + M + 1)2\n\nN 4/3\n\n.\n\nNote that higher order differentiability allows one to use a higher order integrator, which leads to the\noptimal O(N\u22122) rate in the limit. The next example is based on high order polynomial or (cid:96)p norm.\nCorollary 3. Consider the objective function f (x) = (cid:107)Ax + b(cid:107)4\n4. Assume that \u2203x, s.t.Ax = \u2212b.\nSimulating the ODE (11) for p = 4 with a numerical integrator of order s = 4 for N iterations\nresults in the suboptimality bound\n\nf (xN ) \u2212 f (x\u2217) \u2264 C2(f (x0) \u2212 f (x\u2217) + (cid:107)x0 \u2212 x\u2217(cid:107)2 + 1)5(L + M + 1)4\n\nN 16/5\n\n.\n\n5\n\n\f3.1 Logistic loss\nDiscretizing logistic loss f (x) = log(1 + e\u2212wT x) does not \ufb01t exactly into the setting of Theorem\n1 due to nonexistence of x\u2217. This potentially causes two problems. First, Assumption 1 is not well\nde\ufb01ned. Second, the constant E0 in Theorem 1 is not well de\ufb01ned. We explain in this section how\nwe can modify our analysis to admit logistic loss by utilizing its structure of high order derivatives.\nThe \ufb01rst problem can be resolved by replacing f (x\u2217) by inf x\u2208Rd f (x) in Assumption 1; then, the\nlogistic loss satis\ufb01es Assumption 1 with arbitrary integer p > 0. To approach the second problem,\nwe replace x\u2217 by \u02dcx that satis\ufb01es the following relaxed inequalities. For some \u00011, \u00012, \u00013 < 1 we have\n(14)\n\n(cid:104)x \u2212 \u02dcx,\u2207f (x)(cid:105) \u2265 f (x) \u2212 f (\u02dcx) \u2212 \u00011,\nL(cid:107)\u2207(i)f (x)(cid:107) p\n\np\u2212i \u2212 \u00012,\n\nf (\u02dcx) \u2212 inf\nx\u2208Rd\n\nf (x) \u2212 f (\u02dcx) \u2265 1\n\n(15)\nAs the inequalities are relaxed, there exists a vector \u02dcx \u2208 Rd that satis\ufb01es the above conditions. If we\nfollow the original proof and balance the additional error terms by picking \u02dcx carefully, we obtain\nCorollary 4. (Informal) If the objective is f (x) = log(1 + e\u2212wT x), then discretizing the ODE (11)\nwith an order s numerical integrator for N iterations with step size h = O(N\u22121/(s+1)) results in a\nconvergence rate of O(Sp s\n\ns+1 N\u2212p s\n\ns+1 ).\n\nf (x) \u2264 \u00013.\n\n4 Proof of Theorem 1\nWe prove Theorem 1 as follows. First(Proposition 5), we show that the suboptimality f (x(t)) \u2212\nf (x\u2217) along the continuous trajectory of the ODE (11) converges to zero suf\ufb01ciently fast. Sec-\nond(Proposition 6), we bound the discretization error (cid:107)\u03a6h(yk) \u2212 \u03d5h(yk)(cid:107), which measures the\ndistance between the point generated by discretizing the ODE and the true continuous solution.\nFinally(Proposition 7), a bound on this error along with continuity of the Lyapunov function (16)\nimplies that the suboptimality of the discretized sequence of points also converges to zero quickly.\nCentral to our proof is the choice of a Lyapunov function used to quantify progress. We propose in\nparticular the Lyapunov function E : R2d+1 \u2192 R+ de\ufb01ned as\n\nE([v; x; t]) :=\n\nt2\n\n4p2(cid:107)v(cid:107)2 +\n\n(cid:13)(cid:13)(cid:13)x +\n\nv \u2212 x\u2217(cid:13)(cid:13)(cid:13)2\n\nt\n2p\n\n+ tp(f (x) \u2212 f (x\u2217)).\n\n(16)\n\n\u02d9E(y) \u2264 0. This monotonicity then implies that both tp(f (x) \u2212 f (x\u2217)) and t2\n\nThe Lyapunov function (16) is similar to the ones used by Su et al. [23], Wibisono et al. [26], except\n4p2(cid:107)v(cid:107)2. This term allows us to bound (cid:107)v(cid:107) by O(E\nfor the extra term t2\nt ). This dependency is crucial\nfor us to achieve the O(N\u22122) bound(see Lemma 11 for more details).\nWe begin our analysis with Proposition 5, which shows that the function E is non-increasing with\n4p2(cid:107)v(cid:107)2 are\ntime, i.e.,\nbounded above by some constants. The bound on tp(f (x) \u2212 f (x\u2217)) provides a convergence rate of\nO(1/tp) on the sub-optimality f (x(t))\u2212f (x\u2217). It further leads to an upper-bound on the derivatives\nof the function f (x) in conjunction with Assumption 1.\nProposition 5 (Monotonicity of E). Consider the vector y = [v; x; t] \u2208 R2d+1 as a trajectory of\nthe dynamical system (12). Let the Lyapunov function E be de\ufb01ned by (16). Then, for any trajectory\ny = [v; x; t], the time derivative \u02d9E(y) is non-positive and bounded above; more precisely,\n\n\u02d9E(y) \u2264 \u2212 t\np\n\n(cid:107)v(cid:107)2.\n\n(17)\n\nThe proof of this proposition follows from convexity and (11); we defer the details to Appendix A.\nNext, to bound the Lyapunov function for numerical solutions, we need to bound the distance be-\ntween points in the discretized and continuous trajectories. As in Section 2.1, for the dynamical\nsystem \u02d9y = F (y), let \u03a6h(y0) denote the solution generated by a numerical integrator starting at\npoint y0 with step size h. Similarly, let \u03d5h(y0) be the corresponding true solution to the ODE.\nAn ideal numerical integrator would satisfy \u03a6h(y0) = \u03d5h(y0); however, due to discretization error\n\n6\n\n\f(cid:21)\n\n,\n\n(cid:20) [(1 + E(yk))]s+1\n\nthere is always a difference between \u03a6h(y0) and \u03d5h(y0) determined by the order of the integra-\ntor as in (7). Let {yk}N\ni=0 be the sequence of points generated by the numerical integrator, that is,\nyk+1 = \u03a6h(yk). In the following proposition, we derive an upper bound on the resulting discretiza-\ntion error (cid:107)\u03a6h(yk) \u2212 \u03d5h(yk)(cid:107).\nProposition 6 (Discretization error). Let yk = [vk; xk; tk] be the current state of the dynamical\nsystem \u02d9y = F (y) de\ufb01ned in (12). Suppose xk \u2208 S de\ufb01ned in (2).\nIf we use a Runge-Kutta\nintegrator of order s to discretize the ODE for a single step with a step size h such that h \u2264\nmin{0.2,\n\n(1+\u03ba)C(1+E(yk))(M +L+1)}, then\n\n1\n\n[(1 + E(yk))]s+2\n\ntk\n\ntk\n\n(18)\n\n(cid:107)\u03a6h(yk) \u2212 \u03d5h(yk)(cid:107) \u2264 C(cid:48)hs+1(M +L+1)\n\n+ h\nwhere the constants C, \u03ba, and C(cid:48) only depend on p, s, and the integrator.\nThe proof of Proposition 6 is the most challenging part in proving Theorem 1. Details may be found\nin Appendix B. The key step is to bound (cid:107) \u2202s+1\n\u2202hs+1 [\u03a6h(yk) \u2212 \u03d5h(yk)](cid:107). To do so, we \ufb01rst bound\nthe high order derivative tensor (cid:107)\u2207(i)f(cid:107) using Assumption 1 and Proposition 5 within a region of\nradius R. By carefully selecting R, we can show that for a reasonably small h, \u03a6h(yk) and \u03d5h(yk)\nis constrained in the region. Second, we need to compute the high order derivatives of \u02d9y = F (y)\nas a function of \u2207(i)f which is bounded in the region of radius R. As shown in Appendix E, the\nexpressions for higher derivatives become quite complicated as the order increases. We approach\nthis complexity by using the notation for elementary differentials (see Appendix E) adopted from\n[12]; we then induct on the order of the derivatives to bound the higher order derivatives. The \ufb02atness\nassumption (Assumption 1) provides bounds on the operator norm of high order derivatives relative\nto the objective function suboptimality, and hence proves crucial in completing the inductive step.\nBy the conclusion in Proposition 6 and continuity of the Lyapunov function E, we conclude that the\nvalue of E at a discretized point is close to its continuous counterpart. Using this observation, we\nexpect that the Lyapunov function values for the points generated by the discretized ODE do not\nincrease signi\ufb01cantly. We formally prove this key claim in the following proposition.\nProposition 7. Consider the dynamical system \u02d9y = F (y) de\ufb01ned in (12) and the Lyapunov func-\ntion E de\ufb01ned in (16). Let y0 be the initial state of the dynamical system and yN be the \ufb01nal point\ngenerated by a Runge-Kutta integrator of order s after N iterations. Further, suppose that Assump-\ntions 1 and 2 are satis\ufb01ed. Then, there exists a constant \u02dcC determined by p, s and the numerical\nintegrator, such that if the step size h sats\ufb01es h = \u02dcC\n\n(L+M +1)(eE(y0)+1) , then we have\n\nN\u22121/(s+1)\n\nE(yN ) \u2264 exp(1) E(y0) + 1.\n\n(19)\n\nPlease see Appendix C for a proof of this claim.\nProposition 7 shows that the value of the Lyapunov function E at the point yN is bounded above\nby a constant that depends on the initial value E(y0). Hence, if the step size h satis\ufb01es the required\ncondition in Proposition 7, we can see that\n\n(20)\nThe \ufb01rst inequality in (20) follows from the de\ufb01nition of the E (16). Replacing the step size h in\n(20) by the choice used in Proposition 7 yields\n\ntp\nN\n\nf (xN ) \u2212 f (x\u2217) \u2264 E(yN )\n\n\u2264 eE(y0)+1\n(1+N h)p .\n\nf (xN ) \u2212 f (x\u2217) \u2264 (L + M + 1)p(eE(y0) + 1)p+1\n\n\u02dcCN p s\n\ns+1\n\n,\n\n(21)\n\nand the claim of Theorem 1 follows.\nNote: The dependency of the step size h on the degree of the integrator s suggests that an integrator\nof higher order allows for larger step size and therefore faster convergence rate.\n\n5 Numerical experiments\n\nWe perform numerical experiments to verify Theorem 1 and compare ODE direct discretizating\n(DD) methods described in Algorithm 1 against gradient descent (GD) and Nesterov\u2019s accelerated\n\n7\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 1: (a) Convergence paths of GD, NAG, and the proposed algorithm with integrators of degree\ns = 1, s = 2, and s = 4. The objectives is quadratic. (b) Minimizing quadratic objective by\ndiscretizing different ODEs (different choices of q in (22)) with the RK44 integrator (4th order).\n(c/d) Minimizing L4/logistic loss by discretizing different ODEs with a second order integrator.\n\ngradient(NAG) method. All \ufb01gures in this section are on log-log scale. For each optimization\nmethod, we empirically choose the largest step size among {10\u2212k|k \u2208 Z} subject to that the algo-\nrithm remains stable in the \ufb01rst 1000 iterations.\nIn Figure 1a, we generate synthetic linearly separable dataset and \ufb01t linear model Ax = b. A is\nentry-wise Gaussian and the feasibility is achieved via increasing data dimension. We then minimize\nL2 loss f (x) = (cid:107)Ax \u2212 b(cid:107)2\n2. In particular, we discretize the ODE (11) for p = 2 with integrators\nof different orders, i.e., s \u2208 {1, 2, 4} and compare them against GD and NAD. Observe that GD\neventually attains linear rate and NAG achieves local acceleration close to the optimal point as men-\ntioned in [4]. For DD, if we simulate the ODE with an integrator of order s = 1, the algorithm is\neventually unstable. Using a higher order integrator leads to stable accelerated algorithms.\nThroughout this paper, we have assumed that the constant p in (11) is the same as the one in As-\nsumption 1 to attain the best theoretical upper bounds. In Figure 1b, we empirically explore the\nconvergence rate of discretizing the ODE\n\u00a8x(t) + 2q+1\n\n(22)\nwhen q (cid:54)= p. We minimize the same L2 loss with different values of q using a fourth order integrator\nwith the same step size. We observe that when q > 2, the algorithm diverges eventually. We then\ndiscretize ODEs with different parameter q for L4 loss and logistic loss on the same set of data points\nusing a second order RK integrator. As shown in Figure 1c, the objective decreases faster for larger\nq up to q = 6 and diverges when q = 8. Given that L4 loss has p = 4, this result suggests that our\nanalysis might be conservative. Finally, \ufb01gure 1d summarizes the experiment result for minimizing\nlogistic loss. We notice that the algorithm is stable even when q = 8. This result veri\ufb01es Corollary 4.\n\n\u02d9x(t) + q2tq\u22122\u2207f (x(t)) = 0,\n\nt\n\n6 Discussion\n\nOur paper obtains accelerated gradient methods by directly discretizing second order ODEs (instead\nof reverse engineering Nesterov-like constructions), yet it does not fully explain acceleration. First,\n\n8\n\n\funlike Nesterov\u2019s accelerated gradient method that only requires \ufb01rst order differentiability, our\nresults require the objective function to be (s + 2)-times differentiable (where s is the order of the\nintegrator). The precision of numerical integrators only increases with their order when the function\nis suf\ufb01ciently differentiable. This property inherently limits our analysis. Second, while we achieve\nthe O(N\u22122) convergence rate, some of the constants in our bound are loose (e.g., for squared loss\nand logistic regression they are quadratic in L versus linear in L for NAG). Achieving the optimal\ndependence on initial errors f (x0) \u2212 f (x\u2217), the diameter (cid:107)x0 \u2212 x\u2217(cid:107), as well as constants L and M\nrequires further investigation.\nIn addition, we identi\ufb01ed a new condition in Assumption 1 that quanti\ufb01es the local \ufb02atness of con-\nvex functions. At \ufb01rst, this condition may appear counterintuitive, because gradient descent actually\nconverges fast when the objective is not \ufb02at and the progress slows down if the gradient vanishes\nclose to the minimum. However, when we discretize the ODE, the trajectories with vanishing gra-\ndients oscillate slowly, and hence allow stable discretization with large step sizes, which ultimately\nallows us to achieve acceleration. We think this high-level idea, possibly as embodied by Assump-\ntion 1 could be more broadly used in analyzing and designing other optimization methods.\nBased on the above two points, this paper contains both positive and negative message for the recent\ntrend in ODE interpretation of optimization methods. On one hand, it shows that with careful\nanalysis, discretizing ODE can preserve some of its trajectories properties. On the other hand, our\nproof suggests that nontrivial additional conditions might be required to ensure stable discretization.\nHence, designing an ODE with nice properties in the continuous domain doesn\u2019t guarantee the\nexistence of a practical optimization algorithm.\n\nAcknowledgement\n\nAJ and SS acknowledge support in part from DARPA FunLoL, DARPA Lagrange; AJ also acknowl-\nedges support from an ONR Basic Research Challenge Program, and SS acknowledges support from\nNSF-IIS-1409802.\n\nReferences\n[1] Z. Allen-Zhu and L. Orecchia. Linear coupling: An ultimate uni\ufb01cation of gradient and mirror\n\ndescent. arXiv preprint arXiv:1407.1537, 2014.\n\n[2] F. Alvarez. On the minimizing property of a second order dissipative system in hilbert spaces.\n\nSIAM Journal on Control and Optimization, 38(4):1102\u20131119, 2000.\n\n[3] H. Attouch and R. Cominetti. A dynamical approach to convex minimization coupling approx-\nimation with the steepest descent method. Journal of Differential Equations, 128(2):519\u2013540,\n1996.\n\n[4] H. Attouch and J. Peypouquet. The rate of convergence of nesterov\u2019s accelerated forward-\nbackward method is actually faster than 1/k\u02c62. SIAM Journal on Optimization, 26(3):1824\u2013\n1834, 2016.\n\n[5] H. Attouch, X. Goudou, and P. Redont. The heavy ball with friction method, i. the contin-\nuous dynamical system: global exploration of the local minima of a real-valued function by\nasymptotic analysis of a dissipative dynamical system. Communications in Contemporary\nMathematics, 2(01):1\u201334, 2000.\n\n[6] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and\nprojection methods for nonconvex problems: An approach based on the kurdyka-\u0142ojasiewicz\ninequality. Mathematics of Operations Research, 35(2):438\u2013457, 2010.\n\n[7] M. Betancourt, M. I. Jordan, and A. C. Wilson. On symplectic optimization. arXiv preprint\n\narXiv:1802.03653, 2018.\n\n[8] R. E. Bruck Jr. Asymptotic convergence of nonlinear contraction semigroups in hilbert space.\n\nJournal of Functional Analysis, 18(1):15\u201326, 1975.\n\n[9] S. Bubeck, Y. T. Lee, and M. Singh. A geometric alternative to nesterov\u2019s accelerated gradient\n\ndescent. arXiv preprint arXiv:1506.08187, 2015.\n\n[10] J. Diakonikolas and L. Orecchia. The approximate duality gap technique: A uni\ufb01ed theory of\n\n\ufb01rst-order methods. arXiv preprint arXiv:1712.02485, 2017.\n\n9\n\n\f[11] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimization al-\ngorithms via integral quadratic constraints: Nonstrongly convex problems. arXiv preprint\narXiv:1705.03615, 2017.\n\n[12] E. Hairer, C. Lubich, and G. Wanner. Geometric numerical integration: structure-preserving\nalgorithms for ordinary differential equations, volume 31. Springer Science & Business Media,\n2006.\n\n[13] B. Hu and L. Lessard. Dissipativity theory for nesterov\u2019s accelerated method. arXiv preprint\n\narXiv:1706.04381, 2017.\n\n[14] E. Isaacson and H. B. Keller. Analysis of numerical methods. Courier Corporation, 1994.\n[15] W. Krichene, A. Bayen, and P. L. Bartlett. Accelerated mirror descent in continuous and\ndiscrete time. In Advances in neural information processing systems, pages 2845\u20132853, 2015.\n[16] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via\n\nintegral quadratic constraints. SIAM Journal on Optimization, 26(1):57\u201395, 2016.\n\n[17] S. Lojasiewicz. Ensembles semi-analytiques. Lectures Notes IHES (Bures-sur-Yvette), 1965.\n[18] A. Nemirovskii, D. B. Yudin, and E. R. Dawson. Problem complexity and method ef\ufb01ciency\n\nin optimization. 1983.\n\n[19] Y. Nesterov. A method of solving a convex programming problem with convergence rate o\n\n(1/k2). In Soviet Mathematics Doklady, volume 27, pages 372\u2013376, 1983.\n\n[20] M. Raginsky and J. Bouvrie. Continuous-time stochastic mirror descent on a network: Variance\nreduction, consensus, convergence. In Decision and Control (CDC), 2012 IEEE 51st Annual\nConference on, pages 6793\u20136800. IEEE, 2012.\n\n[21] D. Scieur, A. d\u2019Aspremont, and F. Bach. Regularized nonlinear acceleration. In Advances In\n\nNeural Information Processing Systems, pages 712\u2013720, 2016.\n\n[22] D. Scieur, V. Roulet, F. Bach, and A. d\u2019Aspremont. Integration methods and accelerated opti-\n\nmization algorithms. arXiv preprint arXiv:1702.06751, 2017.\n\n[23] W. Su, S. Boyd, and E. Candes. A differential equation for modeling nesterovs accelerated\ngradient method: Theory and insights. In Advances in Neural Information Processing Systems,\npages 2510\u20132518, 2014.\n\n[24] J. Verner. High-order explicit runge-kutta pairs with low stage order. Applied numerical math-\n\nematics, 22(1-3):345\u2013357, 1996.\n\n[25] M. West. Variational integrators. PhD thesis, California Institute of Technology, 2004.\n[26] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods\nin optimization. Proceedings of the National Academy of Sciences, 113(47):E7351\u2013E7358,\n2016.\n\n10\n\n\f", "award": [], "sourceid": 1920, "authors": [{"given_name": "Jingzhao", "family_name": "Zhang", "institution": "MIT"}, {"given_name": "Aryan", "family_name": "Mokhtari", "institution": "MIT"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}, {"given_name": "Ali", "family_name": "Jadbabaie", "institution": "MIT"}]}