{"title": "Accelerating Rescaled Gradient Descent: Fast Optimization of Smooth Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 13555, "page_last": 13565, "abstract": "We present a family of algorithms, called descent algorithms, for optimizing convex and non-convex functions. We also introduce a new first-order algorithm, called rescaled gradient descent (RGD), and show that RGD achieves a faster convergence rate than gradient descent provided the function is strongly smooth - a natural generalization of the standard smoothness assumption on the objective function. When the objective function is convex, we present two frameworks for \u201caccelerating\u201d descent methods, one in the style of Nesterov and the other in the style of Monteiro and Svaiter. Rescaled gradient descent can be accelerated under the same strong smoothness assumption using both frameworks. We provide several examples of strongly smooth loss functions in machine learning and numerical experiments that verify our theoretical findings.", "full_text": "Accelerating Rescaled Gradient Descent:\nFast Optimization of Smooth Functions\n\nAshia C. Wilson\nMicrosoft Research\n\nashia.wilson@microsoft.com\n\nLester Mackey\n\nMicrosoft Research\n\nlmackey@microsoft.com\n\nAndre Wibisono\n\nGeorgia Tech\n\nwibisono@gatech.edu\n\nAbstract\n\nWe present a family of algorithms, called descent algorithms, for optimizing\nconvex and non-convex functions. We also introduce a new \ufb01rst-order algorithm,\ncalled rescaled gradient descent (RGD), and show that RGD achieves a faster\nconvergence rate than gradient descent over the class of strongly smooth functions \u2013\na natural generalization of the standard smoothness assumption on the objective\nfunction. When the objective function is convex, we present two frameworks\nfor accelerating descent algorithms, one in the style of Nesterov and the other in\nthe style of Monteiro and Svaiter, using a single Lyapunov function. Rescaled\ngradient descent can be accelerated under the same strong smoothness assumption\nusing both frameworks. We provide several examples of strongly smooth loss\nfunctions in machine learning and numerical experiments that verify our theoretical\n\ufb01ndings. We also present several extensions of our novel Lyapunov framework\nincluding deriving optimal universal higher-order tensor methods and extending\nour framework to the coordinate descent setting.\n\n1\n\nIntroduction\n\nWe consider the optimization problem\n\nx\u2208X f (x)\nmin\n\nX with inner product norm (cid:107)v(cid:107) :=(cid:112)(cid:104)v, Bv(cid:105) and a dual norm (cid:107)s(cid:107)\u2217 :=(cid:112)(cid:104)s, B\u22121s(cid:105) for s in the dual\n\n(1)\nwhere f : X \u2192 R is a continuously differentiable function, on a \ufb01nite-dimensional real vector space\nspace X \u2217. Here, B : X \u2192 X \u2217 is a positive de\ufb01nite self-adjoint operator. We assume the minimum\nof f is attainable and let x\u2217 represent a point in arg minx\u2208X f (x).\nWe study the performance of a family of discrete-time algorithms parameterized by \u03b4 > 0 and an\ninteger scalar 1 < p \u2264 \u221e, called \u03b4-descent algorithms of order p. These algorithms meet a progress\ncondition that allows us to derive fast non-asymptotic convergence rate upper bounds, parameterized\nby p, for both nonconvex and convex instances of (1). For example, descent algorithms of order\n1 < p < \u221e satisfy the upper bound f (xk) \u2212 f (x\u2217) = O(1/(\u03b4k)p\u22121) for convex functions.\nUsing this framework we introduce a new method for smooth optimization called rescaled gradient\ndescent (RGD),\n\nxk+1 = xk \u2212 \u03b7\n\n1\n\np\u22121\n\nB\u22121\u2207f (xk)\n(cid:107)\u2207f (xk)(cid:107) p\u22122\np\u22121\u2217\n\n,\n\n\u03b7 > 0, p > 1.\n\nWe show that if (1) is suf\ufb01ciently smooth, rescaled gradient descent is a \u03b4-descent algorithm of\norder p, and subsequently converges quickly to solutions of (1). RGD can be viewed as a natural\ngeneralization of gradient descent (p = 2) and normalized gradient descent (p = \u221e), whose\nnon-asymptotic behavior for quasi-convex functions has been well-studied ([11]).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2\n\n3p\u22122\n\nWhen f is convex, we present two frameworks for obtaining algorithms with faster convergence\nrate upper bounds. The \ufb01rst, pioneered in Nesterov [22, 23, 24, 25], shows how to wrap a \u03b4-descent\nmethod of order 1 < p < \u221e in two sequences to obtain a method that satis\ufb01es f (xk) \u2212 f (x\u2217) =\nO(1/(\u03b4k)p). The second, introduced by [18], shows how to wrap a \u03b4-descent method of order\n1 < p < \u221e in the same set of sequences and add a line search step to obtain a method that satis\ufb01es\nf (xk) \u2212 f (x\u2217) = O(1/(\u03b4k)\n). We provide a general description of both frameworks and show\nhow they can be applied to RGD and other descent methods of order p.\nOur motivation also comes from a burgeoning literature (e.g., [27, 28, 30, 33, 13, 35, 4, 8, 32, 29, 31,\n17]) that harnesses the connection between dynamical systems and optimization algorithms to develop\nnew analyses and optimization methods. Rescaled gradient descent is obtained by discretizing an\nODE called rescaled gradient \ufb02ow introduced by [34]. We compare RGD and accelerated RGD to\nthe work of Zhang et al. [36], who introduce accelerated dynamics and apply Runge-Kutta integrators\nto discretize them. They show that Runge-Kutta integrators converge quickly when the function is\nsuf\ufb01ciently smooth and when the order of the integrator is suf\ufb01ciently large. We provide a better\nconvergence rate upper bound for accelerated RGD under a very similar smoothness assumption. We\nalso compare our work to Maddison et al. [17], who introduces conformal Hamiltonian dynamics and\nshow that if the objective function is suf\ufb01ciently smooth, algorithms obtained by discretizing these\ndynamics converge at a linear rate. We show (accelerated) RGD also achieves a fast linear rate under\nsimilar smoothness conditions.\nThe remainder of this paper is organized as follows. Section 2 introduces \u03b4-descent algorithms\nand Section 2.1 describes several examples of descent algorithms that are popular in optimization.\nSection 2.2 introduces RGD and Section 3 presents two frameworks for accelerating \u03b4-descent\nmethods and applies both to RGD. Section 5 describes several examples of strongly smooth objective\nfunctions as well as experiments to verify our \ufb01ndings. Finally, Section 6 discusses simple extensions\nof our framework, including deriving and analyzing optimal universal tensor methods for objective\nfunctions that have H\u00f6lder-continuous higher-order gradients and extending our entire framework\nand results to the coordinate setting.\n\n2 Descent Algorithms\n\nThe focus of this section is a family of algorithms called \u03b4-descent algorithms of order p.\nDe\ufb01nition 1 An algorithm xk+1 = A(xk) is a \u03b4-descent algorithm of order p for 1 < p \u2264 \u221e if\nfor some constant 0 < \u03b4 < \u221e it satis\ufb01es\n\nf (xk+1)\u2212f (xk)\n\n\u03b4\n\nf (xk+1)\u2212f (xk)\n\n\u03b4\n\n\u2264 \u2212(cid:107)\u2207f (xk)(cid:107) p\np\u22121\u2217\n\u2264 \u2212(cid:107)\u2207f (xk+1)(cid:107) p\np\u22121\u2217\n\nfor all k \u2265 0, or\nfor all k \u2265 0.\n\n(2a)\n\n(2b)\n\nFor \u03b4-descent algorithms of order p, it is possible to obtain non-asymptotic convergence guarantees\nfor non-convex, convex and gradient dominated functions. Recall, a function is \u00b5-gradient dominated\nof order p \u2208 (1,\u221e] if\n\np (cid:107)\u2207f (x)(cid:107) p\np\u22121\n\np\u22121\u2217 \u2265 \u00b5\n\n(3)\nWhen p = 2, (3) is the Polyak-\u0141ojasiewicz condition introduced concurrently by Polyak [27] and\n\u0141ojasiewicz [16]. For the following three theorems, we use the shorthand E0 := f (x0) \u2212 f (x\u2217) and\nassume f is differentiable.\nTheorem 1 Any \u03b4-descent algorithm of order p satis\ufb01es\n\n1\n\np\u22121 (f (x) \u2212 f (x\u2217)),\n\n\u2200x \u2208 X .\n\nmin0\u2264s\u2264k (cid:107)\u2207f (xs)(cid:107)\u2217 \u2264 (E0/(\u03b4k))\n\np\u22121\np .\n\n(4)\n\n, then any\n\nTheorem 2 If f is convex with R = supx:f (x)\u2264f (x0) (cid:107)x \u2212 x\u2217(cid:107) < \u221e, and cp := (1\u22121/p)p\np\u22121\n\u03b4-descent algorithm of order p satis\ufb01es\nf (xk) \u2212 f (x\u2217) \u2264\n\np (cid:1)\u2212p\n\n2(cid:0) 1\n\n= O(1/(1 + 1\n\n(cid:40)\n\np\u22121\n\n0\n\n+\n\nE1/p\n\n(\u03b4k)\n2E0 exp(\u2212\u03b4k/(R\u03b3)),\n\nR\u03b3c1/p\n\n1\np p\n\np\u22121\n\nR\u03b3p (\u03b4k)\n\np )p), p < \u221e\np = \u221e.\np\u22121\np )p\u22121 when (2b) is satis\ufb01ed.\n\n(5)\n\nwhere \u03b3 = 1 when (2a) is satis\ufb01ed and \u03b3 = (1 + 1\n\nRp (E0/cp)\n\n1\n\np \u03b4\n\n2\n\n\fTheorem 3 If f is \u00b5-gradient dominated of order p, then any \u03b4-descent algorithm of order p satis\ufb01es\n(6)\n\nf (xk) \u2212 f (x\u2217) \u2264 E0 exp\n\np\u22121 \u03b4k\n\n.\n\n1\n\n(cid:16)\u2212 p\n\np\u22121 \u00b5\n\n(cid:17)\n\nThe proof of Theorems 1 to 3 are all based on simple energy arguments and can be found in\nAppendix B. Bounds of the form (4) are common in the non-convex optimization literature and have\npreviously been established for gradient descent (p = 2 see e.g. [26, Thm1]) and higher-order tensor\nmethods (see e.g.[6]). Theorem 1 provides a more general description of algorithms that satisfy this\nkind of bound.\nTypically, algorithms satisfy the progress condition (2) for speci\ufb01c smoothness classes of functions.\nFor example, gradient descent with step-size 0 < \u03b7 \u2264 1/L is a \u03b4-descent method of order p = 2\nwith \u03b4 = \u03b7/2 when (cid:107)\u22072f(cid:107) \u2264 L. Throughout, we denote (cid:107)B(cid:107) = max(cid:107)h(cid:107)\u22641 (cid:107)Bh(cid:107)\u2217, for any\nB : X \u2192 X \u2217. We list several other examples.\n\n2.1 Examples of descent algorithms\n\n(cid:26)\n\nTheorems 1, 2 and 3 provide a seamless way to derive standard upper bounds for many algorithms in\noptimization.\nExample 1 The universal higher-order tensor method,\n\nxk+1 = arg min\nx\u2208X\n\nwhere fp\u22121(y; x) = (cid:80)p\u22121\n(7)\ni!\u2207if (x)(y \u2212 x)i is the (p \u2212 1)-st order Taylor approximation of f\ncentered at x and \u02dcp = p \u2212 1 + \u03bd for \u03bd \u2208 (0, 1], has been studied by several works [3, 34, 21]. When\nf is convex and has H\u00f6lder-smooth (p \u2212 1)-st order gradients, namely (cid:107)\u2207p\u22121f (x) \u2212 \u2207p\u22121f (y)(cid:107) \u2264\nL(cid:107)x \u2212 y(cid:107)\u03bd, (7) with step size 0 < \u03b7 \u2264 \u221a\n, is a \u03b4-descent algorithm of order \u02dcp with \u03b4 =\n\n(cid:107)x \u2212 xk(cid:107) \u02dcp\n\nfp\u22121(x; xk) +\n\n1\n\u02dcp\u03b7\n\ni=0\n\n1\n\n,\n\n3(p\u22122)!\n2L\n\n(cid:27)\n\n1\n\n2 \u02dcp\u22123\n\u02dcp\u22121 .\n\n\u02dcp\u22121 /2\n\u03b7\nExample 2 The natural proximal method,\n\n(cid:27)\nwhere (cid:107)v(cid:107)x =(cid:112)(cid:104)v,\u22072h(x)v(cid:105) was introduced in the setting h(x) = 1\n2(cid:107)x(cid:107)2\n\nxk+1 = arg min\nx\u2208X\n\n(cid:107)x \u2212 xk(cid:107)p\n\nf (x) +\n\n(cid:26)\n\n1\np\u03b7\n\nxk\n\n,\n\n0 and mB (cid:22) \u22072h, the proximal method is a \u03b4-descent algorithm of order p with \u03b4 = m\nExample 3 Natural gradient descent,\n\n(8)\n\n2 by [19]. For any \u03b7, m >\np\u22121 /p.\n\np\u22121 \u03b7\n\np\n\n1\n\n(cid:27)\n\n(cid:26)\n\n1\n\u03b7\n\nxk+1 = xk \u2212 \u03b7\u22072h(xk)\u22121\u2207f (xk) = arg min\nx\u2208X\n\nwhere (cid:107)v(cid:107)x =(cid:112)(cid:104)v,\u22072h(x)v(cid:105) was introduced by [2]. Suppose (cid:107)\u22072f(cid:107) \u2264 L and mB (cid:22) \u22072h (cid:22)\n\n(cid:104)\u2207f (xk), x(cid:105) +\n\n(cid:107)x \u2212 xk(cid:107)2\n\n1\n2\u03b7\n\n(9)\n\nxk\n\n,\n\nM B for some m, L, M > 0. Then natural gradient descent with step size 0 < \u03b7 \u2264 m2\nalgorithm of order p = 2 with \u03b4 = \u03b7\nExample 4 Mirror descent,\n\n2M .\n\nM L is a \u03b4-descent\n\n(cid:104)\u2207f (xk), x(cid:105) +\n\n1\n\u03b7\n\nxk+1 = arg min\nx\u2208X\n\n(10)\nwhere Dh(x, y) = h(x)\u2212 h(y)\u2212(cid:104)\u2207h(y), x\u2212 y(cid:105) is the Bregman divergence was introduced by [20].\nSuppose (cid:107)\u22072f(cid:107) \u2264 L and mB (cid:22) \u22072h (cid:22) M B for some m, L, M > 0. Then mirror descent with\nstep size 0 < \u03b7 \u2264 m2\nExample 5 The proximal Bregman method,\n\nM L is a \u03b4-descent algorithm of order p = 2 with \u03b4 = \u03b7\n\nDh(x, xk)\n\n2M .\n\n,\n\n(cid:26)\n\n(cid:27)\n\n(cid:26)\n\n(cid:27)\n\n(11)\nwas introduced by [7]). When mB (cid:22) \u22072h (cid:22) M B the proximal Bregman method with step-size\n\u03b7 > 0 is a \u03b4-descent algorithm of order p = 2 with \u03b4 = m\u03b7\n2M 2 .\n\nxk+1 = arg min\nx\u2208X\n\nDh(x, xk)\n\nf (x) +\n\n,\n\nDetails for these examples are contained in Appendix B.2.\n\n3\n\n\f2.2 Rescaled gradient descent\n\n(cid:110)(cid:104)\u2207f (xk), x(cid:105) + 1\n\np\u03b7(cid:107)x \u2212 xk(cid:107)p(cid:111)\n\n,\n\n(12)\n\nWe end this section by discussing the function class for which rescaled gradient descent (RGD),\n\nxk+1 = xk \u2212 \u03b7\n\n1\n\np\u22121 B\u22121\u2207f (xk)\n(cid:107)\u2207f (xk)(cid:107) p\u22122\np\u22121\u2217\n\n= arg minx\u2208X\n\nis a \u03b4-descent method of order p.\nDe\ufb01nition 2 A function f is strongly smooth of order p for some integer p > 1, if there exist\nconstants 0 < L1, . . . , Lp < \u221e such that for m = 1, . . . , p \u2212 1 and for all x \u2208 Rd:\n\nand moreover for m = p, f satis\ufb01es the condition |\u2207pf (x)(v)p| \u2264 Lp(cid:107)v(cid:107)p, \u2200v \u2208 X .\n\n|\u2207mf (x)(B\u22121\u2207f (x))m| \u2264 Lm(cid:107)\u2207f (x)(cid:107)m+ p\u2212m\n\np\u22121\n\n\u2217\n\n(13)\n\nHere, \u2207mf (x)(h)m =(cid:80)d\n\nf (x)(cid:81)m\n\ni1,...,im=1 \u2202xi1 ...xim\n\nj=1 hij where \u2202xif is the partial derivative of\nf with respect to xi. We can always take L1 = 1. When p = 2, (13) is the usual Lipschitz condition\non the gradient of f, but otherwise (13) is stronger. In particular, if f is strongly smooth of order p,\nthen the minimizer x\u2217 has order at least p \u2212 1, i.e., the higher gradients vanish: \u2207mf (x\u2217) = 0 for\nm = 1, . . . , p \u2212 1, whereas this is not implied under mere smoothness. An example of a strongly\nsmooth function of order p is the p-th power of the (cid:96)2-norm f (x) = (cid:107)x(cid:107)p\n2 with B = I, or the (cid:96)p-norm\nf (x) = (cid:107)x(cid:107)p\np. We discuss other families of strongly smooth functions in Section 5. Finally, it is worth\nmentioning that for most of our results, the absolute value on the left hand side of (13) is unnecessary.\nWe now present the main result regarding the performance of RGD on functions that satisfy (13):\nTheorem 4 Suppose f is strongly smooth of order p > 1 with constants 0 < L1, . . . , Lp < \u221e. Then\nrescaled gradient descent with step-size\n\n(cid:26)\n\n(cid:27)\n\n0 < \u03b7\n\n1\n\np\u22121 \u2264 min\n\n1,\n\n(2(cid:80)p\n\n1\n\nm=2\n\nm! )\nLm\n\n(14)\n\nsatis\ufb01es the descent condition (2a) with \u03b4 = \u03b7\n\n1\n\np\u22121 /2.\n\nThe proof of Theorem 4 is in Appendix B.3. A corollary to Theorems 1-4 is the following theorem.\nTheorem 5 RGD with a step size that satis\ufb01es (14) achieves convergence rate guarantee (4) when f\nis differentiable and strongly smooth of order p, (5) when f is convex function and strongly smooth of\norder p, and (6) when f is \u00b5-uniformly convex and strongly smooth of order p, where \u03b4p\u22121 = \u03b7/2p\u22121.\nOur results show rescaled gradient descent can minimize the canonical p-strongly smooth and\np(cid:107)x(cid:107)p at an exponential rate; in contrast, gradient descent can\nuniformly convex function f (x) = 1\nonly minimize it at a polynomial rate, even in one dimension. We provide the proof of Proposition 6\nin Appendix B.4.\nProposition 6 Let f : R \u2192 R be f (x) = 1\nFor any step size 0 < \u03b7\np minimizes f at an exponential rate: f (xk) = (1 \u2212 \u03b7\nany \u03b7\nf (xk) = \u2126((\u03b7\n\np|x|p for p > 2, with minimizer x\u2217 = 0 and f (x\u2217) = 0.\np\u22121 < 1 and initial position x0 \u2208 R, rescaled gradient descent of order\np\u22121 )pkf (x0). On the other hand, for\np\u22122 , gradient descent minimizes f at a polynomial rate:\n\np\u22121 > 0 and |x0| < (2\u03b7\n\np\u22121 k)\n\np\u22122 ).\n\np\u22121 )\n\n\u2212 p\n\n\u2212 1\n\n1\n\n1\n\n1\n\n1\n\n1\n\nWe now demonstrate how all the aforementioned examples of \u03b4-descent methods can be accelerated.\n\n3 Accelerating Descent Algorithms\n\nWe present two frameworks for accelerating descent algorithms based on the dynamical systems\nperspective introduced by Wibisono et al. [34] and Wilson et al. [35] and apply them to RGD. The\nbackbone of both frameworks is the Lyapunov function\n\nEk = Ak(f (xk) \u2212 f (x\u2217)) + Dh(x\u2217, zk),\n\nand two sequences (15) and (16). The connection between continuous time dynamical systems and\nthese two sequences and Lyapunov function is described in [35]. We present a high-level description\nof both techniques in the main text and leave details of our analysis to Appendix C.\n\n4\n\n\f3.1 Nesterov acceleration of descent algorithms\n\nIn the context of convex optimization, the technique of \u201cacceleration\u201d has its origins in Nesterov [22]\nand re\ufb01ned in Nesterov [23]. In these works, Nesterov showed how to combine gradient descent with\ntwo sequences to obtain an algorithm with an optimal convergence rate. There have been many works\nsince (as well as some frameworks, including [15, 1, 14, 35]) describing how to accelerate various\nother algorithms to obtain methods with superior convergence rates.\nWilson et al. [35], for example, show the following two discretizing schemes,\n\nwhere yk+1 satis\ufb01es the \u03b4\n\nxk = \u03b4\u03c4kzk + (1 \u2212 \u03b4\u03c4k)yk\n\nzk+1 = arg minz\u2208X(cid:8)\u03b1k(cid:104)\u2207f (xk), z(cid:105) + 1\n(cid:8)\u03b1k(cid:104)\u2207f (yk+1), z(cid:105) + 1\n\np\u22121 -descent condition f (yk+1) \u2212 f (xk) \u2264 \u2212\u03b4\nxk = \u03b4\u03c4kzk + (1 \u2212 \u03b4\u03c4k)yk\nzk+1 = arg minz\n\n\u03b4 Dh(z, zk)(cid:9)\n\u03b4 Dh(z, zk)(cid:9) ,\n\np\n\np\n\n(15a)\n(15b)\n\n; and\n\np\u22121(cid:107)\u2207f (xk)(cid:107) p\np\u22121\u2217\n\n(16a)\n(16b)\np\u22121 -descent) condition f (yk+1) \u2212 f (xk) \u2264\nwhere the update for yk+1 satis\ufb01es the (\u03b4\n(cid:104)\u2207f (yk+1), yk+1 \u2212 xk(cid:105) \u2264 \u2212\u03b4\np\u22121(cid:107)\u2207f (yk+1)(cid:107) p\np\u22121\u2217\n, constitute an \u201caccelerated method\u201d. Their\nresults can be summarized in the following theorem.\nTheorem 7 Assume for all x, y \u2208 X , the function h satis\ufb01es the local uniform convexity condition\nDh(x, y) \u2265 1\np(cid:107)x\u2212 y(cid:107)p. Then sequences (15) and (16) with parameter choices \u03b1k = (\u03b4/p)p\u22121k(p\u22121)\n(where k(p) := k(k + 1)\u00b7\u00b7\u00b7 (k + p \u2212 1) is the rising factorial) and \u03c4k = p\n\u03b4k ) satisfy,\n\n\u03b4(p+k) = \u0398( p\n\np\n\np\n\nf (yk) \u2212 f (x\u2217) \u2264 ppDh(x\u2217, z0)\n\n(\u03b4k)p\n\n= O (1/(\u03b4k)p) .\n\n(17)\n\nProof details are contained in Appendix C.1. Wilson et al. [35] call these new methods accelerated\ndescent methods due to the fact that Theorem 2 guarantees implementing just the yk+1 sequence\n(where we set xk = yk) satis\ufb01es f (yk) \u2212 f (x\u2217) \u2264 O(1/(\u02dc\u03b4k)p\u22121), where \u02dc\u03b4p\u22121 = \u03b4p. The computa-\ntional cost of adding sequences (15a) and (15b) (or (16a) and (16b)) to the descent method is at most\nan additional gradient evaluation.\n\nRemark 1 (Restarting for accelerated linear convergence) If, in addition, f is \u00b5-gradient domi-\nnated of order p, then algorithms (15) and (16) combined with a scheme for restarting the algorithm\nhas a convergence rate upper bound f (yk) \u2212 f (x\u2217) = O(exp(\u2212\u00b5\np \u03b4k)). We can consider this\nalgorithm an accelerated method given the original descent method satis\ufb01es f (yk) \u2212 f (x\u2217) =\nO(exp(\u2212\u00b5\nTo summarize, it is suf\ufb01cient to establish conditions under which an algorithm is a \u03b4-descent algorithm\nof order p in order to (1) obtain a convergence rate and (2) accelerate the algorithm (in most cases).\n\np\u22121 \u02dc\u03b4k)) under the same condition, where \u02dc\u03b4p\u22121 = \u03b4p. See Appendix C.2 for details.\n\n1\n\n1\n\nAccelerated rescaled gradient descent (Nesterov-style) Using (15) we accelerate RGD.\n\nAlgorithm 1 Nesterov-style accelerated rescaled gradient descent.\nRequire: f satis\ufb01es (13) and h satis\ufb01es Dh(x, y) \u2265 1\np(cid:107)x \u2212 y(cid:107)p\n1: Set x0 = z0, Ak = (\u03b4/p)pk(p), \u03b1k = Ak+1\u2212Ak\n, \u03c4k = \u03b1k\nAk+1\n2: for k = 1, . . . , K do\n3: xk = \u03b4\u03c4kzk + (1 \u2212 \u03b4\u03c4k)yk\n\n4: zk+1 = arg minz\u2208X(cid:8)\u03b1k(cid:104)\u2207f (xk), z(cid:105) + 1\n\n\u03b4 Dh(z, zk)(cid:9)\n\n\u03b4\n\n1\n\np\u22121 B\u22121\u2207f (xk)/(cid:107)\u2207f (xk)(cid:107) p\u22122\np\u22121\u2217\n\n5: yk+1 = xk \u2212 \u03b7\n6: return yK.\n\n, and \u03b4\n\np\n\np\u22121 = \u03b7\n\n1\n\np\u22121 /2.\n\nWe summarize the performance of Algorithm 1 in the following Corollary to Theorems 4 and 7:\n\n5\n\n\fTheorem 8 Suppose f is convex and strongly smooth of order 1 < p < \u221e with constants 0 <\nL1, . . . , Lp < \u221e. Also suppose \u03b7 satis\ufb01es (14). Then Algorithm 1 satis\ufb01es the convergence rate\nupper bound (17).\n\n3.2 Monteiro-Svaiter acceleration of descent algorithms\n\n2\n\n3p\u22122\n\nRecently, Monteiro and Svaiter [18] have introduced an alternative framework for accelerating descent\nmethods, which is similar to Nesterov\u2019s scheme but includes a line search step. This framework\nwas further generalized by several more recent concurrent works [9, 12, 5] who demonstrate that\nhigher-order tensor method (7) with the addition of a line search step obtains a convergence rate upper\nbound f (yk) \u2212 f (x\u2217) = O(1/k\n). When p = 2, this rate matches that of the Nesterov-style\nacceleration framework, but for p > 2 it is better. In this section, we present a novel, generalized\nversion of the Monteiro-Svaiter accleration framework. In particular, we use a simple Lyapunov\nanalysis to generalize the framework and show that many other descent methods of order p can be\naccelerated in it, including the proximal method (8), RGD (12) and universal tensor methods.\nTheorem 9 Suppose h is satis\ufb01es the condition B (cid:22) \u22072h. Consider sequence (15) where in addition,\nwe add a line search step which ensures the inequalities\n(cid:107)yk+1 \u2212 xk(cid:107)p\u22122 \u2264 b,\n(cid:107)yk+1 \u2212 xk + \u03bbk+1\u2207f (yk+1)(cid:107) \u2264 1\n\na \u2264 \u03bbk+1\n\n0 < a < b\n\n2(cid:107)yk+1 \u2212 xk(cid:107)\n\n(18b)\n\n(18a)\n\nand\n\n3p\u22122\n\n\u03b4\n\n2\n\nhold for the pair (\u03bbk+1, yk+1), where \u03bbk+1 = \u03b42\u03b12\n\nk/Ak+1 . Then the composite sequence satis\ufb01es:\n\nf (yk) \u2212 f (x\u2217) \u2264 p\n\n3p\u22122\n2 Dh(x\u2217,x0)\n(\u03b4k)\n\n3p\u22122\n\n2\n\np\n2\n\n= O\n\n1/(\u03b4k)\n\n3p\u22122\n\n2\n\n.\n\n(19)\n\n(cid:16)\n\n(cid:17)\n\nThe proof of Theorem 9 is in Appendix C.3. All the aforementioned concurrent works have demon-\nstrated that the higher-order gradient method (\u03bd = 1) with the addition line search step satis\ufb01es (18).\nWe show the same is true of the proximal method (8), rescaled gradient descent (12) and universal\nhigher-order tensor methods. See Appendix C.5 for details. We conjecture that all methods that\nsatisfy conditions (18a) and (18b) are descent methods of order p with an additional line search step.\n\nRemark 2 (Restarting for improved accelerated linear rate) If, in addition, f is \u00b5-gradient dom-\ninated of order p, then (18) combined with a scheme for restarting the algorithm satis\ufb01es the\nconvergence rate upper bound f (yk) \u2212 f (x\u2217) = O(exp(\u2212\u00b5\n3p\u22122 \u03b4k)). See Appendix C.2 for details.\n\n2\n\n3.3 Accelerating rescaled gradient descent (Monteiro-Svaiter-style)\n\nMonteiro-Svaiter accelerated rescaled gradient descent is the following algorithm.\n\nAlgorithm 2 Monteiro-Svaiter-style accelerated rescaled gradient descent.\nRequire: f is strongly smooth of order 1 < p < \u221e and h satis\ufb01es B (cid:22) \u22072h.\nm! )}\n1: Set x0 = z0 = 0, A0 = 0, \u03b4\n2: for k = 1, . . . , K do\n\u2264 5\n3: Choose \u03bbk+1 (e.g. by line search) such that 3\nyk+1 = xk \u2212 \u03b7\n\n5p , 1/(2(cid:80)p\n\n4 \u2264 \u03bbk+1(cid:107)yk+1\u2212xk(cid:107)p\u22122\n\np\u22121 \u2264 min{ 2\n\n\u2207f (xk)\n\n2 = \u03b7, \u03b7\n\n3p\u22122\n\nm=2\n\np\u22121\n\nLm\n\n\u03b7\n\n,\n\n1\n\n1\n\n4, where\n\n(cid:107)\u2207f (xk)(cid:107) p\u22122\np\u22121\u2217\n\n\u221a\n\n\u03bbk+1+\n\n\u03bbk+1+4Ak\u03bbk+1\n\n2\u03b4\n\nand \u03b1k =\n\n4: Update zk+1 = arg minz\u2208X(cid:8)\u03b1k(cid:104)\u2207f (yk+1), z(cid:105) + 1\n\n\u03b4 Dh(z, zk)(cid:9)\n\n, Ak+1 = \u03b4\u03b1k + Ak, \u03c4k = \u03b1k\nAk+1\nxk = \u03b4\u03c4kzk + (1 \u2212 \u03b4\u03c4k)yk.\n\n5: return yK.\n\n(so that \u03bbk+1 = \u03b42\u03b12\nk\nAk+1\n\n) and\n\n6\n\n\fWe summarize results on performance of Algorithm 2 in the following corollary to Theorem 9:\nTheorem 10 Assume f is convex and strongly smooth of order 1 < p < \u221e with constants 0 <\nL1, . . . , Lp < \u221e. Then Algorithm 2 satis\ufb01es the convergence rate upper bound (19).\n\n4 Related Work\n\nOur acceleration framework is similar in spirit to a number of acceleration frameworks in the literature\n(e.g., Allen Zhu and Orecchia [1], Lessard et al. [14], Lin et al. [15], Diakonikolas and Orecchia [8])\nbut applies more generally to descent methods of order p > 2. In particular, the present framework\nbuilds off of the framework proposed by Wilson et al. [35], but it (1) makes the connection to descent\nmethods more explicit and (2) incorporates a generalization and Lyapunov analysis of the Monteiro-\nSvaiter acceleration framework. These manifold generalizations crucially allow us to propose RGD\nand accelerated RGD, which has superior theoretical and empirical performance to several existing\nmethods on strongly smooth functions.\n\n5 Examples and Numerical Experiments\n\nWe compare our result to several recent works that have shown that for some function classes, more\nintuitive \ufb01rst-order algorithms outperform gradient descent. In particular, both Zhang et al. [36] and\nMaddison et al. [17] obtain \ufb01rst-order algorithms by applying integration techniques to second-order\nODEs. When the objective function is suf\ufb01ciently smooth, both show their algorithm outperforms\n(accelerated) gradient descent. We show that Algorithms 1 and 2 achieves fast performance in theory\nand in practice on similar objectives.\n\nRunge-Kutta Zhang et al. [36] show that if one applies an s-th order Runge-Kutta integrator\nto a family of second-order dynamics, then the resulting algorithm1 achieves a convergence rate\nf (xk) \u2212 f (x\u2217) = O(1/k\ns\u22121 )2 provided the function meet the following two conditions: (1) f\nsatis\ufb01es the gradient lower bound of order p \u2265 2, which means for all m = 1, . . . , p \u2212 1,\n\nps\n\nf (x) \u2212 f (x\u2217) \u2265 1\n\n(cid:107)\u2207mf (x)(cid:107) p\n\np\u2212m \u2200 x \u2208 Rn\n\n(21)\nfor some constants 0 < C1, . . . , Cp\u22121 < \u221e; and (2) for s \u2265 p and M > 0, f is (s + 2)-times\ndifferentiable and (cid:107)\u2207(i)f (x)(cid:107) \u2264 M for i = p, p + 1, . . . , s + 2. One can show that if f is strongly\nsmooth of order p, then f satis\ufb01es the gradient lower bound of order p. The details of this result is\nin D. While are unable to prove that condition (21) is equivalent to strong smoothness, we have yet to\n\ufb01nd an example of a function that satis\ufb01es (21) and is not strongly smooth.\n\nCm\n\nHamiltonian Descent Maddison et al. [17] show explicit integration techniques applied to con-\nformal Hamiltonian dynamics converge at a fast linear rate for a function class larger than gradient\ndescent. The method entails \ufb01nding a kinetic energy map that upper bounds the dual of the function.\nAll examples for which we can compute such a map given by [17] are uniformly convex and gradient\ndominated functions; therefore, simply rescaling the gradient for these examples ensures a linear rate.\n\n5.1 Examples\n\nWe provide several examples of strongly smooth functions in machine learning (see Appendix D.2\nfor details).\n\nExample 6 The (cid:96)p loss function\n\nf (x) = 1\n\np(cid:107)Ax \u2212 b(cid:107)p\np,\n\n(22)\n\nshown by Zhang et al. [36] to satisfy (21) of order p, is strongly smooth of order p.\n\n1which requires at least s gradient evaluations per iteration\n2this matches the rate of Algorithm 1 in the limit s \u2192 \u221e, where s is the order of the integrator.\n\n7\n\n\f(a) Example 7: Logistic loss (Iteration)\n\n(b) Example 6: (cid:96)4 loss (Iteration)\n\n(c) Example 7: Logistic loss (Gradient)\n\n(d) Example 6: (cid:96)4 loss (Gradient)\n\n(e) Example 10: Hamiltonian function\n\nFigure 1: Experimental results comparing RGD and accelerated RGD (ARGD) to gradient descent\n(GD), Nesterov accelerated GD (NAG) and Runge-Kutta (DD). The plots for Runge-Kutta use an\ns = 2 integrator which requires two gradient evaluations per iteration. Where relevant, we plot both\niterations (Figs. 1a and 1b) and gradient evaluations (Figs. 1c and 1d).\n\nExample 7 The logistic loss\n\nf (x) = log(1 + e\u2212yw(cid:62)x),\n\nshown by Zhang et al. [36] to satisfy (21) of order p = \u221e, is strongly smooth of order p = \u221e.\nExample 8 The GLM loss,\n2 (y \u2212 \u03c6(x(cid:62)w))2\n\nfor \u03c6(r) = 1/(1 + e\u2212r),\n\nf (x) = 1\n\ny \u2208 {0, 1},\n\nand w \u2208 Rd,\n\nstudied by Hazan et al. [11] is strongly smooth of order p = 3.\nExample 9 The (cid:96)2 loss to the p-th power\n\nf (x) = 1\n\np(cid:107)Ax \u2212 b(cid:107)p\n2,\n\n(23)\n\n(24)\n\n(25)\n\n8\n\n\ffor which Hamiltonian descent [17] obtains a linear rate, is strongly smooth and gradient dominated\nof order p.\n\nExample 10 The loss function,\n\n(26)\nfor which Hamiltonian descent [17] obtains a linear rate, is strongly smooth and gradient dominated\nof order p = 4.\n\nf (x) = (x(1) + x(2))4 + 1\n\n16 (x(1) \u2212 x(2))4,\n\n5.2 Experiments\n\n4(cid:107)Ax \u2212 b(cid:107)4\n\nRunge-Kutta algorithms of Zhang et al. [36] (DD) on the logistic loss f (x) =(cid:80)10\n\nIn this section, we perform a series of numerical experiments to compare the performance of ARGD\n(Algorithm 1) with gradient descent (GD), Nesterov accelerated GD (NAG), and the state-of-the-art\ni xyi),\nthe (cid:96)4 loss f (x) = 1\n4, and the Hamiltonian descent loss (Example 10). For the logistic and\n(cid:96)4 losses, we use the same code, plots, and experimental methodology of Zhang et al. [36] (including\ndata and step-size choice), adding to it (A)RGD. Speci\ufb01cally, for Fig. 1a-Fig. 1d, the entries of\nW \u2208 R10\u00d710 and A \u2208 R10\u00d710 are i.i.d. standard Gaussian, and the \ufb01rst \ufb01ve entries of y (and b)\nare valued 0 while the rest are 1. Fig. 1e shows the performance of A(RGD), GD, and NAG on the\nHamiltonian objective studied by [17]; for Fig. 1e, the largest step-size was chosen subject to the\nalgorithm not diverging. For each experiment, a simple implementation of (A)RGD signi\ufb01cantly\noutperforms the Runge-Kutta algorithm (DD), GD and NAG. The code for these experiments can be\nfound here: https://github.com/aswilson07/ARGD.git.\n\ni=1 log(1+e\u2212w(cid:62)\n\n6 Additional Results and Discussion\n\nThis paper establishes broad conditions under which an algorithm will converge and its performance\ncan be accelerated by adding momentum. We use these conditions to introduce (accelerated) rescaled\ngradient descent for strongly smooth functions, and showed it outperforms several recent \ufb01rst-order\nmethods that have been introduced for optimizing smooth functions in machine learning.\nThere are (at least) two simple extensions of our framework. First, an analogous framework can\nbe established for (accelerated) \u03b4-coordinate descent methods of order p. As an application, we\nintroduce (accelerated) rescaled coordinate descent for functions that are strongly smooth along\neach coordinate direction of the gradient. We provide details in Appendix E.1. Second, with our\ngeneralization of the Monteiro-Svaiter framework, we derive optimal univeral tensor methods\nfor functions whose (p \u2212 1)-st gradients are \u03bd-H\u00f6lder-smooth which achieve the upper bound\n) where \u02dcp = p \u2212 1 + \u03bd. The matching lower bound for this class of\nf (yk) \u2212 f (x\u2217) = O(1/k\nfunctions was recently established by [10]. We present this result in Appendix E.3.\nThere are several possible directions for future work. We know that certain simple operations preserve\nconvexity (e.g., addition), but what operations preserve strong smoothness? Understanding this\ncould allow us to construct more complex examples of strongly smooth functions. Our results reveal\nan interesting hierarchy of smoothness assumptions which lead to methods that converge quickly;\nexploring this more is of signi\ufb01cant interest. Finally, extending our analysis to the stochastic or\nmanifold setting, studying the use of variance reduction techniques, and introducing other \u03b4-decent\nalgorithms of order p are all interesting directions for future work.\n\n3 \u02dcp\u22122\n\n2\n\nAcknowledgments\n\nWe would like to thank Jingzhao Zhang for providing us access to his code.\n\nReferences\n[1] Zeyuan Allen Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uni\ufb01cation of gradient\nand mirror descent. In 8th Innovations in Theoretical Computer Science Conference, ITCS 2017,\nJanuary 9-11, 2017, Berkeley, CA, USA, pages 3:1\u20133:22, 2017.\n\n[2] Shun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, pages\n\n251\u2013276, 1998.\n\n9\n\n\f[3] Michel Baes. Estimate sequence methods: Extensions and approximations, August 2009.\n\n[4] Michael Betancourt, Michael Jordan, and Ashia Wilson. On symplectic optimization. Arxiv\n\npreprint arXiv1802.03653, 2018.\n\n[5] S\u00e9bastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, and Aaron Sidford. Near-optimal\nmethod for highly smooth convex optimization. In Proceedings of the Thirty-Second Conference\non Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 492\u2013507,\nPhoenix, USA, 25\u201328 Jun . PMLR.\n\n[6] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for \ufb01nding\n\nstationary points ii: First-order methods. Arxiv preprint arXiv:1711.0084, 2017.\n\n[7] Gong Chen and Marc Teboulle. Convergence analysis of a proximal-like minimization algorithm\n\nusing Bregman functions. SIAM Journal of Optimization, 3(3):538\u2013543, 1993.\n\n[8] Jelena Diakonikolas and Lorenzo Orecchia. Accelerated extra-gradient descent: A novel\naccelerated \ufb01rst-order method. In 9th Innovations in Theoretical Computer Science Conference,\nITCS 2018, January 11-14, 2018, Cambridge, MA, USA, pages 23:1\u201323:19, 2018.\n\n[9] Alexander Gasnikov, Pavel Dvurechensky, Eduard Gorbunov, Evgeniya Vorontsova, Daniil\nSelikhanovych, and C\u00e9sar A. Uribe. Optimal tensor methods in smooth convex and uniformly\nconvex optimization. In Proceedings of the Thirty-Second Conference on Learning Theory,\npages 1374\u20131391, Phoenix, USA, 25\u201328 Jun 2019. PMLR.\n\n[10] G.N Grapiglia and Yu. Nesterov. Tensor methods for minimizing functions with h\u00f6lder continu-\n\nous higher-order derivatives. Arxiv preprint arXiv1904.12559, April 2019.\n\n[11] Elad Hazan, K\ufb01r Y. Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex\noptimization. In Advances in Neural Information Processing Systems 28: Annual Conference\non Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec,\nCanada, pages 1594\u20131602, 2015.\n\n[12] B. Jiang, H. Wang, and S. Zhang. An optimal high-order tensor method for convex optimization.\n\nArxiv preprint arXiv:1812.06557, 2018.\n\n[13] Walid Krichene, Alexandre Bayen, and Peter L Bartlett. Accelerated mirror descent in continu-\nous and discrete time. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 28, pages 2845\u20132853. Curran\nAssociates, Inc., 2015.\n\n[14] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization\nalgorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57\u201395,\n2016.\n\n[15] Hongzhou Lin, Julien Mairal, and Za\u00efd Harchaoui. Catalyst acceleration for \ufb01rst-order convex\noptimization: from theory to practice. Journal of Machine Learning Research, 18:212:1\u2013212:54,\n2017.\n\n[16] S. \u0141ojasiewicz. A topological property of real analytic subsets (in french). In Coll. du CNRS,\n\nLes \u00e9quations aux d\u00e9ri\u00b4vees partielles, pages 87\u2013 89, 1963.\n\n[17] Chris J. Maddison, Daniel Paulin, Yee Whye Teh, Brendan O\u2019Donoghue, and Arnaud Doucet.\n\nHamiltonian descent methods. Arxiv preprint arXiv1809.05042, 2018.\n\n[18] Renato D. C. Monteiro and Benar Fux Svaiter. An accelerated hybrid proximal extragradient\nmethod for convex optimization and its implications to second-order methods. SIAM Journal\non Optimization, 23(2):1092\u20131125, 2013.\n\n[19] Jean Jacques Moreau. Proximit\u00e9 et dualit\u00e9 dans un espace Hilbertien. Bulletin de la Soci\u00e9t\u00e9\n\nMath\u00e9matique de France, 93:273\u2013299, 1965.\n\n[20] Arkadi Nemirovskii and David Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimiza-\n\ntion. John Wiley & Sons, 1983.\n\n10\n\n\f[21] Y. Nesterov. Implementable tensor methods in unconstrained convex optimization. Core discus-\n\nsion papers, 2018. URL https://ideas.repec.org/p/cor/louvco/2018005.html.\n\n[22] Yurii Nesterov. A method of solving a convex programming problem with convergence rate\n\nO(1/k2). Soviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\n[23] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Applied\n\nOptimization. Kluwer, Boston, 2004.\n\n[24] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,\n\n103(1):127\u2013152, 2005.\n\n[25] Yurii Nesterov. Accelerating the cubic regularization of Newton\u2019s method on convex problems.\n\nMathematical Programming, 112(1):159\u2013181, 2008. ISSN 0025-5610.\n\n[26] Yurii Nesterov and Boris T. Polyak. Cubic regularization of Newton\u2019s method and its global\n\nperformance. Mathematical Programming, 108(1):177\u2013205, 2006.\n\n[27] Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[28] J. Schropp and I. Singer. A dynamical systems approach to constrained minimization. Numerical\n\nFunctional Analysis and Optimization, 21(3-4):537\u2013551, 2000.\n\n[29] Bin Shi, Simon Du, Michael Jordan, and Weiji Su. Understanding the acceleration phenomenon\nvia high-resolution differential equations. Arxiv preprint arXiv1810.08907, November 2018.\n\n[30] Weijie Su, Stephen Boyd, and Emmanuel J. Cand\u00e8s. A differential equation for modeling\nNesterov\u2019s accelerated gradient method: Theory and insights. In Advances in Neural Information\nProcessing Systems (NIPS) 27, 2014.\n\n[31] Ganesh Sundaramoorthi and Anthony J. Yezzi. Variational PDEs for acceleration on manifolds\nand application to diffeomorphisms. In Advances in Neural Information Processing Systems\n31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8\nDecember 2018, Montr\u00e9al, Canada., pages 3797\u20133807, 2018.\n\n[32] Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as\na composite optimization problem. In Conference On Learning Theory, COLT 2018, Stockholm,\nSweden, 6-9 July 2018., pages 2093\u20133027, 2018.\n\n[33] Andre Wibisono and Ashia Wilson. On accelerated methods in optimization. Arxiv preprint\n\narXiv1509.03616, 2015.\n\n[34] Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. A variational perspective on\naccelerated methods in optimization. Proceedings of the National Academy of Sciences, 113\n(47):E7351\u2013E7358, 2016.\n\n[35] Ashia Wilson, Benjamin Recht, and Michael Jordan. A Lyapunov analysis of momentum\n\nmethods in optimization. Arxiv preprint arXiv1611.02635, November 2016.\n\n[36] Jingzhao Zhang, Aryan Mokhtari, Suvrit Sra, and Ali Jadbabaie. Direct Runge-Kutta dis-\ncretization achieves acceleration.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman,\nN. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems\n31, pages 3904\u20133913. Curran Associates, Inc., 2018.\n\n11\n\n\f", "award": [], "sourceid": 7499, "authors": [{"given_name": "Ashia", "family_name": "Wilson", "institution": "UC Berkeley"}, {"given_name": "Lester", "family_name": "Mackey", "institution": "Microsoft Research"}, {"given_name": "Andre", "family_name": "Wibisono", "institution": "Georgia Tech"}]}