{"title": "Understanding the Role of Momentum in Stochastic Gradient Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 9633, "page_last": 9643, "abstract": "The use of momentum in stochastic gradient methods has become a widespread practice in machine learning. Different variants of momentum, including heavy-ball momentum, Nesterov's accelerated gradient (NAG), and quasi-hyperbolic momentum (QHM), have demonstrated success on various tasks. Despite these empirical successes, there is a lack of clear understanding of how the momentum parameters affect convergence and various performance measures of different algorithms. In this paper, we use the general formulation of QHM to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions. In addition, by combining the results on convergence rates and stationary distributions, we obtain sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.", "full_text": "Understanding the Role of Momentum in\n\nStochastic Gradient Methods\n\nIgor Gitman\n\nHunter Lang\n\nPengchuan Zhang\n\nLin Xiao\n\nMicrosoft Research AI\n\nRedmond, WA 98052, USA\n\n{igor.gitman, hunter.lang, penzhan, lin.xiao}@microsoft.com\n\nAbstract\n\nThe use of momentum in stochastic gradient methods has become a widespread\npractice in machine learning. Di\ufb00erent variants of momentum, including heavy-\nball momentum, Nesterov\u2019s accelerated gradient (NAG), and quasi-hyperbolic\nmomentum (QHM), have demonstrated success on various tasks. Despite these\nempirical successes, there is a lack of clear understanding of how the momentum\nparameters a\ufb00ect convergence and various performance measures of di\ufb00erent\nalgorithms. In this paper, we use the general formulation of QHM to give a uni\ufb01ed\nanalysis of several popular algorithms, covering their asymptotic convergence\nconditions, stability regions, and properties of their stationary distributions. In\naddition, by combining the results on convergence rates and stationary distributions,\nwe obtain sometimes counter-intuitive practical guidelines for setting the learning\nrate and momentum parameters.\n\nIntroduction\n\n1\nStochastic gradient methods have become extremely popular in machine learning for solving stochastic\noptimization problems of the form\n\n(1)\n\n(cid:2) f(x, \u03b6)(cid:3),\n\nminimize\n\nx\u2208Rn\n\nF(x) (cid:44) E\u03b6\n\nwhere \u03b6 is a random variable representing data sampled from some (unknown) probability distribution,\nx \u2208 Rn represents the parameters of a machine learning model (e.g., the weight matrices in a neural\nnetwork), and f is a loss function associated with the model parameters and any sample \u03b6. Many\nvariants of the stochastic gradient methods can be written in the form of\n\nxk+1 = xk \u2212 \u03b1k dk,\n\n(2)\nwhere dk is a (stochastic) search direction and \u03b1k > 0 is the step size or learning rate. The classical\nstochastic gradient descent (SGD) [31] method uses dk = \u2207x f(xk, \u03b6 k), where \u03b6 k is a random sample\ncollected at step k. For the ease of notation, we use gk to denote \u2207x f(xk, \u03b6 k) throughout this paper.\nThere is a vast literature on modi\ufb01cations of SGD that aim to improve its theoretical and empirical\nperformance. The most common such modi\ufb01cation is the addition of a momentum term, which\nsets the search direction dk as the combination of the current stochastic gradient gk and past search\ndirections. For example, the stochastic variant of Polyak\u2019s heavy ball method [26] uses\n\n(3)\nwhere \u03b2k \u2208 [0, 1). We call the combination of (2) and (3) the Stochastic Heavy Ball (SHB) method.\nGupal and Bazhenov [9] studied a \u201cnormalized\u201d version of SHB, where\n\ndk = gk + \u03b2k dk\u22121\n,\n\ndk = (1 \u2212 \u03b2k)gk + \u03b2k dk\u22121\n\n.\n\n(4)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdk = \u2207x f(cid:0)xk \u2212 \u03b1k \u03b2k dk\u22121\n\n, \u03b6 k(cid:1) + \u03b2k dk\u22121\n\nIn the context of modern deep learning, Sutskever et al. [34] proposed to use a stochastic variant of\nNesterov\u2019s accelerated gradient (NAG) method, where\n\n.\n\n(5)\nThe number of variations on momentum has kept growing in recent years; see, e.g., Synthesized\nNesterov Variants (SNV) [17], Triple Momentum [36], Robust Momentum [3], PID Control-based\nmethods [1], Accelerated SGD (AccSGD) [12], and Quasi-Hyperbolic Momentum (QHM) [18].\nDespite various empirical successes reported for these di\ufb00erent methods, there is a lack of clear\nunderstanding of how the di\ufb00erent forms of momentum and their associated parameters a\ufb00ect\nconvergence properties of the algorithms and other performance measures, such as \ufb01nal loss value.\nFor example, Sutskever et al. [34] show that momentum is critical to obtaining good performance in\ndeep learning. But using di\ufb00erent parametrizations, Ma and Yarats [18] claim that momentum may\nhave little practical e\ufb00ect. In order to clear up this confusion, several recent works [see, e.g., 40, 1, 18]\nhave aimed to develop and analyze general frameworks that capture many di\ufb00erent momentum\nmethods as special cases.\nIn this paper, we focus on a class of algorithms captured by the general form of QHM [18]:\n\ndk = (1 \u2212 \u03b2k)gk + \u03b2k dk\u22121\n,\nxk+1 = xk \u2212 \u03b1k\n\n(cid:2)(1 \u2212 \u03bdk)gk + \u03bdk dk(cid:3) ,\n\n(6)\nwhere the parameter \u03bdk \u2208 [0, 1] interpolates between SGD (\u03bdk = 0) and (normalized) SHB (\u03bdk = 1).\nWhen the parameters \u03b1k, \u03b2k and \u03bdk are held constant (thus the subscript k can be omitted) and\n\u03bd = \u03b2, it recovers a normalized variant of NAG with an additional coe\ufb03cient 1 \u2212 \u03b2k on the stochastic\ngradient term in (5) (see Appendix A). In addition, Ma and Yarats [18] show that di\ufb00erent settings\nof \u03b1k, \u03b2k and \u03bdk recover other variants such as AccSGD [12], Robust Momentum [3], and Triple\nMomentum [36]. They also show that it is equivalent to SNV [17] and special cases of PID Control\n(either PI or PD) [1]. However, there is little theoretical analysis of QHM in general. In this paper, we\ntake advantage of its general formulation to derive a uni\ufb01ed set of analytic results that help us better\nunderstand the role of momentum in stochastic gradient methods.\n\n1.1 Contributions and outline\nOur theoretical results on the QHM model (6) cover three di\ufb00erent aspects: asymptotic convergence\nwith probability one, stability region and local convergence rates, and characterizations of the\nstationary distribution of {xk} under constant parameters \u03b1, \u03b2, and \u03bd. Speci\ufb01cally:\n\n\u2022 In Section 3, we show that for minimizing smooth nonconvex functions, QHM converges almost\nsurely as \u03b2k \u2192 0 for arbitrary values of \u03bdk. And more surprisingly, we show that QHM converges\nas \u03bdk \u03b2k \u2192 1 (which requires both \u03bdk \u2192 1 and \u03b2k \u2192 1) as long as \u03bdk \u03b2k \u2192 1 slow enough, as\ncompared with the speed of \u03b1k \u2192 0.\n\u2022 In Section 4, we consider local convergence behaviors of QHM for \ufb01xed parameters \u03b1, \u03b2, and \u03bd.\nIn particular, we derive joint conditions on (\u03b1, \u03b2, \u03bd) that ensure local stability (or convergence\nwhen there is no stochastic noise in the gradient approximations) of the algorithm near a strict\nlocal minimum. We also characterize the local convergence rate within the stability region.\n\u2022 In Section 5, we investigate the stationary distribution of {xk} generated by the QHM dynamics\naround a local minimum (using a simple quadratic model with noise). We derive the dependence\nof the stationary variance on (\u03b1, \u03b2, \u03bd) up to the second-order Taylor expansion in \u03b1. These results\nreveal interesting e\ufb00ects of \u03b2 and \u03bd that cannot be seen from \ufb01rst-order expansions.\n\nOur asymptotic convergence results in Section 3 give strong guarantees for the convergence of QHM\nwith diminishing learning rates under di\ufb00erent regimes (\u03b2k \u2192 0 and \u03b2k \u2192 1). However, as with\nmost asymptotic results, they provide limited guidance on how to set the parameters in practice for\nfast convergence. Our results in Sections 4 and 5 complement the asymptotic results by providing\nprincipled guidelines for tuning these parameters. For example, one of the most e\ufb00ective schemes\nused in deep learning practice is called \u201cconstant and drop\u201d, where constant parameters (\u03b1, \u03b2, \u03bd) are\nused to train the model for a long period until it reaches a stationary state and then the learning rate \u03b1\nis dropped by a constant factor for re\ufb01ned training. Each stage of the constant-and-drop scheme runs\nvariants of QHM with constant parameters, and their choices dictate the overall performance of the\nalgorithm. In Section 6, by combining our results in Sections 4 and 5, we obtain new and, in some\ncases, counter-intuitive insight into how to set these parameters in practice.\n\n2\n\n\f2\n\nk=0 \u03b1\n\nconvergence is(cid:80)\u221e\n\nk=0 \u03b1k = \u221e and(cid:80)\u221e\n\n2 Related work\nAsymptotic convergence There exist many classical results concerning the asymptotic convergence\nof the stochastic gradient methods [see, e.g. 37, 28, 14, and references therein]. For the classical SGD\nmethod without momentum, i.e., (2) with dk = gk, a well-known general condition for asymptotic\nk < \u221e. In general, we will always need \u03b1k \u2192 0 to counteract\nthe e\ufb00ect of noise. But interestingly, the conditions on \u03b2k are much less restricted. For normalized\nSHB, Polyak [27] and Kaniovski [11] studied its asymptotic convergence properties in the regime\nof \u03b1k \u2192 0 and \u03b2k \u2192 0, while Gupal and Bazhenov [9] investigated asymptotic convergence in the\nregime of \u03b1k \u2192 0 and \u03b2k \u2192 1, both for convex optimization problems. More recently, Gadat et al.\n[7] extended asymptotic convergence analysis for the normalized SHB update to smooth nonconvex\nfunctions for \u03b2k \u2192 1. In this work we generalize the classical SGD and SHB results to the case of\nQHM for smooth nonconvex functions.\nLocal convergence rate The stability region and local convergence rate of the deterministic gradient\ndescent and heavy ball algorithms were established by Boris Polyak for the case of convex functions\nnear a strict twice-di\ufb00erentiable local minimum [29, 26]. For this class of functions heavy ball method\nis optimal in terms of the local convergence rate [21]. However, it might fail to converge globally for\nthe general strongly convex twice-di\ufb00erentiable functions [17] and is no longer optimal for the class\nof smooth convex functions. For the latter case, Nesterov\u2019s accelerated gradient was shown to attain\nthe optimal global convergence rate [22, 23]. In this paper we extend the results of Polyak [26] on\nlocal convergence to the more general QHM algorithm.\nStationary analysis The limit behavior analysis of SGD algorithms with momentum and constant\nstep size was used in various applications. [25, 39, 15] establish su\ufb03cient conditions on detecting\nwhether iterates reach stationarity and use them in combination with statistical tests to automatically\nchange learning rate during training. [6, 4] prove many properties of limiting behavior of SGD with\nconstant step size by using tools from Markov chain theory. Our results are most closely related\nto the work of Mandt et al. [19] who use stationary analysis of SGD with momentum to perform\napproximate Bayesian inference. In fact, our Theorem 4 extends their results to the case of QHM and\nour Theorem 5 establishes more precise relations (to the second order in \u03b1), revealing interesting\ndependence on the parameters \u03b2 and \u03bd which cannot be seen from the \ufb01rst order equations.\n\n3 Asymptotic convergence\nIn this section, we generalize the classical asymptotic results to provide conditions under which\nQHM converges almost surely to a stationary point for smooth nonconvex functions. Throughout this\nsection, \"a.s.\" refers to \"almost surely\". We need to make the following assumptions.\nAssumption A. The following conditions hold for F de\ufb01ned in (1) and the stochastic gradient oracle:\n1. F is di\ufb00erentiable and \u2207F is Lipschitz continuous, i.e., there is a constant L such that\n\n(cid:107)\u2207F(x) \u2212 \u2207F(y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107),\n\nx, y \u2208 Rn.\n\n2. F is bounded below and (cid:107)\u2207F(x)(cid:107) is bounded above, i.e., there exist F\u2217 and G such that\n\nF(x) \u2265 F\u2217,\n\n(cid:107)\u2207F(x)(cid:107) \u2264 G,\n\nx \u2208 Rn.\n\n3. For k = 0, 1, 2, . . ., the stochastic gradient gk = \u2207F(xk) + \u03be k, where the random noise \u03be k satis\ufb01es\n\n(cid:2)(cid:107)\u03be k(cid:107)2(cid:3) \u2264 C a.s.\n\nEk[\u03be k] = 0,\n\nEk\n\n, . . . , xk\u22121\n0\n\nwhere Ek[\u00b7] denotes expectation conditioned on {x0\n, g\n\n, xk}, and C is a constant.\nNote that Assumption A.3 allows the distribution of \u03be k to depend on xk, and we simply require the\nsecond moment to be conditionally bounded uniformly in k. The assumption (cid:107)\u2207F(x)(cid:107) \u2264 G can be\nremoved if we assume a bounded domain for x. However, this will complicate the proof by requiring\nspecial treatment (e.g., using the machinery of gradient mapping [24]) when {xk} converges to the\nboundary of the domain. Here we assume this condition to simplify the analysis.\nBy convergence to a stationary point, we mean that the sequence {xk} satis\ufb01es the condition\n\n, gk\u22121\n\nk\u2192\u221e (cid:107)\u2207F(xk)(cid:107) = 0\nlim inf\n\na.s.\n\n(7)\n\n3\n\n\fIntuitively, as \u03b2k \u2192 0, regardless of \u03bdk, the QHM dynamics become more like SGD, so there should\nbe no issue with convergence. The following theorem, which generalizes the analysis technique of\nRuszczy\u0144ski and Syski [33] to QHM, shows formally that this is indeed the case:\nTheorem 1. Let F satisfy Assumption A. Additionally, assume 0 \u2264 \u03bdk \u2264 1 and the sequences {\u03b1k}\nand {\u03b2k} satisfy the following conditions:\n\n\u03b1k = \u221e,\n\n2\n\nk < \u221e,\n\n\u03b1\n\nlim\nk\u2192\u221e \u03b2k = 0,\n\n\u00af\u03b2 (cid:44) sup\n\n\u03b2k < 1.\n\nk\n\nThen the sequence {xk} generated by the QHM algorithm (6) satis\ufb01es (7). Moreover, we have\n\n\u221e(cid:88)\n\nk=0\n\n\u221e(cid:88)\n\nk=0\n\nF(xk) =\n\nlim sup\nk\u2192\u221e\n\nlim sup\n\nk\u2192\u221e, (cid:107)\u2207F(x k)(cid:107)\u21920\n\nF(xk)\n\na.s.\n\n(8)\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\nMore surprisingly, however, one can actually send \u03bdk \u03b2k \u2192 1 as long as \u03bdk \u03b2k \u2192 1 slow enough,\nalthough we require a stronger condition on the noise \u03be. We extend the technique of Gupal and\nBazhenov [9] to show asymptotic convergence of QHM for minimizing smooth nonconvex functions.\nTheorem 2. Let F satisfy assumption A, and additionally assume that ||\u03be k||2\n< C almost surely, i.e.,\nthe noise \u03be is a.s. bounded. Let the sequences {\u03b1k}, {\u03b2k}, and {\u03bdk} satisfy the following conditions:\n\n2\n\u03b1\nk\n\n< \u221e,\n\n\u03b1k = \u221e,\n\n(1 \u2212 \u03bdk \u03b2k)2\n\n< \u221e,\n\nlim\nk\u2192\u221e \u03b2k = 1.\n\nk=0\n\nk=0\n\nk=0\n\n1 \u2212 \u03bdk \u03b2k\nThen the sequence {xk} generated by Algorithm (6) satis\ufb01es (7).\nThe conditions in Theorem 2 can be satis\ufb01ed by, for example, taking \u03b1k = k\u2212\u03c9 and (1 \u2212 \u03bdk \u03b2k) = k\u2212c\nfor 1+c2 < \u03c9 \u2264 1 and 1\n2 < c < 1. We should note that, even though setting \u03bdk \u03b2k \u2192 1 is somewhat\nunusual in practice, we think the result of Theorem 2 is interesting from both theoretical and practical\npoints of view. From the theoretical side, this result shows that it is possible to always be increasing\nthe amount of momentum (in the limit when \u03bdk \u03b2k = 1, we are not using the fresh gradient information\nat all) and still obtain convergence for smooth functions. From the practical point of view, our\nTheorem 5 in Section 5 shows that for a \ufb01xed \u03b1, increasing \u03bdk \u03b2k might lead to smaller stationary\ndistribution size, which may give better empirical results.\nAlso, note that when \u03bdk = \u03b2k, Theorems 1 and 2 give asymptotic convergence guarantees for the\ncommon practical variant of NAG, which have not appeared in the literature before. However, we\nshould mention that the bounded noise assumption of Theorem 2 (i.e.\n< C a.s.) is quite\nrestrictive. In fact, Ruszczy\u0144ski and Syski [32] prove a similar result for SGM with a more general\nnoise condition, and their technique may extend to QHM, but bounded noise greatly simpli\ufb01es the\nderivations. We provide the proofs of Theorems 1 and 2 in Appendix B.\nThe results in this section indicate that both \u03b2k \u2192 0 and \u03bdk \u03b2k \u2192 1 are admissible from the perspective\nof asymptotic convergence. However, they give limited guidance on how to choose momentum\nparameters in practice, where non-asymptotic behaviors are of main concern.\nIn the next two\nsections, we study local convergence and stationary behaviors of QHM with constant learning rate\nand momentum parameters; our analysis provides new insights that could be very useful in practice.\n\n||\u03be k||2\n\n4 Stability region and local convergence rate\nLet the sequence {xk} be generated by the QHM algorithm (6) with constant parameters \u03b1k = \u03b1,\n\u03b2k = \u03b2 and \u03bdk = \u03bd. In this case, xk does not converge to any local minimum in the asymptotic sense,\nbut its distribution may converge to a stationary distribution around a local minimum. Since the\nobjective function F is smooth, we can approximate F around a strict local minimum x\u2217 by a convex\nquadratic function. Since \u2207F(x\u2217) = 0, we have\nF(x) \u2248 F(x\u2217) + 1\n\n2(x \u2212 x\u2217)T\u22072F(x\u2217)(x \u2212 x\u2217) ,\n\nwhere the Hessian \u22072F(x\u2217) is positive de\ufb01nite. Therefore, for the ease of analysis, we focus on convex\nquadratic functions of the form F(x) = (1/2)(x \u2212 x\u2217)T A(x \u2212 x\u2217), where A is positive de\ufb01nite (and we\ncan set x\u2217 = 0 without loss of generality). In addition, we assume\n\ngk = \u2207F(xk) + \u03be k = A(x \u2212 x\u2217) + \u03be k,\n\n(9)\n\n4\n\n\fwhere the noise \u03be k satis\ufb01es Assumption A.3 and in addition, \u03be k is independent of xk for all k \u2265 0.\nMandt et al. [19] observe that this independence assumption often holds approximately when the\ndynamics of SHB are approaching stationarity around a local minimum.\nUnder the above assumptions, the behaviors of QHM can be described by a linear dynamical system\ndriven by i.i.d. noise. More speci\ufb01cally, let zk = [dk\u22121; xk \u2212 x\u2217] \u2208 R2n be an augmented state vector,\nthen the dynamics of (6) can be written as (see Appendix E for details)\n\nwhere T and S are functions of (\u03b1, \u03b2, \u03bd) and A:\n(1 \u2212 \u03b2)A\n\n(cid:20) \u03b2I\n\n(cid:21)\n\nzk+1 = T zk + S\u03be k,\n\nT =\n\n\u2212\u03b1\u03bd \u03b2I\n\nI \u2212 \u03b1(1 \u2212 \u03bd \u03b2)A\n\n,\n\nS =\n\n(cid:20)\n\n(1 \u2212 \u03b2)I\n\u2212\u03b1(1 \u2212 \u03bd \u03b2)I\n\n(cid:21)\n\n.\n\n(10)\n\n(11)\n\nIt is well-known that the linear system (10) is stable if and only if the spectral radius of T, denoted by\n\u03c1(T), is less than 1. When \u03c1(T) < 1, the dynamics of (10) is the superposition of two components:\n\u2022 A deterministic part described by the dynamics zk+1 = T zk with initial condition z0 = [0; x0]\n(we always take d\u22121 = 0). This part asymptotically decays to zero.\n\u2022 An auto-regressive stochastic process (10) driven by {\u03be k} with zero initial condition z0 = [0; 0].\nRoughly speaking, \u03c1(T) determines how fast the dynamics converge from an arbitrary initial point\nx0 to the stationary distribution, while properties of the stationary distribution (such as its variance\nand auto-correlations) depends on the full spectrum of the matrix T as well as S. Both aspects have\nimportant implications for the practical performance of QHM on stochastic optimization problems.\nOften there are trade-o\ufb00s that we have to make in choosing the parameters \u03b1, \u03b2 and \u03bd to balance the\ntransient convergence behavior and stationary distribution properties.\nIn the rest of this section, we focus on the deterministic dynamics zk+1 = T zk to derive the conditions\non (\u03b1, \u03b2, \u03bd) that ensure \u03c1(T) < 1 and characterize the convergence rate. Let \u03bbi(A) for i = 1, . . . , n\ndenote the eigenvalues of A (they are all real and positive). In addition, we de\ufb01ne\n\n\u00b5 = min\ni=1,...,n\n\n\u03bbi(A),\n\nL = max\ni=1,...,n\n\n\u03bbi(A),\n\n\u03ba = L/\u00b5,\n\n\u03ba \u2212 1)/(\u221a\n\nwhere \u03ba is the condition number. The local convergence rate for strictly convex quadratic functions\nis well studied for the case of gradient descent (\u03bd = 0) and heavy ball (\u03bd = 1) [26]. In fact, heavy\nball achieves the best possible convergence rate of (\u221a\n\u03ba + 1)[23]. Thus, it is immediately\nclear that the optimal convergence rate of QHM will be the same and will be achieved with \u03bd = 1.\nHowever, there are no results in the literature characterizing how the optimal rate or optimal parameters\nchange as a function of \u03bd. Our next result establishes the convergence region and dependence of the\nconvergence rate on the parameters \u03b1, \u03b2, and \u03bd. We present the result for quadratic functions, but it\ncan be generalized to any L-smooth and \u00b5-strongly convex functions, assuming the initial point x0 is\nclose enough to the optimal point x\u2217 (see Theorem 6 in Appendix C).\nTheorem 3. Let\u2019s denote \u03b8 = {\u03b1, \u03b2, \u03bd} 1. For any function F(x) = xT Ax + bT x + c that satis\ufb01es\n0 < \u00b5 \u2264 \u03bbi(A) \u2264 L for all i = 1, . . . , n and any x0, \u2203{\u0001k}, with \u0001k \u2265 0, such that the deterministic\nQHM algorithm zk+1 = T zk satis\ufb01es\n\n(cid:13)(cid:13)xk \u2212 x\u2217(cid:13)(cid:13) \u2264 (R(\u03b8, \u00b5, L) + \u0001k)k(cid:13)(cid:13)x0 \u2212 x\u2217(cid:13)(cid:13) ,\n\nx F(x), limk\u2192\u221e \u0001k = 0 and R(\u03b8, \u00b5, L) = \u03c1(T), which can be characterized as\n\nwhere x\u2217 = arg min\n\nR(\u03b8, \u00b5, L) = max {r(\u03b8, \u00b5), r(\u03b8, L)} , where\n\n0.5(cid:16)(cid:112)C1(\u03bb)2 \u2212 4C2(\u03bb) + C1(\u03bb)(cid:17)\n0.5(cid:16)(cid:112)C1(\u03bb)2 \u2212 4C2(\u03bb) \u2212 C1(\u03bb)(cid:17)\n(cid:112)C2(\u03bb)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nr(\u03b8, \u03bb) =\n\nC1(\u03bb, \u03b8) = 1 \u2212 \u03b1\u03bb + \u03b1\u03bb\u03bd \u03b2 + \u03b2 ,\nC2(\u03bb, \u03b8) = \u03b2(1 \u2212 \u03b1\u03bb + \u03b1\u03bb\u03bd) .\n\nif C1(\u03bb) \u2265 0, C1(\u03bb)2 \u2212 4C2(\u03bb) \u2265 0 ,\nif C1(\u03bb) < 0, C1(\u03bb)2 \u2212 4C2(\u03bb) \u2265 0 ,\nif C1(\u03bb)2 \u2212 4C2(\u03bb) < 0 ,\n\nTo ensure R(\u03b8, \u00b5, L) < 1, the parameters \u03b1, \u03b2, \u03bd must satisfy the following constraints:\n\n0 < \u03b1 <\n\n2(1 + \u03b2)\n\nL(1 + \u03b2(1 \u2212 2\u03bd)),\n\n0 \u2264 \u03b2 < 1,\n\n0 \u2264 \u03bd \u2264 1 .\n\nIn addition, the optimal rate depends only on \u03ba: min\u03b8 R(\u03b8, \u00b5, L) is a function of only \u03ba.\n\n5\n\n(12)\n\n(13)\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Plots (a), (b) show the dependence of the optimal \u03b1, \u03b2 and convergence rate as a function\nof \u03bd. We can see that both rate and optimal \u03b2 are decreasing functions of \u03bd. Plots (c), (d) show\nthe dependence of optimal \u03b1, \u03bd and rate on \u03b2. We can see that there are three phases in which the\ndependence is quite di\ufb00erent. Also note that in all presented cases, changing \u03bd required changing \u03b1\nin the same way (they increase and decrease together).\n\nThe conditions in (13) characterize the stability region of QHM. Note that when \u03bd = 0 we have\nthe classical result for gradient descent: \u03b1 < 2/L; when \u03bd = 1, the condition matches that of the\nnormalized heavy ball: \u03b1 < 2(1 + \u03b2)/(L(1 \u2212 \u03b2)).\nThe equations (12) de\ufb01ne the convergence rate for any \ufb01xed values of the parameters \u03b1, \u03b2, \u03bd. While it\ndoes not give a simple analytic form, it allows us to conduct easy numerical investigations. To gain\nmore intuition into the e\ufb00ect that momentum parameters \u03bd and \u03b2 have on the convergence rate, we\nstudy how the optimal \u03bd changes as a function of \u03b2 and vice versa. To \ufb01nd the optimal parameters and\nrate, we solve the corresponding optimization problem numerically (using the procedure described in\nAppendix D). For each pair {\u03b2, \u03bd} we set \u03b1 to the optimal value in order to remove its e\ufb00ect. These\nplots are presented in Figure 1.\nA natural way to think about the interplay between parameters \u03b1, \u03b2 and \u03bd is in terms of the total\n\u201camount of momentum\u201d. Intuitively, it should be controlled by the product of \u03bd \u00d7 \u03b2. This intuition\nhelps explain Figure 1 (a), (b), which show the dependence of the optimal \u03b2 as a function of \u03bd for\ndi\ufb00erent values of \u03ba. We can see that for bigger values of \u03bd we need to use smaller values of \u03b2, since\nincreasing each one of them increases the \u201camount of momentum\u201d in QHM. However, the same\nintuition fails when considering \u03bd as a function of \u03b2 (and \u03b2 is big enough), as shown in Figure 1\n(c), (d). In this case there are 3 regimes of di\ufb00erent behavior. In the \ufb01rst regime, since \u03b2 is small,\nthe amount of momentum is not enough for the problem and thus the optimal \u03bd is always 1. In this\nphase we also need to increase \u03b1 when increasing \u03b2 (it is typical to use larger learning rate when the\nmomentum coe\ufb03cient is bigger). The second phase begins when we reach the optimal value of \u03b2\n(rate is minimal) and, after that, the amount of momentum becomes too big and we need to decrease \u03bd\nand \u03b1. However, somewhat surprisingly, there is a third phase, when \u03b2 becomes big enough we need\nto start increasing \u03bd and \u03b1 again. Thus we can see that it\u2019s not just the product of \u03bd \u03b2 that governs the\nbehavior of QHM, but a more complicated function.\nFinally, based on our analytic and numerical investigations, we conjecture that the optimal convergence\nrate is a monotonically decreasing function of \u03bd (if \u03b1 and \u03b2 are chosen optimally for each \u03bd). While\nwe can\u2019t prove this statement2, we verify this conjecture numerically in Appendix D. The code of all\nof our experiments is available at https://github.com/Kipok/understanding-momentum.\n\n5 Stationary analysis\n\nIn this section, we study the stationary behavior of QHM with constant parameters \u03b1, \u03b2 and \u03bd. Again\nwe only consider quadratic functions for the same reasons as outlined in the beginning of Section 4.\nIn other words, we focus on the linear dynamics of (10) driven by the noise \u03be k as k \u2192 \u221e (where\nthe deterministic part depending on x0 dies out). Under the assumptions of Section 4 we have the\n\nfollowing result on the covariance matrix de\ufb01ned as \u03a3x (cid:44) limk\u2192\u221e E(cid:2)xk(xk)T(cid:3).\n\n1We drop the dependence of some functions on \u03b8 for brevity.\n2In fact, we hypothesise that R\u2217(\u03bd, \u03ba) might not have analytical formula, since it is possible to show that the\n\noptimization problem over \u03b1 and \u03b2 is equivalent to the system of highly non-linear equations.\n\n6\n\n0.00.51.0\u03bd0.000.250.500.751.00\u03ba=10R(\u03b8,\u03ba)\u03b2\u2217(\u03bd)\u03b1\u2217(\u03bd)0.00.51.0\u03bd0.000.250.500.751.00\u03ba=1000R(\u03b8,\u03ba)\u03b2\u2217(\u03bd)\u03b1\u2217(\u03bd)0.00.51.0\u03b20.000.250.500.751.00\u03ba=10R(\u03b8,\u03ba)\u03bd\u2217(\u03b2)\u03b1\u2217(\u03b2)0.00.51.0\u03b20.000.250.500.751.00\u03ba=1000R(\u03b8,\u03ba)\u03bd\u2217(\u03b2)\u03b1\u2217(\u03b2)\f(a) Mean loss = 0.11\n\n(b) Mean loss = 0.01\n\n(c) Mean loss = 0.06\n\n(d) Mean loss = 0.15\n\n(e) Mean loss = 0.86\n\n(f) Mean loss = 0.06\n\n(g) Mean loss = 0.44\n\n(h) Mean loss = 0.68\n\nFigure 2: Changes in the shape and size of stationary distribution changes with respect to \u03b1, \u03b2, and\n\u03bd on a 2-dimensional quadratic problem. Each picture shows the last 5000 iterates of QHM on a\ncontour plot. The \ufb01rst picture of each row is a reference and other pictures should be compared to it.\nThe second pictures show how the stationary distribution changes when we decrease \u03b1. The third and\nfourth show the dependence on \u03b2 and \u03bd, respectively. We can see that as expected, moving \u03b1 \u2192 0\nand \u03b2 \u2192 1 always decreases the achievable loss. However, the dependence on \u03bd is more complicated,\nand for some values of \u03b1 and \u03b2 increasing \u03bd increases the loss (top row), while for other values the\ndependence is reversed (bottom row). Note the scale change between top and bottom plots.\n\nE[\u03be] = 0 and covariance matrix E(cid:2)\u03be\u03beT(cid:3) = \u03a3\u03be. Also, suppose the parameters \u03b1, \u03b2, \u03bd satisfy (13).\n\nTheorem 4. Suppose F(x) = 1\n2 xT Ax, where A is symmetric positive de\ufb01nite matrix. The stochastic\ngradients satisfy gk = \u2207F(xk) + \u03be, where \u03be is a random vector independent of xk with zero mean\nThen the QHM algorithm (6), equivalently (10) in this case, converges to a stationary distribution\nsatisfying\n(14)\nWhen \u03bd = 1, this result matches the known formula for the stationary distribution of unnormalized\nSHB [19] with reparametrization of \u03b1 \u2192 \u03b1/(1 \u2212 \u03b2). Note that Theorem 4 shows that for the\nnormalized version of the algorithm, the stationary distribution\u2019s covariance does not depend on \u03b2 (or\n\u03bd) to the \ufb01rst order in \u03b1. In order to explore such dependence, we need to expand the dependence\non \u03b1 to the second order. In that case, we are not able to obtain a matrix equation, but can get the\nfollowing relation for tr(A\u03a3x).\nTheorem 5. Under the conditions of Theorem 4, we have\n\nA\u03a3x + \u03a3x A = \u03b1A\u03a3\u03be + O(\u03b1\n\n2) .\n\ntr(A\u03a3x) =\n\n\u03b1\n\n2 tr(\u03a3\u03be) +\n\n2\n\u03b1\n4\n\n1 +\n\n2\u03bd \u03b2\n1 \u2212 \u03b2\n\n\u2212 1\n\ntr(A\u03a3\u03be) + O(\u03b1\n\n3) .\n\n(15)\n\n(cid:18)\n\n(cid:21)(cid:19)\n\n(cid:20) 2\u03bd \u03b2\n\n1 + \u03b2\n\nWe note that tr(A\u03a3x) is twice the mean value of F(x) when the dynamics have reached stationarity,\nso the right-hand side of (15) is approximately the \u201cachievable loss\u201d given the values of \u03b1, \u03b2 and \u03bd. It\nis interesting to consider several special cases:\n\u2022 \u03bd = 0 (SGD): tr(A\u03a3x) = \u03b12 tr(\u03a3\u03be) + \u03b1\n2\n\u2022 \u03bd = 1 (SHB): tr(A\u03a3x) = \u03b12 tr(\u03a3\u03be) + \u03b1\n\u2022 \u03bd = \u03b2 (NAG): tr(A\u03a3x) = \u03b12 tr(\u03a3\u03be) + \u03b1\n\n(cid:17) tr(A\u03a3\u03be) + O(\u03b1\n\n4 tr(A\u03a3\u03be) + O(\u03b1\n3).\n1\u2212\u03b2\n1+\u03b2 tr(A\u03a3\u03be) + O(\u03b1\n\n(cid:16)1 \u2212 2\u03b2\n\n3) .\n\n2\n4\n\n2\n4\n\n3).\n\n2(1+2\u03b2)\n1+\u03b2\n\nFrom the expressions for SHB and NAG, it might be bene\ufb01cial to move \u03b2 to 1 during training in\norder to make the achievable loss smaller. While moving \u03b2 to 1 is somewhat counter-intuitive, we\n\n7\n\n1214161812141618\u03b1=1.0,\u03b2=0.9,\u03bd=0.71214161812141618\u03b1=0.1,\u03b2=0.9,\u03bd=0.71214161812141618\u03b1=1.0,\u03b2=0.99,\u03bd=0.71214161812141618\u03b1=1.0,\u03b2=0.9,\u03bd=1.0101520255101520\u03b1=4.0,\u03b2=0.8,\u03bd=0.7101520255101520\u03b1=0.4,\u03b2=0.8,\u03bd=0.7101520255101520\u03b1=4.0,\u03b2=0.95,\u03bd=0.7101520255101520\u03b1=4.0,\u03b2=0.8,\u03bd=1.0\fFigure 3: These pictures show dependence of the average \ufb01nal loss (depicted with color: whiter is\nsmaller) on the parameters of QHM algorithm for di\ufb00erent problems. The top row shows results for a\nsynthetic 2-dimensional quadratic problem, where all the assumptions of Theorem 5 are satis\ufb01ed.\nThe red curve indicates the boundary of convergence region (algorithm diverges below it). In this\ncase, we start the algorithm directly at the optimal value to measure the size of stationary distribution\nand ignore convergence rate. We can see that as predicted by theory, smaller \u03b1 and bigger \u03b2 make\nthe \ufb01nal loss smaller. The bottom row shows results of the same experiments repeated for logistic\nregression on MNIST and ResNet-18 on the CIFAR-10 dataset. We can see that while the assumptions\nof Theorem 5 are no longer valid, QHM still shows similar qualitative behavior.\n\nproved in Section 3 that QHM still converges asymptotically in this regime, assuming \u03bd also goes to\n1 and \u03bd \u03b2 converges to 1 \u201cslower\u201d than \u03b1 converges to 0. However, since we only consider Taylor\nexpansion in \u03b1, there is no guarantee that the approximation remains accurate when \u03bd and \u03b2 converge\nto 1 (see Appendix G for evaluation of this approximation error). In order to precisely investigate the\ndependence on \u03b2 and \u03bd, it is necessary to further extend our results by considering Taylor expansion\nwith respect to them as well, especially in terms of 1 \u2212 \u03b2. We leave this for future work.\nFigure 2 shows a visualization of the QHM stationary distribution on a 2-dimensional quadratic\nproblem. We can see that our prediction about the dependence on \u03b1 and \u03b2 holds in this case. However,\nthe dependence on \u03bd is more complicated: the top and bottom rows of Figure 2 show opposite behavior.\nComparing this experiment with our analysis of the convergence rate (Figure 1) we can see another\ncon\ufb01rmation that for big values of \u03b2, increasing \u03bd can, in a sense, decrease the \u201camount of momentum\u201d\nin the system. Next, we evaluate the average \ufb01nal loss for a large grid of parameters \u03b1, \u03b2 and \u03bd on\nthree problems: a 2-dimensional quadratic function (where all of our assumptions are satis\ufb01ed),\nlogistic regression on the MNIST [16] dataset (where the quadratic assumption is approximately\nsatis\ufb01ed, but gradient noise comes from mini-batches) and ResNet-18 [10] on CIFAR-10 [13] (where\nall of our assumptions are likely violated). Figure 3 shows the results of this experiment. We can\nindeed see that \u03b2 \u2192 1 and \u03b1 \u2192 0 make the \ufb01nal loss smaller in all cases. The dependence on \u03bd is\nless clear, but we can see that for large values of \u03b2 it is approximately quadratic, with a minimum at\nsome \u03bd < 1. Thus from this point of view \u03bd (cid:44) 1 helps when \u03b2 is big enough, which might be one of\nthe reasons for the empirical success of the QHM algorithm. Notice that the empirical dependence on\n\u03bd is qualitatively the same as predicted by formula (15), but with optimal value shifted closer to 1.\nSee Appendix F for details.\n\n6 Some practical implications and guidelines\nIn this section, we present some practical implications and guidelines for setting learning rate and\nmomentum parameters in practical machine learning applications. In particular, we consider the\nquestion of how to set the optimal parameters in each stage of the popular constant-and-drop scheme\nfor deep learning. We argue that in order to answer this question, it is necessary to consider both\n\n8\n\n0.51.01.5\u03b10.00.20.40.60.81.0\u03b2Quadraticfunction(\u03bd=1.0)0.000.120.240.360.480.600.720.840.960.51.01.5\u03b10.00.20.40.60.81.0\u03b2Quadraticfunction(\u03bd=0.7)0.000.060.120.180.240.300.360.420.480.20.40.60.81.0\u03bd0.00.20.40.60.81.0\u03b2Quadraticfunction(\u03b1=0.3)0.000.060.120.180.240.300.360.420.48051015\u03b10.20.40.60.81.0\u03b2LRonMNIST(\u03bd=1.0)0.150.601.051.501.952.402.853.303.750102030\u03b10.20.40.60.81.0\u03bdLRonMNIST(\u03b2=0.95)0.01.22.43.64.86.07.28.49.62468\u03b10.00.20.40.60.81.0\u03b2ResNet-18onCIFAR-10(\u03bd=1.0)0.270.330.390.450.510.570.630.690.750.81\f(a)\n\n(b)\n\n(c)\n\nFigure 4:\n(a) This plot shows a trade-o\ufb00 between stationary distribution size (\ufb01nal loss) and\nconvergence rate on a simple 2-dimensional quadratic problem. Algorithms that converge faster,\ntypically will converge to a higher \ufb01nal loss. (b) This plot illustrates the regime where there is no\ntrade-o\ufb00 between stationary distribution size and convergence rate. Larger values of \u03b1 don\u2019t change\nthe convergence rate, while making \ufb01nal loss signi\ufb01cantly higher. To make plots (a) and (b) smoother,\nwe plot the average value of the loss for each 100 iterations on y-axis. (c) This plot shows that the\nsame behavior can also be observed in training deep neural networks. For all plots \u03b2 = 0.9, \u03bd = 1.0.\nAll of the presented results depend continuously on the algorithm\u2019s parameters (e.g. the transition\nbetween behaviours shown in (a) and (b) is smooth).\n\nconvergence rate and stationary distribution perspectives. There is typically a trade-o\ufb00 between\nobtaining a fast rate and a small stationary distribution. You can see an illustration of this trade-o\ufb00\nin Figure 4 (a). Interestingly, by combining stationary analysis of Section 5 and results for the\nconvergence rate (3), we can \ufb01nd certain regimes of parameters \u03b1, \u03b2, and \u03bd where the \ufb01nal loss and\nthe convergence speed do not compete with each other.\nOne of the most important of these regimes happens in the case of the SHB algorithm (\u03bd = 1). In that\ncase, we can see that when C2\n\u03b2 and does\nnot depend on \u03b1. Thus, as long as this inequality is satis\ufb01ed, we can set \u03b1 as small as possible and it\nwould not harm the convergence rate, but will decrease the size of stationary distribution. To get the\nbest possible convergence rate, we, in fact, have to set \u03b1 and \u03b2 in such a way that this inequality will\nturn into equality and thus there will be only a single value of \u03b1 that could be used. However, as long\nas \u03b2 is not exactly at the optimal value, there is going to be some freedom in choosing \u03b1 and it should\nbe used to decrease the size of stationary distribution. From this point of view, the optimal value\n\n\u03b2(cid:1)(cid:17), which will be smaller then the largest possible \u03b1 for convergence\n\n2(l) \u2264 0, l \u2208 {\u00b5, L}, the convergence rate equals \u221a\n\n1(l) \u2212 C2\n\nas long as \u03ba > 2 and \u03b2 is set close to 1 (see proof of Theorem 3 for more details). This guideline\ncontradicts some typical advice to set \u03b1 as big as possible while algorithm still converges3. The\nre\ufb01ned guideline for the constant-and-drop scheme would be to set \u03b1 as small as possible until the\nconvergence noticeably slows down. You can see an illustration of this behavior on a simple quadratic\nproblem (Figure 4 (b)), as well as for ResNet-18 on CIFAR-10 (Figure 4 (c)). Such regimes of no\ntrade-o\ufb00 can be identi\ufb01ed for \u03b2 and \u03bd as well.\n\nof \u03b1 =(cid:0)1 \u2212 \u221a\n\n\u03b2(cid:1)(cid:14)(cid:16)\n\n\u00b5(cid:0)1 +\n\n\u221a\n\n7 Conclusion\n\nUsing the general formulation of QHM, we have derived a uni\ufb01ed set of new analytic results that give\nus better understanding of the role of momentum in stochastic gradient methods. Our results cover\nseveral di\ufb00erent aspects: asymptotic convergence, stability region and local convergence rate, and\ncharacterizations of stationary distribution. We show that it is important to consider these di\ufb00erent\naspects together to understand the key trade-o\ufb00s in tuning the learning rate and momentum parameters\nfor better performance in practice. On the other hand, we note that the obtained guidelines are mainly\nfor stochastic optimization, meaning the minimization of the training loss. There is evidence that\ndi\ufb00erent heuristics and guidelines may be necessary for achieving better generalization performance\nin machine learning, but this topic is beyond the scope of our current paper.\n\n3For example [8] says \u201cSo set \u03b2 as close to 1 as you can, and then \ufb01nd the highest \u03b1 which still converges.\n\nBeing at the knife\u2019s edge of divergence, like in gradient descent, is a good place to be.\u201d\n\n9\n\n05000100001500020000iteration10\u22122101104averagelossQuadraticfunction\u03b1=0.005\u03b1=0.02\u03b1=0.10200040006000iteration100102104averagelossQuadraticfunction\u03b1=1.0\u03b1=2.0\u03b1=3.0050010001500iteration246traininglossResNet-18onCIFAR-10\u03b1=2.0\u03b1=5.0\u03b1=8.5\fReferences\n[1] Wangpeng An, Haoqian Wang, Qingyun Sun, Jun Xu, Qionghai Dai, and Lei Zhang. A pid\ncontroller approach for stochastic optimization of deep networks. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 8522\u20138531, 2018.\n\n[2] Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. Advances in optimizing\nrecurrent networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal\nProcessing, pages 8624\u20138628. IEEE, 2013.\n\n[3] Saman Cyrus, Bin Hu, Bryan Van Scoy, and Laurent Lessard. A robust accelerated optimization\nalgorithm for strongly convex functions. In 2018 Annual American Control Conference (ACC),\npages 1376\u20131381. IEEE, 2018.\n\n[4] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step\n\nsize stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386, 2017.\n\n[5] Yu M Ermoliev. On the stochastic quasi-gradient method and stochastic quasi-feyer sequences.\n\nKibernetika, 2:72\u201383, 1969.\n\n[6] Mark Iosifovich Freidlin and Alexander D Wentzell. Random perturbations.\n\nperturbations of dynamical systems, pages 15\u201343. Springer, 1998.\n\nIn Random\n\n[7] S\u00e9bastien Gadat, Fabien Panloup, So\ufb01ane Saadane, et al. Stochastic heavy ball. Electronic\n\nJournal of Statistics, 12(1):461\u2013529, 2018.\n\n[8] Gabriel Goh. Why momentum really works. Distill, 2017. doi: 10.23915/distill.00006. URL\n\nhttp://distill.pub/2017/momentum.\n\n[9] A. M. Gupal and L. T. Bazhenov. A stochastic analog of the conjugate gradient method.\n\nCybernetics, 8(1):138\u2013140, 1972.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[11] Yu. M. Kaniovski. Behaviour in the limit of iterations of the stochastic two-step method. USSR\n\nComputational Mathematics and Mathematical Physics, 23(1):8\u201313, 1983.\n\n[12] Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. On the insu\ufb03ciency\nof existing momentum schemes for stochastic optimization. In 2018 Information Theory and\nApplications Workshop (ITA), pages 1\u20139. IEEE, 2018.\n\n[13] Alex Krizhevsky and Geo\ufb00rey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[14] Harold J. Kushner and G. George Yin. Stochastic Approximation and Recursive Algorithms and\n\nApplications. Springer, 2nd edition, 2003.\n\n[15] Hunter Lang, Pengchuan Zhang, and Lin Xiao. Statistical adaptive stochastic approximation.\n\n2019.\n\n[16] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/,\n\n1998.\n\n[17] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization\nalgorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57\u201395,\n2016.\n\n[18] Jerry Ma and Denis Yarats. Quasi-hyperbolic momentum and adam for deep learning. In\n\nInternational Conference on Learning Representations, 2019.\n\n[19] Stephan Mandt, Matthew D Ho\ufb00man, and David M Blei. Stochastic gradient descent as\napproximate bayesian inference. The Journal of Machine Learning Research, 18(1):4873\u20134907,\n2017.\n\n10\n\n\f[20] Paul-Andr\u00e9 Meyer. Martingales and stochastic integrals I, volume 284 of Lecture notes in\n\nmathematics. Springer-Verlag, 1972.\n\n[21] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method\n\ne\ufb03ciency in optimization. 1983.\n\n[22] Yurii Nesterov. A method of solving a convex programming problem with convergence rate\n\no(1/k2). In Soviet Math. Dokl, volume 27, 1983.\n\n[23] Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Kluwer Academic\n\nPublishers, 2004.\n\n[24] Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Program-\n\nming, 140(1):125\u2013161, 2013.\n\n[25] Georg Ch. P\ufb02ug. On the determination of the step size in stochastic quasigradient methods.\nCollaborative Paper CP-83-025, International Institute for Applied Systems Analysis (IIASA),\nLaxenburg, Austria, 1983.\n\n[26] Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[27] Boris T. Polyak. Comparison of the rates of convergence of one-step and multi-step optimization\n\nalgorithms in the presence of noise. Engineering Cybernetics, 15:6\u201310, 1977.\n\n[28] Boris T Polyak. Introduction to optimization. optimization software. Inc., Publications Division,\n\nNew York, 1, 1987.\n\n[29] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel\u2019noi\n\nMatematiki i Matematicheskoi Fiziki, 3(4):643\u2013653, 1963.\n\n[30] Benjamin Recht. Cs726-lyapunov analysis and the heavy ball method. 2010.\n[31] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951.\n\n[32] Andrzej Ruszczy\u0144ski and Wojciech Syski. Stochastic approximation method with gradient\nIEEE Transactions on Automatic Control, 28(12):\n\naveraging for unconstrained problems.\n1097\u20131105, 1983.\n\n[33] Andrzej Ruszczy\u0144ski and Wojciech Syski. Stochastic approximation algorithm with gradient\naveraging and on-line stepsize rules. In J. Gertler and L. Keviczky, editors, Proceedings of 9th\nIFAC World Congress, pages 1023\u20131027, Budapest, Hungary, 1984.\n\n[34] Ilya Sutskever, James Martens, George Dahl, and Geo\ufb00rey Hinton. On the importance of\ninitialization and momentum in deep learning. In Sanjoy Dasgupta and David McAllester,\neditors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of\nProceedings of Machine Learning Research, pages 1139\u20131147, Atlanta, Georgia, USA, 17\u201319\nJun 2013. PMLR.\n\n[35] Ole Tange et al. Gnu parallel-the command-line power tool. The USENIX Magazine, 36(1):\n\n42\u201347, 2011.\n\n[36] Bryan Van Scoy, Randy A Freeman, and Kevin M Lynch. The fastest known globally convergent\n\ufb01rst-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2\n(1):49\u201354, 2017.\n\n[37] M. T. Wasan. Stochastic Approximation. Cambridge University Press, 1969.\n[38] David Williams. Probability with Martingales. Cambridge University Press, 1991.\n[39] Sho Yaida. Fluctuation-dissipation relations for stochastic gradient descent. arXiv preprint\n\narXiv:1810.00004, 2018.\n\n[40] Tianbao Yang, Qihang Lin, and Zhe Li. Uni\ufb01ed convergence analysis of stochastic momentum\n\nmethods for convex and non-convex optimization. arXiv preprint arXiv:1604.03257, 2016.\n\n11\n\n\f", "award": [], "sourceid": 5108, "authors": [{"given_name": "Igor", "family_name": "Gitman", "institution": "Microsoft Research"}, {"given_name": "Hunter", "family_name": "Lang", "institution": "Microsoft Research"}, {"given_name": "Pengchuan", "family_name": "Zhang", "institution": "Microsoft Research"}, {"given_name": "Lin", "family_name": "Xiao", "institution": "Microsoft Research"}]}