{"title": "Reducing the variance in online optimization by transporting past gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 5391, "page_last": 5402, "abstract": "Most stochastic optimization methods use gradients once before discarding them. While variance reduction methods have shown that reusing past gradients can be beneficial when there is a finite number of datapoints, they do not easily extend to the online setting. One issue is the staleness due to using past gradients. We propose to correct this staleness using the idea of {\\em implicit gradient transport} (IGT) which transforms gradients computed at previous iterates into gradients evaluated at the current iterate without using the Hessian explicitly. In addition to reducing the variance and bias of our updates over time, IGT can be used as a drop-in replacement for the gradient estimate in a number of well-understood methods such as heavy ball or Adam. We show experimentally that it achieves state-of-the-art results on a wide range of architectures and benchmarks. Additionally, the IGT gradient estimator yields the optimal asymptotic convergence rate for online stochastic optimization in the restricted setting where the Hessians of all component functions are equal.", "full_text": "Reducing the variance in online optimization by\n\ntransporting past gradients\n\nS\u00e9bastien M. R. Arnold \u2217\n\nUniversity of Southern California\n\nLos Angeles, CA\n\nseb.arnold@usc.edu\n\nPierre-Antoine Manzagol\n\nGoogle Brain\nMontr\u00e9al, QC\n\nmanzagop@google.com\n\nReza Babanezhad\n\nIoannis Mitliagkas\n\nUniversity of British Columbia\n\nMila, Universit\u00e9 de Montr\u00e9al\n\nVancouver, BC\n\nrezababa@cs.ubc.ca\n\nMontr\u00e9al, QC\n\nioannis@iro.umontreal.ca\n\nNicolas Le Roux\nMila, Google Brain\n\nMontr\u00e9al, QC\n\nnlr@google.com\n\nAbstract\n\nMost stochastic optimization methods use gradients once before discarding them.\nWhile variance reduction methods have shown that reusing past gradients can be\nbene\ufb01cial when there is a \ufb01nite number of datapoints, they do not easily extend\nto the online setting. One issue is the staleness due to using past gradients. We\npropose to correct this staleness using the idea of implicit gradient transport (IGT)\nwhich transforms gradients computed at previous iterates into gradients evaluated\nat the current iterate without using the Hessian explicitly. In addition to reducing\nthe variance and bias of our updates over time, IGT can be used as a drop-in\nreplacement for the gradient estimate in a number of well-understood methods\nsuch as heavy ball or Adam. We show experimentally that it achieves state-of-\nthe-art results on a wide range of architectures and benchmarks. Additionally,\nthe IGT gradient estimator yields the optimal asymptotic convergence rate for\nonline stochastic optimization in the restricted setting where the Hessians of all\ncomponent functions are equal.2\n\n1\n\nIntroduction\n\nWe wish to solve the following minimization problem:\n\n\u03b8\u2217 = arg min\n\n\u03b8\n\nEx\u223cp[f (\u03b8, x)] ,\n\n(1)\n\nwhere we only have access to samples x and to a \ufb01rst-order oracle that gives us, for a given \u03b8 and a\ngiven x, the derivative of f (\u03b8, x) with respect to \u03b8, i.e. \u2202f (\u03b8,x)\n\u2202\u03b8 = g(\u03b8, x). It is known [35] that, when\nf is smooth and strongly convex, there is a converging algorithm for Problem 1 that takes the form\n\u03b8t+1 = \u03b8t \u2212 \u03b1tg(\u03b8t, xt), where xt is a sample from p. This algorithm, dubbed stochastic gradient\n(SG), has a convergence rate of O(1/t) (see for instance [4]), within a constant factor of the minimax\nrate for this problem. When one has access to the true gradient g(\u03b8) = Ex\u223cp[g(\u03b8, x)] rather than just\na sample, this rate dramatically improves to O(e\u2212\u03bdt) for some \u03bd > 0.\nIn addition to hurting the convergence speed, noise in the gradient makes optimization algorithms\nharder to tune. Indeed, while full gradient descent is convergent for constant stepsize \u03b1, and also\n\n\u2217Work done while at Mila.\n2Open-source implementation available at: https://github.com/seba-1511/igt.pth\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\famenable to line searches to \ufb01nd a good value for that stepsize, the stochastic gradient method\nfrom [35] with a constant stepsize only converges to a ball around the optimum [38].3 Thus,\nto achieve convergence, one needs to use a decreasing stepsize. While this seems like a simple\nmodi\ufb01cation, the precise decrease schedule can have a dramatic impact on the convergence speed.\nWhile theory prescribes \u03b1t = O(t\u2212\u03b1) with \u03b1 \u2208 (1/2, 1] in the smooth case, practictioners often use\nlarger stepsizes like \u03b1t = O(t\u22121/2) or even constant stepsizes.\nWhen the distribution p has \ufb01nite support, Eq. 1 becomes a \ufb01nite sum and, in that setting, it is possible\nto achieve ef\ufb01cient variance reduction and drive the noise to zero, allowing stochastic methods to\nachieve linear convergence rates [24, 17, 50, 28, 42, 5]. Unfortunately, the \ufb01nite support assumption is\ncritical to these algorithms which, while valid in many contexts, does not have the broad applicability\nof the standard SG algorithm. Several works have extended these approaches to the online setting by\napplying these algorithms while increasing the mini-batch size N [2, 14] but they need to revisit past\nexamples multiple times and are not truly online.\nAnother line of work reduces variance by averaging iterates [33, 22, 3, 10, 7, 6, 16]. While these\nmethods converge for a constant stepsize in the stochastic case4, their practical speed is heavily\ndependent on the fraction of iterates kept in the averaging, a hyperparameter that is thus hard to tune,\nand they are rarely used in deep learning.\nOur work combines two existing ideas and adds a third: a) At every step, it updates the parameters\nusing a weighted average of past gradients, like in SAG [24, 40], albeit with a different weighting\nscheme; b) It reduces the bias and variance induced by the use of these old gradients by transporting\nthem to \u201cequivalent\u201d gradients computed at the current point, similar to [11]; c) It does so implicitly\nby computing the gradient at a parameter value different from the current one. The resulting gradient\nestimator can then be used as a plug-in replacement of the stochastic gradient within any optimization\nscheme. Experimentally, both SG using our estimator and its momentum variant outperform the most\ncommonly used optimizers in deep learning.\n\n2 Momentum and other approaches to dealing with variance\n\nStochastic variance reduction methods use an average of past gradients to reduce the variance of the\ngradient estimate. At \ufb01rst glance, it seems like their updates are similar to that of momentum [32],\nalso known as the heavy ball method, which performs the following updates5:\n\nv0 = g(\u03b80, x0)\n\nt(cid:88)\n\ni=1\n\n(cid:33)\n\nvt = \u03b3tvt\u22121 + (1 \u2212 \u03b3t)g(\u03b8t, xt),\n\n\u03b8t+1 = \u03b8t \u2212 \u03b1tvt .\n\n(cid:32)\n\n\u03b3tg(\u03b80, x0) + (1 \u2212 \u03b3)\n\nWhen \u03b3t = \u03b3, this leads to \u03b8t+1 = \u03b8t \u2212 \u03b1t\nheavy ball method updates the parameters of the model using an average of past gradients, bearing\nsimilarity with SAG [24], albeit with exponential instead of uniform weights.\nInterestingly, while momentum is a popular method for training deep networks, its theoretical analysis\nin the stochastic setting is limited [44], except in the particular setting when the noise converges\nto 0 at the optimum [26]. Also surprising is that, despite the apparent similarity with stochastic\nvariance reduction methods, current convergence rates are slower when using \u03b3 > 0 in the presence\nof noise [39], although this might be a limitation of the analysis.\n\n\u03b3t\u2212ig(\u03b8i, xi)\n\n. Hence, the\n\n2.1 Momentum and variance\n\n\u03b1(cid:80)t\n\nWe propose here an analysis of how, on quadratics, using past gradients as done in momentum\ndoes not lead to a decrease in variance. If gradients are stochastic, then \u2206t = \u03b8t \u2212 \u03b8\u2217 is a random\nvariable. Denoting \u0001i the noise at timestep i, i.e. g(\u03b8i, xi) = g(\u03b8i) + \u0001i, and writing \u2206t \u2212 E[\u2206t] =\ni=0 Ni,t\u0001i, with Ni,t the impact of the noise of the i-th datapoint on the t-th iterate, we may now\nanalyze the total impact of each \u0001i on the iterates. Figure 1 shows the impact of \u0001i on \u2206t \u2212 E[\u2206t] as\n\n3Under some conditions, it does converge linearly to the optimum [e.g., 45]\n4Under some conditions on f.\n5This is slightly different from the standard formulation but equivalent for constant \u03b3t.\n\n2\n\n\f(a) Stochastic gradient\n\n(b) Momentum - \u03b3 = 0.9\n\n(c) Momentum - \u03b3t = 1 \u2212 1\n\nt\n\n(d) Momentum - \u03b3t = 1 \u2212 1\n\nt with IGT.\n\nFigure 1: Variance induced over time by the noise from three different datapoints (i = 1, i = 25 and\ni = 50) as well as the total variance for SG (\u03b3 = 0, top left), momentum with \ufb01xed \u03b3 = 0.9 (top\nright), momentum with increasing \u03b3t = 1 \u2212 1\nt without (bottom left) and with (bottom right) transport.\nThe impact of the noise of each gradient \u0001i increases for a few iterations then decreases. Although a\nlarger \u03b3 reduces the maximum impact of a given datapoint, the total variance does not decrease. With\ntransport, noises are now equal and total variance decreases. The y-axis is on a log scale.\n\nmeasured by N 2\ni,t for three datapoints (i = 1, i = 25 and i = 50) as a function of t for stochastic\ngradient (\u03b3 = 0, left) and momentum (\u03b3 = 0.9, right). As we can see, when using momentum, the\nvariance due to a given datapoint \ufb01rst increases as the noise in\ufb02uences both the next iterate (through\nthe parameter update) and the subsequent updates (through the velocity). Due to the weight 1 \u2212 \u03b3\nwhen a point is \ufb01rst sampled, a larger value of \u03b3 leads to a lower immediate impact of the noise of a\ngiven point on the iterates. However, a larger \u03b3 also means that the noise of a given gradient is kept\nlonger, leading to little or no decrease of the total variance (dashed blue curve). Even in the case of\nstochastic gradient, the noise at a given timestep carries over to subsequent timesteps, even if the old\ngradients are not used for the update, as the iterate itself depends on the noise.\nAt every timestep, the contribution to the noise of the 1st, the 25th and the 50th points in Fig. 1 is\nunequal. If we assume that the \u0001i are i.i.d., then the total variance would be minimal if the contribution\nfrom each point was equal. Further, one can notice that the impact of datapoint i is only a function of\nt \u2212 i and not of t. This guarantees that the total noise will not decrease over time.\nTo address these two points, one can increase the momentum parameter over time. In doing so,\nthe noise of new datapoints will have a decreasing impact on the total variance as their gradient is\nmultiplied by 1 \u2212 \u03b3t. Figure 1c shows the impact N 2\ni,t of each noise \u0001i for an increasing momentum\n\u03b3t = 1 \u2212 1\nt . The peak of noise for i = 25 is indeed lower than that of i = 1. However, the variance\nstill does not go to 0. This is because, as the momentum parameter increases, the update is an average\nof many gradients, including stale ones. Since these gradients were computed at iterates already\nin\ufb02uenced by the noise over previous datapoints, that past noise is ampli\ufb01ed, as testi\ufb01ed by the higher\npeak at i = 1 for the increasing momentum. Ultimately, increasing momentum does not lead to a\nconvergent algorithm in the presence of noise when using a constant stepsize.\n\n3\n\n\f2.2 SAG and Hessian modelling\n\n\u03ba\n\n8N\n\nThe impact of the staleness of the gradients on the convergence is not limited to momentum. In SAG,\n\n(cid:9)(cid:1)k, compared to\nfor instance, the excess error after k updates is proportional to(cid:0)1 \u2212 min(cid:8) 1\n(cid:1)k where \u03ba is the condition number of\nthe excess error of the full gradient method which is(cid:0)1 \u2212 1\n16(cid:98)\u03ba , 1\nthe second term. This happens either when(cid:98)\u03ba is small, i.e. the problem is well conditioned and a lot\n\nthe problem. 6 The difference between the two rates is larger when the minimum in the SAG rate is\n\nof progress is made at each step, or when N is large, i.e. there are many points to the training set.\nBoth cases imply that a large distance has been travelled between two draws of the same datapoint.\nRecent works showed that correcting for that staleness by modelling the Hessian [46, 11] leads to\nimproved convergence. As momentum uses stale gradients, the velocity is an average of current and\npast gradients and thus can be seen as an estimate of the true gradient at a point which is not the\ncurrent one but rather a convex combination of past iterates. As past iterates depend on the noise\nof previous gradients, this bias in the gradients ampli\ufb01es the noise and leads to a non-converging\nalgorithm. We shall thus \u201ctransport\u201d the old stochastic gradients g(\u03b8i, xi) to make them closer to\ntheir corresponding value at the current iterate, g(\u03b8t, xi). Past works did so using the Hessian or an\nexplicit approximation thereof, which can be expensive and dif\ufb01cult to compute and maintain. We\nwill resort to using implicit transport, a new method that aims at compensating the staleness of past\ngradients without making explicit use of the Hessian.\n\n3 Converging optimization through implicit gradient transport\n\nBefore showing how to combine the advantages of both increasing momentum and gradient transport,\nwe demonstrate how to transport gradients implicitly. This transport is only exact under a strong\nassumption that will not hold in practice. However, this result will serve to convey the intuition behind\nimplicit gradient transport. We will show in Section 4 how to mitigate the effect of the unsatis\ufb01ed\nassumption.\n\n3.1\n\nImplicit gradient transport\n\n(cid:80)t\n\nLet us assume that we received samples x0, . . . , xt in an online fashion. We wish to approach the full\ngradient gt(\u03b8t) = 1\ni=0 g(\u03b8t, xi) as accurately as possible. We also assume here that a) We have\nt+1\n\na noisy estimate(cid:98)gt\u22121(\u03b8t\u22121) of gt\u22121(\u03b8t\u22121); b) We can compute the gradient g(\u03b8, xt) at any location\n\n\u03b8. We shall seek a \u03b8 such that\n\nt + 1(cid:98)gt\u22121(\u03b8t\u22121) +\n\nt\n\n1\n\nt + 1\n\ng(\u03b8, xt) \u2248 gt(\u03b8t) .\n\nTo this end, we shall make the following assumption:\nAssumption 3.1. All individual functions f (\u00b7, x) are quadratics with the same Hessian H.\n\nThis is the same assumption as [10, Section 4.1]. Although it is unlikely to hold in practice, we shall\nsee that our method still performs well when that assumption is violated.\nUnder Assumption 3.1, we then have (see details in Appendix)\n\ngt(\u03b8t) =\n\nt\n\n1\n\ng(\u03b8t, xt)\n\ngt\u22121(\u03b8t) +\n\nt + 1\n\u2248 t\n\nt + 1(cid:98)gt\u22121(\u03b8t\u22121) +\n\nt + 1\n1\n\nt + 1\n\ng(\u03b8t + t(\u03b8t \u2212 \u03b8t\u22121), xt) .\n\nThus, we can transport our current estimate of the gradient by computing the gradient on the new\npoint at a shifted location \u03b8 = \u03b8t + t(\u03b8t \u2212 \u03b8t\u22121). This extrapolation step is reminiscent of Nesterov\u2019s\nacceleration with the difference that the factor in front of \u03b8t \u2212 \u03b8t\u22121, t, is not bounded.\n6The(cid:98)\u03ba in the convergence rate of SAG is generally larger than the \u03ba in the full gradient algorithm.\n\n4\n\n\f3.2 Combining increasing momentum and implicit gradient transport\n\nWe now describe our main algorithm, Implicit Gradient Transport (IGT). IGT uses an increasing\nmomentum \u03b3t = t\nt+1. At each step, when updating the velocity, it computes the gradient of the new\npoint at an extrapolated location so that the velocity vt is a good estimate of the true gradient g(\u03b8t).\nWe can rewrite the updates to eliminate the velocity vt, leading to the update:\n\n\u03b8t+1 =\n\n2t + 1\nt + 1\n\n\u03b8t \u2212 t\nt + 1\n\n\u03b8t\u22121 \u2212 \u03b1\nt + 1\n\ng (\u03b8t + t(\u03b8t \u2212 \u03b8t\u22121), xt) .\n\n(IGT)\n\nWe see in Fig. 1d that IGT allows a reduction in the total variance, thus leading to convergence with a\nconstant stepsize. This is captured by the following proposition:\nProposition 3.1. If f is a quadratic function with positive de\ufb01nite Hessian H with largest eigenvalue\nL and condition number \u03ba and if the stochastic gradients satisfy: g(\u03b8, x) = g(\u03b8) + \u0001 with \u0001 a random\ni.i.d. noise with covariance bounded by BI, then Eq. IGT with stepsize \u03b1 = 1/L leads to iterates \u03b8t\nsatisfying\n\n(cid:18)\n\n(cid:19)2t (cid:107)\u03b80 \u2212 \u03b8\u2217(cid:107)2 +\n\nE[(cid:107)\u03b8t \u2212 \u03b8\u2217(cid:107)2] \u2264\n\n1 \u2212 1\n\u03ba\n\nd\u03b12B \u00af\u03bd2\n0\n\nt\n\n,\n\nwith \u03bd = (2 + 2 log \u03ba)\u03ba for every t > 2\u03ba.\n\nThe proof of Prop. 3.1 is provided in the appendix.\nDespite this theoretical result, two limitations remain: First, Prop. 3.1 shows that IGT does not\nimprove the dependency on the conditioning of the problem; Second, the assumption of equal\nHessians is unlikely to be true in practice, leading to an underestimation of the bias. We address the\nconditioning issue in the next section and the assumption on the Hessians in Section 4.\n\n3.3\n\nIGT as a plug-in gradient estimator\n\nWe demonstrated that the IGT estimator has lower variance than the stochastic gradient estimator for\nquadratic objectives. IGT can also be used as a drop-in replacement for the stochastic gradient in\nan existing, popular \ufb01rst order method: the heavy ball (HB). This is captured by the following two\npropositions:\nProposition 3.2 (Non-stochastic). In the non-stochastic case, where B = 0, variance is equal to 0\n\nand Heavyball-IGT achieves the accelerated linear rate O(cid:0)(cid:16)\u221a\n\n(cid:17)t(cid:1) using the known, optimal\n\n\u03ba\u22121\u221a\n\n\u03ba+1\n\n(cid:16)\u221a\n\n\u03ba\u22121\u221a\n\n(cid:17)2\n\n\u221a\n\n\u03ba+1\n\n, \u03b1 = (1 +\n\nheavy ball tuning, \u00b5 =\nProposition 3.3 (Online, stochastic). When B > 0, there exist constant hyperparameters \u03b1 > 0,\n\u00b5 > 0 such that (cid:107)E[\u03b8t \u2212 \u03b8\u2217](cid:107)2 converges to zero linearly, and the variance is \u02dcO(1/t).\nThe pseudo-code can be found in Algorithm 1.\n\n\u00b5)2/L.\n\nAlgorithm 1 Heavyball-IGT\n1: procedure HEAVYBALL-IGT(Stepsize \u03b1, Momentum \u00b5, Initial parameters \u03b80)\n2:\n3:\n4:\n\nv0 \u2190 g(\u03b80, x0)\nfor t = 1, . . . , T \u2212 1 do\n\n, w0 \u2190 \u2212\u03b1v0\n\n\u03b81 \u2190 \u03b80 + w0\n\n,\n\n(cid:17)\n\n\u03b8t + \u03b3t\n1\u2212\u03b3t\n\n(\u03b8t \u2212 \u03b8t\u22121), xt\n\n(cid:16)\n\nt+1\n\n\u03b3t \u2190 t\nvt \u2190 \u03b3tvt\u22121 + (1 \u2212 \u03b3t)g\nwt \u2190 \u00b5wt\u22121 \u2212 \u03b1vt\n\u03b8t+1 \u2190 \u03b8t + wt\n\n5:\n6:\n7:\n8:\n9:\n10: end procedure\n\nend for\nreturn \u03b8T\n\n5\n\n\f4\n\nIGT and Anytime Tail Averaging\n\nSo far, IGT weighs all gradients equally. This is because, with equal Hessians, one can perfectly\ntransport these gradients irrespective of the distance travelled since they were computed. In practice,\nthe individual Hessians are not equal and might change over time. In that setting, the transport induces\nan error which grows with the distance travelled. We wish to average a linearly increasing number of\ngradients, to maintain the O(1/t) rate on the variance, while forgetting about the oldest gradients to\ndecrease the bias. To this end, we shall use anytime tail averaging [23], named in reference to the tail\naveraging technique used in optimization [16].\nTail averaging is an online averaging technique where only the last points, usually a constant\nfraction c of the total number of points seen, is kept. Maintaining the exact average at every\ntimestep is memory inef\ufb01cient and anytime tail averaging performs an approximate averaging using\n\u03b3t = c(t\u22121)\n1+c(t\u22121)\n\n. We refer the reader to [23] for additional details.\n\n(cid:113) 1\u2212c\n\n(cid:16)\n\n(cid:17)\n\n1 \u2212 1\n\nc\n\nt(t\u22121)\n\n5\n\nImpact of IGT on bias and variance in the ideal case\n\nTo understand the behaviour of IGT when Assumption 3.1 is veri\ufb01ed, we minimize a strongly convex\nquadratic function with Hessian Q \u2208 R100\u00d7100 with condition number 1000, and we have access to\nthe gradient corrupted by noise \u0001t, where \u0001t \u223c N (0, 0.3\u00b7I100). In that scenario where all Hessians are\nequal and implicit gradient transport is exact, Fig. 2a con\ufb01rms the O(1/t) rate of IGT with constant\nstepsize while SGD and HB only converge to a ball around the optimum.\nTo further understand the impact of IGT, we study the quality of the gradient estimate. Standard\nstochastic methods control the variance of the parameter update by scaling it with a decreasing\nstepsize, which slows the optimization down. With IGT, we hope to have a low variance while\nmaintaining a norm of the update comparable to that obtained with gradient descent. To validate the\nquality of our estimator, we optimized a quadratic function using IGT, collecting iterates \u03b8t. For each\niterate, we computed the squared error between the true gradient and either the stochastic or the IGT\ngradient. In this case where both estimators are unbiased, this is the trace of the noise covariance of\nour estimators. The results in Figure 2b show that, as expected, this noise decreases linearly for IGT\nand is constant for SGD.\nWe also analyse the direction and magnitude of the gradient of IGT on the same quadratic setup.\nFigure 2c displays the cosine similarity between the true gradient and either the stochastic or the IGT\ngradient, as a function of the distance to the optimum. We see that, for the same distance, the IGT\ngradient is much more aligned with the true gradient than the stochastic gradient is, con\ufb01rming that\nvariance reduction happens without the need for scaling the estimate.\n\n6 Experiments\n\nWhile Section 5 con\ufb01rms the performance of IGT in the ideal case, the assumption of identical\nHessians almost never holds in practice. In this section, we present results on more realistic and\nlarger scale machine learning settings. All experiments are extensively described in the Appendix A\nand additional baselines compared in Appendix B.\n\n6.1 Supervised learning\n\nCIFAR10 image classi\ufb01cation We \ufb01rst consider the task of training a ResNet-56 model [12] on\nthe CIFAR-10 image classi\ufb01cation dataset [19]. We use TF of\ufb01cial models code and setup [1],\nvarying only the optimizer: SGD, HB, Adam and our algorithm with anytime tail averaging both on\nits own (ITA) and combined with Heavy Ball (HB-ITA). We tuned the step size for each algorithm\nby running experiments using a logarithmic grid. To factor in ease of tuning [48], we used Adam\u2019s\ndefault parameter values and a value of 0.9 for HB\u2019s parameter. We used a linearly decreasing\nstepsize as it was shown to be simple and perform well [43]. For each optimizer we selected the\nhyperparameter combination that is fastest to reach a consistently attainable target train loss [43].\nSelecting the hyperparameter combination reaching the lowest training loss yields qualitatively\nidentical curves. Figure 3 presents the results, showing that IGT with the exponential anytime tail\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Analysis of IGT on quadratic loss functions. (a) Comparison of convergence curves for\nmultiple algorithms. As expected, the IGT family of algorithms converges to the solution while\nstochastic gradient algorithms can not. (b) The blue and orange curves show the norm of the noise\ncomponent in the SGD and IGT gradient estimates, respectively. The noise component of SGD\nremains constant, while it decreases at a rate 1/\nt for IGT. The green curve shows the norm of the\nIGT gradient estimate. (c) Cosine similarity between the full gradient and the SGD/IGT estimates.\n\n\u221a\n\nFigure 3: Resnet-56 on CIFAR10. Left: Train loss. Center: Train accuracy. Right: Test accuracy.\n\naverage performs favourably, both on its own and combined with Heavy Ball: the learning curves\nshow faster improvement and are much less noisy.\n\nImageNet image classi\ufb01cation We also consider the task of training a ResNet-50 model[12] on\nthe larger ImageNet dataset [36]. The setup is similar to the one used for CIFAR10 with the difference\nthat we trained using larger minibatches (1024 instead of 128). In Figure 4, one can see that IGT is as\nfast as Adam for the train loss, faster for the train accuracy and reaches the same \ufb01nal performance,\nwhich Adam does not. We do not see the noise reduction we observed with CIFAR10, which could\nbe explained by the larger batch size (see Appendix A.1).\n\nIMDb sentiment analysis We train a bi-directional LSTM on the IMDb Large Movie Review\nDataset for 200 epochs. [27] We observe that while the training convergence is comparable to HB,\nHB-ITA performs better in terms of validation and test accuracy. In addition to the baseline and\nIGT methods, we also train a variant of Adam using the ITA gradients, dubbed Adam-ITA, which\nperforms similarly to Adam.\n\n6.2 Reinforcement learning\n\nLinear-quadratic regulator We cast the classical linear-quadratic regulator (LQR) [21] as a policy\nlearning problem to be optimized via gradient descent. This setting is extensively described in\nAppendix A. Note that despite their simple linear dynamics and a quadratic cost functional, LQR\nsystems are notoriously dif\ufb01cult to optimize due to the non-convexity of the loss landscape. [8]\n\n7\n\n101103105Iterations10-310-1101103105ErrorSGDHBIGTHB-IGTConvergence101103105Iterations10-210-1100101102103104L2 NormSGDIGTGradient MagnitudeGradient Estimation Error0200004000060000Iterations10-1100101ErrortrainSGDHBAdamITAHB-ITACIFAR100200004000060000Iterations0.20.40.60.81.0AccuracytrainSGDHBAdamITAHB-ITACIFAR100200004000060000Iterations0.20.40.60.8AccuracytestSGDHBAdamITAHB-ITACIFAR10\fFigure 4: ResNet-50 on ImageNet. Left: Train loss. Center: Train accuracy. Right: Test accuracy.\n\nFigure 5: Validation curves for different large-scale machine learning settings. Shading indicates\none standard deviation computed over three random seeds. Left: Reinforcement learning via policy\ngradient on a LQR system. Right: Meta-learning using MAML on Mini-Imagenet.\n\nThe left chart in Figure 5 displays the evaluation cost computed along training and averaged over three\nrandom seeds. The \ufb01rst method (Optimal) indicates the cost attained when solving the algebraic\nRiccati equation of the LQR \u2013 this is the optimal solution of the problem. SGD minimizes the costs\nusing the REINFORCE [47] gradient estimator, averaged over 600 trajectories. ITA is similar to\nSGD but uses the ITA gradient computed from the REINFORCE estimates. Finally, GD uses the\nanalytical gradient by taking the expectation over the policy.\nWe make two observations from the above chart. First, ITA initially suffers from the stochastic\ngradient estimate but rapidly matches the performance of GD. Notably, both of them converge to\na solution signi\ufb01cantly better than SGD, demonstrating the effectiveness of the variance reduction\nmechanism. Second, the convergence curve is smoother for ITA than for SGD, indicating that the\nITA iterates are more likely to induce similar policies from one iteration to the next. This property\nis particularly desirable in reinforcement learning as demonstrated by the popularity of trust-region\nmethods in large-scale applications. [41, 29]\n\n6.3 Meta-learning\n\nModel-agnostic meta-learning We now investigate the use of IGT in the model-agnostic meta-\nlearning (MAML) setting. [9] We replicate the 5 ways classi\ufb01cation setup with 5 adaptation steps on\ntasks from the Mini-Imagenet dataset [34]. This setting is interesting because of the many sources\ncontributing to noise in the gradient estimates: the stochastic meta-gradient depends on the product\nof 5 stochastic Hessians computed over only 10 data samples, and is averaged over only 4 tasks. We\nsubstitute the meta-optimizer with each method, select the stepsize that maximizes the validation\naccuracy after 10K iterations, and use it to train the model for 100K iterations.\nThe right graph of Figure 5 compares validation accuracies for three random seeds. We observe that\nmethods from the IGT family signi\ufb01cantly outperform their stochastic meta-gradient counter-part,\nboth in terms of convergence rate and \ufb01nal accuracy. Those results are also re\ufb02ected in the \ufb01nal test\n\n8\n\n102103104Iterations2\u00d71023\u00d7102CostevaluationOptimalSGDITAGDLQR0.000.250.500.751.00Iterations1e50.450.500.550.60AccuracyvalidHBAdamHB-ITAAdam-ITAMAML\faccuracies where Adam-ITA (65.16%) performs best, followed by HB-ITA (64.57%), then Adam\n(63.70%), and \ufb01nally HB (63.08%).\n\n7 Conclusion and open questions\n\nWe proposed a simple optimizer which, by reusing past gradients and transporting them, offers\nexcellent performance on a variety of problems. While it adds an additional parameter, the ratio of\nexamples to be kept in the tail averaging, it remains competitive across a wide range of such values.\nFurther, by providing a higher quality gradient estimate that can be plugged in any existing optimizer,\nwe expect it to be applicable to a wide range of problems. As the IGT is similar to momentum, this\nfurther raises the question on the links between variance reduction and curvature adaptation. Whether\nthere is a way to combine the two without using momentum on top of IGT remains to be seen.\n\nAcknowledgments\n\nThe authors would like to thank Liyu Chen for his help with the LQR experiments and Fabian\nPedregosa for insightful discussions.\n\nReferences\n[1] The TensorFlow Authors. Tensor\ufb02ow of\ufb01cial resnet model. 2018.\n\n[2] Reza Babanezhad, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Kone\u02d8cn\u00fd, and\nScott Sallinen. Stop wasting my gradients: Practical SVRG. In Advances in Neural Information\nProcessing Systems, 2015.\n\n[3] Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with\nconvergence rate o (1/n). In Advances in Neural Information Processing Systems, pages 773\u2013781,\n2013.\n\n[4] S\u00e9bastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends R(cid:13)\n\nin Machine Learning, 8(3-4):231\u2013357, 2015.\n\n[5] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing Systems, pages 1646\u20131654, 2014.\n\n[6] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step\n\nsize stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386, 2017.\n\n[7] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger\nconvergence rates for least-squares regression. The Journal of Machine Learning Research,\n18(1):3520\u20133570, 2017.\n\n[8] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy\ngradient methods for the linear quadratic regulator.\nIn Jennifer Dy and Andreas Krause,\neditors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of\nProceedings of Machine Learning Research, pages 1467\u20131476, Stockholmsm\u00e4ssan, Stockholm\nSweden, 10\u201315 Jul 2018. PMLR.\n\n[9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-\ntation of deep networks. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 1126\u20131135. JMLR. org, 2017.\n\n[10] Nicolas Flammarion and Francis Bach. From averaging to acceleration, there is only a step-size.\n\nIn Conference on Learning Theory, pages 658\u2013695, 2015.\n\n[11] Robert Gower, Nicolas Le Roux, and Francis Bach. Tracking the gradients using the hessian: A\nnew look at variance reducing stochastic methods. In Amos Storkey and Fernando Perez-Cruz,\neditors, Proceedings of the Twenty-First International Conference on Arti\ufb01cial Intelligence and\nStatistics, volume 84 of Proceedings of Machine Learning Research, pages 707\u2013715, Playa\nBlanca, Lanzarote, Canary Islands, 09\u201311 Apr 2018. PMLR.\n\n9\n\n\f[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In European conference on computer vision, pages 630\u2013645. Springer, 2016.\n\n[14] Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Variance re-\nduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing\nSystems, pages 2305\u20132313, 2015.\n\n[15] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Ac-\ncelerating stochastic gradient descent for least squares regression. In Conference On Learning\nTheory, pages 545\u2013604, 2018.\n\n[16] Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford.\nParallelizing stochastic gradient descent for least squares regression: Mini-batching, averaging,\nand model misspeci\ufb01cation. Journal of Machine Learning Research, 18(223):1\u201342, 2018.\n\n[17] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[18] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd\nInternational Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May\n7-9, 2015, Conference Track Proceedings, 2015.\n\n[19] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n\n[20] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nCiteseer, 2009.\n\n[21] Huibert Kwakernaak. Linear optimal control systems, volume 1.\n\n[22] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining\nan o (1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint\narXiv:1212.2002, 2012.\n\n[23] Nicolas Le Roux. Anytime tail averaging. arXiv preprint arXiv:1902.05083, 2019.\n\n[24] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an\nIn Advances in Neural Information\n\nexponential convergence rate for \ufb01nite training sets.\nProcessing Systems, pages 2663\u20132671, 2012.\n\n[25] Yann LeCun and Corinna Cortes. Mnist handwritten digit database. AT&T Labs [Online].\n\nAvailable: http://yann. lecun. com/exdb/mnist, 2010.\n\n[26] Nicolas Loizou and Peter Richt\u00e1rik. Momentum and stochastic momentum for stochastic gradi-\nent, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677,\n2017.\n\n[27] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher\nPotts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting\nof the Association for Computational Linguistics: Human Language Technologies, pages\n142\u2013150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.\n\n[28] Julien Mairal. Optimization with \ufb01rst-order surrogate functions. In Proceedings of The 30th\n\nInternational Conference on Machine Learning, pages 783\u2013791, 2013.\n\n[29] OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal J\u00f3zefowicz, Bob McGrew,\nJakub W. Pachocki, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex\nRay, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech\nZaremba. Learning dexterous in-hand manipulation. CoRR, abs/1808.00177, 2018.\n\n[30] Brendan O\u2019Donoghue and Emmanuel Candes. Adaptive restart for accelerated gradient schemes.\n\nFoundations of computational mathematics, 15(3):715\u2013732, 2015.\n\n10\n\n\f[31] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for\nword representation. In Proceedings of the 2014 conference on empirical methods in natural\nlanguage processing (EMNLP), pages 1532\u20131543, 2014.\n\n[32] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[33] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.\n\nSIAM Journal on Control and Optimization, 30(4):838\u2013855, 1992.\n\n[34] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In 5th\nInternational Conference on Learning Representations, ICLR 2017, Toulon, France, April\n24-26, 2017, Conference Track Proceedings, 2017.\n\n[35] Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathemati-\n\ncal Statistics, 22(3):400\u2013407, 1951.\n\n[36] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision\n(IJCV), 115(3):211\u2013252, 2015.\n\n[37] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.\nMobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, pages 4510\u20134520, 2018.\n\n[38] Mark Schmidt. Convergence rate of stochastic gradient with constant step size. 2014.\n\n[39] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Convergence rates of inexact proximal-\ngradient methods for convex optimization. In Advances in Neural Information Processing\nSystems 24, 2011.\n\n[40] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Mathematical Programming, 162(1-2):83\u2013112, 2017.\n\n[41] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\npolicy optimization. In International conference on machine learning, pages 1889\u20131897, 2015.\n\n[42] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss. Journal of Machine Learning Research, 14(1):567\u2013599, February 2013.\n\n[43] Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and\nGeorge E. Dahl. Measuring the effects of data parallelism on neural network training. Journal\nof Machine Learning Research, 20(112):1\u201349, 2019.\n\n[44] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In International conference on machine learning,\npages 1139\u20131147, 2013.\n\n[45] Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-\nparameterized models and an accelerated perceptron. In Proceedings of the 22nd International\nConference on Arti\ufb01cial Intelligence and Statistics, 2019.\n\n[46] Hoi-To Wai, Wei Shi, Angelia Nedic, and Anna Scaglione. Curvature-aided incremental\naggregated gradient method. In 2017 55th Annual Allerton Conference on Communication,\nControl, and Computing (Allerton), pages 526\u2013532. IEEE, 2017.\n\n[47] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n[48] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The\nmarginal value of adaptive gradient methods in machine learning. In Advances in Neural\nInformation Processing Systems, pages 4148\u20134158, 2017.\n\n[49] Jian Zhang and Ioannis Mitliagkas. Yellow\ufb01n and the art of momentum tuning. In SysML, 2019.\n\n11\n\n\f[50] Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number\nindependent access of full gradients. In Advances in Neural Information Processing Systems,\npages 980\u2013988, 2013.\n\n12\n\n\f", "award": [], "sourceid": 2887, "authors": [{"given_name": "S\u00e9bastien", "family_name": "Arnold", "institution": "University of Southern California"}, {"given_name": "Pierre-Antoine", "family_name": "Manzagol", "institution": "Google"}, {"given_name": "Reza", "family_name": "Babanezhad Harikandeh", "institution": "UBC"}, {"given_name": "Ioannis", "family_name": "Mitliagkas", "institution": "Mila & University of Montreal"}, {"given_name": "Nicolas", "family_name": "Le Roux", "institution": "Google Brain"}]}