{"title": "Gradient Descent: Second Order Momentum and Saturating Error", "book": "Advances in Neural Information Processing Systems", "page_first": 887, "page_last": 894, "abstract": null, "full_text": "Gradient Descent: Second-Order Momentum \n\nand Saturating Error \n\nBarak Pearlmutter \n\nDepartment of Psychology \nP.O. Box llA Yale Station \nNew Haven, CT 06520-7447 \npearlmutter-barak@yale.edu \n\nAbstract \n\nBatch gradient descent, ~w(t) = -7JdE/dw(t) , conver~es to a minimum \nof quadratic form with a time constant no better than '4Amax/ Amin where \nAmin and Amax are the minimum and maximum eigenvalues of the Hessian \nmatrix of E with respect to w. \nIt was recently shown that adding a \nmomentum term ~w(t) = -7JdE/dw(t) + Q'~w(t - 1) improves this to \n~ VAmax/ Amin, although only in the batch case. Here we show that second(cid:173)\norder momentum, ~w(t) = -7JdE/dw(t) + Q'~w(t -1) + (3~w(t - 2), can \nlower this no further. We then regard gradient descent with momentum \nas a dynamic system and explore a non quadratic error surface, showing \nthat saturation of the error accounts for a variety of effects observed in \nsimulations and justifies some popular heuristics. \n\n1 \n\nINTRODUCTION \n\nGradient descent is the bread-and-butter optimization technique in neural networks. \nSome people build special purpose hardware to accelerate gradient descent optimiza(cid:173)\ntion of backpropagation networks. Understanding the dynamics of gradient descent \non such surfaces is therefore of great practical value. \n\nHere we briefly review the known results in the convergence of batch gradient de(cid:173)\nscent; show that second-order momentum does not give any speedup; simulate a \nreal network and observe some effect not predicted by theory; and account for these \neffects by analyzing gradient descent with momentum on a saturating error surface. \n887 \n\n\f888 \n\nPearl mutter \n\n1.1 SIMPLE GRADIENT DESCENT \n\nFirst, let us review the bounds on the convergence rate of simple gradient descent \nwithout momentum to a minimum of quadratic form [11,1]. Let w* be the minimum \nof E, the error, H = d2 E/dw 2(w*), and Ai, vi be the eigenvalues and eigenvectors \nof H. The weight change equation \n\n(where .6.f(t) = f(t + 1) -\n\nf(t\u00bb \n\n.6.w = -T}(cid:173)\n\ndE \ndw \n\nis limited by \no < T} < 2/ Amax \n\n(1) \n\n(2) \n\nWe can substitute T} = 2/ Amax into the weight change equation to obtain conver(cid:173)\ngence that tightly bounds any achievable in practice, getting a time constant of \nconvergence of -1/log(1 - 2s) = (2s)-1 + 0(1), or \n\nE - E* ~ exp(-4st) \n\n(3) \n\nwhere we use s = Amin/ Amax for the inverse eigenvalues spread of H and ~ is read \n\"asymptotically converges to zero more slowly than.\" \n\n1.2 FIRST-ORDER MOMENTUM \n\nSometimes a momentum term is used, the weight update (1) being modified to \nincorporate a momentum term a < 1 [5, equation 16], \n\n.6.w(t) = -T} dw (t) + a.6.w(t - 1). \n\ndE \n\n(4) \n\nThe Momentum LMS algorithm, MLMS, has been analyzed by Shynk and Roy [6], \nwho have shown that the momentum term can not speed convergence in the online, \nor stochastic gradient, case. In the batch case, which we consider here, Tugay and \nTanik [9] have shown that momentum is stable when \n\na < 1 and 0 < TJ < 2(a+ 1)/Amax \n\nwhich speeds convergence to \n\nE - E* ~ exp(-(4VS + O(s)) t) \n\nby \n\n(5) \n\n(6) \n\n* \n\na = \n\n2 - 4Js(1 - s) \n\n(1- 2s)2 \n\n-1 = 1- 4VS+ 0 (s), \n\nT}*=2(a*+1)/Amax. \n\n(7) \n\n2 SECOND-ORDER MOMENTUM \n\nThe time constant of asymptotic convergence can be changed from O(Amax/ Amin) to \nO( JAmax / Amin) by going from a first-order system, (1), to a second-order system, \n(4). Making a physical analogy, the first-order system corresponds to a circuit with \n\n\fGradient Descent: Second Order Momentum and Saturating Error \n\n889 \n\n.... \n\nFigure 1: Second-order momentum converges if 7]Amax is less than the value plotted \nas \"eta,\" as a function of a and (3. The region of convergence is bounded by four \nsmooth surfaces: three planes and one hyperbola. One of the planes is parallel \nto the 7] axis, even though the sampling of the plotting program makes it appear \nslightly sloped. Another is at 7] = 0 and thus hidden. The peak is at 4. \n\na resistor, and the second-order system adds a capacitor to make an RC oscillator. \nOne might ask whether further gains can be had by going to a third-order system, \n\n~w(t) = -7] dw + a~w(t - 1) + (3~w(t - 2) . \n\ndE \n\n(8) \n\nFor convergence, all the eigenvalues of the matrix \n\nMi = (~6 ~) \n\n- {3 -a + {3 1 - 7]Ai + a \n\nin (c,(t - 1) Ci(t) Ci(t + l))T ~ M,(Ci(t - 2) c,(t - 1) ci(t)f must have absolute \nvalue less than or equal to 1, which occurs precisely when \n\n-1 ~ (3 ~ 1 \no ~ 7] ~ 4({3 + 1) / Ai \n\n7]Ad2 - (1 - (3) < a ~ {37]Ai/2 + (1 - (3). \n\nFor {3 ~ 0 this is most restrictive for Amax, but for {3 > 0 Amin also comes into play. \nTaking the limit as Amin -.0, this gives convergence conditions for gradient descent \nwith second-order momentum of \n\n-1< {3 \n\n{3-1~ a ~1-{3 \n\nwhen a ~ 3{3 + 1 : \n\n0< \nwhen a 2: 3{3+ 1: \n\n0< \n\n7] ~ -(1+a-{3) \n\n2 \n\nAmax \n\n7] ~ f + ~(a + (3 - 1) \n\nmax \n\n(9) \n\n\f890 \n\nPearlmutter \n\na region shown in figure 1. \nFastest convergence for Amin within this region lies along the ridge a = 3{3 + 1, \nT} = 2(1 + a - {3)/ Amax. Unfortunately, although convergence is slightly faster than \nwith first-order momentum, the relative advantage tends to zero as 8 --+- 0, giving \nnegligible speedup when Amax ~ Amin. For small 8, the optimal settings of the \nparameters are \n\n9 \n4: vIS + 0(8) \n1 -\n3 --vIS + 0(8) \n4 \n4(1 - vIS) + 0(8) \n\n(10) \n\nwhere a\u00b7 is as in (7). \n\n3 SIMULATIONS \n\nWe constructed a standard three layer backpropagation network with 10 input units, \n3 sigmoidal hidden units, and 10 sigmoidal output units. 15 associations between \nrandom 10 bit binary input and output vectors were constructed, and the weights \nwere initialized to uniformly chosen random values between -1 and +1. Training \nwas performed with a square error measure, batch weight updates, targets of 0 and \n1, and a weight decay coefficient of 0.01. \nTo get past the initial transients, the network was run at T} = 0.45, a = 0 for \n150 epochs, and at T} = 0.3, a = 0.9 for another 200 epochs. The weights were then \nsaved, and the network run for 200 epochs for T} ranging from 0 to 0.5 and a ranging \nfrom 0 to 1 from that starting point. \n\nFigure 3 shows that the region of convergence has the shape predicted by theory. \nCalculation of the eigenvalues of d2 E / dw 2 confirms that the location of the bound(cid:173)\nary is correctly predicted. Figure 2 shows that momentum speeded convergence by \nthe amount predicted by theory. Figure 3 shows that the parameter setting that \ngive the most rapid convergence in practice are the settings predicted by theory. \n\nHowever, within the region that does not converge to the minimum, there appear \nto be two regimes: one that is characterized by apparently chaotic fluctuations \nof the error, and one which slopes up gradually from the global minimum. Since \nthis phenomenon is so atypical of a quadratic minimum in a linear system, which \neither converges or diverges, and this phenomenon seems important in practice, we \ndecided to investigate a simple system to see if this behavior could be replicated \nand understood, which is the subject of the next section. \n\n4 GRADIENT DESCENT WITH SATURATING ERROR \n\nThe analysis of the sections above may be objected to on the grounds that it assumes \nthe minimum to have quadratic form and then performs an analysis in the neigh(cid:173)\nborhood of that minimum, which is equivalent to analyzing a linear unit. Surely \nour nonlinear backpropagation networks are richer than that. \n\n\fGradient Descent: Second Order Momentum and Saturating Error \n\n891 \n\n0.1195 \n\n! 0.690~~11~---------__ ---:l \n\n0.611:1 \n\nO. 680L......~~-----.J'---'-~~------.l~~~-..J.~~~ .......... \n\n350 \n\n400 \n\n450 \nepoch \n\n500 \n\nFigure 2: Error plotted as a function of time for two settings of the learning param(cid:173)\neters, both determined empirically: the one that minimized the error the most, and \nthe one with a = 0 that minimized the error the most. There exists a less aggressive \nsetting of the parameters that converges nearly as fast as the quickly converging \ncurve but does not oscillate. \n\nA clue that this might be the case was shown in figure 3. The region where the \nsystem converges to the minimum is of the expected shape, but rather than simply \ndiverging outside of this region, as would a linear system, more complex phenomena \nare observed, in particular a sloping region . \n\nActing on the hypothesis that this region is caused by Amax being maximal at \nthe minimum, and gradually decreasing away from it (it must decrease to zero in \nthe limit, since the hidden units saturate and the squared error is thus bounded) \nwe decided to perform a dynamic systems analysis of the convergence of gradient \ndescent on a one dimensional nonquadratic error surface. We chose \n\nwhich is shown in figure 4, as this results in a bounded E. \n\nE=l-l \n\n1 \n2 +w \n\nLetting \n\nf( ) = \n\nW \n\nW \n\nT} \n\nW \n\n-\n\n_ E'( ) _ w(l - 2T} + 2w2 + w4 ) \n\n(1 + w 2 )2 \n\n(11) \n\n(12) \n\nbe our transfer function, a local analysis at the minimum gives Amax = E\"(O) = 2 \nwhich limits convergence to T} < 1. Since the gradient towards the minimum is \nalways less than predicted by a second-order series at the minimum, such T} are in \nfact globally convergent. As T} passes 1 the fixedpoint bifurcates into the limit cycle \n\nw = \u00b1j.,;ry - 1, \n\n(13) \nwhich remains stable until T} --+- 16/9 = 1.77777 ... , at which point the single sym(cid:173)\nmetric binary limit cycle splits into two asymmetric limit cycles, each still of period \ntwo. These in turn remain stable until T} --+- 2.0732261475-, at which point repeated \nperiod doubling to chaos occurs. This progression is shown in figure 7. \n\n\f892 \n\nPearlmutter \n\n\"i 0.40 \n!! \n~ 0.30 \n'c \n5 0.20 \n.!l \n-; 0.10 \n\nO.OOC=:==\"\"\"\"\",_--=~ \n' .0 \n\n0.8 \n\n0.6 \n\nQ (mom .... lum) \n\n0.0 \n\n0.2 \n\n0.4 \n\nFigure 3: (Left) the error at epoch 550 as a function of the learning regime. Shading \nis based on the height, but most of the vertical scale is devoted to nonconvergent net(cid:173)\nworks in order to show the mysterious non convergent sloping region. The minimum, \ncorresponding to the most darkly shaded point, is on the plateau of convergence \nat the location predicted by the theory. (Center) the region in which the network \nis convergent, as measured by a strictly monotonically decreasing error. Learning \nparameter settings for which the error was strictly decreasing have a low value while \nthose for which it was not have a high one. The lip at 7] = 0 has a value of 0, given \nwhere the error did not change. The rim at a = 1 corresponds to damped oscillation \ncaused by 7] > 4aA/(1 - a)2. (Right) contour plot of the convergent plateau shows \nthat the regions of equal error have linear boundaries in the nonoscillatory region \nin the center, as predicted by theory. \n\nAs usual in a bifurcation, w rises sharply as 7] passes 1. But recall that figure 3, \nwith the smooth sloping region, plotted the error E rather than the weights. The \nanalogous graph here is shown in figure 6 where we see the same qualitative feature \nof a smooth gradual rise, which first begins to jitter as the limit cycle becomes \nasymmetric, and then becomes more and more jagged as the period doubles its way \nto chaos. From figure 7 it is clear that for higher 7] the peak error of the attractor \nwill continue to rise gently until it saturates. \nNext, we add momentum to the system. This simple one dimensional system du(cid:173)\nplicates the phenomena we found earlier, as can be seen by comparing figure 3 with \nfigure 5. We see that momentum delays the bifurcation of the fixed point attractor \nat the minimum by the amount predicted by (5), namely until 7] approaches 1 + a. \nAt this point the fixed point bifurcates into a symmetric limit cycle of period 2 at \n\na formula of which (13) is a special case. This limit cycle is stable for \n\n16 \n\n7]< g(1+a), \n\n(14) \n\n(15) \n\nbut as 7] reaches this limit, which happens at the same time that w reaches \u00b11/V3 \n(the inflection point of E where E = 1/4) the limit cycle becomes unstable. How(cid:173)\never, for a near 1 the cycle breaks down more quickly in practice, as it becomes \nhaloed by more complex attractors which make it progressively less likely that a \nsequence of iterations will actually converge to the limit cycle in question. Both \nboundaries of this strip, 7] = 1 + a and 7] = 196 (1 + a), are visible in figure 5, \n\n\fGradient Descent: Second Order Momentum and Saturating Error \n\n893 \n\n1 \n\nE \n\no \n-3 \n\no \nw \n\n3 \n\nFigure 6: E as a func(cid:173)\ntion of 7J with a = o. \nWhen convergent, the fi(cid:173)\nnal value is shown; oth(cid:173)\nerwise E after 100 it(cid:173)\nerations from a starting \npoint of w = 1.0. This a \nmore detailed graph of a \nslice of figure 5 at a = O. \n\nFigure 5: E after 50 it(cid:173)\nerations from a starting \npoint of 0.05, as a func(cid:173)\ntion of 7J and a. \n\nFigure 4: A one dimen(cid:173)\nsional tulip-shaped non(cid:173)\nlinear error surface E = \n1- (1 + w2)-1. \n\n0.' \n\n1.' \n\n-1 \n\n-I \n\nFigure 7: The attractor in was a function of 7J is shown, with the progression from a \nsingle attract or at the minimum of E to a limit cycle of period two, which bifurcates \nand then doubles to chaos. a = 0 (left) and a = 0.8 (right). For the numerical \nsimulations portions of the graphs, iterations 100 through 150 from a starting point \nof w = 1 or w = 0.05 are shown. \n\nparticularly since in the region between them E obeys \n\nE = 1- J1 ~a \n\n(16) \n\nThe bifurcation and subsequent transition to chaos with momentum is shown for \na = 0.8 in figure 7. This a is high enough that the limit cycle fails to be reached \nby the iteration procedure long before it actually becomes unstable. Note that this \ndiagram was made with w started near the minimum. If it had been started far \nfrom it, the system would usually not reach the attractor at w = 0 but instead \nenter a halo attractor. This accounts for the policy of backpropagation experts, \nwho gradually raise momentum as the optimization proceeds. \n\n\f894 \n\nPearlmutter \n\n5 CONCLUSIONS \n\nThe convergence bounds derived assume that the learning parameters are set op(cid:173)\ntimally. Finding these optimal values in practice is beyond the scope of this pa(cid:173)\nper, but some techniques for achieving nearly optimal learning rates are available \n[4, 10, 8, 7, 3]. Adjusting the momentum feels easier to practitioners than adjust(cid:173)\ning the learning rate, as too high a value leads to small oscillations rather than \ndivergence, and techniques from control theory can be applied to the problem [2]. \n\nHowever, because error surfaces in practice saturate, techniques for adjusting the \nlearning parameters automatically as learning proceeds can not be derived under \nthe quadratic minimum assumption, but must take into account the bifurcation and \nlimit cycle and the sloping region of the error, or they may mistake this regime of \nstable error for convergence, leading to premature termination. \n\nReferences \n\n[1] S. Thomas Alexander. Adaptive Signal Processing. Springer-Verlag, 1986. \n[2] H. S. Dabis and T. J. Moir. Least mean squares as a control system. Interna(cid:173)\n\ntional Journal of Control, 54(2):321-335, 1991. \n\n[3] Yan Fang and Terrence J. Sejnowski. Faster learning for dynamic recurrent \n\nbackpropagation. Neural Computation, 2(3):270-273, 1990. \n\n[4] Robert A. Jacobs. Increased rates of convergence through learning rate adap(cid:173)\n\ntation. Neural Networks, 1(4):295-307,1988. \n\n[5] David E. Rumelhart, Geoffrey E. Hinton, and R. J. Williams. Learning internal \nrepresentations by error propagation. In D. E. Rumelhart, J. 1. McClelland, \nand the PDP research group., editors, Parallel distributed processing: Explo(cid:173)\nrations in the microstructure of cognition, Volume 1: Foundations. MIT Press, \n1986. \n\n[6] J. J. Shynk and S. Roy. The LMS algorithm with momentum updating. In \nProceedings of the IEEE International Symposium on Circuits and Systems, \npages 2651-2654, June 6-9 1988. \n\n[7] F. M. Silva and L. B. Almeida. Acceleration techniques for the backpropaga(cid:173)\n\ntion algorithm. In L. B. Almeida and C. J. Wellekens, editors, Proceedings of \nthe 1990 EURASIP Workshop on Neural Networks. Springer-Verlag, February \n1990. (Lecture Notes in Computer Science series). \n\n[8] Tom Tollenaere. SuperSAB: Fast adaptive back propagation with good scaling \n\nproperties. Neural Networks, 3(5):561-573, 1990. \n\n[9] Mehmet Ali Tugay and Yal~in Tanik. Properties of the momentum LMS algo(cid:173)\n\nrithm. Signal Processing, 18(2):117-127, October 1989. \n\n[10] T. P. Vogl, J. K. Mangis, A. K. Zigler, W. T. Zink, and D. L. Alkon. Acceler(cid:173)\n\nating the convergence of the back-propagation method. Biological Cybernetics, \n59:257-263, September 1988. \n\n[11] B. Widrow, J. M. McCool, M. G. Larimore, and C. R. Johnson Jr. Stational and \nnonstationary learning characteristics of the LMS adaptive filter. Proceedings \nof the IEEE, 64:1151-1162, 1979. \n\n\f", "award": [], "sourceid": 454, "authors": [{"given_name": "Barak", "family_name": "Pearlmutter", "institution": null}]}