{"title": "Using Curvature Information for Fast Stochastic Search", "book": "Advances in Neural Information Processing Systems", "page_first": 606, "page_last": 612, "abstract": null, "full_text": "U sing Curvature Information for \n\nFast Stochastic Search \n\nGenevieve B. Orr \n\nDept of Computer Science \n\nWillamette University \n\n900 State Street \nSalem, OR 97301 \n\ngorr@willamette.edu \n\nTodd K. Leen \n\nDept of Computer Science and Engineering \n\nOregon Graduate Institute of \n\nScience and Technology \n\nP.O.Box 91000, Portland, Oregon 97291-1000 \n\ntleen@cse.ogi.edu \n\nAbstract \n\nWe present an algorithm for fast stochastic gradient descent that \nuses a nonlinear adaptive momentum scheme to optimize the late \ntime convergence rate. The algorithm makes effective use of cur(cid:173)\nvature information, requires only O(n) storage and computation, \nand delivers convergence rates close to the theoretical optimum. \nWe demonstrate the technique on linear and large nonlinear back(cid:173)\nprop networks. \n\nImproving Stochastic Search \n\nLearning algorithms that perform gradient descent on a cost function can be for(cid:173)\nmulated in either stochastic (on-line) or batch form. The stochastic version takes \nthe form \n\nWt+l = Wt + J1.t G( Wt, Xt ) \n\n(1) \nwhere Wt is the current weight estimate, J1.t is the learning rate, G is minus the \ninstantaneous gradient estimate, and Xt is the input at time t i . One obtains the \ncorresponding batch mode learning rule by taking J1. constant and averaging Gover \nall x. \n\nStochastic learning provides several advantages over batch learning. For large \ndatasets the batch average is expensive to compute. Stochastic learning eliminates \nthe averaging. The stochastic update can be regarded as a noisy estimate of the \nbatch update, and this intrinsic noise can reduce the likelihood of becoming trapped \nin poor local optima [1, 2J. \n\n1 We assume that the inputs are i.i.d. This is achieved by random sampling with re(cid:173)\n\nplacement from the training data. \n\n\fUsing Curvature Informationfor Fast Stochastic Search \n\n607 \n\nThe noise must be reduced late in the training to allow weights to converge. After \nsettling within the basin of a local optimum W., learning rate annealing allows con(cid:173)\nvergence of the weight error v == W - w \u2022. It is well-known that the expected squared \nweight error, E[lv12] decays at its maximal rate ex: l/t with the annealing schedule \nflo/to FUrthermore to achieve this rate one must have flo > flcnt = 1/(2Am in) where \nAmin is the smallest eigenvalue of the Hessian at w. [3, 4, 5, and references therein]. \nFinally the optimal flo, which gives the lowest possible value of E[lv12] is flo = 1/ A. \nIn multiple dimensions the optimal learning rate matrix is fl(t) = (l/t) 1-\u00a3-1 ,where \n1-\u00a3 is the Hessian at the local optimum. \nIncorporating this curvature information into stochastic learning is difficult for two \nreasons. First, the Hessian is not available since the point of stochastic learning is \nnot to perform averages over the training data. Second, even if the Hessian were \navailable, optimal learning requires its inverse - which is prohibitively expensive to \ncompute 2. \n\nThe primary result of this paper is that one can achieve an algorithm that behaves \noptimally, i.e. as if one had incorporated the inverse of the full Hessian, without \nthe storage or computational burden. The algorithm, which requires only V(n) \nstorage and computation (n = number of weights in the network), uses an adaptive \nmomentum parameter, extending our earlier work [7] to fully non-linear problems. \nWe demonstrate the performance on several large back-prop networks trained with \nlarge datasets. \n\nImplementations of stochastic learning typically use a constant learning rate during \nthe early part of training (what Darken and Moody [4] call the search phase) to ob(cid:173)\ntain exponential convergence towards a local optimum, and then switch to annealed \nlearning (called the converge phase). We use Darken and Moody's adaptive search \nthen converge (ASTC) algorithm to determine the point at which to switch to l/t \nannealing. ASTC was originally conceived as a means to insure flo > flcnt during \nthe annealed phase, and we compare its performance with adaptive momentum as \nwell. We also provide a comparison with conjugate gradient optimization. \n\n1 Momentum in Stochastic Gradient Descent \n\nThe adaptive momentum algorithm we propose was suggested by earlier work on \nconvergence rates for annealed learning with constant momentum. In this section \nwe summarize the relevant results of that work. \n\nExtending (1) to include momentum leaves the learning rule \n\nwt+ 1 = Wt + flt G ( Wt, x t) + f3 ( Wt - Wt -1 ) \n\n(2) \nwhere f3 is the momentum parameter constrained so that 0 < f3 < 1. Analysis of \nthe dynamics of the expected squared weight error E[ Ivl2 ] with flt = flo/t learning \nrate annealing [7, 8] shows that at late times, learning proceeds as for the algorithm \nwithout momentum, but with a scaled or effective learning rate \n\nflo \nfleff = 1 _ f3 \n\n_ \n\n. \n\n( ) \n3 \n\nThis result is consistent with earlier work on momentum learning with small, con(cid:173)\nstant fl, where the same result holds [9, 10, 11] \n\n2Venter [6] proposed a I-D algorithm for optimizing the convergence rate that estimates \nthe Hessian by time averaging finite differences of the gradient and scalin~ the learning \nrate by the inverse. Its extension to multiple dimensions would require O(n ) storage and \nO(n3) time for inversion. Both are prohibitive for large models. \n\n\f608 \n\nG. B. Orr and T. K. Leen \n\nIf we allow the effective learning rate to be a matrix, then, following our comments \nin the introduction, the lowest value of the misadjustment is achieved when /leff = \nti- 1 [7, 8]. Combining this result with (3) suggests that we adopt the heuristic3 \n\n/3o pt = I - /loti. \n\n(4) \nwhere /3opt is a matrix of momentum parameters, I is the identity matrix, and /lo \nis a scalar. \nWe started with a scalar momentum parameter constrained by 0 < /3 < 1. The \nequivalent constraint for our matrix /3opt is that its eigenvalues lie between 0 and \n1. Thus we require /lo < 1/ Amoz where Amoz is the largest eigenvalue of ti. \nA scalar annealed learning rate /loft combined with the momentum parameter /3o pt \nought to provide an effective learning rate asymptotically equal to the optimal learn(cid:173)\ning rate ti- 1. This rate 1) is achieved without ever performing a matrix inversion \non ti and 2) is independent of the choice of /lo, subject to the restriction in the \nprevious paragraph. \n\nWe have dispensed with the need to invert the Hessian, and we next dispense with \nthe need to store it. First notice that, unlike its inverse, stochastic estimates of ti \nare readily available, so we use a stochastic estimate in (4). Secondly according to \n(2) we do not require the matrix /3opt, but rather /3opt times the last weight up(cid:173)\ndate. For both linear and non-linear networks this dispenses with the O( n 2 ) storage \nrequirements. This algorithm, which we refer to as adaptive momentum, does not \nrequire explicit knowledge or inversion of the Hessian, and can be implemented very \nefficiently as is shown in the next section. \n\n2 \n\nImplementation \n\nThe algorithm we propose is \n\nwhere ~Wt = Wt - Wt-l and iit is a stochastic estimate of the Hessian at time t. \n\nWt+! = Wt + /It G( Wt, Xt) + (I - /lo iit ) ~Wt \n\n(5) \n\nWe first consider a single layer feedforward linear network. Since the weights con(cid:173)\nnecting the inputs to different outputs are independent of each other we need only \ndiscuss the case for one output node. Each output node is then treated identically. \nFor one output node and N inputs, the Hessian is ti = (xxT}z E n NxN where 0:1: \nindicates expectation over the inputs x and where xT is the transpose of x. The \nsingle-step estimate of the hessian is then just iit = xtxi. The momentum term \nbecomes \n\n~ \n\n(I - /lotit) ~Wt = (I - /lo(XtXt ))~Wt = ~Wt -\n\n(6) \nWritten in this way, we note that there is no matrix multiplication, just the vector \ndot product xi ~Wt and vector addition that are both O(n). For M output nodes, \nthe algorithm is then O(Nw ) where N w = NM is the total number weights in the \nnetwork. \n\n/loXt(X t ~Wt). \n\nT \n\nT \n\nFor nonlinear networks the problem is somewhat more complicated. To compute \niit~Wt we use the algorithm developed by Pearlmutter [12] for computing the prod(cid:173)\nuct of the hessian times an arbitrary vector.4 The equivalent of one forward-back \n\n3We refer to (4) as a heuristic since we have no theoretical results on the dynamics of \n\nthe squared weight error for learning with this matrix of momentum parameters. \n\n\u00b7We actually use a slight modification that calculates the linearized Hessian times a \nvector: D f @D f ~Wt where D f is the Jacobian of the network output (vector) with respect \nto the weights, and @ indicates a tensor product. \n\n\fUsing Curvature Information for Fast Stochastic Search \n\n609 \n\nLog( E[ Ivl2 1 ) \n\nI ~o=O\u00b71 I \n\n\u00b71 \n\n\u00b72 \n\n\u00b73 \n\nB=adaptlve \n\nLog( E[ Iv12]) \n\n'--------\"\"_--..flo=O.1 \n\u00b71 \n\n\u00b72L.-------\n\nflo=O\u00b701 \n\n\u00b73 \n\nI B=adaptlve I \n\na) \n\n2 \n\n3 \nLog(t) \n\n5 \n\nb) \n\n2 \n\n3 \nLog(t) \n\nFigure 1: 2\u00b7D LMS Simulations: Behavior of log(E[lvI 2 ]) over an ensemble of 1000 net(cid:173)\nworks with Al = .4 and Al = 4, (J'~ = 1. a) 1-'0 = 0.1 with various 13. Dashed curve \ncorresponds to adaptive momentum. b) 13 adaptive for various 1-'0. \n\npropagation is required for this calculation. Thus, to compute the entire weight up(cid:173)\ndate requires two forward-backward propagations, one for the gradient calculation \nand one for computing iltllWt. \nThe only constraint on JJo is that JJo < 1/ Amax. We use the on-line algorithm \ndeveloped by LeCun, Simard, and Pearlmutter [13] to find the largest eigenvalue \nprior to the start of training. \n\n3 Examples \n\nIn the following two subsections we examine the behavior of annealed learning with \nadaptive momentum on networks previously trained to a point close to an optimum, \nwhere the noise dominates. We look at very simple linear nets, large linear nets, and \na large nonlinear net. In section 3.3 we couple adaptive momentum with automatic \nswitching from constant to annealed learning. \n\n3.1 Linear Networks \n\nWe begin with a simple 2-D LMS network. Inputs Xt are gaussian distributed with \nzero mean and the targets d at each timestep t are d t = W,!, Xt + Et where Et is zero \nmean gaussian noise, and W* is the optimal weight vector. The weight error at time \nt is just v == Wt - w*. \nFigure 1 displays results for both constant and adaptive momentum with averages \ncomputed over an ensemble of 1000 networks. Figure (la) shows the decay of E[lv1 2 ] \nfor JJo = 0.1 and various values of f3. As momentum is increased, the convergence \nrate increases. The optimal scalar momentum parameter is f3 == (1- JJOAmin) = .96. \nAdaptive momentum achieves essentially the same rate of convergence without prior \nknowledge of the Hessian. \nFigure 1b shows the behavior of E[lvI 2 ] for various JJo when adaptive momentum \nis used. One can see that after a few hundred iterations the value of E[lv1 2 ] is \nindependent of JJo (in all cases JJo < l/A max < JJcrit ). \nFigure 2 shows the behavior of the misadjustment (mean squared error in ex(cid:173)\ncess of the optimum~ for a 4-D LMS problem with a large condition number \nP == Amax/Arr;in = 10 . We compare 3 cases:. 1) the opt~mal learning rate matrix \nJJo = 1i- wIthout momentum, 2) JJo = .5 wIth the optzmal constant momentum \nmatrix f3 = I -\nJJo 1i, and 3) JJo = .5 with the adaptive momentum. All three \ncases show similar behavior, showing the efficacy with which the matrix momentum \n\n\f610 \n\n10. \n\n0.1 \n\n0.001 \n\nG. B. Orr and T. K. Leen \n\n10. \n\n100. 1000. 10000. \n\nI \n\n5 \n10 \n\n6 \n10 \n\nFigure 2: 4-D LMS with p = 105 : Plot \ndisplays misadjustment. Annealing starts at \nt = 10. For {3adapt and {3 = I - 1-'01i, we use \n1-'0 = .5. Each curve is an average of 10 runs. \n\nFigure 3: Linear Prediction: 1-'0 = 0.26. \nCurves show constant learning rate, anneal(cid:173)\ning started at t = 50 without momentum, \nand with adaptive momentum. \n\nmocks up the optimal learning rate matrix J1.0 = 1\u00a3 -1, and lending credence to the \nstochastic estimate of the Hessian used in adaptive momentum. \n\nWe next consider a large linear prediction problem (128 inputs, 16 outputs and \neigenvalues ranging from 1.06 x 10-5 to 19.98 - condition number p = 1.9 X 106)5. \nFigure 3 displays the misadjustment for 1) annealed learning with f3 = f3adapt, \n2) annealed learning with f3 = 0, and 3) constant learning rate (for comparison \npurposes). As before, we have first trained (not shown completely) at constant \nlearning rate J1.0 = .026 until the MSE and the weight error have leveled out. As \ncan be seen f3adapt does much better than annealing without momentum. \n\n3.2 Phoneme Classification \n\nWe next use phoneme classification as an example of a large nonlinear problem. \nThe database consists of 9000 phoneme vectors taken from 48 50-second speech \nmonologues. Each input vector consists of 70 PLP coefficients. There are 39 target \nclasses. The architecture was a standard fully connected feedforward network with \n71 (includes bias) input nodes, 70 hidden nodes, and 39 output nodes for a total of \n7700 weights. \n\nWe first trained the network with constant learning rate until the MSE flattened \nout. At that point we either annealed without momentum, annealed with adaptive \nmomentum, or used ASTC (which attempts to adjust J1.0 to be above J1.crit - see \nnext section). When annealing was used without momentum, we found that the \nnoise went away, but the percent of correctly classified phonemes did not improve. \nBoth the adaptive momentum and ASTC resulted in significant increases in the \npercent correct, however, adaptive momentum was significantly better than ASTC. \nIn the next section, we examine this problem in more detail. \n\n3.3 Switching on Annealing \n\nA complete algorithm must choose an appropriate point to change from constant J1. \nsearch to annealed learning. We use Moody and Darken's ASTC algorithm [4, 14] \nto accomplish this. ASTC measures the roughness of trajectories, switching to 1ft \nannealing when the trajectories become very rough - an indication that the noise \nin the updates is dominating the algorithm's behavior. In an attempt to satisfy \n\n5Prediction of a 4 X 4 block of image pixels from the surrounding 8 blocks. \n\n\fUsing Curvature Information for Fast Stochastic Search \n\n611 \n\n50 \n\n40 \n~30 \n0 \n(.)20 \n;,I! \n0 \n\n10 \n\n100000 \n\na) \n\nb) \n\n40 -\n\n0 \n~30 \n0 \n(.)20 \n~ \n0 \n\n50 \n\n10 \nqo \n\n20 \n\nepoch \n\n50 \n\n100 \n\nFigure 4: Phoneme Classification: Percent Correct a) ASTC without momentum (bottom \ncurve) and adaptive momentum (top) as function of the number of input presentations. \nb) Conjugate Gradient Descent - one epoch equals one pass through the data, i.e. 9000 \ninput presentations. \n\nJ.lo > J.lcrit, ASTC can also switch back to constant learning when trajectories \nbecome too smooth. \n\nWe return to the phoneme problem using three different training methods: 1) ASTC \nwithout momentum (with switching back and forth between annealed and constant \nlearning), 2) adaptive momentum with annealing turned on when ASTC first sug(cid:173)\ngests the transition (but no subsequent return to constant learning rate), and 3) \nstandard conjugate gradient descent. \n\nFigure 4a compares ASTC (no momentum) with adaptive momentum (using ASTC \nto turn on annealing). After annealing is turned on, the classification accuracy \nimproves far more quickly with adaptive momentum. \n\nFigure 4b displays the classification performance as a function of epoch using con(cid:173)\njugate gradient descent (CGD). After 100 passes through the 9000 example dataset \n(900,000 presentations), the classification accuracy is 39.6%, or 7% below adaptive \nmomentum's performance at 100,000 presentations. Note also that adaptive mo(cid:173)\nmentum is continuing to improve the optimization, while the ASTC and conjugate \ngradient descent curves have flattened out. \n\nThe cpu time used for the optimization was about the same for the CGD and adap(cid:173)\ntive momentum algorithms. It thus appears that our implementation of adaptive \nmomentum costs about 9 times as much per pattern as CGD. We believe that the \nperformance can be improved. Our complexity analysis [8] predicts a 3:1 cost ratio, \nrather than 9:1, and optimization comparable to that applied to the CGD code6 \nshould enhance the run-time performance of CGD. \n\nFor this problem, the performance of the two algorityms on the test set (no shown \non graph) is not much different (31.7% for CGD versus 33.4% for adaptive momen(cid:173)\ntum. Howver we are concerned here with the efficiency of the optimization, not \ngeneralization performance. The latter depends on dataset size and regularization \ntechniques, which can easily be combined with any optimizer. \n\n4 Summary \n\nWe have presented an efficient O( n) stochastic algorithm with few adjustable param(cid:173)\neters that achieves fast convergence during the converge phase for both linear and \nnonlinear problems. It does this by incorporating curvature information without \n\n6CGD was performed using nopt written by Etienne Barnard and made available \nthrough the Center for Spoken Language Understanding at the Oregon Graduate Institute. \n\n\f612 \n\nG. B. Orr and T. K. Leen \n\nexplicit computation of the Hessian. We also combined it with a method (ASTC) \nfor detecting when to make the transition between search and converge regimes. \n\nAcknowledgments \n\nThe authors thank Yann LeCun for his helpful critique. This work was supported \nby EPRl under grant RPB015-2 and AFOSR under grant FF4962-93-1-0253. \n\nReferences \n[1] Genevieve B. Orr and Todd K. Leen. Weight space probability densities in stochastic \nlearning: II. Transients and basin hopping times. In Giles, Hanson, and Cowan, \neditors, Advances in Neural Information Processing Systems, vol. 5, San Mateo, CA, \n1993. Morgan Kaufmann. \n\n[2] William Finnoff. Diffusion approximations for the constant learning rate backprop(cid:173)\n\nagation algorithm and resistence to local minima. In Giles, Hanson, and Cowan, \neditors, Advances in Neural Information Processing Systems, vol. 5, San Mateo, CA, \n1993. Morgan Kaufmann. \n\n[3] Larry Goldstein. Mean square optimality in the continuous time Robbins Monro \nprocedure. Technical Report DRB-306, Dept. of Mathematics, University of Southern \nCalifornia, LA, 1987. \n\n[4] Christian Darken and John Moody. Towards faster stochastic gradient search. In J.E. \nMoody, S.J. Hanson, and R.P. Lipmann, editors, Advances in Neural Information \nProcessing Systems 4. Morgan Kaufmann Publishers, San Mateo, CA, 1992. \n\n[5] Halbert White. Learning in artificial neural networks: A statistical perspective. Neu(cid:173)\n\nral Computation, 1:425-464, 1989. \n\n[6] J. H. Venter. An extension of the robbins-monro procedure. Annals of Mathematical \n\nStatistics, 38:117-127, 1967. \n\n[7] Todd K. Leen and Genevieve B. Orr. Optimal stochastic search and adaptive mo(cid:173)\n\nmentum. In J.D. Cowan, G. Tesauro, and J . Alspector, editors, Advances in Neural \nInformation Processing Systems 6, San Francisco, CA., 1994. Morgan Kaufmann Pub(cid:173)\nlishers. \n\n[8] Genevieve B. Orr. Dynamics and Algorithms for Stochastic Search. PhD thesis, \n\nOregon Graduate Institute, 1996. \n\n[9] Mehmet Ali Tugay and Yalcin Tanik. Properties of the momentum LMS algorithm. \n\nSignal Processing, 18:117-127, 1989. \n\n[10] John J. Shynk and Sumit Roy. Analysis of the momentum LMS algorithm. IEEE \n\nTransactions on Acoustics, Speech, and Signal Processing, 38(12):2088-2098, 1990. \n\n[11] W. Wiegerinck, A. Komoda, and T. Heskes. Stochastic dynamics of learning with \n\nmomentum in neural networks. Journal of Physics A, 27:4425-4437, 1994. \n\n[12] Barak A. Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, \n\n6:147-160, 1994. \n\n[13] Yann LeCun, Patrice Y. Simard, and Barak Pearlmutter. Automatic learning rate \nmaximization by on-line estimation of the hessian's eigenvectors. In Giles, Hanson, \nand Cowan, editors, Advances in Neural Information Processing Systems, vol. 5, San \nMateo, CA, 1993. Morgan Kaufmann. \n\n[14J Christian Darken. Learning Rate Schedules for Stochastic Gradient Algorithms. PhD \n\nthesis, Yale University, 1993. \n\n\f", "award": [], "sourceid": 1227, "authors": [{"given_name": "Genevieve", "family_name": "Orr", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}]}