{"title": "Automatic Learning Rate Maximization by On-Line Estimation of the Hessian's Eigenvectors", "book": "Advances in Neural Information Processing Systems", "page_first": 156, "page_last": 163, "abstract": "", "full_text": "Automatic Learning Rate Maximization \nby On-Line Estimation of the Hessian's \n\nEigenvectors \n\nYann LeCun,l Patrice Y. Simard,l and Barak Pearlmutter2 \n\n1 AT&T Bell Laboratories 101 Crawfords Corner Rd, Holmdel, NJ 07733 \n\n2CS&E Dept. Oregon Grad. Inst., 19600 NW vonNeumann Dr, Beaverton, OR 97006 \n\nAbstract \n\nWe propose a very simple, and well principled way of computing \nthe optimal step size in gradient descent algorithms. The on-line \nversion is very efficient computationally, and is applicable to large \nbackpropagation networks trained on large data sets. The main \ningredient is a technique for estimating the principal eigenvalue(s) \nand eigenvector(s) of the objective function's second derivative ma(cid:173)\ntrix (Hessian), which does not require to even calculate the Hes(cid:173)\nsian. Several other applications of this technique are proposed for \nspeeding up learning, or for eliminating useless parameters. \n\n1 \n\nINTRODUCTION \n\nChoosing the appropriate learning rate, or step size, in a gradient descent procedure \nsuch as backpropagation, is simultaneously one of the most crucial and expert(cid:173)\nintensive part of neural-network learning. We propose a method for computing the \nbest step size which is both well-principled, simple, very cheap computationally, \nand, most of all, applicable to on-line training with large networks and data sets. \nLearning algorithms that use Gradient Descent minimize an objective function E \nof the form \n\np \n\nE(W) = ~EEP(W) \n\nEP = E(W,XP) \n\n(1) \n\np=O \n\nwhere W is the vector of parameters (weights), P is the number of training patterns, \nand XP is the p-th training example (including the desired output if necessary). Two \nbasic versions of gradient descent can be used to minimize E. In the first version, \n\n156 \n\n\fAutomatic Learning Rate Maximization by Estimation of Hessian's Eigenvectors \n\n157 \n\ncalled the batch version, the exact gradient of E with respect to W is calculated, \nand the weights are updated by iterating the procedure \n\nW - W - 1]VE(W) \n\n(2) \n\nwhere 1] is the learning rate or step size, and VE(W) is the gradient of E with \nrespect to W. In the second version, called on-line, or Stochastic Gradient Descent, \nthe weights are updated after each pattern presentation \n\n(3) \nBefore going any further, we should emphasize that our main interest is in training \nlarge networks on large data sets. As many authors have shown, Stochastic Gradient \nDescent (SGD) is much faster on large problems than the \"batch\" version. In fact, \non large problems, a carefully tuned SGD algorithm outperforms most accelerated \nor second-order batch techniques, including Conjugate Gradient. Although there \nhave been attempts to \"stochasticize\" second-order algorithms (Becker and Le Cun, \n1988) (Moller, 1992), most of the resulting procedures also rely on a global scaling \nparameter similar to 1]. Therefore, there is considerable interest in finding ways of \noptimizing 1]. \n\n2 COMPUTING THE OPTIMAL LEARNING RATE: \n\nTHE RECIPE \n\nIn a somewhat unconventional way, we first give our simple \"recipe\" for computing \nthe optimal learning rate 1]. In the subsequent sections, we sketch the theory behind \nthe recipe. \nHere is the proposed procedure for estimating the optimal learning rate in a back(cid:173)\npropagation network trained with Stochastic Gradient Descent. Equivalent proce(cid:173)\ndures for other adaptive machines are strai~htforward. In the following, the notation \nN(V) designates the normalized vector V /11 VII. Let W be the N dimensional weight \nvector, \n\n1. pick a normalized, N dimensional vector \\If at random. Pick two small \n\npositive constants a and \" say a = 0.01 and, = 0.01. \n\n2. pick a training example (input and desired output) XP. Perform a regular \nforward prop and a backward prop. Store the resulting gradient vector \nG1 = VEP(W). \n\n3. add aNew) to the current weight vector W, \n4. perform a forward prop and a backward prop on the same pattern us(cid:173)\ning the perturbed weight vector. Store the resulting gradient vector \nG2 ::: VEP(W + aN(w\u00bb \n\n5. update \n\nW -\n\nvector W with \n\n(1 -,)w + ;( G2 - G.). \n\nthe \n\nrunmng \n\naverage \n\nformula \n\n6. restore the weight vector to its original value W. \n7. loop to step 2 until Ilwll stabilizes. \n8. set the learning rate 1] to IIWII- 1, and go on to a regular training session. \n\nThe constant a controls the size of the perturbation. A small a gives a better esti(cid:173)\nmate, but is more likely to cause numerical errors. , controls the tradeoff between \nthe convergence speed of wand the accuracy of the result. It is better to start with \n\n\f158 \n\nLeCun, Simard, and Pearlmutter \n\nE(W) \n\nW2 \n\nprincipal \neigenvector \n\nW \n\nz \n\n(a) \n\n~--------------~~Wl \n\n(b) \n\nFigure 1: Gradient descent with optimal learning rate in (a) one dimension, and \n(b) two dimensions (contour plot). \n\na relatively large 'Y (say 0.1) and progressively decrease it until the fluctuations on \n1I\\]i1l are less than say 10%. In our experience accurate estimates can be obtained \nwith between one hundred and a few hundred pattern presentations: for a large \nproblem, the cost is very small compared to a single learning epoch. \n\n3 STEP SIZE, CURVATURE AND EIGENVALUES \n\nThe procedure described in the previous section makes \"\\]ill converge to the largest \npositive eigenvalue of the second derivative matrix of the average obJective function. \nIn this section we informally explain why the best learning rate is the inverse of this \neigenvalue. More detailed analysis of gradient descent procedures can be found in \nOptimization, Statistical Estimation, or Adaptive Filtering textbooks (see for ex(cid:173)\nample (Widrow and Stearns, 1985\u00bb. For didactical purposes, consider an objective \nfunction of the form E(w) = ~(w - z)2 + C where w is a scalar parameter (see \nfig l(a\u00bb. Assuming w is the current value of the parameter, what is the optimal \n1] that takes us to the minimum in one step? It is easy to visualize that, as it has \nbeen known since Newton, the optimal TJ is the inverse of the second derivative of \nE, i.e. 1/ h. Any smaller or slightly larger value will yield slower convergence. A \nvalue more then twice the optimal will cause divergence. \nIf the objective function is \nIn multidimension, things are more complicated. \nquadratic, the surfaces of equal cost are ellipsoids (or ellipses in 2D as shown on \nfigure l(b\u00bb. Intuitively, if the learning rate is set for optimal convergence along the \ndirection of largest second derivative, then it will be small enough to ensure (slow) \nconvergence along all the other directions. This corresponds to setting the learning \nrate to the inverse of the second derivative in the direction in which it is the largest. \nThe largest learning rate that ensures convergence is twice that value. The actual \noptimal TJ is somewhere in between. Setting it to the inverse of the largest second \nderivative is both safe, and close enough to the optimal. The second derivative \ninformation is contained in the Hessian matrix of E(W): the symmetric matrix H \nwhose (i,j) component is ()2 E(W)/OWiOWj. If the learning machine has N free \nparameters (weights), H is an N by N matrix. The Hessian can be decomposed \n(diagonalized) into a product of the form H = RART, where A is a diagonal matrix \nwhose diagonal terms (the eigenvalues of H) are the second derivatives of E(W) \n\n\fAutomatic Learning Rate Maximization by Estimation of Hessian's Eigenvectors \n\n159 \n\nalong the principal axes of the ellipsoids of equal cost, and R is a rotation matrix \nwhich defines the directions of these principal axes. The direction of largest second \nderivative is the principal eigenvector of H, and the largest second derivative is \nthe corresponding eigenvalue (the largest one). In short, it can be shown that the \noptimal learning rate is the inverse of the largest eigenvalue of H: \n\n1 \n\n1Jopt = ~ \n\n\"max \n\n(4) \n\n4 COMPUTING THE HESSIAN'S LARGEST \n\nEIGENVALUE WITHOUT COMPUTING THE \nHESSIAN \n\nThis section derives the recipe given in section 2. Large learning machines, such as \nbackpropagation networks can have several thousand free parameters. Computing, \nor even storing, the full Hessian matrix is often prohibitively expensive. So at first \nglance, finding its largest eigenvalue in a reasonable time seems rather hopeless. \nWe are about to propose a shortcut based on three simple ideas: 1- the Taylor \nexpansion, 2- the power method, 3- the running average. The method described \nhere is general, and can be applied to any differentiable objective function that can \nbe written as an average over \"examples\" (e.g. RBFs, or other statistical estimation \ntechniques). \nTaylor expansion: Although it is often unrealistic to compute the Hessian H, \nthere is a simple way to approximate the product of H by a vector of our choosing. \nLet \\II be an N dimensional vector, and a a small real constant, the Taylor expansion \nof the gradient of E(W) around W along the direction \\II gives us \n\nH\\II = V'E(W + a\\ll) - V'E(W) + O(a2 ) \n\na \n\n(5) \n\nAssuming E is locally quadratic (i.e. ignoring the O(a2 ) term), the product of H by \nany vector W can be estimated by subtracting the gradient of E at point (W + a\\ll) \nfrom the gradient at W. This is an O(N) process, compared to the O(N2) direct \nproduct. In the usual neural network context, this can be done with two forward \npropagations and two backward propagations. More accurate methods which do \nnot use perturbations for computing H\\II exist, but they are more complicated to \nimplement than this one. (Pearlmutter, 1993). \n\nThe power method: Let Amax be the largest eigenvalue! of H, and Vmax the \ncorresponding normalized eigenvector (or a vector in the eigenspace if >'max is de(cid:173)\ngenerate). If we pick a vector \\II (say, at random) which is non-orthogonal to Vmax , \nthen iterating the procedure \n\n\\II .- H N(\\II) \n\n(6) \nwill make N(\\II) converge to Vmax , and IIwll converge to I>'maxl. The procedure \nis slow if good accuracy is required, but a good estimate of the eigenvalue can be \nobtained with a very small number of iterations (typically about 10). The reason \nfor introducing equation (5), is now clear: we can use it to compute the right hand \nside of (6), yielding \n\n\\II .- 1. (V'E (W + aN(\\II\u00bb - V'E(W\u00bb \n\na \n\n(7) \n\nllargest in absolute value, not largest algebraically \n\n\f160 \n\nLeCun, Simard, and Pearlmutter \n\nwhere W is the current estimate of the principal eigenvector of H, and a is a small \nconstant. \nThe \"on-line\" version: One iteration of the procedure (7) requires the computa(cid:173)\ntion of the gradient of E at two different points of the parameter space. This means \nthat one iteration of (7) is roughly equivalent to two epochs of gradient descent \nlearning (two passes through the entire training set). Since (7) needs to be iterated, \nsay 10 times, the total cost of estimating Amax would be approximately equivalent \nto 20 epochs. \nThis excessive cost can be drastically reduced with an \"on-line\" version of (7) which \nexploits the stationarity of the second-order information over large (and redundant) \ntraining sets. Essentially, the hidden \"average over patterns\" in VE can be replaced \nby a running average. The procedure becomes \n\n\\II <- (1 - ,)w + ,- (VE (W + aN(w\u00bb - VE(W\u00bb \n\n1 \na \n\n(8) \n\nwhere , is a small constant which controls the tradeoff between the convergence \nspeed and the accuracy 2. The \"recipe\" given in section 2 is a direct implementation \nof (8). Empirically, this procedure yields sufficiently accurate values in a very short \ntime. In fact, in all the cases we have tried, it converged with only a few dozen \npattern presentations: a fraction of the time of an entire learning pass through \nthe training set (see the results section). It looks like the essential features of the \nHessian can be extracted from only a few examples of the training set. In other \nwords, the largest eigenvalue of the Hessian seems to be mainly determined by the \nnetwork architecture and initial weights, and by short-term, low-order statistics of \nthe input data. It should be noted that the on-line procedure can only find positive \neigenvalues. \n\n5 A FEW RESULTS \n\nExperiments will be described for two different network architectures trained on seg(cid:173)\nmented handwritten digits taken from the NIST database. Inputs to the networks \nwere 28x28 pixel images containing a centered and size-normalized image of the \ncharacter. Network 1 was a 4-hidden layer, locally-connected network with shared \nweights similar to (Le Cun et al., 1990a) but with fewer feature maps. Each layer \nis only connected to the layer above. the input is 32x32 (there is a border around \nthe 28x28 image), layer 1 is 2x28x28, with 5x5 convolutional (shared) connections. \nLayer 2 is 2x14x14 with 2x2 subsampled, averaging connections. Layer 3 is 4xl0xl0, \nwith 2x5x5 convolutional connections. Layer 4 is 4x5x5 with 2x2 averaging connec(cid:173)\ntions, and the output layer is 10xlxl with 4x5x5 convolutional connections. The \nnetwork has a total of 64,638 connections but only 1278 free parameters because \nof the weight sharing. Network 2 was a regular 784x30xlO fully-connected net(cid:173)\nwork (23860 weights). The sigmoid function used for all units in both nets was \n1.7159 tanh(2/3x). Target outputs were set to +1 for the correct unit, and -1 for \nthe others. \nTo check the validity of our assumptions, we computed the full Hessian of Network 1 \non 300 patterns (using finite differences on the gradient) and obtained the eigen(cid:173)\nvalues and eigenvectors using one of the EISPACK routines. We then computed \n\n2the procedure (8) is not an unbiased estimator of (7). Large values of 'Yare likely \nto produce slightly underestimated eigenvalues, but this inaccuracy has no practical \nconsequences. \n\n\fAutomatic Learning Rate Maximization by Estimation of Hessian's Eigenvectors \n\n161 \n\n80 \n\n70 \n\nII) 60 -IS \n\nE 50 \n;: \n8 \nII) 40 \n::;, \nii \n> 30 \nIi \n9 20 \nII) \n\n10 \n\no 0 \n\n'}'I=O.1 \n\n'}'I=O.03 \n\n'}'I=O.01 \n\n'}'I=O.003 \n\n60 \n\n100 150 200 250 300 350 400 \n\nNumber of pattern presentations \n\nFigure 2: Convergence of the on-line eigenvalue estimation (Network 1) \n\nthe principal eigenvector and eigenvalue using procedures (7), and (8). All three \nmethods agreed within less than a percent on the eigenvalue. An example run of \n(8) on a 1000 pattern set is shown on figure 2. A 10% accurate estimate of the \nlargest eigenvalue is obtained in less than 200 pattern presentations (one fifth of \nthe database). As can be seen, the value is fairly stable over small portions of the \nset, which means that increasing the set size would not require more iterations of \nthe estimation procedure. \nA second series of experiments were run to verify the accuracy of the learning rate \nprediction. Network 1 was trained on 1000 patterns, and network 2 on 300 patterns, \nboth with SGD. Figure 3 shows the Mean Squared Error of the two networks after \n1,2,3,4 and 5 passes through the training set as a function of the learning rate, for \none particular initial weight vector. The constant I was set to 0.1 for the first 20 \npatterns, 0.03 for the next 60, 0.01 for the next 120, and 0.003 for the next 200 (400 \ntotal pattern presentations), but it was found that adequate values were obtained \nafter only 100 to 200 pattern presentations. The vertical bar represents the value \npredicted by the method for that particular run. It is clear that the predicted \noptimal value is very close to the correct optimal learning rate. Other experiments \nwith different training sets and initial weights gave similar results. Depending on \nthe initial weights, the largest eigenvalue for Network 1 varied between 80 and 250, \nand for Network 2 between 250 and 400. Experiments tend to suggest that the \noptimal learning rate varies only slightly during the early phase of training. The \nlearning rate may need to be decreased for long learning sessions, as SGD converts \nfrom the \"getting near the minimum\" mode to the \"wobbling around\" mode. \nThere are many other method for adjusting the learning rate. Unfortunately, most \nof them are based on some measurement of the oscillations of the gradient (Jacobs, \n1987). Therefore, they are difficult to apply to stochastic gradient descent. \n\n6 MORE ON EIGENVALUES AND EIGENVECTORS \n\nWe believe that computing the optimal learning rate is only one of many applications \nof our eigenvector estimation technique. The procedure can be adapted to serve \nmany applications. \n\n\f162 \n\nLeCun, Simard, and Pearl mutter \n\n2 \n\na:: \n~ \nffi 1.5 \nc \nILl \na:: \nc \n::> \n0 \nCI) \nz \n\n1 \n\n; \n\n0.5 \n\n(b) \n\n(a) \n\n2 \n\ng \nffi 1.5 \nc \nILl \na:: \n~ \ni3 \nz \n; \n\n1 \n\nCI) \n\n0.5 \n\nn \nm \n\nn \nA \n06-4 _____ -+-_ ...... __ ~~-~...-\n4 \n\n3.5 \n\n0.5 \n\n2.5 \n\n1.5 \n\n1 \n\n2 \n\n0 \n\n3 \n\no \n\no \n\n0.5 \n\n1 \n\n1.5 \n\n2 \n\n2.5 \n\nFigure 3: Mean Squared Error after 1,2,3,4, and 5 epochs (from top to bottom) as a \nfunction of the ratio between the learning rate TJ and the learning rate predicted by \nthe proposed method 1I'l111-1. (a) Network 1 trained on 1000 patterns, (b) Network \n2 trained on 300 patterns. \n\nAn important variation of the learning rate estimation is when, instead of update \nrule 3, we use a \"scaled SGD\" rule of the form W +- W - TJcI>V'EP(W), where cI> is \na diagonal matrix (each weight has its own learning rate TJ4Jd. For example, each \n4Ji can be the inverse of the corresponding diagonal term of the average Hessian, \nwhich can be computed efficiently as suggested in (Le Cun, 1987; Becker and Le \nCun, 1988). Then procedure 8 must be changed to \n\n'l1 +- (1 - ,)'l1 +, ! cI>~ (V'E (W + acI>~ N('l1)) - V'E(W)) \n\n(9) \n\nwhere the terms of cI>~ are the square root of the corresponding terms in cI>. More \ngenerally, the above formula applies to any transformation of the parameter space \nwhose Jacobian is cI>~. The added cost is small since cI>~ is diagonal. \nAnother extension of the procedure can compute the first J( principal eigenvectors \nk = 1 .. . J(, \nand eigenvalues. The idea is to store J( ei~envector estimates 'l1k' \nupdated simultaneously with equation (8) tthis costs a factor J( over estimating \nonly one). We must also ensure that the 'l1 k'S remain orthogonal to each other. \nThis can be performed by projecting each 'l1 k onto the space orthogonal to the \nI < k. This is an N J( process, which is relatively \nspace sub tended by the 'l1l' \ncheap if the network uses shared weights. A generalization of the acceleration \nmethod introduced in (Le Cun, Kanter and SoHa, 1991) can be implemented with \nthis technique. The idea is to use a \"Newton-like\" weIght update formula of the \ntype \n\nW +- W - L II'l1kll-1 Pk \n\nK \n\nk=1 \n\nwhere Pk, k = 1 ... J( - 1 is the projection of V'E( W) onto 'l1 k, and PK is the \n(k = 1 ... J( - 1). In \nprojection of V'E(W) on the space orthogonal to the 'l1k' \ntheory, this procedure can accelerate the training by a factor 1I'l1111/II'l1KII, which is \nbetween 3 and 10 for J( = 5 in a typical backprop network. Results will be reported \nin a later publication. \nInterestingly, the method can be slightly modified to yield the smallest eigenval(cid:173)\nues/eigenvectors. First, the largest eigenvalue Amax must be computed (or bounded \n\n\fAutomatic Learning Rate Maximization by Estimation of Hessian's Eigenvectors \n\n163 \n\nabove). Then, by iterating \n\nW ~ (1 - ,)w + AmaxN(w) - ,.!. (VE (W + o:N(w\u00bb - VE(W\u00bb \n\na \n\n(10) \n\none can compute the eigenvector corresponding to the smallest (probably negative) \neigenvalue of (H - AmaxI), which is the same as H's. This can be used to deter(cid:173)\nmine the direction(s) of displacement in parameter space that will cause the least \nincrease of the objective function. There are obvious applications of this to weight \nelimination methods: a better version of OBD (Le Cun et al., 1990b) or a more \nefficient version of OBS (Hassibi and Stork, 1993). \nWe have proposed efficient methods for (a) computing the product of the Hessian by \nany vector, and (b) estimating the few eigenvectors of largest or smallest eigenvalues. \nThe methods were successfully applied the estimation of the optimal learning rate \nin Stochastic Gradient Descent learning We feel that we have only scratched the \nsurface of the many applications of the proposed techniques. \n\nAcknowledgements \n\nYann LeCun and Patrice Simard would like to thank the members of the Adaptive Systems \nResearch dept for their support and comments. Barak Pearlmutter was partially supported \nby grants NSF ECS-9114333 and ONR N00014-92-J-4062 to John Moody. \n\nReferences \nBecker, S. and Le Cun, Y. (1988). Improving the Convergence of Back-Propagation \n\nLearning with Second-Order Methods. Technical Report CRG-TR-88-5, Uni(cid:173)\nversity of Toronto Connectionist Research Group. \n\nHassibi, B. and Stork, D. (1993). Optimal Brain Surgeon. In Giles, L., Hanson, S., \nand Cowan, J., editors, Advances in Neural Information Processing Systems, \nvolume 5, (Denver, 1992). Morgan Kaufman. \n\nJacobs, R. A. (1987). \n\nIncreased Rates of Convergence Through Learning Rate \nAdaptation. Department of Computer and Information Sciences COINS-TR-\n87-117, University of Massachusetts, Amherst, Ma. \n\nLe Cun, Y. (1987). Modeles connexionnistes de l'apprentissage (connectionist learn(cid:173)\n\ning models). PhD thesis, Universite P. et M. Curie (Paris 6). \n\nLe Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, \n\nW., and Jackel, L. D. (1990a). Handwritten digit recognition with a back(cid:173)\npropagation network. In Touretzky, D., editor, Advances in Neural Information \nProcessing Systems 2 (NIPS *89} , Denver, CO. Morgan Kaufman. \n\nLe Cun, Y., Denker, J. S., SolI a, S., Howard, R. E., and Jackel, L. D. (1990b). Opti(cid:173)\n\nmal Brain Damage. In Touretzky, D., editor, Advances in Neural Information \nProcessing Systems 2 (NIPS*89), Denver, CO. Morgan Kaufman. \n\nLe Cun, Y., Kanter, I., and Solla, S. (1991). Eigenvalues of covariance matrices: \napplication to neural-network learning. Physical Review Letters, 66(18):2396-\n2399. \n\nMoller, M. (1992). supervised learning on large redundant training sets. In Neural \n\nNetworks for Signal Processing 2. IEEE press. \n\nPearlmutter, B. (1993). Phd thesis, Carnegie-Mellon University. Pittsburgh PA. \nWidrow, B. and Stearns, S. D. (1985). Adaptive Signal Processing. Prentice-Hall. \n\n\f", "award": [], "sourceid": 589, "authors": [{"given_name": "Yann", "family_name": "LeCun", "institution": null}, {"given_name": "Patrice", "family_name": "Simard", "institution": null}, {"given_name": "Barak", "family_name": "Pearlmutter", "institution": null}]}