{"title": "Stable Dynamic Parameter Adaption", "book": "Advances in Neural Information Processing Systems", "page_first": 225, "page_last": 231, "abstract": "", "full_text": "Stable Dynamic Parameter Adaptation \n\nFachbereich Informatik, Technische Universitat Berlin \n\nStefan M. Riiger \n\nSekr. FR 5-9, Franklinstr. 28/29 \n\n10587 Berlin, Germany \nasync~cs. tu-berlin.de \n\nAbstract \n\nA stability criterion for dynamic parameter adaptation is given. In \nthe case of the learning rate of backpropagation, a class of stable \nalgorithms is presented and studied, including a convergence proof. \n\n1 \n\nINTRODUCTION \n\nAll but a few learning algorithms employ one or more parameters that control the \nquality of learning. Backpropagation has its learning rate and momentum param(cid:173)\neter; Boltzmann learning uses a simulated annealing schedule; Kohonen learning \na learning rate and a decay parameter; genetic algorithms probabilities, etc. The \ninvestigator always has to set the parameters to specific values when trying to solve \na certain problem. Traditionally, the metaproblem of adjusting the parameters is \nsolved by relying on a set of well-tested values of other problems or an intensive \nsearch for good parameter regions by restarting the experiment with different val(cid:173)\nues. In this situation, a great deal of expertise and/or time for experiment design \nis required (as well as a huge amount of computing time). \n\n1.1 DYNAMIC PARAMETER ADAPTATION \n\nIn order to achieve dynamic parameter adaptation, it is necessary to modify the \nlearning algorithm under consideration: evaluate the performance of the parameters \nin use from time to time, compare them with the performance of nearby values, and \n(if necessary) change the parameter setting on the fly. This requires that there \nexist a measure of the quality of a parameter setting, called performance, with the \nfollowing properties: the performance depends continuously on the parameter set \nunder consideration, and it is possible to evaluate the performance locally, i. e., at \na certain point within an inner loop of the algorithm (as opposed to once only at \nthe end of the algorithm). This is what dynamic parameter adaptation is all about. \n\n\f226 \n\nS.M.RUOER \n\nDynamic parameter adaptation has several virtues. It is automatic; and there is no \nneed for an extra schedule to find what parameters suit the problem best. When \nthe notion of what the good values of a parameter set are changes during learning, \ndynamic parameter adaptation keeps track of these changes. \n\n1.2 EXAMPLE: LEARNING RATE OF BACKPROPAGATION \n\nBackpropagation is an algorithm that implements gradient descent in an error \nfunction E: IRn ~ llt Given WO E IRn and a fixed '\" > 0, the iteration rule is \nWH1 = wt - \",V E(wt). The learning rate\", is a local parameter in the sense that \nat different stages of the algorithm different learning rates would be optimal. This \nproperty and the following theorem make\", especially interesting. \n\nTrade-off theorem for backpropagation. Let E: JR1l ~ IR be the error function of \na neural net with a regular minimum at w\u00b7 E IRn , i. e., E is expansible into a \nTaylor series about w\u00b7 with vanishing gradient V E( w\u00b7) and positive definite Hessian \nmatrix H(w\u00b7) . Let A denote the largest eigenvalue of H(w\u00b7). Then, in general, \nbackpropagation with a fixed learning rate\", > 2/ A cannot converge to w\u00b7 . \nProof. Let U be an orthogonal matrix that diagonalizes H(w\u00b7), i. e., D \n:= \nUT H ( w\u00b7) U is diagonal. U sing the coordinate transformation x = UT (w - w\u00b7) \nand Taylor expansion, E(w) - E(w\u00b7) can be approximated by F(x) := x T Dx/2. \nSince gradient descent does not refer to the coordinate system, the asymptotic be(cid:173)\nhavior of backpropagation for E near w\u00b7 is the same as for F near O. In the latter \ncase, backpropagation calculates the weight components x~ = x~(I- Dii\",)t at time \nstep t. The diagonal elements Dii are the eigenvalues of H(w\u00b7); convergence for all \ngeometric sequences t 1-7 x~ thus requires\", < 2/ A. \nI \nThe trade-off theorem states that, given \"', a large class of minima cannot be found, \nnamely, those whose largest eigenvalue of the corresponding Hessian matrix is larger \nthan 2/\",. Fewer minima might be overlooked by using a smaller \"', but then the \nalgorithm becomes intolerably slow. Dynamic learning-rate adaptation is urgently \nneeded for backpropagation! \n\n2 STABLE DYNAMIC PARAMETER ADAPTATION \n\nTransforming the equation for gradient descent, wt+l = wt - \",VE(wt), into a \ndifferential equation, one arrives at awt fat = -\",V E(wt). Gradient descent with \nconstant step size\", can then be viewed as Euler's method for solving the differential \nequation. One serious drawback of Euler's method is that it is unstable: each finite \nstep leaves the trajectory of a solution without trying to get back to it. Virtually \nany other differential-equation solver surpasses Euler's method, and there are even \nsome featuring dynamic parameter adaptation [5]. \n\nHowever, in the context of function minimization, this notion of stability (\"do not \ndrift away too far from a trajectory\") would appear to be too strong. Indeed, \ndifferential-equation solvers put much effort into a good estimation of points that \nare as close as possible to the trajectory under consideration. What is really needed \nfor minimization is asymptotic stability: ensuring that the performance of the pa(cid:173)\nrameter set does not decrease at the end of learning. This weaker stability criterion \nallows for greedy steps in the initial phase of learning. \n\nThere are several successful examples of dynamic learning-rate adaptation for back(cid:173)\npropagation: Newton and quasi-Newton methods [2] as an adaptive \",-tensor; indi(cid:173)\nvidual learning rates for the weights [3, 8]; conjugate gradient as a one-dimensional \n\",-estimation [4]; or straightforward \",-adaptation [1, 7]. \n\n\fStable Dynamic Parameter Adaptation \n\n227 \n\nA particularly good example of dynamic parameter adaptation was proposed by \nSalomon [6, 7]: let ( > 1; at every step t of the backpropagation algorithm test two \nvalues for 17, a somewhat smaller one, 17d(, and a somewhat larger one, 17t(; use as \n17HI the value with the better performance, i. e., the smaller error: \n\nThe setting of the new parameter (proves to be uncritical (all values work, especially \nsensible ones being those between 1.2 and 2.1). This method outperforms many \nother gradient-based algorithms, but it is nonetheless unstable. \n\nb) \n\nFigure 1: Unstable Parameter Adaptation \n\nThe problem arises from a rapidly changing length and direction of the gradient, \nwhich can result in a huge leap away from a minimum, although the latter may have \nbeen almost reached. Figure 1a shows the niveau lines of a simple quadratic error \nfunction E: 1R2 -+ IR along with the weight vectors wo, WI , . .. (bold dots) resulting \nfrom the above algorithm. This effect was probably the reason why Salomon sug(cid:173)\ngested using the normalized gradient instead of the gradient, thus getting rid of the \nchanges in the length of the gradient. Although this works much better, Figure 1b \nshows the instability of this algorithm due to the change in the gradient's direction. \nThere is enough evidence that these algorithms converge for a purely quadratic \nerror function [6, 7]. Why bother with stability? One would like to prove that an \nalgorithm asymptotically finds the minimum, rather than occasionally leaping far \naway from it and thus leaving the region where the quadratic Hessian term of a \nglobally nonquadratic error function dominates. \n\n3 A CLASS OF STABLE ALGORITHMS \n\nIn this section, a class of algorithms is derived from the above ones by adding \nstability. This class provides not only a proof of asymptotic convergence, but also \na significant improvement in speed. \n\nLet E: IRn -+ IR be an error function of a neural net with random weight vector \nW O E IRn. Let ( > 1, 170 > 0, 0 < c ~ 1, and 0 < a ~ 1 ~ b. At step t of the algo(cid:173)\nrithm, choose a vector gt restricted only by the conditions gtV E(wt)/Igtllv Ew t I ~ c \nand that it either holds for all t that 1/1gtl E [a, b) or that it holds for all t that \nIV E(wt)I/lgtl E [a, b), i. e., the vectors g have a minimal positive projection onto \nthe gradient and either have a uniformly bounded length or are uniformly bounded \nby the length of the gradient. Note that this is always possible by choosing gt as the \ngradient or the normalized gradient. \nLet e: 17 t-t E (wt - 17gt) denote a one-dimensional error function given by E, wt and \ngt. Repeat (until the gradient vanishes or an upper limit of t or a lower limit Emin \n\n\f228 \n\nS.M.ROOER \n\nof E is reached) the iteration WH1 = wt - 'T/tHgt with \n\n'T/Hl = \n\n.-\n'T/* .-\n\n'T/d( \n'T/t( \n\n'T/t(/2 \n\n1 + e('T/t() - e(O) \n'T/t(gt\\1 E(wt) \n\nif e(O) < e('T/t() \n\n(1) \n\nif e('T/d() ::; e('T/t() ::; e(O) \notherwise. \n\nThe first case for 'T/Hl is a stabilizing term 'T/*, which definitely decreases the error \nwhen the error surface is quadratic, i. e., near a minimum. \n'T/* is put into effect \nwhen the errOr e(T}t() , which would occur in the next step if'T/t+l = 'T/t( was chosen, \nexceeds the error e(O) produced by the present weight vector wt . By construction, \n'T/* results in a value less than 'T/t(/2 if e('T/t() > e(O); hence, given ( < 2, the learning \nrate is decreased as expected, no matter what E looks like. Typically, (if the values \nfor ( are not extremely high) the other two cases apply, where 'T/t( and 'T/d ( compete \nfor a lower error. \nNote that, instead of gradient descent, this class of algorithms proposes a \"gt de(cid:173)\nscent,\" and the vectors gt may differ from the gradient. A particular algorithm is \ngiven by a specification of how to choose gt. \n\n4 PROOF OF ASYMPTOTIC CONVERGENCE \n\nAsymptotic convergence. Let E: w f-t 2:~=1 AiW; /2 with Ai > O. For all ( > 1, \no < c ::; 1, 0 < a ::; 1 ::; b, 'T/o > 0, and WO E IRn , every algorithm from Section :1 \nproduces a sequence t f-t wt that converges to the minimum 0 of E with an at least \nexponential decay of t f-t E(wt). \nProof. This statement follows if a constant q < 1 exists with E(WH1 ) ::; qE(wt) for \nall t. Then, limt~oo wt = 0, since w f-t ..jE(w) is a norm in IRn. \nFix a wt , 'T/t, and a gt according to the premise. Since E is a positive definite \nquadratic form, e: 'T/ f-t E( wt - 'T/gt ) is a one-dimensional quadratic function with \na minimum at, say, 'T/*. Note that e(O) = E(wt) and e('T/tH) = E(wt+l). e is \ncompletely determined by e(O), e'(O) = -gt\\1 E(wt), 'T/te and e('T/t(). Omitting the \nalgebra, it follows that 'T/* can be identified with the stabilizing term of (1). \n\ne(O) \n.A'-~--I qe( 0) \n-...-...J'----+I (1 - q11)e(0) + q11e('T/*) \n\ne\"----r-++--+j qee(O) \n\n__ ~<-+--+I 11t+~:11\u00b7 e(O) + (1 -\n\n11t\u00b1~:11\u00b7 )e('T/*) \n\ne( 'T/*) 1--____ ---\"\"' ...... ----A~-_+_--+t e( 'T/tH) \n\no \n\nFigure 2: Steps in Estimating a Bound q for the Improvement of E. \n\n\fStable Dynamic Parameter Adaptation \n\n229 \n\nIf e(17t() > e(O), by (1) 17t+l will be set to 17\u00b7; hence, Wt+l has the smallest possible \nerror e(17\u00b7) along the line given by l. Otherwise, the three values 0, 17t!(, and 17t( \ncannot have the same error e, as e is quadratic; e(17t() or e(17t!() must be less than \ne(O), and the argument with the better performance is used as 17tH' The sequence \nt I-t E(wt) is strictly decreasing; hence, a q ~ 1 exists. The rest of the proof shows \nthe existence of a q < 1. \nAssume there are two constants 0 < qe, qT/ < 1 with \n\n[qT/,2 - qT/] \n\nE \n~ qee(O). \n\n(2) \n(3) \n\nLet 17tH ~ 17\u00b7; using first the convexity of e, then (2), and (3), one obtains \n\ne(17tH -17\u00b7 2 \u2022 + (1- 17t+l -17\u00b7) .) \n\n17. \n\n17 \n\n17. \n\n17 \n\n< \n< \n< \n\n17t+l -17\u00b7 e(O) + (1- 17tH -17\u00b7 )e(17.) \n\n17\u00b7 \n\n(1 - qT/)e(O) + qf/e(17\u00b7) \n(1- qT/(1 - qe))e(O). \n\n17\u00b7 \n\nFigure 2 shows how the estimations work. The symmetric case 0 < 17tH ~ 17\u00b7 has \nthe same result E(wt+l) ~ qE(wt) with q := 1 - qT/(1 - qe) < 1. \n\nLet ,X < := minPi} and ,X> := max{'xi}. A straightforward estimation for qe yields \n\n,X< \n\nqe := 1 - c2 ,X> < 1. \n\nNote that 17\u00b7 depends on wt and gt. A careful analysis of the recursive dependence \nof 17t+l /17\u00b7 (wt , gt) on 17t /17\u00b7( wt - 1 ,l-l) uncovers an estimation \n< \n\n._ min _2_ ~ ca ~ \n\n17o (,X \n\n( <) 3/2 \n\nqT/ .-\n\n{(2 + l' (2 + 1 b'x> \n\n, bmax{1, J2'x> E(WO)}} \n\n> 0 \n. \n\n\u2022 \n\n5 NON-GRADIENT DIRECTIONS CAN IMPROVE \n\nCONVERGENCE \n\nIt is well known that the sign-changed gradient of a function is not necessarily the \nbest direction to look for a minimum. The momentum term of a modified back(cid:173)\npropagation version uses old gradient directions; Newton or quasi-Newton methods \nexplicitly or implicitly exploit second-order derivatives for a change of direction; \nanother choice of direction is given by conjugate gradient methods [5]. \n\nThe algorithms from Section 3 allow almost any direction, as long as it is not nearly \nperpendicular to the gradient. Since they estimate a good step size, these algorithms \ncan be regarded as a sort of \"trial-and-error\" line search without bothering to find \nan exact minimum in the given direction, but utilizing any progress made so far. \nOne could incorporate the Polak-Ribiere rule, cttH = \\1 E( Wt+l) + a(3ctt, for conju(cid:173)\ngate directions with dO = \\1 E (WO), a = 1, and \n\n(\\1E(Wt+l) - \\1E(wt))\\1E(wt+l) \n\n(3 = \n\n(\\1 E(Wt))2 \n\n\f230 \n\nS.M. RUOER \n\nto propose vectors gt := ett /Iettl for an explicit algorithm from Section 3. As in \nthe conjugate gradient method, one should reset the direction ett after each n (the \nnumber of weights) updates to the gradient direction. Another reason for resetting \nthe direction arises when gt does not have the minimal positive projection c onto \nthe normalized gradient. \na = 0 sets the descent direction gt to the normalized gradient \"V E(wt)/I\"V E(wt)lj \nthis algorithm proves to exhibit a behavior very similar to Salomon's algorithm with \nnormalized gradients. The difference lies in the occurrence of some stabilization \nsteps from time to time, which, in general, improve the convergence. \n\nSince comparisons of Salomon's algorithm to many other methods have been pub(cid:173)\nlished [7], this paper confines itself to show that significant improvements are \nbrought about by non-gradient directions, e. g., by Polak-Ribiere directions (a = 1). \n\nTable 1: Average Learning Time for Some Problems \nPROBLEM \n\na = 0 \n\nEmin \n\na = 1 \n\n(a) 3-2-4 regression \n10\u00b0 \n10-4 \n(b) 3-2-4 approximation \n(c) Pure square (n = 76) 10-16 \n(d) Power 1.8 (n = 76) \n10-4 \n(e) Power 3.8 (n = 76) \n10-16 \n10-4 \n(f) 8-3-8 encoder \n\n195\u00b1 95% \n58 \u00b1 70% \n1070 \u00b1 140% 189\u00b1 115% \n464\u00b1 17% 118\u00b1 9% \n486\u00b1 29% \n84\u00b1 23% \n37\u00b1 14% \n28 \u00b1 10% \n1380\u00b1 60% 300\u00b1 60% \n\nTable 1 shows the average number of epochs of two algorithms for some problems. \nThe average was taken over many initial random weight vectors and over values of \n( E [1.7,2.1]j the root mean square error of the averaging process is shown as a \npercentage. Note that, owing to the two test steps for \",t/( and \"'t(, one epoch has \nan overhead of around 50% compared to a corresponding epoch of backpropagation. \na f:. 0 helps: it could be chosen by dynamic parameter adaptation. \nProblems (a) and (b) represent the approximation of a function known only from \nsome example data. A neural net with 3 input, 2 hidden, and 4 output nodes was \nused to generate the example dataj artificial noise was added for problem (a). The \nsame net with random initial weights was then used to learn an approximation. \nThese problems for feedforward nets are expected to have regular minima. \nProblem (c) uses a pure square error function E: w rt L:~1 ilwil P /2 with p = 2 \nand n = 76. Note that conjugate gradient needs exactly n epochs to arrive at the \nminimum [5]. However, the few additional epochs that are needed by the a = 1 \nalgorithm to reach a fairly small error (here 118 as opposed to 76) must be compared \nto the overhead of conjugate gradient (one line search per epoch). \nPowers other than 2, as used in (d) or (e), work well as long as, say, p > 1.5. A power \np < 1 will (if n ~ 2) produce a \"trap\" for the weight vector at a location near a \ncoordinate axis, where, owing to an infinite gradient component, no gradient-based \nalgorithm can escape1 . Problems are expected even for p near 1: the algorithms of \nSection 3 exploit the fact that the gradient vanishes at a minimum, which in turn \nis numerically questionable for a power like 1.1. Typical minima, however, employ \npowers 2,4, ... Even better convergence is expected and found for large powers. \n\nIDynamic parameter adaptation as in (1) can cope with the square-root singularity \n(p = 1/2) in one dimension, because the adaptation rule allows a fast enough decay of \nthe learning rate; the ability to minimize this one-dimensional square-root singularity is \nsomewhat overemphasized in [7]. \n\n\fStable Dynamic Parameter Adaptation \n\n231 \n\nThe 8-3-8 encoder (f) was studied, because the error function has global minima \nat the boundary of the domain (one or more weights with infinite length). These \nminima, though not covered in Section 4, are quickly found. Indeed, the ability \nto increase the learning rate geometrically helps these algorithms to approach the \nboundary in a few steps. \n\n6 CONCLUSIONS \n\nIt has been shown that implementing asymptotic stability does help in the case of the \nbackpropagation learning rate: the theoretical analysis has been simplified, and the \nspeed of convergence has been improved. Moreover, the presented framework allows \ndescent directions to be chosen flexibly, e. g., by the Polak-Ribiere rule. Future work \nincludes studies of how to apply the stability criterion to other parametric learning \nproblems. \n\nReferences \n\n[1] R. Battiti. Accelerated backpropagation learning: Two optimization methods. \n\nComplex Systems, 3:331-342, 1989. \n\n[2] S. Becker and Y. Ie Cun. Improving the convergence of back-propagation learn(cid:173)\n\ning with second order methods. In D. Touretzky, G. Hinton, and T. Sejnowski, \neditors, Proceedings of the 1988 Connectionist Models Summer School, pages \n29-37. Morgan Kaufmann, San Mateo, 1989. \n\n[3] R. Jacobs. Increased rates of convergence through learning rate adaptation. \n\nNeural Networks, 1:295-307, 1988. \n\n[4] A. Kramer and A. Sangiovanni-Vincentelli. Efficient parallel learning algorithms \nfor neural networks. In D. Touretzky, editor, Advances in Neural Information \nProcessing Systems 1, pages 40-48. Morgan Kaufmann, San Mateo, 1989. \n\n[5] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical \n\nRecipes in C. Cambridge University Press, 1988. \n\n[6] R. Salomon. Verbesserung konnektionistischer Lernverfahren, die nach der Gra(cid:173)\n\ndientenmethode arbeiten. PhD thesis, TU Berlin, October 1991. \n\n[7] R. Salomon and J. L. van Hemmen. Accelerating backpropagation through \n\ndynamic self-adaptation. Neural Networks, 1996 (in press). \n\n[8] F. M. Silva and L. B. Almeida. Speeding up backpropagation. In Proceedings of \nNSMS - International Symposium on Neural Networks for Sensory and Motor \nSystems, Amsterdam, 1990. Elsevier. \n\n\f", "award": [], "sourceid": 1061, "authors": [{"given_name": "Stefan", "family_name": "R\u00fcger", "institution": null}]}