{"title": "Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm and Resistence to Local Minima", "book": "Advances in Neural Information Processing Systems", "page_first": 459, "page_last": 466, "abstract": "", "full_text": "Diffusion Approximations for the \n\nConstant Learning Rate \n\nBackpropagation Algorithm and \n\nResistence to Local Minima \n\nSiemens AG, Corporate Research and Development \n\nWilliam Finnoff \n\nOtto-Hahn-Ring 6 \n\n8000 Munich 83, Fed. Rep. Germany \n\nAbstract \n\nIn this paper we discuss the asymptotic properties of the most com(cid:173)\nmonly used variant of the backpropagation algorithm in which net(cid:173)\nwork weights are trained by means of a local gradient descent on ex(cid:173)\namples drawn randomly from a fixed training set, and the learning \nrate TJ of the gradient updates is held constant (simple backpropa(cid:173)\ngation). Using stochastic approximation results, we show that for \nTJ ~ 0 this training process approaches a batch training and pro(cid:173)\nvide results on the rate of convergence. Further, we show that for \nsmall TJ one can approximate simple back propagation by the sum \nof a batch training process and a Gaussian diffusion which is the \nunique solution to a linear stochastic differential equation. Using \nthis approximation we indicate the reasons why simple backprop(cid:173)\nagation is less likely to get stuck in local minima than the batch \ntraining process and demonstrate this empirically on a number of \nexamples. \n\n1 \n\nINTRODUCTION \n\nThe original (simple) backpropagation algorithm, incorporating pattern for pattern \nlearning and a constant learning rate 'T} E (0,00), remains in spite of many real (and \n\n459 \n\n\f460 \n\nFinnoff \n\nimagined) deficiencies the most widely used network training algorithm, and a vast \nbody of literature documents its general applicability and robustness. In this paper \nwe will draw on the highly developed literature of stochastic approximation the(cid:173)\nory to demonstrate several asymptotic properties of simple backpropagation. The \nclose relationship between backpropagation and stochastic approximation methods \nhas been long recognized, and various properties of the algorithm for the case of \ndecreasing learning rate 7]n+l < 7]n, n E N were shown for example by White \n[W,89a], [W,89b] and Darken and Moody [D,M,91]. Hornik and Kuan [H,K,91] \nused comparable results for the algorithm with constant learning rate to derive \nweak convergence results. \n\nIn the first part of this paper we will show that simple backpropagation has the \nsame asymptotic dynamics as batch training in the small learning rate limit. As \nsuch, anything that can be expected of batch training can also be expected in simple \nbackpropagation as long as the learning rate of the algorithm is very small. In the \nspecial situation considered here (in contrast to that in [H,K,91]) we will also be \nable to provide a result on the speed of convergence. As such, anything that can be \nexpected of batch training can also be expected in simple backpropagation as long \nas the learning rate of the algorithm is very small. In the next part of the paper, \nGaussian approximations for the difference between the actual training process and \nthe limit are derived. It is shown that this difference, (properly renormalized), con(cid:173)\nverges to the solution of a linear stochastic differential equation. In the final section \nof the paper, we combine these results to provide an approximation for the simple \nback propagation training process and use this to show why simple backpropagation \nwill be less inclined to get stuck in local minima than batch training. This ability \nto avoid local minima is then demonstrated empirically on several examples. \n\n2 NOTATION \n\nDefine the. parametric version of a single hidden layer network activation function \nwith h inputs, m outputs and q hidden units \n\nI: Rd x Rh -+ Rm, (0, x) -+ (lUJ, x), ... , reO, x\u00bb, \n\nby setting for x E Rh, Z = (Xl, ... , Xh, I), 0 = (i:, /3:) and u = 1, ... , m, \n\nreo, x) = rA\u00abi:, (3:), x) = t/J (t i'Jt/J({3jzT) + i:+l) , \n\n,=1 \n\nwhere xT denotes the transpose of x and d = m(q + 1) + q(h + 1) denotes the \nnumber of weights in the network. Let \u00abYk, Xk\u00bbk=l, ... ,T be a set of training exam(cid:173)\nples, consisting of targets (Yk)k=l, ... ,T and inputs (Xk)k=l, ... ,T. We then define the \nparametric error function \n\nU(y, x, 0) = Ily - 1(0, x)II 2 ,. \n\nand for every 0, the cummulative gradient \n\n\fDiffusion Approximations for the Constant Learning Rate Backpropagation Algorithm \n\n461 \n\n3 APPROXIMATION WITH THE ODE \n\nWe will be considering the asymptotic properties of network training processes \ninduced by the starting value 80 , the gradient (or direction) function -W- the \nlearning rate TJ and an infinite training sequence (Yn, xn)neN, where each (Yn, xn) \nexample is drawn at random from the set {(Y1,X1), ... ,(YT,XT)}. One defines the \ndiscrete parameter process 8 = 8\" = (8~)neZ+ of weight updates by setting \n\n{ (Jo \n\nfor n = 0 \n(J7J \nn = (J~_l - TJ~(Yn' Xn, 8~_d for n E N \n\nand the corresponding continuous parameter process \u00ab(J\" (t\u00bbte[o,oo), by setting \n\nl)TJ, nTJ), \n\nfor t E [en -\nn E N. The first question that we will investigate is that \nof the 'small learning rate limit' of the continuous parameter process 8\", i.e. the \nlimiting properties of the family 8\" for TJ -+ O. We show that the family of (stochas(cid:173)\ntic) processes \u00ab(J\")7J>o converges with probability one to a limit process 8, where 8 \ndenotes the solution to the cummulative gradient equation, \n\n8(t) = (Jo + it h(8(s\u00bbds. \n\nHere, for (Jo = a = constant, this solution is deterministic. This result corresponds \nto a 'law of large numbers' for the weight update process, in which the small learning \nrate (in the limit) averages out the stochastic fluctuations. \n\nCentral to any application of the stochastic approximation results is the derivation \nof local Lipschitz and linear growth bounds for Wand h. That is the subject of \nthe following, \n\nLemma(3.1) i) There exists a constant K > 0 so that \nsup II ~~ (y, x, (J)I < K(l + 11(11) \n\n(1/ ,z) \n\nand \n\nIIh(8)1I ~ K(l + 11(11). \n\nii) For every G > 0 there exists a constant La so that for any 8, '9 E [-G, Gl d , \n\n\f462 \n\nFinnoff \n\nIlau \n\nsup 8o(Y, x, 0) - ao(Y' x, 0) ~ LGIlO - Oil \n(y,::) \n\nau \n\n-II \n\n'\" \n\nand \n\nIIh(8) - h(O)11 ~ LGIIO - 811. \n\nProof: The calculations on which this result are based are tedious but straightfor(cid:173)\nward, making repeated use of the fact that products and sums of locally Lipschitz \ncontinuous functions are themselves locally Lipschitz continuous. It is even possible \nto provide explicit values for the constants given above. \n\u2022 \nDenoting with P (resp. E) the probability (resp. mathematical expectation) of the \nprocesses defined above, we can present the results on the probability of deviations \nof the process 0 from the limit e. \n\nTheorem(3.2) Let r,6 E (0, (0). Then there exists a constant Br (which \ndoesn't depend on 71) so that \n\nii)P (suP,sr IIO(s) - O(s)11 > 6) < bBr71. \n\nProof: The first part of the proof requires that one finds bounds for 0'1 (t) and O(t) \nfor t E [0, r]. This is accomplished using the results of Lemma(3.l) and Gronwall's \nLemma. This places 71 independent bounds on Br . The remainder of the proof uses \nTheorem(9), \u00a71.5, Part II of [Ben,Met,Pri,87]. The required conditions (AI), (A2) \nfollow directly from our hypotheses, and (A3), (A4) from Lemma(3.l). Due to the \nboundedness of the variables (Yn, xn)neN and 0o, condition (A5) is trivially fulfilled . \n\u2022 \n\nIt should be noted that the constant Br is usually dependent on r and may indeed \nincrease exponentially (in r) unless it is possible to show that the training process \nremains in some bounded region for t -- 00. This is not necessarily due exclusively \nto the difference between the stochastic approximation and the discrete parameter \ncummulative gradient process, but also to the the error between the discrete (Euler \napproximation) and continuous parameter versions of (3.3). \n\n4 GAUSSIAN APPROXIMATIONS \n\nIn this section we will give a Gaussian approximation for the difference between \nthe training process 8\" and the limit O. Although in the limit these coincide, for \n\n71 > \u00b0 the training process fluctuates away from the limit in a stochastic fashion. \n\nThe following Gaussian approximation provides an estimate for the size and nature \n\n\fDiffusion Approximations for the Constant Learning Rate Backpropagation Algorithm \n\n463 \n\nof these fluctuations depending on the second order statistics (variance/covariance \nmatrix) of the weight update process. Define for any t E [0,00), \n\n8'1(t) = O'1(t) - O(t) . \n\n..ft \n\nFurther, for i = 1, ... , d we denote with ~~ i (y, x, 0), (resp. hi (6)) the i-th coordinate \nvector of ~(y,x,O) (resp. h(O)). Then define for i,j = I, ... ,d, 6 E Rd \n\nThus, for any n EN, 6 E R d, R( 0) represents the covariance matrix of the random \nelements ~(Yn, Xn, 6). We can then define for the symmetric matrix R(6) a further \nRdxd valued matrix Ri(6) with the property that R(6) = Ri(6)(R!(6))T. \nThe following result represents a central limit theorem for the training process. This \npermits a type of second order approximation of the fluctuations of the stochastic \ntraining process around its deterministic limit. \n\nTheorem( 4.1): Under the assumptions given above, the distributions of the \nprocesses 8'1, TJ > 0, converge weakly (in the sense of weak convergence of measures) \nfor TJ ~ 0 to a uniquely defined measure C{O), where '8 denotes the solution to the \nfollowing stochastic differential equation \n\nwhere W denotes a standard d-dimensional Brownian motion (i.e. with covariance \nmatrix equal to the identity matrix). \n\nProof: The proof here uses Theorem(7), \u00a74.4, Part II of [Ben,Met,Pri,87]. As \nnoted in the proof of Theorem(3.2), under our hypotheses, the conditions (Al)(cid:173)\n(A5) are fulfilled. Define for i,j = 1, ... ,d, (y,x) E Im+h, 0 E Rd, wij (y,x,6) = \npi(y, x, O)pi (y, x, O)-hi(O)hj (0), and 11 = p. Under our hypotheses, h has ~~>ntinuous \nfirst and second order derivatives for all 0 E Rd and the function R = (R\u00b7'ki=l, ... ,d \nas well as W = (Wij)i,;=l, ... ,d fulfill the remaining requirements of (AS) as follows: \n(A8)i) and (A8)ii) are trivial consequence of the definition of Rand W. Finally, \nsetting Pa = P4 = 0 and JJ = 1, (AS)iii) then can be derived directly from the \ndefinitions of Wand Rand Lemma(5.1)ii). \n\u2022 \n\n5 RESISTENCE TO LOCAL MINIMA \n\nIn this section we combine the results of the two preceding sections to provide \na Gaussian approximation of simple backpropagation. Recalling the results and \n\n\f464 \n\nFinnoff \n\nnotation of Theorem(3.2) and Theorem(4.1) we have for any t E [0,(0), \n\n(J'1(t) = 8(t) + 7]~(J(t) + 0(7]1). \n\n1 -\n\n1 \n\n-\n\nUsing this approximation we have: \n\n-For 'very small' learning rate 7], simple backpropagation and batch learning will \nproduce essentially the same results since the stochastic portion of the process \n(controlled by 7]~) will be negligible. \n\n-Otherwise, there is a non negligible stochastic eleE1ent in the training process which \ncan be approximated by the Gaussian diffusion (J. \n\n-This diffusion term gives simple backpropagation a 'quasi-annealing' character, in \nwhich the cummulative gradient is continuously perturbed by the Gaussian term 8, \nallowing it to escape local minima with small shallow basins of attraction. \n\nIt should be noted that the rest term will actually have a better convergence rate \nthan the indicated 0(7]~). The calculation of exact rates, though, would require a \ngeneralized version of the Berry-Esseen theorem. To our knowledge, no such results \nare available which would be applicable to the situation described above. \n\n6 EMPIRICAL RESULTS \n\nThe imperviousness of simple backpropagation to local minima, which is part of \nneural network 'folklore' is documented here in four examples. A single hidden \nlayer feedforward network with 4J = tanh, ten hidden units n and one output was \ntrained with both simple backpropagation and batch training using data gener(cid:173)\nated by four different models. The data consisted of pairs (Yi, :c,), i = 1, , .. , T, \nTEN with targets Yi E R and inputs Xi = (:ci, .. \" xf) E [-I,l)K, where \nYi = g\u00abxl, .. \" x1\u00bb + Ui, for j, KEN. The first experiment was based on \nan additive structure 9 having the following form with j = 5 and K = 10, \ng\u00abxt, .. \" x;\u00bb = L::~=1 sin(okx:), ok E R, The second model had a product struc(cid:173)\nture 9 with j = 3, K = 10 and g\u00abxt, ... , x:\u00bb = n!=1 xf, ok E R , The third struc(cid:173)\nture considered was constructed with j = 5 and K = 10, using sums of Radial Basis \n(5 ( a \u2022\u2022 I-~~ )2) \nFunctIOns (RBF s) as follows: g\u00abxi' .'\" xi\u00bb = E,=1 (-1) exp Ek=l -\n\u2022 \nThese points were chosen by independent drawings from a uniform distribution on \n[-1,1)5, The final experiment was conducted using data generated by a feedforward \nnetwork activation function. For more details concerning the construction of the \nexamples used here consult [F,H,Z,92]. \n\n2(12 '-\n\n1 \n\n5 \n\n\" \n\n8 \n\nI \n\nFor each model three training runs were made using the same vector of starting \nweights for both simple backpropagation and batch training. As can be seen, in all \nbut one example the batch training process got stuck in a local minimum producing \nmuch worse results than those found using simple backpropagation, Due to the \nwide array of structures used to generate data and the number of data sets used, it \nwould be hard to dismiss the observed phenomena as being example dependent. \n\n\fDiffusion Approximations for the Constant Learning Rate Backpropagation Algorithm \n\n465 \n\nError x 10-3 \n\nNet \n\nsimple BP \n\u00b7Sitch\u00b7C\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \n\n800.00 .-tl-.-... -.. ....:. .. ..,.~--.... \\.\\ \n\n~.OO~~~~~~~~\u00b7\u00b7\u00b7~\u00b7 .. ~\u00b7\u00b7\u00b7~ .. \u00b7~\u00b7~ .. \u00b7~\u00b7\u00b7\u00b7=\u00b7\u00b7\u00b7=\u00b7\u00b7~-----------------\n\u00b7\u00b7\u00b7\u00b7\u00b7:.:r:::=:::..:::.:-;::=;.~:.:-:.::::::::.::~ .. ::::::::.-:::; \n400.00 --+----=~~~.~. ;;:;=~~~; \n\ni \n\n-\n\n0.00 \n\n!OO.OO \nProduct Mapping \n\n200.00 \n\n300.00 \n\nEpochs \n\nError x 10-3 \n\n: ............. : ...... ~; ..... . \n\n. . . . . ~ . . . . . . . . . . . . . \" . . . . . . . . . . . . . _ . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I \n\n.\n\n800.00 -\u00b7-!I-L--...... :::::::\u00b7\u00b7 :::r+;;::: .. = ... j .. := ..... =.::=::=:::~::\\-ij~: :.-::-:.:-::.-.::-::.-.::-::.-.::-::.-. :'.-::'-.::~:::::'.:::'.:::::'.: \n6CXl.OO ---~-- -.+-_ .---\n\nI \n\n.\n\n.. . . . . . . \n\n~lmpie 3P \n3ate\u00b7iiI:\u00b7\u00b7 .. \u00b7\u00b7 \n\n0.1'1) \n\n:oo,C() \n\n200. \"() \n\n:0000 \n\nSums ~f RBF's \n\nError x : 0'\" \nSOO.110 . '-'=;:::::::::::::::~.:.:-:- ' - ---- - - - --- -- - --- ---- -_. . -- 51mple :3P \n\n.... , ............ , \u2022\u2022\u2022\u2022 \u2022\u2022 :;~;;; ~; ::::';':';':':';':';':':';':';';;';':':';':':':':':';';':';';';'!:';';';';';';':':':':';';';';'; ';':';';'; ':';';';';';';';':';';: \n\nSOiJ,GO ~ ... . - --- -- - . -\n\n.J.C{).OO ------ .-------=~~;;;::~-~--=-\n\n'3 a tC' n' L::'\" .. . \n\n0.00 \n\nifX).OO \n\n~OO,OO \n\n:00.00 \n\nError x 10-.) \n\nSums of sin's \n\n~)().00~~~~r------------------------------\n\n400.00 -t-------~~~~ . . . ~r-\n\n0.00 \n\n100.00 \n\n200.00 \n\n~lmpie SP \n\u00b73ate\u00b7i{\u00b7::: .. \u00b7 ... \n\n-:!pocns \n\n\f466 \n\nFinnoff \n\n7 REFERENCES \n\n[Ben,Met,Pri,87] Benveniste, A., Metivier, M., Priouret, P., Adaptive Algorithms \n\nand Stochastic Approximations, Springer Verlag, 1987. \n\n[Bou,85] Bouton C., Approximation Gaussienne d'algorithmes stochastiques a dy(cid:173)\n\nnamique Markovienne. Thesis, Paris VI, (in French), 1985. \n\n[Da,M,91] Darken C. and Moody J., Note on learning rate schedules for stochastic \noptimization, inAdvances in Neural Information Processing Systems 3, Lipp(cid:173)\nmann, R. Moody, J., and Touretzky, D., ed., Morgan Kaufmann, San Mateo, \n1991. \n\n[F ,H,Z,92] Improving model selection by nonconvergent methods. To appear in \n\nNeural Networks. \n\n[H,K,91], Hornik, K. and Kuan, C.M., Convergence of Learning Algorithms with \nconstant learning rates, IEEE Trans. on Neural Networks 2, pp. 484-489, \n(1991). \n\n[Wh,89a] White, H., Some asymptotic results for learning in single hidden-layer \nfeedforward network models, Jour. Amer. Stat. Ass. 84, no. 408, p. 1003-\n1013, 1989. \n\n[W,89b] White, H., Learning in artificial neural networks: A statistical perspective, \n\nNeural Computation 1, p.425-464, 1989. \n\n\f", "award": [], "sourceid": 650, "authors": [{"given_name": "William", "family_name": "Finnoff", "institution": null}]}