{"title": "Asymptotics of Gradient-based Neural Network Training Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 335, "page_last": 342, "abstract": null, "full_text": "Asymptotics of Gradient-based \n\nNeural Network 'fraining Algorithms \n\nSayandev Mukherjee \nsaymukh~ee.comell.edu \n\nTerrence L. Fine \n\ntlfine~ee.comell.edu \n\nSchool of Electrical Engineering \n\nCornell University \nIthaca, NY 14853 \n\nSchool of Electrical Engineering \n\nCornell University \nIthaca, NY 14853 \n\nAbstract \n\nWe study the asymptotic properties of the sequence of iterates of \nweight-vector estimates obtained by training a multilayer feed for(cid:173)\nward neural network with a basic gradient-descent method using \na fixed learning constant and no batch-processing. In the one(cid:173)\ndimensional case, an exact analysis establishes the existence of a \nlimiting distribution that is not Gaussian in general. For the gen(cid:173)\neral case and small learning constant, a linearization approximation \npermits the application of results from the theory of random ma(cid:173)\ntrices to again establish the existence of a limiting distribution. \nWe study the first few moments of this distribution to compare \nand contrast the results of our analysis with those of techniques of \nstochastic approximation. \n\n1 \n\nINTRODUCTION \n\nThe wide applicability of neural networks to problems in pattern classification and \nsignal processing has been due to the development of efficient gradient-descent al(cid:173)\ngorithms for the supervised training of multilayer feedforward neural networks with \ndifferentiable node functions. A basic version uses a fixed learning constant and up(cid:173)\ndates all weights after each training input is presented (on-line mode) rather than \nafter the entire training set has been presented (batch mode) . The properties of \nthis algorithm as exhibited by the sequence of iterates are not yet well-understood. \nThere are at present two major approaches. \n\n\f336 \n\nSayandev Mukherjee, Terrence L. Fine \n\nStochastic approximation techniques (Bucklew,Kurtz,Sethares, 1993; Finnoff, 1993; \nKuan,Hornik, 1991; White, 1989) study the limiting behavior of the stochastic pro(cid:173)\ncess that is the piecewise-constant or piecewise-linear interpolation of the sequence \nof weight-vector iterates (assuming infinitely many U.d. training inputs) as the \nlearning constant approaches zero. It can be shown (Bucklew,Kurtz,Sethares, 1993; \nFinnoff, 1993) that as the learning constant tends to zero, the fluctuation between \nthe paths and their limit, suitably normalized, tends to a Gaussian diffusion process. \n\nLeen and Moody (1993) and Orr and Leen (1993) have considered the Markov \nprocess formed by the sequence of iterates (again, assuming infinitely many Li.d. \ntraining inputs) for a fixed nonzero learning constant. This approach has the merit \nof dealing with the nonzero learning constant case anq of linking the study of the \ntraining algorithm with the well-developed literature on Markov processes. \n\nIn particular, it is possible to solve (Leen,Moody, 1993) for the asymptotic distribu(cid:173)\ntion of the sequence of weight-vector iterates from the Chapman-Kolmogorov equa(cid:173)\ntion after certain assumptions have been used to simplify it considerably. However, \nthe assumptions are unrealistic: in particular, the assumption of detailed balance \ndoes not hold in more than one dimension. This approach also fails to establish the \nexistence of a limiting distribution in the general case. \n\nThis paper follows the method of considering the sequence of weight-vector iter(cid:173)\nates as a discrete-time continuous state-space Markov process, when the learning \nconstant is fixed and nonzero. We shall first seek to establish the existence of an \nasymptotic distribution, and then examine this distribution through its first few \nmoments. \n\nIt can be proved (Mukherjee, 1994), using Foster's criteria (Tweedie, 1976) for the \npositive-recurrence of a Markov process, that when a single sigmoidal node with one \nparameter is trained using the iterative form of the basic gradient-descent training \nalgorithm (without batch-processing), the sequence of iterates of the parameter has \na limiting distribution which is in general non-Gaussian, thereby qualifying the oft(cid:173)\nstated claims in the literature (see, for example, (Bucklew,Kurtz,Sethares, 1993; \nFinnoff, 1993; White, 1989)). However, this method proves to be intractable in the \nmultiple parameter case. \n\n2 THE GENERAL CASE AND LINEARIZATION IN W n \n\nThe general version of this problem for a neural network\", with scalar out(cid:173)\nput involves training \", with the i. i.d. training sequence {( X n' Yn )}, loss function \n\u00a3 (;f, y, w) = ~ [y - \",(~, w W (~ E lRd, Y E lR, w E lR m) and the gradient-descent \nupdating equation for the estimates of the weight vector given by \n\nW n+1 = Wn -/LV'1\u00a3\u00a3(;f,y,w)I(X n+1 ,Yn+l,Wn ) \n\n= W n + /L[Yn+1 - \",(W n' X n+1)]V' J\u00a3.\"'(w, ;f)lw\",x n +1 \u2022 \n\nAs is customary in this kind of analysis, the training set is assumed infinite, so that \n{W n}~=O forms a homogeneous Markov process in discrete time. In our analysis, \nthe training data is assumed to come from the model \n\nY = \",(WO,X) + Z, \n\n\fBn+1 = J.LZn+l \\1 1\u00a3\"l( W, ;\u00a3)11\u00a30 ,x n+l ' \n\n1m - J.L(\\11\u00a3\"l(w,;\u00a3)I~o,Xn+l)(\\11\u00a3\"l(w,;\u00a3)I~o,Xn+lf \n+ J.LZn+1 \\1 w \\1 w\"l( w,;\u00a3) Iwo X \n1m - J.L(Gn+1 - Zn+1 J n+1), \n\n'-n+l \n\n-\n\n-\n\n-\n\n(2) \n\n(3) \n(4) \n\nAsymptomatics of Gradient-Based Neural Network Training Algorithms \n\n337 \n\nwhere Z and X are independent, and Z has zero mean and variance (J'2. Hence, \nthe unrestricted Bayes estimator of Y given X, E(YIX) = \"l(wO,X), is in the class \nof neural network estimators, and WO is the goal of training. For convenience, we \ndefine W = w - woo \nAssuming that J.L is small and that after a while, successive iterates, with high \nprobability, jitter about in a close neighborhood of the optimal value wO, we make \nthe important assumption that \n\n(1) \nfor some 0 < k < 1 (see Section 4) 1. Applying Taylor series expansions to \"l \nand \\1 w\"l and neglecting all terms Op(J.L1+ 2k) and higher, we obtain the following \nlinearized form of the updating equation: \n\nwhere \n\nAn+1 = \n\nG n+1 = (\\11!!.\"l(w,~)I~o ,Xn+l)(\\11!!.\"l(w,~)I~o,Xn+l f \n\nI n +1 = (\\1w\\1w\"l(w,;\u00a3)lwo X \n\n) \n'-n+l \n\n-\n\n-\n\n-\n\ndo not depend on W n. The matrices {(An+1, B n+1 )} form an Li.d. sequence, but \nAn+! and Bn+1 are dependent for each n. Hence the linearized W n again forms a \nhomogeneous Markov process in discrete time. \n\nIn what follows we analyze this process in the hope that its asymptotics agree with \nthose of the original Markov process. \n\n3 EXISTENCE OF A LIMITING DISTRIBUTION \n\nLet A, B, G, J denote random matrices with the common distributions of the i.i.d. \nsequences {An}, {Bn}, {Gn }, and {In} respectively, and let T : lRm \n---+ lRm be \nthe random affine transformation \n\nw~Aw+B . \n\nThe following result establishes the existence of a limiting distribution of W n. \n\nLemma 1 (Berger Thm. V, p.162) Suppose \nE[1og+ IIAII + log+ IIBlll < \nElogIIAn A n - 1 \u00b7 \u00b7 \u00b7AtII < \n\n00\u00b7 , \no for some n \n\n(5) \n(6) \n\nwhere \n\nlog+ X = log x V o. \n\nThen the following conclusions hold: \n\nll.e., (VE > O)(3M.)(Vn)lP(J.L- k IlW nil $ M.) ~ 1 - E. \n\n\f338 \n\nSayandev Mukherjee, Terrence L. Fine \n\n1. Unique stationary distribution: There exists a unique mndom variable \nWEIR m, upto distribution, that is stationary with respect to T (i. e., W is \nindependent of T, and TW has the same distribution as W). \n2. Asymptotic stationarity: We have convergence in distribution: \n\n-\nWn~W. \n\n'D \n\n-\n\nOur choice of norm is the operator norm for the matrix A, \n\n/lAII = max IA(A)I, \n\nwhere {A(A)} are the eigenvalues of A, and the Euclidean norm for the vector B, \n\nm \n\nIIBII = 2: IBiI 2 . \n\ni=1 \n\nWe first verify (5). From the inequality \"Ix E IR, log+ X \n::; x 2 , it is easily seen \nthat if r] is a feedforward net where all activation functions are twice-continuously \ndifferentiable in the weights, all hidden-layer activation functions are bounded and \nhave bounded derivatives up to order 2, and if the training sequence (Xn' Yn ) is \nLi.d. with finite fourth moments, then (5) holds for the Euclidean norm for Band \nthe Frobenius norm for A, IIAII2 = 2::1 2:7=1IAijI2. Since \n\n(max IA(A)1)2 :::; 2: IA(AW :::; 2: 2: IAij12, \n\nm m \n\ni=1 j=1 \n\n(7) \n\nwe see that (5) also holds for the operator norm of A. \nAssumption (6) forces the product An'\" A1 to tend to Omxm almost surely (Berger, \n1993, p.146) and therefore removes the dependence of the asymptotic distribution \nof {W n} on that of the initial value ll:=o. A sufficient condition for (6) is given by \nthe following lemma. \n\nLemma 2 Suppose lEG is positive definite (note that it is positive semidefinite by \ndefinition), and for all n, lEAn < 00. Then (6) holds for sufficiently small, positive \n11\u00b7 \n\nProof: By assumption, minA(lEG) = 8 > 0 for some 8. \nLet Hn = ~ 2:~=1 (Gi - ZiJi)' By the Strong Law of Large Numbers applied to the \nLLd. random matrices (Gi - ZiJi) , we have Hn --+ lEG a.s., so \n\nminA(Hn) --+ minA(lEG) a.s. \n\n(8) \n\nApplying (7) to min A(Hn), it is easily shown that the same conditions on r] and \nthe training sequence that are sufficient for (5) also give sUPn lE[min A(Hn )J2 < 00, \nwhich in turn implies that {min A(Hn)} are uniformly integrable. Together with (8), \nthis implies (Loeve, 1977, p.165) that min A(Hn) --+ min A(lEG) in L1. Hence there \n\n\fAsymptomatics of Gradient-Based Neural Network Training Algorithms \n\n339 \n\nexists some (nonrandom) N, say, such that ElminA(HN) - minA(EG)1 ::; 8/2. \nSince \n\nIEminA(HN) - minA(EG)1 ::; EI minA(HN) - minA(EG)1 ::; 8/2, \n\nwe therefore have \n\nE [min'\\ (~ t,(Gi - ZiJi\u00bb) 1 ? rnin'\\(EG) - 8/2 = 8 - 8/2 = 8/2> O. \n\n(9) \n\nWe shall prove that (6) holds for this N (;:::: m) by showing that \n\nElog IIANAN- I \u00b7\u00b7\u00b7 Ad 2 = 2lEiog IIANAN-I'\" Alii < O. \n\nFor our choice of norm, we therefore want E[log(max IA(AN .. . Al)I)2] < O. From \nJensen's inequality, it is sufficient to have log E[max IA(AN . . . Ad 1]2 < 0, or equiv-\nalently, \n\n(10) \n\nNow, since N is fixed, we can choose /-l small enough that \n\nAN'\" Al = \n\nN \n\n1m - /-l2:)Gi - ZiJi) + Op(/-l2). \n\ni=l \n\nHence, A(AN ... AI) = 1 - /-lA(l:~1 (Gi - ZiJi\u00bb + Op (/-l2), and N is fixed, so \n\nI'\\(AN ... AI)I' = 1 - 21''\\ (t, (G, - Z,J ,\u00bb) + Op (1\"), \n\ngiving \n\nmax I'\\(AN' .. Adl' <; 1 - 21' min'\\ (t,(Gi - Z,J ,\u00bb) + Op (1\"), \n\nEmaxIA(AN,,\u00b7AIW::; I-N8/-l+o(/-l), \n\n(11) \nwhere we use (9) and the observation that the structure of the last Op(/-l2) term is \nsuch that its expectation (guaranteed finite by the hypothesis EAN < 00) is o (/-l2) , \nor o(/-l), and we also restrict /-l < 1/ N 8 so that \n\n1- 2p.Ernin'\\ (t,(G, - Z,J,\u00bb) > O. \n\nFrom (11), it is clear that (10) holds for all sufficiently small, positive /-l (<< 1/N8) . \nTherefore (6) holds for n = N. \n\nWe can combine these two lemmas into the following theorem. \n\nTheorem 1 Let 'fJ be a feed forward net where all activation functions are twice(cid:173)\ncontinuously differentiable in the weights, all hidden-layer activation junctions are \nbounded and have bounded derivatives up to order 2, and let the training sequence \n(Xn' Yn) be i.i.d. with finite moments. Further, assume that lEG is positive definite. \nThen, for all sufficiently small, positive /-l the sequence of random vectors {W n}~l \nobtained from the updating equation (2) has a unique limiting distribution. \n\n\f340 \n\nSayandev Mukherjee, Terrence L. Fine \n\nWe circumvent the generally intractable problem of finding the limiting distribution \nby calculating and investigating the behavior of its moments. \n\n4 MOMENTS OF THE LIMITING DISTRIBUTION \n\nLet us assume that the mean and variance of the limiting distribution exist, and \nthat Z '\" N(O, (1'2). From (2) and the form of An+! and B n+!, it is easy to show \nthat EW = 0, or EW = wo, so the optimal value Wo is the mean of the limiting \ndistribution of the sequence of iterates {W n}' It can also be shown (Mukherjee, \n1994) that EWWT = (f.L(1'2/2) 1m, yielding W = Op(yTi). This is consistent with \nour assumption (1) with k = 1/2. \nIn the one-dimensional case (d = m = 1), we have EW = 0 and EW2 = ~f.L(1'2 \nif lE[Xn+!'1}'(WO Xn+!W i= O. Using these results, the fact that Z '\" N(O, (1'2), \nEW = 0, the independence of Z and X, and assuming that EXB < 00, it is not \ndifficult to compute the expressions \n\n- 3 _ \nlEW -\n\nf.L2(1'4E[X3'1}\"'1}'(1 - f.LX2'1}'2)] \n\nE[X2'1}'2 - f.LX4( '1}'4 + (1'2'1}\"2) + f.L2 X6'1}'2( '1}'4 /3 + (1'2'1}\"2)] \n\n, \n\nand \n\nwhere \n\nE[X2'1}'2(1 - f.LX2'1}'2)2 + f.LX4'1}'4 + 3f.L2 X6'1}\"2'1}'2] \n\n18f.L2(1'2E[X3'1}\"'1}'(1 - f.LX2'1}'2)2 + f.L4 (1' 4 x 7 '1}\"3'1}'] \n\nK(f.L) \n\nK(f.L) \n\nE[X2'1}'2 - ~f.LX4('1}14 + (1'2'1}\"2(1 - f.LX2'1}'2)2) \n\n2 \n+ f.L2 X6'1}'6 -\n\n~f.L3 XB('1}'B + 3(1'4'1}\"4)], \n\nand '1}' and '1}\" are evaluated at the argument wO X for'1}. \nFrom the above expressions, it is seen that if'1}O = 1/[1 + e- O ] and X has a sym(cid:173)\nmetric distribution (say N(O, 1)), then EW3 i= 0 and lEW4 i= 3(EW2)2, implying \nthat W is non-Gaussian in general. This result is consistent with that obtained by \ndirect application of Foster's criterion (Mukherjee, 1994). \n\n5 RECONCILING LINEARIZATION AND \n\nSTOCHASTIC APPROXIMATION METHODS \n\nThe results of stochastic approximation analysis give a Gaussian distribution for W \nin the limit as f.L ---. 0 (Bucklew,Kurtz,Sethares, 1993; Finnoff, 1993). However, our \nresults establish that the Gaussian distribution result is not valid for small nonzero \nf.L in general. To reconcile these results, recall W = Op (yTi). Hence, if we consider \n\n\fAsymptomatics of Gradient-Based Neural Network Training Algorithms \n\n341 \n\nonly moments of the normalized quantity W / fo (and neglect higher-order terms \nin Op(fo)), we obtain E(W / fo)3 = 0 and E(W / fo)4 = 3[E(W / fo)2j2, which \nsuggests that the normalized quantity W / fo is Gaussian in the limit of vanishing \n{L, a conclusion also reached from stochastic approximation analysis. \n\nIn support of this theoretical indication that the conclusions of our analysis (based \non linearization for small {L) might tally with those of stochastic approxima(cid:173)\ntion techniques for small values of {L, simulations were done on the simple one(cid:173)\ndimensional training case of the previous section for 8 cases: {L = 0.1,0.2,0.3,0.5, \nand (72 = 0.1,0.5 for each value of {L, with wO fixed at 3. For each of the 8 cases, \neither 5 or 10 runs were made, with lengths (for the given values of {L) of 810000, \n500000, 300000, and 200000 respectively. Each run gave a pair of sequences {Wn } \nobtained by starting off at Wo = 0 and training the network independently twice. \nEach resulting sequence {Wn} was then downs amp led at a large enough rate that \nthe true autocorrelation of the downsampled sequence was less than 0.05, followed \nby deleting the first 10% of the samples of this downsampled sequence, so as to \nremove any dependence on initial conditions that might persist. (Autocorrelation \nat lag unity for this Markov Chain was so high that when {L = 0.1, a decimation rate \nof 9000 was required.) This was done to ensure that the elements of the resulting \ndownsampled sequences could be assumed independent for the various hypothesis \ntests that were to follow. \n\n(a) For each run of each case, the empirical distribution functions of the two \ndownsampled sequences thus generated were compared by means of the \nKolmogorov-Smirnov test (Bickel,Doksum, 1977) at level 0.95, with the \nnull hypothesis being that both sequences had the same actual cumulative \ndistribution function (assumed continuous). This test was passed with ease \non all trials, thereby showing that a limiting distribution existed and was \nattained by such a training algorithm. \n\n(b) For each run of each case, a skewness test and a kurtosis \n\ntest \n(Bickel,Doksum, 1977) for normality were done at level 0.95 to test for \nnormality. The sequences generated failed both tests for the ({L, (7) pair \n(0.1,0.1) and passed them both for the pairs (0.1,0.5), (0.3,0.1), (0.5,0.1), \nand (0.5,0.5). For the pair (0.2,0.5), the skewness test was passed and the \nkurtosis test failed, and for the pairs (0.2,0.1) and (0.3,0.5), the skewness \ntest was failed and the kurtosis test passed. \n\n(c) All trials cleared the Kolmogorov tests (Bickel,Doksum, 1977) for normality \nat level 0.95, both when the normal distribution was taken to have the \nsample mean and variance (c.)mputed on the downsampled sequence), and \nwhen the normal distribution function had the asymptotic values of mean \n(zero) and variance ({L(72 /2). \n\nHence we may conclude: \n\n1. The limiting distribution of {Wn } exists. \n2. For small values of {L, the deviation from Gaussianness is so small that the \nGaussian distribution may be taken as a good approximation to the limiting \ndistribution. \n\n\f342 \n\nSayandev Mukherjee, Terrence L. Fine \n\nIn other words, though stochastic approximation analysis states that W / yfji is \nGaussian only in the limit of vanishing f.J\" our simulation shows that this is a good \napproximation for small values of f.J, as well. \n\nAcknowledgements \n\nThe research reported here was partially supported by NSF Grant SBR-9413001. \n\nReferences \n\nBerger, Marc A. An Introduction to Probability and Stochastic Processes. Springer(cid:173)\nVerlag, New York, 1993. \n\nBickel, Peter, and Doksum, Kjell. Mathematical Statistics: Basic Ideas and Selected \nTopics. Holden-Day, San Francisco, 1977. \nBucklew, J.A., Kurtz, T.G., and Sethares, W.A. \"Weak Convergence and Local \nStability Properties of Fixed Step Size Recursive Algorithms,\" IEEE Trans. Inform. \nTheory, vol. 39, pp. 966-978, 1993. \nFinnoff, W. \"Diffusion Approximations for the Constant Learning Rate Backprop(cid:173)\nagation Algorithm and Resistence to Local Minima.\" In Giles, C.L., Hanson, S.J., \nand Cowan, J.D., editors, Advances in Neural Information Processing Systems 5. \nMorgan Kaufmann Publishers, San Mateo CA, 1993, p.459 ff. \n\nKuan, C-M, and Hornik, K. \"Convergence of Learning Algorithms with Constant \nLearning Rates,\" IEEE Trans. Neural Networks, vol. 2, pp. 484-488, 1991. \nLeen, T.K., and Moody, J .E. \"Weight Space Probability Densities in Stochastic \nLearning: 1. Dynamics and Equilibria,\" Adv. in NIPS 5, Morgan Kaufmann Pub(cid:173)\nlishers, San Mateo CA, 1993, p.451 ff. \nLoeve, M. Probability Theory I, 4th ed. Springer-Verlag, New York, 1977. \nMukherjee, Sayandev. Asymptotics of Gradient-based Neural Network Training Al(cid:173)\ngorithms. M.S. thesis, Cornell University, Ithaca, NY, 1994. \n\nOrr, G.B., and Leen, T.K. \"Probability densities in stochastic learning: II. Tran(cid:173)\nsients and Basin Hopping Times,\" Adv. in NIPS 5, Morgan Kaufmann Publishers, \nSan Mateo CA, 1993, p.507 ff. \nRumelhart, D.E., Hinton, G.E., and Williams, R.J. \"Learning interval represen(cid:173)\ntations by error propagation.\" In D.E. Rumelhart and J.L. McClelland, editors, \nParallel Distributed Processing, Ch. 8, MIT Press, Cambridge MA, 1985. \nTweedie, R.L. \"Criteria for Classifying General Markov Chains,\" Adv. Appl. Prob., \nvol. 8,737-771, 1976. \nWhite, H. \"Some Asymptotic Results for Learning in Single Hidden-Layer Feedfor(cid:173)\nward Network Models,\" J. Am. Stat. Assn., vol. 84, 1003-1013, 1989. \n\n\fPART IV \n\nREThWORCEMENTLEARMNG \n\n\f\f", "award": [], "sourceid": 931, "authors": [{"given_name": "Sayandev", "family_name": "Mukherjee", "institution": null}, {"given_name": "Terrence", "family_name": "Fine", "institution": null}]}