{"title": "Green's Function Method for Fast On-Line Learning Algorithm of Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 333, "page_last": 340, "abstract": null, "full_text": "Green's Function Method for Fast On-line Learning \n\nAlgorithm of Recurrent Neural Networks \n\nGuo-Zheng Sun, Hsing-Hen Chen and Yee-Chun Lee \n\nInstitute for Advanced Computer Studies \n\nand \n\nLaboratory for Plasma Research, \n\nUniversity of Maryland \nCollege Park, MD 20742 \n\nAbstract \n\nThe two well known learning algorithms of recurrent neural networks are \nthe back-propagation (Rumelhart & el al., Werbos) and the forward propa(cid:173)\ngation (Williams and Zipser). The main drawback of back-propagation is its \noff-line backward path in time for error cumulation. This violates the on-line \nrequirement in many practical applications. Although the forward propaga(cid:173)\ntion algorithm can be used in an on-line manner, the annoying drawback is \nthe heavy computation load required to update the high dimensional sensitiv(cid:173)\nity matrix (0( fir) operations for each time step). Therefore, to develop a fast \nforward algorithm is a challenging task. In this paper w~ proposed a forward \nlearning algorithm which is one order faster (only 0(fV3) operations for each \ntime step) than the sensitivity matrix algorithm. The basic idea is that instead \nof integrating the high dimensional sensitivity dynamic equation we solve \nforward in time for its Green's function to avoid the redundant computations, \nand then update the weights whenever the error is to be corrected. \n\nA Numerical example for classifying state trajectories using a recurrent \n\nnetwork is presented. It substantiated the faster speed of the proposed algo(cid:173)\nrithm than the Williams and Zipser's algorithm. \n\nI. Introduction. \n\nIn order to deal with sequential signals, recurrent neural networks are often put forward as a \nuseful model. A particularly pressing issue concerning recurrent networks is the search for an \nefficient on-line training algorithm. Error back-propagation method (Rumelhart, Hinton, and \nWilliams[ I]) was originally proposed to handle feedforward networks. This method can be ap(cid:173)\nplied to train recurrent networks if one unfolds the time sequence of mappings into a multilayer \nfeed-forward net, each layer with identical weights. Due to the nature of backward path, it is \nbasically an off-line method. Pineda [2] generalized it to recurrent networks with hidden neu(cid:173)\nrons. However, he is mostly interested in time-independent fixed point type ofbehaviocs. Pearl(cid:173)\nmutter [3] proposed a scheme to learn temporal trajectories which involves equations to be \nsolved backward in time. It is essentially a generalized version of error back-propagation to the \nproblem of learning a target state trajectory. The viable on-line method to date is the RTRL \n(Real Time Recurrent Learning) algorithm (Williams and Zipser [4]), which propagates a sen-\n\n333 \n\n\f334 \n\nSun, Chen, and Lee \n\nsitivity matrix forward in time. The main drawback of this algorithm is its high cost of compu(cid:173)\ntation. It needs O( JII) number of operations per time step. Therefore, a faster (less than O(JII) \noperations) on-line algorithm appears to be desirable. \n\nToomarian and Barhen [5] proposed an O(N2) on-line algorithm. They derived the same \nequations as Pearlmutter's back-propagation using adjoint-operator approach. They then tried \nto convert the backward path into a forward path by adding a Delta function to its source term. \nBut this is not correct. The problem is not merely because it \"precludes straightforward numer(cid:173)\nical implementation\" as they acknowledged later [6]. Even in theory, the result is not correct. \nThe mistake is in their using a not well defined equity of the Delta function integration. Briefly \nf% 0 (t - t,)f(t) dt = f(t,) is not right if the functionj(t) is discontin-\nspeaking, the equity \nuous at t = tf The value of the left-side integral depends on the distribution of functionj(t) and \ntherefore is not uniquely defined. If we deal with the discontinuity carefully by splitting time \ninterval from to to 'linto two segments: to to 1\"\u00a3 and tr\u00a3 to 'land let E ~ 0, we will find out \nthat adding a Delta function to the source term does not affect the basic property of the adjoint \nequation. Namely, it still has to be solved backward in time. \n\nRecently, Toomarian and Barhen [6] modified their adjoint-operator approach and proposed \nan alternative O(~) on-line training algorithm. Although, in nature, their result is very similar \nto what we presented in this paper, it will be seen that our approach is more straightforward and \ncan be easily implemented numerically. \n\nSchmidhuber[7] proposed an O(N3) algorithm which is a combination of back propagation \n(within each data block of size N) and forward propagation (between blocks). It is therefore not \ntruly an on-line algorithm. \n\nSun, Chen and Lee [8] studied this problem, using a more general approach - variational ap(cid:173)\nproach, in which a constrained optimization problem with Lagrangian multipliers was consid(cid:173)\nered. The dynamic equation of the Lagrangian multiplier was derived, which is exactly the \nsame as adjoint equation[5]. By taking advantage oflinearity of this equation an O(N3) on-line \nalgorithm was derived. But, the numerical implementation of the algorithm, especially the nu(cid:173)\nmerical instabilities are not addressed in the paper. \n\nIn this paper we will present a new approach to this problem - the Green's function method. \nThe advantages of the this method are the simple mathematical fonnulation and easy numerical \nimplementation. One numerical example of trajectory classification is presented to substantiate \nthe faster speed of the proposed algorithm. The numerical results are benchmarked with Wil(cid:173)\nliams and Zipser's algorithm. \nII. Green's Function Approach. \n(a) Definition of the Problem \n\nConsider a fully recurrent network with neural activity represented by an N-dimensional vec(cid:173)\ntor x(t). The dynamic equations can be written in general as a set of first order differential equa(cid:173)\ntions: \n\n(1) \nwhere W is a matrix representing the set of weights and all other adjustable parameters, I(t) is a \nvector representing the neuron units clamped by external input signals at time t. For a simple \nnetwork connected by first order weights the nonlinear function F may look like \n\ni(t) = F(x(t),w,I(t\u00bb \n\n(2) \nwhere the scaler function g(u) could be, for instance, the Sigmoid function g(u) = 1 1(1+e\" u). \nSuppose that part of the state neurons {Xi liE M} are measurable and part of neurons {Xi liE \n\nF(x(t), w,I(t\u00bb = -x(t) +g(w \u00b7x) +I(t) \n\n\fGreen's Function Method for Fast On-line Learning Algorithm of Recurrent Neural Networks \n\n335 \n\nH} are hidden. For the measurable units we may have desired output x (t) . In order to train \nthe network, an objective functional (or an error measure functional) is often given to be \n\ntf \n\nE (x, f) = f e (x (t),f (t\u00bb dt \n\n(3) \n\nwhere functional E depends on weights w implicitly through the measurable neurons {Xi liE \nM}. A typical error function is \n\ne(x(t),x(t\u00bb = (xU) _x(t\u00bb2 \n\nThe gradient descent learning is to modify the weights according to \n\nAwoc: oE = -rf~. ax dt \n. \n\nax ow \n\now \n\nto \n\n(4) \n\n(5) \n\nIn order 0 evaluate the integral in Eq. (5) one needs to know both de/dw and dx/dW. The \nfirst term can be easily obtained by taking derivative of the given error function \n\ne (x (t), f (t\u00bb \n\n. For the second term one needs to solve the differential equation \n\n!l. ~ x) = of . Ox + of \nax ow ow \ndt dw \n\n(6) \n\nwhich is easily derived by taking derivative of Eq.(l) with respect to w. The well known for(cid:173)\nward algorithm of recurrent networks [4] is to solve Equation (6) forward in time and make the \nweight correction at the end (t = r.r) of the input sequence. (This algorithm was developed inde-\npendently by several researchers, but due to the page limitation we could not refer all related \npapers and now simply call it Williams and Zipser's algorithm) The on-line learning is to make \nweight correction whenever an error is to be corrected during the input sequence \n\nA w (t) = -11 ( ~ . ~ \n\nax dW \n\n(7) \n\nThe proof of convergence of on-line learning algorithm will be addressed elsewhere. \n\nThe main drawback of this forward algorithm is that it requires O(Ni) operations per time \nstep to update the matrix dx/dW. Our goal of the Green's function approach is to find an on(cid:173)\nline algorithm which requires less computation load. \n(b). Green's Function Solution \n\nFirst let us analyze the computational complexity when integrating Eq. (6) directly. Rewrite \n\nEq. (6) as \n\nL. ax = of \now \nwhere the linear operator L is defined as L = !l. _ of \ndt ax \n\now \n\n(8) \n\nTwo types of redundancy will be seen from Eq. (8). First, the operator L does not depend on w \nexplicitly, which means that what we did in solving for dx/dw is to repeatedly solve the iden(cid:173)\ntical differential equation for each components of w. This is redundant. It is especially wasteful \nwhen higher order connection weights are used. The second redundancy is in the special form \nof dF/dw for neural computations where the same activity function (say, Sigmoid function) is \n\n\f336 \n\nSun, Chen, and Lee \n\nused for every neuron, so that \n\naFk \ny - = g (LWkl ' xI) 8ki Xj \nUWi} \n\n, \n\nI \n\n(9) \n\nwhere 8ki is the Kronecker delta function. It is seen from Eq. (9) that among N3 components of \nthis third order tensor most of them, N2(N-l), are zero (when k ~ i) and need not to be computed \nrepeatedly. In the original forward learning scheme, we did not pay attention to this redundan(cid:173)\ncy. \n\nOur Green's function approach is able to avoid the redundancy by solving for the low dimen(cid:173)\nsional Green's function. And then we construct the solution ofEq. (8) by the dot product of (JF / \n(Jw with the Green's function, which can in tum be reduced to a scaler product due to Eq. (9). \nThe Green's function of the operator L is defined as a dual time tensor function G(t-t) which \n\nsatisfies the following equation \n\nIt is well known that, if the solution of Eq. (10) is known, the solution of the original equation \nEq. (6) (or (8\u00bb can be constructed using the source term (JF/dw through the integral \n\nd \n-G(t-t)-- \u00b7G(t-t) = 8(t-t) \ndt \n\nax \ndW (t) = f (G (t - t) . dW (t\u00bb dt \n\naF \n\nI \n\naF \nax \n, \n\n'0 \n\naF \nax \n\n~d \n\n-V(t) - - ' V(t) = 0 \ndt \n(to) = 1 \n\n(10) \n\n(11) \n\n(12) \n\n(13) \n\nTo find the Green's function solution we first introduce a tensor function V(t) that satisfies \n\nthe homogeneous form of Eq. (10) \n\nThe solution ofEq. (10) or the Green's function can then be constructed as \n\nG(t-t) = V(t) . VI (t)H(t-t) \n\nwhere H(t-t) is the Heaviside function defined as \n1 \nH(t- t) = {O \n\nt~t \n\ntk (t) llkij (1\u00bb \n\nk \n\n(24) \n\nTo summarize the procedure of the Green's function method, we need to simultaneously in(cid:173)\ntegrate Eq. (21) and Eq. (22) for U(I) and n forward in time starting from Ui}{O) = Oij and \n0ijk(O) = O. Whenever error message are generated, we shall solve Eq. (23) for v(t) and update \nweights according to Eq. (24). \n\nThe memory size required by this algorithm is simply I?+fil for storing U(I) and O(t). \nThe speed of the algorithm is analyzed as follows. From Eq. (21) and Eq. (22) we see that the \nupdate of U(I) and IT both need I? operations per time step. To solve for v(t) and update w, \nwe need also NJ operations per time step. So, the on-line Updating of weights needs totally 41? \noperations per time step. This is one order of magnitude faster than the current forward learning \nscheme. \nIn Numerical Simulation \nWe present in this section numerical examples to demonstrate the proposed learning algorithm \nand benchmark it against Williams&Zipser's algorithm. \n\nClass 1 \n\nClass 2 \n\nClass 3 \n\nFig.1 PHASE SPACE TRAJECTORIES \n\nThree different shapes of 2-D trajectory, each is shown in one column with three examples. \nRecurrent neural networks are trained to recognize the different shapes of trajectory. \nWe consider the trajectory c1assitication problem. The input data are the time series of two \n\n\fGreen's Function Method for Fast On-line Learning Algorithm of Recurrent Neural Networks \n\n339 \n\ndimensional coordinate pairs (x{t), yet)} sampled along three different types of trajectories in \nthe phase space. The sampling is taken uniformly with flt=27t160. The trajectory equations are \nX{I) = sin{I+~)lsin{I)1 \nX{I) = sin(o.sl+~)sin(l.st) X{I) = sin(t+~)sin{21) \n{y (I) = cos (I +~) I sin (I) I {y (I) = cos (0.51 +~) sin (l.5t) {y (I) = cos (I +~) sin (21) \nwhere ~ is a uniformly distributed random parameter. When J3 is changed, these trajectories \nare distorted accordingly. Nine examples (three for each class) are shown in Fig.l. The neural \nnet used here is a fully recurrent first-order network with dynamics \n\nSi(t+l) = Si(t) + (Tanh L Wi/se/)})) \n\n(:\n\n(25) \n\n+6 \n\n}=1 \n\nwhere S and I are vectors of state and input neurons, the symbol e represents concatenation, \nand N is the number of state. Six input neurons are used to represent the normalized vector {I, \nx(t), yet), x(t)2, y(t)2, x(t)y(t)}. The neural network structure is shown in Fig. 2. \n\nState {t + 1 '\\ ---- the end of input S.fquence. \n\nCheck state neurons at \n\n\u2022 \u2022 \u2022\u2022 ~ error = Target - ;} \n\nSN \n\n2 \nState{t) \n\nInput{t) \n\nFig.2 Recurre\"t Neural Network for Trajectory ClassiflCatio\" \n\nFor recognition, each trajectory data sequence needs to be fed to the input neurons and the \nstate neurons evolve according to the dynamics in Eq. (25). At the end of input series we check \nthe last three state neurons and classify the input trajectory according to the \"winner-take-all\" \nrule. For training, we assign the desired final output for the three trajectory classes to (1,0,0), \n(0,1,0) and (0,0,1) respectively. Meanwhile, we need to simultaneously integrate Eq. (21) for \nU(t) and Eq. (22) for n. At the end, we calculated the error from Eq. (4) and solve Eq. (23) \nfor vet) using LU decomposition algorithm. Finally, we update weights according to Eq. (24). \nSince the classification error is generated at the end of input sequence, this learning does not \nhave to be on-line. We present this example only to compare the speeds of the proposed fast \nalgorithm against the Williams and Zipser's. We run the two algorithms for the same number \nof iterations and compare the CPU time used. The results are shown in Table. 1 , where in each \none iteration we present 150 training patterns, 50 for each class. These patterns are chosen by \nrandomly selecting ~ values. It is seen that the CPU time ratio is O( lIN), indicating the Green's. \nfunction algorithm is one order faster in N. \n\nAnother issue to be considered is the error convergent rate (or learning rate, as usually \ncalled). Although the two algorithms calculate the same weight correction as in Eq. (7), due to \ndifferent numerical schemes the outcomes may be different. As the result, the error convergent \nrates are slightly different even if the same learning rate 11 is used. In all numerical simulations \nwe have conducted the learning results are very good (in testing, the recognition is perfect, no \nsingle misclassification was found). But, during training the error convergence rates are differ(cid:173)\nent. The numerical experiments show that the proposed fast algorithm converges slower than \n\n\f340 \n\nSun, Chen, and Lee \n\nthe Williams and Zipser's for the small size neural nets but faster for the large size neural net. \n\n~ Simulation \n\nFast Algorithm \n\nN=4 \n\n(Number of Iterations = 200) \n\nN=8 \n\n(Number of Iterations = 50) \n\nN=12 \n\n(Number of Iterations = 50) \n\n1607.4 \n\n1981.7 \n\n5947.6 \n\nWillillms&Zipser's \n\n-ratio \n\n5020.8 \n\n10807.0 \n\n45503.0 \n\n1:3 \n\n1:5 \n\n1,' 8 \n\nTable 1. The CPU time (in seconds) comparison, implemented in DEC3100 Workstation, \n\nfor learning the trajectory classification example. \n\nIV. Conclusion \nThe Green's function has been used to develop a faster on-line learning algorithm for recur(cid:173)\n\nrent neural networks. This algorithm requires O(tv3) operations for each time step, which is one \norder faster than the Williams and Zipser's algorithm. The memory required is O(tv3). \n\nOne feature of this algorithm is its straightforward formula, which can be easily implemented \nnumerically. A numerical example of trajectory classification has been used to demonstrate the \nspeed of this fast algorithm compared to Williams and Zipser's algorithm. \nReferences \n\n[1] D.Rumelhart, G. Hinton, and R. Williams. Learning internal representations by error \n\npropagation. In Parallel distributed processing: VoU MIT press 1986. P. Werbos, Beyond Re(cid:173)\ngression: New tools for prediction and analysis in the behavior sciences. Ph.D. thesis, Harvard \nuniversity, 1974. \n\n[2] F. Pineda, Generalization of back-propagation to recurrent neural networks. Phys. Rev. \n\n[3] B. Pearlmutter, Learning state space trajectories in recurrent neural networks. Neural \n\nLetters, 19(59):2229, 1987. \n\nComputation,1(2):263, 1989. \n\n[4] R. Williams and D. Zipser, A learning algorithm for continually running fully recurrent \nneural networks. Tech. Report ICS Report 8805, UCSD, La Jolla, CA 92093, November 1988. \n[5] N. Toomarian, J. Barben and S. Gulati, \"Application of Adjoint Operators to Neural \n\nLearning\", Appl. Math. Lett., 3(3), 13-18, 1990. \n\n[6] N. Toomarian and J. Barhen, \"Adjoint-Functions and Temporal Learning Algorithms in \nNeural Networks\", Advances in Neural Information Processing Systems 3, p. 113-120, Ed. by \nR. Lippmann, J. Moody and D. Touretzky, Morgan Kaufmann, 1991. \n\n[7] J. H. Schmidhuber, \"An O(N3) Learning Algorithm for Fully Recurrent Networks\", Tech \n\nReport FKI-151-91, Institut fUr Informatik, Technische Universitiit MUnchen, May 1991. \n\n[8] Guo-Zheng Sun, Hsing-Hen Chen and Yee-Chun Lee, \"A Fast On-line Learning Algo(cid:173)\nrithm for Recurrent Neural Networks\", Proceedings of International Joint Conference on Neu(cid:173)\nral Networks, Seattle, Washington, page 11-13, June 1991. \n\n\f", "award": [], "sourceid": 504, "authors": [{"given_name": "Guo-Zheng", "family_name": "Sun", "institution": null}, {"given_name": "Hsing-Hen", "family_name": "Chen", "institution": null}, {"given_name": "Yee-Chun", "family_name": "Lee", "institution": null}]}