{"title": "Time Dependent Adaptive Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 710, "page_last": 718, "abstract": null, "full_text": "710 \n\nPineda \n\nTime DependentAdaptive Neural Networks \n\nFernando J. Pineda \n\nCenter for Microelectronics Technology \n\nJet Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPasadena, CA 91109 \n\nABSTRACT \n\nA comparison of algorithms that minimize error functions to train the \ntrajectories of recurrent networks, reveals how complexity is traded off for \ncausality. These algorithms are also related to time-independent \nfonnalisms. It is suggested that causal and scalable algorithms are \npossible when the activation dynamics of adaptive neurons is fast \ncompared to the behavior to be learned. Standard continuous-time \nrecurrent backpropagation is used in an example. \n\n1 INTRODUCTION \n\nTraining the time dependent behavior of a neural network model involves the minimization \nof a function that measures the difference between an actual trajectory and a desired \ntrajectory. The standard method of accomplishing this minimization is to calculate the \ngradient of an error function with respect to the weights of the system and then to use the \ngradient in a minimization algorithm (e.g. gradient descent or conjugate gradient). \n\nTechniques for evaluating gradients and performing minimizations are well developed in the \nfield of optimal control and system identification, but are only now being introduced to the \nneural network community. Not all algorithms that are useful or efficient in control problems \nare realizable as physical neural networks. In particular, physical neural network algorithms \nmust satisfy locality, scaling and causality constraints. Locality simply is the constraint that \none should be able to update each connection using only presynaptic and postsynaptic \ninfonnation. There should be no need to use infonnation from neurons or connections that \nare not in physical contact with a given connection. Scaling, for this paper, refers to the \n\n\fTime Dependent Adaptive Neural Networks \n\n711 \n\nscaling law that governs the amount of computation or hardware that is required to perform \nthe weight updates. For neural networks, where the number of weights can become very \nlarge, the amount of hardware or computation required to calculate the gradient must scale \nlinearly with the number of weights. Otherwise, large networks are not possible. Finally, \nlearning algorithms must be causal since physical neural networks must evolve forwards in \ntime. Many algorithms for learning time-dependent behavior, although they are seductively \nelegant and computationally efficient, cannot be implemented as physical systems because \nthe gradient evaluation requires time evolution in two directions. In this paper networks that \nviolate the causality constraint will be referred to as unphysical. \n\nIt is useful to understand how scalability and causality trade off in various gradient evaluation \nalgorithms. In the next section three related gradient evaluation algorithms are derived and \ntheir scaling and causality properties are compared. The three algorithms demonstrate a \nnatural progression from a causal algorithm that scales poorly to an a causal algorithm that \nscales linearly. \n\nThe difficulties that these exact algorithms exhibit appear to be inescapable. This suggests \nthat approximation schemes that do not calculate exact gradients or that exploit special \nproperties of the tasks to-be-Ieamed may lead to physically realizable neural networks. The \nfinal section of this paper suggests an approach that could be exploited in systems where the \ntime scale of the to-be-Ieamed task is much slower than the relaxation time scale of the \nadaptive neurons. \n\n2 ANALYSIS OF ALGORITHMS \n\nWe will begin by reviewing the learning algorithms that apply to time-dependent recurrent \nnetworks. The control literature generally derives these algorithms by taking a variational \napproach (e.g. Bryson and Ho, 1975). Here we will take a somewhat unconventional \napproach and restrict oursel yes to the domain of differential equations and their solutions. To \nbegin with, let us take a concrete example. Consider the neural system given by the equation \n\n, \ndx\u00b7 \n(it = X i+ ~ w I(x) + I I \n\nn \n\n,=1 \n\nWhere f(.) is a sigmoid shaped function (e.g. tanh(.)) and ~is an external input This system \nis a well studied neural model (e.g. Aplevich, 1968; Cowan, 1967; Hopfield, 1984; Malsburg, \n1973; Sejnowski, 1977). The goal is to find the weight matrix w that causes the states x(t) \nof the output units to follow a specified trajectory x(t). The actually trajectory depends not \nonly on the weight matrix but also on the external input vector I. To find the weights one \nminimizes a measure of the difference between the actual trajectory x(t) and the desired \ntrajectory ~(t). This measure is a functional of the trajectories and a function of the weights. \nIt is given by \n\nf t I \n\nE(w ,t I,t ) = 2 .L \n\n1 \n\ndt (x ,{t) - ~,{t)) \n\n2 \n\n(1) \n\n(2) \n\nwhere 0 is the set of output units. We shall, only for the purpose of algorithm comparison, \n\n,e 0 \n\nt \n\no \n\n\faE nft! \n-\naw rs \n\n=- L \n\ni=1 \n\nt \n\no \n\nd t J i(t) P irit ) \n\nJ.= { g i(t)- x i(t) if i E 0 \n'0 \nififl.O \n\nax \u00b7 , \nPirs=-a(cid:173)\n\nWrs \n\n(3a) \n\n(3b) \n\n(3c) \n\n712 \n\nPineda \n\nmake the following assumptions: (1) That the networks are fully connected (2) That all the \ninterval [tD,tr] is divided into q segments with numerical integrations performed using the \nEuler method and (3) That all the operations are performed with the same precision. This will \nallow us to easily estimate the amount of computation and memory required for each \nalgorithm relative to the others. \n\n2.1 ALGORITHM A \n\nIf the objective function E is differentiated with respect to w n one obtains \n\nwhere \n\nand where \n\nTo evaluate Pirs' differentiate equation (1) with respect to w n and observe that the time \nderivative and the partial derivative with respect to w n commute. The resulting equation is \n\ndp irs ~L ( \n-d = ~ ij'X j Pjrs+Sir. \n\n) \n\nt \n\n. 1 \nJ= \n\n(4a) \n\n(4b) \n\nwhere \n\nand where \n\n(4c) \nThe initial condition for eqn. (4a) is p(t) = O. Equations (1), (3) and (4) can be used to \ncalculate the gradient for a learning rule. This is the approach taken by Williams and Zipser \n(1989) and also discussed by Pearlmutter(1988). Williams and Zipser further observe that \none can use the instantaneous value of p(t) and J(t) to update the weights continually \nprovided the weights change slowly. The computationally intensive part of this algorithm \noccurs in the integration of equation (4a). There are n3 components to p hence there are Ji3 \nequations. Accordingly the amount of hardware or memory required to perform the \ncalculation will scale like n3\u2022 Each of these equations requires a summation over all the \nneurons, hence the amount of computation (measured in multiply-accumulates) goes like It \nper time step, and there are q time steps, hence the total number of multiply-accumulates \nscales like n4q Clearly, the scaling properties of this approach are very poor and it cannot \nbe practically applied to very large networks. \n\n2.2 ALGORITHM B \n\nRather than numerically integrate the system of equations (4a) to obtain p(t), suppose we \nwrite down the formal solution. This solution is \n\n\fTime Dependent Adaptive Neural Networks \n\n713 \n\nPirs(t)='LKij(t,to)PjrsCt 0)+ 'L \n\n11 \n\ndrKjj(t,f)Sjrs(i) \n\nj=1 \n\n\"f' \n'0 \n\nj=1 \n\nThe matrix K is defined by the expression \n\nK (' 2' ,) = ex p(.r.. '~T L (x (T))) \n\n(Sa) \n\n(5b) \n\nThis matrix is known as the propagator or transition matrix. The expression for Pit. consists \nof a homogeneous solution and a particular solution. The choice of initial condition Pirs(to) \n= 0 leaves only the particular solution. If the particular solution is substituted back into eqn. \n(3a), one eventually obtains the following expression for the gradient \n\n'f f ' \n\naE \n-= - 'L f dt \naw rs \n\n'0 \n\nj=1 \n\n11 \n\n'0 \n\nd-r J;Ct)K irU ,-r)f(x s(-r)) \n\n(6) \n\nTo obtain this expression one must observe that s. can be expressed in terms of x , i.e. use \neqn. (4c). This allows the summation over j to be performed trivially, thus resulting in \neqn.(6). The familiar outer product form of backpropagation is not yet manifest in this \nexpression. To uncover it, change the order of the integrations. This requires some care \nbecause the limits of the integration are not the same. The result is \n\nIn \n\n\u2022 \n\naE \n-= -'L \naw rs \n\ni = 1 \n\n11 f If \n\n'0 \n\nIf \n\nl' \n\nd-rf dt Jj(t)K irU ,-r)f(x sC-r)) \n\n(7) \n\nInspection ofthis expression reveals that neither the summation over i nor the integration over \n't includes x.(t), thus it is useful to factor it out. Consequently equation (7) takes on the \nfamiliar outer product form of backpropagation \n\naE \n-= - f dt Y r(t)f(x sU)) \naw rs \n\nIf \n\nl' \n\nWhere yr(t) is defined to be \n\nY r(-r) =- 'L f \n\n11 \n\nIf \ndt Jj(t)K irU ,-r) \n\ni= 1 \n\nt' \n\n(8) \n\n(9) \n\n(10) \n\nEquation (8), defines an expression for the gradient, provided we can calculate Yr(t) from eqn. \n(9). In principle, this can be done since the propagator K and the vector J are both completely \ndetermined by x(t). The computationally intensive part of this algorithm is the calculation \nof K(t, 't) for all values of t and't. The calculation requires the integration of equations of the \nform \n\ndK i: ,-r) - L (x U) K (t ,-r) \n\nfor q different values of't. There are n2different equations to integrate for each value of't \nConsequently there are n2q integrations to be performed where the interval from to to tf is \ndivided into q intervals. The calculation of all the components ofK(t,'t), from tr to t ,scales \nlike n3q2, since each integration requires n multiply-accumulates per time step and there are \nq time steps. Similarly, the memory requirements scale like n2q2. This is because K has n2 \ncomponents for each (t,'t) pair and there are q2 such pairs. \n\n\f714 \n\nPineda \n\nEquation (10) must be integrated forwards in time from t= 't to t = trand backwards in time \nfrom t= 't to t = to. This is because K must satisfy K( 't\u00bb't) = 1 (the identity matrix) for all \n'to This condition follows from the definition of K eqn. (5b). Finally, we observe that \nexpression (9) is the time-dependent analog of the expression used by Rohwer and Forrest \n(1987) to calculate the gradient in recurrent networks. The analogy can be made somewhat \nmore explicit by writingK(t,'t) as the inverse K-l('t,t). Thus we see that y( t) can be expressed \nin terms of a matrix inverse just as in the Rohwer and Forrest algorithm. \n\n2.3 \n\nALGORITHM C \n\nThe final algorithm is familiar from continuous time optimal control and identification. The \nalgorithm is usually derived by performing a variation on the functional given by eqn. (2). \nThis results in a two-point boundary value problem. On the other hand, we know that y is \ngiven by eqn. (9). So we simply observe that this is the particular solution of the differential \nequation \n\ndy \n\n- ([t= L \n\nT \n(x (t))y +J \n\n(11) \n\nWhere LT is the transpose of the matrix defined in eqn. (4b). To see this simply substitute \nthe form for y into eqn. (11) and verify that it is indeed the solution to the equation. \n\nThe particular solution to eqn. (11) vanishes only if y(1r) = O. In other words: to obtain yet) \nwe need only integrate eqn. (11) backwards from the final condition y(t~ = O. This is just \nthe algorithm introduced to the neural network community by Pearlmutter (1988). This also \ncorresponds to the unfolding in time approach discussed by Rumelhart et al. (1986), provided \nthat all the equations are discretized and one takes At = 1. \n\nThe two point boundary value problem is rather straight forward to solve because the \nequation for x(t) is independent of yet). Both x(t) and yet) can be obtained with n multiply(cid:173)\naccumulates per time step. There are q time steps from to to tfand bothx(t) and yet) have n \ncomponents, hence the calculation of x(t) and yet) scales like 02q. The weight update \nequation also requires n2q mUltiply- accumulates. Thus the computational requirements of \nthe algorithm as a whole scale like n2q The memory required also scales like n2q, since it \nis necessary to save each value of x(t) along the trajectory to compute yet). \n\n2.4 \n\nSCALING VS CAUSALITY \n\nThe results of the previous sections are summarized in table 1 below. We see that we have \na progression of tradeoffs between scaling and causality. That is, we must choose between \na causal algorithm with exploding computational and storage requirements and an a causal \nalgorithm with modest storage requirements. There is no q dependence in the memory \nrequirments because the integral given in eqn. (3a) can be accumulated at each time step. \nAlgorithm B has some of the worst features of both algorithms. \n\n\fTime Dependent Adaptive Neural Networks \n\n715 \n\nTable 1: Comparison of three algorithms \n\nAlgorithm Memory \n\nMultiply \n\ndiirection of integations \n\n-accumulates \n\nA \nB \nC \n\nx and p are both forward in time \nx is forward, K is forward and backward \nx is forward, y is backward in time. \n\nDigital hardware has no difficulties (at least over finite time intervals) with a causal \nalgorithms provided a stack is available to act as a memory that can recall states in reverse \norder. To the extent that the gradient calculations are carried out on digital machines, it makes \nsense to use algorithm C because it is the most efficient. \nIn analog VLSI however, it is \ndifficult to imagine how to build a continually running network that uses an a causal \nalgorithm. Algorithm A is attractive for physical implementation because it could be run \ncontinually and in real time (Williams and Zipser, 1989). However, its scaling properties \npreclude the possibility of building very large networks based on the algorithm. Recently, \nZipser (1990) has suggested that a divide and conquer approach may reduce the \ncomputational and spatial complexity of the algorithm. This approach, although promising, \ndoes not always work and there is as yet no convergence proof. How then, is it possible to \nlearn trajectories using local, scalable and causal algorithms? In the next section a possible \navenue of attack is suggested. \n\n3 EXPLOITING DISPARATE TIME SCALES \n\nI assert that for some classes of problems there are scalable and causal algorithms that \napproximate the gradient and that these algorithms can be found by exploiting the disparity \nin time scales found in these classes of problems. In particular, I assert that when the time \nscale of the adaptive units is fast compared to the time scale of the behavior to be learned, it \nis possible to find scalable and causal adaptive algorithms. A general formalism for doing \nthis will not be presented here, instead a simple, perhaps artificial, example will be presented. \nThis example minimizes an error function for a time dependent problem. \n\nIt is likely that trajectory generation in motor control problems are of this type. The \ncharacteristic time scales of the trajectories that need to be generated are determined by \ninertia and friction. These mechanical time scales are considerably longer than the electronic \ntime scales that occur in VLSI. Thus it seems that for robotic problems, there may be no need \nto use the completely general algorithms discussed in section 2. Instead, algorithms that take \nadvantage of the disparity between the mechanical and the electronic time scales are likely \nto be more useful for learning to generate trajectories. \n\nhe task is to map from a periodic input I(t) to a periodic output ~(t). The basic idea is to use \nthe continuous-time recurrent-backpropagation approach with slowly varying time-\ndependent inputs rather than with static inputs. The learning is done in real-time and in a \ncontinuous fashion. Consider a set of n \"fast\" neurons (i= 1, .. ,n) each of which satisfies the \n\n\f716 \n\nPineda \n\nadditive activation dynamics determined by eqn (1). Assume that the initial weights are \nsufficientl y small that the dynamics of the network would be convergent if the inputs I were \nconstant. The external input vector ~ is applied to the network through the vector I. It has \nbeen previously shown (pineda, 1988) that the ij-th component of the gradient ofE is equal \nto yfjf(xf) where Xfj is the steady state solution of eqn. (1) and where yfjis a component of \nthe steady state solution of \n\nT \n\ndy \n-= L (x )y +1 \ndt \n\nf \n\n(12) \n\nwhere the components ofLT are given by eqn. (4.b). Note that the relative sign between \nequations (11) and (12) is what enables this algorithm to be causal. Now suppose that instead \nof a fixed input vector I, we use a slowly varying input I(t/'t ) where't is the characteristic \ntime scale over which the input changes significantly. If w~ take as lite gradient descent \nalgorithm, the dynamics defined by \n\ndw rs \n\n't'w([t=Y i(t)X /t) \n\n(13) \n\nwhere't .. is the time constant that defines the (slow) time scale over which w changes and \nwhere Xj is the instantaneous solution of eqn. (1) and Yj is the instantaneous solution of \neqn.(12) . Then in the adiabatic limit the Cartesian product yl(x) in eqn. (13) approximates \nthe negative gradient of the objective function E, that is \n\n(14) \n\nThis approach can map one continuous trajectory into another continuous trajectory, \nprovided the trajectories change slowly enough. Furthermore, learning occurs causally and \nscalably. There is no memory in the model, i.e. the output of the adaptive neurons depends \nonly on their input and not on their internal state. Thus, this network can never learn to \nperform tasks that require memory unless the learning algorithm is modified to learn the \nappropriate transitions. This is the major drawback of the adiabatic approach. Some state \ninformation can be incorporated into this model by using recurrent connections -\nin which \ncase the network can have multiple basins and the final state will depend on the initial state \nof the net as well as on the inputs, but this will not be pursued here. \n\nSimple simulations were performed to verify that the approach did indeed perform gradient \ndescent. One simulation is presented here for the benefit of investigators who may wish to \nverify the results. A feedforward network topology consisting of two input units, five hidden \nunits and two output units was used for the adaptive network. Units were numbered \nsequentially, 1 through 9, beginning with the input layer and ending in the output layer. Time \ndependent external inputs for the two input neurons were generated with time dependence \nII = sin(27tt) and ~ = cos(2m). The targets for the output neurons were ~ = R sin(27tt) and \n~9 =R cos(2m) where R = 1.0 + 0.lsin(6m). All the equations were simultaneously integrated \nusing 4th order Runge-Kutta with a time step of 0.1. A relaxation time scale was introduced \ninto the forward and backward propagation equations by multiplying the time derivatives in \neqns. (1) and (12) by't\" and 'tyrespectively. These time scales were set to't\" ='ty= 0.5. The \nadaptive time scale of the weights was 't .. = 1.0. The error in the network was initially, E = \n\n\fTime Dependent Adaptive Neural Networks \n\n717 \n\n10 and the integration was cut off when the error reached a plateau at E = 0.12. The learning \ncurve is shown in Fig. 1. The trained trajectory did not exactly reach the desired solution. In \nparticular the network did not learn the odd order hannonic that modulates R. By way of \ncomparison, a conventional backpropagation approach that calculated a cumulative gradient \nover the trajectory and used conjugate gradient for the descent, was able to converge to the \nglobal minimum. \n\n12,---------------------------------~ \n\n8 -\n\n10-' \nIII \nIII \nm \nm \nm \nm \nm \nm \nED \nm \n\n6 -\n\n4-\n\n2 -. \nO+-__ ~---~~\u00b7\u00b7~\u00b7\u00b7E\u00b7\u00b7~\u00b7.B .. ~ .. B .. B . . ~\u00b7.m\u00b7\u00b7D\u00b7~\u00b7\u00b7\u00b7~~ \n50 \n\n30 \n\n40 \n\n20 \n\no \n\nI \n\n10 \n\nI \n\nI \n\nFigure 1: Learning curve. One time unit corresponds to a single oscillation \n\nTime \n\n4 SUMMARY \n\nThe key points of this paper are: 1) Exact minimization algorithms for learning time(cid:173)\ndependent behavior either scale poorl y or else violate causality and 2) Approximate gradient \ncalculations will likely lead to causal and scalable learning algorithms. The adiabatic \napproach should be useful for learning to generate trajectories of the kind encountered when \nlearning motor skills. \n\nReferences herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, \nor otherwise, does not constitute or imply any endorsement by the U oited States Government or the Jet Propulsion \nLaboratory, California Institute of Technology. The work described in this paper was carried out at the \nCenter for Space Microelectonrics Technology, Jet Propulsion Laboratory, California Institute of \nTechnology. Support for the work came from the Air Force Office of Scientific Research through an \nagreement with the National Aeronautics and Space Administration (AFOSR-ISSA-90-0027). \n\nREFERENCES \n\nAplevich,J.D. (1968). Models of certain nonlinear systems. InE.R.Caianiello(Ed.),Neural \nNetworks, (pp. 110-115), Berlin: Springer Verlag. \n\nBryson, A. E. and Ho, Y. (1975). Applied Optimal Control: Optimization. Estimation. and \n\n\f718 \n\nPineda \n\nControl. New York: Hemisphere Publishing Co. \n\nCowan, J. D. (1967). A mathematical theory of central nervous activity. Unpublished \ndissertation, Imperial College, University of London. \n\nHopfield, J. J. (1984). Neurons with graded response have collective computational \nproperties like those of two-state neurons. Proc. Nat. Acad. Sci. USA, Bio., .. 8.l. 3088-3092. \n\nMalsburg, C. van der (1973). Self-organization of orientation sensitive cells in striate cortex, \nKybernetic, 14,85-100. \n\nPearlmutter, B. A. (1988), Learning state space trajectories in recurrent neural networks: A \npreliminary report, (Tech. Rep. AlP-54), Department of Computer Science , Carnegie Mellon \nUniversity, Pittsburgh, PA \n\nPineda, F. J. (1988). Dynamics and Architecture for Neural Computation. Journal of \nComplexity,~, (pp.216-245) \n\nRowher R, R. and Forrest, B. (1987). Training time dependence in neural networks, In M. \nCaudilandC.Butler,(Eds.),ProceedingsoftheIEEEFirstAnnuallnternationalConference \non Neural Networks, ~, (pp. 701-708). San Diego, California: IEEE. \n\nRumelhart, D. E., Hinton, G. E., and Willaims, R.J. (1986). Learning Internal \nRepresentations by Error Propagation. In D. E. Rumelhart and J. L. McClelland, (Eds.), \nParallel Distributed Processing, (pp. 318-362). Cambridge: M.LT. Press. \n\nSejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. Journal \nof Mathematical Biology, ~,303 .. 321. \n\nWilliams, R.I. and Zipser, D. (1989). A learning algorithm for continually running \nfully recurrent neural networks. Neural Computation, 1, (pp. 270-280). \n\nZipser, D. (1990). Subgrouping reduces complexity and speeds up learning in recurrent \nnetworks, (this volume). \n\n\f", "award": [], "sourceid": 212, "authors": [{"given_name": "Fernando", "family_name": "Pineda", "institution": null}]}