{"title": "Generalization of Back propagation to Recurrent and Higher Order Neural Networks", "book": "Neural Information Processing Systems", "page_first": 602, "page_last": 611, "abstract": null, "full_text": "602 \n\nGENERALIZATION OF BACKPROPAGATION \n\nTO \n\nRECURRENT AND HIGHER ORDER NEURAL NETWORKS \n\nFernando J. Pineda \n\nApplied Physics Laboratory, Johns Hopkins University \n\nJohns Hopkins Rd., Laurel MD 20707 \n\nAbstract \n\nA general method for deriving backpropagation algorithms for networks \n\nwith recurrent and higher order networks is introduced. The propagation of activation \nin these networks is determined by dissipative differential equations. The error signal \nis backpropagated by integrating an associated differential equation. The method is \nintroduced by applying it to the recurrent generalization of the feedforward \nbackpropagation network. The method is extended to the case of higher order \nnetworks and to a constrained dynamical system for training a content addressable \nmemory. The essential feature of the adaptive algorithms is that adaptive equation has \na simple outer product form. \n\nPreliminary experiments suggest that learning can occur very rapidly in \n\nnetworks with recurrent connections. The continuous formalism makes the new \napproach more suitable for implementation in VLSI. \n\nIntroduction \n\nOne interesting class of neural networks, typified by the Hopfield neural \n\nnetworks (1,2) or the networks studied by Amari(3,4) are dynamical systems with three \nsalient properties. First, they posses very many degrees of freedom, second their \ndynamics are nonlinear and third, their dynamics are dissipative. Systems with these \nproperties can have complicated attractor structures and can exhibit computational \nabilities. \n\nThe identification of attractors with computational objects, e.g. memories at d \nrules, is one of the foundations of the neural network paradigm. In this paradigl n, \nprogramming becomes an excercise in manipulating attractors. A learning algorithm is \na rule or dynamical equation which changes the locations of fixed points to encode \ninformation. One way of doing this is to minimize, by gradient descent, some \nfunction of the system parameters. This general approach is reviewed by Amari(4) \nand forms the basis of many learning algorithms. The formalism described here is a \nspecific case of this general approach. \n\nThe purpose of this paper is to introduce a fonnalism for obtaining adaptive \ndynamical systems which are based on backpropagation(5,6,7). These dynamical \nsystems are expressed as systems of coupled first order differential equations. The \nformalism will be illustrated by deriving adaptive equations for a recurrent network \nwith first order neurons, a recurrent network with higher order neurons and finally a \nrecurrent first order associative memory. \n\nExample 1: Recurrent backpropagation with first order units \n\nConsider a dynamical system whose state vector x evolves according to the \n\nfollowing set of coupled differential equations \n\n\u00ae American Institute ofPhvsics 1988 \n\n\fdx\u00b7/dt = -x' + g'(LW\"X') + I\u00b7 \nI \n\n1 1. IJ J \n\nI \n\nJ \n\nwhere i=l, ... ,N. The functions g' are assumed to be differentiable and may have \ndifferent forms for various populations of neurons. In this paper we shall make no \nother requirements on gi' In the neural network literature it is common to take these \nfunctions to be Sigmoid shaped functions. A commonly used form is the logistic \nfunction, \n\n603 \n\n(1) \n\n(2) \n\nThis form is biologically motivated since it attempts to account for the refractory phase \nof real neurons. However, it is important to stress that there is nothing in the \nmathematical content of this paper which requires this form -- any differentiable \nfunction will suffice in the formalism presented in this paper. For example, a choice \nwhich may be of use in signal processing is sin(~). \n\nA necessary condition for the learning algorithms discussed here to exist is that the \nsystem posesses stable isolated attractors, i.e. fixed points. The attractor structure of \n(1) is the same as the more commonly used equation \n\ndu/dt = -ui +~Wijg(Uj) + Ki' \n\n(3) \n\nJ \n\nBecause (1) and (3) are related by a simple linear transformation. Therefore results \nconcerning the stability of (3) are applicable to (1). Amari(3) studied the dynamics of \nequation (3) in networks with random conections. He found that collective variables \ncorresponding to the mean activation and its second moment must exhibit either stable \nor bistable behaviour. More recently, Hopfield(2) has shown how to construct content \naddressable memories from symmetrically connected networks with this same \ndynamical equation. The symmetric connections in the network gaurantee global \nstability. The solution of equation (1) is also globally asymptotically stable if w can be \ntransformed into a lower triangular matrix by row and column exchange operations. \nThis is because in such a case the network is a simply a feedforward network and the \noutput can be expressed as an explicit function of the input. No Liapunov function \nexists for arbitrary weights as can be demonstrated by constructing a set of weights \nwhich leads to oscillation. In practice, it is found that oscillations are not a problem \nand that the system converges to fixed points unless special weights are chosen. \nTherefore it shall be assumed, for the purposes of deriving the backpropagation \nequations, that the system ultimately settles down to a fixed point. \n\nConsider a system of N neurons, or units, whose dynamics is determined by \n\nequation (1). Of all the units in the network we will arbitrarily define some subset of \nthem (A) as input units and some other subset of them (0) as output units. Units \nwhich are neither members of A nor 0 are denoted hidden units. A unit may be \nsimultaneously an input unit and an output unit. The external environment influences \nthe system through the source term, I. If a unit is an input unit, the corresponding \ncomponent of I is nonzero. To make this more precise it is useful to introduce a \nnotational convention. Suppose that * represent some subset of units in the network \nthen the function 8i** is defined by \n\n8'm= { \n\n1'V \n\n1 \n0 \n\nif i-th unit is a member of ** \nth o erwise \n\nIn terms of this function, the components of the I vector are given by \n\n(4) \n\n(5) \n\n\f604 \n\nwhere ~i is detennined by the external environment. \n\nOur goal will be to fmd a local algorithm which adjusts the weight matrix w so that \na given initial state XO = x(to)' and a given input I result in a fixed point, xoo= x(too), \nwhose components have a desired set of values Ti along the output units. This will be \naccomplished by minimizing a function E which tneasures the distance between the \ndesired fixed point and the actual fixed point i.e., \n1 N \n\nE = - :E Ji2 \n\n2 \n\ni=l \n\n(6) \n\nwhere \n\nJ. -\nI -\n\n(T. - xoo. ) e'n \nI.u. \n\nI \n\nI \n\n(7) \n\nE depends on the weight matrix w through the fixed point Xoo(w). A learning \nalgorithm drives the fixed points towards the manifolds which satisfy xi 00 = Ti on the \noutput units. One way of accomplishing this with dynamics is to let the system evolve \nin the weight space along trajectories which are antiparallel to the gradient of E. In \n. other words, \n\ndWi/dt = - T\\ -\n\ndE \ndw .. \nIJ \n\n(8) \n\nwhere T\\ is a numerical constant which defines the (slow) time scale on which w \nchanges. T\\ must be small so that x is always essentially at steady state, i.e. \nx(t) == xoo. It is important to stress that the choice of gradient descent for the learning \ndynamics is by no means unique, nor is it necessarily the best choice. Other learning \ndynamics which employ second order time derivatives (e.g. the momentum \nmethod(5\u00bb or which employ second order space derivatives (e.g. second order \nbackpropagation(8\u00bb may be more useful in particular applications. However, equation \n(8) does have the virtue of being the simplest dynamics which minimizes E. \n\nOn performing the differentiations in equation (8), one immediately obtains \n\ndwrs/dt = T\\ 1: Jk a \n\nk \n\ndxoo k \nwrs \n\n(9) \n\nThe derivative of xoo k with respect to w rs is obtained by first noting that the fixed \npoints of equation (1) satisfy the nonlinear algebraic equation \n\ndifferentiating both sides of this equation with respect to Wrs and finally solving for \ndxooId dWrs' The result is \n\nXoo. = g\u00b7(:Ewooxoo.) + J. \n\nI \n\nI . IJ \n\nJ \n\nI ' \n\nJ \n\ndXook \n-\ndWrs \n\n= (L- 1)kr gr'(Ur)xoo s \n\n(10) \n\n(11) \n\n(12) \n\nwhere gr' is the derivative of gr and where the matrix L is given by \n\nBii is the Kroneker B function ( BU= 1 if i=j, otherwise Bij = 0). On substituting (11) \ninto (9) one obtains the remarkablY simple form \n\n\fwhere \n\ndWrsldt = 11 YrXoo s \nYr = gr'(ur) LJk(L -1)kr \n\nk= \n\n605 \n\n(13) \n\n(14) \n\nr \n\nr \n\nEquations (13) and (14) specify a fonnallearning rule. Unfortunately, equation \n\n(14) requires a matrix inversion to calculate the error signals Yk' Direct matrix \ninversions are necessarily nonlocal calculations and therefore this learning algorithm is \nnot suitable for implementation as a neural network. Fortunately, a local method for \ncalculating Yr can be obtained by the introduction of an associated dynamical system. \nTo obtain this dynamical system fIrst rewrite equation (14) as \n\nLLrk (Yr / gr'(ur)} = Jk \nr \n\n. \n\n(15) \n\nThen multiply both sides by fk'(uk)' substitute the explicit form for L and finally sum \nover r. The result is \n\no = - Yk + gk'(uk){ LWrkYr + Jk} . \n\n(16) \n\nOne now makes the observation that the solutions of this linear equation are the fIxed \npoints of the dynamical system given by \n\ndYk/dt = - Yk +gk'(uk){LWrkYr + Jk} . \n\n(17) \n\nThis last step is not unique, equation (16) could be transformed in various ways \nleading to related differential equations, cf. Pineda(9). It is not difficult to show that \nthe frrst order fInite difference approximation (with a time step ~t = 1) of equations \n(1), (13) and (17) has the same form as the conventional backpropagation algorithm. \n\nEquations (1), (13) and (17) completely specify the dynamics for an adaptive \n\nneural network, provided that (1) and (17) converge to stable fixed points and \nprovided that both quantities on the right hand side of equation (13) are the steady \nstate solutions of (1) and (17). \n\nIt was pointed out by Almeida(10) that the local stability of (1) is a sufficient \n\ncondition for the local stability of (17). To prove this it suffices to linearize equation \n(1) about a stable fixed point. The resulting linearized equation depends on the same \nmatrix L whose transpose appears in the derivation of equation (17), cf. equation \n(15). But Land LT have the same eigenValues, hence it follows that the fIXed points \nof (17) must also be locally stable if the fIxed points of (1) are locally stable. \n\nLearning multiple associations \n\nIt is important to stress that up to this point the entire discussionhas assumed that I \nand T are constant in time, thus no mechanism has been obtained for learning multiple \ninput/output associations. Two methods for training the network to learn multiple \nassociations are now discussed. These methods lead to qualitatively different learning \nbehaviour. \n\nSuppose that each input/output pair is labeled by a pattern label n, i.e. {In ,Tn}. \n\nThen the energy function which is minimized in the above discussion must also \ndepend on this label since it is an implicit function of the In ,Tn pairs. In order to \nlearn multiple input/output associations it is necessary to minimize all the E[n] \nsimultaniously. In otherwords the function to minimize is \n\n(18) \n\n\f606 \n\nwhere the sum is over all input/output associations. From (18) it follows that the \ngradient for Etotal is simply the sum of the gradients for each association, hence the \ncorresponding gradient descent equation has the form, \n\ndWijldt = 11 L yOOi[a] xOOia] . \n\n(19) \n\na \n\nIn numerical simulations, each time step of (19) requires relaxing (1) and (17) for each \npattern and accumulating the gradient over all the patterns. This fonn of the algorithm \nis deterministic and is guaranteed to converge because, by construction, Etotal is a \nLiapunov function for equation (19). However, the system may get stuck m a local \nminimum. This method is similar to the master/slave approach of Lapedes and \nFarber(1l). Their adaptive equation, which plays the same role as equation (19), also \nhas a gradient form, although it is not strictly descent along the gradient. For a \nrandomly or fully connected network it can be shown that tbe number of oper~tions \nrequired per weight update in the master/slave fonnalis~ is proportional to N where \nN is the number of units. This is because there are O(N ) update equations and each \nequation requires O(N) operations (assuming some precomputation). On the other \nhand, in the backpropagation formalism each update equation re~uires only 0(1) \noperations because of their trivial outer product form. Also O(N ) operations are \nrequired t~ precompute XOO and yoo. The result is that each weight update requires \nonly O(N ) operations. It is not possible to conclude from this argument that one or \nthe other approach will be more efficient in a particular application because there are \nother factors to consider such as the number of patterns and the number of time steps \nrequired for x and y to converge. A detailed comparison of the two methods is in \npreparation. \n\nA second approach to learning multiple patterns is to use (13) and to change the \npatterns randomly on each time step. The system therefore receives a sequence of \nrandom impulses each of which attempts to minimize E[ ex] for a single pattern. One \ncan then defme L(w) to be the mean E[a] (averaged over the distribution of patterns). \n\nL(w) = \n\n(20) \n\nAmari(4) has pointed out that if the sequence of random patterns is stationary and if \nL(w) has a unique minimum then the theory of stochastic approximation guarantees \nthat the solution of (13) wet) will converge to the minimum point '!min of L(w) to \nwithin a small fluctuating tenn which vanishes as 11 tends to zero. hVlaently 11 is \nanalogous to the temperature parameter in simulated annealing. This second approach \ngenerally converges more slowly than the first, but it will ultimately converge (in a \nstatistical sense) to the global minimum. \n\nIn principle the fixed points, to which the solutions of (1) and (17) eventually \n\nconverge, depend on the initial states. Indeed, Amari's(3) results imply that equation \n(1) is bistable for certain choices of weights. Therefore the presentation of multiple \npatterns might seem problematical since in both approaches the final state of the \nprevious pattern becomes the initial state of the new pattern. The safest approach is to \nreinitialize the network to the same initial state each time a new pattern is presented. \ne.g. xi(t~ = 0.5 for all i. In practice the system learns robustly even if the initial \nconditIons are chosen randomly. \n\nExample 2: Recurrent higher order networks \n\nIt is straightforward to apply the technique of the previous section to a dynamical \n\nsystem with higher order units. Higher order systems have been studied by \nSejnowski (12) and Lee et al.(13). Higher order networks may have definite advantages \n\n\f607 \n\nover networks with first order units alone A detailed discussion of the \nbackpropagation fonnalism applied to higher order networks is beyond the scope of \nthis paper. Instead, the adaptive equations for a network with purely n-th order units \nwill be presented as an example of the fonnalism. To this end consider a dynamical \nsystem of the fonn \n\nwhere \n\ndx\u00b7/dt - -x' + g'(lI!) + I\u00b7 \n1 \n\n1-:1 \n\n1 \n\nI \n\n-\n\n(21) \n\n(22) \n\nand where there are n+ 1 indices and the summations are over all indices except i. The \nsuperscript on the weight tensor indicates the order of the correlation. Note that an \nadditional nonlinear function f has been added to illustrate a further generalization. \nBoth f and g must be differentiable and may be chosen to be sigmoids. It is not \ndifficult, although somewhat tedious, to repeat the steps of the previous example to \nderive the adaptive equations for this system. The objective function in this case is the \nsame as was used in the fIrst example, i.e. equation (6). The n-th order gradient \ndescent equation has the fonn \n\nEquation (23) illustrates the major feature of backpropagation which distinguishes it \nfrom other gradient descent algorithms or similar algorithms which make use of a \ngradient. Namely, that the gradient of the objective function has a very trivial outer \nproduct fonn. y (n)oo is the steady state solution of \n\n(23) \n\ndy(n)k/dt = - y(n)k + gk'(uk) {fk'(xk)Ly(n)rkY (n)r + Jk }. \n\n(24) \n\nr \n\nThe matrix v(n) plays the role of w in the previous example, however v(n) now \ndepends on the state of the network according to \n\ny(n)ij = L'\" L s*