{"title": "Phase-Space Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 481, "page_last": 488, "abstract": null, "full_text": "Phase-Space Learning \n\nFu-Sheng Tsung \n\nChung Tai Ch'an Temple \n\n56, Yuon-fon Road, Yi-hsin Li, Pu-li \n\nNan-tou County, Taiwan 545 \n\nRepublic of China \n\nGarrison W. Cottrell\u00b7 \n\nInstitute for Neural Computation \nComputer Science & Engineering \nUniversity of California, San Diego \n\nLa Jolla, California 92093 \n\nAbstract \n\nExisting recurrent net learning algorithms are inadequate. We in(cid:173)\ntroduce the conceptual framework of viewing recurrent training as \nmatching vector fields of dynamical systems in phase space. Phase(cid:173)\nspace reconstruction techniques make the hidden states explicit, \nreducing temporal learning to a feed-forward problem. In short, \nwe propose viewing iterated prediction [LF88] as the best way of \ntraining recurrent networks on deterministic signals. Using this \nframework, we can train multiple trajectories, insure their stabil(cid:173)\nity, and design arbitrary dynamical systems. \n\n1 \n\nINTRODUCTION \n\nExisting general-purpose recurrent algorithms are capable of rich dynamical be(cid:173)\nhavior. Unfortunately, straightforward applications of these algorithms to training \nfully-recurrent networks on complex temporal tasks have had much less success \nthan their feedforward counterparts. For example, to train a recurrent network \nto oscillate like a sine wave (the \"hydrogen atom\" of recurrent learning), existing \ntechniques such as Real Time Recurrent Learning (RTRL) [WZ89] perform sub(cid:173)\noptimally. Williams & Zipser trained a two-unit network with RTRL, with one \nteacher signal. One unit of the resulting network showed a distorted waveform, the \nother only half the desired amplitude. [Pea89] needed four hidden units. However, \nour work demonstrates that a two-unit recurrent network with no hidden units can \nlearn the sine wave very well [Tsu94]. Existing methods also have several other \n\n\u00b7Correspondence should be addressed to the second author: gary@cs.ucsd.edu \n\n\f482 \n\nFu-Sheng Tsung, Garrison W. Cottrell \n\nlimitations. For example, networks often fail to converge even though a solution \nis known to exist; teacher forcing is usually necessary to learn periodic signals; it \nis not clear how to train multiple trajectories at once, or how to insure that the \ntrained trajectory is stable (an attractor). \nIn this paper, we briefly analyze the algorithms to discover why they have such \ndifficulties, and propose a general solution to the problem. Our solution is based on \nthe simple idea of using the techniques of time series prediction as a methodology \nfor recurrent network training. \nFirst, by way of introducing the appropriate concepts, consider a system of coupled \nautonomousl first order network equations: \n\nFI (Xl (t), X2(t), . .. , Xn (t)) \nF2(XI(t), X2(t),\u00b7 \u00b7 \u00b7, Xn(t)) \n\nor, in vector notation, \n\nX(t) = F(X) where XCt) = (XICt), X2(t),\u00b7\u00b7 ., xn(t)) \n\nThe phase space is the space of the dependent variables (X), it does not include t, \nwhile the state space incorporates t. The evolution of a trajectory X(t) traces out \na phase curve, or orbit, in the n-dimensional phase space of X . For low dimensional \nsystems (2- or 3-D), it is easy to visualize the limit sets in the phase space: a fixed \npoint and a limit cycle become a single point and a closed orbit (closed curve), \nrespectively. In the state space they become an infinite straight line and a spiral. \nF(X) defines the vector field of X, because it associates a vector with each point \nin the phase space of X whose direction and magnitude determines the movement \nof that point in the next instant of time (by definition, the tangent vector). \n\n2 ANALYSIS OF CURRENT APPROACHES \n\nTo get a better understanding of why recurrent algorithms have not been very effec(cid:173)\ntive, we look at what happens during training with two popular recurrent learning \ntechniques: RTRL and back propagation through time (BPTT). With each, we il(cid:173)\nlustrate a different problem, although the problems apply equally to each technique. \nRTRL is a forward-gradient algorithm that keeps a matrix of partial derivatives \nof the network activation values with respect to every weight. To train a periodic \ntrajectory, it is necessary to teacher-force the visible units [WZ89], i.e., on every \niteration, after the gradient has been calculated, the activations of the visible units \nare replaced by the teacher. To see why, consider learning a pair of sine waves offset \nby 90\u00b0. In phase space, this becomes a circle (Figure la). Initially the network \n\n1 Autonomous means the right hand side of a differential equation does not explicitly ref(cid:173)\n\nerence t, e.g. dx/dt = 2x is autonomous, even though x is a function oft, but dx/dt = 2x+t \nis not. Continuous neural networks without inputs are autonomous. A nonautonomous \nsystem can always be turned into an autonomous system in a higher dimension. \n\n\fPhase-Space Learning \n\n483 \n\na \n\nb \n\nFigure 1: Learning a pair of sine waves with RTRL learning. (a) without teacher \nforcing, the network dynamics (solid arrows) take it far from where the teacher \n(dotted arrows) assumes it is, so the gradient is incorrect. (b) With teacher forcing, \nthe network's visible units are returned to the trajectory. \n(thick arrows) is at position Xo and has arbitrary dynamics. After a few iterations, \nit wanders far away from where the teacher (dashed arrows) assumes it to be. The \nteacher then provides an incorrect next target from the network's current position. \nTeacher-forcing (Figure 1b), resets the network back on the circle, where the teacher \nagain provides useful information. \n\nHowever, if the network has hidden units, then the phase space of the visible units \nis just a projection of the actual phase space of the network, and the teaching \nsignal gives no information as to where the hidden units should be in this higher(cid:173)\ndimensional phase space. Hence the hidden unit states, unaltered by teacher forcing , \nmay be entirely unrelated to what they should be. This leads to the moving targets \nproblem. During training, every time the visible units re-visit a point, the hidden \nunit activations will differ, Thus the mapping changes during learning. (See [Pin88, \nWZ89] for other discussions of teacher forcing.) \nWith BPTT, the network is unrolled in time (Figure 2). This unrolling reveals \nanother problem: Suppose in the teaching signal, the visible units' next state is a \nnon-linearly separable function of their current state. Then hidden units are needed \nbetween the visible unit layers, but there are no intermediate hidden units in the \nunrolled network. The network must thus take two time steps to get to the hidden \nunits and back. One can deal with this by giving the teaching signal every other \niteration, but clearly, this is not optimal (consider that the hidden units must \"bide \ntheir time\" on the alternate steps).2 \n\nThe trajectories trained by RTRL and BPTT tend to be stable in simulations of \nsimple tasks [Pea89, TCS90], but this stability is paradoxical. Using teacher forc(cid:173)\ning, the networks are trained to go from a point on the trajectory, to a point within \nthe ball defined by the error criterion f (see Figure 4 (a)). However, after learning, \nthe networks behave such that from a place near the trajectory, they head for the \ntrajectory (Figure 4 (b)). Hence the paradox. Possible reasons are: 1) the hid(cid:173)\nden unit moving targets provide training off the desired trajectory, so that if the \ntraining is successful, the desired trajectory is stable; 2) we would never consider \nthe training successful if the network \"learns\" an unstable trajectory; 3) the stable \ndynamics in typical situations have simpler equations than the unstable dynam(cid:173)\nics [N ak93]. To create an unstable periodic trajectory would imply the existence of \nstable regions both inside and outside the unstable trajectory; dynamically this is \n\n2 At NIPS, 0 delay connections to the hidden units were suggested, which is essentially \n\npart of the solution given here. \n\n\f484 \n\nFu-Sheng Tsung, Garrison W. Cottrell \n\n~------------. \n\nFigure 2: A nonlinearly separable \nmapping must be computed by the \nhidden units (the leftmost unit here) \nevery other time step. \n\nFigure 3: The network used for iter(cid:173)\nated prediction training. Dashed con(cid:173)\nnections are used after learning. \n\n\" \n\na \n\nb \n\nFigure 4: The paradox of attractor learning with teacher forcing. (a) During learn(cid:173)\ning, the network learns to move from the trajectory to a point near the trajectory. \n(b) After learning, the network moves from nearby points towards the trajectory. \n\nmore complicated than a simple periodic attractor. In dynamically complex tasks, \na stable trajectory may no longer be the simplest solution, and stability could be a \nproblem. \n\nIn summary, we have pointed out several problems in the RTRL (forward-gradient) \nand BPTT (backward-gradient) classes of training algorithms: \n\n1. Teacher forcing with hidden units is at best an approximation, and leads \n\nto the moving targets problem. \n\n2. Hidden units are not placed properly for some tasks. \n3. Stability is paradoxical. \n\n3 PHASE-SPACE LEARNING \n\nThe inspiration for our approach is prediction training [LF88], which at first appears \nsimilar to BPTT, but is subtly different. In the standard scheme, a feedforward \nnetwork is given a time window, a set of previous points on the trajectory to be \nlearned, as inputs. The output is the next point on the trajectory. Then, the inputs \nare shifted left and the network is trained on the next point (see Figure 3). Once \nthe network has learned, it can be treated as recurrent by iterating on its own \npredictions. \n\nThe prediction network differs from BPTT in two important ways. First, the vis(cid:173)\nible units encode a selected temporal history of the trajectory (the time window) . \nThe point of this delay space embedding is to reconstruct the phase space of the \nunderlying system. \nministic system. Note that in the reconstructed phase space, the mapping from one \n\n[Tak81] has shown that this can always be done for a deter(cid:173)\n\n\fPhase-Space Learning \n\n485 \n\nYI+I \n\nr;.-.-.-. --------\n\n.. ,--(cid:173)\n\n... ... ... ... ... ... ... ... ... ... ... \n'\" '\" '\" .. , \n'\" \u2022\u2022\u2022 \n\na \n\nb \n\nFigure 5: Phase-space learning. (a) The training set is a sample of the vector field. \n(b) Phase-space learning network. Dashed connections are used after learning. \n\npoint to the next (based on the vector field) is deterministic. Hence what originally \nappeared to be a recurrent network problem can be converted into an entirely feed \nforward problem. Essentially, the delay-space reconstruction makes hidden states \nvisible, and recurrent hidden units unnecessary. Crucially, dynamicists have de(cid:173)\nveloped excellent reconstruction algorithms that not only automate the choices of \ndelay and embedding dimension but also filter out noise or get a good reconstruc(cid:173)\ntion despite noise [FS91, Dav92, KBA92]. On the other hand, we clearly cannot \ndeal with non-deterministic systems by this method. \n\nThe second difference from BPTT is that the hidden units are between the visible \nunits, allowing the network to produce nonlinearly separable transformations of the \nvisible units in a single iteration. In the recurrent network produced by iterated \nprediction, the sandwiched hidden units can be considered \"fast\" units with delays \non the input/output links summing to 1. \nSince we are now lear~ing a mapping in phase space, stability is easily ensured by \nadding additional training examples that converge towards the desired orbit. 3 We \ncan also explicitly control convergence speed by the size and direction of the vectors. \n\nThus, phase-space learning (Figure 5) consists of: (1) embedding the temporal \nsignal to recover its phase space structure, (2) generating local approximations of \nthe vector field near the desired trajectory, and (3) functional approximation of the \nvector field with a feedforward network. Existing methods developed for these three \nproblems can be directly and independently applied to solve the problem. Since \nfeedforward networks are universal approximators [HSW89], we are assured that \nat least locally, the trajectory can be represented. The trajectory is recovered from \nthe iterated output of the pre-embedded portion of the visible units. Additionally, \nwe may also extend the phase-space learning framework to also include inputs that \naffect the output of the system (see [Tsu94] for details).4 \nIn this framework, training multiple attractors is just training orbits in different \nparts of the phase space, so they simply add more patterns to the training set. \nIn fact, we can now create designer dynamical systems possessing the properties \nwe want, e.g., with combinations of fixed point, periodic, or chaotic attractors. \n\n3The common practice of adding noise to the input in prediction training is just a \n\nsimple minded way of adding convergence information. \n\n4Principe & Kuo(this volume) show that for chaotic attractors, it is better to treat this \n\nas a recurrent net and train using the predictions. \n\n\f486 \n\nFu-Sheng Tsung, Garrison W. Cottrell \n\n0.5 Q \n\n-0.5 \n\n-0.5 \n\n0 \n\n0.5 \n\nFigure 6: Learning the van der Pol oscillator. ( a) the training set. (b) Phase space \nplot of network (solid curve) and teacher (dots). (c) State space plot. \n\nAs an example, to store any number of arbitrary periodic attractors Zi(t) with \nperiods 11 in a single recurrent network, create two new coordinates for each Zi(t), \n(Xi(t),Yi(t)) = (sin(*t),cos(*t)), where (Xi,Yi) and (Xj,Yj) are disjoint circles \nfor i 'I j. Then (Xi, Yi, Zi) is a valid embedding of all the periodic attractors in \nphase space, and the network can be trained. In essence, the first two dimensions \nform \"clocks\" for their associated trajectories. \n\n4 SIMULATION RESULTS \n\nIn this section we illustrate the method by learning the van der Pol oscillator (a much \nmore difficult problem than learning sine waves), learning four separate periodic \nattractors, and learning an attractor inside the basin of another attractor. \n\n4.1 LEARNING THE VAN DER POL OSCILLATOR \n\nThe van der Pol equation is defined by: \n\nWe used the values 0.7, 1, 1 for the parameters a, b, and w, for which there is a \nglobal periodic attractor (Figure 6). We used a step size of 0.1, which discretizes \nthe trajectory into 70 points. The network therefore has two visible units. We used \ntwo hidden layers with 20 units each, so that the unrolled, feedforward network has \na 2-20-20-2 architecture. We generated 1500 training pairs using the vector field \nnear the attractor. The learning rate was 0.01, scaled by the fan-in, momentum was \n0.75, we trained for 15000 epochs. The order of the training pairs was randomized. \nThe attractor learned by the network is shown in (Figure 6 (b)). Comparison of the \ntemporal trajectories is shown in Figure 6 (c); there is a slight frequency difference. \nThe average MSE is 0.000136. Results from a network with two layers of 5 hidden \nunits each with 500 data pairs were similar (MSE=0.00034). \n\n4.2 LEARNING MULTIPLE PERIODIC ATTRACTORS \n\n[Hop82] showed how to store multiple fixed-point at tractors in a recurrent net. \n[Bai91] can store periodic and chaotic at tractors by inverting the normal forms of \nthese attractors into higher order recurrent networks. However, traditional recurrent \ntraining offers no obvious method of training multiple attractors. [DY89] were able \n\n\fPhase.Space Learning \n\n487 \n\n_.0.1. Il.6, 0.63. 0.7 \n\n'I'---:.ru\"='\"\"\"--'O~--:O-=-\" ---! \n\n.I'--::.ru-;---;;--;;-:Oj---! \n\nA \n\n8 \n\n.ru \n\nOJ \n\n0 \nE \n\n'1'--::4U~-::-0 --;Oj~-! \n\nF \n\n100 \n\n1OO \nD \n\n300 \n\n400 \n\n1.--------, \n\n~~ .1 0 \n\n50 100 150 1OO ~ 300 \n\nH \n\nFigure 7: Learning mUltiple attractors. In each case, a 2-20-20-2 network using \nconjugate gradient is used. Learning 4 attractors: (A) Training set. (B) Eight \ntrajectories of the trained network. (C) Induced vector field of the network. There \nare five unstable fixed points. (D) State space behavior as the network is \"bumped\" \nbetween attractors. Learning 2 attractors, one inside the other: (E) Training set. \n(F) Four trajectories ofthe trained network. (G) Induced vector field of the network. \n(H) State space \nThere is an unstable limit cycle between the two stable ones. \nbehavior with a \"bump\". \n\nto store two limit cycles by starting with fixed points stored in a Hopfield net, and \ntraining each fixed point locally to become a periodic attractor. Our approach has \nno difficulty with multiple attractors. Figure 7 (A-D) shows the result of training \nfour coexisting periodic attractors, one in each quadrant of the two-dimensional \nphase space. The network will remain in one of the attractor basins until an external \nforce pushes it into another attractor basin. Figure 7 (E-H) shows a network with \ntwo periodic attractors, this time one inside the other. This vector field possess \nan unstable limit cycle between the two stable limit cycles. This is a more difficult \ntask, requiring 40 hidden units, whereas 20 suffice for the previous task (not shown). \n\n5 SUMMARY \n\nWe have presented a phase space view of the learning process in recurrent nets. \nThis perspective has helped us to understand and overcome some of the problems \nof using traditional recurrent methods for learning periodic and chaotic attractors. \nOur method can learn multiple trajectories, explicitly insure their stability, and \navoid overfitting; in short, we provide a practical approach to learning complicated \ntemporal behaviors. The phase-space framework essentially breaks the problem \ninto three sub-problems: (1) Embedding a temporal signal to recover its phase \nspace structure, (2) generating local approximations of the vector field near the \ndesired trajectory, and (3) functional approximation in feedforward networks. We \nhave demonstrated that using this method, networks can learn complex oscillations \nand multiple periodic attractors. \n\n\f488 \n\nFu-Sheng Tsung, Garrison W. Cottrell \n\nAcknowledgements \n\nThis work was supported by NIH grant R01 MH46899-01A3. Thanks for comments \nfrom Steve Biafore, Kenji Doya, Peter Rowat, Bill Hart, and especially Dave DeMers \nfor his timely assistance with simulations. \n\nReferences \n\n[Bai91] \n\n[Dav92] \n\n[DY89] \n\n[FS91] \n\n[Hop82] \n\nW. Baird and F. Eeckman. Cam storage of analog patterns and continuous \nsequences with 3n2 weights. In R.P. Lippmann, J .E. Moody, and D.S. \nTouretzky, editors, Advances in Neural Information Processing Systems, \nvolume 3, pages 91-97, 1991. Morgan Kaufmann, San Mateo. \nM. Davies. Noise reduction by gradient descent. International Journal of \nBifurcation and Chaos, 3:113-118, 1992. \nK. Doya and S. Yoshizawa. Memorizing oscillatory patterns in the analog \nneuron network. In IJCNN, Washington D.C., 1989. IEEE. \nJ.D. Farmer and J.J. Sidorowich. Optimal shadowing and noise reduction. \nPhysica D, 47:373-392, 1991. \nJ.J. Hopfield. Neural networks and physical systems with emergent col(cid:173)\nlective computational abilities. Proceedings of the National Academy of \nSciences, USA, 79, 1982. \n\n[HSW89] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward net(cid:173)\nworks are universal approximators. Neural Networks, 2:359-366, 1989. \n[KBA92] M.B. Kennel, R. Brown, and H. Abarbanel. Determining embedding di(cid:173)\nmension for phase-space reconstruction using a geometrical construction. \nPhysical Review A, 45:3403-3411, 1992. \n\n[LF88] A. Lapedes and R. Farber. How neural nets work. In D.Z. Anderson, \neditor, Neural Information Processing Systems, pages 442-456, Denver \n1987, 1988. American Institute of Physics, New York. \n\n[N ak93] Hiroyuki Nakajima. A paradox in learning trajectories in neural networks. \n\nWorking paper, Dept. of EE II, Kyoto U., Kyoto, JAPAN, 1993. \n\n[Pea89] B.A. Pearlmutter. Learning state space trajectories in recurrent neural \n\nnetworks. Neural Computation, 1:263-269, 1989. \n\n[Pin88] F.J . Pineda. Dynamics and architecture for neural computation. Journal \n\nof Complexity, 4:216-245, 1988. \n\n[Tak81] F. Takens. Detecting strange attractors in turbulence. In D.A. Rand and \nL.-S. Young, editors, Dynamical Systems and Turbulence, volume 898 \nof Lecture Notes in Mathematics, pages 366-381, Warwick 1980, 1981. \nSpringer-Verlag, Berlin. \n\n[TCS90] F-S. Tsung, G. W. Cottrell, and A. I. Selverston. Some experiments on \nlearning stable network oscillations. In IJCNN, San Diego, 1990. IEEE. \n[Tsu94] F-S. Tsung. Modelling Dynamical Systems with Recurrent Neural Net(cid:173)\n\nworks. PhD thesis, University of California, San Diego, 1994. \n\n[WZ89] R.J. Williams and D. Zipser. A learning algorithm for continually running \n\nfully recurrent neural networks. Neural Computation, 1:270-280, 1989. \n\n\f", "award": [], "sourceid": 943, "authors": [{"given_name": "Fu-Sheng", "family_name": "Tsung", "institution": null}, {"given_name": "Garrison", "family_name": "Cottrell", "institution": null}]}