{"title": "Learning to Control an Unstable System with Forward Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 324, "page_last": 331, "abstract": null, "full_text": "324 \n\nJordan and Jacobs \n\nLearning to Control an Unstable System with \n\nForward Modeling \n\nMichael I. Jordan \n\nBrain and Cognitive Sciences \n\nMIT \n\nCambridge, MA 02139 \n\nRobert A. Jacobs \n\nComputer and Information Sciences \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \n\nABSTRACT \n\nThe forward modeling approach is a methodology for learning con(cid:173)\ntrol when data is available in distal coordinate systems. We extend \nprevious work by considering how this methodology can be applied \nto the optimization of quantities that are distal not only in space \nbut also in time. \n\nIn many learning control problems, the output variables of the controller are not \nthe natural coordinates in which to specify tasks and evaluate performance. Tasks \nare generally more naturally specified in \"distal\" coordinate systems (e.g., endpoint \ncoordinates for manipulator motion) than in the \"proximal\" coordinate system of \nthe controller (e.g., joint angles or torques). Furthermore, the relationship between \nproximal coordinates and distal coordinates is often not known a priori and, if \nknown, not easily inverted. \n\nThe forward modeling approach is a methodology for learning control when train(cid:173)\ning data is available in distal coordinate systems. A forward model is a network \nthat learns the transformation from proximal to distal coordinates so that distal \nspecifications can be used in training the controller (Jordan & Rumelhart, 1990). \nThe forward model can often be learned separately from the controller because it \ndepends only on the dynamics of the controlled system and not on the closed-loop \ndynamics. \n\nIn previous work, we studied forward models of kinematic transformations (Jordan, \n1988, 1990) and state transitions (Jordan & Rumelhart, 1990). In the current paper, \n\n\fLearning to Control an Unstable System with Forward Modeling \n\n325 \n\nwe go beyond the spatial credit assignment problems studied in those papers and \nbroaden the application of forward modeling to include cases of temporal credit \nassignment (cf. Barto, Sutton, & Anderson, 1983; Werbos, 1987). As discussed \nbelow, the function to be modeled in such cases depends on a time integral of the \nclosed-loop dynamics. This fact has two important implications. First, the data \nneeded for learning the forward model can no longer be obtained solely by observing \nthe instantaneous state or output of the plant. Second, the forward model is no \nlonger independent of the controller: If the parameters of the controller are changed \nby a learning algorithm, then the closed-loop dynamics change and so does the \nmapping from proximal to distal variables. Thus the learning of the forward model \nand the learning of the controller can no longer be separated into different phases. \n\n1 FORWARD MODELING \nIn this section we briefly summarize our previous work on forward modeling (see \nalso Nguyen & Widrow, 1989 and Werbos, 1987). \n\n1.1 LEARNING A FORWARD MODEL \n\nGiven a fixed control law , the learning of a forward model is a system identification \nproblem. Let z = g(s, u) be a system to be modeled, where z is the output or the \nstate-derivative, s is the state, and u is the control. We require the forward model \nto minimize the cost functional \n\nJm = ~ J (z - z)T(z - z)dt. \n\n(1) \n\n(3) \n\nwhere z = 9(s, u, v) is the parameterized function computed by the model. Once \nthe minimum is found, backpropagation through the model provides an estimate \n\u00a5u of the system Jacobian matrix :~ (cf. Jordan, 1988). \n\n1.2 LEARNING A CONTROLLER \n\nOnce the forward model is sufficiently accurate, it can be used in the training of the \ncontroller. Backpropagation through the model provides derivatives that indicate \nhow to change the outputs of the controller. These derivatives can be used to \nchange the parameters of the controller by a further application of back propagation. \nFigure 1 illustrates the general procedure. \n\nThis procedure minimizes the \"distal\" cost functional \n\n(2) \n\nwhere z\u00b7 is a reference signal. To see this, let the controller output be given as a \nfunction u = f(s, z\u00b7, w) of the state s\u00b7, the reference signal z\u00b7, and a parameter \nvector w. Differentiating J with respect to w yields \n\n\"w J = -\n\nJ ouT ozT \now ou (z\u00b7 - z)dt. \n\n\f326 \n\nJordan and Jacobs \n\n\\ \n\n~ \n\nFeedforward \nController \n\nz* \n\nx \n\nPlant \n\nz \n\nForward \n\n- -Model - -\n\n+ \n\n-\n\nFigure 1: Learning a Controller. The Dashed Line Represents Backpropagation. \n\nThe Jacobian matrix \u00a5u cannot be assumed to be available a priori, but can be \nestimated by backpropagation through the forward model. Thus the error signal \navailable for learning the controller is the estimated gradient \n\n.. \nV'wJ = -\n\nJ ou oz \nT 0' T \n-\n-\now OU \n\n\u2022 \n\n(z - z)dt. \n\n(4) \n\nWe now consider a task in which the foregoing framework must be broadened to \nallow a more general form of distal task specification. \n\n2 THE TASK \nThe task is to learn to regulate an unstable nonminimum-phase plant. We have \nchosen the oft-studied (e.g., Barto, Sutton, & Anderson, 1983; \\Vidrow & Smith, \n1964) problem of learning to balance an inverted pendulum on a moving cart. The \nplant dynamics are given by: \n\n[ M+m \nmlcos(J \n\nmlcos(J ] [ ~ ] + [ -mlsi~(J ] iP = [ F ] \n0 \n\n-mglszn(J \n\nI \n\n(J \n\nwhere m is the mass of the pole, M is the mass of the cart, I is half the pole length, \nI is the inertia of the pole around its base, and F is the force applied to the cart. \nThe task we studied is similar to that studied by Barto, Sutton, & Anderson (1983). \nA state-feedback controller provides forces to the cart, and the system evolves until \nfailure occurs (the cart reaches the end of the track or the pole reaches a critical \nangle). The system learns from failure; indeed, it is assumed that the only teaching \ninformation provided by the environment is the signal that failure has occurred. \n\n\fLearning to Control an Unstable System with Forward Modeling \n\n327 \n\nForward Model \n\nTemporal \nDifference \n-0 \n\nUnit \n\n\u2022 \n\no \no \no \no \no \no \no \no \no \no \n\nAction \nUnit \n\n-0 \n\no \nsgn (x) 0 \nlielO \nsgn( x) 0 \nlei 0 \nsgn(e) 0 \nlei 0 \nsgn(e) 0 \n\no \no \no \no \n\ne \nIi \n\n~,.p .. nl \n\nController \n\nFigure 2: The Network Architecture \n\nThere are several differences between our task and that studied by Barto, Sutton, &. \nAnderson (1983). First, disturbances (white noise) are provided by the environment \nrather than by the learning algorithm. This implies that in our experiments the \nlevel of noise seen by the controller does not diminish to zero over the course of \nlearning. Second, we used real-valued forces rather than binary forces. Finally, we \ndo not assume the existence of a \"reset button\" that reinitializes the system to the \norigin of state space; upon failure the system is restarted in a random configuration. \n\n3 OUR APPROACH \nIn our approach, the control system learns a model that relates the current state of \nthe plant and the current control signal to a prediction of future failure. We make \nuse of a temporal difference algorithm (Sutton, 1988) to learn the transformation \nfrom (state, action) pairs to an estimate of the inverse of the time until failure. \nThis mapping is then used as a differentiable forward model in the learning of the \ncontroller-the controller is changed so as to minimize the output of the model and \nthereby maximize the time until failure . \n\nThe overall system architecture is shown in Figure 2. We describe each component \nin detail in the following sections. \n\nAn important feature that distinguishes this architecture from previous work (e.g., \n\n\f328 \n\nJordan and Jacobs \n\nBarto, Sutton, & Anderson, 1983) is the path from the action unit into the forward \nmodel. This path is necessary for supervised learning algorithms to be used (see \nalso Werbos, 1987). \n\n3.1 LEARNING THE FORWARD MODEL \n\nTemporal difference algorithms learn to make long term predictions by achieving \nlocal consistency between predictions at neighboring time steps, and by grounding \nthe chain of predictions when information from the environment is obtained. In our \ncase, if z(t) is the inverse of the time until failure, then consistency is defined by \nthe requirement that z-l(t) = z-l(t + 1) + 1. The chain is grounded by defining \nz(T) = 1, where T is the time step on which failure occurs. \nTo learn to estimate the inverse of the time until failure, the following temporal \ndifference error terms are used. For time steps on which failure does not occur, \n\nA( ) \n( ) \ne t = 1 + \u00a3-1 (t + 1) - z t , \n\n1 \n\nwhere \u00a3(t) denotes the output of the forward model. When failure occurs, the target \nfor the forward model is set to unity: \n\ne(t) = 1 -- \u00a3(t) \n\nThe error signal e(t) is propagated backwards at time t + 1 using activations saved \nfrom time t. Standard backpropagation is used to compute the changes to the \nweights. \n\n3.2 LEARNING THE CONTROLLER \nIf the controller is performing as desired, then the output of the forward model \nis zero (that is, the predicted time-until-failure is infinity). This suggests that an \nappropriate distal error signal for the controller is zero minus the output of the \nforward model. \nGiven that the forward model has the control action as an input, the distal error \ncan be propagated backward to the hidden units of the forward model, through the \naction unit, and into the controller where the weights are changed (see Figure 2). \nThus the controller is changed in such a way as to minimize the output of the \nforward model and thereby maximize the time until failure. \n\n3.3 LEARNING THE FORWARD MODEL AND THE CONTROLLER \n\nSIMULTANEOUSLY \n\nAs the controller varies, the mapping that the forward model must learn also varies. \nThus, if the forward model is to provide reasonable derivatives, it must be contin(cid:173)\nuously updated as the controller changes. We find that it is possible to train the \nforward model and the controller simultaneously, provided that we use a larger \nlearning rate for the forward model than for the controller. \n\n\fLearning to Control an Unstable System with Forward Modeling \n\n329 \n\n4 MISCELLANY \n4.1 RESET \n\nAlthough previous studies have assumed the existence of a \"reset button\" that \ncan restart the system at the origin of state space, we prefer not to make such an \nassumption. A reset button implies the existence of a controller that can stabilize \nthe system, and the task of learning is to find such a controller. In our simulations, \nwe restart the system at random points in state space after failure occurs. \n\n4.2 REDUNDANCY \n\nThe mapping learned by the forward model depends on both the state and the ac(cid:173)\ntion. The action, however, is itself a function of the state, so the action unit provides \nredundant information. This implies that the forward model could have arbitrary \nweights in the path from the action unit and yet make reasonable predictions. Such \na model, however, would yield meaningless derivatives for learning the controller. \nFortunately, backpropagation tends to produce meaningful weights for a path that \nis correlated with the outcome, even if that path conveys redundant information. \nTo further bias things in our favor, we found it useful to employ a larger learning \nrate in the path from the action unit to the hidden units of the forward model (0.9) \nthan in the path from the state units (0.3). \n\n4.3 REPRESENTATION \n\nAs seen in Figure 2, we chose input representations that take advantage of symme(cid:173)\ntries in the dynamics of the cart-pole system. The forward model has even symmetry \nwith respect to the state variables, whereas the controller has odd symmetry. \n\n4.4 LONG-TERM BEHAVIOR \n\nThere is never a need to \"turn off\" the learning of the forward model. Once the pole \nis being successfully balanced in the presence of fluctuations, the average time until \nfailure goes to infinity. The forward model therefore learns to predict zero in the \nregion of state space around the origin, and the error propagated to the controller \nalso goes to zero. \n\n5 RESULTS \nWe ran twenty simulations starting with random initial weights. The learning rate \nfor the controller was 0.05 and the learning rate for the forward model was 0.3, \nexcept for the connection from the action unit where the learning rate was 0.9. \nEighteen runs converged to controller configurations that balanced the pole, and \ntwo runs converged on local minima. Figure 3 shows representative learning curves \nfor six of the successful runs. \n\nTo obtain some idea of the size of the space of correct solutions, we performed an \nexhaustive search of a lattice in a rectangular region of weight space that contained \n\n\f330 \n\nJordan and Jacobs \n\n1000 \n\n800 \n\n600 \n\n.00 \n\n200 \n\nAverage \n\ntime \n\nuntil failure \n\no \n\n500 \n\n1000 \n\n1500 \n\nBins \n\n(1 bin ., 20 fillur \u2022\u2022 ) \n\nFigure 3: Learning Curves for Six Runs \n\nall of the weight configurations found by our simulations. As shown in Figure 4, \nonly 15 out of 10,000 weight configurations were able to balance the pole. \n\n6 CONCLUSIONS \nPrevious wor k within the forward modeling paradigm focused on models of fixed \nkinematic or dynamic properties of the controlled plant (Jordan, 1988,1990; Jordan \n&, Rumelhart, 1990). In the current paper, the notion of a forward model is broader. \nThe function that must be modeled depends not only on properties of the controlled \nplant, but also on properties of the controller. Nonetheless, the mapping is well(cid:173)\ndefined, and the results demonstrate that it can be used to provide appropriate \nincremental changes for the controller. \n\nThese results provide further demonstration of the applicability of supervised learn(cid:173)\ning algorithms to learning control problems in which explicit target information is \nnot available. \n\nAcknowledgments \n\nThe first author was supported by BRSG 2 S07 RR07047-23 awarded by the Biomed(cid:173)\nical Research Support Grant Program, Division of Research Resources, National \nInstitutes of Health and by a grant from Siemens Corporation. The second au(cid:173)\nthor was supported by the Air Force Office of Scientific Research, through grant \nAFOSR-87 -0030. \n\n\fLearning to Control an Unstable System with Forward Modeling \n\n331 \n\n\u2022 \n\nLog \n\nFrequency \n\n3 \u2022 \n\u2022 \n\u2022 \n2 \u2022 \n\u2022 \u2022 \n). \n-. -\u2022\u2022 \n\no \n\n\u2022 \n\n200 \n\n.00 \n\n100 \n\n100 \n\n1000 \n\n0+---44.-----~--.-r_----r_--~ \n\nMedian Time Steps Until Failure \n\nFigure 4: Performance of Population of Controllers \n\nReferences \nBarto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive el(cid:173)\nements that can solve difficult learning control problems. IEEE Transactions on \nSystems, Man, and Cybernetics, SMC.19, 834-846. \nJordan, M. I. (1988). Supervised learning and systems with excess degress of free(cid:173)\ndom. (COINS Tech. Rep. 88-27). Amherst, MA: University of Massachusetts, \nComputer and Information Sciences. \nJordan, M. I. (1990). Motor learning and the degrees of freedom problem. In M. \nJeannerod, (Ed). Attention and Performance, XIII. Hillsdale, NJ: Erlbaum. \nJordan, M. I. & Rumelhart, D. E. (1990). Supervised learning with a distal teacher. \nPaper in preparation. \nNguyen, D. & Widrow, B. (1989). The truck backer-upper: An example of self(cid:173)\nlearning in neural networks. In: Proceedings of the International Joint Conference \non Neural Networks. Piscataway, NJ: IEEE Press. \nSutton, R. S. (1987). Learning to predict by the methods of temporal differences. \nMachine Learning, 9, 9-44. \nWerbos, P. (1987). Building and understanding adaptive systems: A statisti(cid:173)\ncal/numerical approach to factory automation and brain research. IEEE Trans(cid:173)\nactions on Systems, Man, and Cybernetics, 17, 7-20. \nWidrow, B. & Smith, F. W. (1964). Pattern-recognizing control systems. In: Com(cid:173)\nputer and Information Sciences Proceedings, Washington, D.C.: Spartan. \n\n\f", "award": [], "sourceid": 199, "authors": [{"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Robert", "family_name": "Jacobs", "institution": null}]}