{"title": "Fast Learning with Predictive Forward Models", "book": "Advances in Neural Information Processing Systems", "page_first": 563, "page_last": 570, "abstract": null, "full_text": "Fast Learning with Predictive Forward Models \n\nCarlos Brody\u00b7 \n\nDept. of Computer Science \n\nlIMAS, UNAM \n\nMexico D.F. 01000 \n\nMexico. \n\ne-mail: carlos@hope. caltech. edu \n\nAbstract \n\nA method for transforming performance evaluation signals distal both in \nspace and time into proximal signals usable by supervised learning algo(cid:173)\nrithms, presented in [Jordan & Jacobs 90], is examined. A simple obser(cid:173)\nvation concerning differentiation through models trained with redundant \ninputs (as one of their networks is) explains a weakness in the original \narchitecture and suggests a modification: an internal world model that \nencodes action-space exploration and, crucially, cancels input redundancy \nto the forward model is added. Learning time on an example task, cart(cid:173)\npole balancing, is thereby reduced about 50 to 100 times. \n\n1 \n\nINTRODUCTION \n\nIn many learning control problems, the evaluation used to modify (and thus im(cid:173)\nprove) control may not be available in terms of the controller's output: instead, it \nmay be in terms of a spatial transformation of the controller's output variables (in \nwhich case we shall term it as being \"distal in space\"), or it may be available only \nseveral time steps into the future (termed as being \"distal in time\"). For example, \ncontrol of a robot arm may be exerted in terms of joint angles, while evaluation may \nbe in terms of the endpoint cartesian coordinates; furthermore, we may only wish \nto evaluate the endpoint coordinates reached after a certain period of time: the co-\n\n\u00b7Current address: Computation and Neural Systems Program, California Institute of \n563 \n\nTechnology, Pasadena CA. \n\n\f564 \n\nBrody \n\nordinates reached at the end of some motion, for instance. In such cases, supervised \nlearning methods are not directly applicable, and other techniques must be used. \nHere we study one such technique (proposed for cases where the evaluation is distal \nin both space and time by [Jordan & Jacobs 90)), analyse a source of its problems, \nand propose a simple solution for them which leads to fast, efficient learning. \nWe first describe two methods, and then combine them into the \"predictive forward \nmodeling\" technique with which we are concerned. \n\n1.1 FORWARD MODELING \n\n\"Forward Modeling\" [Jordan & Rumelhart 90] is useful for dealing with evaluations \nwhich are distal in space; it involves the construction of a differentiable model to \napproximate the controller-action -+ evaluation transformation. Let our controller \nhave internal parameters w, output c, and be evaluated in space e, where e = e(c) \nis an unknown but well-defined transformation. If there is a desired output in \nspace e, called e*, we can write an \"error\" function, that is, an evaluation we wish \nminimised, and differentiate it w.r.t. the controller's weights to obtain \n\nE = (e* - e)2 \n\n8c 8e 8E \n8E \n8w = 8w . 8c . 8e \n\n(1) \n\nUsing a differentiable controller allows us to obtain the first factor in the second \nequation, and the third factor is also known; but the second factor is not. However, \nif we construct a differentiable model (called a ''forward model\") of e(c), then we \ncan obtain an approximation to the second term by differentiating the model, and \nuse this to obtain an estimate of the gradient 8E / 8w through equation (1); this \ncan then be used for comparatively fast minimisation of E, and is what is known \nas \"forward modeling\". \n\n1.2 PREDICTIVE CRITICS \n\nTo deal with evaluations which are distal in time, we may use a \"critic\" network, as \nin [Barto, Sutton & Anderson 83]. For a particular control policy implemented by \nthe controller network, the critic is trained to predict the final evaluation that will \nbe obtained given the current state - using, for example, Sutton's TD algorithm \n[Sutton 88]. The estimated final evaluation is then available as soon as we enter a \nstate, and so may in turn be used to improve the control policy. This approach is \nclosely related to dynamic programming [Barto, Sutton & Watkins 89]. \n\n1.3 PREDICTIVE FORWARD MODELS \n\nWhile the estimated evaluation we obtain from the critic is no longer distal in time, \nit may still be distal in space. A natural proposal in such cases, where the evaluation \nsignal is distal both in space and time, is to combine the two techniques described \nabove: use a differentiable model as a predictive critic [Jordan & Jacobs 90]. If \nwe know the desired final evaluation, we can then proceed as in equation (1) and \nobtain the gradient of the error w.r.t. the controller's weights. Schematically, this \nwould look like figure 1. When using a backprop network for the predictive model, \n\n\fFast Learning with Predictive Forward Models \n\n565 \n\nstate \nvector \n\ncontrol \n\npredicted \nevaluation \n\nCONTROLLER \n\nNETWORK \n\nPREDICTIVE \n\nMODEL \n\nFigure 1: Jordan and Jacobs' predictive forward modeling architecture. Solid lines indi(cid:173)\ncate data paths, the dashed line indicates back propagation. \n\nwe would backpropagate through it, through it's control input, and then into the \ncontroller to modify the controller network. We should note that since predictions \nmake no sense without a particular control policy, and the controller is only modified \nthrough the predictive model, both networks must be trained simultaneously. \n\n[Jordan & Jacobs 90] applied this method to a well-known problem, that of learn(cid:173)\ning to balance an inverted pendulum on a movable cart by exerting appropriate \nhorizontal forces on the cart. The same task, without differentiating the critic, was \nstudied in [Barto, Sutton & Anderson 83]. There, reinforcement learning methods \nwere used instead to modify the controller's weights; these perform a search which \nin some cases may be shown to follow, on average, the gradient of the expected \nevaluation w.r .t. the network weights. Since differentiating the critic allows this \ngradient to be found directly, one would expect much faster learning when using \nthe architecture of figure 1. However, Jordan and Jacobs' results show precisely the \nopposite: it is surprisingly slow. \n\n2 THE REDUNDANCY PROBLEM \n\nWe can explain the above surprising result if we consider the fact that the predictive \nmodel network has redundant inputs: the control vector c is a function of the state \nvector; (call this c = 17( S)). Let K. and u be the number of components of the \ncontrol and state vectors, respectively. Instead of drawing its inputs from the entire \nvolume of (K.+u)-dimensional input space, the predictor is trained only with inputs \nwhich lie on the u-dimensional manifold defined by the relation 17. A way from the \nmanifold the network is free to produce entirely arbitrary outputs. Differentiation \nof the model will then provide non-arbitrary gradients only for directions tangential \nto the manifold; this is a condition that the axes of the control dimensions will \nnot, in general, satisfy.l This observation, which concerns any model trained with \nredundant inputs, is the very simple yet principal point of this paper. \n\nOne may argue that since the control policy is continually changing, the redundancy \npicture sketched out here is not in fact accurate: as the controller is modified, many \n\nlNote that if it is single-valued, there is no way the manifold can \"fold around\" to cover \n\nall (or most) of the K. + (T input space. \n\n\f566 \n\nBrody \n\nEVALUATION \n\nEVALUATION FUNCTION \n\nMODELS \n\n( ( lC\\..D _____ ~ \n\n\u2022 \n\nb \n\nc \n\nd \n\nCONTROL OUTPUT \n\nFigure 2: The evaluation as a function of control action. Curves A,B,C,D represent \npossible (wrong) estimates of the \"real\" curve made by the predictive model network. \n\npossible control policies are \"seen\" by the predictor, so creating volume in input \nspace and leading to correct gradients obtained from the predictor. However, the \nway in which this modification occurs is significant. An argument based on empirical \nobservations will be made to sustain this. \n\nConsider the example shown in figure 2. The graph shows what the \"real\" evaluation \nat some point in state space is, as a function of a component of the control action \ntaken at that pointj this function is what the predictive network should approximate. \nSuppose the function implemented by the predictive network initially looks like the \ncurve which crosses the \"real\" evaluation function at point (a)j suppose also that the \ncurrent action taken also corresponds to point (a). Here we see a one-dimensional \nexample of the redundancy problem: though the prediction at this point is entirely \naccurate, the gradient is not. If we wish to minimise the predicted evaluation, we \nwould change the action in the direction of point (b). Examples of point (a) will \nno longer be presented to the predictive network, so it could quite plausibly modify \nitself simply so as to look like the estimated evaluation curve \"B\" which is shown \ncrossing point (b) (a minimal change necessary to continue being correct). Again, \nthe gradient is wrong and minimising the prediction will change the action in the \nsame direction as before, perhaps to point (c)j then to (d), and so on. Eventually, \nthe prediction, though accurate, will have zero gradient, as in curve \"D\", and no \nmodifications will occur. In practice, we have observed networks \"getting stuck\" \nin this fashion. Though the objective was to minimise the evaluation, the system \nstops \"learning\" at a point far from optimal. \n\nThe problem may be solved, as Jordan and Jacobs did, by introducing noise in \nthe controller's output, thus breaking the redundancy. Unfortunately, this degrades \n\n\fFast Learning with Predictive Forward Models \n\n567 \n\n.. ~[ \n\nvector \n\ncontrol \nvector \n\n- -- ----- - - -\n\npredicted \n\nstate --.. \n\n--- ....... ---\n\npredicted \nevaluation \n\nCONTROLLER \n\nNETWORK \n\nINTERMEDIATE \n(WORLD) MODEL \n\nPREDICTIVE \n\nMODEL \n\nFigure 3: The proposed system architecture. Again, solid lines represent data paths while \nthe dashed line represents backpropagation (or differentiation). \n\nsignal quality and means that since we are predicting future evaluations, we wish to \npredict the effects of future noise - a notoriously difficult objective. The predictive \nnetwork eventually outputs the evaluation's expectation value, but this can take a \nlong time. \n\n3 USING AN INTERMEDIATE MODEL \n\n3.1 AN EXTRA WORLD MODEL \n\nAnother way to solve the redundancy problem is through the use of what is here \ncalled an \"intermediate model\": a model of the world the controller is interacting \nwith. That is, if 8(t) represents the state vector at time t, and c(t) the controller \noutput at time t, it is a model of the function 1 where 8(t + 1) = 1(8(t), c(t)). \nThis model is used as represented schematically in figure 3. It helps in modularising \nthe learning task faced by the predictive model [Chrisley 90], but more interestingly, \nit need not be trained simultaneously with the controller since its output does not \ndepend on future control policy. Hence, it can be trained separately, with examples \ndrawn from its entire (state x action) input space, providing gradient signals without \narbitrary components when differentiated. Once trained, we freeze the intermediate \nmodel's weights and insert it into the system as in figure 3; we then proceed to train \nthe controller and predictive model as before. The predictive model will no longer \nhave redundant inputs when trained either, so it too will provide correct gradient \nsignals. Since all arbitrary components have been eliminated, the speedup expected \nfrom using differentiable predictive models should now be obtainable.2 \n\n3.2 AN EXAMPLE TASK \n\nThe intermediate model architecture was tested on the same example task as used \nby Jordan and Jacobs, that of learning to balance a pole which is attached through \na hinge on its lower end to a movable cart. The control action is a real valued-force \n\n2This same architecture was independently proposed in [Werbos 90], but without the \n\nexplanation as to why the intermediate model is necessary instead of merely desirable. \n\n\f568 \n\nBrody \n\n1100 \n\n1000 I-\n\n900 \n\n800 \n\n700 \n\n600 \n\n500 \n\n400 \n\n300 \n\n200 \n\n100 \n\n..... \n\n..... .. .... \n0 ... \nII \n~ \nH \n\no \n\no \n\n. II! LII \n100 \n\nJ \n\n200 \n\n300 \n\n-\n---\n---\n\n-\n\n-\n\n-\n\nJ \n\n700 \n\n800 \n\n900 \n\n1000 \n\n400 \n\n500 \n\nL.arninq trial \n\n600 \n\nFigure 4: The evolution of eight different learning networks, using the intermediate model. \n\napplied to the cart; the evaluation signal is a \"0\" while the pole has not fallen over, \nand the cart hasn't reached the edge of the finite-sized tracks it is allowed to move \non, a \"I\" when either of these events happens. A trial is then said to have failed, \nand terminates.3 \n\nWe count the number of learning trials needed before a controller is able to keep the \npole balanced for a significant amount of a time (measured in simulated seconds). \nFigure 4 shows the evolution of eight networks; most reach balancing solutions \nwithin 100 to 300 faiulres. (These successful networks came from a batch of eleven: \nthe other three never reached solutions.) This is 50 to 100 times faster than without \nthe intermediate model, where 5000 to 30000 trials were needed to achieve similar \nbalancing times [Jordan & Jacobs 90]. \n\nWe must now take into account the overhead needed to train the intermediate \nmodel. This was done in 200 seconds of simulated time, while training the whole \nsystem typically required some 400 seconds- the overhead is small compared to the \nimprovement achieved through the use of the intermediate model. However, off-line \ntraining of the intermediate model requires an additional agency to organise the \nselection and presentation of training examples. In the real world, we would either \nneed some device which could initialise the system at any point in state space, or we \nwould have to train through \"flailing\": applying random control actions, over many \ntrials, so as to eventually cover all possible states and actions. As the dimensionality \nof the state representation rises for larger problems, intermediate model training will \nbecome more difficult. \n\n3The differential equations which were used as a model of this system may be found in \n[Barto, Sutton & Anderson 83]. The parameters of the simulations were identical to those \nused in [Jordan & Jacobs 90]. \n\n\fFast Learning with Predictive Forward Models \n\n569 \n\n3.3 REMARKS \n\nWe should note that the need for covering all state space is not merely due to \nthe requirement of training an intermediate model: dynamic-programming based \ntechniques such as the ones mentioned in this paper are guaranteed to lead us to an \noptimal control solution only if we explore the entire state space during learning. \nThis is due to their generality, since no a priori structure of the state space is \nassumed. It might be possible to interleave the training of the intermediate model \nwith the training of the controller and predictor networks, so as to achieve both \nconcurrently. High-dimensional problems will still be problematic, but not just due \nto intermediate model training- the curse of dimensionality is not easily avoided! \n\n4 CONCLUSIONS \n\nIf we differentiate through a model trained with redundant inputs, we eliminate \npossible arbitrary components (which are due to the arbitrary mixing of the inputs \nthat the model may use) only if we differentiate tangentially along the manifold \ndefined by the relationship between the inputs. For the architecture presented in \n[Jordan & Jacobs 90], this is problematic, since the axes of the control vector will \ntypically not be tangential to the manifold. Once we take this into account, it is \nclear why the architecture was not as efficient as expected; and we can introduce \nan \"intermediate\" world model to avoid the problems that it had. \nUsing the intermediate model allows us to correctly obtain (through backpropaga(cid:173)\ntion, or differentiation) a real-valued vector evaluation on the controller's output. \nOn the example task presented here, this led to a 50 to 100-foid increase in learn(cid:173)\ning speed, and suggests a much better scaling-up performance and applicability to \nreal-world problems than simple reinforcement learning, where real-valued outputs \nare not permitted, and vector control outputs would train very slowly. \n\nAcknowledgements \n\nMany thanks are due to Richard Rohwer, who supervised the beginning of this \nproject, and to M. I. Jordan and R. Jacobs, who answered questions enlighteningly; \nthanks are also due to Dr F. Bracho at lIMAS, UNAM, who provided the environ(cid:173)\nment for the project's conclusion. This work was supported by scholarships from \nCON ACYT in Mexico and from Caltech in the U.S. \n\nReferences \n\n[Ackley 88] D. H. Ackley, \"Associative Learning via Inhibitory Search\", \n\nin \nD. S. Touretzky, ed., Advances in Neural Information Processing Systems 1, \nMorgan Kaufmann 1989 \n\n[Barto, Sutton & Anderson 83] A. G. Barto, R. S. Sutton, and C. W. Anderson, \n\"Neuronlike Adaptive Elements that can Solve Difficult Control Problems\", \nIEEE Transactions on Systems, Man, and Cybernetics, Vol. SMC-13, No.5, \nSept/Oct. 1983 \n\n\f570 \n\nBrody \n\n[Barto, Sutton & Watkins 89] A. G. Barto, R. S. Sutton, and C. J. C. H. Watkins, \n\"Learning and Sequential Decision Making\", University of Massachusetts at \nAmherst COINS Technical Report 89-95, September 1989 \n\n[Chrisley 90] R. L. Chrisley, \"Cognitive Map Construction and Use: A Parallel Dis(cid:173)\ntributed Approach\", in Touretzky, Elman, Sejnowski, and Hinton, eds., Con(cid:173)\nnectionist Models: Proceedings of the 1990 Summer School, Morgan Kaufmann \n1991, \n\n[Jordan & Jacobs 90] M. I. Jordan and R. A. Jacobs, \"Learning to Control an Un(cid:173)\n\nstable System with Forward Modeling\", in D. S. Touretzky, ed., Advances in \nNeural Information Processing Systems 2, Morgan Kaufmann 1990 \n\n[Jordan & Rumelhart 90] M. I. Jordan and D. E. Rumelhart, \"Supervised learning \n\nwith a Distal Teacher\" , preprint. \n\n[Nguyen & Widrow 90] D. Nguyen and B. Widrow, ''The Truck Backer-Upper: An \nExample of Self-Learning in Neural Networks\", in Miller, Sutton and Werbos, \neds., Neural Networks for Control, MIT Press 1990 \n\n[Sutton 88] R. S. Sutton, \"Learning to Predict by the Methods of Temporal Differ(cid:173)\n\nences\", Machine Learning 3: 9-44, 1988 \n\n[Werbos 90] P. Werbos, \"Architectures for Reinforcement Learning\", in Miller, Sut(cid:173)\n\nton and Werbos, eds., Neural Networks for Control, MIT Press 1990 \n\n\f", "award": [], "sourceid": 492, "authors": [{"given_name": "Carlos", "family_name": "Brody", "institution": null}]}