{"title": "Planning with an Adaptive World Model", "book": "Advances in Neural Information Processing Systems", "page_first": 450, "page_last": 456, "abstract": null, "full_text": "Planning with an Adaptive World Model \n\nSebastian B. Thrun \nGerman National Research \nCenter for Computer \nScience (GMD) \nD-5205 St. Augustin, FRG \n\nKnut Moller \n\nUniversity of Bonn \n\nDepartment of \n\nComputer Science \nD-5300 Bonn, FRG \n\nAlexander Linden \nGerman National Research \nCenter for Computer \nScience (GMD) \nD-5205 St. Augustin, FRG \n\nAbstract \n\nWe present a new connectionist planning method [TML90]. By interaction \nwith an unknown environment, a world model is progressively construc(cid:173)\nted using gradient descent. For deriving optimal actions with respect to \nfuture reinforcement, planning is applied in two steps: an experience net(cid:173)\nwork proposes a plan which is subsequently optimized by gradient descent \nwith a chain of world models, so that an optimal reinforcement may be \nobtained when it is actually run. The appropriateness of this method is \ndemonstrated by a robotics application and a pole balancing task. \n\n1 \n\nINTRODUCTION \n\nWhenever decisions are to be made with respect to some events in the future, \nplanning has been proved to be an important and powerful concept in problem \nsolving. Planning is applicable if an autonomous agent interacts with a world, and \nif a reinforcement is available which measures only the over-all performance of the \nagent. Then the problem of optimizing actions yields the temporal credit assignment \nproblem [Sut84], i.e. the problem of assigning particular reinforcements to particular \nactions in the past. The problem becomes more complicated if no knowledge about \nthe world is available in advance. \nMany connectionist approaches so far solve this problem directly, using techniques \nbased on the interaction of an adaptive world model and an adaptive controller \n[Bar89, Jor89, Mun87]. Although such controllers are very fast after training, trai(cid:173)\nning itself is rather complex, mainly because of two reasons: a) Since future is not \nconsidered explicitly, future effects must be directly encoded into the world model. \nThis complicates model training. b) Since the controller is trained with the world \nmodel, training of the former lags behind the latter. Moreover, if there do exist \n\n450 \n\n\fPlanning with an Adaptive World Model \n\n451 \n\n: : \n\nstate \n\nFigure 1: The training of the model network is a system identification task. \nInternal parameters are estimated by gradient descent, e.g. by backpropagation. \n\nseveral optimal actions, such controllers will only generate at most one regardless of \nall others, since they represent many-to-one functions. E.g., changing the objective \nfunction implies the need of an expensive retraining. \nIn order to overcome these problems, we applied a planning technique to reinforce(cid:173)\nment learning problems. A model network which approximates the behavior of the \nworld is used for looking ahead into future and optimizing actions by gradient des(cid:173)\ncent with respect to future reinforcement. In addition, an experience network is \ntrained in order to accelerate and improve planning. \n\n2 LOOK-AHEAD PLANNING \n\n2.1 SYSTEM IDENTIFICATION \nPlanning needs a world model. Training of the world model,is adopted from \n[Bar89, Jor89, Mun87]. Formally, the world maps actions to subsequent states and \nreinforcements (Fig. 1). The world model used here is a standard non-recurrent or \na recurrent connectionist network which is trained by backpropagation or related \ngradient descent algorithms [WZ88, TS90]. Each time an action is performed on the \nworld their resulting state and reinforcement is compared with the corresponding \nprediction by the model network. The difference is used for adapting the internal pa(cid:173)\nrameters of the model in small steps, in order to improve its accuracy. The resulting \nmodel approximates the world's behavior. \nOur planning technique relies mainly on two fundamental steps: Firstly, a plan is \nproposed either by some heuristic or by a so-called experience network. Secondly, \nthis plan is optimized progressively by gradient descent in action space. First, we \nwill consider the second step. \n\n2.2 PLAN OPTIMIZATION \n\nIn this section we show the optimization of plans by means of gradient descent. For \nthat purpose, let us assume an initial plan, i.e. a sequence of N actions, is given. The \nfirst action of this plan together with the current state (and, in case of a recurrent \nmodel network, its current context activations) are fed into the model network (Fig. \n2). This gives us a prediction for the subsequent state and reinforcement of the world. \nIf we assume that the state prediction is a good estimation for the next state, we can \nproceed by predicting the immediate next state and reinforcement from the second \naction of the plan correspondingly. This procedure is repeated for each of the N \nstages of the plan. The final output is a sequence of N reinforcement predictions, \nwhich represents the quality of the plan. In order to maximize reinforcement, we \n\n\f452 \n\nThrun, l\\1OIler, and Linden \n\n\"')}\"\".\".' .. '. ~ __ m_od_e_l_n_e-.-r_o_r_k_(N_J_---,11-oI .1---- plan: JVlh action \n\u2022\u2022\u2022 \u2022 ...r.=.. \n\nmodel network (2) ~-- plan: 2nd action \n\n\u2022 \u2022 \n\n.... .. ..... . ... \n.. :/;:::\":;:/::;\">:>:. \n\nmodel network (1) \n\nL.....-____ ,--,,.--_.:.-;..._---l \n\n1+--- plan: lit action \n\n(PLANNING RESULT) \n\nFigure 2: Looking ahead by the chain of model networks. \n\ncontext units \n\nrecurrent networks on! \n\nestablish a differentiable reinforcement energy function Ereinf, which measures the \ndeviation of predicted and desired reinforcement. The problem of optimizing plans \nis transformed to the problem of minimizing Ereinf' Since both Ereinf and the chain \nof model networks are differentiable, the gradients of the plan with respect to Ereinf \ncan be computed. These gradients are used for changing the plan in small steps, \nwhich completes the gradient descent optimization. \nThe whole update procedure is repeated either until convergence is observed or, \nwhich makes it more convenient for real-time applications, a predefined number of \niterations - note that in the latter case the computational effort is linear in N. From \nthe planning procedure we obtain the optimized plan, the first action 1 of which is \nthen performed on the world. Now the whole procedure is repeated. \n\nThe gradients of the plan with respect to Ereinf can be computed either by back(cid:173)\npropagation through the chain of models or by a feed-forward algorithm which is \nrelated to [WZ88, TS90]: \nHand in hand with the activations we propagate also the gradients \n\net, (r) \n\na activationj (r) \n\na actioni (s) \n\n(1) \n\nthrough the chain of models. Here i labels all action input units and j all units of \nthe whole model network, r (1S;rS;N) is the time associated with the rth model of \nthe chain, and s (1~~s) of the chain of networks, including all \npredictions, is measured by et,( r). \nIt has been shown in an earlier paper that this gradient can easily be propagated \nforward through the network [TML90]: \n\net,(r) = \n\nif j action input unit \nif r=l 1\\ j state/context input unit \nif r>l 1\\ j state/context input unit \n(j' corresponding output unit of preceding model) \notherwise \n\nlogistic'(netj(r)). L weightjl e!.,(r) \n\nIEpred(j) \n\n(2) \n\n11\u00a3 an unknown world is to be explored, this action might be disturbed by adding a \n\nsmall random variable. \n\n\fPlanning with an Adaptive World Model \n\n453 \n\nThe reinforcement energy to be minimized is defined as \n\nEreinf \n\n-\n\nN \n\n~ L L gk ( T) . (reinf~ - activationk (T\u00bb) 2 \u2022 \n\n(3) \n\nT=l k \n\n(k numbers the reinforcement output units, reinf~ is the desired reinforcement va(cid:173)\nlue, usually Vk: reinf~=l, and gk weights the reinforcement with respect to T and k, \nin the simplest case gk(T)=l.) Since Ereinf is differentiable, we can compute the gra(cid:173)\ndient of Ereinf with respect to each particular reinforcement prediction. From these \ngradients and the gradients ef, of the reinforcement prediction units the gradients \n\n(i, _ {) {) ~rein( ) = - ~ ~ gk( T) . (reinf~ - activationk( T\u00bb . ef,( T) \n\n(4) \n\nactloni S \n\nN \n\nL.J L.J \nT=' \n\nk \n\nare derived which indicate how to change the plan in order to minimize Ereinf' \nVariable plan lengths: The feed-forward manner of the propagation allows it \nto vary the number of look-ahead steps due to the current accuracy of the model \nnetwork. Intuitively, if a model network has a relatively large error, looking far \ninto future makes little sense. A good heuristic is to avoid further look-ahead if the \ncurrent linear error (due to the training patterns) of the model network is larger \nthan the effect of the first action of the plan to the current predictions. This effect \nis exactly the gradients efl ( T). Using variable plan lengths might overcome the \ndifficulties in finding an appropriate plan length N a priori. \n\nINITIAL PLANS - THE EXPERIENCE NETWORK \n\n2.3 \nIt remains to show how to obtain initial plans. There are several basic strategies \nwhich are more or less problem-dependent, e.g. random, average over previous ac(cid:173)\ntions etc. Obviously, if some planning took place before, the problem of finding an \ninitial plan reduces to the problem of finding a simple action, since the rest of the \nprevious plan is a good candidate for the next initial plan. \nA good way of finding this action is the experience network. This network is trai(cid:173)\nned to predict the result of the planning procedure by observing the world's state \nand, in the case of recurrent networks, the temporal context information from the \nmodel network. The target values are the results of the planning procedure. Al(cid:173)\nthough the experience network is trained like a controller [Bar89], it is used in a \ndifferent way, since outcoming actions are further optimized by the planning proce(cid:173)\ndure. Thus, even if the knowledge of the experience network lags behind the model \nnetwork's, the derived actions are optimized with respect to the \"knowledge\" of \nthe model network rather than the experience network. On the other hand, while \nthe optimization is gradually shifted into the experience network, planning can be \nprogressively shortened. \n\n3 APPROACHING A ROLLING BALL WITH A ROBOT \n\nARM \n\nWe applied planning with an adaptive world model to a simulation of a real-time \nrobotics task: A robot arm in 3-dimensional space was to approach a rolling ball. \nBoth hand position (i.e. x,y,z and hand angle) and ball position (i.e. x' ,y') were \nobserved by a camera system in workspace. Conversely, actions were defined as \nangular changes of the robot joints in configuration space. Model and experience \n\n\f454 \n\nThrun, MOller, and Linden \n\nH \nB \n:8 \n1\u00b710 \n\nX-V-Space \ncurrent hand pOS. \ncurrent ball pos. \nprevious ban pOS . \nplans \n\nFigure 3: (a) The recurrent model network (white) and the experience network (grey) at \nthe robotics task. (b) Planning: Starting with the initial plan 1, the approximation leads \nfinally to plan 10. The first action of this plan is then performed on the world. \n\nnetworks are shown in Fig. 3a. Note that the ball movement was predicted by \na recurrent Elman-type network, since only the current ball position was visible \nat any time. The arm prediction is mathematically more sophisticated, because \nkinematics and inverse kinematics are required to solve it analytically. \nThe reason why planning makes sense at this task is that we did not want the robot \narm to minimize the distance between hand and ball at each step -\nthis would \nobviously yield trajectories in which the hand follows the ball, e.g.: \n\nrobot \narm \n\nFigure 4: Basic strategy, the \narm \"follows\" the ball. \n\nInstead, we wanted the system to find short cuts by making predictions about the \nball's next movement. Thus, the reinforcement measured the distance in workspace. \nFig. 3b illustrates a \"typical\" planning process with look-ahead N =4, 9 iterations, \ngk( r) = 1.3 T (c.f. (2))2, a weighted stepsize TJ = 0.05\u00b7 0.9 T , and well-trained model \nand experience networks. Starting with an initial plan 1 by the experience network \n\n2This exponential function is crucial for minimizing later distances rather than the \n\nsooner. \n\n\fPlanning with an Adaptive World Model \n\n455 \n\nthe optimization led to plan 10. It is clear to see that the resulting action surpassed \nthe initial plan, which demonstrates the appropriateness of the optimization. \nThe final trajectory was: \n\nrobot \narm \n\nFigure 5: Planning: The \narm finds the short cut. \n\nWe were now interested in modifying the behavior of the arm. Without further \nlearning of either the model or the experience network, we wanted the arm to \napproach the ball from above. For this purpose we changed the energy function (7): \nBefore the arm was to approach the ball, the energy was minimal if the arm reached \na position exactly above the ball. Since the experience network was not trained for \nthat task, we doubled the number of iteration steps. This led to: \n\nrobot \narm \n\nFigure 6: The arm approa.(cid:173)\nches from above due to a \nmodified energy function. \n\nA first implementation on a real robot arm with a camera system showed similar \nresults. \n\n4 POLE BALANCING \n\nNext, we applied our planning method to the pole balancing task adopted from \n[And89]. One main difference to the task described above is the fact that gradient \ndescent is not applicable with binary reinforcement, since the better the appro(cid:173)\nximation by the world model, the more the gradient vanishes. This effect can be \nprevented by using a second model network with weight decay, which is trained \nwith the same training patterns. Weight decay smoothes the binary mapping. By \nusing the model network for prediction only and the smoothed network for gradient \npropagation, the pole balancing problem became solvable. We see this as a general \n\n\f456 \n\nThrun, MOller, and Linden \n\ntechnique for applying gradient descent to binary reinforcement tasks. \nWe were especially interested in the dependency of look-ahead and the duration of \nbalance. It turned out that in most randomly chosen initial configurations of pole \nand cart the look-ahead N = 4 was sufficient to balance the pole more than 20000 \nsteps. If the cart is moved randomly, after on average 10 movements the pole falls. \n\n5 DISCUSSION \n\nThe planning procedure presented in this paper has two crucial limitations. By using \na bounded look-ahead, effects of actions to reinforcement beyond this bound can \nnot be taken into account. Even if the plan lengths are kept variable (as described \nabove), each particular planning process must use a finite plan. Moreover, using \ngradient descent as search heuristic implies the danger of getting stuck in local \nminima. It might be interesting to investigate other search heuristics. \nOn the other hand this planning algorithm overcomes certain problems of adap(cid:173)\ntive controller networks, namely: a) The training is relatively fast, since the model \nnetwork does not include temporal effects. b) Decisions are optimized due to the \ncurrent \"knowledge\" in the system, and no controller lags behind the model net(cid:173)\nwork. c) The incorporation of additional constraints to the objective function at \nruntime is possible, as demonstrated. d) By using a probabilistic experience net(cid:173)\nwork the planning algorithm is able to act as a non-deterministic many-to-many \ncontroller. Anyway, we have not investigated the latter point yet. \n\nAcknowledgements \nThe authors thank J org Kindermann and Frank Smieja for many fruitful discussions \nand Michael Contzen and Michael FaBbender for their help with the robot arm. \n\nReferences \n\n[And89] C.W. Anderson. Learning to control an inverted pendulum using neural \n\nnetworks. IEEE Control Systems Magazine, 9(3):31-37, 1989. \n\n[Bar89] A. G. Barto. Connectionist learning for control: An overview. Technical \nReport COINS TR 89-89, Dept. of Computer and Information Science, \nUniversity of Massachusetts, Amherst, MA, September 1989. \n\n[Jor89] M. I. Jordan. Generic constraints on unspecified target constraints. In \n\nProceedings of the First International Joint Conference on Neural Net(cid:173)\nworks, Washington, DC, San Diego, 1989. IEEE TAB NN Committee. \n\n[Mun87] P. Munro. A dual backpropagation scheme for scalar-reward learning. In \nNinth Annual Conference of the Cognitive Science Society, pages 165-176, \nHillsdale, NJ, 1987. Cognitive Science Society, Lawrence Erlbaum. \n\n[Sut84] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. \n\nPhD thesis, University of Massachusetts, 1984. \n\n[TML90] S. Thrun, K. Moller, and A. Linden. Adaptive look-ahead planning. In \nG. Dorffner, editor, Proceedings KONNAIIOEGAI, Springer, Sept. 1990. \nS. Thrun and F. Smieja. A general feed-forward algorithm for gradient-\ndescent in connectionist networks. TR 483, GMD, FRG, Nov. 1990. \n\n[TS90] \n\n[WZ88] R. J. Williams and D. Zipser. A learning algorithm for continually run(cid:173)\nning fully recurrent neural networks. TR ICS Report 8805, Institute for \nCognitive Science, University of California, San Diego, CA, 1988. \n\n\f", "award": [], "sourceid": 365, "authors": [{"given_name": "Sebastian", "family_name": "Thrun", "institution": null}, {"given_name": "Knut", "family_name": "M\u00f6ller", "institution": null}, {"given_name": "Alexander", "family_name": "Linden", "institution": null}]}~~