{"title": "Nonparametric Model-Based Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1008, "page_last": 1014, "abstract": null, "full_text": "Nonparametric Model-Based \n\nReinforcement Learning \n\nChristopher G. Atkeson \n\nCollege of Computing, Georgia Institute of Technology, \n\nAtlanta, GA 30332-0280, USA \n\nATR Human Information Processing, \n\n2-2 Hikaridai, Seiko-cho, Soraku-gun, 619-02 Kyoto, Japan \n\ncga@cc.gatech.edu \n\nhttp://www.cc.gatech.edu/fac/Chris.Atkeson/ \n\nAbstract \n\nThis paper describes some of the interactions of model learning \nalgorithms and planning algorithms we have found in exploring \nmodel-based reinforcement learning. The paper focuses on how lo(cid:173)\ncal trajectory optimizers can be used effectively with learned non(cid:173)\nparametric models. We find that trajectory planners that are fully \nconsistent with the learned model often have difficulty finding rea(cid:173)\nsonable plans in the early stages of learning. Trajectory planners \nthat balance obeying the learned model with minimizing cost (or \nmaximizing reward) often do better, even if the plan is not fully \nconsistent with the learned model. \n\n1 \n\nINTRODUCTION \n\nWe are exploring the use of nonparametric models in robot learning (Atkeson et al., \n1997b; Atkeson and Schaal , 1997). This paper describes the interaction of model \nlearning algorithms and planning algorithms, focusing on how local trajectory opti(cid:173)\nmization can be used effectively with nonparametric models in reinforcement learn(cid:173)\ning. We find that trajectory optimizers that are fully consistent with the learned \nmodel often have difficulty finding reasonable plans in the early stages of learning . \nThe message of this paper is that a planner should not be entirely consistent with \nthe learned model during model-based reinforcement learning. Trajectory optimiz(cid:173)\ners that balance obeying the learned model with minimizing cost (or maximizing \nreward) often do better, even if the plan is not fully consistent with the learned \nmodel: \n\n\fNonparametric Model-Based Reinforcement Learning \n\n1009 \n\n' ~ ~ \n\n./ \n\n+ \n\nA \n.--/ ~ \n'\\ \\ - / ~ \n-\n-\n1\"-\n----\ne L V 2\\ \\ '\\ V V '\\ \n[i V II \nIf \\ \n'\\\\ l\\ \n\\ .\"\" / \n\\ \\ i\\ \n\\ \\ I\"\" ./ V \\ ~ I'---\" / \n\\ \\ l\"\" ~ ~ I\"\" V \n\\ \\ !\"\" ~ ~ ~ ~ it \n\n( ) \n) \n\n/ \n\nFigure 1: A: Planning in terms of trajectory segments. B: Planning in terms of \ntrajectories all the way to a goal point . \n\nTwo kinds of reinforcement learning algorithms are direct (non-model-based) and \nindirect (model-based) . Direct reinforcement learning algorithms learn a policy or \nvalue function without explicitly representing a model of the controlled system (Sut(cid:173)\nton et al. , 1992) . Model-based approaches learn an explicit model of the system si(cid:173)\nmultaneously with a value function and policy (Sutton, 1990 , 1991a,b; Barto et al. , \n1995; Kaelbling et al. , 1996) . We will focus on model-based reinforcement learning, \nin which the learner uses a planner to derive a policy from a learned model and an \noptimization criterion . \n\n2 CONSISTENT LOCAL PLANNING \n\nAn efficient approach to dynamic programming, a form of global planning, is to use \nlocal trajectory optimizers (Atkeson, 1994) . These local planners find a plan for \neach starting point in a grid in the state space. Figure 1 compares the output of \na traditional cell based dynamic programming process with the output of a plan(cid:173)\nner based on integrating local plans. Traditional dynamic programming generates \ntrajectory segments from each cell to neighboring cells, while the planner we use \ngenerates entire trajectories. These locally optimal trajectories have local policies \nand local models of the value function along the trajectories (Dyer and McReynolds, \n1970; Jacobson and Mayne, 1970). The locally optimal trajectories are made con(cid:173)\nsistent with their neighbors by using the local value function to predict the value \nof a neighboring trajectory. If all the local value functions are consistent with their \nneighbors the aggregate value function is a unique solution to the Bellman equation \nand the corresponding trajectories and policy are globally optimal. We would like \nany local planning algorithm to produce a local model of the value function so we \ncan perform this type of consistency checking. We would also like a local policy \nfrom the local planner, so we can respond to disturbances and modeling errors. \n\nDifferential dynamic programming is a local planner that has these characteris(cid:173)\ntics (Dyer and McReynolds. 1970; Jacobson and Mayne, 1970). Differential dy(cid:173)\nnamic programming maintains a local quadratic model of the value function along \nthe current best trajectory x* (t): \n\nV (x,t) = Vo(t) + Vx(t)(x - x*(t))T + 0.5(x - x*(t))TVxx(t)(x - x*(t)) \n\n(1) \n\n\f1010 \n\nas well as a local linear model of the corresponding policy: \nU(X,t) = u*(t) + K(t)(x - x*(t)) \n\nC. G. Atkeson \n\n(2) \n\nu(x, t) is the local policy at time t, the control signal u as a function of state x. \nu * (t) is the model's estimate of the control signal necessary to follow the current \nbest trajectory x*(t). K(t) are the feedback gains that alter the control signals in \nresponse to deviations from the current best trajectory. These gains are also the \nfirst derivative of the policy along the current best trajectory. \n\nThe first phase of each optimization iteration is to apply the current local policy \nto the learned model, integrating the modeled dynamics forward in time and seeing \nwhere the simulated trajectory goes. The second phase of the differential dynamic \nprogramming approach is to calculate the components of the local quadratic model \nof the value function at each point along the trajectory: the constant term Vo (t), the \ngradient Vx (t), and the Hessian Vxx (t). These terms are constructed by integrating \nbackwards in time along the trajectory. The value function is used to produce a \nnew policy, which is represented using a new x*(t), u*(t), and K(t). \n\nThe availability of a local value function and policy is an attractive feature of \ndifferential dynamic programming. However, we have found several problems when \napplying this method to model-based reinforcement learning with nonparametric \nmodels: \n\n1. Methods that enforce consistency with the learned model need an initial \n\ntrajectory that obeys that model, which is often difficult to produce. \n\n2. The integration of the learned model forward in time often blows up when \n\nthe learned model is inaccurate or when the plant is unstable and the cur(cid:173)\nrent policy fails to stabilize it. \n\n3. The backward integration to produce the value function and a correspond(cid:173)\ning policy uses derivatives of the learned model, which are often quite inac(cid:173)\ncurate in the early stages of learning, producing inaccurate value function \nestimates and ineffective policies. \n\n3 \n\nINCONSISTENT LOCAL PLANNING \n\nTo avoid the problems of consistent local planners, we developed a trajectory opti(cid:173)\nmization approach that does not integrate the learned model and does not require \nfull consistency with the learned model. Unfortunately, the price of these modifi(cid:173)\ncations is that the method does not produce a value function or a policy, just a \ntrajectory (x(t), u(t)). To allow inconsistency with the learned model, we represent \nthe state history x(t) and the control history u(t) separately, rather than calculate \nx(t) from the learned model and u(t). We also modify the original optimization \ncriterion C = Lk C(Xk, Uk) by changing the hard constraint that Xk+1 = f(Xk' Uk) \non each time step into a soft constraint: \n\nCnew = L [C(Xk' Uk) +~IXk+1-f(Xk,Uk)12] \n\n(3) \n\nk \n\nC(Xk' Uk) is the one step cost in the original optimization criterion. ~ is the penalty \non the trajectory being inconsistent with the learned model Xk+1 = f(Xk' Uk). \nIXk +1 - f (Xk' Uk) I is the magnitude of the mismatch of the trajectory and the model \nprediction at time step k in the trajectory. ~ provides a way to control the amount \nof inconsistency. A small ~ reflects lack of confidence in the model, and allows \n\n\fNonparametric Model-Based Reinforcement Learning \n\n1011 \n\nThe \nrobot \n\nFigure 2: \nSARCOS \narm with a pen(cid:173)\ndulum gripped in \nthe hand. \nThe \npendulum \naXIS \nis aligned with \nthe \nfingers and \nwith \narm in this arm \nconfiguration. \n\nthe \n\nfore(cid:173)\n\n//If\\<''',,,\u00b7 \" \n~j \n\nthe optimized trajectory to be inconsistent with the model in favor of reducing \nC(Xk, Uk)' A large). reflects confidence in the model, and forces the optimized tra(cid:173)\njectory to be more consistent with the model. ). can increase with time or with the \nnumber of learning trials. If we use a model that estimates the confidence level of \na prediction, we can vary). for each lookup based on Xk and Uk. Locally weighted \nlearning techniques provide exactly this type of local confidence estimate (Atkeson \net al., 1997a) . \n\nNow that we are not integrating the trajectory we can use more compact repre(cid:173)\nsentations of the trajectory, such as splines (Cohen, 1992) or wavelets (Liu et al., \n1994). We no longer require that Xk+l = f(Xk, Uk), which is a condition difficult to \nfulfill without having x and u represented as independent values on each time step. \nWe can now parameterize the trajectory using the spline knot points, for example. \nIn this work we used B splines (Cohen, 1992) to represent the trajectory. Other \nchoices for spline basis functions would probably work just as well. We can use any \nnonlinear programming or function optimization method to minimize the criterion \nin Eq. 3. In this work we used Powell's method (Press et al., 1988) to optimize the \nknot points, a method which is convenient to use but not particularly efficient. \n\n4 \n\nIMPLEMENTATION ON AN ACTUAL ROBOT \n\nBoth local planning methods work well with learned parametric models. However, \ndifferential dynamic programming did not work at all with learned nonparametric \nmodels, for reasons already discussed. This section describes how the inconsistent \nlocal planning method was used in an application of model-based reinforcement \nlearning: robot learning from demonstration using a pendulum swing up task (Atke(cid:173)\nson and Schaal, 1997). The pendulum swing up task is a more complex version of \nthe pole or broom balancing task (Spong, 1995) . The hand holds the axis of the \npendulum, and the pendulum rotates about this hinge in an angular movement \n(Figure 2). Instead of starting with the pendulum vertical and above its rotational \njoint, the pendulum is hanging down from the hand, and the goal of the swing up \ntask is to move the hand so that the pendulum swings up and is then balanced \nin the inverted position . The swing up task was chosen for study because it is a \ndifficult dynamic maneuver and requires practice for humans to learn, but it is easy \nto tell if the task is successfully executed (at the end of the task the pendulum is \nbalanced upright and does not fall down) . \n\nWe implemented learning from demonstration on a hydraulic seven degree of free-\n\n\f1012 \n\nC. G. Atkeson \n\nhuman demonstration \n1 st trial (imitation) \n2nd trial \n3rd trial \n\n0.2 \n\n0.4 0.6 0.8 \n\n1.0 \n\n1.2 \n\n1.4 \n\n1.6 \n\n1.8 \n\n2.0 \n\n0.0 \n\n.. c: \n\nIII \n\n.S! \n'tI \n~ \nCD \n\"S. c: \n\u00a7 \n'S \nb i \n~ \n.! \nCD \n.\u00a7. \nc: \n.2 \n\u2022 1:: \n~ \n~ \n] \nIII \n~ \n\n1.0 \n\n0.0 \n\n-1.0 \n\n-2.0 \n\n-3.0 \n\n-4.0 \n\n-5.0 \n\n0.5 \n0.4 \n0.3 \n0.2 \n0.1 \n-0.0 \n-0.1 \n-0.2 \n-0.3 \n\n.~.~ ......... \n\n--_ .................. \n\n. -. '., \n\n/ . \n\n--,.,.,.\"..~ \n\n\u00b0-.----\u00b0 \n\n0.0 \n\n0.2 \n\n1.0 \n\n1.2 \n\n1.4 \n\n1.6 \n\n1.8 \n\n2.0 \n\nseconds \n\nFigure 3: The hand and pendulum motion during robot learning from demonstra(cid:173)\ntion using a nonparametric model. \n\ndom anthropomorphic robot arm (SARCOS Dextrous Arm located at ATR, Fig(cid:173)\nure 2). The robot observed its own performance with the same stereo vision system \nthat was used to observe the human demonstrations. \n\nThe robot observed a human swinging up a pendulum using a horizontal hand \nmovement (dotted line in Figure 3) . The most obvious approach to learning from \ndemonstration is to have the robot imitate the human motion, by following the \nhuman hand trajectory. The dashed lines in Figures 3 show the robot hand motion \nas it attempts to follow the human demonstration of the swing up task, and the \ncorresponding pendulum angles. Because of differences in the task dynamics for \nthe human and for the robot, this direct imitation failed to swing the pendulum \nup, as the pendulum did not get even halfway up to the vertical position, and then \noscillated about the hanging down position. \n\nThe approach we used was to apply a planner to finding a swing up trajectory \nthat worked for the robot, based on learning both a model and a reward function \nand using the human demonstration to initialize the planning process. The data \ncollected during the initial imitation trial and subsequent trials was used to build \na model. Nonparametric models were constructed using locally weighted learning \nas described in (Atkeson et al., 1997a) . These models did not use knowledge of the \nmodel structure but instead assumed a general relationship: \n\n(4) \nwhere () is the pendulum angle and x is the hand position. Training data from \nthe demonstrations was stored in a database, and a local model was constructed \nto answer each query. Meta-parameters such as distance metrics were tuned using \ncross validation on the training set. For example, cross validation was able to \nquickly establish that hand position and velocity (x and x) played an insignificant \nrole in predicting future pendulum angular velocities. \n\nThe planner used a cost function that penalizes deviations from the demonstration \ntrajectory sampled at 60H z: \n\nC(Xk, Uk) = (Xk - X~)T(Xk - X~) + uluk \n\n(5) \n\n\fNonparametric Model-Based Reinforcement Learning \n\n1013 \n\nwhere the state is x = ((J, il, x , x), xd is the demonstrated motion, k is the sample \nindex, and the control is u = (x). Equation 3 was optimized using B splines to \nrepresent x and u. The knot points for x and u were initially separately optimized \nto minimize \n\n(6) \n\nand \n\n(7) \nThe tolerated inconsistency, ). was kept constant during a set of trials and set \nat values ranging from 100 to 100000. The exact value of ). did not make much \ndifference. Learning failed when). was set to zero , as there was no way for the \nlearned model to affect the plan. The planning process failed when ). was set too \nhigh , enforcing the learned model too strongly. \n\nThe next attempt got the pendulum up a little more. Adding this new data to the \ndatabase and replanning resul ted in a movement that succeeded (trial 3 in Figure 3). \nThe behavior shown in Figure 3 is quite repeatable. The balancing behavior at the \nend of the trial is learned separately and continues for several minutes, at which \npoint the trial is automatically terminated (Schaal, 1997). \n\n5 DISCUSSION AND CONCLUSION \n\nWe applied locally weighted regression (Atkeson et aI. , 1997a) in an attempt to avoid \nthe structural modeling errors of idealized parametric models during model-based \nreinforcement learning, and also to see if a priori knowledge of the structure of the \ntask dynamics was necessary. In an exploration of the swingup task, we found that \nthese nonparametric models required a planner that ignored the learned model to \nsome extent. The fundamental reason for this is that planners amplify modeling \nerror. Mechanisms for this amplification include: \n\n\u2022 The planners take advantage of any modeling error to reduce the cost of \nthe planned trajectory, so the planning process seeks out modeling error \nthat reduces apparent cost . \n\n\u2022 Some planners use derivatives of the model, which amplifies any noise in \n\nthe model. \n\nModels that support fast learning will have errors and noise. For example , in order \nto learn a model of the complexity necessary to accurately model the full robot \ndynamics between the commanded and actual hand accelerations a large amount \nof data is required, independent of modeling technique. The input would be 21 \ndimensional (robot state and command) ignoring actuator dynamics. Because there \nare few robot trials during learning, there is not enough data to make such a model \neven just in the vicinity of a successful trajectory. If it was required that enough \ndata is collected during learning to make an accurate model. robot learning would \nbe greatly slowed down. \n\nOne solution to this error amplification is to bias the nonparametric modeling tools \nto oversmooth the data. This reduces the benefit of nonparametric modeling, and \nalso ignores the true learned model to some degree. Our solution to this problem \nis to introduce a controlled amount of inconsistency with the learned model into \nthe planning process. The control parameter). is explicit and can be changed as a \nfunction of time, amount of data, or as a function of confidence in the model at the \nquery point. \n\n\f1014 \n\nC. G. Atkeson \n\nReferences \nAtkeson, C. G. (1994). Using local trajectory optimizers to speed up global opti(cid:173)\n\nmization in dynamic programming. In Cowan, J. D., Tesauro, G., and Alspector, \nJ., editors, Advances in Neural Information Processing Systems 6, pages 663-670. \nMorgan Kaufmann, San Mateo, CA. \n\nAtkeson, C . G., Moore, A. W., and Schaal, S. (1997a). Locally weighted learning. \n\nArtificial Intelligence Review, 11:11-73. \n\nAtkeson, C. G., Moore, A. W., and Schaal, S. (1997b). Locally weighted learning \n\nfor control. Artificial Intelligence Review, 11:75-113. \n\nAtkeson, C. G. and Schaal, S. (1997) . Robot learning from demonstration. In \n\nProceedings of the 1997 International Conference on Machine Learning. \n\nBarto, A. G ., Bradtke, S. J., and Singh, S. P. (1995). Learning to act using real-time \n\ndynamic programming. Artificial Intelligence, 72(1):81-138. \n\nCohen, M. F . (1992). Interactive spacetime control for animation. Computer Graph(cid:173)\n\nics, 26(2):293-302. \n\nDyer, P. and McReynolds, S. (1970). The Computational Theory of Optimal Control. \n\nAcademic, NY. \n\nJacobson, D. and Mayne, D. (1970). Differential Dynamic Programming. Elsevier, \n\nNY. \n\nKaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: \n\nA survey. lournal of Artificial Intelligence Research, 4:237-285 . \n\nLiu, Z., Gortler, S. J., and Cohen, M. F. (1994). Hierarchical spacetime control. \n\nComputer Graphics (SIGGRAPH '94 Proceedings), pages 35-42. \n\nPress, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P. (1988). \n\nNumerical Recipes in C. Cambridge University Press, New York, NY. \n\nSchaal, S. (1997). Learning from demonstration. In Mozer, M. C., Jordan, M., and \nPetsche, T ., editors, Advances in Neural Information Processing Systems 9, pages \n1040-1046. MIT Press, Cambridge, MA. \n\nSpong, M. W. (1995). The swing up control problem for the acrobot. IEEE Control \n\nSystems Magazine, 15(1):49-55. \n\nSutton, R. S. (1990). Integrated architectures for learning, planning, and reacting \n\nbased on approximating dynamic programming. In Seventh International Ma(cid:173)\nchine Learning Workshop, pages 216-224. Morgan Kaufmann, San Mateo, CA. \nhttp://envy.cs.umass.edu/People/sutton/publications.html. \n\nSutton, R. S. (1991a). Dyna, an integrated architecture for learning, planning \n\nand reacting. http://envy.cs.umass.edu/People/sutton/publications.html, Work(cid:173)\ning Notes of the 1991 AAAI Spring Symposium on Integrated Intelligent Archi(cid:173)\ntectures pp. 151-155 and SIGART Bulletin 2, pp. 160-163. \n\nSutton, R. S. (1991b). Planning by incremental dynamic programming. In Eighth \nInternational Machine Learning Workshop, pages 353-357. Morgan Kaufmann, \nSan Mateo, CA. http://envy.cs.umass.edu/People/sutton/publications.html. \n\nSutton, R. S., Barto, A. G., and Williams, R. J. (1992). Reinforcement learning is \n\ndirect adaptive optimal control. IEEE Control Systems Magazine, 12:19-22. \n\n\f", "award": [], "sourceid": 1476, "authors": [{"given_name": "Christopher", "family_name": "Atkeson", "institution": null}]}