{"title": "Using Local Trajectory Optimizers to Speed Up Global Optimization in Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 663, "page_last": 670, "abstract": null, "full_text": "U sing Local Trajectory Optimizers To \n\nSpeed Up Global Optimization In \n\nDynamic Programming \n\nChristopher G. Atkeson \n\nDepartment of Brain and Cognitive Sciences and \n\nthe Artificial Intelligence Laboratory \n\nMassachusetts Institute of Technology, NE43-771 \n545 Technology Square, Cambridge, MA 02139 \n\n617-253-0788, cga@ai.mit.edu \n\nAbstract \n\nDynamic programming provides a methodology to develop planners \nand controllers for nonlinear systems. However, general dynamic \nprogramming is computationally intractable. We have developed \nprocedures that allow more complex planning and control problems \nto be solved. We use second order local trajectory optimization \nto generate locally optimal plans and local models of the value \nfunction and its derivatives. We maintain global consistency of the \nlocal models of the value function, guaranteeing that our locally \noptimal plans are actually globally optimal, up to the resolution of \nour search procedures. \n\nLearning to do the right thing at each instant in situations that evolve over time is \ndifficult, as the future cost of actions chosen now may not be obvious immediately, \nand may only become clear with time. Value functions are a representational tool \nthat makes the consequences of actions explicit. Value functions are difficult to \nlearn directly, but they can be built up from learned models of the dynamics of the \nworld and the cost function. This paper focuses on how fast optimizers that only \nproduce locally optimal answers can playa useful role in speeding up the process \nof computing or learning a globally optimal value function. \nConsider a system with dynamics Xk+l = f(xk, Uk) and a cost function L(Xk, Uk), \n\n663 \n\n\f664 \n\nAtkeson \n\nwhere x is the state of the system and u is a vector of actions or controls. The sub(cid:173)\nscript k serves as a time index, but will be dropped in the equations that follow. A \ngoal of reinforcement learning and optimal control is to find a policy that minimizes \nthe total cost, which is the sum of the costs for each time step. One approach to \ndoing this is to construct an optimal value function, V(x). The value of this value \nfunction at a state x is the sum of all future costs, given that the system started in \nstate x and followed the optimal policy P(x) (chose optimal actions at each time \nstep as a function of the state). A local planner or controller can choose globally \noptimal actions if it knew the future cost of each action. This cost is simply the \nsum of the cost of taking the action right now and the future cost of the state that \nthe action leads to, which is given by the value function. \n\nu* = arg min (L(x, u) + V(f(x, u\u00bb) \n\nu \n\n(1) \n\nValue functions are difficult to learn. The environment does not provide training \nexamples that pair states with their optimal cost (x, V(x\u00bb. In fact, it seems that the \noptimal policy depends on the optimal value function, which in turn depends on the \noptimal policy. Algorithms to compute value functions typically iteratively refine \na candidate value function and/or a corresponding policy (dynamic programming). \nThese algorithms are usually expensive. We use local optimization to generate \nlocally optimal plans and local models of the value function and its derivatives. We \nmaintain global consistency of the local models of the value function, guaranteeing \nthat our locally optimal plans are actually globally optimal, up to the resolution of \nour search procedures. \n\n1 A SIMPLE EXAMPLE: A PENDULUM \n\nIn this paper we will present a simple example to make our ideas clear. Figure 1 \nshows a simulated set of locally optimal trajectories in phase space for a pendulum \nbeing driven by a motor at the joint from the stable to the unstable equilibrium \nposition. S marks the start point, where the pendulum is hanging straight down, \nand G marks the goal point, where the pendulum is inverted (pointing straight up). \nThe optimization criteria quadratically penalizes deviations from the goal point \nand the magnitude of the torques applied. In the three locally optimal trajectories \nshown the pendulum either swings directly up to the goal (1), moves initially away \nfrom the goal and then swings up to the goal (2), or oscillates to pump itself and \nthen swing to the goal (3). In what follows we describe how to find these locally \noptimal trajectories and also how to find the globally optimal trajectory. \n\n2 LOCAL TRAJECTORY OPTIMIZATION \n\nWe base our local optimization process on dynamic programming within a tube \nsurrounding our current best estimate of a locally optimal trajectory (Dyer and \nMcReynolds 1970, Jacobson and Mayne 1970). We have a local quadratic model \nof the cost to get to the goal (V) at each time step along the optimal trajectory \n(assume a time step index k in everything below unless otherwise indicated): \n\nVex) ~ Vo + Vxx + 2x Vxxx \n\n1 T \n\n(2) \n\n\fUsing Local Trajectory Optimizers to Speed Up Global Optimization \n\n665 \n\n\u2022 \n\ne \n\nI \n\n~ /\" \n1/// ~ \n~ \n/ VI \n\\ \n\n/ III \" \nIs \n\\ v' \n\n') \n\n\\ \n\nI~ \n\n\\ \nGo \n\ne \n\nFigure 1: Locally optimal trajectories for the pendulum swing up task. \n\nA locally optim al policy can be computed using local models of the plant (in this \ncase local linear models) at each time step along the trajectory: \n\n(3) \nand local quadratic m odels of the one step cost at each time step along the trajec(cid:173)\ntory: \n\nXk+l = f(x, u) ~ Ax + Bu + c \n\n1 \n\n1 \n\nL(x,u) ~ 2xT Qx+ 2uTRu+xTSu+tTu \n\n(4) \n\nAt each point along the trajectory the optimal policy is given by: \n\nu opt = -(R + BTVxxB)-1 x \n\n(BTVxxAx + ST x + BTVxxc + VxB + t) \n\nOne can integrate the plant dynamics forward in time based on the above policy, \nand then integrate the value functions and its first and second spatial derivatives \nbackwards in time to compute an improved value function, policy, and trajectory. \nFor a one step cost of the form: \n\nL(x, u) ~ 2(x - Xd) Q(x - Xd)+ \n\n1 \n\nT \n\n1 \n2(u - Ud) R(u - Ud) + (x - Xd) S(n - Ud) \n\nT \n\nT \n\nthe backward sweep takes the following form (in discrete time): \n\nZx = VxA + Q(x - Xd) \nZu = VxB + R(u - Ud) \nZxx = ATVxxA + Q \nZux = BTVxxA + S \nZuu = BTVxxB + R \n\nK = Z;;: Zux \n\nVXk _ 1 = Zx - ZuK \n\nVXXk _ 1 = Zxx - ZxuK \n\n(5) \n(6) \n(7) \n(8) \n(9) \n(10) \n(11) \n(12) \n\n\f666 \n\nAtkeson \n\n3 STANDARD DYNAMIC PROGRAMMING \n\nA typical implementation of dynamic programming in continuous state spaces dis(cid:173)\ncretizes the state space into cells, and assigns a fixed control action to each cell. \nLarson's state increment dynamic programming (Larson 1968) is a good example \nof this type of approach. In Figure 2A we see the trajectory segments produced by \napplying the constant action in each cell, plotted on a phase space for the example \nproblem of swinging up a pendulum. \n\n4 USING LOCAL TRAJECTORY OPTIMIZATION \n\nWITH DP \n\nWe want to minimize the number of cells used in dynamic programming by making \nthe cells as large as possible. Combining local trajectory optimization with dynamic \nprogramming allows us to greatly reduce the resolution of the grid on which we do \ndynamic programming and still correctly estimate the cost to get to the goal from \ndifferent parts of the space. Figure 2A shows a dynamic programming approach \nin which each cell contains a trajectory segment applied to the pendulum problem. \nFigure 2B shows our approach, which creates a set of locally optimal trajectories \nto the goal. By performing the local trajectory optimizations on a grid and forcing \nadjacent trajectories to be consistent, this local optimization process becomes a \nglobal optimization process. Forcing adjacent trajectories to be consistent means \nrequiring that all trajectories can be generated from a single underlying policy. \nA trajectory can be made consistent with a neighbor by using the neighboring \ntrajectory as an initial trajectory in the local optimization process, or by using the \nvalue function from the neighboring trajectory to generate the initial trajectory in \nthe local optimization process. Each grid element stores the trajectory that starts \nat that point and achieves the lowest cost. \n\nThe trajectory segments in figure 2A match the trajectories in 2B. Figures 2C and \n2D are low resolution versions of the same problem. Figure 2C shows that some \nof the trajectory segments are no longer correct. In Figure 2D we see the locally \noptimal trajectories to the goal are still consistent with the trajectories in Figure 2B. \nUsing locally optimal trajectories which go all the way to the goal as building blocks \nfor our dynamic programming algorithm allows us to avoid the problem of correctly \ninterpolating the cost to get to the goal function on a sparse grid. Instead, the cost \nto get to the goal is measured directly on the optimal trajectory from each node to \nthe goal. We can use a much sparser grid and still converge. \n\n5 ADAPTIVE GRIDS BASED ON CONSTANT COST \n\nCONTOURS \n\nWe can limit the search by \"growing\" the volumes searched around the initial and \ngoal states by gradually increasing a cost threshold Cg \u2022 We will only consider states \naround the goal that have a cost less than Cg to get to the goal and states around \nthe initial state that have a cost less than Cg to get from the initial state to that \nstate (Figure 3B). These two regions will increase in size as Cg is increased. We stop \n\n\fUsing Local Trajectory Optimizers to Speed Up Global Optimization \n\n667 \n\nA \n\nc \n\nFigure 2: Different dynamic programming techniques (see text). \n\nB \n\no \n\n\f668 \n\nAtkeson \n\nFigure 3: Volumes defined by a cost threshold. \n\nincreasing Cg as soon as the two regions come into contact. The optimal trajectory \nhas to be entirely within the union of these two regions, and has a cost of 2Cg . \n\nInstead of having the initial conditions of the trajectories laid out on a grid over the \nwhole space, the initial conditions are laid out on a grid over the surface separating \nthe inside and the outside surfaces of the volumes described above. The resolution \nof this grid is adaptively determined by checking whether the value function of one \ntrajectory correctly predicts the cost of a neighboring trajectory. If it does not, \nadditional grid points are added between the inconsistent trajectories. \n\nDuring this global optimization we separate the state space into a volume around \nthe goal which has been completely solved and the rest of the state space, in which \nno exploration or computation has been done. Each iteration of the algorithm \nenlarges the completely solved volume by performing dynamic programming from \na surface of slightly increased cost to the current constant cost surface. When the \nsolved volume includes a known starting point or contacts a similar solved volume \nwith constant cost to get to the boundary from the starting point, a globally optimal \ntrajectory from the start to the goal has been found. \n\n6 DP BASED ON APPROXIMATING CONSTANT \n\nCOST CONTOURS \n\nUnfortunately, adaptive grids based on constant cost contours still suffer from the \ncurse of dimensionality, having only reduced the dimensionality of the problem by \n1. We are currently exploring methods to approximate constant cost contours. For \nexample, constant cost contours can be approximated by growing \"key\" trajectories. \n\n\fUsing Local Trajectory Optimizers to Speed Up Global Optimization \n\n669 \n\n;' \n/ \n\n\\ \n\" \n\nFigure 4: Approximate constant cost contours based on key trajectories \n\nA version of this is illustrated in Figure 4. Here, trajectories were grown along the \n\"bottoms\" of the value function \"valleys\". The location of a constant cost contour \ncan be estimated by using local quadratic models of the value function produced \nby the process which optimizes the trajectory. These approximate representations \ndo not suffer from the curse of dimensionality. They require on the order of T D2, \nwhere T is the length of time the trajectory requires to get to the goal, and D is \nthe dimensionality of the state space. \n\n7 SUMMARY \n\nDynamic programming provides a methodology to plan trajectories and design con(cid:173)\ntrollers and estimators for nonlinear systems. However, general dynamic program(cid:173)\nming is computationally intractable. We have developed procedures that allow more \ncomplex planning problems to be solved. We have modified the State Increment \nDynamic Programming approach of Larson (1968) in several ways: \n\n1. In State Increment DP, a constant action is integrated to form a trajectory \nsegment from the center of a cell to its boundary. We use second order local \ntrajectory optimization (Differential Dynamic Programming) to generate an \noptimal trajectory and form an optimal policy in a tube surrounding the \noptimal trajectory within a cell. The trajectory segment and local policy \nare globally optimal, up to the resolution of the representation of the value \nfunction on the boundary of the cell. \n\n2. We use the optimal policy within each cell to guide the local trajectory \noptimization to form a globally optimal trajectory from the center of each \n\n\f670 \n\nAtkeson \n\ncell all the way to the goal. This helps us avoid the accumulation of inter(cid:173)\npolation errors as one moves from cell to cell in the state space, and avoid \nlimitations caused by limited resolution of the representation of the value \nfunction over the state space. \n\n3. The second order trajectory optimization provides us with estimates of \nthe value function and its first and second spatial derivatives along each \ntrajectory. This provides a natural guide for adaptive grid approaches. \n\n4. During the global optimization we separate the state space into a volume \naround the goal which has been completely solved and the rest of the state \nspace, in which no exploration or computation has been done. The sur(cid:173)\nface separating these volumes is a surface of constant cost, with respect to \nachieving the goal. \n\n5. Each iteration of the algorithm enlarges the completely solved volume by \nperforming dynamic programming from a surface of slightly increased cost \nto the current constant cost surface. \n\n6. When the solved volume includes a known starting point or contacts a \nsimilar solved volume with constant cost to get to the boundary from the \nstarting point, a globally optimal trajectory from the start to the goal has \nbeen found. No optimal trajectory will ever leave the solved volumes. This \nwould require the trajectory to increase rather than decrease its cost to get \nto the goal as it progressed. \n\n7. The surfaces of constant cost can be approximated by a representation that \n\navoids the curse of dimensionality. \n\n8. The true test of this approach lies ahead: Can it produce reasonable solu(cid:173)\n\ntions to complex problems? \n\nAcknowledgenlents \n\nSupport was provided under Air Force Office of Scientific Research grant AFOSR-\n89-0500, by the Siemens Corporation, and by the ATR Human Information Process(cid:173)\ning Research Laboratories. Support for CGA was provided by a National Science \nFoundation Presidential Young Investigator A ward. \n\nReferences \n\nBellman, R., (1957) Dynamic Programming, Princeton University Press, Princeton, \nNJ. \n\nBertsekas, D.P., (1987) Dynamic Programming: Deterministic and Stochastic Mod(cid:173)\nels, Prentice-Hall, Englewood Cliffs, NJ. \n\nDyer, P. and S.R. McReynolds, (1970) The Computation and Theory of Optimal \nControl, Academic Press, New York, NY. \n\nJacobson, D.H. and D.Q. Mayne, (1970) Differential Dynamic Programming, Else(cid:173)\nvier, New York, NY. \n\nLarson, R.E., (1968) State Increment Dynamic Programming, Elsevier, New York, \nNY. \n\n\f", "award": [], "sourceid": 788, "authors": [{"given_name": "Christopher", "family_name": "Atkeson", "institution": null}]}