{"title": "Receding Horizon Differential Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1465, "page_last": 1472, "abstract": "The control of high-dimensional, continuous, non-linear systems is a key problem in reinforcement learning and control. Local, trajectory-based methods, using techniques such as Differential Dynamic Programming (DDP) are not directly subject to the curse of dimensionality, but generate only local controllers. In this paper, we introduce Receding Horizon DDP (RH-DDP), an extension to the classic DDP algorithm, which allows us to construct stable and robust controllers based on a library of local-control trajectories. We demonstrate the effectiveness of our approach on a series of high-dimensional control problems using a simulated multi-link swimming robot. These experiments show that our approach effectively circumvents dimensionality issues, and is capable of dealing effectively with problems with (at least) 34 state and 14 action dimensions.", "full_text": "Receding Horizon\n\nDifferential Dynamic Programming\n\nYuval Tassa \u2217\n\nTom Erez & Bill Smart \u2020\n\nAbstract\n\nThe control of high-dimensional, continuous, non-linear dynamical systems is a\nkey problem in reinforcement learning and control. Local, trajectory-based meth-\nods, using techniques such as Differential Dynamic Programming (DDP), are not\ndirectly subject to the curse of dimensionality, but generate only local controllers.\nIn this paper,we introduce Receding Horizon DDP (RH-DDP), an extension to the\nclassic DDP algorithm, which allows us to construct stable and robust controllers\nbased on a library of local-control trajectories. We demonstrate the effective-\nness of our approach on a series of high-dimensional problems using a simulated\nmulti-link swimming robot. These experiments show that our approach effectively\ncircumvents dimensionality issues, and is capable of dealing with problems of (at\nleast) 24 state and 9 action dimensions.\n\n1 Introduction\n\nWe are interested in learning controllers for high-dimensional, highly non-linear dynamical systems,\ncontinuous in state, action, and time. Local, trajectory-based methods, using techniques such as Dif-\nferential Dynamic Programming (DDP), are an active \ufb01eld of research in the Reinforcement Learn-\ning and Control communities. Local methods do not model the value function or policy over the\nentire state space by focusing computational effort along likely trajectories. Featuring algorithmic\ncomplexity polynomial in the dimension, local methods are not directly affected by dimensionality\nissues as space-\ufb01lling methods.\nIn this paper, we introduce Receding Horizon DDP (RH-DDP), a set of modi\ufb01cations to the classic\nDDP algorithm, which allows us to construct stable and robust controllers based on local-control\ntrajectories in highly non-linear, high-dimensional domains. Our new algorithm is reminiscent of\nModel Predictive Control, and enables us to form a time-independent value function approximation\nalong a trajectory. We aggregate several such trajectories into a library of locally-optimal linear\ncontrollers which we then select from, using a nearest-neighbor rule.\nAlthough we present several algorithmic contributions, a main aspect of this paper is a conceptual\none. Unlike much of recent related work (below), we are not interested in learning to follow a\npre-supplied reference trajectory. We de\ufb01ne a reward function which represents a global measure\nof performance relative to a high level objective, such as swimming towards a target. Rather than\na reward based on distance from a given desired con\ufb01guration, a notion which has its roots in the\ncontrol community\u2019s de\ufb01nition of the problem, this global reward dispenses with a \u201cpath planning\u201d\ncomponent and requires the controller to solve the entire problem.\nWe demonstrate the utility of our approach by learning controllers for a high-dimensional simulation\nof a planar, multi-link swimming robot. The swimmer is a model of an actuated chain of links\nin a viscous medium, with two location and velocity coordinate pairs, and an angle and angular\nvelocity for each link. The controller must determine the applied torque, one action dimension for\n\n\u2217Y. Tassa is with the Hebrew University, Jerusalem, Israel.\n\u2020T. Erez and W.D. Smart are with the Washington University in St. Louis, MO, USA.\n\n1\n\n\feach articulated joint. We reward controllers that cause the swimmer to swim to a target, brake on\napproach and come to a stop over it.\nWe synthesize controllers for several swimmers, with state dimensions ranging from 10 to 24 dimen-\nsions. The controllers are shown to exhibit complex locomotive behaviour in response to real-time\nsimulated interaction with a user-controlled target.\n\n1.1 Related work\n\nOptimal control of continuous non-linear dynamical systems is a central research goal of the RL\ncommunity. Even when important ingredients such as stochasticity and on-line learning are re-\nmoved, the exponential dependence of computational complexity on the dimensionality of the do-\nmain remains a major computational obstacle. Methods designed to alleviate the curse of dimen-\nsionality include adaptive discretizations of the state space [1], and various domain-speci\ufb01c manip-\nulations [2] which reduce the effective dimensionality.\nLocal trajectory-based methods such as DDP were introduced to the NIPS community in [3], where\na local-global hybrid method is employed. Although DDP is used there, it is considered an aid to the\nglobal approximator, and the local controllers are constant rather than locally-linear. In this decade\nDDP was reintroduced by several authors.\nIn [4] the idea of using the second order local DDP\nmodels to make locally-linear controllers is introduced. In [5] DDP was applied to the challenging\nhigh-dimensional domain of autonomous helicopter control, using a reference trajectory.\nIn [6]\na minimax variant of DDP is used to learn a controller for bipedal walking, again by designing\na reference trajectory and rewarding the walker for tracking it. In [7], trajectory-based methods\nincluding DDP are examined as possible models for biological nervous systems. Local methods\nhave also been used for purely policy-based algorithms [8, 9, 10], without explicit representation of\nthe value function.\nThe best known work regarding the swimming domain is that by Ijspeert and colleagues (e.g. [11])\nusing Central Pattern Generators. While the inherently stable domain of swimming allows for such\nopen-loop control schemes, articulated complex behaviours such as turning and tracking necessitate\nfull feedback control which CPGs do not provide.\n\n2 Methods\n\n2.1 De\ufb01nition of the problem\nWe consider the discrete-time dynamics xk+1 = F (xk, uk) with states x \u2208 Rn and actions u \u2208 Rm.\nIn this context we assume F (xk, uk) = xk +\n0 f(x(t), uk)dt for a continuous f and a small \u2206t,\napproximating the continuous problem and identifying with it in the \u2206t \u2192 0 limit. Given some\nscalar reward function r(x, u) and a \ufb01xed initial state x1 (superscripts indicating the time index), we\nwish to \ufb01nd the policy which maximizes the total reward1 acquired over a \ufb01nite temporal horizon:\n\n(cid:82) \u2206t\n\n\u03c0\u2217(xk, k) = argmax\n\u03c0(\u00b7,\u00b7)\n\n[\n\nr(xi, \u03c0(xi, i))].\n\nN(cid:88)\n\ni=k\n\nThe quantity maximized on the RHS is the value function, which solves Bellman\u2019s equation:\n\nV (x, k) = max\n\n[r(x, u) + V (F (x, u), k+1)].\n\n(1)\n\nu\n\nEach of the functions in the sequence {V (x, k)}N\nk=1 describes the optimal reward-to-go of the opti-\nmization subproblem from k to N. This is a manifestation of the dynamic programming principle. If\nN = \u221e, essentially eliminating the distinction between different time-steps, the sequence collapses\nto a global, time-independent value function V (x).\n\n2.2 DDP\n\nDifferential Dynamic Programming [12, 13] is an iterative improvement scheme which \ufb01nds a\nlocally-optimal trajectory emanating from a \ufb01xed starting point x1. At every iteration, an approx-\n\n1We (arbitrarily) choose to use phrasing in terms of reward-maximization, rather than cost-minimization.\n\n2\n\n\fimation to the time-dependent value function is constructed along the current trajectory {xk}N\nwhich is formed by iterative application of F using the current control sequence {uk}N\niteration is comprised of two sweeps of the trajectory: a backward and a forward sweep.\nIn the backward sweep, we proceed backwards in time to generate local models of V in the following\nmanner. Given quadratic models of V (xk+1, k + 1), F (xk, uk) and r(xk, uk), we can approximate\nthe unmaximised value function, or Q-function,\n\nk=1,\nk=1. Every\n\nQ(xk, uk) = r(xk, uk) + V k+1(F (xk, uk))\n\nas a quadratic model around the present state-action pair (xk, uk):\n\nQ(xk + \u03b4x, uk + \u03b4u) \u2248 Q0 + Qx\u03b4x + Qu\u03b4u +\n\n1\n2\n\n[\u03b4xT \u03b4uT ]\n\nQxx Qxu\nQux Quu\n\n(cid:34)\n\n(cid:35)(cid:104)\n\n(cid:105)\n\n\u03b4x\n\u03b4u\n\n(2)\n\n(3)\n\nWhere the coef\ufb01cients Q(cid:63)(cid:63) are computed by equating coef\ufb01cients of similar powers in the second-\norder expansion of (2)\n\nQx = rx + V k+1\nQu = ru + V k+1\n\nxx F k\nxx F k\nxx F k\nOnce the local model of Q is obtained, the maximizing \u03b4u is solved for\n\nQxx = rxx + F k\nQuu = ruu + F k\nQxu = rxu + F k\n\nx V k+1\nu V k+1\nx V k+1\n\nx F k\nx\nx F k\nu\n\nx + V k+1\nu + V k+1\nu + V k+1\n\nx F k\nxx\nx F k\nuu\nx F k\nxu.\n\n\u2217 = argmax\n\n\u03b4u\n\n\u03b4u\n\n[Q(xk + \u03b4x, uk + \u03b4u)] = \u2212Q\u22121\n\nuu (Qu + Qux\u03b4x)\n\nand plugged back into (3) to obtain a quadratic approximation of V k:\n\n(4)\n\n(5)\n\n0 = V k+1\nV k\nx = Qk+1\nV k\nxx = Qk+1\nV k\n\n0 \u2212 Qu(Quu)\u22121 Qu\nx \u2212 Qu(Quu)\u22121 Qux\nxx \u2212 Qxu(Quu)\u22121Qux.\n\n(6a)\n(6b)\n(6c)\nThis quadratic model can now serve to propagate the approximation to V k\u22121. Thus, equations (4),\n(5) and (6) iterate in the backward sweep, computing a local model of the Value function along\nwith a modi\ufb01cation to the policy in the form of an open-loop term \u2212Q\u22121\nuu Qu and a feedback term\n\u2212Q\u22121\nuu Qux\u03b4x, essentially solving a local linear-quadratic problem in each step. In some senses, DDP\ncan be viewed as dual to the Extended Kalman Filter (though employing a higher order expansion\nof F ).\nIn the forward sweep of the DDP iteration, both the open-loop and feedback terms are combined to\ncreate a new control sequence (\u02c6uk)N\n\nk=1 which results in a new nominal trajectory (\u02c6xk)N\n\nk=1.\n\n\u02c6x1 = x1\n\u02c6uk = uk \u2212 Q\u22121\n\u02c6xk+1 = F (\u02c6xk, \u02c6uk)\n\n(7a)\n(7b)\n(7c)\nWe note that in practice the inversion in (5) must be conditioned. We use a Levenberg Marquardt-\nlike scheme similar to the ones proposed in [14]. Similarly, the u-update in (7b) is performed with\nan adaptive line search scheme similar to the ones described in [15].\n\nuu Qux(\u02c6xk \u2212 xk)\n\nuu Qu \u2212 Q\u22121\n\n2.2.1 Complexity and convergence\n\nThe leading complexity term of one iteration of DDP itself, assuming the model of F as required for\n(4) is given, is O(N m\u03b31) for computing (6) N times, with 2 < \u03b31 < 3, the complexity-exponent of\ninverting Quu. In practice, the greater part of the computational effort is devoted to the measurement\nof the dynamical quantities in (4) or in the propagation of collocation vectors as described below.\nDDP is a second order algorithm with convergence properties similar to, or better than Newton\u2019s\nmethod performed on the full vectorial uk with an exact N m \u00d7 N m Hessian [16]. In practice,\nconvergence can be expected after 10-100 iterations, with the stopping criterion easily determined\nas the size of the policy update plummets near the minimum.\n\n3\n\n\f2.2.2 Collocation Vectors\n\nWe use a new method of obtaining the quadratic model of Q (Eq. (2)), inspired by [17]2. Instead\nof using (4), we \ufb01t this quadratic model to samples of the value function at a cloud of collocation\nvectors {xk\ni }i=1..p, spanning the neighborhood of every state-action pair along the trajectory.\nWe can directly measure r(xk\ni ) for each point in the cloud, and by using the\napproximated value function at the next time step, we can estimate the value of (2) at every point:\n\ni ) and F (xk\n\ni , uk\n\ni , uk\n\ni , uk\n\nq(xk\n\ni , uk\n\ni ) = r(xk\n\ni , uk\n\ni ) + V k+1(F (xk\n\ni , uk\n\ni ))\n\ni , uk\n\ni , uk\n\ni ) and (xk\n\nThen, we can insert the values of q(xk\ni ) on the LHS and RHS of (3) respectively,\nand solve this set of p linear equations for the Q(cid:63)(cid:63) terms. If p > (3(n + m) + (m + n)2)/2, and\nthe cloud is in general con\ufb01guration, the equations are non-singular and can be easily solved by a\ngeneric linear algebra package.\nThere are several advantages to using such a scheme. The full nonlinear model of F is used to\nconstruct Q, rather than only a second-order approximation. Fxx, which is an n\u00d7 n\u00d7 n tensor need\nnot be stored. The addition of more vectors can allow the modeling of noise, as suggested in [17].\nIn addition, this method allows us to more easily apply general coordinate transformations in order\nto represent V in some internal space, perhaps of lower dimension.\nThe main drawback of this scheme is the additional complexity of an O(N p\u03b32) term for solving the\np-equation linear system. Because we can choose {xk\ni } in way which makes the linear system\nsparse, we can enjoy the \u03b32 < \u03b31 of sparse methods and, at least for the experiments performed\nhere, increase the running time only by a small factor.\nIn the same manner that DDP is dually reminiscent of the Extended Kalman Filter, this method bears\na resemblance to the test vectors propagated in the Unscented Kalman Filter [18], although we use\na quadratic, rather than linear number of collocation vectors.\n\ni , uk\n\n2.3 Receding Horizon DDP\n\nWhen seeking to synthesize a global controller from many local controllers, it is essential that the\ndifferent local components operate synergistically. In our context this means that local models of\nthe value function must all model the same function, which is not the case for the standard DDP\nsolution. The local quadratic models which DDP computes around the trajectory are approximations\nto V (x, k), the time-dependent value function. The standard method in RL for creating a global\nvalue function is to use an exponentially discounted horizon. Here we propose a \ufb01xed-length non-\ndiscounted Receding Horizon scheme in the spirit of Model Predictive Control [19].\nHaving computed a DDP solution to some problem starting from many different starting points\nx1, we can discard all the models computed for points xk>1 and save only the ones around the\nx1\u2019s. Although in this way we could accumulate a time-independent approximation to V (x, N)\nonly, starting each run of N-step DDP from scratch would be prohibitively expensive. We therefore\npropose the following: After obtaining the solution starting from x1, we save the local model at\nk = 1 and proceed to solve a new N-step problem starting at x2, this time initialized with the\npolicy obtained on the previous run, shifted by one time-step, and appended with the last control\nunew = [u2, u3...uN uN ]. Because this control sequence is very close to the optimal solution, the\nsecond-order convergence of DDP is in full effect and the algorithm converges in 1 or 2 sweeps.\nAgain saving the model at the \ufb01rst time step, we iterate. We stress the that without the fast and exact\nconvergence properties of DDP near the maximum, this algorithm would be far less effective.\n\n2.4 Nearest Neighbor control with Trajectory Library\n\nA run of DDP computes a locally quadratic model of V and a locally linear model of u, expressed by\nthe gain term \u2212Q\u22121\nuu Qux. This term generalizes the open-loop policy to a tube around the trajectory,\ninside of which a basin-of-attraction is formed. Having lost the dependency on the time k with\nthe receding-horizon scheme, we need some space-based method of determining which local gain\nmodel we select at a given state. The simplest choice, which we use here, is to select the nearest\nEuclidian neighbor.\n\n2Our method is a speci\ufb01c instantiation of a more general algorithm described therein.\n\n4\n\n\fOutside of the basin-of-attraction of a single trajectory, we can expect the policy to perform very\npoorly and lead to numerical divergence if no constraint on the size of u is enforced. A possible\nsolution to this problem is to \ufb01ll some volume of the state space with a library of local-control\ntrajectories [20], and consider all of them when selecting the nearest linear gain model.\n\n3 Experiments\n\n3.1 The swimmer dynamical system\n\n(cid:161) cos(\u03b8)\n\n(cid:162)\n\n(cid:88)\n\ni\n\n(cid:80)\n\n(cid:162)\n\n(cid:161) \u2212 sin(\u03b8)\n\nWe describe a variation of the d-link swimmer dynamical system [21]. A stick or link of length\nl, lying in a plane at an angle \u03b8 to some direction, parallel to \u02c6t =\nand perpendicular to\n, moving with velocity \u02d9x in a viscous \ufb02uid, is postulated to admit a normal frictional\n\u02c6n =\nforce \u2212knl\u02c6n( \u02d9x \u00b7 \u02c6n) and a tangential frictional force \u2212ktl\u02c6t( \u02d9x \u00b7 \u02c6t), with kn > kt > 0. The swimmer\nis modeled as a chain of d such links of lengths li and masses mi, its con\ufb01guration described by\nthe generalized coordinates q = ( xcm\n\u03b8 ), of two center-of-mass coordinates and d angles. Letting\n\u00afxi = xi \u2212 xcm be the positions of the link centers WRT the center of mass , the Lagrangian is\n\ncos(\u03b8)\n\nsin(\u03b8)\n\n(cid:88)\n\n(cid:88)\n\nL = 1\n\n2 \u02d9x2\n\ncm\n\nmi + 1\n2\n\ni\n\ni\n\nmi \u02d9\u00afx2\n\ni + 1\n\n2\n\n\u02d9\u03b82\n\ni\n\nIi\n\n12 mil2\n\nwith Ii = 1\nand angles of the links is given by the d \u2212 1 equations \u00afxi+1 \u2212 \u00afxi = 1\nexpress the joining of successive links, and by the equation\n\ni the moments-of-inertia. The relationship between the relative position vectors\n2 li\u02c6ti, which\ni mi\u00afxi = 0 which comes from the\n\n2 li+1\u02c6ti+1 + 1\n\n(a) Time course of two angular velocities.\n\n(b) State projection.\n\n(a) three snapshots of the receding horizon trajectory (dotted)\nFigure 1: RH-DDP trajectories.\nwith the current \ufb01nite-horizon optimal trajectory (solid) appended, for two state dimensions. (b)\nProjections of the same receding-horizon trajectories onto the largest three eigenvectors of the full\nstate covariance matrix. As described in Section 3.3, the linear regime of the reward, here applied\nto a 3-swimmer, compels the RH trajectories to a steady swimming gait \u2013 a limit cycle.\n\n5\n\n\fde\ufb01nition of the \u00afxi\u2019s relative to the center-of-mass. The function\ni ] \u2212 1\n\u02d9\u03b82\n2 kt\n\n[li( \u02d9xi \u00b7 \u02c6ni)2 + 1\n\nF = \u2212 1\n\n12 l3\n\n2 kn\n\ni\n\n(cid:88)\n\ni\n\n(cid:88)\n\ni\n\nli( \u02d9xi \u00b7 \u02c6ti)2\n\nknown as the dissipation function, is that function whose derivatives WRT the \u02d9qi\u2019s provide the postu-\nlated frictional forces. With these in place, we can obtain \u00a8q from the 2+d Euler-Lagrange equations:\n\nd\n\ndt( \u2202\n\u2202qi\n\nL) = \u2202\n\u2202 \u02d9qi\n\nF + u\n\nwith u being the external forces and torques applied to the system. By applying d \u2212 1 torques \u03c4j\nin action-reaction pairs at the joints ui = \u03c4i \u2212 \u03c4i\u22121, the isolated nature of the dynamical system\nis preserved. Performing the differentiations, solving for \u00a8q, and letting x =\nbe the 4 + 2d-\ndimensional state variable, \ufb01nally gives the dynamics \u02d9x = ( \u02d9q\n\n\u02d9q\n\n\u00a8q ) = f(x, u).\n\n(cid:161) q\n\n(cid:162)\n\n3.2\n\nInternal coordinates\n\nThe two coordinates specifying the position of the center-of-mass and the d angles are de\ufb01ned\nrelative to an external coordinate system, which the controller should not have access to. We make\na coordinate transformation into internal coordinates, where only the d\u22121 relative angles {\u02c6\u03b8j =\n\u03b8j+1 \u2212 \u03b8j}d\u22121\nj=1 are given, and the location of the target is given relative to coordinate system \ufb01xed\non one of the links. This makes the learning isotropic and independent of a speci\ufb01c location on the\nplane. The collocation method allows us to perform this transformation directly on the vector cloud\nwithout having to explicitly differentiate it, as we would have had to using classical DDP. Note also\nthat this transformation reduces the dimension of the state (one angle less), suggesting the possibility\nof further dimensionality reduction.\n\n3.3 The reward function\n\nThe reward function we used was\n\nr(x, u) = \u2212cx\n\n(cid:112)||xnose||2 + 1\n\n||xnose||2\n\n\u2212 cu||u||2\n\n(8)\n\nWhere xnose = [x1x2]T is the 2-vector from some designated point on the swimmer\u2019s body to the\ntarget (the origin in internal space), and cx and cu are positive constants. This reward is maximized\nwhen the nose is brought to rest on the target under a quadratic action-cost penalty. It should not be\nconfused with the desired state reward of classical optimal control since values are speci\ufb01ed only\nfor 2 out of the 2d + 4 coordinates. The functional form of the target-reward term is designed to\nbe linear in ||xnose|| when far from the target and quadratic when close to it (Figure 2(b)). Because\n\n(a) Swimmer\n\nFigure 2: (a) A 5-swimmer with the \u201cnose\u201d point at its tip and a ring-shaped target. (b) The func-\ntional form of the planar reward component r(xnose) = \u2212||xnose||2/\ntranslates into a steady swimming gait at large distances with a smooth braking and stopping at the\ngoal.\n\n(b) Reward\n\n(cid:112)||xnose||2 + 1. This form\n\n6\n\n\fof the differentiation in Eq. (5), the solution is independent of V0, the constant part of the value.\nTherefore, in the linear regime of the reward function, the solution is independent of the distance\nfrom the target, and all the trajectories are quickly compelled to converge to a one-dimensional\nmanifold in state-space which describes steady-state swimming (Figure 1(b)). Upon nearing the\ntarget, the swimmer must initiate a braking maneuver, and bring the nose to a standstill over the\ntarget. For targets that are near the swimmer, the behaviour must also include various turns and\njerks, quite different from steady-state swimming, which maneuver the nose into contact with the\ntarget. Our experience during interaction with the controller, as detailed below, leads us to believe\nthat the behavioral variety that would be exhibited by a hypothetical exact optimal controller for this\nsystem to be extremely large.\n\n4 Results\n\n(cid:82) t+\u2206t\n\nIn order to asses the controllers we constructed a real-time interaction package3. By dragging the\ntarget with a cursor, a user can interact with controlled swimmers of 3 to 10 links with a state di-\nmension varying from 10 to 24, respectively. Even with controllers composed of a single trajectory,\nthe swimmers perform quite well, turning, tracking and braking on approach to the target.\nAll of the controllers in the package control swimmers with unit link lengths and unit masses. The\nnormal-to-tangential drag coef\ufb01cient ratio was kn/kt = 25. The function F computes a single 4th-\norder Runge-Kutta integration step of the continuous dynamics F (xk, uk) = xk+\nf(xk, uk)dt\nt\nwith \u2206t=0.05s. The receding horizon window was of 40 time-steps, or 2 seconds.\nWhen the state doesn\u2019t gravitate to one of the basins of attraction around the trajectories, numerical\ndivergence can occur. This effect can be initiated by the user by quickly moving the target to a\n\u201csurprising\u201d location. Because nonlinear viscosity effects are not modeled and the local controllers\nare also linear, exponentially diverging torques and angular velocities can be produced. When adding\nas few as 20 additional trajectories, divergence is almost completely avoided.\nAnother claim which may be made is that there is no guarantee that the solutions obtained, even on\nthe trajectories, are in fact optimal. Because DDP is a local optimization method, it is bound to stop\nin a local minimum. An extension of this claim is that even if the solutions are optimal, this has to\ndo with the swimmer domain itself, which might be inherently convex in some sense and therefore\nan \u201ceasy\u201d problem.\nWhile both divergence and local minima are serious issues, they can both be addressed by appealing\nto our panoramic motivation in the biology. Real organisms cannot apply unbounded torque. By\nhard-limiting the torque to large but \ufb01nite values, non-divergence can be guaranteed4. Similarly,\nlocal minima exist even in the motor behaviour of the most complex organisms, famously evidenced\nby Fosbury\u2019s reinvention of the high jump.\nRegarding the easiness or dif\ufb01culty of the swimmer problem \u2013 we made the documented code avail-\nable and hope that it might serve as a useful benchmark for other algorithms.\n\n5 Conclusions\n\nThe signi\ufb01cance of this work lies at its outlining of a new kind of tradeoff in nonlinear motor control\ndesign. If biological realism is an accepted design goal, and physical and biological constraints taken\ninto account, then the expectations we have from our controllers can be more relaxed than those of\nthe control engineer. The unavoidable eventual failure of any speci\ufb01c biological organism makes\nthe design of truly robust controllers a futile endeavor, in effect putting more weight on the mode,\nrather than the tail of the behavioral distribution. In return for this forfeiture of global guarantees,\nwe gain very high performance in a small but very dense sub-manifold of the state-space.\n\n3Available at http://alice.nc.huji.ac.il/\u223ctassa/\n4We actually constrain angular velocities since limiting torque would require a stiffer integrator, but theo-\nretical non-divergence is fully guaranteed by the viscous dissipation which enforces a Lyapunov function on\nthe entire system, once torques are limited.\n\n7\n\n\fSince we make use of biologically grounded arguments, we brie\ufb02y outline the possible implications\nof this work to biological nervous systems. It is commonly acknowledged, due both to theoretical\narguments and empirical \ufb01ndings, that some form of dimensionality reduction must be at work in\nneural control mechanisms. A common object in models which attempt to describe this reduction\nis the motor primitive, a hypothesized atomic motor program which is combined with other such\nprograms in a small \u201calphabet\u201d, to produce complex behaviors in a given context. Our controllers\nimply a different reduction: a set of complex prototypical motor programs, each of which is near-\noptimal only in a small volume of the state-space, yet in that space describes the entire complexity of\nthe solution. Giving the simplest building blocks of the model such a high degree of task speci\ufb01city\nor context, would imply a very large number of these motor prototypes in a real nervous system, an\norder of magnitude analogous, in our linguistic metaphor, to that of words and concepts.\n\nReferences\n[1] Remi Munos and Andrew W. Moore. Variable Resolution Discretization for High-Accuracy Solutions of\nOptimal Control Problems. In International Joint Conference on Arti\ufb01cial Intelligence, pages 1348\u20131355,\n1999.\n\n[2] M. Stilman, C. G. Atkeson, J. J. Kuffner, and G. Zeglin. Dynamic programming in reduced dimensional\nspaces: Dynamic planning for robust biped locomotion. In Proceedings of the 2005 IEEE International\nConference on Robotics and Automation (ICRA 2005), pages 2399\u20132404, 2005.\n\n[3] Christopher G. Atkeson. Using local trajectory optimizers to speed up global optimization in dynamic\n\nprogramming. In NIPS, pages 663\u2013670, 1993.\n\n[4] C. G. Atkeson and J. Morimoto. Non-parametric representation of a policies and value functions: A\n\ntrajectory based approach. In Advances in Neural Information Processing Systems 15, 2003.\n\n[5] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng. An application of reinforcement learning to aerobatic\n\nhelicopter \ufb02ight. In Advances in Neural Information Processing Systems 19, 2007.\n\n[6] J. Morimoto and C. G. Atkeson. Minimax differential dynamic programming: An application to robust\n\nbipedwalking. In Advances in Neural Information Processing Systems 14, 2002.\n\n[7] Emanuel Todorov and Wei-Wei Li. Optimal control methods suitable for biomechanical systems. In 25th\n\nAnnual Int. Conf. IEE Engineering in Medicine and Biology Society, 2003.\n\n[8] R. Munos. Policy gradient in continuous time. Journal of Machine Learning Research, 7:771\u2013791, 2006.\n[9] J. Peters and S. Schaal. Reinforcement learning for parameterized motor primitives. In Proceedings of\n\nthe IEEE International Joint Conference on Neural Networks (IJCNN 2006), 2006.\n\n[10] Tom Erez and William D. Smart. Bipedal walking on rough terrain using manifold control. In IEEE/RSJ\n\nInternational Conference on Robots and Systems (IROS), 2007.\n\n[11] A. Crespi and A. Ijspeert. AmphiBot II: An amphibious snake robot that crawls and swims using a central\npattern generator. In Proceedings of the 9th International Conference on Climbing and Walking Robots\n(CLAWAR 2006), pages 19\u201327, 2006.\n\n[12] D. Q. Mayne. A second order gradient method for determining optimal trajectories for non-linear discrete-\n\ntime systems. International Journal of Control, 3:85\u201395, 1966.\n\n[13] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming. Elsevier, 1970.\n[14] L.-Z. Liao and C. A. Shoemaker. Convergence in unconstrained discrete-time differential dynamic pro-\n\ngramming. IEEE Transactions on Automatic Control, 36(6):692\u2013706, 1991.\n\n[15] S. Yakowitz. Algorithms and computational techniques in differential dynamic programming. Control\n\nand Dynamic Systems: Advances in Theory and Applications, 31:75\u201391, 1989.\n\n[16] L.-Z. Liao and C. A. Shoemaker. Advantages of differential dynamic programming over newton\u2019s method\n\n[17] E. Todorov.\n\nfor discrete-time optimal control problems. Technical Report 92-097, Cornell Theory Center, 1992.\nwww.cogsci.ucsd.edu/\u223ctodorov/papers/ildp.pdf, 2007.\n\nIterative local dynamic programming. Manuscript under\n\nreview, available at\n\n[18] S. J. Julier and J. K. Uhlmann. A new extension of the kalman \ufb01lter to nonlinear systems. In Proceedings\n\nof AeroSense: The 11th Int. Symp. on Aerospace/Defence Sensing, Simulation and Controls, 1997.\n\n[19] C. E. Garcia, D. M. Prett, and M. Morari. Model predictive control: theory and practice. Automatica, 25:\n\n335\u2013348, 1989.\n\n[20] M. Stolle and C. G. Atkeson. Policies based on trajectory libraries. In Proceedings of the International\n\nConference on Robotics and Automation (ICRA 2006), 2006.\n\n[21] R. Coulom. Reinforcement Learning Using Neural Networks, with Applications to Motor Control. PhD\n\nthesis, Institut National Polytechnique de Grenoble, 2002.\n\n8\n\n\f", "award": [], "sourceid": 1101, "authors": [{"given_name": "Yuval", "family_name": "Tassa", "institution": null}, {"given_name": "Tom", "family_name": "Erez", "institution": null}, {"given_name": "William", "family_name": "Smart", "institution": null}]}