{"title": "Minimax Differential Dynamic Programming: An Application to Robust Biped Walking", "book": "Advances in Neural Information Processing Systems", "page_first": 1563, "page_last": 1570, "abstract": "", "full_text": "Minimax Differential Dynamic Programming:\n\nAn Application to Robust Biped Walking\n\nJun Morimoto\n\nHuman Information Science Labs,\nDepartment 3, ATR International\n\nKeihanna Science City,\nKyoto, JAPAN, 619-0288\nxmorimo@atr.co.jp\n\nChristopher G. Atkeson (cid:3)\n\nThe Robotics Institute and HCII,\n\nCarnegie Mellon University\n\n5000 Forbes Ave.,\n\nPittsburgh, USA, 15213\n\ncga@cs.cmu.edu\n\nAbstract\n\nWe developed a robust control policy design method in high-dimensional\nstate space by using differential dynamic programming with a minimax\ncriterion. As an example, we applied our method to a simulated \ufb01ve link\nbiped robot. The results show lower joint torques from the optimal con-\ntrol policy compared to a hand-tuned PD servo controller. Results also\nshow that the simulated biped robot can successfully walk with unknown\ndisturbances that cause controllers generated by standard differential dy-\nnamic programming and the hand-tuned PD servo to fail. Learning to\ncompensate for modeling error and previously unknown disturbances in\nconjunction with robust control design is also demonstrated.\n\n1 Introduction\n\nReinforcement learning[8] is widely studied because of its promise to automatically gen-\nerate controllers for dif\ufb01cult tasks from attempts to do the task. However, reinforcement\nlearning requires a great deal of training data and computational resources, and sometimes\nfails to learn high dimensional tasks. To improve reinforcement learning, we propose using\ndifferential dynamic programming (DDP) which is a second order local trajectory optimiza-\ntion method to generate locally optimal plans and local models of the value function[2, 4].\nDynamic programming requires task models to learn tasks. However, when we apply dy-\nnamic programming to a real environment, handling inevitable modeling errors is crucial.\nIn this study, we develop minimax differential dynamic programming which provides ro-\nbust nonlinear controller designs based on the idea of H1 control[9, 5] or risk sensitive\ncontrol[6, 1]. We apply the proposed method to a simulated \ufb01ve link biped robot (Fig. 1).\nOur strategy is to use minimax DDP to \ufb01nd both a low torque biped walk and a policy or\ncontrol law to handle deviations from the optimized trajectory. We show that both standard\nDDP and minimax DDP can \ufb01nd a local policy for lower torque biped walk than a hand-\ntuned PD servo controller. We show that minimax DDP can cope with larger modeling\nerror than standard DDP or the hand-tuned PD controller. Thus, the robust controller al-\nlows us to collect useful training data. In addition, we can use learning to correct modeling\n\n(cid:3)also af\ufb01liated with Human Information Science Laboratories, Department 3, ATR International\n\n\ferrors and model previously unknown disturbances, and design a new more optimal robust\ncontroller using additional iterations of minimax DDP.\n\n2 Minimax DDP\n\n2.1 Differential dynamic programming (DDP)\n\nA value function is de\ufb01ned as sum of accumulated future penalty r(xi; ui; i) from current\nstate and terminal penalty (cid:8)(xN ),\n\nN (cid:0)1\n\nV (xi; i) = (cid:8)(xN ) +\n\nr(xj ; uj ; j);\n\n(1)\n\nXj=i\n\nwhere xi is the input state, ui is the control output at the i-th time step, and N is the number\nof time steps. Differential dynamic programming maintains a second order local model of\na Q function (Q(i); Qx(i); Qu(i); Qxx(i); Qxu(i); Quu(i)), where Q(i) = r(xi; ui; i) +\nV (xi+1; i + 1), and the subscripts indicate partial derivatives. Then, we can derive the\ni = ui + (cid:14)ui from arg max(cid:14)ui Q(xi + (cid:14)xi; ui + (cid:14)ui; i). Finally,\nnew control output unew\nby using the new control output unew\n, a second order local model of the value function\n(V (i); Vx(i); Vxx(i)) can be derived [2, 4].\n\ni\n\n2.2 Finding a local policy\n\nDDP \ufb01nds a locally optimal trajectory xopt\nand the corresponding control trajectory uopt\n.\nWhen we apply our control algorithm to a real environment, we usually need a feedback\ncontroller to cope with unknown disturbances or modeling errors. Fortunately, DDP pro-\nvides us a local policy along the optimized trajectory:\n\ni\n\ni\n\nuopt(xi; i) = uopt\n\ni + Ki(xi (cid:0) xopt\n\ni\n\n);\n\n(2)\n\nwhere Ki is a time dependent gain matrix given by taking the derivative of the optimal\npolicy with respect to the state [2, 4].\n\n2.3 Minimax DDP\n\nMinimax DDP can be derived as an extension of standard DDP [2, 4]. The difference is that\nthe proposed method has an additional disturbance variable w to explicitly represent the\nexistence of disturbances. This representation of the disturbance provides the robustness\nfor optimized trajectories and policies [5].\nThen, we expand the Q function Q(xi + (cid:14)xi; ui + (cid:14)ui; wi + (cid:14)wi; i) to second order in\nterms of (cid:14)u, (cid:14)w and (cid:14)x about the nominal solution:\n\nQ(xi + (cid:14)xi; ui + (cid:14)ui; wi + (cid:14)wi; i) = Q(i) + Qx(i)(cid:14)xi + Qu(i)(cid:14)ui + Qw(i)(cid:14)wi\n\n+\n\n1\n2\n\n[(cid:14)xT\n\ni (cid:14)uT\n\ni (cid:14)wT\n\ni ]\" Qxx(i) Qxu(i) Qxw(i)\n\nQwx(i) Qwu(i) Qww(i) #\" (cid:14)xi\n\nQux(i) Quu(i) Quw(i)\n\n(cid:14)ui\n\n(cid:14)wi # ;\n\n(3)\n\nThe second order local model of the Q function can be propagated backward in time using:\n\nQx(i) = Vx(i + 1)Fx + rx(i)\nQu(i) = Vx(i + 1)Fu + ru(i)\nQw(i) = Vx(i + 1)Fw + rw(i)\nQxx(i) = FxVxx(i + 1)Fx + Vx(i + 1)Fxx + rxx(i)\n\n(4)\n(5)\n(6)\n(7)\n\n\fQxu(i) = FxVxx(i + 1)Fu + Vx(i + 1)Fxu + rxu(i)\nQxw(i) = FxVxx(i + 1)Fu + Vx(i + 1)Fxw + rxw(i)\nQuu(i) = FuVxx(i + 1)Fu + Vx(i + 1)Fuu + ruu(i)\nQww(i) = FwVxx(i + 1)Fw + Vx(i + 1)Fww + rww(i)\nQuw(i) = FuVxx(i + 1)Fw + Vx(i + 1)Fuw + ruw(i);\n\n(8)\n(9)\n(10)\n(11)\n(12)\n\nwhere xi+1 = F(xi; ui; wi) is a model of the task dynamics.\nHere, (cid:14)ui and (cid:14)wi must be chosen to minimize and maximize the second order expansion\nof the Q function Q(xi + (cid:14)xi; ui + (cid:14)ui; wi + (cid:14)wi; i) in (3) respectively, i.e.,\n\n(cid:14)ui = (cid:0)Q(cid:0)1\n(cid:14)wi = (cid:0)Q(cid:0)1\n\nuu(i)[Qux(i)(cid:14)xi + Quw(i)(cid:14)wi + Qu(i)]\nww(i)[Qwx(i)(cid:14)xi + Qwu(i)(cid:14)ui + Qw(i)]:\n\n(13)\n\nBy solving (13), we can derive both (cid:14)ui and (cid:14)wi. After updating the control output ui and\nthe disturbance wi with derived (cid:14)ui and (cid:14)wi, the second order local model of the value\nfunction is given as\n\nV (i) = V (i + 1) (cid:0) Qu(i)Q(cid:0)1\nVx(i) = Qx(i) (cid:0) Qu(i)Q(cid:0)1\nVxx(i) = Qxx(i) (cid:0) Qxu(i)Q(cid:0)1\n\nuu(i)Qux(i) (cid:0) Qxw(i)Q(cid:0)1\n\nww(i)Qwx(i):\n\n(14)\n\nuu(i)Qu(i) (cid:0) Qw(i)Q(cid:0)1\n\nww(i)Qw(i)\n\nuu(i)Qux(i) (cid:0) Qw(i)Q(cid:0)1\n\nww(i)Qwx(i)\n\n3 Experiment\n\n3.1 Biped robot model\n\nIn this paper, we use a simulated \ufb01ve link biped robot (Fig. 1:Left) to explore our approach.\nKinematic and dynamic parameters of the simulated robot are chosen to match those of a\nbiped robot we are currently developing (Fig. 1:Right) and which we will use to further\nexplore our approach. Height and total weight of the robot are about 0.4 [m] and 2.0 [kg]\nrespectively. Table 1 shows the parameters of the robot model.\n\n3\n\nlink3\n\njoint2,3\n\nlink4\n\n4\n\nlink2\n\n2\n\nlink1\n\njoint1\n\n1\n\njoint4\n\n5\nlink5\n\nankle\n\nFigure 1: Left: Five link robot model, Right: Real robot\n\nTable 1: Physical parameters of the robot model\nlink4\n0:43\n0:2\n4:29\n\nlink3\n1:0\n0:01\n4:33\n\nlink1\n0:05\n0:2\n1:75\n\nlink2\n0:43\n0:2\n4:29\n\nmass [kg]\nlength [m]\n\ninertia [kg(cid:1)m (cid:2)10(cid:0)4]\n\nlink5\n0:05\n0:2\n1:75\n\n\fWe can represent the forward dynamics of the biped robot as\n\n(15)\nwhere x = f(cid:18)1; : : : ; (cid:18)5; _(cid:18)1; : : : ; _(cid:18)5g denotes the input state vector, u = f(cid:28)1; : : : ; (cid:28)4g de-\nnotes the control command (each torque (cid:28)j is applied to joint j (Fig.1):Left). In the mini-\nmax optimization case, we explicitly represent the existence of the disturbance as\n\nxi+1 = f (xi) + b(xi)ui;\n\n(16)\nwhere w = fw0; w1; w2; w3; w4g denotes the disturbance (w 0 is applied to ankle, and wj\n(j = 1 : : : 4) is applied to joint j (Fig. 1:Left)).\n\nxi+1 = f (xi) + b(xi)ui + bw(xi)wi;\n\n3.2 Optimization criterion and method\n\nWe use the following objective function, which is designed to reward energy ef\ufb01ciency and\nenforce periodicity of the trajectory:\n\nN (cid:0)1\n\nJ = (cid:8)(x0; xN ) +\n\nr(xi; ui; i)\n\n(17)\n\ni )T Q(xi (cid:0) xd\n\ni ) + ui\n\nr(xi; ui; i) = (xi (cid:0) xd\n\nwhich is applied for half the walking cycle, from one heel strike to the next heel strike.\nThis criterion sums the squared deviations from a nominal trajectory, the squared control\nmagnitudes, and the squared deviations from a desired velocity of the center of mass:\nT Rui + (v(xi) (cid:0) vd)T S(v(xi) (cid:0) vd);\n\n(18)\nwhere xi is a state vector at the i-th time step, xd\ni is the nominal state vector at the i-th\ntime step (taken from a trajectory generated by a hand-designed walking controller), v(xi)\ndenotes the velocity of the center of mass at the i-th time step, and v d denotes the desired\nvelocity of the center of mass. The term (xi (cid:0) xd\ni ) encourages the robot to\nfollow the nominal trajectory, the term ui\nT Rui discourages using large control outputs,\nand the term (v(xi) (cid:0) vd)T S(v(xi) (cid:0) vd) encourages the robot to achieve the desired\nvelocity.\nIn addition, penalties on the initial (x0) and \ufb01nal (xN ) states are applied:\n\ni )T Q(xi (cid:0) xd\n\nThe term F (x0) penalizes an initial state where the foot is not on the ground:\n\n(cid:8)(x0; xN ) = F (x0) + (cid:8)N (x0; xN ):\n\n(19)\n\n(20)\nwhere Fh(x0) denotes height of the swing foot at the initial state x0. The term (cid:8)N (x0; xN )\nis used to generate periodic trajectories:\n\nT (x0)P0Fh(x0);\n\nF (x0) = Fh\n\n(cid:8)N (x0; xN ) = (xN (cid:0) H(x0))T PN (xN (cid:0) H(x0));\n\n(21)\nwhere xN denotes the terminal state, x0 denotes the initial state, and the term (xN (cid:0)\nH(x0))T PN (xN (cid:0) H(x0)) is a measure of terminal control accuracy. A function H()\nrepresents the coordinate change caused by the exchange of a support leg and a swing leg,\nand the velocity change caused by a swing foot touching the ground (Appendix A).\n\nWe implement the minimax DDP by adding a minimax term to the criterion. We use a\nmodi\ufb01ed objective function:\n\nN (cid:0)1\n\nJminimax = J (cid:0)\n\nwi\n\nT Gwi;\n\n(22)\n\nXi=0\n\nXi=0\n\nwhere wi denotes a disturbance vector at the i-th time step, and the term wi\nT Gwi rewards\ncoping with large disturbances. This explicit representation of the disturbance w provides\nthe robustness for the controller [5].\n\n\f4 Results\n\nWe compare the optimized controller with a hand-tuned PD servo controller, which also\nis the source of the initial and nominal trajectories in the optimization process. We set\nthe parameters for the optimization process as Q = 0:25I10, R = 3:0I4, S = 0:3I1,\ndesired velocity vd = 0:4[m/s] in equation (18), P0 = 1000000:0I1 in equation (20), and\nPN = diagf10000.0, 10000.0, 10000.0, 10000.0, 10000.0, 10.0, 10.0, 10.0, 5.0, 5.0g in\nequation (21), where IN denotes N dimensional identity matrix. For minimax DDP, we\nset the parameter for the disturbance reward in equation (22) as G = diagf5.0, 20.0, 20.0,\n20.0, 20.0g (G with smaller elements generates more conservative but robust trajectories).\nEach parameter is set to acquire the best results in terms of both the robustness and the\nenergy ef\ufb01ciency. When we apply the controllers acquired by standard DDP and minimax\nDDP to the biped walk, we adopt a local policy which we introduced in section 2.2.\n\nResults in table 2 show that the controller generated by standard DDP and minimax DDP\ndid almost halve the cost of the trajectory, as compared to that of the original hand-tuned\nPD servo controller. However, because the minimax DDP is more conservative in taking\nadvantage of the plant dynamics, it has a slightly higher control cost than the standard DDP.\nNote that we de\ufb01ned the control cost as 1\ni=0 jjuijj2, where ui is the control output\n(torque) vector at i-th time step, and N denotes total time step for one step trajectories.\n\nN PN (cid:0)1\n\nTable 2: One step control cost (average over 100 steps)\n\nPD servo\n\nstandard DDP minimax DDP\n\ncontrol cost [(N (cid:1) m)2 (cid:2) 10(cid:0)2 ]\n\n7:50\n\n3:54\n\n3:86\n\nTo test robustness, we assume that there is unknown viscous friction at each joint:\n\nj = (cid:0)(cid:22)j _(cid:18)j\n(cid:28) dist\n\n(j = 1; : : : ; 4);\n\n(23)\n\nwhere (cid:22)j denotes the viscous friction coef\ufb01cient at joint j.\nWe used two levels of disturbances in the simulation, with the higher level being 3 times\nlarger than the base level (Table 3).\n\nTable 3: Parameters of the disturbance\n\n(cid:22)2,(cid:22)3 (hip joints)\n\n(cid:22)1,(cid:22)4 (knee joints)\n\nbase\nlarge\n\n0:01\n0:03\n\n0:05\n0:15\n\nAll methods could handle the base level disturbances. Both the standard and the minimax\nDDP generated much less control cost than the hand-tuned PD servo controller (Table 4).\nHowever, only the minimax DDP control design could cope with the higher level of dis-\nturbances. Figure 2 shows trajectories for the three different methods. Both the simulated\nrobot with the standard DDP and the hand-tuned PD servo controller fell down before\nachieving 100 steps. The bottom of \ufb01gure 2 shows part of a successful biped walking tra-\njectory of the robot with the minimax DDP. Figure 3 shows ankle joint trajectories for the\nthree different methods. Only the minimax DDP successfully kept ankle joint (cid:18)1 around\n90 degrees more than 20 seconds. Table 5 shows the number of steps before the robot fell\ndown. We terminated a trial when the robot achieved 1000 steps.\n\n\fTable 4: One step control cost with the base setting (averaged over 100 steps)\nstandard DDP minimax DDP\n\nPD servo\n\ncontrol cost [(N (cid:1) m)2 (cid:2) 10(cid:0)2]\n\n8:97\n\n5:23\n\n5:87\n\nHand-tuned PD servo\n\nStandard DDP\n\nMinimax DDP\n\nFigure 2: Biped walk trajectories with the three different methods\n\n5 Learning the unmodeled dynamics\n\nIn section 4, we veri\ufb01ed that minimax DDP could generate robust biped trajectories and\npolicies. The minimax DDP coped with larger disturbances than the standard DDP and\nthe hand-tuned PD servo controller. However, if there are modeling errors, using a robust\ncontroller which does not learn is not particularly energy ef\ufb01cient. Fortunately, with mini-\nmax DDP, we can collect suf\ufb01cient data to improve our dynamics model. Here, we propose\nusing Receptive Field Weighted Regression (RFWR) [7] to learn the error dynamics of the\nbiped robot. In this section we present results on learning a simulated modeling error (the\ndisturbances discussed in section 4). We are currently applying this approach to an actual\nrobot.\n\nWe can represent the full dynamics as the sum of the known dynamics and the error dy-\nnamics (cid:1)F(xi; ui; i):\n\n(24)\n\nWe estimate the error dynamics (cid:1)F by using RFWR:\n\nxi+1 = F(xi; ui) + (cid:1)F(xi; ui; i):\n\n(cid:1) ^F(xi; ui; i) = PNb\n\nk=1 (cid:11)i\n\n(cid:30)k(xi; ui; i) = (cid:12)T\n\nk(cid:30)k(xi; ui; i)\nk=1 (cid:11)i\nk ~xi\nk;\n\nPNb\n\nk\n\n(cid:11)i\n\nk = exp(cid:18)(cid:0)\n\n1\n2\n\n(i (cid:0) ck)Dk(i (cid:0) ck)(cid:19) ;\n\n;\n\n(25)\n\n(26)\n\n(27)\n\nwhere, Nb denotes the number of basis function, ck denotes center of k-th basis function,\nDk denotes distance metric of the k-th basis function, (cid:12)k denotes parameter of the k-\nth basis function to approximate error dynamics, and ~xi\nk = (xi; ui; 1; i (cid:0) ck) denotes\naugmented state vector for the k-th basis function. We align 20 basis functions (N b = 20)\nat even intervals along the biped trajectories.\n\nThe learning strategy uses the following sequence: 1) Design the initial controller using\nminimax DDP applied to the nominal model. 2) Apply that controller. 3) Learn the actual\ndynamics using RFWR. 4) Redesign the biped controller using minimax DDP with the\nlearned model.\n\n\f]\n\ng\ne\nd\n\n[\n \n\nl\n\ne\nk\nn\na\n\n]\n\ng\ne\nd\n\n[\n \n\nl\n\ne\nk\nn\na\n\n]\n\ng\ne\nd\n\n[\n \n\nl\n\ne\nk\nn\na\n\n100\n\n90\n\n80\n\n70\n\n60\n\n100\n\n90\n\n80\n\n70\n\n60\n\n100\n\n90\n\n80\n\n70\n\n60\n\n0\n\n2\n\n4\n\n6\n\n0\n\n2\n\n4\n\n6\n\n0\n\n2\n\n4\n\n6\n\n8\n\n12\ntime [sec] (PD servo)\n\n10\n\n8\n\n10\n\n12\n\ntime [sec] (Standard DDP)\n\n8\n\n10\n\n12\n\ntime [sec] (Minimax DDP)\n\n14\n\n16\n\n18\n\n20\n\n14\n\n16\n\n18\n\n20\n\n14\n\n16\n\n18\n\n20\n\nFigure 3: Ankle joint trajectories with the three different methods\n\nTable 5: Number of steps with the large disturbances\n\nPD servo\n\nstandard DDP minimax DDP\n\nNumber of steps\n\n49\n\n24\n\n> 1000\n\nWe compare the ef\ufb01ciency of the controller with the learned model to the controller with-\nout the learned model. Results in table 6 show that the controller after learning the error\ndynamics used lower torque to produce stable biped walking trajectories.\n\nTable 6: One step control cost with the large disturbances (averaged over 100 steps)\n\nwithout learned model with learned model\n\ncontrol cost [(N (cid:1) m)2 (cid:2) 10(cid:0)2]\n\n17:1\n\n11:3\n\n6 Discussion\n\nIn this study, we developed an optimization method to generate biped walking trajectories\nby using differential dynamic programming (DDP). We showed that 1) DDP and minimax\nDDP can be applied to high dimensional problems, 2) minimax DDP can design more\nrobust controllers, and 3) learning can be used to reduce modeling error and unknown\ndisturbances in the context of minimax DDP control design.\n\nBoth standard DDP and minimax DDP generated low torque biped trajectories. We showed\nthat the minimax DDP control design was more robust than the controller designed by\nstandard DDP and the hand-tuned PD servo. Given a robust controller, we could collect\nsuf\ufb01cient data to learn the error dynamics using RFWR[7] without the robot falling down\nall the time. We also showed that after learning the error dynamics, the biped robot could\n\ufb01nd a lower torque trajectory.\n\nDDP provides a feedback controller which is important in coping with unknown distur-\n\n\fbances and modeling errors. However, as shown in equation (2), the feedback controller is\nindexed by time, and development of a time independent feedback controller is a future\ngoal.\n\nAppendix\n\nA Ground contact model\n\nThe function H() in equation (21) includes the mapping (velocity change) caused by\nground contact. To derive the \ufb01rst derivative of the value function Vx(xN ) and the sec-\nond derivative Vxx(xN ), where xN denotes the terminal state, the function H() should be\nanalytical. Then, we used an analytical ground contact model[3]:\n\n+\n\n_(cid:18)\n\n(cid:0)\n\n(cid:0) _(cid:18)\n\n= M (cid:0)1((cid:18))D((cid:18))f (cid:1)t;\n\n(28)\n\n_(cid:18)(cid:0) denotes angular velocities before ground\nwhere (cid:18) denotes joint angles of the robot,\ncontact, _(cid:18)+ denotes angular velocities after ground contact, M denotes the inertia matrix,\nD denotes the Jacobian matrix which converts the ground contact force f to the torque at\neach joint, and (cid:1)t denotes time step of the simulation.\n\nReferences\n\n[1] S. P. Coraluppi and S. I. Marcus. Risk-Sensitive and Minmax Control of Discrete-Time\n\nFinite-State Markov Decision Processes. Automatica, 35:301\u2013309, 1999.\n\n[2] P. Dyer and S. R. McReynolds. The Computation and Theory of Optimal Control.\n\nAcademic Press, New York, NY, 1970.\n\n[3] Y. Hurmuzlu and D. B. Marghitu. Rigid body collisions of planar kinematic chains\nwith multiple contact points. International Journal of Robotics Research, 13(1):82\u2013\n92, 1994.\n\n[4] D. H. Jacobson and D. Q. Mayne. Differential Dynamic Programming. Elsevier, New\n\nYork, NY, 1970.\n\n[5] J. Morimoto and K. Doya. Robust Reinforcement Learning.\n\nIn Todd K. Leen,\nThomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Pro-\ncessing Systems 13, pages 1061\u20131067. MIT Press, Cambridge, MA, 2001.\n\n[6] R. Neuneier and O. Mihatsch. Risk Sensitive Reinforcement Learning. In M. S. Kearns,\nS. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Sys-\ntems 11, pages 1031\u20131037. MIT Press, Cambridge, MA, USA, 1998.\n\n[7] S. Schaal and C. G. Atkeson. Constructive incremental learning from only local infor-\n\nmation. Neural Computation, 10(8):2047\u20132084, 1998.\n\n[8] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT\n\nPress, Cambridge, MA, 1998.\n\n[9] K. Zhou, J. C. Doyle, and K. Glover. Robust Optimal Control. PRENTICE HALL,\n\nNew Jersey, 1996.\n\n\f", "award": [], "sourceid": 2304, "authors": [{"given_name": "Jun", "family_name": "Morimoto", "institution": null}, {"given_name": "Christopher", "family_name": "Atkeson", "institution": null}]}