{"title": "Probabilistic Differential Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1907, "page_last": 1915, "abstract": "We present a data-driven, probabilistic trajectory optimization framework for systems with unknown dynamics, called Probabilistic Differential Dynamic Programming (PDDP). PDDP takes into account uncertainty explicitly for dynamics models using Gaussian processes (GPs). Based on the second-order local approximation of the value function, PDDP performs Dynamic Programming around a nominal trajectory in Gaussian belief spaces. Different from typical gradient-based policy search methods, PDDP does not require a policy parameterization and learns a locally optimal, time-varying control policy. We demonstrate the effectiveness and efficiency of the proposed algorithm using two nontrivial tasks. Compared with the classical DDP and a state-of-the-art GP-based policy search method, PDDP offers a superior combination of data-efficiency, learning speed, and applicability.", "full_text": "Probabilistic Differential Dynamic Programming\n\nYunpeng Pan and Evangelos A. Theodorou\n\nAutonomous Control and Decision Systems Laboratory\nDaniel Guggenheim School of Aerospace Engineering\n\nInstitute for Robotics and Intelligent Machines\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nypan37@gatech.edu, evangelos.theodorou@ae.gatech.edu\n\nAbstract\n\nWe present a data-driven, probabilistic trajectory optimization framework for sys-\ntems with unknown dynamics, called Probabilistic Differential Dynamic Program-\nming (PDDP). PDDP takes into account uncertainty explicitly for dynamics mod-\nels using Gaussian processes (GPs). Based on the second-order local approxi-\nmation of the value function, PDDP performs Dynamic Programming around a\nnominal trajectory in Gaussian belief spaces. Different from typical gradient-\nbased policy search methods, PDDP does not require a policy parameterization\nand learns a locally optimal, time-varying control policy. We demonstrate the ef-\nfectiveness and ef\ufb01ciency of the proposed algorithm using two nontrivial tasks.\nCompared with the classical DDP and a state-of-the-art GP-based policy search\nmethod, PDDP offers a superior combination of learning speed, data ef\ufb01ciency\nand applicability.\n\n1\n\nIntroduction\n\nDifferential Dynamic Programming (DDP) is a powerful trajectory optimization approach. Origi-\nnally introduced in [1], DDP generates locally optimal feedforward and feedback control policies\nalong with an optimal state trajectory. Compared with global optimal control approaches, the lo-\ncal optimal DDP shows superior computational ef\ufb01ciency and scalability to high-dimensional prob-\nlems. In the last decade, variations of DDP have been proposed in both control and machine learning\ncommunities [2][3][4][5][6]. Recently, DDP was applied for high-dimensional policy search which\nachieved promising results in challenging control tasks [7].\nDDP is derived based on linear approximations of the nonlinear dynamics along state and control\ntrajectories, therefore it relies on accurate and explicit dynamics models. However, modeling a\ndynamical system is in general a challenging task and model uncertainty is one of the principal\nlimitations of model-based methods. Various parametric and semi-parametric approaches have been\ndeveloped to address these issues, such as minimax DDP using Receptive Field Weighted Regression\n(RFWR) by Morimoto and Atkeson [8], and DDP using expert-demonstrated trajectories by Abbeel\net al. [9]. Motivated by the complexity of the relationships between states, controls and observations\nin autonomous systems, in this work we take a Bayesian non-parametric approach using Gaussian\nProcesses (GPs).\nOver last few years, GP-based control and Reinforcement Learning (RL) algorithms have increas-\ningly drawn more attention in control theory and machine learning communities. For instance,\nthe works by Rasmussen et al.[10], Nguyen-Tuong et al.[11], Deisenroth et al.[12][13][14] and\nHemakumara et al.[15] have demonstrated the remarkable applicability of GP-based control and RL\nmethods in robotics. In particular, a recently proposed GP-based policy search framework called\nPILCO, developed by Deisenroth and Rasmussen [13] (an improved version has been developed by\n\n1\n\n\fDeisenroth, Fox and Rasmussen [14]) has achieved unprecedented performances in terms of data-\nef\ufb01ciency and policy learning speed. PILCO as well as most gradient-based policy search algorithms\nrequire iterative methods (e.g.,CG or BFGS) for solving non-convex optimization to obtain optimal\npolicies.\nThe proposed approach does not require a policy parameterization. Instead PDDP \ufb01nds a linear, time\nvarying control policy based on Bayesian non-parametric representation of the dynamics and out-\nperforms PILCO in terms of control learning speed while maintaining a comparable data-ef\ufb01ciency.\n\n2 Proposed Approach\n\nThe proposed PDDP framework consists of 1) a Bayesian non-parametric representation of the un-\nknown dynamics; 2) local approximations of the dynamics and value functions; 3) locally optimal\ncontroller learning.\n\n2.1 Problem formulation\n\nWe consider a general unknown dynamical system described by the following differential equation\n(1)\nwhere x \u2208 Rn is the state, u \u2208 Rm is the control and \u03c9 \u2208 Rp is standard Brownian motion\nnoise. The trajectory optimization problem is de\ufb01ned as \ufb01nding a sequence of state and controls that\nminimize the expected cost\n\nd\u03c9 \u223c N (0, \u03a3\u03c9),\n\ndx = f (x, u)dt + Cd\u03c9,\n\nx(t0) = x0,\n\n(cid:20)\n\n(cid:16)\n\n(cid:17)\n\n(cid:90) T\n\nL(cid:16)\n\n(cid:17)\n\n(cid:21)\n\n,\n\nJ \u03c0(x(t0)) = Ex\n\nh\n\nx(T )\n\n(2)\nwhere h(x(T )) is the terminal cost, L(x(t), \u03c0(x(t)), t) is the instantaneous cost rate, u(t) =\n\u03c0(x(t)) is the control policy. The cost J \u03c0(x(t0)) is de\ufb01ned as the expectation of the total cost\naccumulated from t0 to T . For the rest of our analysis, we denote xk = x(tk) in discrete-time\nwhere k = 0, 1, ..., H is the time step, we use this subscript rule for other variables as well.\n\nx(t), \u03c0(x(t)), t\n\ndt\n\n+\n\nt0\n\n2.2 Probabilistic model learning\nThe continuous functional mapping from state-control pair \u02dcx = (x, u) \u2208 Rn+m to state transition\ndx can be viewed as an inference with the goal of inferring dx given \u02dcx. We view this inference\nas a nonlinear regression problem. In this subsection, we introduce the Gaussian processes (GP)\napproach to learning the dynamics model in (1). A GP is de\ufb01ned as a collection of random vari-\nables, any \ufb01nite number subset of which have a joint Gaussian distribution. Given a sequence of\nstate-control pairs \u02dcX = {(x0, u0), . . . (xH , uH )}, and the corresponding state transition dX =\n{dx0, . . . , dxH}, a GP is completely de\ufb01ned by a mean function and a covariance function. The\njoint distribution of the observed output and the output corresponding to a given test state-control\npair \u02dcx\u2217 = (x\u2217, u\u2217) can be written as p\n. The co-\nvariance of this multivariate Gaussian distribution is de\ufb01ned via a kernel matrix K(xi, xj). In partic-\n2 (xi\u2212xj)TW(xi\u2212xj)),\nular, in this paper we consider the Gaussian kernel K(xi, xj) = \u03c32\nwith \u03c3s, \u03c3n, W the hyper-parameters. The kernel function can be interpreted as a similarity measure\nof random variables. More speci\ufb01cally, if the training pairs \u02dcXi and \u02dcXj are close to each other in the\nkernel space, their outputs dxi and dxj are highly correlated. The posterior distribution, which is\nalso a Gaussian, can be obtained by constraining the joint distribution to contain the output dx\u2217 that\nis consistent with the observations. Assuming independent outputs (no correlation between each\noutput dimension) and given a test input \u02dcxk = (xk, uk) at time step k, the one-step predictive mean\nand variance of the state transition are speci\ufb01ed as\n\n(cid:104) K( \u02dcX, \u02dcX) + \u03c3nI K( \u02dcX, \u02dcx\u2217)\n\nK(\u02dcx\u2217, \u02dcX)\ns exp(\u2212 1\n\n(cid:17) \u223c N(cid:16)\n\n(cid:16) dX\n\n(cid:105)(cid:17)\n\nK(\u02dcx\u2217, \u02dcx\u2217)\n\ndx\u2217\n\n0,\n\nEf [dxk] = K(\u02dcxk, \u02dcX)(K( \u02dcX, \u02dcX) + \u03c3nI)\u22121dX,\n\nVARf [dxk] = K(\u02dcxk, \u02dcxk) \u2212 K(\u02dcxk, \u02dcX)(K( \u02dcX, \u02dcX) + \u03c3nI)\u22121K( \u02dcX, \u02dcxk).\n\n(3)\n\nThe state distribution at k = 1 is p(x1) \u223c N (\u00b51, \u03a31) where the state mean and variance are\n\u00b51 = x0 +Ef [dx0], \u03a31 = VARf [dx0]. When propagating the GP-based dynamics over a trajectory\n\n2\n\n\fNext, we compute the predictive covariance matrix\n\n(cid:82) p(f (\u02dcxk)|\u02dcxk)p(\u02dcxk)d\u02dcxk. Generally, this predictive distribution cannot be computed analytically\n\nof time horizon H, the input state-control pair \u02dcxk becomes uncertain with a Gaussian distribution\n(initially \u02dcx0 is deterministic). Here we de\ufb01ne the joint distribution over state-control pair at k as\np(\u02dcxk) = p(xk, uk) \u223c N ( \u02dc\u00b5k, \u02dc\u03a3k). Thus the distribution over state transition becomes p(dxk) =\nbecause the nonlinear mapping of an input Gaussian distribution leads to a non-Gaussian predictive\ndistribution. However, the predictive distribution can be approximated by a Gaussian p(dxk) \u223c\nN (d\u00b5k, d\u03a3k) [16]. Thus the state distribution at k + 1 is also a Gaussian N (\u00b5k+1, \u03a3k+1) [14]\n\n\u00b5k+1 = \u00b5k + d\u00b5k,\n\n(4)\nGiven an input joint distribution N ( \u02dc\u00b5k, \u02dc\u03a3k), we employ the moment matching approach [16][14]\nto compute the predictive distribution. The predictive mean d\u00b5k is evaluated as\n\nf ,\u02dcxk [dxk, xk].\n\n\u03a3k+1 = \u03a3k + d\u03a3k + COV\n\nf ,\u02dcxk [xk, dxk] + COV\n\nd\u00b5k = E\u02dcxk\n\n(cid:34)\n\nd\u03a3k =\n\nCOVf ,\u02dcxk\n\n(cid:90)\n\n]\n\n]\n\n. . .\n\n. . .\n\n. . .\n\n,\n\n[dxk1\n\n[dxk1\n\n, dxkn ]\n\n[dxkn , dxk1\n\nCOVf ,\u02dcxk\n\n.\n.\n.\nVARf ,\u02dcxk\n\nVARf ,\u02dcxk\n.\n.\n.\n\n(cid:2)Ef [dxk](cid:3) =\n\nEf [dxk]N(cid:0) \u02dc\u00b5k, \u02dc\u03a3k\n\n(cid:1)d\u02dcxk.\n(cid:35)\n(cid:2)Ef [dxki](cid:3)2\n(cid:2)VARf [dxki](cid:3) + E\u02dcxk\n(cid:2)Ef [dxki]Ef [dxkj ](cid:3) \u2212 E\u02dcxk [Ef [dxki]]E\u02dcxk [Ef [dxkj ]].\n(cid:2)\u02dcxkEf [dxk]T(cid:3) \u2212 E\u02dcxk [\u02dcxk]Ef ,\u02dcxk [dxk]T.\n(cid:26)\n\n(cid:2)Ef [dxki]2(cid:3) \u2212 E\u02dcxk\n\n(cid:18)\n\n[dxkn ]\n\n(cid:17)(cid:19)(cid:27)\n\n(cid:16)\n\n\u0398\u2217 = argmax\n\nlog\n\np\n\ndX| \u02dcX, \u0398\n\n.\n\nwhere the variance term on the diagonal for output dimension i is obtained as\n\nVARf ,\u02dcxk [dxki] = E\u02dcxk\n\n,\nand the off-diagonal covariance term for output dimension i, j is given by the expression\n\nCOV\n\nf ,\u02dcxk [dxki , dxkj ] = E\u02dcxk\n\nThe input-output cross-covariance is formulated as\n\nCOV\n\nf ,\u02dcxk [\u02dcxk, dxk] = E\u02dcxk\n\n(7)\nCOV\nf ,\u02dcxk [xk, dxk] can be easily obtained as a sub-matrix of (7). The kernel or hyper-parameters\n\u0398 = (\u03c3n, \u03c3s, W) are learned by maximizing the log-likelihood of the training outputs given the\ninputs\n\n(5)\n\n(6)\n\n(8)\n\n(10)\n\n(11)\n\nThis optimization problem can be solved using numerical methods such as conjugate gradient [17].\n\n\u0398\n\n2.3 Local dynamics model\n\nIn DDP-related algorithms, a local model along a nominal trajectory (\u00afxk, \u00afuk), is created based\non: i) a \ufb01rst or second-order linear approximation of the dynamics model; ii) a second-order local\napproximation of the value function. In our proposed PDDP framework, we will create a local model\nalong a trajectory of state distribution-control pair (p(\u00afxk), \u00afuk). In order to incorporate uncertainty\nexplicitly in the local model, we introduce the Gaussian augmented state (or the belief over xk)\nzk = [\u00b5k vec(\u03a3k)]T \u2208 Rn+n\u00d7n where vec(\u03a3k) is the vectorization of \u03a3k. Now we create a local\nlinear model of the dynamics. Based on eq.(4), the dynamics model with the augmented state is\n(9)\nDe\ufb01ne the control and state variations \u03b4zk = zk \u2212 \u00afzk and \u03b4uk = uk \u2212 \u00afuk. In this work we consider\nthe \ufb01rst-order expansion of the dynamics. More precisely we have\n\nzk+1 = F(zk, uk).\n\nwhere the Jacobian matrices F x\n\nk and F u\n\n\uf8f9\uf8fb \u2208 R(n+n2)\u00d7(n+n2),\n\nk \u03b4uk,\n\n\u03b4zk+1 = F z\n\n\uf8ee\uf8f0 \u2202\u00b5k+1\n(cid:34) \u2202\u00b5k+1\n\nk \u03b4zk + F u\nk are speci\ufb01ed as\n\u2202\u00b5k+1\n\u2202\u03a3k\n\u03a3k+1\n\u2202\u03a3k\n\u2208 R(n+n2)\u00d7m.\n\n\u2202\u00b5k\n\u2202\u03a3k+1\n\u2202\u00b5k\n\n(cid:35)\n\n\u2202uk\n\n\u2202\u03a3k+1\n\nF z\nk = \u2207zkF =\n\nk = \u2207ukF =\nF u\n\n\u2202uk\n\nThe partial derivatives \u2202\u00b5k+1\nare computed analytically.\n\u2202\u00b5k\nTheir forms are provided in the supplementary document of this work. For numerical implementa-\ntion, the dimension of the augmented state can be reduced by eliminating the redundancy of \u03a3k and\nthe principle square root of \u03a3k may be used for numerical robustness.\n\n, \u2202\u03a3k+1\n\u2202\u03a3k\n\n, \u2202\u03a3k+1\n\u2202\u00b5k\n\n, \u2202\u03a3k+1\n\u2202uk\n\n\u2202\u00b5k+1\n\u2202\u03a3k\n\n\u2202\u00b5k+1\n\u2202uk\n\n,\n\n,\n\n3\n\n\f2.4 Cost function\n\n(cid:104)L(xk, uk)\n(cid:105)\n\nIn the classical DDP and many optimal control problems, the following quadratic cost function is\nused\n\nL(xk, uk) = (xk \u2212 xgoal\n\n)TQ(xk \u2212 xgoal\n\n(12)\nis the target state. Given the distribution p(xk) \u223c N (\u00b5k, \u03a3k), the expectation of\n\nk Ruk,\n\n) + uT\n\nk\n\nk\n\nwhere xgoal\noriginal quadratic cost function is formulated as\n\nk\n\nEx\n\n= tr(Q\u03a3k) + (\u00b5k \u2212 xgoal\n\n)TQ(\u00b5k \u2212 xgoal\n\n(13)\nIn PDDP, we use the cost function L(zk, uk) = Exk [L(xk, uk)]. The analytic expressions of partial\nL(zk, uk) can be easily obtained. The cost function (13) scales\nderivatives \u2202\n\u2202zk\nlinearly with the state covariance, therefore the exploration strategy of PDDP is balanced between\nthe distance from the target and the variance of the state. This strategy \ufb01ts well with DDP-related\nframeworks that rely on local approximations of the dynamics. A locally optimal controller obtained\nfrom high-risk explorations in uncertain regions might be highly undesirable.\n\nL(zk, uk) and \u2202\n\nk Ruk.\n\n) + uT\n\n\u2202uk\n\nk\n\nk\n\n2.5 Control policy\n\nThe Bellman equation for the value function in discrete-time is speci\ufb01ed as follows\n\n(cid:34)\n\n(cid:124)\n\n(cid:35)\n\n|xk\n\n(cid:17)\n(cid:125)\n\n(cid:16)F(zk, uk), k + 1\n(cid:123)(cid:122)\n(cid:20) \u03b4zk\n\nQ(zk,uk)\n\nV (zk, k) = min\nuk\n\nE\n\nL(zk, uk) + V\n\n.\n\n(14)\n\nWe create a quadratic local model of the value function by expanding the Q-function up to the\nsecond order\nQk(zk +\u03b4zk, uk +\u03b4uk) \u2248 Q0\n\nk Qzu\n, (15)\nk\nk Quu\nk\nk = \u2207zQk(zk, uk).\nwhere the superscripts of the Q-function indicate derivatives. For instance, Qz\nFor the rest of the paper, we will use this superscript rule for L and V as well. To \ufb01nd the optimal\ncontrol policy, we compute the local variations in control \u03b4 \u02c6uk that maximize the Q-function\n\n(cid:21)T(cid:20) Qzz\n\n(cid:21)(cid:20) \u03b4zk\n\nk\u03b4zk +Qu\n\nk +Qz\n\nk\u03b4uk +\n\n(cid:21)\n\nQuz\n\n\u03b4uk\n\n\u03b4uk\n\n1\n2\n\n(cid:104)\n\n(cid:105)\n\n\u03b4 \u02c6uk = arg max\nuk\n\nQk(zk + \u03b4zk, uk + \u03b4uk)\n\n(cid:124)\n\n= \u2212(Quu\n\n(cid:123)(cid:122)\n(cid:125)\nk )\u22121Qu\n\nk\n\n(cid:124)\n\nIk\n\n\u2212(Quu\n\n(cid:123)(cid:122)\n(cid:125)\nk )\u22121Quz\n\nk\n\nLk\n\n\u03b4zk = Ik + Lk\u03b4zk.\n\n(16)\nThe optimal control can be found as \u02c6uk = \u00afuk + \u03b4 \u02c6uk. The quadratic expansion of the value function\nis backward propagated based on the equations that follow\n\nk F z\nk = Lz\nk = Lu\nk , Qu\nk + V x\nQz\nk + (F z\nk = Lzz\nk F z\nk , Quz\nQzz\nk )TV zz\nV z\nVk\u22121 = Vk + Qu\nk\u22121 = Qz\nkIk,\n\nk F u\nk + V x\nk ,\nk = Luz\nk + (F u\nk + Qu\nkLk,\n\nk F u\nk ,\n(17)\nThe second-order local approximation of the value function is propagated backward in time iter-\natively. We use the learned controller to generate a locally optimal trajectory by propagating the\ndynamics forward in time. The control policy is a linear function of the augmented state zk, there-\nfore the controller is deterministic. The state propagations have been discussed in Sec. 2.2.\n\nk F z\nV zz\nk\u22121 = Qzz\n\nk = Luu\nk Lk.\n\nk + (F u\n\nk + Qzu\n\nk , Quu\n\nk )TV zz\n\nk )TV zz\n\n2.6 Summary of algorithm\n\nThe proposed algorithm can be summarized in Algorithm 1. The algorithm consists of 8 modules.\nIn Model learning (Step 1-2) we sample trajectories from the original physical system in order to\ncollect training data and learn a probabilistic model. In Local approximation (Step 4) we obtain\na local linear approximation (10) of the learned probabilistic model along a nominal trajectory by\ncomputing Jacobian matrices (11). In Controller learning (Step 5) we compute a local optimal con-\ntrol sequence (16) by backward-propagation of the value function (17). To ensure convergence, we\nemploy the line search strategy as in [2]. We compute the control law as \u03b4 \u02c6uk = \u03b1Ik + Lk\u03b4zk.\n\n4\n\n\fInitially \u03b1 = 1, then decrease it until the expected cost is smaller than the previous one. In Forward\npropagation (Step 6), we apply the control sequence from last step and obtain a new nominal trajec-\ntory for the next iteration. In Convergence condition (Step 7), we set a threshold on the accumulated\ncost J\u2217 such that when J \u03c0 < J\u2217, the algorithm is terminated with the optimized state and control\ntrajectory. In Interaction condition (Step 8), when the state covariance \u03a3k exceeds a threshold \u03a3tol,\nwe sample new trajectories from the physical system using the control obtained in step 5, and go\nback to step 2 to learn a new model. The old GP training data points are removed from the training\nset to keep its size \ufb01xed. Finally in Nominal trajectory update (step 9), the trajectory obtained in\nStep 6 or 8 becomes the new nominal trajectory for the next iteration. An simple illustration of the\nalgorithm is shown in Fig. 3a. Intuitively, PDDP requires interactions with the physical systems\nonly if the GP model no longer represents the true dynamics around the nominal trajectory.\n\nGiven: A system with unknown dynamics, target states\nGoal : An optimized trajectory of state and control\n\n1 Generate N state trajectories by applying random control sequences to the physical system (1);\n2 Obtain state and control training pairs from sampled trajectories and optimize the\n\nhyper-parameters of GP (8);\n\n3 for i = 1 to Imax do\n4\n5\n\n6\n\n7\n8\n\nCompute a linear approximation of the dynamics along (\u00afzk, \u00afuk) (10);\nBackpropagate in time to get the locally optimal control \u02c6uk = \u00afuk + \u03b4 \u02c6uk and value function\nV (zk, k) according to (16) (17);\nForward propagate the dynamics (9) by applying the optimal control \u02c6uk, obtain a new\ntrajectory (zk, uk);\nif Converge then Break the for loop;\nif \u03a3k > \u03a3tol then Apply the optimal control to the physical system to generate a new\nnominal trajectory (zk, uk) and N \u2212 1 additional trajectories by applying small variations\nof the learned controller, and go back to step 2;\nSet \u00afzk = zk, \u00afuk = uk and i = i + 1, go back to step 4;\n\n9\n10 end\n11 Apply the optimized controller to the physical system, obtain the optimized trajectory.\n\nAlgorithm 1: PDDP algorithm\n\n2.7 Computational complexity\n\nthe complexity of one-step moment matching (2.2) is O(cid:0)(N )2n2(n+m)(cid:1) [14], which is \ufb01xed during\n\nDynamics propagation: The major computational effort is devoted to GP inferences. In particular,\nthe iterative process of PDDP. We found a small number of sampled trajectories (N \u2264 5) are able\nto provide good performances for a system of moderate size (6-12 state dimensions). However, for\nhigher dimensional problems, sparse or local approximation of GP (e.g. [11][18][19], etc) may be\nused to reduce the computational cost of GP dynamics propagation.\nController learning: According to (16), learning policy parameters Ik and Lk requires computing\nk , which has the computational complexity of O(m3), where m is the dimension\nthe inverse of Quu\nof control input. As a local trajectory optimization method, PDDP offers comparable scalability to\nthe classical DDP.\n\n2.8 Relation to existing works\n\nHere we summarize the novel features of PDDP in comparison with some notable DDP-related\nframeworks for stochastic systems (see also Table 1). First, PDDP shares some similarities with\nthe belief space iLQG [6] framework, which approximates the belief dynamics using an extended\nKalman \ufb01lter. Belief space iLQG assumes a dynamics model is given and the stochasticity comes\nfrom the process noises. PDDP, however, is a data-driven approach that learns the dynamics models\nand controls from sampled data, and it takes into account model uncertainties by using GPs. Second,\nPDDP is also comparable with iLQG-LD [5], which applies Locally Weighted Projection Regression\n(LWPR) to represent the dynamics.\niLQG-LD does not incorporate model uncertainty therefore\nrequires a large amount of data to learn an accurate model. Third, PDDP does not suffer from the\nhigh computational cost of \ufb01nite differences used to numerically compute the \ufb01rst-order expansions\n\n5\n\n\f[2][6] and second-order expansions [4] of the underlying stochastic dynamics. PDDP computes\nJacobian matrices analytically (11).\n\nState\n\nDynamics model\n\nLinearization\n\nBelief space iLQG[6]\n\niLQG-LD[5]\n\niLQG[2]/sDDP[4]\n\nPDDP\n\u00b5k, \u03a3k\nUnknown\n\n\u00b5k, \u03a3k\nKnown\n\nAnalytic Jacobian Finite differences Analytic Jacobian Finite differences\n\nxk\n\nUnknown\n\nxk\n\nKnown\n\nTable 1: Comparison with DDP-related frameworks\n\n3 Experimental Evaluation\n\nWe evaluate the PDDP framework using two nontrivial simulated examples: i) cart-double inverted\npendulum swing-up; ii) six-link robotic arm reaching. We also compare the learning ef\ufb01ciency\nof PDDP with the classical DDP [1] and PILCO [13][14]. All experiments were performed in\nMATLAB.\n\n3.1 Cart-double inverted pendulum swing-up\n\nCart-Double Inverted Pendulum (CDIP) swing-up is a challenging control problem because the sys-\ntem is highly underactuated with 3 degrees of freedom and only 1 control input. The system has\n6 state-dimensions (cart position/velocity, link 1,2 angles and angular velocities). The goal of the\nswing-up problem is to \ufb01nd a sequence of control input to force both pendulums from initial position\n(\u03c0,\u03c0) to the inverted position (2\u03c0,2\u03c0). The balancing task requires the velocity of the cart, angular\nvelocities of both pendulums to be zero. We sample 4 initial trajectories with time horizon H = 50.\nThe CDIP swing-up problem has been solved by two controllers for swing-up and balancing, re-\nspectively [20]. PILCO [14] is one of the few RL methods that is able to complete this task without\nknowing the dynamics. The results are shown in Fig.1.\n\n(a)\n\n(b)\n\nFigure 1: Results for the CDIP task. (a) Optimized state trajectories of PDDP. Solid lines indicate\nmeans, errorbars indicate variances. (b) Cost comparison of PDDP, DDP and PILCO. Costs (eq. 13)\nwere computed based on sampled trajectories by applying the \ufb01nal controllers.\n\n3.2 Six-link robotic arm\n\nThe six-link robotic arm model consists of six links of equal length and mass, connected in an open\nchain with revolute joints. The system has 6 degrees of freedom, and 12 state dimensions (angle\nand angular velocity for each joint). The goal for the \ufb01rst 3 joints is to move to the target angle \u03c0\nand for the rest 3 joints to \u2212 \u03c0\n4\n4 . The desired velocities for all 6 joints are zeros. We sample 2 initial\ntrajectories with time horizon H = 50. The results are shown in Fig. 2.\n\n3.3 Comparative analysis\n\nDDP: Originally introduced in the 70\u2019s, the classical DDP [1] is still one of the most effective and\nef\ufb01cient trajectory optimization approaches. The major differences between DDP and PDDP can\n\n6\n\n05101520253035404550\u22124\u22122024681012CDIP state trajectoriesTime steps Cart positionCart velocityLink1 angular velocityLink2 angular velocityLink1 angleLink2 angle0510152025303540455000.20.40.60.81Time stepsCDIP cost PDDPDDPPILCO\f(a)\n\n(b)\n\nFigure 2: Results for the 6-link arm task.\n(a) Optimized state trajectories of PDDP. Solid lines\nindicate means, errorbars indicate variances. (b) Cost comparison of PDDP, DDP and PILCO. Costs\n(eq. 13) were computed based on sampled trajectories by applying the \ufb01nal controllers.\n\nbe summarized as follow: \ufb01rstly, DDP relies on a given accurate dynamics model, while PDDP is\na data-driven framework that learns a locally accurate model by forward sampling; secondly, DDP\ndoes not deal with model uncertainty, PDDP takes into account model uncertainty using GPs and\nperform local dynamic programming in Gaussian belief spaces; thirdly, generally in applications\nof DDP linearizations are performed using \ufb01nite differences while in PDDP Jacobian matrices are\ncomputed analytically (11).\nPILCO: The recently proposed PILCO [14] framework has demonstrated state-of-the-art learning\nef\ufb01ciency compared with other methods such as [21][22]. The proposed PDDP is different from\nPILCO in several ways. Firstly, based on local linear approximation of dynamics and quadratic\napproximation of the value function, PDDP \ufb01nds linear, time-varying feedforward and feedback\npolicy, PILCO requires an a priori policy parameterization and an extra optimization solver. Sec-\nondly, PDDP keeps a \ufb01xed size of training data for GP inferences, while PILCO adds new data to\nthe training set after each trial (recently, the authors applied sparse GP approximation [19] in an\nimproved version of PILCO when the data size reached a threshold). Thirdly, by using the Gaussian\nbelief and cost function (13), PDDP\u2019s exploration scheme is balanced between the distance from\nthe target and the variance of the state. PILCO employs a saturating cost function which leads to\nautomatic explorations in the high-variance regions in the early stages of learning.\nIn both tasks, PDDP, DDP and PILCO bring the system to the desired states. The resulting tra-\njectories for PDDP are shown in Fig.1a and 2a. The reason for low variances of some optimized\ntrajectories is that during \ufb01nal stage of learning, interactions with the physical systems (forward\nsamplings using the locally optimal controller) would reduce the variances signi\ufb01cantly. The costs\nare shown in Fig. 1b and 2b. For both tasks, PDDP and DDP performs similarly and slightly dif-\nferent from PILCO in terms of cost reduction. The major reasons for this difference are: i) different\ncost functions used by these methods; ii) we did not impose convergence condition for the optimized\ntrajectories on PILCO. We now compare PDDP with DDP and PILCO in terms of data-ef\ufb01ciency\nand controller learning speed.\nData ef\ufb01ciency: As shown in Fig.4a, in both tasks, PDDP performs slightly worse than PILCO in\nterms of data ef\ufb01ciency based on the number of interactions required with the physical systems. For\nthe systems used for testing, PDDP requires around 15% \u2212 25% more interactions than PILCO.\nThe number of interactions indicates the amount of sampled trajectories required from the physical\nsystem. At each trial we sample N trajectories from the physical systems (algorithm 1). Possible\nreasons for the slightly worse performances are:\ni) PDDP\u2019s policy is linear which is restrictive,\nwhile PILCO yields nonlinear policy parameterizations; ii) PDDP\u2019s exploration scheme is more\nconservative than PILCO in the early stages of learning. We believe PILCO is the more data ef\ufb01cient\nfor these tasks. However, PDDP is able to offer close performances thanks to the probabilistic\nrepresentation of the dynamics as well as the use of Gaussian belief (augmented state).\nLearning speed: In terms of total computational time required to obtain the \ufb01nal controller, PDDP\noutperforms PILCO signi\ufb01cantly as shown in Fig.4b. For the 6 and 12 dimensional systems used\nfor testing, PILCO requires an iterative method (e.g.,CG or BFGS) to solve high dimensional opti-\nmization problems (depending on the policy parameterization), while PDDP computes local optimal\ncontrols (16) without an extra optimizer. In terms of computational time per iteration, as shown in\n\n7\n\n5101520253035404550\u2212101Angle5101520253035404550\u2212101Angular velocityTime steps0510152025303540455000.511.522.53Time steps6\u2212link arm Cost PDDPDDPPILCO\fFig.3b, PDDP is slower than the classical DDP due to the high computational cost of GP dynamics\npropagations. However, for DDP, the time dedicated to linearizing the dynamics model is around\n70% \u2212 90% of the total time per iteration for the two tasks considered in this work. PDDP avoids\nthe high computational cost of \ufb01nite differences by evaluating all Jacobian matrices analytically, the\ntime dedicated to linearization is less than 10% of the total time per iteration.\n\n(a)\n\n(b)\n\nFigure 3: (a) An intuitive illustration of the PDDP framework. (b) Comparison of PDDP and DDP\nin terms of the computational time per iteration (in seconds) for the CDIP (left sub\ufb01gure) and 6-link\narm (right sub\ufb01gure) tasks. Green indicates time for performing linearization, cyan indicates time\nfor forward and backward sweeps (Sec. 2.6).\n\n(a)\n\n(b)\n\nFigure 4: Comparison of PDDP and PILCO in terms of data ef\ufb01ciency and controller learning speed.\n(a) Number of interactions with the physical systems required to obtain the \ufb01nal results in Fig. 1\nand 2. (b) Total computational time (in minutes) consumed to obtain the \ufb01nal controllers.\n\n4 Conclusions\n\nIn this work we have introduced a probabilistic model-based control and trajectory optimization\nmethod for systems with unknown dynamics based on Differential Dynamic Programming (DDP)\nand Gaussian processes (GPs), called Probabilistic Differential Dynamic Programming (PDDP).\nPDDP takes model uncertainty into account explicitly by representing the dynamics using GPs and\nperforming local Dynamic Programming in Gaussian belief spaces. Based on the quadratic approxi-\nmation of the value function, PDDP yields a linear, locally optimal control policy and features a more\nef\ufb01cient control improvement scheme compared with typical gradient-based policy search methods.\nThanks to the probabilistic representation of the dynamics, PDDP offers reasonable data ef\ufb01ciency\ncomparable to a state of the art GP-based policy search method [14]. In general, local trajectory op-\ntimization is a powerful approach to challenging control and RL problems. Due to its model-based\nnature, model inaccuracy has always been the major obstacle for advanced applications. Grounded\non the solid developments of classical trajectory optimization and Bayesian machine learning, the\nproposed PDDP has demonstrated encouraging performance and potential for many applications.\n\nAcknowledgments\n\nWe thank reviewers for their constructive feedback and helpful comments.\n\n8\n\nControl policyGP dynamicsLocal Model\u2028 Cost function Physical systemDDPPDDP0246810121416Time per iteration (sec) for CDIP DDPPDDP01020304050Time per iteration (sec) for 6\u2212link arm Dynamics linearizationForward/backward passDyanmics linearizationForward/backward passCDIP6\u2212Link arm05101520253035Number of interactions PDDPPILCOCDIP6\u2212Link arm050010001500Total time (minutes) PDDPPILCO\fReferences\n[1] D.H. Jacobson and D.Q. Mayne. Differential dynamic programming. Elsevier Sci. Publ., 1970.\n[2] E. Todorov and W. Li. A generalized iterative lqg method for locally-optimal feedback control\nof constrained nonlinear stochastic systems. In American Control Conference, pages 300\u2013306,\nJune 2005.\n\n[3] Y. Tassa, T. Erez, and W. D. Smart. Receding horizon differential dynamic programming. In\n\nNIPS, pages 1465\u20131472.\n\n[4] E. Theodorou, Y. Tassa, and E. Todorov. Stochastic differential dynamic programming. In\n\nAmerican Control Conference, pages 1125\u20131132, June 2010.\n\n[5] D. Mitrovic, S. Klanke, and S. Vijayakumar. Adaptive optimal feedback control with learned\ninternal dynamics models. In From Motor Learning to Interaction Learning in Robots, pages\n65\u201384. Springer, 2010.\n\n[6] J. Van Den Berg, S. Patil, and R. Alterovitz. Motion planning under uncertainty using it-\nerative local optimization in belief space. The International Journal of Robotics Research,\n31(11):1263\u20131278, 2012.\n\n[7] S. Levine and V. Koltun. Variational policy search via trajectory optimization. In NIPS, pages\n\n207\u2013215. 2013.\n\n[8] J. Morimoto and C.G. Atkeson. Minimax differential dynamic programming: An application\n\nto robust biped walking. In NIPS, pages 1539\u20131546, 2002.\n\n[9] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng. An application of reinforcement learning to\n\naerobatic helicopter \ufb02ight. In NIPS, pages 1\u20138, 2007.\n\n[10] C. E. Rasmussen and M. Kuss. Gaussian processes in reinforcement learning. In NIPS, pages\n\n751\u2013759, 2003.\n\n[11] D. Nguyen-Tuong, J. Peters, and M. Seeger. Local gaussian process regression for real time\n\nonline model learning. In NIPS, pages 1193\u20131200, 2008.\n\n[12] M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian process dynamic programming.\n\nNeurocomputing, 72(7):1508\u20131524, 2009.\n\n[13] M. P. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-ef\ufb01cient approach to\n\npolicy search. In ICML, pages 465\u2013472, 2011.\n\n[14] M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for data-ef\ufb01cient learning\nin robotics and control. IEEE Transsactions on Pattern Analysis and Machine Intelligence,\n27:75\u201390, 2014.\n\n[15] P. Hemakumara and S. Sukkarieh. Learning uav stability and control derivatives using gaussian\n\nprocesses. IEEE Transactions on Robotics, 29:813\u2013824, 2013.\n\n[16] J. Quinonero Candela, A. Girard, J. Larsen, and C. E. Rasmussen. Propagation of uncertainty in\nbayesian kernel models-application to multiple-step ahead forecasting. In IEEE International\nConference on Acoustics, Speech, and Signal Processing, 2003.\n\n[17] C.K.I Williams and C.E. Rasmussen. Gaussian processes for machine learning. MIT Press,\n\n2006.\n\n[18] L. Csat\u00b4o and M. Opper. Sparse on-line gaussian processes. Neural Computation, 14(3):641\u2013\n\n668, 2002.\n\n[19] E. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs. In NIPS, pages\n\n1257\u20131264, 2005.\n\n[20] W. Zhong and H. Rock. Energy and passivity based control of the double inverted pendulum\n\non a cart. In International Conference on Control Applications, pages 896\u2013901, Sept 2001.\n\n[21] T. Raiko and M. Tornio. Variational bayesian learning of nonlinear hidden state-space models\n\nfor model predictive control. Neurocomputing, 72(16):3704\u20133712, 2009.\n\n[22] H. van Hasselt. Insights in reinforcement learning. Hado van Hasselt, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1053, "authors": [{"given_name": "Yunpeng", "family_name": "Pan", "institution": "Georgia Institute of Technology"}, {"given_name": "Evangelos", "family_name": "Theodorou", "institution": "Georgia Institute of Technology"}]}