{"title": "Variational Policy Search via Trajectory Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 207, "page_last": 215, "abstract": "In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and high-dimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks.", "full_text": "Variational Policy Search via Trajectory Optimization\n\nSergey Levine\n\nStanford University\n\nsvlevine@cs.stanford.edu\n\nVladlen Koltun\n\nStanford University and Adobe Research\n\nvladlen@cs.stanford.edu\n\nAbstract\n\nIn order to learn effective control policies for dynamical systems, policy search\nmethods must be able to discover successful executions of the desired task.\nWhile random exploration can work well in simple domains, complex and high-\ndimensional tasks present a serious challenge, particularly when combined with\nhigh-dimensional policies that make parameter-space exploration infeasible. We\npresent a method that uses trajectory optimization as a powerful exploration strat-\negy that guides the policy search. A variational decomposition of a maximum\nlikelihood policy objective allows us to use standard trajectory optimization al-\ngorithms such as differential dynamic programming, interleaved with standard\nsupervised learning for the policy itself. We demonstrate that the resulting algo-\nrithm can outperform prior methods on two challenging locomotion tasks.\n\n1\n\nIntroduction\n\nDirect policy search methods have the potential to scale gracefully to complex, high-dimensional\ncontrol tasks [12]. However, their effectiveness depends on discovering successful executions of the\ndesired task, usually through random exploration. As the dimensionality and complexity of a task\nincreases, random exploration can prove inadequate, resulting in poor local optima. We propose to\ndecouple policy optimization from exploration by using a variational decomposition of a maximum\nlikelihood policy objective. In our method, exploration is performed by a model-based trajectory\noptimization algorithm that is not constrained by the policy parameterization, but attempts to mini-\nmize both the cost and the deviation from the current policy, while the policy is simply optimized to\nmatch the resulting trajectory distribution. Since direct model-based trajectory optimization is usu-\nally much easier than policy search, this method can discover low cost regions much more easily.\nIntuitively, the trajectory optimization \u201cguides\u201d the policy search toward regions of low cost.\nThe trajectory optimization can be performed by a variant of the differential dynamic programming\nalgorithm [4], and the policy is optimized with respect to a standard maximum likelihood objective.\nWe show that this alternating optimization maximizes a well-de\ufb01ned policy objective, and demon-\nstrate experimentally that it can learn complex tasks in high-dimensional domains that are infeasible\nfor methods that rely on random exploration. Our evaluation shows that the proposed algorithm\nproduces good results on two challenging locomotion problems, outperforming prior methods.\n\n2 Preliminaries\n\n\u03c0\u03b8(ut|xt), so as to minimize the sum of expected costs E[c(\u03b6)] = E[(cid:80)T\n\nIn standard policy search, we seek to \ufb01nd a distribution over actions ut in each state xt, denoted\nt=1 c(xt, ut)], where \u03b6\nis a sequence of states and actions. The expectation is taken with respect to the system dynamics\np(xt+1|xt, ut) and the policy \u03c0\u03b8(ut|xt), which is typically parameterized by a vector \u03b8.\nAn alternative to this standard formulation is to convert the task into an inference problem, by intro-\nducing a binary random variable Ot at each time step that serves as the indicator for \u201coptimality.\u201d\n\n1\n\n\fWe follow prior work and de\ufb01ne the probability of Ot as p(Ot = 1|xt, ut) \u221d exp(\u2212c(xt, ut)) [19].\nUsing the dynamics distribution p(xt+1|xt, ut) and the policy \u03c0\u03b8(ut|xt), we can de\ufb01ne a dynamic\nBayesian network that relates states, actions, and the optimality indicator. By setting Ot = 1 at all\ntime steps and learning the maximum likelihood values for \u03b8, we can perform policy optimization\n[20]. The corresponding optimization problem has the objective\n\n(cid:90)\n\n(cid:90)\n\n(cid:32)\n\u2212 T(cid:88)\n\n(cid:33)\n\nT(cid:89)\n\np(O|\u03b8) =\n\np(O|\u03b6)p(\u03b6|\u03b8)d\u03b6 \u221d\n\nexp\n\nc(xt, ut)\n\np(x1)\n\n\u03c0\u03b8(ut|xt)p(xt+1|xt, ut)d\u03b6.\n\n(1)\n\nAlthough this objective differs from the classical minimum average cost objective, previous work\nshowed that it is nonetheless useful for policy optimization and planning [20, 19]. In Section 5, we\ndiscuss how this objective relates to the classical objective in more detail.\n\nt=1\n\nt=1\n\n3 Variational Policy Search\nFollowing prior work [11], we can decompose log p(O|\u03b8) by using a variational distribution q(\u03b6):\n\nwhere the variational lower bound L is given by\n\nlog p(O|\u03b8) = L(q, \u03b8) + DKL(q(\u03b6)(cid:107)p(\u03b6|O, \u03b8)),\n\n(cid:90)\n\nL(q, \u03b8) =\n\nq(\u03b6) log\n\np(O|\u03b6)p(\u03b6|\u03b8)\n\nd\u03b6,\n\nq(\u03b6)\n\nand the second term is the Kullback-Leibler (KL) divergence\n\nDKL(q(\u03b6)(cid:107)p(\u03b6|O, \u03b8)) = \u2212\n\nq(\u03b6) log\n\np(\u03b6|O, \u03b8)\n\nq(\u03b6)\n\nd\u03b6 = \u2212\n\n(cid:90)\n\n(cid:90)\n\nq(\u03b6) log\n\np(O|\u03b6)p(\u03b6|\u03b8)\nq(\u03b6)p(O|\u03b8)\n\nd\u03b6.\n\n(2)\n\nWe can then optimize the maximum likelihood objective in Equation 1 by iteratively minimizing the\nKL divergence with respect to q(\u03b6) and maximizing the bound L(q, \u03b8) with respect to \u03b8. This is the\nstandard formulation for expectation maximization [9], and has been applied to policy optimization\nin previous work [8, 21, 3, 11]. However, prior policy optimization methods typically represent q(\u03b6)\nby sampling trajectories from the current policy \u03c0\u03b8(ut|xt) and reweighting them, for example by\nthe exponential of their cost. While this can improve policies that already visit regions of low cost,\nit relies on random policy-driven exploration to discover those low cost regions. We propose instead\nto directly optimize q(\u03b6) to minimize both its expected cost and its divergence from the current\npolicy \u03c0\u03b8(ut|xt) when a model of the dynamics is available. In the next section, we show that, for\na Gaussian distribution q(\u03b6), the KL divergence in Equation 2 can be minimized by a variant of the\ndifferential dynamic programming (DDP) algorithm [4].\n\n4 Trajectory Optimization\n\nDDP is a trajectory optimization algorithm based on Newton\u2019s method [4]. We build off of a variant\nof DDP called iterative LQR, which linearizes the dynamics around the current trajectory, computes\nthe optimal linear policy under linear-quadratic assumptions, executes this policy, and repeats the\nprocess around the new trajectory until convergence [17]. We show how this procedure can be used\nto minimize the KL divergence in Equation 2 when q(\u03b6) is a Gaussian distribution over trajectories.\nThis derivation follows previous work [10], but is repeated here and expanded for completeness.\nIterative LQR is a dynamic programming algorithm that recursively computes the value function\nbackwards through time. Because of the linear-quadratic assumptions, the value function is always\nquadratic, and the dynamics are Gaussian with the mean at f (xt, ut) and noise \u0001. Given a trajectory\n(\u00afx1, \u00afu1), . . . , (\u00afxT , \u00afuT ) and de\ufb01ning \u02c6xt = xt\u2212 \u00afxt and \u02c6ut = ut\u2212 \u00afut, the dynamics and cost function\nare then approximated as following, with subscripts x and u denoting partial derivatives:\n\n\u02c6xt+1 \u2248 fxt \u02c6xt + fut \u02c6ut + \u0001\nt cxt + \u02c6uTcut +\n\nc(xt, ut) \u2248 \u02c6xT\n\n1\n2\n\n\u02c6xT\n\nt cxxt \u02c6xt +\n\n1\n2\n\n\u02c6uT\nt cuut \u02c6ut + \u02c6uT\n\nt cuxt \u02c6xt + c(\u00afxt, \u00afut).\n\n2\n\n\fUnder this approximation, we can recursively compute the Q-function as follows:\n\nQxxt = cxxt +f T\nQxt = cxt +f T\n\nxtVxt+1\n\nxtVxxt+1fxt\n\nQuut = cuut +f T\nQut = cut +f T\n\nutVxt+1,\n\nutVxxt+1fut\n\nas well as the value function and linear policy terms:\n\nQuxt = cuxt +f T\n\nutVxxt+1fxt\n\nuxtQ\u22121\nuxtQ\u22121\nThe deterministic optimal policy is then given by\n\nVxt = Qxt \u2212 QT\nVxxt = Qxxt \u2212 QT\n\nkt = \u2212Q\u22121\nuutQu\nuutQux Kt = \u2212Q\u22121\n\nuutQut\nuutQuxt.\n\ng(xt) = \u00afut + kt + Kt(xt \u2212 \u00afxt).\n\nBy repeatedly computing the optimal policy around the current trajectory and updating \u00afxt and \u00afut\nbased on the new policy, iterative LQR converges to a locally optimal solution [17]. In order to use\nthis algorithm to minimize the KL divergence in Equation 2, we introduce a modi\ufb01ed cost function\n\u00afc(xt, ut) = c(xt, ut) \u2212 log \u03c0\u03b8(ut|xt). The optimal trajectory for this cost function approximately1\nminimizes the KL divergence when q(\u03b6) is a Dirac delta function, since\n\nDKL(q(\u03b6)(cid:107)p(\u03b6|O, \u03b8)) =\n\nq(\u03b6)\n\nc(xt, ut) \u2212 log \u03c0\u03b8(ut|xt) \u2212 log p(xt+1|xt, ut)\n\nd\u03b6 + const.\n\n(cid:35)\n\n(cid:90)\n\n(cid:34) T(cid:88)\n\nt=1\n\nHowever, we can also obtain a Gaussian q(\u03b6) by using the framework of linearly solvable MDPs\n[16] and the closely related concept of maximum entropy control [23]. The optimal policy \u03c0G under\nthis framework minimizes an augmented cost function, given by\n\u02dcc(xt, ut) = \u00afc(xt, ut) \u2212 H(\u03c0G),\n\nwhere H(\u03c0G) is the entropy of a stochastic policy \u03c0G(ut|xt), and \u00afc(xt, ut) includes log \u03c0\u03b8(ut|xt)\nas above. Ziebart [23] showed that the optimal policy can be written as\n\u03c0G(ut|xt) = exp(\u2212Qt(xt, ut) + Vt(xt)),\n\nwhere V is a \u201csoftened\u201d value function given by\n\nVt(xt) = log\n\nexp (Qt(xt, ut)) dut.\n\nUnder linear dynamics and quadratic costs, V has the same form as in the LQR derivation above,\nwhich means that \u03c0G(ut|xt) is a linear Gaussian with mean g(xt) and covariance Q\u22121\nuut [10]. To-\ngether with the linearized dynamics, the resulting policy speci\ufb01es a Gaussian distribution over tra-\njectories with Markovian independence:\n\n(cid:90)\n\nT(cid:89)\n\nt=1\n\nq(\u03b6) = \u02dcp(xt)\n\n\u03c0G(ut|xt)\u02dcp(xt+1|xt, ut),\n\nwhere \u03c0G(ut|xt) = N (g(xt), Q\u22121\nuut), \u02dcp(xt) is an initial state distribution, and \u02dcp(xt+1|xt, ut) =\nN (fxt \u02c6xt +fut \u02c6ut + \u00afxt+1, \u03a3f t) is the linearized dynamics with Gaussian noise \u03a3f t. This distribution\nalso corresponds to a Laplace approximation for p(\u03b6|O, \u03b8), which is formed from the exponential of\nthe second order Taylor expansion of log p(\u03b6|O, \u03b8) [15].\nOnce we compute \u03c0G(ut|xt) using iterative LQR/DDP, it is straightforward to obtain the marginal\ndistributions q(xt), which will be useful in the next section for minimizing the variational bound\nL(q, \u03b8). Using \u00b5t and \u03a3t to denote the mean and covariance of the marginal at time t and assuming\nthat the initial state distribution at t = 1 is given, the marginals can be computed recursively as\n\n(cid:21)\n\n\u00b5t+1 = [ fxt\n\nfut ]\n\n\u03a3t+1 = [ fxt\n\nfut ]\n\n\u00afut + kt + Kt(\u00b5t \u2212 \u00afxt)\n\n\u03a3tKT\nt\nKt\u03a3t Q\u22121\n\nuut + Kt\u03a3tKT\nt\n\n(cid:21)\n\n[ fxt\n\nfut ]T + \u03a3f t.\n\n(cid:20) \u00b5t\n(cid:20) \u03a3t\n\n1The minimization is not exact if the dynamics p(xt+1|xt, ut) are not deterministic, but the result is very\nclose if the dynamics have much lower entropy than the policy and exponentiated cost, which is often the case.\n\n3\n\n\fAlgorithm 1 Variational Guided Policy Search\n1: Initialize q(\u03b6) using DDP with cost \u00afc(xt, ut) = \u03b10c(xt, ut)\n2: for iteration k = 1 to K do\n3:\n4:\n5:\n6:\n7: end for\n8: Return optimized policy \u03c0\u03b8(ut|xt)\n\nSet \u03b1k based on annealing schedule, for example \u03b1k = exp(cid:0) K\u2212k\n\nCompute marginals (\u00b51, \u03a3t), . . . , (\u00b5T , \u03a3T ) for q(\u03b6)\nOptimize L(q, \u03b8) with respect to \u03b8 using standard nonlinear optimization methods\nK log \u03b10 + k\nOptimize q(\u03b6) using DDP with cost \u00afc(xt, ut) = \u03b1kc(xt, ut) \u2212 log \u03c0\u03b8(ut|xt)\n\nK log \u03b1K\n\n(cid:1)\n\nWhen the dynamics are nonlinear or the modi\ufb01ed cost \u00afc(xt, ut) is nonquadratic, this solution only\napproximates the minimum of the KL divergence.\nIn practice, the approximation is quite good\nwhen the dynamics and the cost c(xt, ut) are smooth. Unfortunately, the policy term log \u03c0\u03b8(ut|xt)\nin the modi\ufb01ed cost \u00afc(xt, ut) can be quite jagged early on in the optimization, particularly for\nnonlinear policies. To mitigate this issue, we compute the derivatives of the policy not only along\nthe current trajectory, but also at samples drawn from the current marginals q(xt), and average them\ntogether. This averages out local perturbations in log \u03c0\u03b8(ut|xt) and improves the approximation.\nIn Section 8, we discuss more sophisticated techniques that could be used in future work to handle\nhighly nonlinear dynamics for which this approximation may be inadequate.\n\n5 Variational Guided Policy Search\n\nThe variational guided policy search (variational GPS) algorithm alternates between minimizing the\nKL divergence in Equation 2 with respect to q(\u03b6) as described in the previous section, and maxi-\nmizing the bound L(q, \u03b8) with respect to the policy parameters \u03b8. Minimizing the KL divergence\nreduces the difference between L(q, \u03b8) and log p(O|\u03b8), so that the maximization of L(q, \u03b8) becomes\na progressively better approximation for the maximization of log p(O|\u03b8). The method is summa-\nrized in Algorithm 1. The bound L(q, \u03b8) can be maximized by a variety of standard optimization\nmethods, such as stochastic gradient descent (SGD) or LBFGS. The gradient is given by\n\n\u2207L(q, \u03b8) =\n\nq(\u03b6)\n\n\u2207 log \u03c0\u03b8(ut|xt)d\u03b6 \u2248 1\nM\n\n\u2207 log \u03c0\u03b8(ui\n\nt|xi\nt),\n\n(3)\n\nt, ui\n\nt) are drawn from the marginals q(xt, ut). When using SGD, new sam-\nwhere the samples (xi\nples can be drawn at every iteration, since sampling from q(xt, ut) only requires the precomputed\nmarginals from the preceding section. Because the marginals are computed using linearized dynam-\nics, we can be assured that the samples will not deviate drastically from the optimized trajectory,\nregardless of the true dynamics. The resulting SGD optimization is analogous to a supervised learn-\ning task with an in\ufb01nite training set. When using LBFGS, a new sample set can generated every n\nLBFGS iterations. We found that values of n from 20 to 50 produced good results.\nWhen choosing the policy class, it is common to use deterministic policies with additive Gaussian\nnoise. In this case, we can optimize the policy more quickly and with many fewer samples by only\nsampling states and evaluating the integral over actions analytically. Letting \u00b5\u03b8\n, \u03a3q\n, \u03a3\u03b8\ndenote the means and covariances of \u03c0\u03b8(ut|xt) and q(ut|xt), we can write L(q, \u03b8) as\nxt\nxt\n\nxt and \u00b5q\nxt\n\n(cid:90)\n\nT(cid:88)\n\nt=1\n\nM(cid:88)\n\nT(cid:88)\n\ni=1\n\nt=1\n\n(cid:90)\n\nM(cid:88)\nT(cid:88)\n\ni=1\n\nT(cid:88)\n(cid:16)\n\nt=1\n\n\u2212 1\n2\n\nL(q, \u03b8) \u2248 1\nM\n\nM(cid:88)\n\n=\n\n1\nM\n\ni=1\n\nt=1\n\nq(ut|xi\n\nt)dut + const\n\nt) log \u03c0\u03b8(ut|xi\n(cid:17)T\n\n(cid:16)\n\n\u03a3\u03b8\u22121\n\nxt\n\n\u00b5\u03b8\nxi\nt\n\n\u2212 \u00b5q\n\nxi\nt\n\n(cid:17) \u2212 1\n\n2\n\n(cid:12)(cid:12)(cid:12)\u03a3\u03b8\n\nxi\nt\n\n(cid:12)(cid:12)(cid:12) \u2212 1\n\n2\n\n(cid:16)\n\ntr\n\n\u03a3\u03b8\u22121\n\nxi\nt\n\n\u03a3q\nxi\nt\n\n(cid:17)\n\n+ const.\n\nlog\n\n\u00b5\u03b8\nxi\nt\n\n\u2212 \u00b5q\n\nxi\nt\n\nTwo additional details should be taken into account in order to obtain the best results. First, although\nmodel-based trajectory optimization is more powerful than random exploration, complex tasks such\nas bipedal locomotion, which we address in the following section, are too dif\ufb01cult to solve entirely\nwith trajectory optimization. To solve such tasks, we can initialize the procedure from a good initial\n\n4\n\n\ftrajectory, typically provided by a demonstration. This trajectory is only used for initialization and\nneed not be reproducible by any policy, since it will be modi\ufb01ed by subsequent DDP invocations.\nSecond, unlike the average cost objective, the maximum likelihood objective is sensitive to the mag-\nnitude of the cost. Speci\ufb01cally, the logarithm of Equation 1 corresponds to a soft minimum over all\nlikely trajectories under the current policy, with the softness of the minimum inversely proportional\nto the cost magnitude. As the magnitude increases, this objective scores policies based primarily\non their best-case cost, rather than the average case. As the magnitude decreases, the objective be-\ncomes more similar to the classic average cost. Because of this, we found it bene\ufb01cial to gradually\nanneal the cost by multiplying it by \u03b1k at the kth iteration, starting with a high magnitude to favor\naggressive exploration, and ending with a low magnitude to optimize average case performance. In\nour experiments, \u03b1k begins at 1 and is reduced exponentially to 0.1 by the 50th iteration.\nSince our method produces both a parameterized policy \u03c0\u03b8(ut|xt) and a DDP solution \u03c0G(ut|xt),\none might wonder why the DDP policy itself is not a suitable controller. The issue is that \u03c0\u03b8(ut|xt)\ncan have an arbitrary parameterization, and admits constraints on available information, stationarity,\netc., while \u03c0G(ut|xt) is always a nonstationary linear feedback policy. This has three major advan-\ntages: \ufb01rst, only the learned policy may be usable at runtime if the information available at runtime\ndiffers from the information during training, for example if the policy is trained in simulation and\nexecuted on a physical system with limited sensors. Second, if the policy class is chosen carefully,\nwe might hope that the learned policy would generalize better than the DDP solution, as shown in\nprevious work [10]. Third, multiple trajectories can be used to train a single policy from different\ninitial states, creating a single controller that can succeed in a variety of situations.\n\n6 Experimental Evaluation\n\nWe evaluated our method on two simulated planar locomotion tasks: swimming and bipedal walk-\ning. For both tasks, the policy sets joint torques on a simulated robot consisting of rigid links. The\nswimmer has 3 links and 5 degrees of freedom, including the root position, and a 10-dimensional\nstate space that includes joint velocities. The walker has 7 links, 9 degrees of freedom, and 18\nstate dimensions. Due to the high dimensionality and nonlinear dynamics, these tasks represent a\nsigni\ufb01cant challenge for direct policy learning. The cost function for the walker was given by\n\nc(x, u) = wu(cid:107)u(cid:107)2 + (vx \u2212 v(cid:63)\n\nx)2 + (py \u2212 p(cid:63)\n\ny)2,\n\nx are the current and desired horizontal velocities, py and p(cid:63)\n\nwhere vx and v(cid:63)\ny are the current and\ndesired heights of the hips, and the torque penalty was set to wu = 10\u22124. The swimmer cost\nexcludes the height term and uses a lower torque penalty of wu = 10\u22125. As discussed in the\nprevious section, the magnitude of the cost was decreased by a factor of 10 during the \ufb01rst 50\niterations, and then remained \ufb01xed. Following previous work [10], the trajectory for the walker was\ninitialized with a demonstration from a hand-crafted locomotion system [22].\nThe policy was represented by a neural network with one hidden layer and a soft rectifying nonlin-\nearity of the form a = log(1 + exp(z)), with Gaussian noise at the output. Both the weights of the\nneural network and the diagonal covariance of the output noise were learned as part of the policy\noptimization. The number of policy parameters ranged from 63 for the 5-unit swimmer to 246 for\nthe 10-unit walker. Due to its complexity and nonlinearity, this policy class presents a challenge to\ntraditional policy search algorithms, which often focus on compact, linear policies [8].\nFigure 1 shows the average cost of the learned policies on each task, along with visualizations of\nthe swimmer and walker. Methods that sample from the current policy use 10 samples per iteration,\nunless noted otherwise. To ensure a fair comparison, the vertical axis shows the average cost E[c(\u03b6)]\nrather than the maximum likelihood objective log p(O|\u03b8). The cost was evaluated for both the\nactual stochastic policy (solid line), and a deterministic policy obtained by setting the variance of\nthe Gaussian noise to zero (dashed line). Each plot also shows the cost of the initial DDP solution.\nPolicies with costs signi\ufb01cantly above this amount do not succeed at the task, either falling in the\ncase of the walker, or failing to make forward progress in the case of the swimmer. Our method\nlearned successful policies for each task, and often converged faster than previous methods, though\nperformance during early iterations was often poor. We believe this is because the variational bound\nL(q, \u03b8) does not become a good proxy for log p(O|\u03b8) until after several invocations of DDP, at which\npoint the algorithm is able to rapidly improve the policy.\n\n5\n\n\fFigure 1: Comparison of variational guided policy search (VGPS) with prior methods. The average\ncost of the stochastic policy is shown with a solid line, and the average cost of the deterministic\npolicy without Gaussian noise is shown with a dashed line. The bottom-right panel shows plots of\nthe swimmer and walker, with the center of mass trajectory under the learned policy shown in blue,\nand the initial DDP solution shown in black.\n\nThe \ufb01rst method we compare to is guided policy search (GPS), which uses importance sampling to\nintroduce samples from the DDP solution into a likelihood ratio policy search [10]. The GPS algo-\nrithm \ufb01rst draws a \ufb01xed number of samples from the DDP solution, and then adds on-policy samples\nat each iteration. Like our method, GPS uses DDP to explore regions of low cost, but the policy op-\ntimization is done using importance sampling, which can be susceptible to degenerate weights in\nhigh dimensions. Since standard GPS only samples from the initial DDP solution, these samples\nare only useful if they can be reproduced by the policy class. Otherwise, GPS must rely on random\nexploration to improve the solution. On the easier swimmer task, the GPS policy can reproduce the\ninitial trajectory and succeeds immediately. However, GPS is unable to \ufb01nd a successful walking\npolicy with only 5 hidden units, which requires modi\ufb01cations to the initial trajectory. In addition, al-\nthough the deterministic GPS policy performs well on the walker with 10 hidden units, the stochastic\npolicy fails more often. This suggests that the GPS optimization is not learning a good variance for\nthe Gaussian policy, possibly because the normalized importance sampled estimator places greater\nemphasis on the relative probability of the samples than their absolute probability.\nThe adaptive variant of GPS runs DDP at every iteration and adapts to the current policy, in the same\nmanner as our method. However, samples from this adapted DDP solution are then included in the\npolicy optimization with importance sampling, while our approach optimizes the variational bound\nL(q, \u03b8).\nIn the GPS estimator, each sample \u03b6i is weighted by an importance weight dependent\non \u03c0\u03b8(\u03b6i), while the samples in our optimization are not weighted. When a sample has a low\nprobability under the current policy, it is ignored by the importance sampled optimizer. Because of\nthis, although the adaptive variant of GPS improves on the standard variant, it is still unable to learn\na walking policy with 5 hidden units, while our method quickly discovers an effective policy.\nWe also compared to an imitation learning method called DAGGER. DAGGER aims to learn a pol-\nicy that imitates an oracle [14], which in our case is the DDP solution. At each iteration, DAGGER\nadds samples from the current policy to a dataset, and then optimizes the policy to take the oracle\naction at each dataset state. While adjusting the current policy to match the DDP solution may ap-\npear similar to our approach, we found that DAGGER performed poorly on these tasks, since the\non-policy samples initially visited states that were very far from the DDP solution, and therefore\nthe DDP action at these states was large and highly suboptimal. To reduce the impact of these\npoor states, we implemented a variant of DAGGER which weighted the samples by their probability\nunder the DDP marginals. This variant succeeded on the swimming tasks and eventually found a\ngood deterministic policy for the walker with 10 hidden units, though the learned stochastic policy\nperformed very poorly. We also implemented an adapted variant, where the DDP solution is reop-\ntimized at each iteration to match the policy (in addition to weighting), but this variant performed\n\n6\n\nDDP solutionvariational GPSGPSadapted GPScost-weightedcost-weighted 1000DAGGERweighted DAGGERadapted DAGGERswimmer, 5 hidden unitsiterationaverage cost20406080100100150200250300350400swimmer, 10 hidden unitsiterationaverage cost20406080100100150200250300350400walker, 5 hidden unitsiterationaverage cost2040608010005001000150020002500300035004000walker, 10 hidden unitsiterationaverage cost2040608010005001000150020002500300035004000\fworse. Unlike DAGGER, our method samples from a Gaussian distribution around the current DDP\nsolution, ensuring that all samples are drawn from good parts of the state space. Because of this, our\nmethod is much less sensitive to poor or unstable initial policies.\nFinally, we compare to an alternative variational policy search algorithm analogous to PoWER [8].\nAlthough PoWER requires a linear policy parameterization and a speci\ufb01c exploration strategy, we\ncan construct an analogous non-linear algorithm by replacing the analytic M-step with nonlinear\noptimization, as in our method. This algorithm is identical to ours, except that instead of using DDP\nto optimize q(\u03b6), the variational distribution is formed by taking samples from the current policy and\nreweighting them by the exponential of their cost. We call this method \u201ccost-weighted.\u201d The policy\nis still initialized with supervised training to resemble the initial DDP solution, but otherwise this\nmethod does not bene\ufb01t from trajectory optimization and relies entirely on random exploration. This\nkind of exploration is generally inadequate for such complex tasks. Even if the number of samples\nper iteration is increased to 103 (denoted as \u201ccost-weighted 1000\u201d), this method still fails to solve\nthe harder walking task, suggesting that simply taking more random samples is not the solution.\nThese results show that our algorithm outperforms prior methods because of two advantages: we use\na model-based trajectory optimization algorithm instead of random exploration, which allows us to\noutperform model-free methods such as the \u201ccost-weighted\u201d PoWER analog, and we decompose the\npolicy search into two simple optimization problems that can each be solved ef\ufb01ciently by standard\nalgorithms, which leaves us less vulnerable to local optima than more complex methods like GPS.\n\n7 Previous Work\n\nIn optimizing a maximum likelihood objective, our method builds on previous work that frames\ncontrol as inference [20, 19, 13]. Such methods often rede\ufb01ne optimality in terms of a log evidence\nprobability, as in Equation 1. Although this de\ufb01nition differs from the classical expected return, our\nevaluation suggests that policies optimized with respect to this measure also exhibit a good average\nreturn. As we discuss in Section 5, this objective is risk seeking when the cost magnitude is high, and\nannealing can be used to gradually transition from an objective that favors aggressive exploration\nto one that resembles the average return. Other authors have also proposed alternative de\ufb01nitions\nof optimality that include appealing properties like maximization of entropy [23] or computational\nbene\ufb01ts [16]. However, our work is the \ufb01rst to our knowledge to show how trajectory optimization\ncan be used to guide policy learning within the control-as-inference framework.\nOur variational decomposition follows prior work on policy search with variational inference [3, 11]\nand expectation maximization [8, 21]. Unlike these methods, our approach aims to \ufb01nd a variational\ndistribution q(\u03b6) that is best suited for control and leverages a known dynamics model. We present an\ninterpretation of the KL divergence minimization in Equation 2 as model-based exploration, which\ncan be performed with a variant of DDP. As shown in our evaluation, this provides our method\nwith a signi\ufb01cant advantage over methods that rely on model-free random exploration, though at the\ncost of requiring a differentiable model of the dynamics. Interestingly, our algorithm never requires\nsamples to be drawn from the current policy. This can be an advantage in applications where running\nan unstable, incompletely optimized policy can be costly or dangerous.\nOur use of DDP to guide the policy search parallels our previous Guided Policy Search (GPS)\nalgorithm [10]. Unlike the proposed method, GPS incorporates samples from DDP directly into\nan importance-sampled estimator of the return. These samples are therefore only useful when the\npolicy class can reproduce them effectively. As shown in the evaluation of the walker with 5 hidden\nunits, GPS may be unable to discover a good policy when the policy class cannot reproduce the\ninitial DDP solution. Adaptive GPS addresses this issue by reoptimizing the trajectory to resemble\nthe current policy, but the policy is still optimized with respect to an importance-sampled return\nestimate, which leaves it highly prone to local optima, and the theoretical justi\ufb01cation for adaptation\nis unclear. The proposed method justi\ufb01es the reoptimization of the trajectory under a variational\nframework, and uses standard maximum likelihood in place of the complex importance-sampled\nobjective.\nWe also compared our method to DAGGER [14], which uses a general-purpose supervised training\nalgorithm to train the current policy to match an oracle, which in our case is the DDP solution.\nDAGGER matches actions from the oracle policy at states visited by the current policy, under the\n\n7\n\n\fassumption that the oracle can provide good actions in all states. This assumption does not hold\nfor DDP, which is only valid in a narrow region around the trajectory. To mitigate the locality of\nthe DDP solution, we weighted the samples by their probability under the DDP marginals, which\nallowed DAGGER to solve the swimming task, but it was still outperformed by our method on the\nwalking task, even with adaptation of the DDP solution. Unlike DAGGER, our approach is relatively\ninsensitive to the instability of the learned policy, since the learned policy is not sampled.\nSeveral prior methods also propose to improve policy search by using a distribution over high-value\nstates, which might come from a DDP solution [6, 1]. Such methods generally use this \u201crestart\u201d\ndistribution as a new initial state distribution, and show that optimizing a policy from such a restart\ndistribution also optimizes the expected return. Unlike our approach, such methods only use the\nstates from the DDP solution, not the actions, and tend to suffer from the increased variance of the\nrestart distribution, as shown in previous work [10].\n\n8 Discussion and Future Work\n\nWe presented a policy search algorithm that employs a variational decomposition of a maximum\nlikelihood objective to combine trajectory optimization with policy search. The variational distri-\nbution is obtained using differential dynamic programming (DDP), and the policy can be optimized\nwith a standard nonlinear optimization algorithm. Model-based trajectory optimization effectively\ntakes the place of random exploration, providing a much more effective means for \ufb01nding low cost\nregions that the policy is then trained to visit. Our evaluation shows that this algorithm outperforms\nprior variational methods and prior methods that use trajectory optimization to guide policy search.\nOur algorithm has several interesting properties that distinguish it from prior methods. First, the pol-\nicy search does not need to sample the learned policy. This may be useful in real-world applications\nwhere poor policies might be too risky to run on a physical system. More generally, this prop-\nerty improves the robustness of our method in the face of unstable initial policies, where on-policy\nsamples have extremely high variance. By sampling directly from the Gaussian marginals of the\nDDP-induced distribution over trajectories, our approach also avoids some of the issues associated\nwith unstable dynamics, requiring only that the task permit effective trajectory optimization.\nBy optimizing a maximum likelihood objective, our method favors policies with good best-case\nperformance. Obtaining good best-case performance is often the hardest part of policy search, since\na policy that achieves good results occasionally is easier to improve with standard on-policy search\nmethods than one that fails outright. However, modifying the algorithm to optimize the standard\naverage cost criterion could produce more robust controllers in the future.\nThe use of local linearization in DDP results in only approximate minimization of the KL divergence\nin Equation 2 in nonlinear domains or with nonquadratic policies. While we mitigate this by averag-\ning the policy derivatives over multiple samples from the DDP marginals, this approach could still\nbreak down in the presence of highly nonsmooth dynamics or policies. An interesting avenue for\nfuture work is to extend the trajectory optimization method to nonsmooth domains by using samples\nrather than linearization, perhaps analogously to the unscented Kalman \ufb01lter [5, 18]. This could also\navoid the need to differentiate the policy with respect to the inputs, allowing for richer policy classes\nto be used. Another interesting avenue for future work is to apply model-free trajectory optimiza-\ntion techniques [7], which would avoid the need for a model of the system dynamics, or to learn the\ndynamics from data, for example by using Gaussian processes [2]. It would also be straightforward\nto use multiple trajectories optimized from different initial states to learn a single policy that is able\nto succeed under a variety of initial conditions.\nOverall, we believe that trajectory optimization is a very useful tool for policy search. By separating\nthe policy optimization and exploration problems into two separate phases, we can employ simpler\nalgorithms such as SGD and DDP that are better suited for each phase, and can achieve superior\nperformance on complex tasks. We believe that additional research into augmenting policy learning\nwith trajectory optimization can further advance the performance of policy search techniques.\n\nAcknowledgments\n\nWe thank Emanuel Todorov, Tom Erez, and Yuval Tassa for providing the simulator used in our\nexperiments. Sergey Levine was supported by NSF Graduate Research Fellowship DGE-0645962.\n\n8\n\n\fReferences\n\n[1] A. Bagnell, S. Kakade, A. Ng, and J. Schneider. Policy search by dynamic programming. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2003.\n\n[2] M. Deisenroth and C. Rasmussen. PILCO: a model-based and data-ef\ufb01cient approach to policy\n\nsearch. In International Conference on Machine Learning (ICML), 2011.\n\n[3] T. Furmston and D. Barber. Variational methods for reinforcement learning. Journal of Ma-\n\nchine Learning Research, 9:241\u2013248, 2010.\n\n[4] D. Jacobson and D. Mayne. Differential Dynamic Programming. Elsevier, 1970.\n[5] S. Julier and J. Uhlmann. A new extension of the Kalman \ufb01lter to nonlinear systems.\nInternational Symposium on Aerospace/Defense Sensing, Simulation, and Control, 1997.\n\n[6] S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In\n\nIn\n\nInternational Conference on Machine Learning (ICML), 2002.\n\n[7] M. Kalakrishnan, S. Chitta, E. Theodorou, P. Pastor, and S. Schaal. STOMP: stochastic trajec-\ntory optimization for motion planning. In International Conference on Robotics and Automa-\ntion, 2011.\n\n[8] J. Kober and J. Peters. Learning motor primitives for robotics. In International Conference on\n\n[9] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT\n\n[10] S. Levine and V. Koltun. Guided policy search.\n\nIn International Conference on Machine\n\nRobotics and Automation, 2009.\n\nPress, 2009.\n\nLearning (ICML), 2013.\n\n[11] G. Neumann. Variational inference for policy search in changing situations. In International\n\nConference on Machine Learning (ICML), 2011.\n\n[12] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural\n\nNetworks, 21(4):682\u2013697, 2008.\n\n[13] K. Rawlik, M. Toussaint, and S. Vijayakumar. On stochastic optimal control and reinforcement\n\nlearning by approximate inference. In Robotics: Science and Systems, 2012.\n\n[14] S. Ross, G. Gordon, and A. Bagnell. A reduction of imitation learning and structured prediction\n\nto no-regret online learning. Journal of Machine Learning Research, 15:627\u2013635, 2011.\n\n[15] L. Tierney and J. B. Kadane. Accurate approximations for posterior moments and marginal\n\ndensities. Journal of the American Statistical Association, 81(393):82\u201386, 1986.\n\n[16] E. Todorov. Policy gradients in linearly-solvable MDPs. In Advances in Neural Information\n\nProcessing Systems (NIPS 23), 2010.\n\n[17] E. Todorov and W. Li. A generalized iterative LQG method for locally-optimal feedback\n\ncontrol of constrained nonlinear stochastic systems. In American Control Conference, 2005.\n\n[18] E. Todorov and Y. Tassa. Iterative local dynamic programming. In IEEE Symposium on Adap-\n\ntive Dynamic Programming and Reinforcement Learning (ADPRL), 2009.\n\n[19] M. Toussaint. Robot trajectory optimization using approximate inference.\n\nIn International\n\nConference on Machine Learning (ICML), 2009.\n\n[20] M. Toussaint, L. Charlin, and P. Poupart. Hierarchical POMDP controller optimization by\n\nlikelihood maximization. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2008.\n\n[21] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis. Learning model-free robot control by a\n\nMonte Carlo EM algorithm. Autonomous Robots, 27(2):123\u2013130, 2009.\n\n[22] K. Yin, K. Loken, and M. van de Panne. SIMBICON: simple biped locomotion control. ACM\n\nTransactions Graphics, 26(3), 2007.\n\n[23] B. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal\n\nentropy. PhD thesis, Carnegie Mellon University, 2010.\n\n9\n\n\f", "award": [], "sourceid": 176, "authors": [{"given_name": "Sergey", "family_name": "Levine", "institution": "Stanford University"}, {"given_name": "Vladlen", "family_name": "Koltun", "institution": "Adobe Research"}]}