{"title": "Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 1071, "page_last": 1079, "abstract": "We present a policy search method that uses iteratively refitted local linear models to optimize trajectory distributions for large, continuous problems. These trajectory distributions can be used within the framework of guided policy search to learn policies with an arbitrary parameterization. Our method fits time-varying linear dynamics models to speed up learning, but does not rely on learning a global model, which can be difficult when the dynamics are complex and discontinuous. We show that this hybrid approach requires many fewer samples than model-free methods, and can handle complex, nonsmooth dynamics that can pose a challenge for model-based techniques. We present experiments showing that our method can be used to learn complex neural network policies that successfully execute simulated robotic manipulation tasks in partially observed environments with numerous contact discontinuities and underactuation.", "full_text": "Learning Neural Network Policies with Guided Policy\n\nSearch under Unknown Dynamics\n\nDepartment of Electrical Engineering and Computer Science\n\nUniversity of California, Berkeley\n\nSergey Levine and Pieter Abbeel\n\n{svlevine, pabbeel}@eecs.berkeley.edu\n\nBerkeley, CA 94709\n\nAbstract\n\nWe present a policy search method that uses iteratively re\ufb01tted local linear models\nto optimize trajectory distributions for large, continuous problems. These tra-\njectory distributions can be used within the framework of guided policy search\nto learn policies with an arbitrary parameterization. Our method \ufb01ts time-varying\nlinear dynamics models to speed up learning, but does not rely on learning a global\nmodel, which can be dif\ufb01cult when the dynamics are complex and discontinuous.\nWe show that this hybrid approach requires many fewer samples than model-free\nmethods, and can handle complex, nonsmooth dynamics that can pose a challenge\nfor model-based techniques. We present experiments showing that our method\ncan be used to learn complex neural network policies that successfully execute\nsimulated robotic manipulation tasks in partially observed environments with nu-\nmerous contact discontinuities and underactuation.\n\n1\n\nIntroduction\n\nPolicy search methods can be divided into model-based algorithms, which use a model of the system\ndynamics, and model-free techniques, which rely only on real-world experience without learning a\nmodel [10]. Although model-free methods avoid the need to model system dynamics, they typically\nrequire policies with carefully designed, low-dimensional parameterizations [4]. On the other hand,\nmodel-based methods require the ability to learn an accurate model of the dynamics, which can\nbe very dif\ufb01cult for complex systems, especially when the algorithm imposes restrictions on the\ndynamics representation to make the policy search ef\ufb01cient and numerically stable [5].\nIn this paper, we present a hybrid method that \ufb01ts local, time-varying linear dynamics models, which\nare not accurate enough for standard model-based policy search. However, we can use these local\nlinear models to ef\ufb01ciently optimize a time-varying linear-Gaussian controller, which induces an\napproximately Gaussian distribution over trajectories. The key to this procedure is to restrict the\nchange in the trajectory distribution at each iteration, so that the time-varying linear model remains\nvalid under the new distribution. Since the trajectory distribution is approximately Gaussian, this\ncan be done ef\ufb01ciently, in terms of both sample count and computation time.\nTo then learn general parameterized policies, we combine this trajectory optimization method with\nguided policy search. Guided policy search optimizes policies by using trajectory optimization in\nan iterative fashion, with the policy optimized to match the trajectory, and the trajectory optimized\nto minimize cost and match the policy. Previous guided policy search methods used model-based\ntrajectory optimization algorithms that required known, differentiable system dynamics [12, 13, 14].\nUsing our algorithm, guided policy search can be performed under unknown dynamics.\nThis hybrid guided policy search method has several appealing properties. First, the parameterized\npolicy never needs to be executed on the real system \u2013 all system interaction during training is done\n\n1\n\n\fusing the time-varying linear-Gaussian controllers. Stabilizing linear-Gaussian controllers is easier\nthan stabilizing arbitrary policies, and this property can be a notable safety bene\ufb01t when the initial\nparameterized policy is unstable. Second, although our algorithm relies on \ufb01tting a time-varying\nlinear dynamics model, we show that it can handle contact-rich tasks where the true dynamics are\nnot only nonlinear, but even discontinuous. This is because the learned linear models average the\ndynamics from both sides of a discontinuity in proportion to how often each side is visited, unlike\nstandard linearization methods that differentiate the dynamics. This effectively transforms a discon-\ntinuous deterministic problem into a smooth stochastic one. Third, our algorithm can learn policies\nfor partially observed tasks by training a parameterized policy that is only allowed to observe some\nparts of the state space, using a fully observed formulation for the trajectory optimizer. This corre-\nsponds to full state observation during training (for example in an instrumented environment), but\nonly partial observation at test time, making policy search for partially observed tasks signi\ufb01cantly\neasier. In our evaluation, we demonstrate this capability by training a policy for inserting a peg into\nhole when the precise position of the hole is unknown at test time. The learned policy, represented\nby a neural network, acquires a strategy that searches for and \ufb01nds the hole regardless of its position.\nThe main contribution of our work is an algorithm for optimizing trajectories under unknown dy-\nnamics. We show that this algorithm outperforms prior methods in terms of both sample complexity\nand the quality of the learned trajectories. We also show that our method can be integrated with\nguided policy search, which previously required known models, to learn policies with an arbitrary\nparameterization, and again demonstrate that the resulting policy search method outperforms prior\nmethods that optimize the parameterized policy directly. Our experimental evaluation includes simu-\nlated peg-in-hole insertion, high-dimensional octopus arm control, swimming, and bipedal walking.\n\nE\u03c0\u03b8 [(cid:80)T\n\n2 Preliminaries\nPolicy search consists of optimizing the parameters \u03b8 of a policy \u03c0\u03b8(ut|xt), which is a distribution\nover actions ut conditioned on states xt, with respect to the expectation of a cost (cid:96)(xt, ut), denoted\nt=1 (cid:96)(xt, ut)]. The expectation is under the policy and the dynamics p(xt+1|xt, ut), which\ntogether form a distribution over trajectories \u03c4. We will use E\u03c0\u03b8 [(cid:96)(\u03c4 )] to denote the expected cost.\nOur algorithm optimizes a time-varying linear-Gaussian policy p(ut|xt) = N (Ktxt + kt, Ct),\nwhich allows for a particularly ef\ufb01cient optimization method when the initial state distribution is\nnarrow and approximately Gaussian. Arbitrary parameterized policies \u03c0\u03b8 are optimized using the\nguided policy search technique, in which \u03c0\u03b8 is trained to match one or more Gaussian policies p. In\nthis way, we can learn a policy that succeeds from many initial states by training a single stationary,\nnonlinear policy \u03c0\u03b8, which might be represented (for example) by a neural network, from multiple\nGaussian policies. As we show in Section 5, this approach can outperform methods that search\nfor the policy parameters \u03b8 directly, by taking advantage of the linear-Gaussian structure of p to\naccelerate learning. For clarity, we will refer to p as a trajectory distribution since, for a narrow\nCt and well-behaved dynamics, it induces an approximately Gaussian distribution over trajectories,\nwhile the term \u201cpolicy\u201d will be reserved for the parameterized policy \u03c0\u03b8.\nTime-varying linear-Gaussian policies have previously been used in a number of model-based and\nmodel-free methods [25, 16, 14] due to their close connection with linear feedback controllers, which\nare frequently used in classic deterministic trajectory optimization. The algorithm we will describe\nbuilds on the iterative linear-Gaussian regulator (iLQG), which optimizes trajectories by iteratively\nconstructing locally optimal linear feedback controllers under a local linearization of the dynamics\nand a quadratic expansion of the cost [15]. Under linear dynamics and quadratic costs, the value\nor cost-to-go function is quadratic, and can be computed with dynamic programming. The iLQG\nalgorithm alternates between computing the quadratic value function around the current trajectory,\nand updating the trajectory using a rollout of the corresponding linear feedback controller.\nWe will use subscripts to denote derivatives, so that (cid:96)xut is the derivative of the cost at time step\nt with respect to (xt, ut)T, (cid:96)xu,xut is the Hessian, (cid:96)xt is the derivative with respect to xt, and\nso forth. Using N (fxtxt + futut, Ft) to denote the local linear-Gaussian approximation to the\ndynamics, iLQG computes the \ufb01rst and second derivatives of the Q and value functions as follows:\n(1)\n\nQxu,xut = (cid:96)xu,xut + f T\nVx,xt = Qx,xt \u2212 QT\n\nxutVx,xt+1fxut\nu,xtQ\u22121\n\nu,utQu,x\n\nQxut = (cid:96)xut + f T\nVxt = Qxt \u2212 QT\n\nxutVxt+1\nu,xtQ\u22121\n\nu,utQut\n\n2\n\n\fThe linear controller g(xt) = \u02c6ut + kt + Kt(xt \u2212 \u02c6xt) can be shown to minimize this quadratic Q-\nfunction, where \u02c6xt and \u02c6ut are the states and actions of the current trajectory, Kt = \u2212Q\u22121\nu,utQu,xt,\nand kt = \u2212Q\u22121\nu,utQut. We can also construct a linear-Gaussian controller with the mean given by the\ndeterministic optimal solution, and the covariance proportional to the curvature of the Q-function:\n\np(ut|xt) = N (\u00afut + kt + Kt(xt \u2212 \u02c6xt), Q\u22121\n\nu,ut)\n\np(\u03c4 )\u2208N (\u03c4 )\n\nPrior work has shown that this distribution optimizes a maximum entropy objective [12], given by\nEp[(cid:96)(\u03c4 )] \u2212 H(p(\u03c4 )) s.t. p(xt+1|xt, ut) = N (xt+1; fxtxt + futut, Ft), (2)\np(\u03c4 ) = arg min\nwhere H is the differential entropy. This means that the linear-Gaussian controller produces the\nwidest, highest-entropy distribution that also minimizes the expected cost, subject to the linearized\ndynamics and quadratic cost function. Although this objective differs from the expected cost, it is\nuseful as an intermediate step in algorithms that optimizes the more standard expected cost objective\n[20, 12]. Our method similarly uses the maximum entropy objective as an intermediate step, and\nconverges to trajectory distribution with the optimal expected cost. However, unlike iLQG, our\nmethod operates on systems where the dynamics are unknown.\n\n3 Trajectory Optimization under Unknown Dynamics\nWhen the dynamics N (fxtxt + futut, Ft) are unknown, we can estimate them using samples\n{(xti, uti)T, xt+1i} from the real system under the previous linear-Gaussian controller p(ut|xt),\nwhere \u03c4i = {x1i, u1i, . . . , xT i, uT i} is the ith rollout. Once we estimate the linear-Gaussian dy-\nnamics at each time step, we can simply run the dynamic programming algorithm in the preceding\nsection to obtain a new linear-Gaussian controller. However, the \ufb01tted dynamics are only valid in\na local region around the samples, while the new controller generated by iLQG can be arbitrarily\ndifferent from the old one. The fully model-based iLQG method addresses this issue with a line\nsearch [23], which is impractical when the rollouts must be stochastically sampled from the real\nsystem. Without the line search, large changes in the trajectory will cause the algorithm to quickly\nfall into unstable, costly parts of the state space, preventing convergence. We address this issue by\nlimiting the change in the trajectory distribution in each dynamic programming pass by imposing a\nconstraint on the KL-divergence between the old and new trajectory distribution.\n\n3.1 KL-Divergence Constraints\n\nUnder linear-Gaussian controllers, a KL-divergence constraint against the previous trajectory distri-\nbution \u02c6p(\u03c4 ) can be enforced with a simple modi\ufb01cation of the cost function. Omitting the dynamics\nconstraint for clarity, the constrained problem is given by\n\nmin\n\np(\u03c4 )\u2208N (\u03c4 )\n\nEp[(cid:96)(\u03c4 )] s.t. DKL(p(\u03c4 )(cid:107)\u02c6p(\u03c4 )) \u2264 \u0001.\n\nThis type of policy update has previously been proposed by several authors in the context of pol-\nicy search [1, 19, 17]. The objective of this optimization is the standard expected cost objec-\ntive, and solving this problem repeatedly, each time setting \u02c6p(\u03c4 ) to the last p(\u03c4 ), will minimize\nEp(xt,ut)[(cid:96)(xt, ut)]. Using \u03b7 to represent the dual variable, the Lagrangian of this problem is\n\nLtraj(p(\u03c4 ), \u03b7) = Ep[(cid:96)(\u03c4 )] + \u03b7[DKL(p(\u03c4 )(cid:107)\u02c6p(\u03c4 )) \u2212 \u0001].\n\nSince p(xt+1|xt, ut) = \u02c6p(xt+1|, xt, ut) = N (fxtxt + futut, Ft) due to the linear-Gaussian dy-\nnamics assumption, the Lagrangian can be rewritten as\n\n(cid:34)(cid:88)\n\n(cid:35)\nEp(xt,ut)[(cid:96)(xt, ut) \u2212 \u03b7 log \u02c6p(ut|xt)]\n\nLtraj(p(\u03c4 ), \u03b7) =\n\n\u2212 \u03b7H(p(\u03c4 )) \u2212 \u03b7\u0001.\n\nt\n\nDividing both sides of this equation by \u03b7 gives us an objective of the same form as Equation (2),\nwhich means that under linear dynamics we can minimize the Lagrangian with respect to p(\u03c4 )\nusing the dynamic programming algorithm from the preceding section, with an augmented cost\n\u03b7 (cid:96)(xt, ut) \u2212 log \u02c6p(ut|xt). We can therefore solve the original constrained\nfunction \u02dc(cid:96)(xt, ut) = 1\nproblem by using dual gradient descent [2], alternating between using dynamic programming to\n\n3\n\n\fminimize the Lagrangian with respect to p(\u03c4 ), and adjust the dual variable according to the amount\nof constraint violation. Using a bracket linesearch with quadratic interpolation [7], this procedure\nusually converges within a few iterations, especially if we accept approximate constraint satisfaction,\nfor example by stopping when the KL-divergence is within 10% of \u0001. Empirically, we found that the\nline search tends to require fewer iterations in log space, treating the dual as a function of \u03bd = log \u03b7,\nwhich also has the convenient effect of enforcing the positivity of \u03b7.\nThe dynamic programming pass does not guarantee that Q\u22121\nu,ut, which is the covariance of the linear-\nGaussian controller, will always remain positive de\ufb01nite, since nonconvex cost functions can intro-\nduce negative eigenvalues into Equation (1) [23]. To address this issue, we can simply increase\n\u03b7 until each Qu,ut becomes positive de\ufb01nite, which is always possible, since the positive de\ufb01nite\nprecision matrix of \u02c6p(ut|xt), multiplied by \u03b7, enters additively into Qu,ut. This might sometimes\nresult in the KL-divergence being lower than \u0001, though this happens rarely in practice.\nThe step \u0001 can be adaptively adjusted based on the discrepancy between the improvement in total cost\npredicted under the linear dynamics and quadratic cost approximation, and the actual improvement,\nwhich can be estimated using the new linear dynamics and quadratic cost. Since these quantities\nonly involve expectations of quadratics under Gaussians, they can be computed analytically.\nThe amount of improvement obtained from optimizing p(\u03c4 ) depends on the accuracy of the esti-\nmated dynamics. In general, the sample complexity of this estimation depends on the dimension-\nality of the state. However, the dynamics at nearby time steps and even successive iterations are\ncorrelated, and we can exploit this correlation to reduce the required number of samples.\n\n3.2 Background Dynamics Distribution\n\nWhen \ufb01tting the dynamics, we can use priors to greatly reduce the number of samples required\nat each iteration. While these priors can be constructed using domain knowledge, a more general\napproach is to construct the prior from samples at other time steps and iterations, by \ufb01tting a back-\nground dynamics distribution as a kind of crude global model. For physical systems such as robots,\na good choice for this distribution is a Gaussian mixture model (GMM), which corresponds to softly\npiecewise linear dynamics. The dynamics of a robot can be reasonably approximated with such\npiecewise linear functions [9], and they are well suited for contacts, which are approximately piece-\nwise linear with a hard boundary. If we build a GMM over vectors (xt, ut, xt+1)T, we see that\nwithin each cluster ci, the conditional ci(xt+1|xt, ut) represents a linear-Gaussian dynamics model,\nwhile the marginal ci(xt, ut) speci\ufb01es the region of the state-action space where this model is valid.\nAlthough the GMM models (softly) piecewise linear dynamics, it is not necessarily a good forward\nmodel, since the marginals ci(xt, ut) will not always delineate the correct boundary between two\nlinear modes. In the case of contacts, the boundary might have a complex shape that is not well\nmodeled by a GMM. However, if we use the GMM to obtain a prior for linear regression, it is easy\nto determine the correct linear mode from the covariance of (xti, uti) with xt+1i in the current\nsamples at time step t. The time-varying linear dynamics can then capture different linear modes at\ndifferent time steps depending on the actual observed transitions, even if the states are very similar.\nTo use the GMM to construct a prior for the dynamics, we re\ufb01t the GMM at each iteration to all of\nthe samples at all time steps from the current iteration, as well as several prior interations, in order\nto ensure that suf\ufb01cient samples are available. We then estimate the time-varying linear dynamics\nby \ufb01tting a Gaussian to the samples {xti, uti, xt+1i} at each time step, which can be conditioned\non (xt, ut)T to obtain linear-Gaussian dynamics. The GMM is used to produce a normal-inverse-\nWishart prior for the mean and covariance of this Gaussian at each time step. To obtain the prior, we\ninfer the cluster weights for the samples at the current time step, and then use the weighted mean and\ncovariance of these clusters as the prior parameters. We found that the best results were produced by\nlarge mixtures that modeled the dynamics in high detail. In practice, the GMM allowed us to reduce\nthe samples at each iteration by a factor of 4 to 8, well below the dimensionality of the system.\n\n4 General Parameterized Policies\n\nThe algorithm in the preceding section optimizes time-varying linear-Gaussian controllers. To learn\narbitrary parameterized policies, we combine this algorithm with a guided policy search (GPS) ap-\n\n4\n\n\fAlgorithm 1 Guided policy search with unknown dynamics\n1: for iteration k = 1 to K do\n2:\n3:\n\n4: Minimize(cid:80)\n\ni } from each linear-Gaussian controller pi(\u03c4 ) by performing rollouts\ni }\ni,t\u03bbi,tDKL(pi(xt)\u03c0\u03b8(ut|xt)(cid:107)pi(xt, ut)) with respect to \u03b8 using samples {\u03c4 j\n\nGenerate samples {\u03c4 j\nFit the dynamics pi(xt+1|xt, ut) to the samples {\u03c4 j\ni }\nUpdate pi(ut|xt) using the algorithm in Section 3 and the supplementary appendix\nIncrement dual variables \u03bbi,t by \u03b1DKL(pi(xt)\u03c0\u03b8(ut|xt)(cid:107)pi(xt, ut))\n\n5:\n6:\n7: end for\n8: return optimized policy parameters \u03b8\n\nproach. In GPS methods, the parameterized policy is trained in supervised fashion to match samples\nfrom a trajectory distribution, and the trajectory distribution is optimized to minimize both its cost\nand difference from the current policy, thereby creating a good training set for the policy. By turn-\ning policy optimization into a supervised problem, GPS algorithms can train complex policies with\nthousands of parameters [12, 14], and since our trajectory optimization algorithm exploits the struc-\nture of linear-Gaussian controllers, it can optimize the individual trajectories with fewer samples\nthan general-purpose model-free methods. As a result, the combined approach can learn complex\npolicies that are dif\ufb01cult to train with prior methods, as shown in our evaluation.\nWe build on the recently proposed constrained GPS algorithm, which enforces agreement between\nthe policy and trajectory by means of a soft KL-divergence constraint [14]. Constrained GPS opti-\nmizes the maximum entropy objective E\u03c0\u03b8 [(cid:96)(\u03c4 )] \u2212 H(\u03c0\u03b8), but our trajectory optimization method\nallows us to use the more standard expected cost objective, resulting in the following optimization:\n\nEp(\u03c4 )[(cid:96)(\u03c4 )] s.t. DKL(p(xt)\u03c0\u03b8(ut|xt)(cid:107)p(xt, ut)) = 0 \u2200t.\n\nmin\n\u03b8,p(\u03c4 )\n\nT(cid:88)\n\nIf the constraint is enforced exactly, the policy \u03c0\u03b8(ut|xt) is identical to p(ut|xt), and the optimiza-\ntion minimizes the cost under \u03c0\u03b8, given by E\u03c0\u03b8 [(cid:96)(\u03c4 )]. Constrained GPS enforces these constraints\nsoftly, so that \u03c0\u03b8 and p gradually come into agreement over the course of the optimization. In gen-\neral, we can use multiple distributions pi(\u03c4 ), with each trajectory starting from a different initial\nstate or in different conditions, but we will omit the subscript for simplicity, since each pi(\u03c4 ) is\ntreated identically and independently. The Lagrangian of this problem is given by\n\nLGPS(\u03b8, p, \u03bb) = Ep(\u03c4 )[(cid:96)(\u03c4 )] +\n\n\u03bbtDKL(p(xt)\u03c0\u03b8(ut|xt)(cid:107)p(xt, ut)).\n\nt=1\n\nThe GPS Lagrangian is minimized with respect to \u03b8 and p(\u03c4 ) in alternating fashion, with the dual\nvariables \u03bbt updated to enforce constraint satisfaction. Optimizing LGPS with respect to p(\u03c4 ) corre-\nsponds to trajectory optimization, which in our case involves dual gradient descent on Ltraj in Sec-\ntion 3.1, and optimizing with respect \u03b8 corresponds to supervised policy optimization to minimize\nthe weighted sum of KL-divergences. The constrained GPS method also uses dual gradient descent\nto update the dual variables, but we found that in practice, it is unnecessary (and, in the unknown\nmodel setting, extremely inef\ufb01cient) to optimize LGPS with respect to p(\u03c4 ) and \u03b8 to convergence\nprior to each dual variable update. Instead, we increment the dual variables after each iteration with\na multiple \u03b1 of the KL-divergence (\u03b1 = 10 works well), which corresponds to a penalty method.\nNote that the dual gradient descent on Ltraj during trajectory optimization is unrelated to the policy\nconstraints, and is treated as an inner loop black-box optimizer by GPS.\nPseudocode for our modi\ufb01ed constrained GPS method is provided in Algorithm 1. The policy KL-\ndivergence terms in the objective also necessitate a modi\ufb01ed dynamic programming method, which\ncan be found in prior work [14], but the step size constraints are still enforced as described in the\npreceding section, by modifying the cost. The same samples that are used to \ufb01t the dynamics are also\nused to train the policy, with the policy trained to minimize \u03bbtDKL(\u03c0\u03b8(ut|xti)(cid:107)p(ut|xti)) at each\nsampled state xti. Further details about this algorithm can be found in the supplementary appendix.\nAlthough this method optimizes the expected cost of the policy, due to the alternating optimization,\nits entropy tends to remain high, since both the policy and trajectory must decrease their entropy\ntogether to satisfy the constraint, which requires many alternating steps. To speed up this process,\nwe found it useful to regularize the policy by penalizing its entropy directly, which speeds up con-\nvergence and produces more deterministic policies.\n\n5\n\n\fitr 1\n\nitr 2\n\nitr 4\n\nitr 1\n\nitr 5\n\nitr 10\n\nitr 1\n\nitr 20\n\nitr 40\n\nFigure 1: Results for learning linear-Gaussian controllers for 2D and 3D insertion, octopus arm, and\nswimming. Our approach uses fewer samples and \ufb01nds better solutions than prior methods, and the\nGMM further reduces the required sample count. Images in the lower-right show the last time step\nfor each system at several iterations of our method, with red lines indicating end effector trajectories.\n\n5 Experimental Evaluation\n\nWe evaluated both the trajectory optimization method and general policy search on simulated robotic\nmanipulation and locomotion tasks. The state consisted of joint angles and velocities, and the actions\ncorresponded to joint torques. The parameterized policies were neural networks with one hidden\nlayer and a soft recti\ufb01er nonlinearity of the form a = log(1 + exp(z)), with learned diagonal\nGaussian noise added to the outputs to produce a stochastic policy. This policy class was chosen for\nits expressiveness, to allow the policy to learn a wide range of strategies. However, due to its high\ndimensionality and nonlinearity, it also presents a serious challenge for policy search methods.\nThe tasks are 2D and 3D peg insertion, octopus arm control, and planar swimming and walking.\nThe insertion tasks require \ufb01tting a peg into a narrow slot, a task that comes up, for example, when\ninserting a key into a keyhole, or assembly with screws or nails. The dif\ufb01culty stems from the need\nto align the peg with the slot and the complex contacts between the peg and the walls, which result\nin discontinuous dynamics. Control in the presence of contacts is known to be challenging, and\nthis experiment is important for ascertaining how well our method can handle such discontinuities.\nOctopus arm control involves moving the tip of a \ufb02exible arm to a goal position [6]. The challenge in\nthis task stems from its high dimensionality: the arm has 25 degrees of freedom, corresponding to 50\nstate dimensions. The swimming task requires controlling a three-link snake, and the walking task\nrequires a seven-link biped to maintain a target velocity. The challenge in these tasks comes from\nunderactuation. Details of the simulation and cost for each task are in the supplementary appendix.\n\n5.1 Trajectory Optimization\n\nFigure 1 compares our method with prior work on learning linear-Gaussian controllers for peg in-\nsertion, octopus arm, and swimming (walking is discussed in the next section). The horizontal axis\nshows the total number of samples, and the vertical axis shows the minimum distance between the\nend of the peg and the bottom of the slot, the distance to the target for the octopus arm, or the total\ndistance travelled by the swimmer. Since the peg is 0.5 units long, distances above this amount\ncorrespond to controllers that cannot perform an insertion.\nWe compare to REPS [17], reward-weighted regression (RWR) [18, 11], the cross-entropy method\n(CEM) [21], and PILCO [5]. We also use iLQG [15] with a known model as a baseline, shown\nas a black horizontal line. REPS is a model-free method that, like our approach, enforces a KL-\ndivergence constraint between the new and old policy. We compare to a variant of REPS that also \ufb01ts\nlinear dynamics to generate 500 pseudo-samples [16], which we label \u201cREPS (20 + 500).\u201d RWR is\nan EM algorithm that \ufb01ts the policy to previous samples weighted by the exponential of their reward,\nand CEM \ufb01ts the policy to the best samples in each batch. With Gaussian trajectories, CEM and\nRWR only differ in the weights. These methods represent a class of RL algorithms that \ufb01t the policy\n\n6\n\noctopus armsamplestarget distance1002003004005006007008000123452D insertionsamplestarget distance10020030040050060070080000.20.40.60.813D insertionsamplestarget distance10020030040050060070080000.20.40.60.81swimmingsamplesdistance travelled2004006008001000120014001600012345iLQG,ItrueImodelREPSI(100Isamp)REPSI(20I+I500Isamp)CEMI(100Isamp)CEMI(20Isamp)RWRI(100Isamp)RWRI(20Isamp)PILCOI(5Isamp)oursI(20Isamp)oursI(withIGMM,I5Isamp)\f#1\n\n#1\n\n#2\n\n#2\n\n#3\n\n#3\n\n#4\n\n#4\n\nFigure 2: Comparison on neural network policies. For insertion, the policy was trained to search for\nan unknown slot position on four slot positions (shown above), and generalization to new positions\nis graphed with dashed lines. Note how the end effector (in red) follows the surface to \ufb01nd the slot,\nand how the swimming gait is smoother due to the stationary policy (also see supplementary video).\n\nto weighted samples, including PoWER and PI2 [11, 24, 22]. PILCO is a model-based method that\nuses a Gaussian process to learn a global dynamics model that is used to optimize the policy. REPS\nand PILCO require solving large nonlinear optimizations at each iteration, while our method does\nnot. Our method used 5 rollouts with the GMM, and 20 without. Due to its computational cost,\nPILCO was provided with 5 rollouts per iteration, while other prior methods used 20 and 100.\nOur method learned much more effective controllers with fewer samples, especially when using the\nGMM. On 3D insertion, it outperformed the iLQG baseline, which used a known model. Contact\ndiscontinuities cause problems for derivative-based methods like iLQG, as well as methods like\nPILCO that learn a smooth global dynamics model. We use a time-varying local model, which\npreserves more detail, and \ufb01tting the model to samples has a smoothing effect that mitigates discon-\ntinuity issues. Prior policy search methods could servo to the hole, but were unable to insert the peg.\nOn the octopus arm, our method succeeded despite the high dimensionality of the state and action\nspaces.1 Prior work used simpli\ufb01ed \u201cmacro-actions\u201d to solve this task, while our method directly\ncontrolled each degree of freedom [6]. Our method also successfully learned a swimming gait, while\nprior model-free methods could not initiate forward motion.2 PILCO also learned an effective gait\ndue to the smooth dynamics of this task, but its GP-based optimization required orders of magnitude\nmore computation time than our method, taking about 50 minutes per iteration.\nThese results suggest that our method combines the sample ef\ufb01ciency of model-based methods with\nthe versatility of model-free techniques. However, this method is designed speci\ufb01cally for linear-\nGaussian controllers. In the next section, we present results for learning more general policies with\nour method, using the linear-Gaussian controllers within the framework of guided policy search.\n\n5.2 Neural Network Policy Learning with Guided Policy Search\n\nBy using our method with guided policy search, we can learn arbitrary parameterized policies. Fig-\nure 2 shows results for training neural network policies for each task, with comparisons to prior\nmethods that optimize the policy parameters directly.3 On swimming, our method achieved similar\nperformance to the linear-Gaussian case, but since the neural network policy was stationary, the re-\nsulting gait was much smoother. Previous methods could only solve this task with 100 samples per\niteration, with RWR eventually obtaining a distance of 0.5m after 4000 samples, and CEM reaching\n2.1m after 3000. Our method was able to reach such distances with many fewer samples.\n\n1The high dimensionality of the octopus arm made it dif\ufb01cult to run PILCO, though in principle, such\n\nmethods should perform well on this task given the arm\u2019s smooth dynamics.\n\n2Even iLQG requires many iterations to initiate any forward motion, but then makes rapid progress. This\nsuggests that prior methods were simply unable to get over the initial threshold of initiating forward movement.\n3PILCO cannot optimize neural network controllers, and we could not obtain reasonable results with REPS.\n\nPrior applications of REPS generally focus on simpler, lower-dimensional policy classes [17, 16].\n\n7\n\nwalking policysamplesdistance travelled100200300400500600700800051015202D insertion policysamplestarget distance10020030040050060070080000.20.40.60.813D insertion policysamplestarget distance10020030040050060070080000.20.40.60.81swimming policysamplesdistance travelled2004006008001000120014001600012345CEM (100 samp)CEM (20 samp)RWR (100 samp)RWR (20 samp)ours (20 samp)ours (with GMM, 5 samp)\fGenerating walking from scratch is extremely challenging even with a known model. We therefore\ninitialize the gait from demonstration, as in prior work [12]. The supplementary website also shows\nsome gaits generated from scratch. To generate the initial samples, we assume that the demonstration\ncan be stabilized with a linear feedback controller. Building such controllers around examples has\nbeen addressed in prior work [3]. The RWR and CEM policies were initialized with samples from\nthis controller to provide a fair comparison. The walker used 5 samples per iteration with the GMM,\nand 40 without it. The graph shows the average distance travelled on rollouts that did not fall, and\nshows that only our method was able to learn walking policies that succeeded consistently.\nOn the insertion tasks, the neural network was trained to insert the peg without precise knowledge\nof the position of the hole, making this a partially observed problem. The holes were placed in a\nregion of radius 0.2 units in 2D and 0.1 units in 3D. The policies were trained on four different hole\npositions, and then tested on four new hole positions to evaluate generalization. The generalization\nresults are shown with dashed lines in Figure 2. The position of the hole was not provided to the\nneural network, and the policies therefore had to \ufb01nd the hole by \u201cfeeling\u201d for it, with only joint\nangles and velocities as input. Only our method could acquire a successful strategy to locate both\nthe training and test holes, although RWR was eventually able to insert the peg into one of the\nfour holes in 2D. This task illustrates one of the advantages of learning expressive neural network\npolicies, since no single trajectory-based policy can represent such a search strategy. Videos of the\nlearned policies can be viewed at http://rll.berkeley.edu/nips2014gps/.\n\n6 Discussion\n\nWe presented an algorithm that can optimize linear-Gaussian controllers under unknown dynamics\nby iteratively \ufb01tting local linear dynamics models, with a background dynamics distribution acting\nas a prior to reduce the sample complexity. We showed that this approach can be used to train\narbitrary parameterized policies within the framework of guided policy search, where the param-\neterized policy is optimized to match the linear-Gaussian controllers. In our evaluation, we show\nthat this method can train complex neural network policies that act intelligently in partially observed\nenvironments, even for tasks that cannot be solved with direct model-free policy search.\nBy using local linear models, our method is able to outperform model-free policy search methods.\nOn the other hand, the learned models are highly local and time-varying, in contrast to model-based\nmethods that rely on learning an effective global model [4]. This allows our method to handle\neven the complicated and discontinuous dynamics encountered in the peg insertion task, which we\nshow present a challenge for model-based methods that use smooth dynamics models [5]. Our\napproach occupies a middle group between model-based and model-free techniques, allowing it to\nlearn rapidly, while still succeeding in domains where the true model is dif\ufb01cult to learn.\nOur use of a KL-divergence constraint during trajectory optimization parallels several prior model-\nfree methods [1, 19, 17, 20, 16]. Trajectory-centric policy learning has also been explored in detail\nin robotics, with a focus on dynamic movement primitives (DMPs) [8, 24]. Time-varying linear-\nGaussian controllers are in general more expressive, though they incorporate less prior information.\nDMPs constrain the \ufb01nal state to a goal state, and only encode target states, relying on an existing\ncontroller to track those states with suitable controls.\nThe improved performance of our method is due in part to the use of stronger assumptions about the\ntask, compared to general policy search methods. For instance, we assume that time-varying linear-\nGaussians are a reasonable local approximation for the dynamics. While this assumption is sensible\nfor physical systems, it would require additional work to extend to hybrid discrete-continuous tasks.\nOur method also suggests some promising future directions. Since the parameterized policy is\ntrained directly on samples from the real world, it can incorporate sensory information that is dif\ufb01-\ncult to simulate but useful in partially observed domains, such as force sensors on a robotic gripper,\nor even camera images, while the linear-Gaussian controllers are trained directly on the true state\nunder known, controlled conditions, as in our peg insertion experiments. This could provide for su-\nperior generalization for partially observed tasks that are otherwise extremely challenging to learn.\n\nAcknowledgments This research was partly funded by a DARPA Young Faculty Award\n#D13AP0046 and the United States Army Research Laboratory (ARL) under the Micro-\nAutonomous Science and Technology Collaborative Technology (MAST) Alliance.\n\n8\n\n\fReferences\n[1] J. A. Bagnell and J. Schneider. Covariant policy search. In International Joint Conference on\n\nArti\ufb01cial Intelligence (IJCAI), 2003.\n\n[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York,\n\nNY, 2004.\n\n[3] A. Coates, P. Abbeel, and A. Ng. Learning for control from multiple demonstrations.\n\nInternational Conference on Machine Learning (ICML), 2008.\n\nIn\n\n[4] M. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. Foundations\n\nand Trends in Robotics, 2(1-2):1\u2013142, 2013.\n\n[5] M. Deisenroth and C. Rasmussen. PILCO: a model-based and data-ef\ufb01cient approach to policy\n\nsearch. In International Conference on Machine Learning (ICML), 2011.\n\n[6] Y. Engel, P. Szab\u00b4o, and D. Volkinshtein. Learning to control an octopus arm with Gaussian\nprocess temporal difference methods. In Advances in Neural Information Processing Systems\n(NIPS), 2005.\n\n[7] R. Fletcher. Practical Methods of Optimization. Wiley-Interscience, New York, NY, 1987.\n[8] A. Ijspeert, J. Nakanishi, and S. Schaal. Learning attractor landscapes for learning motor\n\nprimitives. In Advances in Neural Information Processing Systems (NIPS), 2003.\n\n[9] S. M. Khansari-Zadeh and A. Billard. BM: An iterative algorithm to learn stable non-linear\ndynamical systems with gaussian mixture models. In International Conference on Robotics\nand Automation (ICRA), 2010.\n\n[10] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. Interna-\n\ntional Journal of Robotic Research, 32(11):1238\u20131274, 2013.\n\n[11] J. Kober and J. Peters. Learning motor primitives for robotics. In International Conference on\n\nRobotics and Automation (ICRA), 2009.\n\n[12] S. Levine and V. Koltun. Guided policy search.\n\nLearning (ICML), 2013.\n\nIn International Conference on Machine\n\n[13] S. Levine and V. Koltun. Variational policy search via trajectory optimization. In Advances in\n\nNeural Information Processing Systems (NIPS), 2013.\n\n[14] S. Levine and V. Koltun. Learning complex neural network policies with trajectory optimiza-\n\ntion. In International Conference on Machine Learning (ICML), 2014.\n\n[15] W. Li and E. Todorov. Iterative linear quadratic regulator design for nonlinear biological move-\n\nment systems. In ICINCO (1), pages 222\u2013229, 2004.\n\n[16] R. Lioutikov, A. Paraschos, G. Neumann, and J. Peters. Sample-based information-theoretic\n\nstochastic optimal control. In International Conference on Robotics and Automation, 2014.\n\n[17] J. Peters, K. M\u00a8ulling, and Y. Alt\u00a8un. Relative entropy policy search. In AAAI Conference on\n\nArti\ufb01cial Intelligence, 2010.\n\n[18] J. Peters and S. Schaal. Applying the episodic natural actor-critic architecture to motor primi-\n\ntive learning. In European Symposium on Arti\ufb01cial Neural Networks (ESANN), 2007.\n\n[19] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural\n\nNetworks, 21(4):682\u2013697, 2008.\n\n[20] K. Rawlik, M. Toussaint, and S. Vijayakumar. On stochastic optimal control and reinforcement\n\nlearning by approximate inference. In Robotics: Science and Systems, 2012.\n\n[21] R. Rubinstein and D. Kroese. The Cross-Entropy Method: A Uni\ufb01ed Approach to Combinato-\n\nrial Optimization, Monte-Carlo Simulation and Machine Learning. Springer, 2004.\n\n[22] F. Stulp and O. Sigaud. Path integral policy improvement with covariance matrix adaptation.\n\nIn International Conference on Machine Learning (ICML), 2012.\n\n[23] Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through\nonline trajectory optimization. In IEEE/RSJ International Conference on Intelligent Robots\nand Systems, 2012.\n\n[24] E. Theodorou, J. Buchli, and S. Schaal. Reinforcement learning of motor skills in high dimen-\n\nsions. In International Conference on Robotics and Automation (ICRA), 2010.\n\n[25] M. Toussaint. Robot trajectory optimization using approximate inference.\n\nConference on Machine Learning (ICML), 2009.\n\nIn International\n\n9\n\n\f", "award": [], "sourceid": 634, "authors": [{"given_name": "Sergey", "family_name": "Levine", "institution": "UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "Berkeley"}]}