{"title": "Guided Policy Search via Approximate Mirror Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 4008, "page_last": 4016, "abstract": "Guided policy search algorithms can be used to optimize complex nonlinear policies, such as deep neural networks, without directly computing policy gradients in the high-dimensional parameter space. Instead, these methods use supervised learning to train the policy to mimic a \u201cteacher\u201d algorithm, such as a trajectory optimizer or a trajectory-centric reinforcement learning method. Guided policy search methods provide asymptotic local convergence guarantees by construction, but it is not clear how much the policy improves within a small, finite number of iterations. We show that guided policy search algorithms can be interpreted as an approximate variant of mirror descent, where the projection onto the constraint manifold is not exact. We derive a new guided policy search algorithm that is simpler and provides appealing improvement and convergence guarantees in simplified convex and linear settings, and show that in the more general nonlinear setting, the error in the projection step can be bounded. We provide empirical results on several simulated robotic manipulation tasks that show that our method is stable and achieves similar or better performance when compared to prior guided policy search methods, with a simpler formulation and fewer hyperparameters.", "full_text": "Guided Policy Search via Approximate Mirror\n\nDescent\n\nWilliam Montgomery\n\nSergey Levine\n\nDept. of Computer Science and Engineering\n\nDept. of Computer Science and Engineering\n\nUniversity of Washington\n\nwmonty@cs.washington.edu\n\nUniversity of Washington\n\nsvlevine@cs.washington.edu\n\nAbstract\n\nGuided policy search algorithms can be used to optimize complex nonlinear poli-\ncies, such as deep neural networks, without directly computing policy gradients\nin the high-dimensional parameter space. Instead, these methods use supervised\nlearning to train the policy to mimic a \u201cteacher\u201d algorithm, such as a trajectory\noptimizer or a trajectory-centric reinforcement learning method. Guided policy\nsearch methods provide asymptotic local convergence guarantees by construction,\nbut it is not clear how much the policy improves within a small, \ufb01nite number of\niterations. We show that guided policy search algorithms can be interpreted as an\napproximate variant of mirror descent, where the projection onto the constraint\nmanifold is not exact. We derive a new guided policy search algorithm that is sim-\npler and provides appealing improvement and convergence guarantees in simpli\ufb01ed\nconvex and linear settings, and show that in the more general nonlinear setting, the\nerror in the projection step can be bounded. We provide empirical results on several\nsimulated robotic navigation and manipulation tasks that show that our method is\nstable and achieves similar or better performance when compared to prior guided\npolicy search methods, with a simpler formulation and fewer hyperparameters.\n\n1\n\nIntroduction\n\nPolicy search algorithms based on supervised learning from a computational or human \u201cteacher\u201d have\ngained prominence in recent years due to their ability to optimize complex policies for autonomous\n\ufb02ight [16], video game playing [15, 4], and bipedal locomotion [11]. Among these methods, guided\npolicy search algorithms [6] are particularly appealing due to their ability to adapt the teacher to\nproduce data that is best suited for training the \ufb01nal policy with supervised learning. Such algorithms\nhave been used to train complex deep neural network policies for vision-based robotic manipulation\n[6], as well as a variety of other tasks [19, 11]. However, convergence results for these methods\ntypically follow by construction from their formulation as a constrained optimization, where the\nteacher is gradually constrained to match the learned policy, and guarantees on the performance of\nthe \ufb01nal policy only hold at convergence if the constraint is enforced exactly. This is problematic in\npractical applications, where such algorithms are typically executed for a small number of iterations.\nIn this paper, we show that guided policy search algorithms can be interpreted as approximate variants\nof mirror descent under constraints imposed by the policy parameterization, with supervised learning\ncorresponding to a projection onto the constraint manifold. Based on this interpretation, we can\nderive a new, simpli\ufb01ed variant of guided policy search, which corresponds exactly to mirror descent\nunder linear dynamics and convex policy spaces. When these convexity and linearity assumptions do\nnot hold, we can show that the projection step is approximate, up to a bound that depends on the step\nsize of the algorithm, which suggests that for a small enough step size, we can achieve continuous\nimprovement. The form of this bound provides us with intuition about how to adjust the step size in\npractice, so as to obtain a simple algorithm with a small number of hyperparameters.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fAlgorithm 1 Generic guided policy search method\n1: for iteration k \u2208 {1, . . . , K} do\n2:\n3:\n4: Modify \u02dc(cid:96)i(xt, ut) to enforce agreement between \u03c0\u03b8(ut|xt) and each p(ut|xt)\n5: end for\n\nC-step: improve each pi(ut|xt) based on surrogate cost \u02dc(cid:96)i(xt, ut), return samples Di\nS-step: train \u03c0\u03b8(ut|xt) with supervised learning on the dataset D = \u222aiDi\n\nThe main contribution of this paper is a simple new guided policy search algorithm that can train\ncomplex, high-dimensional policies by alternating between trajectory-centric reinforcement learning\nand supervised learning, as well as a connection between guided policy search methods and mirror\ndescent. We also extend previous work on bounding policy cost in terms of KL divergence [15, 17]\nto derive a bound on the cost of the policy at each iteration, which provides guidance on how\nto adjust the step size of the method. We provide empirical results on several simulated robotic\nnavigation and manipulation tasks that show that our method is stable and achieves similar or better\nperformance when compared to prior guided policy search methods, with a simpler formulation and\nfewer hyperparameters.\n\n2 Guided Policy Search Algorithms\n\nthe policy\u2019s trajectory distribution, given by J(\u03b8) =(cid:80)T\nnotation to use \u03c0\u03b8(xt, ut) to denote the marginals of \u03c0\u03b8(\u03c4 ) = p(x1)(cid:81)T\n\nWe \ufb01rst review guided policy search methods and background. Policy search algorithms aim\nto optimize a parameterized policy \u03c0\u03b8(ut|xt) over actions ut conditioned on the state xt. Given\nstochastic dynamics p(xt+1|xt, ut) and cost (cid:96)(xt, ut), the goal is to minimize the expected cost under\nt=1 E\u03c0\u03b8(xt,ut)[(cid:96)(xt, ut)], where we overload\nt=1 p(xt+1|xt, ut)\u03c0\u03b8(ut|xt),\nwhere \u03c4 = {x1, u1, . . . , xT , uT} denotes a trajectory. A standard reinforcement learning (RL)\napproach to policy search is to compute the gradient \u2207\u03b8J(\u03b8) and use it to improve J(\u03b8) [18, 14]. The\ngradient is typically estimated using samples obtained from the real physical system being controlled,\nand recent work has shown that such methods can be applied to very complex, high-dimensional\npolicies such as deep neural networks [17, 10]. However, for complex, high-dimensional policies,\nsuch methods tend to be inef\ufb01cient, and practical real-world applications of such model-free policy\nsearch techniques are typically limited to policies with about one hundred parameters [3].\nInstead of directly optimizing J(\u03b8), guided policy search algorithms split the optimization into a\n\u201ccontrol phase\u201d (which we\u2019ll call the C-step) that \ufb01nds multiple simple local policies pi(ut|xt) that\n1 \u223c p(x1), and a \u201csupervised phase\u201d (S-step) that\ncan solve the task from different initial states xi\noptimizes the global policy \u03c0\u03b8(ut|xt) to match all of these local policies using standard supervised\nlearning. In fact, a variational formulation of guided policy search [7] corresponds to the EM\nalgorithm, where the C-step is actually the E-step, and the S-step is the M-step. The bene\ufb01t of this\napproach is that the local policies pi(ut|xt) can be optimized separately using domain-speci\ufb01c local\nmethods. Trajectory optimization might be used when the dynamics are known [19, 11], while local\nRL methods might be used with unknown dynamics [5, 6], which still requires samples from the real\nsystem, though substantially fewer than the direct approach, due to the simplicity of the local policies.\nThis sample ef\ufb01ciency is the main advantage of guided policy search, which can train policies with\nnearly a hundred thousand parameters for vision-based control using under 200 episodes [6], in\ncontrast to direct deep RL methods that might require orders of magnitude more experience [17, 10].\nA generic guided policy search method is shown in Algorithm 1. The C-step invokes a local policy\noptimizer (trajectory optimization or local RL) for each pi(ut|xt) on line 2, and the S-step uses\nsupervised learning to optimize the global policy \u03c0\u03b8(ut|xt) on line 3 using samples from each\npi(ut|xt), which are generated during the C-step. On line 4, the surrogate cost \u02dc(cid:96)i(xt, ut) for each\npi(ut|xt) is adjusted to ensure convergence. This step is crucial, because supervised learning does not\nin general guarantee that \u03c0\u03b8(ut|xt) will achieve similar long-horizon performance to pi(ut|xt) [15].\nThe local policies might not even be reproducible by a single global policy in general. To address\nthis issue, most guided policy search methods have some mechanism to force the local policies to\nagree with the global policy, typically by framing the entire algorithm as a constrained optimization\nthat seeks at convergence to enforce equality between \u03c0\u03b8(ut|xt) and each pi(ut|xt). The form of the\n\n2\n\n\foverall optimization problem resembles dual decomposition, and usually looks something like this:\n\nEpi(xt,ut)[(cid:96)(xt, ut)] such that pi(ut|xt) = \u03c0\u03b8(ut|xt) \u2200xt, ut, t, i.\n\n(1)\n\nN(cid:88)\n\nT(cid:88)\n\n\u03b8,p1,...,pN\n\nmin\n\n1 \u223c p(x1), we have J(\u03b8) \u2248 (cid:80)N\n\nt=1\n\ni=1\n\n(cid:80)T\n\nT(cid:88)\n\nSince xi\nt=1 Epi(xt,ut)[(cid:96)(xt, ut)] when the constraints are\nenforced exactly. The particular form of the constraint varies depending on the method: prior works\nhave used dual gradient descent [8], penalty methods [11], ADMM [12], and Bregman ADMM [6].\nWe omit the derivation of these prior variants due to space constraints.\n\ni=1\n\n2.1 Ef\ufb01ciently Optimizing Local Policies\nA common and simple choice for the local policies pi(ut|xt) is to use time-varying linear-Gaussian\ncontrollers of the form pi(ut|xt) = N (Ktxt + kt, Ct), though other options are also possible\n[12, 11, 19]. Linear-Gaussian controllers represent individual trajectories with linear stabilization and\nGaussian noise, and are convenient in domains where each local policy can be trained from a different\n1 \u223c p(x1). This represents an additional assumption beyond standard\n(but consistent) initial state xi\nRL, but allows for an extremely ef\ufb01cient and convenient local model-based RL algorithm based on\niterative LQR [9]. The algorithm proceeds by generating N samples on the real physical system\nfrom each local policy pi(ut|xt) during the C-step, using these samples to \ufb01t local linear-Gaussian\ndynamics for each local policy of the form pi(xt+1|xt, ut) = N (fxtxt + futut + fct, Ft) using\nlinear regression, and then using these \ufb01tted dynamics to improve the linear-Gaussian controller via a\nmodi\ufb01ed LQR algorithm [5]. This modi\ufb01ed LQR method solves the following optimization problem:\n\nt=1\n\nmin\n\nKt,kt,Ct\n\nEpi(xt,ut)[\u02dc(cid:96)i(xt, ut)] such that DKL(pi(\u03c4 )(cid:107)\u00afpi(\u03c4 )) \u2264 \u0001,\n\n(2)\nwhere we again use pi(\u03c4 ) to denote the trajectory distribution induced by pi(ut|xt) and the \ufb01tted\ndynamics pi(xt+1|xt, ut). Here, \u00afpi(ut|xt) denotes the previous local policy, and the constraint\nensures that the change in the local policy is bounded, as proposed also in prior works [1, 14, 13].\nThis is particularly important when using linearized dynamics \ufb01tted to local samples, since these\ndynamics are not valid outside of a small region around the current controller. In the case of linear-\nGaussian dynamics and policies, the KL-divergence constraint DKL(pi(\u03c4 )(cid:107)\u00afpi(\u03c4 )) \u2264 \u0001 can be shown\nto simplify, as shown in prior work [5] and Appendix A:\nDKL(pi(ut|xt)(cid:107)\u00afpi(ut|xt)) =\nDKL(pi(\u03c4 )(cid:107)\u00afpi(\u03c4 )) =\n\n\u2212Epi(xt,ut)[log \u00afpi(ut|xt)]\u2212H(pi(ut|xt)),\n\nT(cid:88)\n\nT(cid:88)\n\nt=1\n\nt=1\n\nand the resulting Lagrangian of the problem in Equation (2) can be optimized with respect to the primal\nvariables using the standard LQR algorithm, which suggests a simple method for solving the problem\nin Equation (2) using dual gradient descent [5]. The surrogate objective \u02dc(cid:96)i(xt, ut) = (cid:96)(xt, ut)+\u03c6i(\u03b8)\ntypically includes some term \u03c6i(\u03b8) that encourages the local policy pi(ut|xt) to stay close to the\nglobal policy \u03c0\u03b8(ut|xt), such as a KL-divergence of the form DKL(pi(ut|xt)(cid:107)\u03c0\u03b8(ut|xt)).\n\n2.2 Prior Convergence Results\n\nPrior work on guided policy search typically shows convergence by construction, by framing the\nC-step and S-step as block coordinate ascent on the (augmented) Lagrangian of the problem in\nEquation (1), with the surrogate cost \u02dc(cid:96)i(xt, ut) for the local policies corresponding to the (augmented)\nLagrangian, and the overall algorithm being an instance of dual gradient descent [8], ADMM\n[12], or Bregman ADMM [6]. Since these methods enforce the constraint pi(ut|xt) = \u03c0\u03b8(ut|xt)\n(cid:80)N\nat convergence (up to linearization or sampling error, depending on the method), we know that\ni=1 Epi(xt,ut)[(cid:96)(xt, ut)] \u2248 E\u03c0\u03b8(xt,ut)[(cid:96)(xt, ut)] at convergence.1 However, prior work does\n1\nnot say anything about \u03c0\u03b8(ut|xt) at intermediate iterations, and the constraints of policy search in\nN\nthe real world might often preclude running the method to full convergence. We propose a simpli\ufb01ed\nvariant of guided policy search, and present an analysis that sheds light on the performance of both\nthe new algorithm and prior guided policy search methods.\n\n1As mentioned previously, the initial state xi\n\n1 of each local policy pi(ut|xt) is assumed to be drawn from\n\np(x1), hence the outer sum corresponds to Monte Carlo integration of the expectation under p(x1).\n\n3\n\n\fAlgorithm 2 Mirror descent guided policy search (MDGPS): convex linear variant\n1: for iteration k \u2208 {1, . . . , K} do\n(cid:80)\nC-step: pi \u2190 arg minpi Epi(\u03c4 )\n2:\ni DKL(pi(\u03c4 )(cid:107)\u03c0\u03b8(\u03c4 )) (via supervised learning)\nS-step: \u03c0\u03b8 \u2190 arg min\u03b8\n3:\n4: end for\n\n(cid:104)(cid:80)T\n\nt=1 (cid:96)(xt, ut)\n\nsuch that DKL(pi(\u03c4 )(cid:107)\u03c0\u03b8(\u03c4 )) \u2264 \u0001\n\n(cid:105)\n\n3 Mirror Descent Guided Policy Search\nIn this section, we propose our new simpli\ufb01ed guided policy search, which we term mirror descent\nguided policy search (MDGPS). This algorithm uses the constrained LQR optimization in Equa-\ntion (2) to optimize each of the local policies, but instead of constraining each local policy pi(ut|xt)\nagainst the previous local policy \u00afpi(ut|xt), we instead constraint it directly against the global policy\n\u03c0\u03b8(ut|xt), and simply set the surrogate cost to be the true cost, such that \u02dc(cid:96)i(xt, ut) = (cid:96)(xt, ut). The\nmethod is summarized in Algorithm 2. In the case of linear dynamics and a quadratic cost (i.e. the\nLQR setting), and assuming that supervised learning can globally solve a convex optimization prob-\nlem, we can show that this method corresponds to an instance of mirror descent [2] on the objective\nJ(\u03b8). In this formulation, the optimization is performed on the space of trajectory distributions, with\na constraint that the policy must lie on the manifold of policies with the chosen parameterization. Let\n\u03a0\u0398 be the set of all possible policies \u03c0\u03b8 for a given parameterization, where we overload notation to\nalso let \u03a0\u0398 denote the set of trajectory distributions that are possible under the chosen parameteri-\nt=1 (cid:96)(xt, ut)].\n\nzation. The return J(\u03b8) can be optimized according to \u03c0\u03b8 \u2190 arg min\u03c0\u2208\u03a0\u0398 E\u03c0(\u03c4 )[(cid:80)T\n\nMirror descent solves this optimization by alternating between two steps at each iteration k:\npk \u2190 arg min\n\nD(cid:0)pk, \u03c0(cid:1) .\n\u03c0k in terms of the divergence D(cid:0)p, \u03c0k(cid:1), while the second step projects this distribution onto the\n\nThe \ufb01rst step \ufb01nds a new distribution pk that minimizes the cost and is close to the previous policy\n\ns. t. D(cid:0)p, \u03c0k(cid:1) \u2264 \u0001,\n\n\u03c0k+1 \u2190 arg min\n\u03c0\u2208\u03a0\u0398\n\n(cid:34) T(cid:88)\n\n(cid:96)(xt, ut)\n\nEp(\u03c4 )\n\n(cid:35)\n\np\n\nt=1\n\nconstraint set \u03a0\u0398, with respect to the divergence D(pk, \u03c0). In the linear-quadratic case with a convex\nsupervised learning phase, this corresponds exactly to Algorithm 2: the C-step optimizes pk, while the\nS-step is the projection. Monotonic improvement of the global policy \u03c0\u03b8 follows from the monotonic\nimprovement of mirror descent [2]. In the case of linear-Gaussian dynamics and policies, the S-step,\nwhich minimizes KL-divergence between trajectory distributions, in fact only requires minimizing\nthe KL-divergence between policies. Using the identity in Appendix A, we know that\n\nDKL(pi(\u03c4 )(cid:107)\u03c0\u03b8(\u03c4 )) =\n\nEpi(xt) [DKL(pi(ut|xt)(cid:107)\u03c0\u03b8(ut|xt))] .\n\n(3)\n\nT(cid:88)\n\nt=1\n\nImplementation for Nonlinear Global Policies and Unknown Dynamics\n\n3.1\nIn practice, we aim to optimize complex policies for nonlinear systems with unknown dynamics. This\nrequires a few practical considerations. The C-step requires a local quadratic cost function, which\ncan be obtained via Taylor expansion, as well as local linear-Gaussian dynamics p(xt+1|xt, ut) =\nN (fxtxt + futut + fct, Ft), which we can \ufb01t to samples as in prior work [5]. We also need a local\ntime-varying linear-Gaussian approximation to the global policy \u03c0\u03b8(ut|xt), denoted \u00af\u03c0\u03b8i(ut|xt). This\ncan be obtained either by analytically differentiating the policy, or by using the same linear regression\nmethod that we use to estimate p(xt+1|xt, ut), which is the approach in our implementation. In both\ncases, we get a different global policy linearization around each local policy. Following prior work\n[5], we use a Gaussian mixture model prior for both the dynamics and global policy \ufb01t.\nThe S-step can be performed approximately in the nonlinear case by using the samples collected for\ndynamics \ufb01tting to also train the global policy. Following prior work [6], our S-step minimizes2\n\nEpi(xt) [DKL(\u03c0\u03b8(ut|xt)(cid:107)pi(ut|xt))] \u2248 1\n|Di|\n\nDKL(\u03c0\u03b8(ut|xt,i,j)(cid:107)pi(ut|xt,i,j)),\n\n(cid:88)\n\ni,t,j\n\n(cid:88)\n\ni,t\n\n2Note that we \ufb02ip the KL-divergence inside the expectation, following [6]. We found that this produced better\nresults. The intuition behind this is that, because log pi(ut|xt) is proportional to the Q-function of pi(ut|xt)\n(see Appendix B.1), DKL(\u03c0\u03b8(ut|xt,i,j)(cid:107)pi(ut|xt,i,j) minimizes the cost-to-go under pi(ut|xt) with respect to\n\u03c0\u03b8(ut|xt), which provides for a more informative objective than the unweighted likelihood in Equation (3).\n\n4\n\n\fAlgorithm 3 Mirror descent guided policy search (MDGPS): unknown nonlinear dynamics\n1: for iteration k \u2208 {1, . . . , K} do\n2:\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nGenerate samples Di = {\u03c4i,j} by running either pi or \u03c0\u03b8i\nFit linear-Gaussian dynamics pi(xt+1|xt, ut) using samples in Di\nFit linearized global policy \u00af\u03c0\u03b8i(ut|xt) using samples in Di\nS-step: \u03c0\u03b8 \u2190 arg min\u03b8\nAdjust \u0001 (see Section 4.2)\n\n(cid:80)\nt,i,j DKL(\u03c0\u03b8(ut|xt,i,j)(cid:107)pi(ut|xt,i,j)) (via supervised learning)\n\nC-step: pi \u2190 arg minpi Epi(\u03c4 )[(cid:80)T\n\nt=1 (cid:96)(xt, ut)] such that DKL(pi(\u03c4 )(cid:107)\u00af\u03c0\u03b8i(\u03c4 )) \u2264 \u0001\n\nwhere xt,i,j is the jth sample from pi(xt) obtained by running pi(ut|xt) on the real system. For\nlinear-Gaussian pi(ut|xt) and (nonlinear) conditionally Gaussian \u03c0\u03b8(ut|xt) = N (\u00b5\u03c0(xt), \u03a3\u03c0(xt)),\nwhere \u00b5\u03c0 and \u03a3\u03c0 can be any function (such as a deep neural network), the KL-divergence\nDKL(\u03c0\u03b8(ut|xt,i,j)(cid:107)pi(ut|xt,i,j)) can easily be evaluated and differentiated in closed form [6]. How-\never, in the nonlinear setting, minimizing this objective no longer minimizes the KL-divergence\nbetween trajectory distributions DKL(\u03c0\u03b8(\u03c4 )(cid:107)pi(\u03c4 )) exactly, which means that MDGPS does not\ncorrespond exactly to mirror descent: although the C-step can still be evaluated exactly, the S-step\nnow corresponds to an approximate projection onto the constraint manifold. In the next section, we\ndiscuss how we can bound the error in this projection. A summary of the nonlinear MDGPS method\nis provided in Algorithm 4, and additional details are in Appendix B. The samples for linearizing\nthe dynamics and policy can be obtained by running either the last local policy pi(ut|xt), or the last\nglobal policy \u03c0\u03b8(ut|xt). Both variants produce good results, and we compare them in Section 6.\n\n3.2 Analysis of Prior Guided Policy Search Methods as Approximate Mirror Descent\n\nThe main distinction between the proposed method and prior guided policy search methods is that\nthe constraint DKL(pi(\u03c4 )(cid:107)\u00af\u03c0\u03b8i(\u03c4 )) \u2264 \u0001 is enforced on the local policies at each iteration, while in\nprior methods, this constraint is iteratively enforced via a dual descent procedure over multiple\niterations. This means that the prior methods perform approximate mirror descent with step sizes\nthat are adapted (by adjusting the Lagrange multipliers) but not constrained exactly. In our empirical\nevaluation, we show that our approach is somewhat more stable, though sometimes slower than these\nprior methods. This empirical observation agrees with our intuition: prior methods can sometimes be\nfaster, because they do not exactly constrain the step size, but our method is simpler, requires less\ntuning, and always takes bounded steps on the global policy in trajectory space.\n\n4 Analysis in the Nonlinear Case\n\nAlthough the S-step under nonlinear dynamics is not an optimal projection onto the constraint man-\nifold, we can bound the additional cost incurred by this projection in terms of the KL-divergence\nbetween pi(ut|xt) and \u03c0\u03b8(ut|xt). This analysis also reveals why prior guided policy search algo-\nrithms, which only have asymptotic convergence guarantees, still attain good performance in practice\neven after a small number of iterations. We will drop the subscript i from pi(ut|xt) in this section\nfor conciseness, though the same analysis can be repeated for multiple local policies pi(ut|xt).\n\n4.1 Bounding the Global Policy Cost\n\nThe analysis in this section is based on the following lemma, which we prove in Appendix C.1,\nbuilding off of earlier results by Ross et al. [15] and Schulman et al. [17]:\n\nLemma 4.1 Let \u0001t = maxxt DKL(p(ut|xt)(cid:107)\u03c0\u03b8(ut|xt). Then DTV(p(xt)(cid:107)\u03c0\u03b8(xt)) \u2264 2(cid:80)T\n\n2\u0001t.\n\n\u221a\n\nt=1\n\nThis means that if we can bound the KL-divergence between the policies, then the total variation\n2(cid:107)p(xt)\u2212 \u03c0\u03b8(xt)(cid:107)1) will\ndivergence between their state marginals (given by DTV(p(xt)(cid:107)\u03c0\u03b8(xt)) = 1\nalso be bounded. This bound allows us in turn to relate the total expected costs of the two policies to\neach other according to the following lemma, which we prove in Appendix C.2:\n\n5\n\n\f\u221a\n\n2\u0001t, then we can bound the total cost of \u03c0\u03b8 as\n\nt=1\n\nEp(xt,ut)[(cid:96)(xt, ut)] +\n\n\u221a\n\n2\u0001t max\nxt,ut\n\n(cid:96)(xt, ut) + 2\n\n\u221a\n\n2\u0001tQmax,t\n\n(cid:21)\n\nLemma 4.2 If DTV(p(xt)(cid:107)\u03c0\u03b8(xt)) \u2264 2(cid:80)T\nT(cid:88)\nwhere Qmax,t =(cid:80)T\n\nE\u03c0\u03b8(xt,ut)[(cid:96)(xt, ut)] \u2264 T(cid:88)\n\n(cid:20)\n\nt=1\n\nt=1\n\nt(cid:48)=t maxxt(cid:48) ,ut(cid:48) (cid:96)(xt(cid:48), ut(cid:48)), the maximum total cost from time t to T .\n\nRecall that the C-step constrains(cid:80)T\n\nThis bound on the cost of \u03c0\u03b8(ut|xt) tells us that if we update p(ut|xt) so as to decrease its total cost\nor decrease its KL-divergence against \u03c0\u03b8(ut|xt), we will eventually reduce the cost of \u03c0\u03b8(ut|xt).\nFor the MDGPS algorithm, this bound suggests that we can ensure improvement of the global policy\nwithin a small number of iterations by appropriately choosing the constraint \u0001 during the C-step.\nt=1 \u0001t \u2264 \u0001, so if we choose \u0001 to be small enough, we can close\nthe gap between the local and global policies. Optimizing the bound directly turns out to produce\nvery slow learning in practice, because the bound is very loose. However, it tells us that we can either\ndecrease \u0001 toward the end of the optimization process or if we observe the global policy performing\nmuch worse than the local policies. We discuss how this idea can be put into action in the next\nsection.\n\n4.2 Step Size Selection\n\nSetting the local policy step size \u0001 is important for proper convergence of guided policy search\nmethods. Since we are approximating the true unknown dynamics with time-varying linear dynamics,\nsetting \u0001 too large can produce unstable local policies which cause the method to fail. However,\nsetting \u0001 too small will prevent the local policies from improving signi\ufb01cantly between iterations,\nleading to slower learning rates.\nIn prior work [8], the step size \u0001 in the local policy optimization is dynamically adjusted by considering\nthe difference between the predicted change in the cost of the local policy p(ut|xt) under the \ufb01tted\ndynamics, and the actual cost obtained when sampling from that policy. The intuition is that, because\nthe linearized dynamics are local, we incur a larger cost the further we deviate from the previous\npolicy. We can adjust the step size by estimating the rate at which the additional cost is incurred and\nchoosing the optimal tradeoff. In Appendix B.3 we describe the step size adjustment rule used for\nBADMM in prior work, and use it to derive two step size adjustment rules for MDGPS: \u201cclassic\u201d and\n\u201cglobal.\u201d The classic step size adjustment is a direct reintrepretation of the BADMM step rule for\nMDGPS, while the global step rule is a more conservative rule that takes the difference between the\nglobal and local policies into account.\n\n5 Relation to Prior Work\n\nWhile we\u2019ve discussed the connections between MDGPS and prior guided policy search methods, in\nthis section we\u2019ll also discuss the connections between our method and other policy search methods.\nOne popular supervised policy learning methods is DAGGER [15], which also trains the policy using\nsupervised learning, but does not attempt to adapt the teacher to provide better training data. MDGPS\nremoves the assumption in DAGGER that the supervised learning stage has bounded error against an\narbitrary teacher policy. MDGPS does not need to make this assumption, since the teacher can be\nadapted to the limitations of the global policy learning. This is particularly important when the global\npolicy has computational or observational limitations, such as when learning to use camera images\nfor partially observed control tasks or, as shown in our evaluation, blind peg insertion.\nWhen we sample from the global policy \u03c0\u03b8(ut|xt), our method resembles policy gradient methods\nwith KL-divergence constraints [14, 13, 17]. However, policy gradient methods update the policy\n\u03c0\u03b8(ut|xt) at each iteration by linearizing with respect to the policy parameters, which often requires\nsmall steps for complex, nonlinear policies, such as neural networks. In contrast, we linearize in the\nspace of time-varying linear dynamics, while the policy is optimized at each iteration with many steps\nof supervised learning (e.g. stochastic gradient descent). This makes MDGPS much better suited for\nquickly and ef\ufb01ciently training highly nonlinear, high-dimensional policies.\n\n6\n\n\fFigure 1: Results for MDGPS variants and BADMM on each task. MDGPS is tested with local policy (\u201coff\npolicy\u201d) and global policy (\u201con policy\u201d) sampling (see Section 3.1), and both the \u201cclassic\u201d and \u201cglobal\u201d step\nsizes (see Section 4.2). The vertical axis for the obstacle task shows the average distance between the point mass\nand the target. The vertical axis for the peg tasks shows the average distance between the bottom of the peg\nand the hole. Distances above 0.1, which is the depth of the hole (shown as a dotted line) indicate failure. All\nexperiments are repeated ten times, with the average performance and standard deviation shown in the plots.\n\n6 Experimental Evaluation\n\nWe compare several variants of MDGPS and a prior guided policy search method based on Bregman\nADMM (BADMM) [6]. We evaluate all methods on one simulated robotic navigation task and\ntwo manipulation tasks. For MDGPS, during training we sample from either the local policies\n(\u201coff-policy\u201d sampling) or the global policy (\u201con-policy\u201d sampling), and we use both forms of the\nstep rule described in Section 4.2 (\u201cclassic\u201d and \u201cglobal\u201d). 3\nObstacle Navigation.\nIn this task, a 2D point mass (grey) must navigate\naround obstacles to reach a target (shown in green), using velocities and\npositions relative to the target. We use N = 5 initial states, with 5\nsamples per initial state per iteration. The target and obstacles are \ufb01xed,\nbut the starting position varies.\nPeg Insertion. This task, which is more complex, requires controlling\na 7 DoF 3D arm to insert a tight-\ufb01tting peg into a hole. The hole can\nbe in different positions, and the state consists of joint angles, velocities,\nand end-effector positions relative to the target. This task is substantially\nmore challenging physically. We use N = 9 different hole positions, with\n5 samples per initial state per iteration.\nBlind Peg Insertion. The last task is a blind variant of the peg insertion\ntask, where the target-relative end effector positions are provided to the\nlocal policies, but not to the global policy \u03c0\u03b8(ut|xt). This requires the\nglobal policy to search for the hole, since no input to the global policy\ncan distinguish between the different initial state xi\n1. This makes it much\nmore challenging to adapt the global and local policies to each other, and\nmakes it impossible for the global learner to succeed without adaptation\nof the local policies. We use N = 4 different hole positions, with 5\nsamples per initial state per iteration.\nThe global policy for each task consists of a fully connected neural network with two hidden layers\nwith 40 recti\ufb01ed linear units. The same settings are used for MDGPS and the prior BADMM-based\nmethod, except for the difference in surrogate costs, constraints, and step size adjustment methods\ndiscussed in the paper. Results are presented in Figure 1 and Table 1. On the easier point mass\nnavigation task all methods achieve similar performance, but the on-policy variants of MDGPS\noutperform the off-policy variants. This suggests that we can bene\ufb01t from directly sampling from\nthe global policy during training, which is not possible in the BADMM formulation. Although\nperformance is similar among all methods, the MDGPS methods are all substantially easier to\napply to these tasks, since they have very few free hyperparameters. An initial step size must be\nselected, but the adaptive step size adjustment rules make this choice less important. In contrast,\n\n3Guided policy search code,\n\nhttps://www.github.com/cbfinn/gps.\n\nincluding BADMM and MDGPS methods,\n\nis\n\navailable\n\nat\n\n7\n\n\fg\ne\nP\n\nItr.\n3\n6\n9\n12\ng 3\n6\n9\n12\n\ne\nP\nd\nn\ni\nl\n\nB\n\nOff/Classic\n11.1 \u00b1 9.9%\n\nOn/Classic\n6.7% \u00b1 7.4%\n\nOff/Global\n6.7% \u00b1 7.4%\n\nBADMM\nOn/Global\n1.1% \u00b1 3.3%\n6.7% \u00b1 7.4%\n51.1% \u00b1 10.2% 62.2 \u00b1 17.4% 64.4% \u00b1 19.1% 68.9% \u00b1 18.5% 63.3% \u00b1 20.0%\n72.2% \u00b1 14.3% 82.2 \u00b1 11.3% 71.1% \u00b1 24.0% 90.0% \u00b1 10.5% 85.6% \u00b1 8.7%\n74.4% \u00b1 19.3% 83.3 \u00b1 11.4% 84.4% \u00b1 15.1% 90.0% \u00b1 11.6% 87.8% \u00b1 13.6%\n20.0% \u00b1 31.2% 2.5 \u00b1 7.5%\n15.0% \u00b1 30.0%\n65.0% \u00b1 22.9% 62.5 \u00b1 32.1% 70.0% \u00b1 21.8% 72.5% \u00b1 28.4% 70.0% \u00b1 35.0%\n82.5% \u00b1 25.1% 80.0 \u00b1 24.5% 60.0% \u00b1 32.0% 80.0% \u00b1 35.0% 82.5% \u00b1 19.5%\n82.5% \u00b1 16.1% 95.0 \u00b1 10.0% 85.0% \u00b1 22.9% 85.0% \u00b1 20.0%\n85.0% \u00b1 12.2%\n\n7.5% \u00b1 16.0%\n\n2.5% \u00b1 7.5%\n\nTable 1: Success rates of each method on each peg insertion task. Success is de\ufb01ned as inserting the peg into the\nhole with a \ufb01nal distance of less than 0.06. Results are averaged over ten runs.\n\nthe BADMM method requires choosing an initial weight on the augmented Lagrangian term, an\nadjustment schedule for this term, a step size on the dual variables, and a step size for local policies,\nall of which have a substantial impact on the \ufb01nal performance of the method (the reported results are\nfor the best setting of these parameters, identi\ufb01ed with a hyperparameter sweep).\nOn the peg insertion tasks, all variants MDGPS consistently outperform BADMM as shown by\nthe success rates in Table 1, which shows that the MDGPS policies succeed at actually inserting\nthe peg into the hole more often and on more conditions. This suggests that our method is better\nable to improve global policies, particularly in situations where informational or representational\nconstraints make na\u00efve imitation of the local policies insuf\ufb01cient to solve the task. On both tasks, we\nsee faster learning from the on-policy variants, although this is less noticeable on the harder blind peg\ninsertion task, where the best \ufb01nal policy is the off-policy variant with classic step size adjustment.\nSampling from the global policies may be desirable in practice, since the global policies can directly\nuse observations at runtime instead of requiring access to the state [6]. The global step size also\ntends to be more conservative than the classic step size, but produces more consistent and monotonic\nimprovement.\n\n7 Discussion and Future Work\n\nWe presented a new guided policy search method that corresponds to mirror descent under linearity\nand convexity assumptions, and showed how prior guided policy search methods can be seen as\napproximating mirror descent. We provide a bound on the return of the global policy in the nonlinear\ncase, and argue that an appropriate step size can provide improvement of the global policy in this\ncase also. Our analysis provides us with the intuition to design an automated step size adjustment\nrule, and we illustrate empirically that our method achieves good results on a complex simulated\nrobotic manipulation task while requiring substantially less tuning and hyperparameter optimization\nthan prior guided policy search methods. Manual tuning and hyperparameter searches are a major\nchallenge across a range of deep reinforcement learning algorithms, and developing scalable policy\nsearch methods that are simple and reliable is vital to enable further progress.\nAs discussed in Section 5, MDGPS has interesting connections to other policy search methods. Like\nDAGGER [15], MDGPS uses supervised learning to train the policy, but unlike DAGGER, MDGPS\ndoes not assume that the learner is able to reproduce an arbitrary teacher\u2019s behavior with bounded\nerror, which makes it very appealing for tasks with partial observability or other limits on information,\nsuch as learning to use camera images for robotic manipulation [6]. When sampling directly from\nthe global policy, MDGPS also has close connections to policy gradient methods that take steps of\n\ufb01xed KL-divergence [14, 17], but with the steps taken in the space of trajectories rather than policy\nparameters, followed by a projection step. In future work, it would be interesting to explore this\nconnection further, so as to develop new model-free policy gradient methods.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their helpful and constructive feedback. This research was\nsupported in part by an ONR Young Investigator Program award.\n\n8\n\n\fReferences\n[1] J. A. Bagnell and J. Schneider. Covariant policy search. In International Joint Conference on\n\nArti\ufb01cial Intelligence (IJCAI), 2003.\n\n[2] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 31(3):167\u2013175, May 2003.\n\n[3] M. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. Foundations\n\nand Trends in Robotics, 2(1-2):1\u2013142, 2013.\n\n[4] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time Atari\ngame play using of\ufb02ine Monte-Carlo tree search planning. In Advances in Neural Information\nProcessing Systems (NIPS), 2014.\n\n[5] S. Levine and P. Abbeel. Learning neural network policies with guided policy search under\n\nunknown dynamics. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[6] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.\n\nJournal of Machine Learning Research (JMLR), 17, 2016.\n\n[7] S. Levine and V. Koltun. Variational policy search via trajectory optimization. In Advances in\n\nNeural Information Processing Systems (NIPS), 2013.\n\n[8] S. Levine, N. Wagener, and P. Abbeel. Learning contact-rich manipulation skills with guided\n\npolicy search. In International Conference on Robotics and Automation (ICRA), 2015.\n\n[9] W. Li and E. Todorov.\n\nIterative linear quadratic regulator design for nonlinear biological\n\nmovement systems. In ICINCO (1), pages 222\u2013229, 2004.\n\n[10] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.\nContinuous control with deep reinforcement learning. In International Conference on Learning\nRepresentations (ICLR), 2016.\n\n[11] I. Mordatch, K. Lowrey, G. Andrew, Z. Popovic, and E. Todorov. Interactive control of diverse\nIn Advances in Neural Information Processing\n\ncomplex characters with neural networks.\nSystems (NIPS), 2015.\n\n[12] I. Mordatch and E. Todorov. Combining the bene\ufb01ts of function approximation and trajectory\n\noptimization. In Robotics: Science and Systems (RSS), 2014.\n\n[13] J. Peters, K. M\u00fclling, and Y. Alt\u00fcn. Relative entropy policy search. In AAAI Conference on\n\nArti\ufb01cial Intelligence, 2010.\n\n[14] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural\n\nNetworks, 21(4):682\u2013697, 2008.\n\n[15] S. Ross, G. Gordon, and A. Bagnell. A reduction of imitation learning and structured prediction\n\nto no-regret online learning. Journal of Machine Learning Research, 15:627\u2013635, 2011.\n\n[16] S. Ross, N. Melik-Barkhudarov, K. Shaurya Shankar, A. Wendel, D. Dey, J. A. Bagnell, and\nM. Hebert. Learning monocular reactive UAV control in cluttered natural environments. In\nInternational Conference on Robotics and Automation (ICRA), 2013.\n\n[17] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization.\n\nIn International Conference on Machine Learning (ICML), 2015.\n\n[18] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, 8(3-4):229\u2013256, May 1992.\n\n[19] T. Zhang, G. Kahn, S. Levine, and P. Abbeel. Learning deep control policies for autonomous\naerial vehicles with mpc-guided policy search. In International Conference on Robotics and\nAutomation (ICRA), 2016.\n\n9\n\n\f", "award": [], "sourceid": 2007, "authors": [{"given_name": "William", "family_name": "Montgomery", "institution": "University of Washington"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "University of Washington"}]}