{"title": "Policy gradients in linearly-solvable MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 2298, "page_last": 2306, "abstract": "We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.", "full_text": "Policy gradients in linearly-solvable MDPs\n\nApplied Mathematics and Computer Science & Engineering\n\nEmanuel Todorov\n\nUniversity of Washington\n\ntodorov@cs.washington.edu\n\nAbstract\n\nWe present policy gradient results within the framework of linearly-solvable\nMDPs. For the \ufb01rst time, compatible function approximators and natural pol-\nicy gradients are obtained by estimating the cost-to-go function, rather than the\n(much larger) state-action advantage function as is necessary in traditional MDPs.\nWe also develop the \ufb01rst compatible function approximators and natural policy\ngradients for continuous-time stochastic systems.\n\n1 Introduction\n\nPolicy gradient methods [18] in Reinforcement Learning have gained popularity, due to the guar-\nanteed improvement in control performance over iterations (which is often lacking in approximate\npolicy or value iteration) as well as the discovery of more ef\ufb01cient gradient estimation methods.\nIn particular it has been shown that one can replace the true advantage function with a compatible\nfunction approximator without affecting the gradient [8,14], and that a natural policy gradient (with\nrespect to Fisher information) can be computed [2,5,11].\nThe goal of this paper is to apply policy gradient ideas to the linearly-solvable MDPs (or LMDPs)\nwe have recently-developed [15, 16], as well as to a class of continuous stochastic systems with\nsimilar properties [4, 7, 16]. This framework has already produced a number of unique results \u2013\nsuch as linear Bellman equations, general estimation-control dualities, compositionality of optimal\ncontrol laws, path-integral methods for optimal control, etc. The present results with regard to policy\ngradients are also unique, as summarized in Abstract. While the contribution is mainly theoretical\nand scaling to large problems is left for future work, we provide simulations demonstrating rapid\nconvergence. The paper is organized in two sections, treating discrete and continuous problems.\n\n2 Discrete problems\n\nSince a number of papers on LMDPs have already been published, we will not repeat the general\ndevelopment and motivation here, but instead only summarize the background needed for the present\npaper. We will then develop the new results regarding policy gradients.\n\n2.1 Background on LMDPs\nAn LMDP is de\ufb01ned by a state cost \uf071 (\uf078) over a (discrete for now) state space X , and a transition\nprobability density \uf070 (\uf0780|\uf078) corresponding to the notion of passive dynamics. In this paper we focus\non in\ufb01nite-horizon average-cost problems where \uf070 (\uf0780|\uf078) is assumed to be ergodic, i.e.\nit has a\nunique stationary density. The admissible \"actions\" are all transition probability densities \uf0bc (\uf0780|\uf078)\nwhich are ergodic and satisfy \uf0bc (\uf0780|\uf078) = 0 whenever \uf070 (\uf0780|\uf078) = 0. The cost function is\n(1)\n\n\uf060 (\uf078\uf03b \uf0bc (\u00b7|\uf078)) = \uf071 (\uf078) + \uf044KL (\uf0bc (\u00b7|\uf078)||\uf070 (\u00b7|\uf078))\n\n1\n\n\fThus the controller is free to modify the default/passive dynamics in any way it wishes, but incurs a\ncontrol cost related to the amount of modi\ufb01cation.\nThe average cost \uf063 and differential cost-to-go \uf076 (\uf078) for given \uf0bc (\uf0780|\uf078) satisfy the Bellman equation\n(2)\n\n+ \uf076 (\uf0780)\u00b6\n\uf063 + \uf076 (\uf078) = \uf071 (\uf078) +P\uf0780 \uf0bc (\uf0780|\uf078)\u00b5log\n\uf063\u2217 + \uf076\u2217 (\uf078) = \uf071 (\uf078) \u2212 logP\uf0780\uf070 (\uf0780|\uf078) exp (\u2212\uf076\u2217 (\uf0780))\n\n\uf0bc (\uf0780|\uf078)\n\uf070 (\uf0780|\uf078)\n\nwhere \uf076 (\uf078) is de\ufb01ned up to a constant. The optimal \uf063\u2217 and \uf076\u2217 (\uf078) can be shown to satisfy\n\nand the optimal \uf0bc\u2217 (\uf0780|\uf078) can be found in closed form given \uf076\u2217 (\uf078):\n\uf070 (\uf0780|\uf078) exp (\u2212\uf076\u2217 (\uf0780))\nP\uf079 \uf070 (\uf079|\uf078) exp (\u2212\uf076\u2217 (\uf079))\n\n\uf0bc\u2217 (\uf0780|\uf078) =\n\nExponentiating equation (3) makes it linear in exp (\u2212\uf076\u2217 (\uf078)), although this will not be used here.\n2.2 Policy gradient for a general parameterization\nConsider a parameterization \uf0bc (\uf0780|\uf078\uf03b w) which is valid in the sense that it satis\ufb01es the above con-\nditions and Ow\uf0bc , \uf040\uf0bc\uf03d\uf040w exists for all w \u2208 R\uf06e. Let \uf0b9 (\uf078\uf03b w) be the corresponding stationary\ndensity. We will also need the pair-wise density \uf0b9 (\uf078\uf03b \uf0780\uf03b w) = \uf0b9 (\uf078\uf03b w) \uf0bc (\uf0780|\uf078\uf03b w). To avoid no-\ntational clutter we will suppress the dependence on w in most of the paper; keep in mind that all\nquantities that depend on \uf0bc are functions of w.\nOur objective here is to compute Ow\uf063. This is done by differentiating the Bellman equation (2) and\nfollowing the template from [14]. The result (see Supplement) is given by\nTheorem 1. The LMDP policy gradient for any valid parameterization is\n\nOw\uf063 =P\uf078\uf0b9 (\uf078)P\uf0780Ow\uf0bc (\uf0780|\uf078)\u00b5log\n\n\uf0bc (\uf0780|\uf078)\n\uf070 (\uf0780|\uf078)\n\n+ \uf076 (\uf0780)\u00b6\n\nLet us now compare (5) to the policy gradient in traditional MDPs [14], which is\n\nOw\uf063 =P\uf078\uf0b9 (\uf078)P\uf061Ow\uf0bc (\uf061|\uf078) \uf051 (\uf078\uf03b \uf061)\n\n(6)\nHere \uf0bc (\uf061|\uf078) is a stochastic policy over actions (parameterized by w) and \uf051 (\uf078\uf03b \uf061) is the correspond-\ning state-action cost-to-go. The general form of (5) and (6) is similar, however the term log (\uf0bc\uf03d\uf070)+\uf076\nin (5) cannot be interpreted as a \uf051-function. Indeed it is not clear what a \uf051-function means in the\nLMDP setting. On the other hand, while in traditional MDPs one has to estimate \uf051 (or rather the\nadvantage function) in order to compute the policy gradient, it will turn out that in LMDPs it is\nsuf\ufb01cient to estimate \uf076.\n\n2.3 A suitable policy parameterization\n\n(3)\n\n(4)\n\n(5)\n\n(7)\n\nThe relation (4) between the optimal policy \uf0bc\u2217 and the optimal cost-to-go \uf076\u2217 suggests parameter-\nizing \uf0bc as a \uf070-weighted Gibbs distribution. Since linear function approximators have proven very\nsuccessful, we will use an energy function (for the Gibbs distribution) which is linear in w :\n\n\uf0bc (\uf0780|\uf078\uf03b w) ,\n\n\uf070 (\uf0780|\uf078) exp\u00a1\u2212wTf (\uf0780)\u00a2\nP\uf079 \uf070 (\uf079|\uf078) exp (\u2212wTf (\uf079))\n\nHere f (\uf078) \u2208 R\uf06e is a vector of features. One can verify that (7) is a valid parameterization. We will\nalso need the \uf0bc-expectation operator\n(8)\n\nde\ufb01ned for both scalar and vector functions over X . The general result (5) is now specialized as\nTheorem 2. The LMDP policy gradient for parameterization (7) is\n\n(9)\nAs expected from (4), we see that the energy function wTf (\uf078) and the cost-to-go \uf076 (\uf078) are related.\nIndeed if they are equal the gradient vanishes (the converse is not true).\n\nOw\uf063 =P\uf078\uf03b\uf0780\uf0b9 (\uf078\uf03b \uf0780) (\u03a0 [f ] (\uf078) \u2212 f (\uf0780))\u00a1\uf076 (\uf0780) \u2212 wTf (\uf0780)\u00a2\n\n\u03a0 [\uf066 ] (\uf078) ,P\uf079\uf0bc (\uf079|\uf078) \uf066 (\uf079)\n\n2\n\n\f2.4 Compatible cost-to-go function approximation\n\nOne of the more remarkable aspects of policy gradient results [8, 14] in traditional MDPs is that,\nwhen the true \uf051 function is replaced with a compatible approximation satisfying certain conditions,\nthe gradient remains unchanged. Key to obtaining such results is making sure that the approximation\nerror is orthogonal to the remaining terms in the expression for the policy gradient. Our goal in this\nsection is to construct a compatible function approximator for LMDPs. The procedure is somewhat\nelaborate and unusual, so we provide the derivation before stating the result in Theorem 3 below.\nGiven the form of (9), it makes sense to approximate \uf076 (\uf078) as a linear combination of the same\n\nfeatures f (\uf078) used to represent the energy function: b\uf076 (\uf078\uf03b r) , rTf (\uf078). Let us also de\ufb01ne the\napproximation error \uf022r (\uf078) , \uf076 (\uf078) \u2212b\uf076 (\uf078\uf03b r). If the policy gradient Ow\uf063 is to remain unchanged\nwhen \uf076 is replaced withb\uf076 in (9), the following quantity must be zero:\n\nExpanding (10) and using the stationarity of \uf0b9, we can simplify d as\n\nd (r) ,P\uf078\uf03b\uf0780\uf0b9 (\uf078\uf03b \uf0780) (\u03a0 [f ] (\uf078) \u2212 f (\uf0780)) \uf022r (\uf0780)\nd (r) =P\uf078\uf0b9 (\uf078) (\u03a0 [f ] (\uf078) \u03a0 [\uf022r] (\uf078) \u2212 f (\uf078) \uf022r (\uf078))\n\n(11)\nOne can also incorporate an \uf078-dependent baseline in (9), such as \uf076 (\uf078) which is often used in tradi-\ntional MDPs. However the baseline vanishes after the simpli\ufb01cation, and the result is again (11).\n\n(10)\n\nr\uf04c\uf053 , arg min\n\nr P\uf078\uf0b9 (\uf078)\u00a1\uf076 (\uf078) \u2212 rTf (\uf078)\u00a22\n\nminimize the squared error weighted by \uf0b9. Denote the resulting weight vector r\uf04c\uf053:\n\nNow we encounter a complication. Suppose we were to \ufb01t b\uf076 to \uf076 in a least-squares sense, i.e.\n\n(12)\nThis is arguably the best \ufb01t one can hope for. The error \uf022r is now orthogonal to the features f, thus\nfor r = r\uf04c\uf053 the second term in (11) vanishes, but the \ufb01rst term does not. Indeed we have veri\ufb01ed\nnumerically (on randomly-generated LMDPs) that d (r\uf04c\uf053) 6= 0.\nIf the best \ufb01t is not good enough, what are we to do? Recall that we do not actually need a good \ufb01t,\nbut rather a vector r such that d (r) = 0. Since d (r) and r are linearly related and have the same\ndimensionality, we can directly solve this equation for r. Replacing \uf022r (\uf078) with \uf076 (\uf078) \u2212 rTf (\uf078) and\nusing the fact that \u03a0 is a linear operator, we have d (r) = \uf041r \u2212 k where\n\uf041 ,P\uf078\uf0b9 (\uf078)\u00b3f (\uf078) f (\uf078)T \u2212 \u03a0 [f ] (\uf078) \u03a0 [f ] (\uf078)T\u00b4\n(13)\nk ,P\uf078\uf0b9 (\uf078) (f (\uf078) \uf076 (\uf078) \u2212 \u03a0 [f ] (\uf078) \u03a0 [\uf076] (\uf078))\n\nWe are not done yet because k still depends on \uf076. The goal now is to approximate \uf076 in such a way\nthat k remains unchanged. To this end we use (2) and express \u03a0 [\uf076] in terms of \uf076:\n\n\uf063 + \uf076 (\uf078) \u2212 \uf060 (\uf078) = \u03a0 [\uf076] (\uf078)\n\nHere \uf060 (\uf078) is shortcut notation for \uf060 (\uf078\uf03b \uf0bc (\u00b7|\uf078\uf03b w)). Thus the vector k becomes\nk =P\uf078\uf0b9 (\uf078) (g (\uf078) \uf076 (\uf078) + \u03a0 [f ] (\uf078) (\uf060 (\uf078) \u2212 \uf063))\n\nwhere the policy-speci\ufb01c auxiliary features g (\uf078) are related to the original features f (\uf078) as\n\n(14)\n\n(15)\n\ng (\uf078) , f (\uf078) \u2212 \u03a0 [f ] (\uf078)\n\n(16)\n\nTheorem 3. The following procedure yields the exact LMDP policy gradient:\n\nin (15) involves the projection of \uf076 on the auxiliary features g. This projection can be computed by\n\nsense, as in (12) but using g (\uf078) rather than f (\uf078). The approximation error is now orthogonal to the\n\nThe second term in (15) does not depend on \uf076; it only depends on \uf063 =P\uf078\uf0b9 (\uf078) \uf060 (\uf078). The \ufb01rst term\nde\ufb01ning the auxiliary function approximatore\uf076 (\uf078\uf03b s) , sTg (\uf078) and \ufb01tting it to \uf076 in a least-squares\nauxiliary features g (\uf078), and so replacing \uf076 (\uf078) withe\uf076 (\uf078\uf03b s) in (15) does not affect k. Thus we have\n1. \ufb01te\uf076 (\uf078\uf03b s) to \uf076 (\uf078) in a least squares sense, and also compute \uf063\n2. compute \uf041 from (13), and k from (15) by replacing \uf076 (\uf078) withe\uf076 (\uf078\uf03b s)\n3. \"\ufb01t\"b\uf076 (\uf078\uf03b r) by solving \uf041r = k\nOw\uf063 =P\uf078\uf03b\uf0780\uf0b9 (\uf078\uf03b \uf0780) (f (\uf0780) \u2212 \u03a0 [f ] (\uf078)) f (\uf0780)T (w \u2212 r)\n\n4. the policy gradient is\n\n(17)\n\n3\n\n\fThis is the \ufb01rst policy gradient result with compatible function approximation over the state space\nrather than the state-action space. The computations involve averaging over \uf0b9, which in practice will\n\nrestrictive, however an equivalent requirement arises in traditional MDPs [14].\n\nbe done through sampling (see below). The requirement that \uf076 \u2212e\uf076 be orthogonal to g is somewhat\n\n2.5 Natural policy gradient\n\nWhen the parameter space has a natural metric \uf047 (w), optimization algorithms tend to work better\nif the gradient of the objective function is pre-multiplied by \uf047 (w)\u22121. This yields the so-called\nnatural gradient [1]. In the context of policy gradient methods [5, 11] where w parameterizes a\nprobability density, the natural metric is given by Fisher information (which depends on \uf078 because\nw parameterizes the conditional density). Averaging over \uf0b9 yields the metric\n\n\uf047 (w) ,P\uf078\uf03b\uf0780\uf0b9 (\uf078\uf03b \uf0780) Ow log \uf0bc (\uf0780|\uf078) Ow log \uf0bc (\uf0780|\uf078)T\n\nWe then have the following result (see Supplement):\nTheorem 4. With the vector r computed as in Theorem 3, the LMDP natural policy gradient is\n\nLet us compare this result to the natural gradient in traditional MDPs [11], which is\n\n\uf047 (w)\u22121 Ow\uf063 = w \u2212 r\n\n(18)\n\n(19)\n\n\uf047 (w)\u22121 Ow\uf063 = r\n\n(20)\nIn traditional MDPs one maximizes reward while in LMDPs one minimizes cost, thus the sign\ndifference. Recall that in traditional MDPs the policy \uf0bc is parameterized using features over the\nstate-action space while in LMDPs we only need features over the state space. Thus the vectors w\uf03b r\nwill usually have lower dimensionality in (19) compared to (20).\nAnother difference is that in LMDPs the (regular as well as natural) policy gradient vanishes when\nw = r, which is a sensible \ufb01xed-point condition. In traditional MDPs the policy gradient vanishes\nwhen r = 0, which is peculiar because it corresponds to the advantage function approximation\nbeing identically 0. The true advantage function is of course different, but if the policy becomes\ndeterministic and only one action is sampled per state, the resulting data can be \ufb01t with r = 0. Thus\nany deterministic policy is a local maximum in traditional MDPs. At these local maxima the policy\ngradient theorem cannot actually be applied because it requires a stochastic policy. When the policy\nbecomes near-deterministic, the number of samples needed to obtain accurate estimates increases\nbecause of the lack of exploration [6]. These issues do not seem to arise in LMDPs.\n\n2.6 A Gauss-Newton method for approximating the optimal cost-to-go\n\nInstead of using policy gradient, we can solve (3) for the optimal \uf076\u2217 directly. One option is approx-\nimate policy iteration \u2013 which in our context takes on a simple form. Given the policy parameters\nw(\uf069) at iteration \uf069, approximate the cost-to-go function and obtain the feature weights r(\uf069), and then\nset w(\uf069+1) = r(\uf069). This is equivalent to the above natural gradient method with step size 1, using a\nbiased approximator instead of the compatible approximator given by Theorem 3.\nThe other option is approximate value iteration \u2013 which is a \ufb01xed-point method for solving (3)\nwhile replacing \uf076\u2217 (\uf078) with wTf (\uf078). We can actually do better than value iteration here. Since\n(3) has already been optimized over the controls and is differentiable, we can apply an ef\ufb01cient\nGauss-Newton method. Up to an additive constant \uf063, the Bellman error from (3) is\n\n4\n\nInterestingly, the gradient of this Bellman error coincides with our auxilliary features g:\n\n\uf065 (\uf078\uf03b w) , wTf (\uf078) \u2212 \uf071 (\uf078) + logP\uf079\uf070 (\uf079|\uf078) exp\u00a1\u2212wTf (\uf079)\u00a2\n\n(21)\n\nOw\uf065 (\uf078\uf03b w) = f (\uf078) \u2212P\uf079\nwhere \u03a0 and g are the same as in (16, 8). We now linearize: \uf065 (\uf078\uf03b w + \uf0b1w) \u2248 \uf065 (\uf078\uf03b w) + \uf0b1wTg (\uf078)\nand proceed to minimize (with respect to \uf063 and \uf0b1w) the quantity\n(23)\n\n\uf070 (\uf079|\uf078) exp\u00a1\u2212wTf (\uf079)\u00a2\nP\uf073 \uf070 (\uf073|\uf078) exp (\u2212wTf (\uf073))\nP\uf078\uf0b9 (\uf078)\u00a1\uf063 + \uf065 (\uf078\uf03b w) + \uf0b1wTg (\uf078)\u00a22\n\nf (\uf079) = f (\uf078) \u2212 \u03a0 [f ] (\uf078) = g (\uf078)\n\n(22)\n\n\fFigure 1: (A) Learning curves for a random LMDP. \"resid\" is the Gauss-Newton method. The\nsampling versions use 400 samples per evaluation: 20 trajectories with 20 steps each, starting from\nthe stationary distribution. (B) Cost-to-go functions for the metronome LMDP. The numbers show\nthe average costs obtained. There are 2601 discrete states and 25 features (Gaussians). Convergence\nwas observed in about 10 evaluations (of the objective and the gradient) for both algorithms, exact\nand sampling versions. The sampling version of the Gauss-Newton method worked well with 400\nsamples per evaluation; the natural gradient needed around 2500 samples.\n\nNormally the density \uf0b9 (\uf078) would be \ufb01xed, however we have found empirically that the resulting\nalgorithm yields better policies if we set \uf0b9 (\uf078) to the policy-speci\ufb01c stationary density \uf0b9 (\uf078\uf03b w)\nat each iteration.\nIt is not clear how to guarantee convergence of this algorithm given that the\nobjective function itself is changing over iterations, but in practice we observed that simple damping\nis suf\ufb01cient to make it convergent (e.g. w \u2190 w + \uf0b1w\uf03d2).\nIt is notable that minimization of (23) is closely related to policy evaluation via Bellman residual\nminimization. More precisely, using (14, 16) it is easy to see that TD(0) applied to our problem\nwould seek to minimize\n\nP\uf078\uf0b9 (\uf078\uf03b w)\u00a1\uf063 \u2212 \uf060 (\uf078\uf03b w) + rTg (\uf078)\u00a22\n\n(24)\n\n(25)\n\nThe similarity becomes even more apparent if we write \u2212\uf060 (\uf078\uf03b w) more explicitly as\n\u2212\uf060 (\uf078\uf03b w) = wT\u03a0 [f ] (\uf078) \u2212 \uf071 (\uf078) + logP\uf079\uf070 (\uf079|\uf078) exp\u00a1\u2212wTf (\uf079)\u00a2\n\nThus the only difference from (21) is that one expression has the term wTf (\uf078) at the place where\nthe other expression has the term wT\u03a0 [f ] (\uf078). Note that the Gauss-Newton method proposed\nhere would be expected to have second-order convergence, even though the amount of computa-\ntion/sampling per iteration is the same as in a policy gradient method.\n\n2.7 Numerical experiments\n\nWe compared the natural policy gradient and the Gauss-Newton method, both in exact form and\nwith sampling, on two classes of LMDPs: randomly generated, and a discretization of a continuous\n\n\"metronome\" problem taken from [17]. Fitting the auxiliary approximatore\uf076 (\uf078\uf03b s) was done using\n\nthe LSTD(\uf0b8) algorithm [3]. Note that Theorem 3 guarantees compatibility only for \uf0b8 = 1, however\nlower values of \uf0b8 reduce variance and still provide good descent directions in practice (as one would\nexpect). We ended up using \uf0b8 = 0\uf03a2 after some experimentation. The natural gradient was used\nwith the BFGS minimizer \"minFunc\" [12].\nFigure 1A shows typical learning curves on a random LMDP with 100 states, 20 random features,\nand random passive dynamics with 50% sparsity. In this case the algorithms had very similar per-\nformance. On other examples we observed one or the other algorithm being slightly faster or pro-\nducing better minima, but overall they were comparable. The average cost of the policies found by\nthe Gauss-Newton method occasionally increased towards the end of the iteration.\nFigure 1B compares the optimal cost-to-go \uf076\u2217, the least-squares \ufb01t to the known \uf076\u2217 using our fea-\ntures (which were a 5-by-5 grid of Gaussians), and the solution of the policy gradient method ini-\ntialized with w = 0. Note that the latter has lower cost compared to the least-squares \ufb01t. In this case\nboth algorithms converged in about 10 iterations, although the Gauss-Newton method needed about\n5 times fewer samples in order to achieve similar performance to the exact version.\n\n5\n\n\f3 Continuous problems\n\nUnlike the discrete case where we focused exclusively on LMDPs, here we begin with a very general\nproblem formulation and present interesting new results. These results are then specialized to a\nnarrower class of problems which are continuous (in space and time) but nevertheless have similar\nproperties to LMDPs.\n\n3.1 Policy gradient for general controlled diffusions\n\nConsider the controlled Ito diffusion\n\n\uf064x = b (x\uf03b u) \uf064\uf074 + \uf043 (x) \uf064\uf021\n\n(26)\nwhere \uf021 (\uf074) is a standard multidimensional Brownian motion process, and u is now a traditional\ncontrol vector. Let \uf060 (x\uf03b u) be a cost function. As before we focus on in\ufb01nite-horizon average-cost\noptimal control problems. Given a policy u = \uf0bc (x), the average cost \uf063 and differential cost-to-go\n\uf076 (x) satisfy the Hamilton-Jacobi-Bellman (HJB) equation\n\n\uf063 = \uf060 (x\uf03b \uf0bc (x)) + L [\uf076] (x)\nwhere L is the following 2nd-order linear differential operator:\n\nL [\uf076] (x) , b (x\uf03b \uf0bc (x))T Ox\uf076 (x) + 1\n\n(28)\nIn can be shown [10] that L coincides with the in\ufb01nitesimal generator of (26), i.e. it computes the\nexpected directional derivative of \uf076 along trajectories generated by (26). We will need\nLemma 1. Let L be the in\ufb01nitesimal generator of an Ito diffusion which has a stationary density \uf0b9,\nand let \uf066 be a twice-differentiable function. Then\n\n2 trace\u00b3\uf043 (x) \uf043 (x)T Oxx\uf076 (x)\u00b4\n\n(27)\n\n(29)\n\nZ \uf0b9 (x)L [\uf066 ] (x) \uf064x = 0\n\nProof: The adjoint L\u2217 of the in\ufb01nitesimal generator L is known to be the Fokker-Planck operator \u2013\nwhich computes the time-evolution of a density under the diffusion [10]. Since \uf0b9 is the stationary\ndensity, L\u2217 [\uf0b9] (x) = 0 for all x, and so hL\u2217 [\uf0b9] \uf03b \uf066i = 0. Since L and L\u2217 are adjoint, hL\u2217 [\uf0b9] \uf03b \uf066i =\nh\uf0b9\uf03bL [\uf066 ]i. Thus h\uf0b9\uf03bL [\uf066 ]i = 0.\nThis lemma seems important-yet-obvious so we would not be surprised if it was already known, but\nwe have not seen in the literature. Note that many diffusions lack stationary densities. For example\nthe density of Brownian motion initialized at the origin is a zero-mean Gaussian whose covariance\ngrows linearly with time \u2013 thus there is no stationary density. If however the diffusion is controlled\nand the policy tends to keep the state within some region, then a stationary density would normally\nexist. The existence of a stationary density may actually be a sensible de\ufb01nition of stability for\nstochastic systems (although this point will not be pursued in the present paper).\nNow consider any policy parameterization u = \uf0bc (x\uf03b w) such that (for the current value of w) the\ndiffusion (26) has a stationary density \uf0b9 and Ow\uf0bc exists. Differentiating (27), and using the shortcut\nnotation b (x) in place of b (x\uf03b \uf0bc (x\uf03b w)) and similarly for \uf060 (x), we have\n\nOw\uf063 = Ow\uf060 (x) + Owb (x)\uf054 Ox\uf076 (x) + L [Ow\uf076] (x)\n\n(30)\nHere L [Ow\uf076] is meant component-wise. If we now average over \uf0b9, the last term will vanish due to\nLemma 1. This is essential for a policy gradient procedure which seeks to avoid \ufb01nite differencing;\nindeed Ow\uf076 could not be estimated while sampling from a single policy. Thus we have\nTheorem 5. The policy gradient of the controlled diffusion (26) is\n\nOw\uf063 =Z \uf0b9 (x)\u00b3Ow\uf060 (x) + Owb (x)\uf054 Ox\uf076 (x)\u00b4 \uf064x\n\n(31)\n\nUnlike most other results in stochastic optimal control, equation (31) does not involve the Hessian\nOxx\uf076, although we can obtain a Oxx\uf076-dependent term here if we allow \uf043 to depend on u. We now\nillustrate Theorem 5 on a linear-quadratic-Gaussian (LQG) control problem.\n\n6\n\n\fExample (LQG). Consider dynamics \uf064\uf078 = \uf075\uf064\uf074 + \uf064\uf021 and cost \uf060 (\uf078\uf03b \uf075) = \uf0782 + \uf0752. Let \uf075 = \u2212\uf077\uf078\nbe the parameterized policy with \uf077 \uf03e 0. The differential cost-to-go is known to be in the form\n\uf076 (\uf078) = \uf073\uf0782. Substituting in the HJB equation and matching powers of \uf078 yields \uf063 = \uf073 = \uf0772+1\n2\uf077 ,\nand so the policy gradient can be computed directly as O\uf077\uf063 = 1 \u2212 \uf0772+1\n2\uf0772 . The stationary density\n\uf0b9 (\uf078) is a zero-mean Gaussian with variance \uf0be2 = 1\n2\uf077 . One can now verify that the gradient given\nby Theorem 5 is identical to the O\uf077\uf063 computed above.\nAnother interesting aspect of Theorem 5 is that it is a natural generalization of classic results from\n\ufb01nite-horizon deterministic optimal control [13], even though it cannot be derived from those results.\nSuppose we have an open-loop control trajectory u (\uf074) \uf03b 0 \u2264 \uf074 \u2264 \uf054 , the resulting state trajectory\n(starting from a given x0) is x (\uf074), and the corresponding co-state trajectory (obtained by integrating\nPontryagin\u2019s ODE backwards in time) is \uf0b8 (\uf074). It is known that the gradient of the total cost \uf04a w.r.t.\nu is Ou\uf060 + OubT\uf0b8. Now suppose u (\uf074) is parameterized by some vector w. Then\n\nOw\uf04a =Z Owu (\uf074)T Ou(\uf074)\uf04a\uf064\uf074 =Z \u00b3Ow\uf060 (x (\uf074) \uf03b u (\uf074)) + Owb (x (\uf074) \uf03b u (\uf074))T \uf0b8 (\uf074)\u00b4 \uf064\uf074\n\nThe co-state \uf0b8 (\uf074) is known to be equal to the gradient Ox\uf076 (x\uf03b \uf074) of the cost-to-go function for the\n(closed-loop) deterministic problem. Thus (31) and (32) are very similar. Of course in \ufb01nite-horizon\nsettings there is no stationary density, and instead the integral in (32) is over the trajectory. An RL\nmethod for estimating Ow\uf04a in deterministic problems was developed in [9].\nTheorem 5 suggests a simple procedure for estimating the policy gradient via sampling: \ufb01t a function\n\n(32)\n\nOwb (x). This however is not practical because learning targets for Ox\uf076 are dif\ufb01cult to obtain.\n\napproximatorb\uf076 to \uf076, and use Oxb\uf076 in (31). Alternatively, a compatible approximation scheme can be\nobtained by \ufb01tting Oxb\uf076 to Ox\uf076 in a least-squares sense, using a linear approximator with features\nIdeally we would construct a compatible approximation scheme which involves \ufb01ttingb\uf076 rather than\nOxb\uf076. It is not clear how to do that for general diffusions, but can be done for a restricted problem\n\n3.2 Natural gradient and compatible approximation for linearly-solvable diffusions\n\nclass as shown next.\n\nWe now focus on a more restricted family of stochastic optimal control problems which arise in\nmany situations (e.g. most mechanical systems can be described in this form):\n\n\uf064x = (a (x) + \uf042 (x) u) \uf064\uf074 + \uf043 (x) \uf064\uf021\n\n(33)\n\n\uf060 (x\uf03b u) = \uf071 (\uf078) + 1\n\n2 uT\uf052 (x) u\n\n\uf0bc (x\uf03b w) , \u2212\uf052 (x)\u22121 \uf042 (x)T Ox\u00a1wTf (x)\u00a2\n\nSuch problems have been studied extensively [13]. The optimal control law u\u2217 and the optimal\ndifferential cost-go-to \uf076\u2217 (x) are known to be related as u\u2217 = \u2212\uf052\u22121\uf042TOx\uf076\u2217. As in the discrete\ncase we use this relation to motivate the choice of policy parameterization and cost-to-go function\napproximator. Choosing some features f (x), we de\ufb01neb\uf076 (x\uf03b r) , rTf (x) as before, and\nIt is convenient to also de\ufb01ne the matrix \uf046 (x) , Oxf (x)T, so that Oxb\uf076 (x\uf03b r) = \uf046 (x) r. We can\nnow substitute these de\ufb01nitions in the general result (31), replace \uf076 with the approximationb\uf076, and\nBefore addressing the issue of compatibility (i.e. whethereOw\uf063 = Ow\uf063), we seek a natural gradient\n\neOw\uf063 =Z \uf0b9 (x) \uf046 (x)T \uf042 (x) \uf052 (x)\u22121 \uf042 (x)T \uf046 (x) (w \u2212 r) \uf064x\n\nversion of (35). To this end we need to interpret \uf046 T\uf042\uf052\u22121\uf042T\uf046 as Fisher information for the (in-\n\ufb01nitesimal) transition probability density of our parameterized diffusion. We do this by discretizing\nthe time axis with time step \uf068, and then dividing by \uf068. The \uf068-step explicit Euler discretization of the\nstochastic dynamics (33) is given by the Gaussian\n\nskipping the algebra, obtain the corresponding approximation to the policy gradient:\n\n(35)\n\n(34)\n\n\uf070\uf068 (\u00b7|x\uf03b w) = N\u00b3x + \uf068a (x) \u2212 \uf068\uf042 (x) \uf052 (x)\u22121 \uf042 (x)T \uf046 (x) w; \uf068\uf043 (x) \uf043 (x)T\u00b4\n\nSuppressing the dependence on x, Fisher information becomes\n\n(36)\n\n(37)\n\n1\n\n\uf068Z \uf070\uf068Ow log \uf070\uf068Ow log \uf070T\n\n\uf068\uf064x0 = \uf046 T\uf042\uf052\u22121\uf042T\u00a1\uf043\uf043T\u00a2\u22121\n\n\uf042\uf052\u22121\uf042T\uf046\n\n7\n\n\fComparing to (35) we see that a natural gradient result is obtained when\n\nAssuming (38) is satis\ufb01ed, and de\ufb01ning \uf047 (w) as the average of Fisher information over \uf0b9 (x),\n\n\uf043 (x) \uf043 (x)T = \uf042 (x) \uf052 (x)\u22121 \uf042 (x)T\n\n(38)\n\n\uf047 (w)\u22121eOw\uf063 = w \u2212 r\n\n\u03a0 [\uf066 ] (x) = \uf066 (x) + \uf068L [\uf066 ] (x) + \uf06f\u00a1\uf0682\u00a2\n\n(39)\nCondition (38) is rather interesting. Elsewhere we have shown [16] that the same condition is needed\nto make problem (33) linearly-solvable. More precisely, the exponentiated HJB equation for the\noptimal \uf076\u2217 in problem (33, 38) is linear in exp (\u2212\uf076\u2217). We have also shown [16] that the continuous\nproblem (33, 38) is the limit (when \uf068 \u2192 0) of continuous-state discrete-time LMDPs constructed\nvia Euler discretization as above. The compatible function approximation scheme from Theorem\n3 can then be applied to these LMDPs. Recall (8). Since L is the in\ufb01nitesimal generator, for any\ntwice-differentiable function \uf066 we have\n(40)\nSubstituting in (13), dividing by \uf068 and taking the limit \uf068 \u2192 0, the matrix \uf041 and vector k become\n(41)\n\n\uf041 =Z \uf0b9 (x)\u00b3\u2212L [f ] (x) f (x)T \u2212 f (x)L [f ] (x)T\u00b4 \uf064x\nk =Z \uf0b9 (x) (\u2212L [f ] (x) \uf076 (x) + f (x) (\uf060 (x) \u2212 \uf063)) \uf064x\nCompatibility is therefore achieved when the approximation error in \uf076 is orthogonal to L [f ]. Thus\nthe auxiliary function approximator is nowe\uf076 (x\uf03b s) , sTL [f ] (x), and we have\nTheorem 6. The following procedure yields the exact policy gradient for problem (33, 38):\n1. \ufb01te\uf076 (x\uf03b s) to \uf076 (x) in a least-squares sense, and also compute \uf063\n2. compute \uf041 and k from (41), replacing \uf076 (x) withe\uf076 (x\uf03b s)\n3. \"\ufb01t\"b\uf076 (x\uf03b r) by solving \uf041r = k\n4. the policy gradient is (35), and the natural policy gradient is (39)\nThis is the \ufb01rst policy gradient result with compatible function approximation for continuous sto-\nchastic systems. It is very similar to the corresponding results in the discrete case (Theorems 3,4)\nexcept it involves the differential operator L rather than the integral operator \u03a0.\n4 Summary\n\nHere we developed compatible function approximators and natural policy gradients which only re-\nquire estimation of the cost-to-go function. This was possible due to the unique properties of the\nLMDP framework. The resulting approximation scheme is unusual, using policy-speci\ufb01c auxiliary\nfeatures derived from the primary features. In continuous time we also obtained a new policy gradi-\nent result for control problems that are not linearly-solvable, and showed that it generalizes results\nfrom deterministic optimal control. We also derived a somewhat heuristic but nevertheless promis-\ning Gauss-Newton method for solving for the optimal cost-to-go directly; it appears to be a hybrid\nbetween value iteration and policy gradient.\nOne might wonder why we need policy gradients here given that the (exponentiated) Bellman equa-\ntion is linear, and approximating its solution using features is faster than any other procedure in\nReinforcement Learning and Approximate Dynamic Programming. The answer is that minimizing\nBellman error does not always give the best policy \u2013 as illustrated in Figure 1B. Indeed a combined\napproach may be optimal: solve the linear Bellman equation approximately [17], and then use the\nsolution to initialize the policy gradient method. This idea will be explored in future work.\nOur new methods require a model \u2013 as do all RL methods that rely on state values rather than state-\naction values. We do not see this as a shortcoming because, despite all the effort that has gone\ninto model-free RL, the resulting methods do not seem applicable to truly complex optimal control\nproblems. Our methods involve model-based sampling which combines the best of both worlds:\ncomputational speed, and grounding in reality (assuming we have a good model of reality).\nAcknowledgements.\nThis work was supported by the US National Science Foundation. Thanks to Guillaume Lajoie and\nJan Peters for helpful discussions.\n\n8\n\n\fReferences\n[1] S. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10:251\u2013276,\n\n1998.\n\n[2] J. Bagnell and J. Schneider. Covariant policy search. In International Joint Conference on\n\nArti\ufb01cial Intelligence, 2003.\n\n[3] J. Boyan. Least-squares temporal difference learning. In International Conference on Machine\n\nLearning, 1999.\n\n[4] W. Fleming and S. Mitter. Optimal control and nonlinear \ufb01ltering for nondegenerate diffusion\n\nprocesses. Stochastics, 8:226\u2013261, 1982.\n\n[5] S. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems,\n\n2002.\n\n[6] S. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University\n\nCollege London, 2003.\n\n[7] H. Kappen. Linear theory for control of nonlinear stochastic systems. Physical Review Letters,\n\n95, 2005.\n\n[8] V. Konda and J. Tsitsiklis. Actor-critic algorithms. SIAM Journal on Control and Optimization,\n\npages 1008\u20131014, 2001.\n\n[9] R. Munos. Policy gradient in continuous time. The Journal of Machine Learning Research,\n\n7:771\u2013791, 2006.\n\n[10] B. Oksendal. Stochastic Differential Equations (4th Ed). Springer-Verlag, Berlin, 1995.\n[11] J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71:1180\u20131190, 2008.\n[12] M. Schmidt. minfunc. online material, 2005.\n[13] R. Stengel. Optimal Control and Estimation. Dover, New York, 1994.\n[14] R. Sutton, D. Mcallester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement\nlearning with function approximation. In Advances in Neural Information Processing Systems,\n2000.\n\n[15] E. Todorov. Linearly-solvable Markov decision problems. Advances in Neural Information\n\nProcessing Systems, 2006.\n\n[16] E. Todorov. Ef\ufb01cient computation of optimal actions. PNAS, 106:11478\u201311483, 2009.\n[17] E. Todorov. Eigen-function approximation methods for linearly-solvable optimal control prob-\n\nlems. IEEE ADPRL, 2009.\n\n[18] R. Williams. Simple statistical gradient following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, pages 229\u2013256, 1992.\n\n9\n\n\f", "award": [], "sourceid": 525, "authors": [{"given_name": "Emanuel", "family_name": "Todorov", "institution": null}]}