{"title": "Total stochastic gradient algorithms and applications in reinforcement learning", "book": "Advances in Neural Information Processing Systems", "page_first": 10204, "page_last": 10214, "abstract": "Backpropagation and the chain rule of derivatives have been prominent; however,\nthe total derivative rule has not enjoyed the same amount of attention. In this work\nwe show how the total derivative rule leads to an intuitive visual framework for\ncreating gradient estimators on graphical models. In particular, previous \u201dpolicy\ngradient theorems\u201d are easily derived. We derive new gradient estimators based\non density estimation, as well as a likelihood ratio gradient, which \u201djumps\u201d to an\nintermediate node, not directly to the objective function. We evaluate our methods\non model-based policy gradient algorithms, achieve good performance, and present evidence towards demystifying the success of the popular PILCO algorithm.", "full_text": "Total stochastic gradient algorithms and applications\n\nin reinforcement learning\n\nPaavo Parmas\n\nNeural Computation Unit\n\nOkinawa Institute of Science and Technology Graduate University\n\nOkinawa, Japan\n\npaavo.parmas@oist.jp\n\nAbstract\n\nBackpropagation and the chain rule of derivatives have been prominent; however,\nthe total derivative rule has not enjoyed the same amount of attention. In this work\nwe show how the total derivative rule leads to an intuitive visual framework for\ncreating gradient estimators on graphical models. In particular, previous \u201dpolicy\ngradient theorems\u201d are easily derived. We derive new gradient estimators based\non density estimation, as well as a likelihood ratio gradient, which \u201djumps\u201d to an\nintermediate node, not directly to the objective function. We evaluate our methods\non model-based policy gradient algorithms, achieve good performance, and present\nevidence towards demystifying the success of the popular PILCO algorithm [5].\n\n1\n\nIntroduction\n\nA central problem in machine learning is estimating the gradient of the expectation of a random\nEx\u223cp(x;\u03b6) [\u03c6(x)]. Some examples include:\nvariable with respect to the parameters of the distribution d\nd\u03b6\nthe gradient of the expected classi\ufb01cation error of a model over the data generating distribution,\nthe gradient of the expected evidence lower bound w.r.t. the variational parameters in variational\ninference [9], or the gradient of the expected reward w.r.t. the policy parameters in reinforcement\nlearning [20]. Usually, such an estimator is needed not just through a single computation, but\nthrough a computation graph; a good overview of related problems is given by [18]. Previously,\nSchulman et al. provided a method to obtain gradient estimators on stochastic computation graphs by\ndifferentiating a surrogate loss [18]. While the work provided an elegant method to obtain gradient\nestimators using automatic differentiation, the resulting stochastic computation graph framework\nhas formal rules, which uniquely de\ufb01ne one speci\ufb01c type of estimator, and it is not suitable for\ndescribing general gradient estimation techniques. For example, determinstic policy gradients [19] or\ntotal propagation [14] are not covered by the framework. In contrast, in probabilistic inference, the\nsuccessful probabilistic graphical model framework [15] only describes the structure of a model, while\nthere are many different choices of algorithms to perform inference. We aim for a similar framework\nfor gradient computation, which we call probabilistic computation graphs. Our framework uses the\ntotal derivative rule df\ndb\nda to decompose the gradient into a sum of partial derivatives along\ndifferent computational paths, while leaving open the choice of estimator for the partial derivatives.\nWe begin by introducing typical gradient estimators in the literature, then explain our new theorem,\nnovel estimators using a non-standard decomposition of the total derivative, and experimental results.\n\nda = \u2202f\n\n\u2202a + \u2202f\n\n\u2202b\n\nNomenclature All variables will be considered as column vectors, and gradients are represented\nas matrices where each row corresponds to one output variable, and each column corresponds\nto one input variable\u2014this allows applying the chain rule by simple matrix multiplication, i.e.\ndf (x)\ndy = \u2202f\n\n\u2202y . Matrices are vectorised with the vec(\u2217) operator, i.e. d\u03a3\n\ndx means dvec(\u03a3)\n\ndx\n\n.\n\n\u2202x\n\n\u2202x\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\f2 Background: Gradients of expectations\n\n2.1 Pathwise derivative estimators\n\n(cid:104) d\u03c6(x)\n\n(cid:105)\n\n(cid:104) d2\u03c6(x)\n\n(cid:105)\n\n2\n\ndx\n\ndx2\n\nand d\nd\u03a3\n\nEx\u223cN (\u00b5,\u03a3)\n\nEx\u223cN (\u00b5,\u03a3) [\u03c6(x)] = 1\n\nEx\u223cN (\u00b5,\u03a3) [\u03c6(x)] = Ex\u223cN (\u00b5,\u03a3)\n\nThis type of estimator relies on gradients of \u03c6 w.r.t. x, e.g.\nthe Gaussian gradient identities:\nd\n,\nd\u00b5\ncited in [17]. The most prominent type of pathwise derivative estimator are reparameterization\n(RP) gradients. We focus our discussion on RP gradients, but we mentioned the Gaussian identities\nto emphasize that RP gradients are not the only possible pathwise estimators, e.g. the derivative w.r.t.\n\u03a3 given above does not correspond to an RP gradient. See [17] for an overview of various options.\nRP gradient for a univariate Gaussian To sample from N (\u00b5, \u03c32), sample from a standard normal\n\u0001 \u223c N (0, 1), then transform this: x = \u00b5 + \u03c3\u0001. The gradients are dx/d\u00b5 = 1 and dx/d\u03c3 = \u0001. The\ngradient can then be estimated by sampling: d\n. For multivariate Gaussians,\nd\u03b6\none can use the Cholesky factor L of \u03a3 = LLT instead of \u03c3. To differentiate the Cholesky\ndecomposition see [12]. See [17] for other distributions. For a general distribution p(x; \u03b6), the RP\ngradient de\ufb01nes a sampling procedure \u0001 \u223c p(\u0001) and a transformation x = f (\u03b6, \u0001), which allows\nmoving the derivative inside the expectation d\n. The RP gradient\nd\u03b6\nallows backpropagating the gradient through sampling operations in a graph. It computes partial\nderivatives through a speci\ufb01c operation.\n\nE [\u03c6(x)] = E(cid:104) d\u03c6(x)\n\nEx\u223cp(x;\u03b6) [\u03c6(x)] = E\u0001\u223cp(\u0001)\n\n(cid:104) d\u03c6\n\n(cid:105)\n\n(cid:105)\n\ndx\nd\u03b6\n\ndf\nd\u03b6\n\ndx\n\ndf\n\n2.2 Jump gradient estimators\n\nd\u03b6\n\nWe introduce the categorization of jump gradient estimators. Unlike pathwise derivatives, which\ncompute local partial derivatives and apply the chain rule through numerous computations, jump\ngradient estimators can estimate the total derivative directly using only local computations\u2014hence the\nnaming: the gradient estimator jumps over multiple nodes in a graph without having to differentiate\nthe nodes inbetween (this will become clearer in later sections in the paper).\n\nLikelihood ratio estimators (LR) Any function f (x) can be stochastically integrated by sampling\nq(x) dx = Ex\u223cq [f (x)/q(x)]. The gradient\ndx. By picking q(x) = p(x), and stochastically\n. One must subtract a baseline\n\nfrom an arbitrary distribution q(x):(cid:82) f (x)dx =(cid:82) q(x) f (x)\nof an expectation can be written as(cid:82) \u03c6(x) dp(x;\u03b6)\nintegrating, one obtains the LR gradient estimator: E(cid:104) dp(x;\u03b6)/d\u03b6\nfrom the \u03c6(x) values for this estimator to have acceptable variance: E(cid:104) dp(x;\u03b6)/d\u03b6\n\n. In\npractice using b = E [\u03c6] is a reasonable choice. If b does not depend on the samples, then this leads\nto an unbiased gradient estimator. Leave-one-out baseline estimates can be performed to achieve an\nunbiased gradient estimator [11]. Other control variate techniques also exist, and this is an active\narea of research [7].\nIn our recent work [14], we introduced the batch importance weighted LR estimator (BIW-LR)\n/P , where we use a mixture\ni p(x; \u03b6i)/P , and each \u03b6i depends on another set of parameters \u03b8 (in our case\nj(cid:54)=i cj,i, where the importance\n\nand baselines: BIW-LR:(cid:80)P\ndistribution q =(cid:80)P\nweights are cj,i = p(xj; \u03b6i)/(cid:80)P\n\nthe policy parameters), BIW-Baseline: bi =\nk=1 p(xj; \u03b6k).\n\n(cid:16) dp(xj ;\u03b6i(\u03b8))/d\u03b8\n(cid:80)P\n(cid:16)(cid:80)P\n\n(cid:17)\n/(cid:80)P\n\n(\u03c6(xj) \u2212 bi)\n\n(\u03c6(x) \u2212 b)\n\nj(cid:54)=i cj,i\u03c6(xj)\n\np(x;\u03b6) \u03c6(x)\n\n(cid:80)P\n\nk=1 p(xj ;\u03b6k)\n\n(cid:17)\n\n(cid:105)\n\n(cid:105)\n\np(x;\u03b6)\n\nj=1\n\ni=1\n\nValue function based estimators\nInstead of using \u03c6(x) directly, one can learn an approximator\n\u02c6\u03c6(x). The approximator will often require less computational time to evaluate, and could be used for\nestimating the derivatives. Both LR gradients and pathwise derivatives could be used with evaluations\nfrom the approximator. Moreover, it is not necessary to evaluate just one x point of the estimator,\nbut one could either use a larger number of samples, or try to directly compute the expectation\u2014this\nleads to a Rao-Blackwellized estimator, which is known to have lower variance. Such estimators have\nbeen considered for example in RL in expected sarsa [24, 20] as well as in the stochastic variational\ninference literature [2, 23], and also in policy gradients [3, 1].\n\n2\n\n\f3 Total stochastic gradient theorem\n\nSec. 2 explained how to obtain estimators of the expectation through a single computation, while\nhere we explain how to decompose the gradient of a complicated graph of computations into smaller\nsections, which can be readily estimated using the methods in Sec. 2. In our framework, we work\nwith the gradient of the marginal distribution. This more general problem directly gives one the\ngradient of the expectation as well, as the expectation is just a function of the marginal distribution.\n\n3.1 Explanation of framework\n\nWe de\ufb01ne probabilistic computation graphs (PCG). The de\ufb01nition is exactly equivalent to the\nde\ufb01nition of a standard directed graphical model, but it highlights our methods better, and emphasizes\nour interest in computing gradients, rather than performing inference. The main difference is the\nexplicit inclusion of the distribution parameters \u03b6, e.g. for a Gaussian, the mean \u00b5 and covariance \u03a3.\nDe\ufb01nition 1 (Probabilistic computation graph (PCG)) An acyclic graph with nodes/vertices V\nand edges E, which satisfy the following properties:\n\n1. Each node i \u2208 V corresponds to a collection of random variables with marginal joint\nprobability density p(xi; \u03b6i), where \u03b6i are the possibly in\ufb01nite parameters of the distribution.\nNote that the parameterization is not unique, and any parameterization is acceptable.\n\n2. The probability density at each node is conditionally dependent on the parent nodes:\n\np(xi|Pai) where Pai are the random variables at the direct parents of node i.\n\n3. The joint probability density satis\ufb01es: p(x1, ..., xn) =(cid:81)n\nat the parents of node i. In particular: p(xi; \u03b6i) =(cid:82) p(xi|Pai)p(Pai; Pzi)dPai\n\n4. Each \u03b6i is a function of its parents: \u03b6i = f (Pzi) where Pzi are the distribution parameters\n\ni=1 p(xi|Pai)\n\nWe emphasize that there is nothing stochastic in our formulation. Each computation is determinstic,\nalthough they may be analytically intractable. We also emphasize that this de\ufb01nition does not exclude\ndeterministic nodes, i.e. the distribution at a node may be a Dirac delta distribution (a point mass).\nLater we will use this formulation to derive stochastic estimates of the gradients.\n\n3.2 Derivation of theorem\n\nWe are interested in computing the total derivative of the distribution parameters at one node \u03b6i w.r.t.\nthe parameters at another node d\u03b6i/d\u03b6j, e.g. nodes i and j could correspond to \u03c6 and x in Sec. 2\nrespectively. By the total derivative rule: d\u03b6i\n\u2202\u03b6i\n. Iterating this equation on the\nd\u03b6j\n\u2202\u03b6m\nd\u03b6m/d\u03b6j terms leads to a sum over paths from node j to node i:\n\n\u03b6m\u2208Pzi\n\nd\u03b6m\nd\u03b6j\n\n=(cid:80)\n\n(cid:88)\n\nd\u03b6i\nd\u03b6j\n\n=\n\n(cid:89)\n\nP aths(j\u2192i)\n\nEdges(k,l)\u2208P ath\n\n\u2202\u03b6l\n\u2202\u03b6k\n\n(1)\n\nThis equation holds for any deterministic computation graph, and is also well known in e.g. the OJA\ncommunity [13]. This equation trivially leads to our total stochastic gradient theorem, which states\nthat the sum over paths from A to B can be written as a sum over paths from A to intermediate nodes\nand from the intermediate nodes to B. Fig. 1 provides examples of the paths in Eq. 2 below.\nTheorem 1 (Total stochastic gradient theorem) Let i and j be distinct nodes in a probabilistic\ncomputation graph, and let IN be any set of intermediate nodes, which block the paths from j to i,\ni.e. IN is such that there does not exist a path from j to i, which does not pass through a node in\nIN. We denote {a \u2192 b} is the set of paths from a to b, and {a \u2192 b}/c is the set of paths from a to b,\nwhere no node along the path except for b is allowed to be in set c. Then the total derivative d\u03b6i/d\u03b6j\ncan be written with the equation below:\n\n\u2202\u03b6l\n\u2202\u03b6k\n\n3\n\n(cid:88)\n\n\uf8eb\uf8ed\uf8eb\uf8ed (cid:88)\n\n(cid:89)\n\n\uf8f6\uf8f8\uf8eb\uf8ed (cid:88)\n\nd\u03b6i\nd\u03b6j\n\n=\n\nm\u2208IN\n\ns\u2208{m\u2192i}\n\n(k,l)\u2208s\n\nr\u2208{j\u2192m}/IN\n\n(p,t)\u2208r\n\n(cid:89)\n\n\uf8f6\uf8f8\uf8f6\uf8f8\n\n\u2202\u03b6t\n\u2202\u03b6p\n\n(2)\n\n\fi\n\nm\n\nj\n\ni\n\nm\n\nj\n\n(b) {m \u2192 i} paths may pass through green nodes.\n(a) {j \u2192 m} paths may not pass through green nodes.\nFigure 1: Example paths in Equation 2. The green nodes correspond to the intermediate nodes IN.\n\nEquations 1 and 2 can be combined to give:\n\n(cid:88)\n\nm\u2208IN\n\nd\u03b6i\nd\u03b6j\n\n=\n\n(cid:88)\n\nd\u03b6i\nd\u03b6j\n\n=\n\nd\u03b6m\n\n\uf8eb\uf8ed(cid:18) d\u03b6i\n\uf8eb\uf8ed\uf8eb\uf8ed (cid:88)\n\n(cid:19)\uf8eb\uf8ed (cid:88)\n\nr\u2208{j\u2192m}/IN\n\n(cid:89)\n\n(p,t)\u2208r\n\n\u2202\u03b6t\n\u2202\u03b6p\n\n(cid:89)\n\n\uf8f6\uf8f8(cid:18)d\u03b6m\n\nd\u03b6j\n\n\u2202\u03b6t\n\u2202\u03b6p\n\n\uf8f6\uf8f8\uf8f6\uf8f8\n(cid:19)\uf8f6\uf8f8\n\n(3)\n\n(4)\n\nNote that an analogous theorem could be derived by swapping r \u2208 {j \u2192 m}/IN and s \u2208 {m \u2192 i}\nwith r \u2208 {j \u2192 m} and s \u2208 {m \u2192 i}/IN respectively. This leads to the equation below:\n\nm\u2208IN\n\nr\u2208{m\u2192i}/IN\n\n(p,t)\u2208r\n\nWe will refer to Equations 3 and 4 as the second and \ufb01rst half total gradient equations respectively.\n\n3.3 Gradient estimation on a graph\nHere we clarify one method how the partial derivatives through the nodes m \u2208 IN in the previous\nsection can be estimated. We use the following properties of the estimators in Sec. 2:\n\n\u2022 Pathwise derivative estimators compute partial derivatives through a single edge, e.g. \u2202\u03b6m\n\u2022 Jump gradient estimators sum the gradients across all computational paths between two\n\n\u2202\u03b6j\n\nnodes and directly compute total derivatives, e.g. d\u03b6i\nd\u03b6m\n\nd\nd\u03b6j\n\nThe task is to estimate the derivative of the expectation at a distal node i w.r.t. the parameters at an\nExi\u223cp(xi;\u03b6i) [xi], through an intermediate node m. Note that E [xi] can be picked as\nearlier node j:\none of the distribution parameters in \u03b6i. The true \u03b6 are intractable, so we perform an ancestral sampling\nbased estimate \u02c6\u03b6, i.e. we sample sequentially from each p(x\u2217|Pa\u2217) to get a sample through the whole\ngraph, then \u02c6\u03b6\u2217 will simply be the parameters of p(x\u2217|Pa\u2217). We refer to one such sample as a particle.\nWe use a batch of P such particles \u02c6\u03b6\u2217 = {\u02c6\u03b6\u2217,c}P\nc to obtain a mixture distribution as an approximation\n\nto the true distribution. Such a sampling procedure has the properties p(x; \u03b6) =(cid:82) p(x; \u02c6\u03b6)p(\u02c6\u03b6)d\u02c6\u03b6\nassume that the sampling is reparameterizable, i.e. p(\u02c6\u03b6m; \u03b6j) =(cid:82) f (\u02c6\u03b6m; \u03b6j, \u0001m)p(\u0001m)d\u0001m. We can\n\nand Exi\u223cp(xi;\u03b6i) [xi] = E\u02c6\u03b6i\u223cp(\u02c6\u03b6i;\u03b6j )\n\n. For simplicity in the explanation, we further\n\nxi\u223cp(xi;\u02c6\u03b6i) [xi]\n\n(cid:104)E\n\n(cid:105)\n\nxi\u223cp(xi;\u02c6\u03b6i) [xi]\n\nE\u02c6\u03b6i\u223cp(\u02c6\u03b6i;\u03b6j )\n\nwrite d\nd\u03b6j\nwill be estimated with a pathwise derivative estimator. The remaining term d\nd\u02c6\u03b6m\nbe estimated with any other estimator, e.g. a jump estimator could be used.\nWe summarize the procedure for creating gradient estimators from j to i on the whole graph:\n\nE\n. The term \u2202 \u02c6\u03b6m\nxi\u223cp(xi;\u02c6\u03b6i) [xi]\n\u2202\u03b6j\nE\nxi\u223cp(xi;\u02c6\u03b6i) [xi] will\n\n= E\u0001m\u223cp(\u0001m)\n\nd\nd\u02c6\u03b6m\n\n\u2202\u03b6j\n\n(cid:104) \u2202 \u02c6\u03b6m\n\n(cid:104)E\n\n(cid:105)\n\n(cid:105)\n\n1. Choose a set of intermediate nodes IN, which block the paths from j to i.\n2. Construct pathwise derivative estimators from j to the intermediate nodes IN.\n3. Construct total derivative estimators from IN to i, and apply Eq. 3 to combine the gradients.\n\n4\n\n\fG\n\nc2\n\nx2\n\nu2\n\nc3\n\nx3\n\nx0\n\nu0\n\nc1\n\nx1\n\nu1\n\n\u03b8\n\nG\n\nc2\n\nx2\n\nu2\n\nc3\n\nx3\n\nx0\n\nu0\n\nc1\n\nx1\n\nu1\n\n\u03b8\n\n(a) Classical model-free policy gradient\n\n(b) Model-based state-space LR gradient\n\nFigure 2: Probabilistic computation graphs for model-based and model-free LR gradient estimation.\n\n4 Relationship to policy gradient theorems\nIn typical model-free RL problems [20] an agent performs actions u \u223c \u03c0(ut|xt; \u03b8) according to a\nstochastic policy \u03c0, transitions through states xt, and obtains costs ct (or conversely rewards). The\nt=0 ct for\n\nagent\u2019s goal is to \ufb01nd the policy parameters \u03b8, which optimize the expected return G =(cid:80)H\n\neach episode. The corresponding probabilistic computation graph is provided in Fig. 2a.\nIn the literature, two \u201dgradient theorems\u201d are widely applied: the policy gradient theorem [21], and\nthe deterministic policy gradient theorem [19]. These two are equivalent in the limit of no noise [19].\n\n(5)\n\n(6)\n\nPolicy gradient theorem\n\nE [G] = E\n\nd\nd\u03b8\n\nDeterministic policy gradient theorem\n\n(cid:34)H\u22121(cid:88)\n(cid:34)H\u22121(cid:88)\n\nt=0\n\nd\u03b8\n\ndut\nd\u03b8\n\nE [G] = E\n\nd\nd\u03b8\n\n(cid:35)\n\nd log \u03c0(ut|xt; \u03b8)\n\n\u02c6Qt(ut, xt)\n\n(cid:35)\n\nd \u02c6Qt(ut, xt)\n\n\u02c6Qt corresponds to an estimator of the remaining return(cid:80)H\u22121\n\ndut\n\nt=0\n\nh=t ch+1 from a particular state x when\nchoosing action u. For Eq. 5 any estimator is acceptable, even a sample based estimate could\nbe used. For Eq. 6, \u02c6Q is usually a differentiable surrogate model. Fig. 2a shows how these two\ntheorems correspond to the same probabilistic computation graph. The intermediate nodes are the\nactions selected at each time step. The difference lies in the choice of jump estimator to estimate the\ntotal derivative following the intermediate nodes\u2014the policy gradient theorem uses an LR gradient,\nwhereas the deterministic policy gradient theorem uses a pathwise derivative to a surrogate model. We\nbelieve that the derivation based on a PCG is more intuitive than previous algebraic proofs [21, 19].\n\n5 Novel algorithms\n\nIn Sec. 3.3 we explained how a particle-based mixture distribution is used for creating gradient\nestimators. In the following sections, we instead take advantage of these particles to estimate a\ndifferent parameterization \u0393, directly for the marginal distribution. Although the algorithms have\ngeneral applicability, to make a concrete example, we explain them in reference to model-based\npolicy gradients using a differentiable model considered in our previous work [14], for which the\nPCG is given in Fig. 2b. Stochastic value gradients [8], for example, share the same PCG.\n\n5.1 Density estimation LR (DEL)\n\nFollowing the explanation in Sec. 5, one could attempt to estimate the distribution parameters \u0393 from\na set of sampled particles, then apply the LR gradient using the estimated distribution q(x; \u0393). In\n\n5\n\n\fparticular, we will approximate the density as a Gaussian by estimating the mean \u02c6\u00b5 =(cid:80)P\nand variance \u02c6\u03a3 =(cid:80)P\ngradient(cid:80)P\n\ni xi/P\ni (xi \u2212 \u02c6\u00b5)2/(P \u2212 1). Then, using the standard LR trick, one can estimate the\n(Gi \u2212 b), where q(x) = N (\u02c6\u00b5, \u02c6\u03a3). To use this method, one must compute\nderivatives of \u02c6\u00b5 and \u02c6\u03a3 w.r.t. the particles xi, then carry the gradient to the policy parameters using\nthe chain rule while differentiating through the model, which is straight-forward. We refer to our new\nmethod as the DEL estimator. Importantly, note that while q(x) is used for estimating the gradient, it\nis not in any way used for modifying the trajectory sampling.\nAdvantages of DEL: One can use LR gradients even if no noise is injected into the computations.\nDisadvantages of DEL: The estimator is biased, and density estimation can be dif\ufb01cult.\n\nd log q(xi)\n\nd\u03b8\n\ni\n\n5.2 Gaussian shaping gradient (GS)\n\nc1\n\n\u03b8\n\ncm\n\nc4\n\nu2\n\nu3\n\nx0\n\nu0\n\nxk\n\nu1\n\nx2\n\nxm\n\nx4\n\nG\n\nc2\n\n(p,t)\u2208r\n\n\u2202\u03b6t\n\u2202\u03b6p\n\n(cid:81)\n\nr\u2208{\u03b8\u2192xk}/IN\n\nterms(cid:80)\n\nUntil now, all RL methods have used the second half total\ngradient equation (Eq. 3). Might one create estimators that\nuse the \ufb01rst half equation (Eq. 4)? Fig.3 gives an exam-\nple of how this might be done. We propose to estimate\nthe density at xm by \ufb01tting a Gaussian on the particles.\nThen dE [cm] /d\u0393m (the pink edges) will be estimated by\nsampling from this distribution (or by any other method of\nintegration). This leaves the question of how to estimate\nd\u0393m/d\u03b8 (all paths from \u03b8 to xm). Using the RP method is\nstraight-forward. To use the LR method, we \ufb01rst apply the\nsecond half total gradient equation on d\u0393m/d\u03b8 to obtain\n(blue edges) and d\u0393m\nd\u03b6xk\n(red edges). In the scenarios we consider, the \ufb01rst of these\nterms is a single path, and will be estimated using RP. The\nsecond term is more interesting, and we will estimate this\nusing an LR method.\nAs we are using a Gaussian approximation, the distribution parameters \u0393m are the mean and vari-\nm. We can\n(xm \u2212 b\u00b5)\n,\n\nance of xm, which can be estimated as \u00b5m = E [xm] and \u03a3m = E(cid:2)xmxT\nE(cid:2)xmxT\n\nd\nd\u03b6xk\nIn practice, we perform a sampling based estimate \u02c6\u03b6xk, and one might be concerned that the\nestimators are conditional on the sample \u02c6\u03b6xk, but we are interested in unconditional estimates.\nWe will explain that the conditional estimate is equivalent. For the variance, note that \u00b5m is\nan estimate of the unconditional mean, so the whole estimate directly corresponds to an es-\ntimate of the unconditional variance. For the mean, apply the rule of iterated expectations:\nExk\u223cp(xk;\u03b6xk ) [xm] = E\u02c6\u03b6xk\u223cp(\u02c6\u03b6xk )\nfrom which it is clear that the conditional\ngradient estimate is an unbiased estimator for the gradient of the unconditional mean.\n\nFigure 3: Computational paths in Gaus-\nsian shaping gradient\n\n(cid:3) \u2212 \u00b5m\u00b5T\n(cid:104) d log p(xk;\u03b6xk )\n\n(cid:3) = Exk\u223cp(xk;\u03b6xk )\n\n(cid:105)\nm \u2212 b\u03a3)\n\nE [xm] = Exk\u223cp(xk;\u03b6xk )\n(xmxT\n\n(cid:104) d log p(xk;\u03b6xk )\n\nobtain LR gradient estimates of these terms\n\n(\u00b5\u00b5T ) = 2\u00b5 d\nd\u03b6xk\n\nxk\u223cp(xk;\u02c6\u03b6xk ) [xm]\n\nE(cid:2)xT\n\n(cid:105)\n(cid:3).\n\n(cid:104)E\n\nand\n\nd\nd\u03b6xk\n\nd\nd\u03b6xk\n\nm\n\nd\u03b6xk\n\n(cid:105)\n\nm\n\nd\u03b6xk\n\nm\n\n(\n\nduk\u22121\n\nd\u0393m\nd\u03b6xk\n\nd\u03b6xk\nduk\u22121\n\nEf\ufb01cient algorithm for accumulating gradients\nIn Fig. 3, for each xk node, we want to perform\nan LR jump to every xm node after k and compute a gradient with the Gaussian approximation\nof the distribution at node m. We will accumulate across all nodes during a backwards pass in\na backpropagation like manner. Note that for each k and each m, we can write the gradient as\ndE[cm]\nd log p(xk;\u03b6xk )\n, where zm\nd\u0393m\ncorresponds to a vector summarizing the xm \u2212 b\u00b5, etc. terms above. Note that dE[cm]\nzm is just a\nscalar quantity gm. We thus use an algorithm which accumulates a sum of all g during a backwards\npass, and sums over all m nodes at each k node. See Alg. 1 for a detailed explanation of how\nit \ufb01ts together with total propagation [14]. The \ufb01nal algorithm essentially just replaces the usual\ncost/reward with a modi\ufb01ed value, and such an approach would also be applicable in model-free\npolicy gradient algorithms using a stochastic policy and LR gradients.\n\nis estimated as dE[cm]\nd\u0393m\n\n). The term dE[cm]\nd\u0393m\n\nd\u0393m\nd\u03b6xk\n\nd\u03b6xk\n\nzm\n\nd\u0393m\n\nd\u03b8\n\nTwo interpretations of GS 1. We are making a Gaussian approximation of the marginal distri-\nbution at a node. 2. We are performing a type of reward shaping based on the distribution of the\n\n6\n\n\fAlgorithm 1 Gaussian shaping gradient with total propagation\n\nGaussian shaping gradient for model-based policy search while combining both LR and RP variants\nusing total propagation\u2014an algorithm introduced in our previous work [14].\nForward pass: Sample a set of particle trajectories.\nBackward pass:\n\n(cid:46) \u03b6 are the distribution parameters, e.g. all of the \u00b5\n\nvectorization operator which stacks the elements in a matrix/tensor into a column vector\n\n(cid:46) Estimate the marginal distribution as a Gaussian\n, e.g. by sampling from this Gaussian, and using the RP gradient\n(cid:46) vec(\u2217) is a\n\n(cid:3)(cid:1); wi,t = vec(cid:0)mi,t\u00b5T\n\ni,t \u2212 E(cid:2)xtxT\n\n(cid:1)\n\nt\n\nt\n\n(vi,t \u2212 2wi,t)\n\n(cid:46) g is a scalar replacing the usual cost/reward\n(cid:46) G is the return (the cost of the remaining trajectory)\n(cid:46) Direct derivative of expected cost for the RP gradient\n\n= 0, dJ\n\nd\u03b8 = 0, GT +1 = 0\n\nInitialise: dGT +1\nd\u03b6T +1\nand \u03c3 for each particle\nfor t = T to 1 do\n\n\u00b5t = E [xt]; \u03a3t = E(cid:2)xtxT\n(cid:3) \u2212 \u00b5t\u00b5T\nmi,t = xi,t \u2212 \u00b5t; vi,t = vec(cid:0)xi,txT\n\nCompute: dE[ct]\nd\u00b5t\nfor each particle i do\n\nand dE[ct]\nd\u03a3t\n\nt\n\nt\n\nd\u00b5t\ndxi,t\n\nmi,t + dE[ct]\nd\u03a3t\n+ dE[ct]\nd\u03a3t\n+ d\u03b6i,t+1\ndui,t\nd\u03b6i,t+1\ndxi,t\nd log p(xi,t)\n\ngi,t = dE[ct]\nd\u00b5t\nGi,t = Gi,t+1 + gi,t\ndE[ct]\n= dE[ct]\nd\u00b5t\ndxi,t\nd\u03b6i,t+1\n= \u2202\u03b6i,t+1\ndxi,t\n\u2202xi,t\ndGRP\n= ( dGi,t+1\ni,t\nd\u03b6i,t+1\nd\u03b6i,t\ndGLR\ni,t\nd\u03b6i,t\ndGRP\ni,t\nd\u03b8 =\ndGLR\ni,t\nd\u03b8 =\n\n= Gi,t\ndGRP\ni,t\nd\u03b6i,t\ndGLR\ni,t\nd\u03b6i,t\n\ndui,t\u22121\nd\u03b6i,t\n\nd\u03b6i,t\nd\u03b6i,t\n\ndui,t\u22121\n\nd\u03b8\n\ndui,t\u22121\n\nd\u03b8\n\ndui,t\u22121\n\nd\u03a3t\ndxi,t\ndui,t\ndxi,t\n+ dE[ct]\ndxi,t\n\n) dxi,t\nd\u03b6i,t\n\n(cid:46) In principle, one could further subtract a baseline from G\n\n(cid:21)\n\n(cid:20) dGRP\n(cid:17)\n(cid:80)P\n\ni,t\nd\u03b8\n\nLR\n\nend for\nRP = trace(V\n\u03c32\n\n(cid:16)\n\nRP\n\n1 + \u03c32\n\u03c32\n\nkLR = 1/\nd\u03b8 = dJ\ndJ\ni\nfor each particle i do\ndGLR\ni,t\nd\u03b6i,t\n\nd\u03b8 + kLR\n\n= kLR\n\n1\nP\n\ndGi,t\nd\u03b6i,t\nend for\n\n(cid:20) dGLR\n(cid:80)P\n\ni,t\nd\u03b8\n\ni\n\n(cid:21)\n\n)\n\n(cid:46) The sample variance of the particles\n\n(cid:46) Weight to combine LR and RP estimators\ndGRP\ni,t\nd\u03b8\n\n(cid:46) Combine LR and RP in \u03b8 space\n\n(cid:46) Combine LR and RP in state space\n\n); \u03c32\n\nLR = trace(V\n\ndGLR\ni,t\n\nd\u03b8 + (1 \u2212 kLR) 1\n\nP\n\n+ (1 \u2212 kLR)\n\ndGRP\ni,t\nd\u03b6i,t\n\nend for\n\nparticles. In particular we are essentially promoting the trajectory distributions to stay unimodal,\nsuch that all of the particles concentrate at one \u201disland\u201d of reward rather than splitting the distribution\nbetween multiple regions of reward\u2014this may simplify optimization.\n\n6 Experiments\n\nWe performed model-based RL simulation experiments from the PILCO papers [5, 4]. We tested the\ncart-pole swing-up and balancing problems to test our GS approach, as well as combinations with total\npropagation [14]. We also tested the DEL approach on the simpler cart-pole balancing-only-problem\nto show the feasibility of the idea. We compared particle-based gradients with our new estimators to\nPILCO. In our previous work [14], we had to change the cost function to obtain reliable results using\nparticles\u2014one of the primary motivations of the current experiments was to match PILCO\u2019s results\nusing the same cost as the original PILCO had used (this is explained in greater detail in Section 6.4).\n\n6.1 Model-based policy search background\n\nWe consider a model-based analogue to the model-free policy search methods introduced in Section 4.\nThe corresponding probabilistic computation graph is given in Fig. 2b. Our notation follows our\n\n7\n\n\fprevious work [14]. After each episode all of the data is used to learn separate Gaussian process\nmodels [16] of each dimension of the dynamics, s.t. p(\u2206xa\nt ]T\nand x \u2208 RD, u \u2208 RF . This model is then used to perform \u201dmental simulations\u201d between the\nepisodes to optimise the policy by gradient descent. We used a squared exponential covariance\na (\u02dcx\u2212 \u02dcx(cid:48))). We use a Gaussian likelihood function, with\na exp(\u2212(\u02dcx\u2212 \u02dcx(cid:48))T \u039b\u22121\nfunction ka(\u02dcx, \u02dcx(cid:48)) = s2\nn,a. The hyperparameters, {s, \u039b, \u03c3n} are trained by maximizing the marginal\nnoise hyperparameter \u03c32\nlikelihood. The predictions have the form p(xa\nf (\u02dcxt) is an\nuncertainty about the model, and depends on the availability of data in a region of the state-space.\n\nt+1) = GP(\u02dcxt), where \u02dcx = [xT\n\nt+1) = N (\u00b5(\u02dcxt), \u03c32\n\nt , uT\n\nf (\u02dcxt) + \u03c32\n\nn), where \u03c32\n\n6.2 Setup\n\nThe cart-pole consists of a cart that can be pushed back and forth, and an attached pole. The state\nspace is [s, \u03b2, \u02d9s, \u02d9\u03b2], where s is the cart position and \u03b2 the angle. The control is a force on the cart.\nThe dynamics were the same as in a PILCO paper [4]. The setup follows our prior work [14].\n\nCommon properties in tasks The experiments consisted of 1 random episode followed by 15\nepisodes with a learned policy, where the policy is optimized between episodes. Each episode\nlength was 3s, with a 10Hz control frequency. Each task was evaluated separately 100 times with\ndifferent random number seeds to test repeatability. The random number seeds were shared across\ndifferent algorithms. Each episode was evaluated 30 times, and the cost was averaged, but note\nthat this was done only for evaluation purposes\u2014the algorithms only had access to 1 episode. The\npolicy was optimized using an RMSprop-like learning rule [22] from our previous work [14], which\nnormalizes the gradients using the sample variance of the gradients from different particles. In the\nmodel-based policy optimization, we performed 600 gradient steps using 300 particles for each\npolicy gradient evaluation. The learning rate and momentum parameters were \u03b1 = 5 \u00d7 10\u22124,\n\u03b3 = 0.9 respectively\u2014the same as in our previous work. The output from the policy was saturated by\nsat(u) = 9 sin(u)/8 + sin(3u)/8, where u = \u02dc\u03c0(x). The policy \u02dc\u03c0 was a radial basis function network\n(a sum of Gaussians) with 50 basis functions and a total of 254 parameters. The cost functions were\nof the type 1 \u2212 exp(\u2212(x \u2212 t)T Q(x \u2212 t)), where t is the target. We considered two types of cost\nfunctions: 1) Angle Cost, a cost where Q = diag([1, 1, 0, 0]) is a diagonal matrix, 2) Tip Cost, a cost\nfrom the original PILCO papers, which depends on the distance of the tip of the pendulum to the\nposition of the tip when it is balanced. These cost functions are conceptually different\u2014with the Tip\nCost the pendulum could be swung up from either direction, with the Angle Cost there is only one\ncorrect direction. The base observation noise levels were \u03c3s = 0.01 m, \u03c3\u03b2 = 1 deg, \u03c3 \u02d9s = 0.1 m/s,\n\u03c3 \u02d9\u03b2 = 10 deg/s, and these were modi\ufb01ed with a multiplier k \u2208 {10\u22122, 1}, such that \u03c32 = k\u03c32\nCart-pole swing-up and balancing In this task the pendulum starts hanging downwards, and\nmust be swung up and balanced. We took some results from our previous work [14]: PILCO;\nreparameterization gradients (RP); Gaussian resampling (GR); batch importance weighted LR, with a\nbatch importance weighted baseline (LR); total propagation combining BIW-LR and RP (TP). We\ncompared to the new methods: Gaussian shaping gradients using the BIW-LR component (GLR),\nGaussian shaping gradients combining BIW-LR and RP variants using total propagation (GTP).\nMoreover, we tested GTP when the model noise variance was multiplied by 25 (GTP+\u03c3n).\n\nbase.\n\nCart-pole balancing with DEL estimator This task is much simpler\u2014the pole starts upright and\nmust be balanced. The experiment was devised to show that DEL is feasible and may be useful if\nfurther developed. The Angle Cost and the base noise level were used.\n\n6.3 Results\n\nThe results are presented in Table 1 and in Fig. 4. Similarly to our previous work [14], with low noise,\nmethods which include LR components do not work well. However, the GTP+\u03c3n experiments show\nthat injecting more noise into the model predictions can solve the problem. The main important result\nis that GTP matches PILCO in the Tip Cost scenarios. In our previous work [14], one of the concerns\nwas that TP had not matched PILCO in this scenario. Looking only at the costs in Fig. 4b and 4c does\nnot adequately display the difference. In contrast, the success rates show that TP did not perform as\nwell. The success rates were measured both by a threshold which was calibrated in previous work\n(\ufb01nal loss below 15) as well as by visually classifying all experimental runs. Both methods agreed.\n\n8\n\n\fTable 1: Success rate of learning cart-pole swing-up\n\nCost func.\nAngle Cost\nAngle Cost\nTip Cost\nTip Cost\n\no multiplier\n\u03c32\nk = 10\u22122\nk = 1\nk = 10\u22122\nk = 1\n\nPILCO RP\n0.88\n0.69\n0.79\n0.74\n0.92\n0.44\n0.73\n0.15\n\nGR\n0.63\n0.89\n0.47\n0.68\n\nLR\n0.57\n0.96\n0.36\n0.28\n\nTP\n0.82\n0.99\n0.54\n0.48\n\nGTP GLR GTP+\u03c3n\n0.65\n0.9\n0.6\n0.69\n\n0.42\n0.93\n0.45\n0.35\n\n0.88\n\n0.8\n\n(a) Cart-pole balancing only\n\n(b) Swing-up and balancing\n\nAll experimental runs\n\n(c) Swing-up and balancing\nTop 40 experimental runs\n\nFigure 4: Data-ef\ufb01ciency and performance of learning algorithms on cart-pole tasks. Figures 4b and\n4c correspond to the k = 1, Tip Cost case.\n\nThe losses of the peak performers at the \ufb01nal episode were TP: 11.14 \u00b1 1.73, GTP: 9.78 \u00b1 0.40,\nPILCO: 9.10 \u00b1 0.22, which also show that TP was signi\ufb01cantly worse. While the peak performers\nwere still improving, the remaining experiments had converged. PILCO still appears slightly more\ndata-ef\ufb01cient; however, the difference has little practical signi\ufb01cance as the required amount of data\nis low. Also note that in Fig. 4b TP has smaller variance. The larger variance of GTP and PILCO\nis caused by outliers with a large loss. These outliers converged to a local minimum, which takes\nadvantage of the tail of the Gaussian approximation of the state distribution\u2014this contrasts with prior\nsuggestions that PILCO performs exploration using the tail of the Gaussian [5].\n\n6.4 Discussion\n\nOur work demysti\ufb01es the factors which contributed to the success of PILCO. It was previously\nsuggested that the Gaussian approximations in PILCO smooth the reward, and cause unimodal\ntrajectory distributions, simplifying the optimization problem [10, 6]. In our previous work [14], we\nshowed that the main advantage was actually that it prevents the curse of chaos/exploding gradients.\nIn the current work we decoupled the gradient and reward effects, and provided evidence that both\nfactors contributed to the success of Gaussian distributions. While GR often has similar performance\nto GTP, there is an important conceptual difference: GR performs resampling, hence the trajectory\ndistribution is not an estimate of the true trajectory distribution. Moreover, unlike resampling, GTP\ndoes not remove the temporal dependence in particles, which may be important in some applications.\n\n7 Conclusions & future work\n\nWe have created an intuitive graphical framework for visualizing and deriving gradient estimators in\na graph of probabilistic computations. Our method provides new insights towards previous policy\ngradient theorems in the literature. We derived new gradient estimators based on density estimation\n(DEL), as well as based on the idea to perform a jump estimation to an intermediate node, not\ndirectly to the expected cost (GS). The DEL estimator needs to be further developed, but it has good\nconceptual properties as it should not suffer from the curse of chaos nor does it require injecting noise\ninto computations. The GS estimator allows differentiating through discrete computations in a manner\nthat will still allow backpropagating pathwise derivatives. Finally, we provided additional evidence\ntowards demystifying the success of the popular PILCO algorithm. We hope that our work could lead\ntowards new automatic gradient estimation software frameworks which are not only concerned with\ncomputational speed, but also the accuracy of the estimated gradients.\n\n9\n\n5101505101520253035DEL top 20 performersDEL all experimental runsEpisode #Average cumulative cost5101505101520253035GTPPILCOTPEpisode #Average cumulative cost5101505101520253035GTPPILCOTPEpisode #Average cumulative cost\fAcknowledgments\n\nWe thank the anonymous reviewers for useful comments. This work was supported by OIST Graduate\nSchool funding and by JSPS KAKENHI Grant Number JP16H06563 and JP16K21738.\n\nReferences\n[1] Asadi, K., Allen, C., Roderick, M., Mohamed, A.-r., Konidaris, G., and Littman, M. (2017).\n\nMean actor critic. arXiv preprint arXiv:1709.00503.\n\n[2] AUEB, M. T. R. and L\u00b4azaro-Gredilla, M. (2015). Local expectation gradients for black box\nvariational inference. In Advances in neural information processing systems, pages 2638\u20132646.\n\n[3] Ciosek, K. and Whiteson, S. (2017). Expected policy gradients. arXiv preprint arXiv:1706.05374.\n\n[4] Deisenroth, M. P., Fox, D., and Rasmussen, C. E. (2015). Gaussian processes for data-ef\ufb01cient\nlearning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n37(2):408\u2013423.\n\n[5] Deisenroth, M. P. and Rasmussen, C. E. (2011). PILCO: A model-based and data-ef\ufb01cient\n\napproach to policy search. In International Conference on Machine Learning, pages 465\u2013472.\n\n[6] Gal, Y., McAllister, R., and Rasmussen, C. (2016). Improving PILCO with bayesian neural\n\nnetwork dynamics models. In Workshop on Data-ef\ufb01cient Machine Learning, ICML.\n\n[7] Greensmith, E., Bartlett, P. L., and Baxter, J. (2004). Variance reduction techniques for gradient\nestimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471\u20131530.\n\n[8] Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. (2015). Learning continuous\ncontrol policies by stochastic value gradients. In Advances in Neural Information Processing\nSystems, pages 2944\u20132952.\n\n[9] Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference.\n\nThe Journal of Machine Learning Research, 14(1):1303\u20131347.\n\n[10] McHutchon, A. (2014). Modelling nonlinear dynamical systems with Gaussian Processes. PhD\n\nthesis, University of Cambridge.\n\n[11] Mnih, A. and Rezende, D. (2016). Variational inference for Monte Carlo objectives.\n\nInternational Conference on Machine Learning, pages 2188\u20132196.\n\nIn\n\n[12] Murray, I. (2016). Differentiation of the Cholesky decomposition.\n\narXiv:1602.07527.\n\narXiv preprint\n\n[13] Naumann, U. (2008). Optimal Jacobian accumulation is NP-complete. Mathematical Program-\n\nming, 112(2):427\u2013441.\n\n[14] Parmas, P., Rasmussen, C. E., Peters, J., and Doya, K. (2018). PIPPS: Flexible model-based\n\npolicy search robust to the curse of chaos. In International Conference on Machine Learning.\n\n[15] Pearl, J. (2014). Probabilistic reasoning in intelligent systems: networks of plausible inference.\n\nElsevier.\n\n[16] Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning.\n\nMIT Press.\n\n[17] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and\n\napproximate inference in deep generative models. arXiv preprint arXiv:1401.4082.\n\n[18] Schulman, J., Heess, N., Weber, T., and Abbeel, P. (2015). Gradient estimation using stochastic\ncomputation graphs. In Advances in Neural Information Processing Systems, pages 3528\u20133536.\n\n[19] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Determinis-\n\ntic policy gradient algorithms. In International Conference on Machine Learning.\n\n10\n\n\f[20] Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction, volume 1. MIT\n\npress Cambridge.\n\n[21] Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods\nIn Advances in neural information\n\nfor reinforcement learning with function approximation.\nprocessing systems, pages 1057\u20131063.\n\n[22] Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running\naverage of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26\u201331.\n\n[23] Tokui, S. and Sato, I. (2017). Evaluating the variance of likelihood-ratio gradient estimators. In\n\nInternational Conference on Machine Learning, pages 3414\u20133423.\n\n[24] Van Seijen, H., Van Hasselt, H., Whiteson, S., and Wiering, M. (2009). A theoretical and\nempirical analysis of expected sarsa. In Adaptive Dynamic Programming and Reinforcement\nLearning, 2009. ADPRL\u201909. IEEE Symposium on, pages 177\u2013184. IEEE.\n\n11\n\n\f", "award": [], "sourceid": 6556, "authors": [{"given_name": "Paavo", "family_name": "Parmas", "institution": "Okinawa Institute of Science and Technology Graduate University"}]}