{"title": "On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient", "book": "Advances in Neural Information Processing Systems", "page_first": 1000, "page_last": 1008, "abstract": "Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (a) using the past experience to estimate {\\em only} the gradient of the expected return $U(\\theta)$ at the current policy parameterization $\\theta$, rather than to obtain a more complete estimate of $U(\\theta)$, and (b) using past experience under the current policy {\\em only} rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines---a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds.", "full_text": "On a Connection between Importance Sampling and\n\nthe Likelihood Ratio Policy Gradient\n\nDepartment of Electrical Engineering and Computer Science\n\nUniversity of California, Berkeley\n\nJie Tang and Pieter Abbeel\n\n{jietang, pabbeel}@eecs.berkeley.edu\n\nBerkeley, CA 94709\n\nAbstract\n\nLikelihood ratio policy gradient methods have been some of the most successful\nreinforcement learning algorithms, especially for learning on physical systems.\nWe describe how the likelihood ratio policy gradient can be derived from an im-\nportance sampling perspective. This derivation highlights how likelihood ratio\nmethods under-use past experience by (i) using the past experience to estimate\nonly the gradient of the expected return U (\u03b8) at the current policy parameteri-\nzation \u03b8, rather than to obtain a more complete estimate of U (\u03b8), and (ii) using\npast experience under the current policy only rather than using all past experience\nto improve the estimates. We present a new policy search method, which lever-\nages both of these observations as well as generalized baselines\u2014a new technique\nwhich generalizes commonly used baseline techniques for policy gradient meth-\nods. Our algorithm outperforms standard likelihood ratio policy gradient algo-\nrithms on several testbeds.\n\n1\n\nIntroduction\n\nPolicy gradient methods have been some of the most effective learning algorithms for dynamic con-\ntrol tasks in robotics. They have been applied to a variety of complex real-world reinforcement\nlearning problems, such as hitting a baseball with an articulated arm robot [1], constrained hu-\nmanoid robotic motion planning [2], and learning gaits for legged robots [3, 4, 5]. For such robotics\ntasks real-world trials are typically the most time consuming factor in the learning process. Making\nef\ufb01cient use of limited experience is crucial for good performance.\nIn this paper we describe a novel connection between likelihood ratio based policy gradient methods\nand importance sampling. Speci\ufb01cally, we show that the likelihood ratio policy gradient estimate\nis equivalent to the gradient of an importance sampled estimate of the expected return function\nestimated using only data from the current policy. This insight indicates that likelihood ratio policy\ngradients are quite naive in terms of data use, and suggests an opportunity for novel algorithms\nwhich use all past data more ef\ufb01ciently by working with the importance sampled expected return\nfunction directly.\nOur main contributions are as follows. First, we develop algorithms for global search over the\nimportance sampled expected return function, allowing us to make more progress for a given amount\nof experience. Our approach uses estimates of the importance sampling variance to constrain the\nsearch in a principled way. Second, we derive generalizations of optimal policy gradient baselines\nwhich are applicable to the importance sampled expected return function.\nSection 2 describes preliminaries on Markov decision processes (MDPs), policy gradient methods\nand importance sampling. Section 3 describes the novel connection between importance sampling\nand likelihood ratio policy gradients, and Section 4 examines our novel minimum variance baselines.\nSection 5 outlines our proposed method. Section 6 relates our method to prior work. Section 7\ndemonstrates the effectiveness of the proposed methods on standard reinforcement learning testbeds.\n\n1\n\n\f2 Preliminaries\nMarkov Decision Processes. A Markov decision process (MDP) is a tuple (S, A, T, R, D, \u03b3, H),\nwhere S is a set of states; A is a set of actions/inputs; T = {P (\u00b7|s, u)}s,u is a set of state transition\nprobabilities (P (\u00b7|s, u) is the state transition distribution upon taking action u in state s); R : S \u00d7\nA (cid:55)\u2192 R is the reward function; D is a distribution over states from which the initial state s0 is drawn;\n0 < \u03b3 < 1 is the discount factor; and H is the horizon time of the MDP, so that the MDP terminates\nafter H steps1. A policy \u03c0 is a mapping from states S to a probability distribution over the set of\nactions A. We will consider policies parameterized by a vector \u03b8 \u2208 Rn. We denote the expected\nreturn of a policy \u03c0\u03b8 by\n\nt=0 \u03b3tR(st, ut)|\u03c0\u03b8\n\nU (\u03b8) = EP (\u03c4 ;\u03b8)\n\njectories \u03c4 = (s0, u0, s1, u1, . . . , sH , uH ). We overload notation and let R(\u03c4 ) =(cid:80)H\n\nHere P (\u03c4 ; \u03b8) is the probability distribution induced by the policy \u03c0\u03b8 over all possible state-action tra-\nt=0 \u03b3tR(st, ut)\n\nbe the (discounted) sum of rewards accumulated along the state-action trajectory \u03c4.\nLikelihood Ratio Policy Gradient. Likelihood ratio policy gradient methods perform a (stochas-\ntic) gradient ascent over the policy parameter space \u0398 to \ufb01nd a local optimum of U (\u03b8). One well-\nknown technique called REINFORCE [6, 7] expresses the gradient \u2207\u03b8U (\u03b8) as follows:\n\n\u03c4 P (\u03c4 ; \u03b8)R(\u03c4 ).\n\n(2.1)\n\n(cid:105)\n\n=(cid:80)\n\n(cid:104)(cid:80)H\n\n(cid:80)m\ni=1 \u2207\u03b8 log P (\u03c4 (i); \u03b8)R(\u03c4 (i)),\n\ng = \u2207\u03b8U (\u03b8) = EP (\u03c4 ;\u03b8)[\u2207\u03b8 log P (\u03c4 ; \u03b8)R(\u03c4 )] \u2248 \u02c6g = 1\n\nm\n\nt\n\nt=0 \u2207\u03b8 log \u03c0\u03b8(u(i)\n\nwhere the rightmost expression provides us an unbiased estimate of the policy gradient from\nm sample paths {\u03c4 (1), . . . , \u03c4 (m)} obtained from acting under policy \u03c0\u03b8. Using the Markov as-\n\u2207\u03b8 log P (\u03c4 (i); \u03b8) =(cid:80)H\nsumption, we can decompose P (\u03c4 ; \u03b8) into a product of conditional probabilities and we obtain\n|s(i)\nt ). Hence no access to a dynamics model is required\nto compute an unbiased estimate of the policy gradient. REINFORCE has been shown to be moder-\n(cid:80)\nately ef\ufb01cient in terms of number of samples used [6, 7].\nSince EP [\u2207\u03b8 log P (\u03c4 ; \u03b8)] =\nTo reduce the variance it\n\u2207\u03b8\n\u03c4 P (\u03c4 ; \u03b8) = \u2207\u03b81 = 0 we can add b(cid:62)\u2207\u03b8 log P (\u03c4 ; \u03b8) (where b is a vector which can be\n(cid:80)m\noptimized to minimize variance) to the REINFORCE gradient estimate without biasing it [8, 9]. Past\nwork often used a scalar b, resulting in:\n\u2207\u03b8U (\u03b8) = EP (\u03c4 ;\u03b8)[\u2207\u03b8 log P (\u03c4 ; \u03b8)(R(\u03c4 ) \u2212 b)] \u2248 \u02c6g = 1\ni=1 \u2207\u03b8 log P (\u03c4 (i); \u03b8)(R(\u03c4 (i)) \u2212 b).\nImportance Sampling. For a general function f and a probability measure P , computing a quantity\nof interest of the form\n\nis common to use baselines.\n\nm\n\nEP (X)[f (X)] =(cid:82)\n\nx P (x)f (x)dx.\n\ncan be computationally challenging. The expectation is often approximated with a sample-based es-\ntimate. However, samples from P could be dif\ufb01cult to obtain, or P might have very low probability\nwhere f takes its largest values. Importance sampling provides an alternative solution which uses\nsamples from a different distribution Q. Given samples from Q, we can estimate the expectation\nw.r.t. P as:\n\n(cid:98)U (\u03b8) = 1\n\n(cid:80)m\n\nIn the above, we assume Q(x) = 0 \u21d2 P (x) = 0. Hence, one can sample from a different distri-\nbution Q and then simply re-weight the samples to obtain an unbiased estimate. This can be readily\nleveraged to estimate the expected return of a stochastic policy [10] as follows:\n\nm\n\ni=1\n\nP (\u03c4 (i);\u03b8)\nQ(\u03c4 (i)) R(\u03c4 (i)),\n\n(2.2)\nwhere we assume Q(\u03c4 ) = 0 \u21d2 P (\u03c4 ; \u03b8) = 0. If we choose Q(\u03c4 ) = P (\u03c4 ; \u03b8(cid:48)), then we are estimating\n(cid:81)H\nthe return of a policy \u03c0\u03b8 from sample paths obtained from acting according to a policy \u03c0\u03b8(cid:48). Evaluat-\n(cid:81)H\nt=0 \u03c0\u03b8(ut|st)\ning the importance weights does not require a dynamics model: P (\u03c4 (i);\u03b8)\nt=0 \u03c0\u03b8(cid:48) (ut|st). If we\nhave samples from many different distributions P (\u03c4 ; \u03b8(j)), a standard technique is to create a fused\nempirical distribution Q(\u03c4 ) = 1\nm\n\nj=1 P (\u03c4 ; \u03b8(j)) to enable use of all past data [10].\n\n(cid:80)m\n\nP (\u03c4 (i);\u03b8(cid:48)) =\n\n\u03c4 (i) \u223c Q\n\n1Any in\ufb01nite horizon MDP with discounted rewards can be \u0001-approximated by a \ufb01nite horizon MDP, using\n\na horizon H\u0001 = (cid:100)log\u03b3(\u0001(1 \u2212 \u03b3)/Rmax)(cid:101), where Rmax = maxs |R(s)|.\n\n2\n\nEP (X)[f (X)] = EQ(X)\n\n(cid:105)\n\n(cid:104) P (X)\n(cid:80)m\nQ(X) f (X)\nQ(x(i)) f (x(i)) with x(i) \u223c Q\nP (x(i))\n\ni=1\n\n\u2248 1\n\nm\n\n\f\u2202(cid:98)U\n\n\u2202\u03b8j\n\n(\u03b8\u2217) = 1\n\nm\n\n= 1\nm\n\n(cid:80)m\n(cid:80)m\n(cid:80)m\n\ni=1\n\ni=1\n\n\u2202P (\u03c4 (i);\u03b8\u2217)\n\n1\n\nQ(\u03c4 (i))\nP (\u03c4 (i);\u03b8\u2217)\nQ(\u03c4 (i))\n\u2202 log P (\u03c4 (i);\u03b8\u2217)\n\nR(\u03c4 (i))\n\n\u2202\u03b8j\n\u2202 log P (\u03c4 (i);\u03b8\u2217)\n\n\u2202\u03b8j\n\nR(\u03c4 (i))\n\n3 Likelihood Ratio Policy Gradient via Importance Sampling\nWe now outline a novel connection between policy gradients and importance sampling. A set of\ntrajectories {\u03c4 (1), . . . , \u03c4 (m)} sampled from policy \u03c0\u03b8\u2217 induces a distribution over paths Q(\u03c4 ) =\n\nP (\u03c4 ; \u03b8\u2217). Let (cid:98)U (\u03b8\u2217) denote the importance sampled estimate of U (\u03b8) at \u03b8\u2217. Using Equation (2.2),\n\nwe have:\n\n(using Q(\u03c4 ) = P (\u03c4 ; \u03b8\u2217)).\n\n\u2202\u03b8j\n\ni=1\n\n= 1\nm\n\nR(\u03c4 (i))\n\n(3.1)\nEquation 3.1 is the j\u2019th entry of the likelihood ratio based estimate of the gradient of U (\u03b8) at \u03b8\u2217.\nThis analysis shows that the standard likelihood ratio policy gradient can be interpreted as forming\nan importance sampling based estimate of the expected return based on the runs under the current\npolicy \u03c0\u03b8\u2217 and then using this estimate of the expected return function only to estimate a gradient\nat \u03b8\u2217. In doing so, it fails to make ef\ufb01cient use of the trials from past policies: (i) It only uses the\n\ngradient of the function (cid:98)U (\u03b8) at the point \u03b8\u2217, rather than all information provided by the function\n(cid:98)U (\u03b8), and (ii) It only uses the runs under the most recent policy \u03c0\u03b8\u2217, rather than using a more\ninformation provided by (cid:98)U (\u03b8) using trials run under all past policies. Such importance sampling\n\ninformed importance sampling based estimate that uses all past data.\nInstead of only using local information from a single policy to drive our learning, we can use global\n\nbased methods (as have been proposed in [10]) should be able to learn from fewer trial runs than the\ncurrently widely popular likelihood ratio based methods.\nGeneralization to G(PO)MDP / Policy Gradient Theorem formulation. The observation that past\nrewards do not depend on future states or actions is leveraged by the G(PO)MDP [8] and the Policy\nGradient Theorem [11] variations on REINFORCE to reduce the variance on their gradient estimates.\nThis same observation can also be leveraged when estimating the expected return function itself. Let\n\u03c41:t denote the state action sequence experienced from time 1 through time t, then we have\n\nU (\u03b8) =(cid:80)\n\n\u03c4 P (\u03c4 ; \u03b8)R(\u03c4 ) =(cid:80)\n\n(cid:80)H\n\n\u03c4\n\nt=0 P (\u03c41:t; \u03b8)R(st, ut).\n\n(3.2)\nFor simplicity of notation we will continue to describe our approach in terms of the expression for\nU (\u03b8) given in Equation (2.1), but our generalization of baselines, and our policy search algorithm\nare equally applicable when using the expression for U (\u03b8) we present in Equation (3.2).\n4 Generalized Unbiased Baselines\nPrevious work has shown that the REINFORCE gradient estimate bene\ufb01ts greatly from the addition\nof an optimal baseline term [12, 9, 8]. In this section, we show that policy gradient baselines are\nspecial cases of a more general variance reduction technique. Our result generalizes policy gradient\nbaselines in three ways: (i) It applies to estimating expectations of any random quantity, not just\npolicy gradients; (ii) It allows for baseline matrices and higher-dimensional tensors, not just vec-\ntors; and (iii) It can be applied recursively to yield baseline terms for baselines since baselines are\nthemselves expectations.\nGiven a random variable X \u223c P\u03b8(X), where P\u03b8\n(cid:80)m\nMinimum Variance Unbiased Baselines.\nis a parametric probability distribution with parameter \u03b8, we have that EP\u03b8 [\u2207\u03b8 log P\u03b8(X)] = 0.\ni=1(h(x(i)) \u2212\nHence for any constant vector b and any scalar function h(X), we have that 1\nm\nb(cid:62)\u2207\u03b8 log P\u03b8(x(i))) with x(i) drawn from P\u03b8 is an unbiased estimator of the scalar quantity\nEP\u03b8 [h(X)]. The variance of this estimator is minimized when the variance of the random variable\ng(X) = h(X) \u2212 bT\u2207\u03b8logP\u03b8(X) is minimized. This variance is given by:\n\nVarP\u03b8 [h(X)\u2212b(cid:62)\u2207\u03b8 log P\u03b8(X)] = EP\u03b8 [(cid:0)h(X) \u2212 b(cid:62)\u2207\u03b8 log P\u03b8(X)(cid:1)2\n\n]\u2212(EP\u03b8 [h(X)\u2212b(cid:62)\u2207\u03b8 log P\u03b8(X)])2.\nAs b(cid:62)EP\u03b8 [\u2207\u03b8 log P\u03b8(X)] = 0, the second term is independent of b. Setting the gradient of the \ufb01rst\nterm with respect to b equal to zero yields the minimum variance baseline\n\nb = EP\u03b8 [\u2207\u03b8 log P\u03b8(X)\u2207\u03b8 log P\u03b8(X)(cid:62)]\u22121EP\u03b8 [\u2207\u03b8 log P\u03b8(X)h(X)].\n\n(4.1)\n\nThe baselines commonly employed with REINFORCE, GPOMDP, and other likelihood ratio policy\ngradient methods can be derived as special cases of this generalized baseline [12].\n\n3\n\n\f(cid:80)m\n\nMinimum Variance Unbiased Baselines with Importance Sampling.\nWhen using im-\nportance sampling with x(i) drawn from Q, we have an unbiased estimator of the form\n1\nm\n\nQ(x(i)) (h(x(i)) \u2212 b(cid:62)\u2207\u03b8 log P\u03b8(x(i))) with a minimum variance baseline vector\n(cid:104) P\u03b8 (X)\nQ(X) \u2207\u03b8 log P\u03b8(X) P\u03b8 (X)\n\nQ(X) \u2207\u03b8 log P\u03b8(X)(cid:62)(cid:105)\u22121\n\nQ(X) \u2207\u03b8 log P\u03b8(X) P\u03b8 (X)\n\n(cid:104) P\u03b8 (X)\n\nQ(X) h(X)\n\n.\n\n(4.2)\n\nb = EQ\n\nP\u03b8(x(i))\n\n(cid:105)\n\nEQ\n\ni=1\n\nBaselines. The minimum variance technique is naturally extended to vector-valued or matrix-\nvalued random variables h(X). For each entry in h(X) we can compute a minimum variance\nbaseline vector b using Equation (4.1) or (4.2). In general, if h(X) is an n-dimensional tensor, we\ncan stack these baseline vectors into a n + 1-dimensional tensor. Indeed, in the case of REINFORCE\nwe would obtain a baseline matrix, rather than a baseline scalar (as in the original work [7]) and\nrather than a vector baseline (as described in later work, such as [12]). The baselines themselves are\nestimated from sample data. Using standard policy gradient methods, it can be impractical to run\nenough trials to accurately \ufb01t such baselines. By using importance sampling to reuse data we can\nuse richer baseline terms in our estimators.\nRecursive Baselines. The baselines are themselves composed of expectations. It is possible to\nrecursively insert minimum variance unbiased baseline terms into these expectations in order to\nreduce the variance on the baseline estimates. However, the number of baseline parameters being\nestimated increases rapidly in this recursive process. Moreover, if we estimate multiple expectations\nfrom the same set of samples, these estimates become correlated and the \ufb01nal result is no longer\nunbiased. In practice, these baselines can be regularized to match the amount of available data. In\nSection 8 we empirically investigate the performance of several different baseline schemes.\n\n5 Policy Search Using (cid:98)U\nbaselines to obtain estimates (cid:98)U (\u03b8) of the expected return function based on the data gathered so far.\n\nWe propose the algorithm outlined in Figure 1. It uses importance sampling with optimal generalized\n\nThis estimator allows to search for a \u03b8 which improves the expected return. It maintains a list of\ncandidate policy parameters from which it searches for improvements. Memory-based search allows\nbacktracking away from unpromising parts of the search space without taking additional, costly trials\non the real platform.\n\nInput: domain of policy parameters \u0398, initial policy \u03c0\u02c6\u03b80\nfor i = 0 to ... do\n\n1.Run M trials under policy \u03c0\u02c6\u03b8i\n2. Search within ESS region\nfor j = 1 : i do\n\n\u03b8j \u2190 \u02c6\u03b8j\n\nwhile (cid:98)U (\u03b8j) is improving do\ngj \u2190 step direction((cid:98)U (\u03b8j))\n\u03b1j \u2190 ESS aware line search((cid:98)U (\u03b8j), gj)\n3. Update policy: \u02c6\u03b8i+1 = arg max\u03b8j (cid:98)U (\u03b8j)\n\n\u03b8j \u2190 \u03b8j + \u03b1jgj\n\nend while\n\nend for\n\nend for\n\nFigure 1: Our policy search algorithm.\n\nEstimate of Expected Returns: We use weighted importance sampling, and add a baseline to\nEquation (2.2):\n\n(cid:80)m\n\ni=1\n\nQ(\u03c4 (i)) (R(\u03c4 (i)) \u2212 b(cid:62)\u2207\u03b8 log P (\u03c4 (i); \u03b8)), Z =(cid:80)m\n\nP (\u03c4 (i);\u03b8)\n\ni=1\n\n(cid:98)U (\u03b8) = 1\n\nZ\n\nwhere i indexes over all past trials, and Q is the empirical distribution over past trials (see Section 2).\n\n4\n\nP (\u03c4 (i);\u03b8)\nQ(\u03c4 (i)) ,\n\n(5.1)\n\n\fOptimal Baseline: Applying Equation (4.2) we get the following sample based estimate of the\noptimal baseline b for the estimate of the expected return function:2\n\n(cid:18)\n\nb =\n\n1\nm\n\n(cid:16) P (\u03c4 (i);\u03b8)\n(cid:80)m\n(cid:18)\n(cid:16) P (\u03c4 (i);\u03b8)\n(cid:80)m\n\n(cid:17)2 \u2207\u03b8 log P\u03b8(\u03c4 (i))\u2207\u03b8 log P (\u03c4 (i); \u03b8)(cid:62)(cid:19)\u22121\n(cid:17)2 \u2207\u03b8 log P (\u03c4 (i); \u03b8)R(\u03c4 (i)))\n\n(cid:19)\n\nQ(\u03c4 (i))\n\ni=1\n\n.\n\ni=1\n\nQ(\u03c4 (i))\n\n1\nm\n\n(5.2)\n\nm\n\nESS Search Region: As our policy search steps away from areas of \u0398 where we have gathered\n\nsample data, the variance of our estimator (cid:98)U increases and our function estimate becomes unreli-\nStep Direction: We use the \ufb01nite-difference gradient of (cid:98)U as the step direction for the inner loop of\n\nable. The effective sample size ESS =\n1+Var(wi) is commonly used to measure the quality of an\nimportance sampled estimate [13]. Here wi are the normalized importance weights and M is the\nnumber of trials. Our policy search only considers parameter values \u03b8 with suf\ufb01ciently high ESS.\n\nthe policy search. In theory, since every outer iteration searches for a local optimum within the ESS\nregion, the choice of step direction affects only the amount of computation and not the number of\ntrials required for convergence.3\nLine Search: One issue with gradient based optimization methods is the need to choose the right\nstep size. One solution is to use adaptive line search-based step size rules like the Armijo rule [15].4\nFor traditional likelihood ratio policy search methods this would require additional trials. By con-\ntrast, no new trials are required when using importance sampling.5\n6 Prior Work\nVarious past approaches use the idea of constructing a model of the system from sample data, which\ncan be used to search for the optimal policy, e.g., [16], [10], [17]. In contrast to Sutton\u2019s DYNA,\nour method attempts to directly optimize the expected return function by varying policy parameters\nrather than building a model for the environment. Cao [17] also uses importance sampling to reuse\npast data for estimating policy gradients, but focuses on estimating local gradient information rather\nthan global surface information. The work of Peshkin and Shelton [10] is most similar in spirit to\nour policy search method. They use importance sampling to construct a \u201cproxy\u201d environment from\nsampled data which can be used to evaluate the expected return at arbitrary policies. They apply\na hill-climbing policy search to this \u201cproxy\u201d surface. This technique does not use estimates of the\nimportance sampling variance to restrict the search, does not use generalized minimum variance\nbaselines, and does not use memory. Our experiments show that these improvements are necessary\nto outperform standard policy gradient methods across our test domains.\nOur general approach of estimating and optimizing the expected return function instead of the gradi-\nent of the expected return function allows for non-local policy steps. Recent EM-based policy search\nmethods [18, 14] are able to make larger steps by optimizing a local lower bound on the expected\nreturn function. These methods can use importance sampling to make better use of data. This lower\nbound objective function and update step could be used in our memory based approach instead of\nfollowing the \ufb01nite difference gradient step.\nWe explained throughout the paper the relationship with earlier methods such as REINFORCE [7, 6]\nand GPOMDP [8, 9]. PEGASUS [19] is an ef\ufb01cient alternative policy search method but can only\nbe used if a simulation model is available.\nRecent work has suggested following the natural gradient direction [20, 21, 22]. The natural gradi-\nent approach is a parameterization invariant second order method which \ufb01nds the direction which\n\n2Estimating the baseline from the same data as the other terms in Equation (5.1) results in a biased estimator.\nThis is often done in policy gradient methods and we do so in our experiments. It is however possible to retain\nan unbiased estimate by data splitting, which could include averaging over resamplings.\n\n3In practice, since we cannot always \ufb01nd the true optimum of (cid:98)U within the ESS region, differences in\n\nstep direction do affect policies that are sampled. Other step directions or policy improvement rules may be\nsubstituted for the \ufb01nite difference gradient step. For example, we could follow the natural gradient direction,\nor use an EM-based policy update [14].\n\n4Though the Armijo rule has its own free parameters to choose, performance is much less sensitive to these\n\nhyper-parameters. We use the same Armijo rule parameters for all of our experiments.\n\n5We can extend standard likelihood ratio policy gradient methods to use the importance sampled expected\nreturn estimate. In our experience this approach yields results comparable to the best \ufb01xed hand-tuned step size\nfor each problem\u2014hence alleviating the need of these methods for tuning the step size.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Performance of various choices for the higher level baselines in our approach. We have a matrix\nbaseline (MAT), and a recursive baseline (REC). For reference, we also plot our approach without an optimal\nbaseline (GLO), GPOMDP (GP), and IS GPOMDP (ISGP). (b), (c) Performance evaluation on LQR and Cart-\npole. The algorithms considered are the GPOMDP likelihood ratio policy gradient method (GP), GPOMDP\nwith importance sampling (ISGP), Peshkin and Shelton\u2019s algorithm (PS), and our approach (OUR).\n\nmaximize the ratio of the improvement of the objective function over the change in distribution over\ntrajectories. Our approach exploits a similar intuition through consideration of variance through\nthe effective sampling size (ESS)\u2014preferring regions for which the past experience gives a good\nestimate.\nNatural actor critic (NAC) approaches have enjoyed substantial success on real-life robotics tasks\n[1, 23]. In the episodic setting, which we consider in this paper, the only difference between episodic\nNAC and natural gradient is in the estimate of the baseline. Episodic NAC computes a scalar base-\nline by solving an LSTD-Q type regression rather than, e.g., using a minimum variance baseline\ncriterion.6\n7 Experimental Setup\nWe present experiments on four testbeds: LQR, cartpole, mountaincar, and acrobot. The details\nof each experimental testbed can be found in the appendix. Though the systems are simulated, the\nlearning algorithms cannot make use of the simulation dynamics except by gathering trials. For each\ntestbed we randomly generated a pool of initial policies until one is found that does not achieve the\nworst case return We then used our policy gradient algorithms to optimize performance. The same\nset of initial policies is used across learning algorithms. We focus on an analysis of performance\nwhen only allowed for a small number of trials: In each of the following experiments we run 50\niterations of policy search, running M trials for each policy at each iteration.\n8 Experimental Results\nIn our experimental results, we \ufb01rst evaluate several generalized baselines in the context of our\npolicy search algorithm. We then break down the effectiveness of each component of our algorithm:\nmemory based search, optimal baselines, and ESS search region. Our policy search outperform\nlikelihood ratio methods on two of the testbeds and performs equally well on the two remaining\nones. Performance is reported as the expected return versus the number of sampled trials. The\nexpected return is plotted on the y-axis. Error bars are shown based on running each instance with\n10 initial policies. The number of trials is plotted on the x-axis.\nGeneralized Baseline Experiments: There are a variety of choices in our generalized baseline\ntechnique: We can vary the dimensionality of the baseline terms to add, the depth of the recursive\nbaseline, and what (if any) regularization to use.\nWe implemented our policy search using three different baseline techniques. We used a vector\nbaseline, a matrix baseline, and a recursive tensor baseline on the matrix baseline. Figure 2 (a)\nshows the average reward received plotted against the number of trials run for the matrix (MAT) and\nrecursive tensor (REC) baselines. The vector baseline was not able to improve the initial policies.\nThe matrix baseline outperforms the other baselines and we use it going forward.\nComponents of Our Approach: Figure 3 examines each of the central contributions of our al-\ngorithm (memory based search, baselines, and ESS). We tested our approach without any of the\n\n6The difference in performance due to different estimation procedure for the scalar baseline has been ob-\n\nserved to be so small that only one plot is shown rather than both in [1].\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: This \ufb01gure demonstrates the effect of (a) memory based search (b) optimal baselines, and (c) ESS\nsearch region on cartpole performance. In each \ufb01gure, we show the performance of Peshkin and Shelton\u2019s\napproach (PS) and our approach (OUR). In addition, we show the performance with memory only (PS+M),\nbaselines only (PS+B), and ESS only (PS+E), and our approach with memory (OUR-M), baselines (OUR-B),\nand ESS (OUR-E) removed. GPOMDP (GP) and IS GPOMDP (ISGP) are also plotted for reference purposes.\n\nthree components, which is equivalent to Peshkin and Shelton\u2019s algorithm [10], which we label PS.\nWe added each one of the three components individually, labeled PS+M, PS+B, PS+E. We also\ntested the performance with two out of three components, labeled OUR-M, OUR-B, and OUR-E\nrespectively. Finally we tested the performance of our approach with all three components. The\nresults indicate that each of the three components is improving performance with ESS and memory\nbased being the most important components. Without any one of the components our approach has\ndif\ufb01culty outperforming importance sampled GPOMDP.\nComparison With Likelihood Ratio Policy Gradients: We have compared several episodic likeli-\nhood ratio algorithms against our global policy search algorithm. We run M = 10 trials per iteration,\nand repeat each trial 10 times. For the likelihood ratio algorithms, we use the appropriate optimal\nbaselines [12] and hand-tune the step size. As a comparison, we have also implemented policy gra-\n\ndient algorithms which use importance sampling to estimate the gradient of (cid:98)U. Figure 2 plots the\n\nreward received as a function of the number of real trials sampled from the system. We plot our\nglobal search approach against GPOMDP, an importance sampled GPOMDP (IS GPOMDP), and\nan implementation of Peshkin and Shelton\u2019s global search.7 Our approach is consistently able to\nimprove its initial policy, outperforming likelihood ratio policy gradient methods on both the cart-\npole and LQR testbeds. In general, importance sampling based methods outperform non-importance\nsampling based algorithms, which work poorly when given few trials. All algorithms in considera-\ntion performed poorly on the mountaincar and acrobot testbed\u2014none of them showing signi\ufb01cant\nimprovement in performance through learning.\n9 Conclusion\nWe have shown that policy gradient methods are a special case of gradient descent over the im-\n\nportance sampled expected return function (cid:98)U. Since our approach provides a full approximation\n\nof the expected return function, we can use global information in addition to gradient information\nto achieve faster learning. We have also shown that optimal baselines for standard policy gradient\nmethods can be seen as special cases of a more general variance reduction technique. Our impor-\ntance sampling approach allows us to leverage more data to \ufb01t generalized baseline terms in our\nestimators. Our experiments show our algorithm requires fewer trials than current policy gradient\nmethods on several testbeds and no more trials on the remaining testbeds, making it appealing for\nrobotic learning tasks for which trials are expensive.\n\n7We do not plot REINFORCE as our experiments indicate that GPOMDP outperforms REINFORCE on these\n\ntestbeds, a fact consistent with existing literature [1].\n\n7\n\n\fAcknowledgments\n\n\u02d9x, \u03b8,\n\nmc+mp\n\n4\n\n3 l(mc+mp)\u2212mpl cos2 \u03b8\n\n1+eK1 and \u03c3 = 0.001 + 1\n\nThe authors thank Jan Peters and Hamid Reza Maei for insightful discussions and the anonymous\nreviewers for their feedback. This work was supported in part by NSF under award IIS-0931463.\nJie Tang is supported by the Department of Defense (DoD) through the National Defense Science &\nEngineering Graduate Fellowship (NDSEG) Program.\nAppendix\n(i) LQR: We use the formulation given in [21]. We use a linear parameterized policy with param-\neters K \u2208 R2, given by u(t) \u223c N (Lx(t), \u03c3), L = \u22121.999 + 1.998\n1+eK2 .8\nThe initial state is drawn from x(0) \u223c N (0.3, 0.1), and the dynamics are given by x(t + 1) =\n0.7 \u2217 x(t) + u(t) + N (0, 0.01). The system incurs a penalty of \u2212(x(t)2 + u(t)2) at each time step.\nEach episode was 20 time steps.\n(ii) Cartpole: This task consists of a cart moving along a track while balancing a pole. The goal\nof this task is to move the cartpole back to the origin as quickly as possible while keeping the\npole upright. Following the formulation given in [24], our control input is drawn from the policy\nu \u223c N (K(cid:62)x, \u03c3), with state x = [x,\n\u02d9\u03b8] and policy parameters K = [K1, K2, K3, K4, \u03c3].\nThe dynamics are given by \u00a8x = F\u2212mpl(\u00a8\u03b8 cos \u03b8\u2212 \u02d9\u03b82 sin \u03b8)\n, and \u00a8\u03b8 = g sin \u03b8(mc+mp)\u2212(ut+mpl \u02d9\u03b82 sin \u03b8) cos \u03b8\n.\nHere mp = 0.1, mc = 1.0, l = 0.5, g = 9.81. The control interval was 0.02s. We solve the\ndynamics using a fourth order Runge-Kutta method. We run each episode for 200 time steps, though\nthe episode terminates once the cartpole has failed (de\ufb01ned as whenever |x| > 2.4m or |\u03b8| >\n0.7rad). The reward function is \u22122 for every time step after the failure occurs, 0 if the cartpole is\nbalanced and satis\ufb01es |x| < 0.05, and \u22121 otherwise.\n(iii) Mountain Car: The mountain car testbed [25] models a simulated car, which starts in a valley\nand must climb the hill to the right as quickly as possible. The task involves two states [x,\n\u02d9x] and\nthree policy parameters [K1, K2, \u03c3]. Our control inputs for this problem are restricted to {\u22121, 1}.\nOur parameterized policy is given by \u03c0(ut = 1|xt, \u02d9xt) = P (K1sign( \u02d9xt) \u02d9x2\nt + K2 + \u0001t < xt),\nwhere \u0001t \u223c N (0, \u03c3). Our initial acceleration is f0 = +1; ft+1 = utft. The dynamics are given by\n\u02d9xt+1 = \u02d9xt + 0.001ft \u2212 0.0025 cos(3(xt \u2212 0.5)), and xt+1 = xt + \u02d9xt\nWe run for 200 time steps, though the episode terminates once the mountaincar reaches its target at\nx = 1.0. The reward function is 0 if the car is at its target and \u22121 otherwise.\n(iv) Acrobot: The acrobot [25] is a robot with 2 rotational links connected by an actuated motor. It\n\u02d9\u03b82] and parameters K = [K1, . . . , K8, \u03c3]. The acrobot is initialized to\nhas four states [\u03b81,\nbe close to [\u03c0, 0, 0, 0] (pointing straight up), and the goal is to keep the acrobot balanced upright\nfor as long as possible. Our control input is drawn from the policy u \u223c N (Lx + K(cid:62)\u03c6(x), \u03c3).\nHere L is the optimal LQR controller for acrobot linearized around the stationary point, and \u03c6(x) =\n[(\u03c0 \u2212 \u03b81)\u03b82,\n\u02d9\u03b82| \u02d9\u03b82|]. The dynamics are\ngiven by \u00a8\u03b81 = \u2212 d2 \u00a8\u03b82+\u03c61\nc2 +2l1lc2 cos \u03b82)+I1 +I2,\nd2 = m2\u2217(l2\n\u02d9\u03b81 sin(\u03b82)+(m1lc1+\nm2l1)g cos(\u03b81 \u2212 \u03c0/2) + \u03c62, and \u03c62 = m2 + lc2g cos(\u03b81 + \u03b82 \u2212 \u03c0/2). Here m1 = 1, m2 = 1, l1 =\n1, l2 = 2, lc1 = 0.5, lc2 = 1, I1 = 0.0833, I2 = 0.33, g = 9.81. The control interval was 0.02s.\nWe solve the dynamics using a fourth order Runge-Kutta method. Each episode is run for 400 time\nsteps, though the episode terminates once the acrobot has failed (de\ufb01ned as whenever the height of\nthe second link t = \u2212 cos(\u03b81)\u2212 cos(\u03b81 + \u03b83) < 0.5). The reward function is \u22122 for every time step\nafter the failure occurs, and \u2212(1 \u2212 (\u2212cos(\u03b81) \u2212 cos(\u03b81 + \u03b82))/2)2 otherwise.\n\n\u02d9\u03b82, (\u03c0 \u2212 \u03b81)|\u03c0 \u2212 \u03b81|,\n, \u00a8\u03b82 = u+d2/d1\u03c61\u2212\u03c62\nc2+I2\u2212d2\n\nc2+l1\u2217lc2 cos \u03b82)+I2, \u03c61 = \u2212m2l1lc2\n\n\u02d9\u03b81| \u02d9\u03b81|, \u03b82|\u03b82|,\nc1 +m2(l2\n1 +l2\n\n\u02d9\u03b82, (\u03c0 \u2212 \u03b81) \u02d9\u03b81, \u03b82\n\n\u02d9\u03b81\n\n2\u2212sin(\u03b82)\u22122m2l1lc2\n\n\u02d9\u03b82\n\n\u02d9\u03b82\n\nd1\n\nm2l2\n\n2/d1\n\n\u02d9\u03b81, \u03b82,\n\n, d1 = m1l2\n\n8We followed standard formulations of the control policy for LQR and cartpole. All policies are designed\n\nas functions of a linear combination of the policy parameters and hand-selected features.\n\n8\n\n\fReferences\n[1] J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic. In Proceedings of the European Machine\n\nLearning Conference (ECML), 2005.\n\n[2] T. Mori, Y. Nakamura, M. Sato, and S. Ishii. Reinforcement learning for cpg-driven biped robot. In AAAI,\n\n2004.\n\n[3] R. Tedrake, T. W. Zhang, and H.S. Seung. Learning to walk in 20 minutes.\n\nFourteenth Yale Workshop on Adaptive and Learning Systems, 2005.\n\nIn Proceedings of the\n\n[4] N. Kohl and P. Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion, 2004.\n[5] J. Zico Kolter and Andrew Y. Ng. Learning omnidirectional path following using dimensionality reduc-\n\ntion. RSS, 2007.\n\n[6] P. Glynn. Likelihood Ratio Gradient Estimation: An Overview\u201d.\n\nSimulation Conference, Atlanta, GA, 1987.\n\nIn Proceedings of the 1987 Winter\n\n[7] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine Learning, 8:23, 1992.\n\n[8] J. Baxter and P. Bartlett. Direct gradient-based reinforcement learning. Journal of Arti\ufb01cial Intelligence\n\nResearch, 1999.\n\n[9] E. Greensmith, P. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in rein-\n\nforcement learning. Journal of Machine Learning Research, 2004.\n\n[10] Leonid Peshkin and Christian R. Shelton. Learning from scarce experience. In Proceedings of the Nine-\n\nteenth International Conference on Machine Learning, 2002.\n\n[11] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning.\n\nIn NIPS 13, 2000.\n\n[12] J. Peters and S. Schaal. Policy gradient methods for robotics. In Proceedings of the IEEE International\n\nConference on Intelligent Robotics Systems, 2006.\n\n[13] A. Kong, J. S. Liu, and W. H. Wong. Sequential imputations and Bayesian missing data problems. Journal\n\nof American Statistics Association, 89:278\u2013288, 1994.\n\n[14] Jens Kober and Jan Peters. Policy search for motor primitives in robotics. NIPS, 2008.\n[15] Dimitri P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 2004.\n[16] Richard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting, 1991.\n[17] Xi-Ren Cao. A basic formula for on-line policy-gradient algorithms. IEEE Transactions on Automatic\n\nControl, 50:696\u2013699, 2005.\n\n[18] Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational\nspace control. In In: Proceedings of the International Conference on Machine Learning (ICML, pages\n745\u2013750, 2007.\n\n[19] Andrew Ng and Michael Jordan. Pegasus: A policy search method for large mdps and pomdps. In In\nProceedings of the Sixteenth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 406\u2013415, 2000.\n\n[20] S. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10, 1998.\n[21] S. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, volume 14\n\nof 26, 2001.\n\n[22] Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient\n\nalgorithm. NIPS, 2007.\n\n[23] Jan Peters. Machine Learning of Motor Skills for Robotics. PhD thesis, University of Southern California,\n\n2007.\n\n[24] M. Riedmiller, J. Peters, and S. Schaal. Evaluation of policy gradient methods and variants on the cart-pole\nbenchmark. In IEEE International Symposium on Approximate Dynamic Programming and Reinforce-\nment Learning, 2007.\n\n[25] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n\n9\n\n\f", "award": [], "sourceid": 796, "authors": [{"given_name": "Tang", "family_name": "Jie", "institution": null}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": null}]}