{"title": "Bayesian Policy Gradient Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 457, "page_last": 464, "abstract": null, "full_text": "Bayesian Policy Gradient Algorithms\n\nMohammad Ghavamzadeh\n\nYaakov Engel\n\nDepartment of Computing Science, University of Alberta\n\nEdmonton, Alberta, Canada T6E 4Y8\nfmgh,yakig@cs.ualberta.ca\n\nAbstract\n\nPolicy gradient methods are reinforcement learning algorithms that adapt a param-\neterized policy by following a performance gradient estimate. Conventional pol-\nicy gradient methods use Monte-Carlo techniques to estimate this gradient. Since\nMonte Carlo methods tend to have high variance, a large number of samples is\nrequired, resulting in slow convergence.\nIn this paper, we propose a Bayesian\nframework that models the policy gradient as a Gaussian process. This reduces\nthe number of samples needed to obtain accurate gradient estimates. Moreover,\nestimates of the natural gradient as well as a measure of the uncertainty in the\ngradient estimates are provided at little extra cost.\n\n1 Introduction\n\nPolicy Gradient (PG) methods are Reinforcement Learning (RL) algorithms that maintain a param-\neterized action-selection policy and update the policy parameters by moving them in the direction\nof an estimate of the gradient of a performance measure. Early examples of PG algorithms are the\nclass of REINFORCE algorithms of Williams [1] which are suitable for solving problems in which\nthe goal is to optimize the average reward. Subsequent work (e.g., [2, 3]) extended these algorithms\nto the cases of in(cid:2)nite-horizon Markov decision processes (MDPs) and partially observable MDPs\n(POMDPs), and provided much needed theoretical analysis. However, both the theoretical results\nand empirical evaluations have highlighted a major shortcoming of these algorithms, namely, the\nhigh variance of the gradient estimates. This problem may be traced to the fact that in most interest-\ning cases, the time-average of the observed rewards is a high-variance (although unbiased) estimator\nof the true average reward, resulting in the sample-inef(cid:2)ciency of these algorithms.\nOne solution proposed for this problem was to use a small (i.e., smaller than 1) discount factor in\nthese algorithms [2, 3], however, this creates another problem by introducing bias into the gradient\nestimates. Another solution, which does not involve biasing the gradient estimate, is to subtract\na reinforcement baseline from the average reward estimate in the updates of PG algorithms (e.g.,\n[4, 1]). Another approach for speeding-up policy gradient algorithms was recently proposed in [5]\nand extended in [6, 7]. The idea is to replace the policy-gradient estimate with an estimate of the\nso-called natural policy-gradient. This is motivated by the requirement that a change in the way the\npolicy is parametrized should not in(cid:3)uence the result of the policy update. In terms of the policy\nupdate rule, the move to a natural-gradient rule amounts to linearly transforming the gradient using\nthe inverse Fisher information matrix of the policy.\nHowever, both conventional and natural policy gradient methods rely on Monte-Carlo (MC) tech-\nniques to estimate the gradient of the performance measure. Monte-Carlo estimation is a frequentist\nprocedure, and as such violates the likelihood principle [8].1 Moreover, although MC estimates are\nunbiased, they tend to produce high variance estimates, or alternatively, require excessive sample\nsizes (see [9] for a discussion).\n\n1The likelihood principle states that in a parametric statistical model, all the information about a data sample\n\nthat is required for inferring the model parameters is contained in the likelihood function of that sample.\n\n\fIn [10] a Bayesian alternative to MC estimation is proposed. The idea is to model integrals of\n\nthe form R f (x)p(x)dx as Gaussian Processes (GPs). This is done by treating the (cid:2)rst term f in\n\nthe integrand as a random function, the randomness of which re(cid:3)ects our subjective uncertainty\nconcerning its true identity. This allows us to incorporate our prior knowledge on f into its prior\ndistribution. Observing (possibly noisy) samples of f at a set of points (x1; x2; : : : ; xM ) allows us\nto employ Bayes\u2019 rule to compute a posterior distribution of f, conditioned on these samples. This,\nin turn, induces a posterior distribution over the value of the integral. In this paper, we propose a\nBayesian framework for policy gradient, by modeling the gradient as a GP. This reduces the number\nof samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient\nand the gradient covariance are provided at little extra cost.\n\n2 Reinforcement Learning and Policy Gradient Methods\nReinforcement Learning (RL) [11, 12] is a class of learning problems in which an agent inter-\nacts with an unfamiliar, dynamic and stochastic environment, where the agent\u2019s goal is to optimize\nsome measure of its long-term performance. This interaction is conventionally modeled as a MDP.\nLet P(S) be the set of probability distributions on (Borel) subsets of a set S. A MDP is a tuple\n(X ; A; q; P; P0) where X and A are the state and action spaces, respectively; q((cid:1)ja; x) 2 P(R) is the\nprobability distribution over rewards; P ((cid:1)ja; x) 2 P(X ) is the transition probability distribution; (we\nassume that P and q are stationary); and P0((cid:1)) 2 P(X ) is the initial state distribution. We denote the\nrandom variable distributed according to q((cid:1)ja; x) as r(x; a). In addition, we need to specify the rule\naccording to which the agent selects actions at each possible state. We assume that this rule does\nnot depend explicitly on time. A stationary policy (cid:22)((cid:1)jx) 2 P(A) is a probability distribution over\nactions, conditioned on the current state. The MDP controlled by the policy (cid:22) induces a Markov\nchain over state-action pairs. We generically denote by (cid:24) = (x0; a0; x1; a1; : : : ; xT (cid:0)1; aT (cid:0)1; xT ) a\npath generated by this Markov chain. The probability (or density) of such a path is given by\n\nPr((cid:24)j(cid:22)) = P0(x0)\n\nT (cid:0)1Yt=0\n\n(cid:22)(atjxt)P (xt+1jxt; at):\n\n(1)\n\nWe denote by R((cid:24)) = PT\n\nt=0 (cid:13)tr(xt; at) the (possibly discounted, (cid:13) 2 [0; 1]) cumulative return of the\npath (cid:24). R((cid:24)) is a random variable both because the path (cid:24) is a random variable, and because even for\na given path, each of the rewards sampled in it may be stochastic. The expected value of R((cid:24)) for a\ngiven (cid:24) is denoted by (cid:22)R((cid:24)). Finally, let us de(cid:2)ne the expected return,\n\n(cid:17)((cid:22)) = E(R((cid:24))) =Z (cid:22)R((cid:24)) Pr((cid:24)j(cid:22))d(cid:24):\n\n(2)\n\nGradient-based approaches to policy search in RL have recently received much attention as a means\nto sidetrack problems of partial observability and of policy oscillations and even divergence encoun-\ntered in value-function based methods (see [11], Sec. 6.4.2 and 6.5.3). In policy gradient (PG) meth-\nods, we de(cid:2)ne a class of smoothly parameterized stochastic policies f(cid:22)((cid:1)jx; (cid:18)); x 2 X ; (cid:18) 2 (cid:2)g, es-\ntimate the gradient of the expected return (2) with respect to the policy parameters (cid:18) from observed\nsystem trajectories, and then improve the policy by adjusting the parameters in the direction of the\ngradient [1, 2, 3]. The gradient of the expected return (cid:17)((cid:18)) = (cid:17)((cid:22)((cid:1)j(cid:1); (cid:18))) is given by 2\n\nr(cid:17)((cid:18)) =Z (cid:22)R((cid:24))\n\nr Pr((cid:24); (cid:18))\n\nPr((cid:24); (cid:18))\n\nPr((cid:24); (cid:18))d(cid:24);\n\n(3)\n\nwhere Pr((cid:24); (cid:18)) = Pr((cid:24)j(cid:22)((cid:1)j(cid:1); (cid:18))). The quantity r Pr((cid:24);(cid:18))\nPr((cid:24);(cid:18)) = r log Pr((cid:24); (cid:18)) is known as the score\nfunction or likelihood ratio. Since the initial state distribution P0 and the transition distribution P\nare independent of the policy parameters (cid:18), we can write the score of a path (cid:24) using Eq. 1 as\n\nu((cid:24)) =\n\nr Pr((cid:24); (cid:18))\n\nPr((cid:24); (cid:18))\n\n=\n\nT (cid:0)1Xt=0\n\nr(cid:22)(atjxt; (cid:18))\n(cid:22)(atjxt; (cid:18))\n\n=\n\nT (cid:0)1Xt=0\n\nr log (cid:22)(atjxt; (cid:18)):\n\n(4)\n\n2Throughout the paper, we use the notation r to denote r(cid:18) (cid:150) the gradient w.r.t. the policy parameters.\n\n\fPrevious work on policy gradient methods used classical Monte-Carlo to estimate the gradient in\nEq. 3. These methods generate i.i.d. sample paths (cid:24)1; : : : ; (cid:24)M according to Pr((cid:24); (cid:18)), and estimate\nthe gradient r(cid:17)((cid:18)) using the MC estimator\n\ncr(cid:17)M C ((cid:18)) =\n\n1\nM\n\nMXi=1\n\nR((cid:24)i)r log Pr((cid:24)i; (cid:18)) =\n\n1\nM\n\nMXi=1\n\nR((cid:24)i)\n\nTi(cid:0)1Xt=0\n\nr log (cid:22)(at;ijxt;i; (cid:18)):\n\n(5)\n\n3 Bayesian Quadrature\n\nBayesian quadrature (BQ) [10] is a Bayesian method for evaluating an integral using samples of its\nintegrand. We consider the problem of evaluating the integral\n\n(cid:26) =Z f (x)p(x)dx:\n\nM PM\n\n(6)\nIf p(x) is a probability density function, this becomes the problem of evaluating the expected value\nof f (x). In MC estimation of such expectations, samples (x1; x2; : : : ; xM ) are drawn from p(x),\ni=1 f (xi). ^(cid:26)M C is an unbiased estimate of (cid:26), with\nand the integral is estimated as ^(cid:26)M C = 1\nvariance that diminishes to zero as M ! 1. However, as O\u2019Hagan points out, MC estimation is\nfundamentally unsound, as it violates the likelihood principle, and moreover, does not make full use\nof the data at hand [9] .\nThe alternative proposed in [10] is based on the following reasoning: In the Bayesian approach, f ((cid:1))\nis random simply because it is numerically unknown. We are therefore uncertain about the value\nof f (x) until we actually evaluate it. In fact, even then, our uncertainty is not always completely\nremoved, since measured samples of f (x) may be corrupted by noise. Modeling f as a Gaussian\nprocess (GP) means that our uncertainty is completely accounted for by specifying a Normal prior\ndistribution over functions. This prior distribution is speci(cid:2)ed by its mean and covariance, and is\ndenoted by f ((cid:1)) (cid:24) N ff0((cid:1)); k((cid:1); (cid:1))g. This is shorthand for the statement that f is a GP with prior mean\nE(f (x)) = f0(x) and covariance Cov(f (x); f (x0)) = k(x; x0), respectively. The choice of kernel\nfunction k allows us to incorporate prior knowledge on the smoothness properties of the integrand\ninto the estimation procedure. When we are provided with a set of samples DM = f(xi; yi)gM\ni=1,\nwhere yi is a (possibly noisy) sample of f (xi), we apply Bayes\u2019 rule to condition the prior on these\nsampled values. If the measurement noise is normally distributed, the result is a Normal posterior\ndistribution of f jDM. The expressions for the posterior mean and covariance are standard:\n\nE(f (x)jDM ) = f0(x) + kM (x)>\nCov(f (x); f (x0)jDM ) = k(x; x0) (cid:0) kM (x)>\n\nCM (yM (cid:0) f 0);\n\nCM kM (x0):\n\n(7)\n\nHere and in the sequel, we make use of the de(cid:2)nitions:\n\nf 0 = (f0(x1); : : : ; f0(xM ))> ; yM = (y1; : : : ; yM )>;\n[KM ]i;j = k(xi; xj)\n\nkM (x) = (k(x1; x); : : : ; k(xM ; x))> ;\n\n; CM = (KM + (cid:6)M )(cid:0)1 ;\n\nand [(cid:6)M ]i;j is the measurement noise covariance between the ith and jth samples. Typically, it\nis assumed that the measurement noise is i.i.d., in which case (cid:6)M = (cid:27)2I, where (cid:27)2 is the noise\nvariance and I is the identity matrix.\nSince integration is a linear operation, the posterior distribution of the integral in Eq. 6 is also\nGaussian, and the posterior moments are given by\n\nE((cid:26)jDM ) =Z E(f (x)jDM )p(x)dx ; Var((cid:26)jDM ) =ZZ Cov(f (x); f (x0)jDM )p(x)p(x0)dxdx0: (8)\n\nSubstituting Eq. 7 into Eq. 8, we get\n\nE((cid:26)jDM ) = (cid:26)0 + z\n\n>\n\nM CM (yM (cid:0) f 0)\n\n;\n\nVar((cid:26)jDM ) = z0 (cid:0) z\n\n>\n\nM CM zM ;\n\n(9)\n\nwhere we made use of the de(cid:2)nitions:\n\n(cid:26)0 =Z f0(x)p(x)dx ;\n\nzM =Z kM (x)p(x)dx ;\n\nz0 =ZZ k(x; x0)p(x)p(x0)dxdx0:\n\n(10)\n\nNote that (cid:26)0 and z0 are the prior mean and variance of (cid:26), respectively.\n\n\fModel 1\np((cid:24); (cid:18)) = Pr((cid:24); (cid:18))\nf ((cid:24); (cid:18)) = (cid:22)R((cid:24))r log Pr((cid:24); (cid:18))\ny((cid:24)) = R((cid:24))r log Pr((cid:24); (cid:18))\nE(f ((cid:24); (cid:18))) = 0\nCov(f ((cid:24); (cid:18)); f ((cid:24) 0; (cid:18))) = k((cid:24); (cid:24) 0)I\nY M CM zM\n\nKnown part\nUncertain part\nMeasurement\nPrior mean of f\nPrior cov. of f\nE(r(cid:17)B((cid:18))jDM ) =\nCov(r(cid:17)B((cid:18))jDM ) = (z0 (cid:0) z>\nKernel function\nzM\nz0\n\nM CM zM )I\n\nk((cid:24)i; (cid:24)j) =(cid:0)1 + u((cid:24)i)>G(cid:0)1u((cid:24)j)(cid:1)2\n\n(zM )i = 1 + u((cid:24)i)>G(cid:0)1u((cid:24)i)\nz0 = 1 + n\n\nModel 2\np((cid:24); (cid:18)) = r Pr((cid:24); (cid:18))\nf ((cid:24)) = (cid:22)R((cid:24))\ny((cid:24)) = R((cid:24))\nE(f ((cid:24))) = 0\nCov(f ((cid:24)); f ((cid:24) 0)) = k((cid:24); (cid:24) 0)\nZM CM yM\nZ 0 (cid:0) ZM CM Z >\nM\nk((cid:24)i; (cid:24)j) = u((cid:24)i)>G(cid:0)1u((cid:24)j)\nZM = U M\nZ 0 = G (cid:0) U M CM U >\nM\n\nTable 1: Summary of the Bayesian policy gradient Models 1 and 2.\n\nIn order to prevent the problem from (cid:147)degenerating into in(cid:2)nite regress(cid:148), as phrased by O\u2019Hagan\n[10], we should choose the functions p, k, and f0 so as to allow us to solve the integrals in Eq. 10\nanalytically. For instance, O\u2019Hagan provides the analysis required for the case where the integrands\nin Eq. 10 are products of multivariate Gaussians and polynomials, referred to as Bayes-Hermite\nquadrature. One of the contributions of the present paper is in providing analogous analysis for\nkernel functions that are based on the Fisher kernel [13, 14]. It is important to note that in MC\nestimation, samples must be drawn from the distribution p(x), whereas in the Bayesian approach,\nsamples may be drawn from arbitrary distributions. This affords us with (cid:3)exibility in the choice of\nsample points, allowing us, for instance to actively design the samples (x1; x2; : : : ; xM ).\n\n4 Bayesian Policy Gradient\n\nIn this section, we use Bayesian quadrature to estimate the gradient of the expected return with\nrespect to the policy parameters, and propose Bayesian policy gradient (BPG) algorithms. In the\nfrequentist approach to policy gradient our performance measure was (cid:17)((cid:18)) from Eq. 2, which is the\nresult of averaging the cumulative return R((cid:24)) over all possible paths (cid:24) and all possible returns accu-\nmulated in each path. In the Bayesian approach we have an additional source of randomness, which\nis our subjective Bayesian uncertainty concerning the process generating the cumulative returns. Let\nus denote\n\n(11)\n(cid:17)B((cid:18)) is a random variable both because of the noise in R((cid:24)) and the Bayesian uncertainty. Under\nthe quadratic loss, our Bayesian performance measure is E((cid:17)B((cid:18))jDM ). Since we are interested\nin optimizing performance rather than evaluating it, we evaluate the posterior distribution of the\ngradient of (cid:17)B((cid:18)). For the mean we have\n\n(cid:17)B((cid:18)) =Z R((cid:24)) Pr((cid:24); (cid:18))d(cid:24):\n\nr Pr((cid:24); (cid:18))\n\nPr((cid:24); (cid:18))\n\nPr((cid:24); (cid:18))d(cid:24) jDM(cid:19) :\n\nrE ((cid:17)B((cid:18))jDM ) = E (r(cid:17)B((cid:18))jDM ) = E(cid:18)Z R((cid:24))\n\n(12)\nConsequently, in BPG we cast the problem of estimating the gradient of the expected return in\nthe form of Eq. 6. As described in Sec. 3, we partition the integrand into two parts, f ((cid:24); (cid:18)) and\np((cid:24); (cid:18)). We will place the GP prior over f and assume that p is known. We will then proceed by\ncalculating the posterior moments of the gradient r(cid:17)B((cid:18)) conditioned on the observed data. Next,\nwe investigate two different ways of partitioning the integrand in Eq. 12, resulting in two distinct\nBayesian models. Table 1 summarizes the two models we use in this work. Our choice of Fisher-type\nkernels was motivated by the notion that a good representation should depend on the data generating\nprocess (see [13, 14] for a thorough discussion). Our particular choices of linear and quadratic Fisher\nkernels were guided by the requirement that the posterior moments of the gradient be analytically\ntractable. In Table 1 we made use of the following de(cid:2)nitions: F M = (f ((cid:24)1; (cid:18)); : : : ; f ((cid:24)M ; (cid:18))) (cid:24)\n\nN (0; KM ), Y M = (y((cid:24)1); : : : ; y((cid:24)M )) (cid:24) N (0; KM + (cid:27)2I), U M = (cid:2)u((cid:24)1) ; u((cid:24)2) ; : : : ; u((cid:24)M )(cid:3),\nZM = R r Pr((cid:24); (cid:18))kM ((cid:24))>d(cid:24), and Z 0 = RR k((cid:24); (cid:24)0)r Pr((cid:24); (cid:18))r Pr((cid:24) 0; (cid:18))>d(cid:24)d(cid:24) 0. Finally, n is the\nnumber of policy parameters, and G = E(cid:0)u((cid:24))u((cid:24))>(cid:1) is the Fisher information matrix.\n\nWe can now use Models 1 and 2 to de(cid:2)ne algorithms for evaluating the gradient of the expected\nreturn with respect to the policy parameters. The pseudo-code for these algorithms is shown in\nAlg. 1. The generic algorithm (for either model) takes a set of policy parameters (cid:18) and a sample size\nM as input, and returns an estimate of the posterior moments of the gradient of the expected return.\n\n\fAlgorithm 1 : A Bayesian Policy Gradient Evaluation Algorithm\n1: BPG Eval((cid:18); M )\n2: Set G = G((cid:18))\n3: for i = 1 to M do\n4:\n\nSample a path (cid:24)i using the policy (cid:22)((cid:18))\n\n// policy parameters (cid:18) 2 Rn, sample size M > 0 //\n\n; D0 = ;\n\nt=0 r log (cid:22)(atjst; (cid:18))\n\nt=0 r(st; at)\n\n5: Di = Di(cid:0)1S f(cid:24)ig\nCompute u((cid:24)i) =PTi(cid:0)1\nR((cid:24)i) =PTi(cid:0)1\n\n6:\n7:\n8:\n9:\n10: end for\n11: CM = (KM + (cid:27)2I)(cid:0)1\n12: Compute the posterior mean and covariance:\n\nUpdate K i using K i(cid:0)1 and (cid:24)i\ny((cid:24)i) = R((cid:24)i)u((cid:24)i)\n(zM )i = 1 + u((cid:24)i)>G(cid:0)1u((cid:24)i)\n\nE(r(cid:17)B((cid:18))jDM ) = Y M CM zM\nE(r(cid:17)B((cid:18))jDM ) = ZM CM yM\n\n(Model 1)\n(Model 1)\n\ny((cid:24)i) = R((cid:24)i)\n\nor\nor ZM (:; i) = u((cid:24)i)\n\n(Model 2)\n(Model 2)\n\n, Cov(r(cid:17)B((cid:18))jDM ) = (z0 (cid:0) z>\n, Cov(r(cid:17)B((cid:18))jDM ) = Z 0 (cid:0) ZM CM Z >\n\nM CM zM )I\n\nM\n\n(Model 1)\n(Model 2)\n\nor\n\n13: return E (r(cid:17)B((cid:18))jDM )\n\n, Cov (r(cid:17)B((cid:18))jDM )\n\n1\n\ni=1PTi(cid:0)1\n\ni=1 Ti PM\nPM\n\nt=0 r log (cid:22)(atjxt; (cid:18)j)r log (cid:22)(atjxt; (cid:18)j)>:\n\nThe kernel functions used in Models 1 and 2 are both based on the Fisher information matrix G((cid:18)).\nConsequently, every time we update the policy parameters we need to recompute G. In Alg. 1 we\nassume that G is known, however, in most practical situations this will not be the case. Let us brie(cid:3)y\noutline two possible approaches for estimating the Fisher information matrix.\nMC Estimation: At each step j, our BPG algorithm generates M sample paths using the current\npolicy parameters (cid:18)j in order to estimate the gradient r(cid:17)B((cid:18)j). We can use these generated sample\npaths to estimate the Fisher information matrix G((cid:18)j) by replacing the expectation in G with em-\npirical averaging as ^GM C ((cid:18)j) =\nModel-Based Policy Gradient: The Fisher information matrix depends on the probability distri-\nbution over paths. This distribution is a product of two factors, one corresponding to the current\npolicy, and the other corresponding to the MDP dynamics P0 and P (see Eq. 1). Thus, if the MDP\ndynamics are known, the Fisher information matrix can be evaluated off-line. We can model the\nMDP dynamics using some parameterized model, and estimate the model parameters using maxi-\nmum likelihood or Bayesian methods. This would be a model-based approach to policy gradient,\nwhich would allow us to transfer information between different policies.\nAlg. 1 can be made signi(cid:2)cantly more ef(cid:2)cient, both in time and memory, by sparsifying the so-\nlution. Such sparsi(cid:2)cation may be performed incrementally, and helps to numerically stabilize the\nalgorithm when the kernel matrix is singular, or nearly so. Here we use an on-line sparsi(cid:2)cation\nmethod from [15] to selectively add a new observed path to a set of dictionary paths DM, which are\nused as a basis for approximating the full solution. Lack of space prevents us from discussing this\nmethod in further detail (see Chapter 2 in [15] for a thorough discussion).\nThe Bayesian policy gradient (BPG) algorithm is described in Alg. 2. This algorithm starts with an\ninitial vector of policy parameters (cid:18)0 and updates the parameters in the direction of the posterior\nmean of the gradient of the expected return, computed by Alg. 1. This is repeated N times, or\nalternatively, until the gradient estimate is suf(cid:2)ciently close to zero.\nAlgorithm 2 : A Bayesian Policy Gradient Algorithm\n1: BPG((cid:18)0; (cid:11); N; M )\n\n// initial policy parameters (cid:18)0, learning rates ((cid:11)j)N (cid:0)1\n\nj=0 , number of policy updates\n\nN > 0, BPG Eval sample size M > 0 //\n\n2: for j = 0 to N (cid:0) 1 do\n3: (cid:1)(cid:18)j = E (r(cid:17)B((cid:18)j)jDM ) from BPG Eval((cid:18)j; M )\n4:\n5: end for\n6: return (cid:18)N\n\n(cid:18)j+1 = (cid:18)j +(cid:11)j(cid:1)(cid:18)j\n\n(regular gradient) or (cid:18)j+1 = (cid:18)j +(cid:11)j G(cid:0)1((cid:18)j)(cid:1)(cid:18)j\n\n(natural gradient)\n\n5 Experimental Results\nIn this section, we compare the BQ and MC gradient estimators in a continuous-action bandit prob-\nlem and a continuous state and action linear quadratic regulation (LQR) problem. We also evaluate\n\n\fthe performance of the BPG algorithm (Alg. 2) on the LQR problem, and compare it with a standard\nMC-based policy gradient (MCPG) algorithm.\n\n5.1 A Bandit Problem\nIn this simple example, we compare the BQ and MC estimates of the gradient (for a (cid:2)xed set of\npolicy parameters) using the same samples. Our simple bandit problem has a single state and A = R.\nThus, each path (cid:24)i consists of a single action ai. The policy, and therefore also the distribution over\n2 = 1). The score function of the path (cid:24) = a and the Fisher\npaths is given by a (cid:24) N ((cid:18)1 = 0; (cid:18)2\ninformation matrix are given by u((cid:24)) = [a; a2 (cid:0) 1]> and G = diag(1; 2), respectively.\nTable 2 shows the exact gradient of the expected return and its MC and BQ estimates (using 10\nand 100 samples) for two versions of the simple bandit problem corresponding to two different\ndeterministic reward functions r(a) = a and r(a) = a2. The average over 104 runs of the MC and\nBQ estimates and their standard deviations are reported in Table 2. The true gradient is analytically\ntractable and is reported as (cid:147)Exact(cid:148) in Table 2 for reference.\n\nr(a) = a\n\nExact\n(cid:18)1\n0(cid:19)\n(cid:18)0\n2(cid:19)\n\nMC (10)\n\nBQ (10)\n\nMC (100)\n\nBQ (100)\n\n(cid:18)1:000 (cid:6) 0:000001\n0:000 (cid:6) 0:000004(cid:19)\n(cid:18)0:000 (cid:6) 0:000003\n2:000 (cid:6) 0:000011(cid:19)\nr(a) = a2\nTable 2: The true gradient of the expected return and its MC and BQ estimates for two bandit problems.\n\n(cid:18) 0:9950 (cid:6) 0:438\n(cid:0)0:0011 (cid:6) 0:977(cid:19)\n(cid:18)0:0136 (cid:6) 1:246\n2:0336 (cid:6) 2:831(cid:19)\n\n(cid:18)1:0004 (cid:6) 0:140\n0:0040 (cid:6) 0:317(cid:19)\n(cid:18)0:0051 (cid:6) 0:390\n1:9869 (cid:6) 0:857(cid:19)\n\n(cid:18)0:9856 (cid:6) 0:050\n0:0006 (cid:6) 0:060(cid:19)\n(cid:18)0:0010 (cid:6) 0:082\n1:9250 (cid:6) 0:226(cid:19)\n\nAs shown in Table 2, the BQ estimate has much lower variance than the MC estimate for both small\nand large sample sizes. The BQ estimate also has a lower bias than the MC estimate for the large\nsample size (M = 100), and almost the same bias for the small sample size (M = 10).\n\nt\n\nt + 0:1a2\n\nPolicy\nActions: at (cid:24) (cid:22)((cid:1)jxt; (cid:18)) = N ((cid:21)xt; (cid:27)2)\nParameters: (cid:18) = ((cid:21) ; (cid:27))>\n\n5.2 A Linear Quadratic Regulator\nIn this section, we consider the following linear system in which the goal is to minimize the expected\nreturn over 20 steps. Thus, it is an episodic problem with paths of length 20.\nSystem\nInitial State: x0 (cid:24) N (0:3; 0:001)\nRewards: rt = x2\nTransitions: xt+1 = xt + at + nx; nx (cid:24) N (0; 0:01)\nWe (cid:2)rst compare the BQ and MC estimates of the gradient of the expected return for the policy\ninduced by the parameters (cid:21) = (cid:0)0:2 and (cid:27) = 1. We use several different sample sizes (number of\npaths used for gradient estimation) M = 5j ; j = 1; : : : ; 20 for the BQ and MC estimates. For each\nsample size, we compute both the MC and BQ estimates 104 times, using the same samples. The\ntrue gradient is estimated using MC with 107 sample paths for comparison purposes.\nFigure 1 shows the mean squared error (MSE) ((cid:2)rst column), and the mean absolute angular error\n(second column) of the MC and BQ estimates of the gradient for several different sample sizes.\nThe absolute angular error is the absolute value of the angle between the true gradient and the\nestimated gradient. In this (cid:2)gure, the BQ gradient estimate was calculated using Model 1 without\nsparsi(cid:2)cation. With a good choice of sparsi(cid:2)cation threshold, we can attain almost identical results\nmuch faster and more ef(cid:2)ciently with sparsi(cid:2)cation. These results are not shown here due to space\nlimitations. To give an intuition concerning the speed and the ef(cid:2)ciency attained by sparsi(cid:2)cation,\nwe should mention that the dimension of the feature space for the kernel used in Model 1 is 6\n(Proposition 9.2 in [14]). Therefore, we deal with a kernel matrix of size 6 with sparsi(cid:2)cation versus\na kernel matrix of size M = 5j ; j = 1; : : : ; 20 without sparsi(cid:2)cation.\nWe ran another set of experiments, in which we add i.i.d. Gaussian noise to the rewards: rt = x2\nt +\nr = 0:1). In Model 2, we can model this by the measurement noise\n0:1a2\nt + nr ; nr (cid:24) N (0; (cid:27)2\nr I, where T = 20 is the path length. Since each reward rt is a Gaussian\ncovariance matrix (cid:6) = T (cid:27)2\nrandom variable with variance (cid:27)2\nt=0 rt will also be a Gaussian random\nvariable with variance T (cid:27)2\nr. The results are presented in the third and fourth columns of Figure 1.\nThese experiments indicate that the BQ gradient estimate has lower variance than its MC counter-\npart. In fact, whereas the performance of the MC estimate improves as 1\nM , the performance of the\nBQ estimate improves at a higher rate.\n\nr, the return R((cid:24)) = PT (cid:0)1\n\n\f106\n\nr\no\nr\nr\n\nE\n \nd\ne\nr\na\nu\nq\nS\n \nn\na\ne\nM\n\n105\n\n104\n\n103\n\n102\n\n101\n\n \n\nMC\nBQ\n\n106\n\n105\n\n104\n\nr\no\nr\nr\n\nE\n \nd\ne\nr\na\nu\nq\nS\n \nn\na\ne\nM\n\n \n\nMC\nBQ\n\n)\ng\ne\nd\n(\n \nr\no\nr\nr\n\nE\n\nl\n\n \nr\na\nu\ng\nn\nA\n \ne\nt\nu\no\ns\nb\nA\n \nn\na\ne\nM\n\nl\n\n \n\nMC\nBQ\n\n)\ng\ne\nd\n(\n \nr\no\nr\nr\n\nE\n\nl\n\n \nr\na\nu\ng\nn\nA\n \ne\nt\nu\no\ns\nb\nA\n \nn\na\ne\nM\n\nl\n\n \n\nMC\nBQ\n\n102\n\n101\n\n100\n\n \n\n0 20 40 60 80 100\n\nNumber of Paths\n\n102\n\n \n\n0 20 40 60 80 100\n\nNumber of Paths\n\n100\n\n \n\n0 20 40 60 80 100\n\nNumber of Paths\n\n103\n\n \n\n0 20 40 60 80 100\n\nNumber of Paths\n\nFigure 1: Results for the LQR problem using Model 1 (left) and Model 2 (right), without sparsi(cid:2)cation. The\nModel 2 results are for a LQR problem, in which the rewards are corrupted by i.i.d. Gaussian noise. For each\nalgorithm, we show the MSE (left) and the mean absolute angular error (right), as functions of the number of\nsample paths M. Note that the errors are plotted on a logarithmic scale. All results are averages over 104 runs.\n\nNext, we use BPG to optimize the policy parameters in the LQR problem. Figure 2 shows the\nperformance of the BPG algorithm with the regular (BPG) and the natural (BPNG) gradient es-\ntimates, versus a MC-based policy gradient (MCPG) algorithm, for the sample sizes (number of\nsample paths used for estimating the gradient of a policy) M = 5; 10; 20; and 40. We use Alg. 2\nwith the number of updates set to N = 100, and Model 1 for the BPG and BPNG methods. Since\nAlg. 2 computes the Fisher information matrix for each set of policy parameters, an estimate of the\nnatural gradient is provided at little extra cost at each step. The returns obtained by these meth-\nods are averaged over 104 runs for sample sizes 5 and 10, and over 103 runs for sample sizes\n20 and 40. The policy parameters are initialized randomly at each run.\nIn order to ensure that\nthe learned parameters do not exceed an acceptable range, the policy parameters are de(cid:2)ned as\n(cid:21) = (cid:0)1:999 + 1:998=(1 + e(cid:23)1) and (cid:27) = 0:001 + 1=(1 + e(cid:23)2 ). The optimal solution is (cid:21)(cid:3) (cid:25) (cid:0)0:92\nand (cid:27)(cid:3) = 0:001 ((cid:17)B((cid:21)(cid:3); (cid:27)(cid:3)) = 0:1003) corresponding to (cid:23) (cid:3)\n\n1 (cid:25) (cid:0)0:16 and (cid:23) (cid:3)\n\n2 ! 1.\n\n \n\nMC\nBPG\nBPNG\nOptimal\n\n101\n\n100\n\nt\n\n \n\nn\nr\nu\ne\nR\nd\ne\nt\nc\ne\np\nx\nE\ne\ng\na\nr\ne\nv\nA\n\n \n\n \n\nMC\nBPG\nBPNG\nOptimal\n\n101\n\n100\n\n \n\nMC\nBPG\nBPNG\nOptimal\n\n101\n\n100\n\nt\n\n \n\nn\nr\nu\ne\nR\nd\ne\nt\nc\ne\np\nx\nE\ne\ng\na\nr\ne\nv\nA\n\n \n\nt\n\n \n\nn\nr\nu\ne\nR\nd\ne\nt\nc\ne\np\nx\nE\ne\ng\na\nr\ne\nv\nA\n\n \n\n \n\nMC\nBPG\nBPNG\nOptimal\n\n101\n\n100\n\nt\n\n \n\nn\nr\nu\ne\nR\nd\ne\nt\nc\ne\np\nx\nE\ne\ng\na\nr\ne\nv\nA\n\n \n\n \n\n0 20 40 60 80 100\n\nNumber of Updates (Sample Size = 5)\n\n \n\n0 20 40 60 80 100\n\n \n\n0 20 40 60 80 100\n\nNumber of Updates (Sample Size = 10)\n\nNumber of Updates (Sample Size = 20)\n\n \n\n0 20 40 60 80 100\n\nNumber of Updates (Sample Size = 40)\n\nFigure 2: A comparison of the average expected returns of BPG using regular (BPG) and natural (BPNG)\ngradient estimates, with the average expected return of the MCPG algorithm for sample sizes 5; 10; 20; and 40.\n\nFigure 2 shows that MCPG performs better than the BPG algorithm for the smallest sam-\nple size (M = 5), whereas for larger samples BPG dominates MCPG. This phenomenon is\nalso reported in [16]. We use two different learning rates for the two components of the\ngradient. For a (cid:2)xed sample size, each method starts with an initial learning rate, and de-\ncreases it according to the schedule (cid:11)j = (cid:11)0(20=(20 + j)). Table 3 summarizes the best\ninitial learning rates for each algorithm. The selected learning rates for BPNG are signif-\nicantly larger than those for BPG and MCPG, which explains why BPNG initially learns\nfaster than BPG and MCPG, but contrary to our expectations, eventually performs worse.\n\nM = 10 M = 20 M = 40\nM = 5\n0.10, 0.15\n0.05, 0.10\nMCPG 0.01, 0.05\n0.10, 0.30\n0.07, 0.10\nBPG\n0.01, 0.03\n0.09, 0.30\n0.80, 0.90\nBPNG 0.03, 0.50\nFigure 3: Initial learning rates used by the PG algorithms.\n\nSo far we have assumed that the Fisher\ninformation matrix is known.\nIn the\nnext experiment, we estimate it us-\ning both MC and maximum likelihood\n(ML) methods as described in Sec. 4.\nIn ML estimation, we assume that the\ntransition probability function is P (xt+1jxt; at) = N ((cid:12)1xt + (cid:12)2at + (cid:12)3; (cid:12)2\n4 ), and then estimate its\nparameters by observing state transitions. Figure 4 shows that when the Fisher information matrix\nis estimated using MC (BPG-MC), the BPG algorithm still performs better than MCPG, and outper-\nforms the BPG algorithm in which the Fisher information matrix is estimated using ML (BPG-ML).\nMoreover, as we increase the sample size, its performance converges to the performance of the BPG\nalgorithm in which the Fisher information matrix is known (BPG).\n\n0.05, 0.10\n0.15, 0.20\n0.45, 0.90\n\n\ft\n\nn\nr\nu\ne\nR\nd\ne\n\n \n\n \n\nt\nc\ne\np\nx\nE\ne\ng\na\nr\ne\nv\nA\n\n \n\nMC\nBPG\nBPG\u2212ML\nBPG\u2212MC\nOptimal\n\n10\u22120.1\n\n10\u22120.2\n\n10\u22120.3\n\n10\u22120.4\n\n10\u22120.5\n0\n100\nNumber of Updates (Sample Size = 10)\n\n20\n\n40\n\n60\n\n80\n\n \n\nt\n\nn\nr\nu\ne\nR\nd\ne\n\n \n\n \n\nt\nc\ne\np\nx\nE\ne\ng\na\nr\ne\nv\nA\n\n \n\nMC\nBPG\nBPG\u2212ML\nBPG\u2212MC\nOptimal\n\n10\u22120.1\n\n10\u22120.2\n\n10\u22120.3\n\n10\u22120.4\n\n10\u22120.5\n0\n100\nNumber of Updates (Sample Size = 20)\n\n20\n\n40\n\n60\n\n80\n\n \n\nt\n\nn\nr\nu\ne\nR\nd\ne\n\n \n\n \n\nt\nc\ne\np\nx\nE\ne\ng\na\nr\ne\nv\nA\n\n \n\nMC\nBPG\nBPG\u2212ML\nBPG\u2212MC\nOptimal\n\n10\u22120.1\n\n10\u22120.2\n\n10\u22120.3\n\n10\u22120.4\n\n10\u22120.5\n0\n100\nNumber of Updates (Sample Size = 40)\n\n20\n\n40\n\n60\n\n80\n\n \n\nFigure 4: A comparison of the average return of BPG when the Fisher information matrix is known (BPG),\nand when it is estimated using MC (BPG-MC) and ML (BPG-ML) methods, for sample sizes 10; 20; and 40\n(from left to right). The average return of the MCPG algorithm is also provided for comparison.\n\n6 Discussion\nIn this paper we proposed an alternative approach to conventional frequentist policy gradient esti-\nmation procedures, which is based on the Bayesian view. Our algorithms use GPs to de(cid:2)ne a prior\ndistribution over the gradient of the expected return, and compute the posterior, conditioned on the\nobserved data. The experimental results are encouraging, but we conjecture that even higher gains\nmay be attained using this approach. This calls for additional theoretical and empirical work.\nAlthough the proposed policy updating algorithm (Alg. 2) uses only the posterior mean of the gradi-\nent in its updates, we hope that more elaborate algorithms can be devised that would make judicious\nuse of the covariance information provided by the gradient estimation algorithm (Alg. 1). Two ob-\nvious possibilities are: 1) risk-aware selection of the update step-size and direction, and 2) using\nthe variance in a termination condition for Alg. 1. Other interesting directions include 1) investi-\ngating other possible partitions of the integrand in the expression for r(cid:17)B((cid:18)) into a GP term f and\na known term p, 2) using other types of kernel functions, such as sequence kernels, 3) combining\nour approach with MDP model estimation, to allow transfer of learning between different policies,\n4) investigating methods for learning the Fisher information matrix, 5) extending the Bayesian ap-\nproach to Actor-Critic type of algorithms, possibly by combining BPG with the Gaussian process\ntemporal difference (GPTD) algorithms of [15].\nAcknowledgments We thank Rich Sutton and Dale Schuurmans for helpful discussions. M.G.\nwould like to thank Shie Mannor for his useful comments at the early stages of this work. M.G. is\nsupported by iCORE and Y.E. is partially supported by an Alberta Ingenuity fellowship.\nReferences\n[1] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine Learning, 8:229(cid:150)256, 1992.\n\n[2] P. Marbach. Simulated-Based Methods for Markov Decision Processes. PhD thesis, MIT, 1998.\n[3] J. Baxter and P. Bartlett. In(cid:2)nite-horizon policy-gradient estimation. JAIR, 15:319(cid:150)350, 2001.\n[4] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning\n\nwith function approximation. In Proceedings of NIPS 12, pages 1057(cid:150)1063, 2000.\n\n[5] S. Kakade. A natural policy gradient. In Proceedings of NIPS 14, 2002.\n[6] J. Bagnell and J. Schneider. Covariant policy search. In Proceedings of the 18th IJCAI, 2003.\n[7] J. Peters, S. Vijayakumar, and S. Schaal. Reinforcement learning for humanoid robotics. In Proceedings\n\nof the Third IEEE-RAS International Conference on Humanoid Robots, 2003.\n\n[8] J. Berger and R. Wolpert. The Likelihood Principle. Inst. of Mathematical Statistics, Hayward, CA, 1984.\n[9] A. O\u2019Hagan. Monte Carlo is fundamentally unsound. The Statistician, 36:247(cid:150)249, 1987.\n[10] A. O\u2019Hagan. Bayes-Hermite quadrature. Journal of Statistical Planning and Inference, 29, 1991.\n[11] D. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti(cid:2)c, 1996.\n[12] R. Sutton and A. Barto. An Introduction to Reinforcement Learning. MIT Press, 1998.\n[13] T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classi(cid:2)ers. In Proceedings\n\nof NIPS 11. MIT Press, 1998.\n\n[14] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.\n[15] Y. Engel. Algorithms and Representations for Reinforcement Learning. PhD thesis, The Hebrew Univer-\n\nsity of Jerusalem, Israel, 2005.\n\n[16] C. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. In Proceedings of NIPS 15. MIT Press, 2003.\n\n\f", "award": [], "sourceid": 2993, "authors": [{"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": null}, {"given_name": "Yaakov", "family_name": "Engel", "institution": null}]}