{"title": "Analysis and Improvement of Policy Gradient Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 262, "page_last": 270, "abstract": "Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE(policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments.", "full_text": "Analysis and Improvement of\nPolicy Gradient Estimation\n\nTingting Zhao, Hirotaka Hachiya, Gang Niu, and Masashi Sugiyama\n\n{tingting@sg., hachiya@sg., gang@sg., sugiyama@}cs.titech.ac.jp\n\nTokyo Institute of Technology\n\nAbstract\n\nPolicy gradient is a useful model-free reinforcement learning approach, but it\ntends to suffer from instability of gradient estimates. In this paper, we analyze\nand improve the stability of policy gradient methods. We \ufb01rst prove that the vari-\nance of gradient estimates in the PGPE (policy gradients with parameter-based\nexploration) method is smaller than that of the classical REINFORCE method\nunder a mild assumption. We then derive the optimal baseline for PGPE, which\ncontributes to further reducing the variance. We also theoretically show that PGPE\nwith the optimal baseline is more preferable than REINFORCE with the optimal\nbaseline in terms of the variance of gradient estimates. Finally, we demonstrate\nthe usefulness of the improved PGPE method through experiments.\n\n1\n\nIntroduction\n\nThe goal of reinforcement learning (RL) is to \ufb01nd an optimal decision-making policy that maximizes\nthe return (i.e., the sum of discounted rewards) through interaction with an unknown environment\n[13]. Model-free RL is a \ufb02exible framework in which decision-making policies are directly learned\nwithout going through explicit modeling of the environment. Policy iteration and policy search are\ntwo popular formulations of model-free RL.\nIn the policy iteration approach [6], the value function is \ufb01rst estimated and then policies are deter-\nmined based on the learned value function. Policy iteration was demonstrated to work well in many\nreal-world applications, especially in problems with discrete states and actions [14, 17, 1]. Although\npolicy iteration can naturally deal with continuous states by function approximation [8], continuous\nactions are hard to handle due to the dif\ufb01culty of \ufb01nding maximizers of value functions with respect\nto actions. Moreover, since policies are indirectly determined via value function approximation,\nmisspeci\ufb01cation of value function models can lead to inappropriate policies even in very simple\nproblems [15, 2]. Another limitation of policy iteration especially in physical control tasks is that\ncontrol policies can vary drastically in each iteration. This causes severe instability in the physical\nsystem and thus is not favorable in practice.\nPolicy search is another approach to model-free RL that can overcome the limitations of policy\niteration [18, 4, 7]. In the policy search approach, control policies are directly learned so that the\nreturn is maximized, for example, via a gradient method (called the REINFORCE method) [18], an\nEM algorithm [4], and a natural gradient method [7]. Among them, the gradient-based method is\nparticularly useful in physical control tasks since policies are changed gradually. This ensures the\nstability of the physical system.\nHowever, since the REINFORCE method tends to have a large variance in the estimation of the\ngradient directions, its naive implementation converges slowly [9, 10, 12]. Subtraction of the optimal\nbaseline [16, 5] can ease this problem to some extent, but the variance of gradient estimates is\nstill large. Furthermore, the performance heavily depends on the choice of an initial policy, and\nappropriate initialization is not straightforward in practice.\n\n1\n\n\fTo cope with this problem, a novel policy gradient method called policy gradients with parameter-\nbased exploration (PGPE) was proposed recently [12]. In PGPE, an initial policy is drawn from\na prior probability distribution, and then actions are chosen deterministically. This construction\ncontributes to mitigating the problem of initial policy choice and stabilizing gradient estimates.\nMoreover, by subtracting a moving-average baseline, the variance of gradient estimates can be fur-\nther reduced. Through robot-control experiments, PGPE was demonstrated to achieve more stable\nperformance than existing policy-gradient methods.\nThe goal of this paper is to theoretically support the usefulness of PGPE, and to further improve its\nperformance. More speci\ufb01cally, we \ufb01rst give bounds of the gradient estimates of the REINFORCE\nand PGPE methods. Our theoretical analysis shows that gradient estimates for PGPE have smaller\nvariance than those for REINFORCE under a mild condition. We then show that the moving-average\nbaseline for PGPE adopted in the original paper [12] has excess variance; we give the optimal\nbaseline for PGPE that minimizes the variance, following the line of [16, 5]. We further theoretically\nshow that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal\nbaseline in terms of the variance of gradient estimates. Finally, the usefulness of the improved PGPE\nmethod is demonstrated through experiments.\n\n2 Policy Gradients for Reinforcement Learning\n\nIn this section, we review policy gradient methods.\n\n2.1 Problem Formulation\nLet us consider a Markov decision problem speci\ufb01ed by (S,A, PT , PI , r, \u03b3), where S is a set of\n(cid:96)-dimensional continuous states, A is a set of continuous actions, PT (s(cid:48)|s, a) is the transition prob-\nability density from current state s to next state s(cid:48) when action a is taken, PI (s) is the probability\nof initial states, r(s, a, s(cid:48)) is an immediate reward for transition from s to s(cid:48) by taking action a,\nand 0 < \u03b3 < 1 is the discount factor for future rewards. Let p(a|s, \u03b8) be a stochastic policy with\nparameter \u03b8, which represents the conditional probability density of taking action a in state s.\nLet h = [s1, a1, . . . , sT , aT ] be a trajectory of length T. Then the return (i.e., the discounted sum\nof future rewards) along h is given by\n\nR(h) :=(cid:80)T\n\nt=1 \u03b3t\u22121r(st, at, st+1).\n\nThe expected return for parameter \u03b8 is de\ufb01ned by\n\nJ(\u03b8) :=(cid:82) p(h|\u03b8)R(h)dh, where p(h|\u03b8) = p(s1)(cid:81)T\n\nt=1 p(st+1|st, at)p(at|st, \u03b8).\n\nThe goal of reinforcement learning is to \ufb01nd the optimal policy parameter \u03b8\u2217 that maximizes the\nexpected return J(\u03b8):\n\n\u03b8\u2217 := arg max J(\u03b8).\n\n2.2 Review of the REINFORCE Algorithm\n\nIn the REINFORCE algorithm [18], the policy parameter \u03b8 is updated via gradient ascent:\n\n\u03b8 \u2190\u2212 \u03b8 + \u03b5\u2207\u03b8J(\u03b8),\n\nwhere \u03b5 is a small positive constant. The gradient \u2207\u03b8J(\u03b8) is given by\n\n\u2207\u03b8J(\u03b8) =(cid:82) \u2207\u03b8p(h|\u03b8)R(h)dh =(cid:82) p(h|\u03b8)\u2207\u03b8 log p(h|\u03b8)R(h)dh\n\n=(cid:82) p(h|\u03b8)(cid:80)T\nt=1 \u2207\u03b8 log p(at|st, \u03b8)R(h)dh,\n(cid:80)T\n(cid:80)N\n\u2207\u03b8(cid:98)J(\u03b8) = 1\nt=1 \u2207\u03b8 log p(an\n\nwhere we used the so-called \u2018log trick\u2019: \u2207\u03b8p(h|\u03b8) = p(h|\u03b8)\u2207\u03b8 log p(h|\u03b8). Since p(h|\u03b8) is un-\nknown, the expectation is approximated by the empirical average:\nt |sn\n\nt , \u03b8)R(hn),\n\nn=1\n\nN\n\nwhere hn := [sn\n\n1 , an\n\n1 , . . . , sn\n\nT , an\n\nT ] is a roll-out sample.\n\n2\n\n\fLet us employ the Gaussian policy model with parameter \u03b8 = (\u00b5, \u03c3), where \u00b5 is the mean vector\nand \u03c3 is the standard deviation:\n\n(cid:16)\u2212 (a\u2212\u00b5(cid:62)s)2\n\n2\u03c32\n\n(cid:17)\n\n.\n\n2\u03c0\nThen the policy gradients are explicitly given as\n\n\u03c3\n\np(a|s, \u03b8) = 1\n\u221a\n\nexp\n\n\u2207\u00b5 log p(a|s, \u03b8) = a\u2212\u00b5(cid:62)s\n\n\u03c32 s and \u2207\u03c3 log p(a|s, \u03b8) = (a\u2212\u00b5(cid:62)s)2\u2212\u03c32\n\n\u03c33\n\n.\n\nA drawback of REINFORCE is that the variance of the above policy gradients is large [10, 11],\nwhich leads to slow convergence.\n\n2.3 Review of the PGPE Algorithm\n\nOne of the reasons for large variance of policy gradients in the REINFORCE algorithm is that the\nempirical average is taken at each time step, which is caused by stochasticity of policies.\nIn order to mitigate this problem, another method called policy gradients with parameter-based\nexploration (PGPE) was proposed recently [11]. In PGPE, a linear deterministic policy,\n\n\u03c0(a|s, \u03b8) = \u03b8(cid:62)s,\n\nis adopted, and stochasticity is introduced by considering a prior distribution over policy parameter\n\u03b8 with hyper-parameter \u03c1: p(\u03b8|\u03c1). Since entire history h is solely determined by a single sample of\nparameter \u03b8 in this formulation, it is expected that the variance of gradient estimates can be reduced.\nThe expected return for hyper-parameter \u03c1 is expressed as\n\nJ(\u03c1) =(cid:82)(cid:82) p(h|\u03b8)p(\u03b8|\u03c1)R(h)dhd\u03b8.\n\nDifferentiating this with respect to \u03c1, we have\n\n\u2207\u03c1J(\u03c1) =(cid:82)(cid:82) p(h|\u03b8)\u2207\u03c1p(\u03b8|\u03c1)R(h)dhd\u03b8 =(cid:82)(cid:82) p(h|\u03b8)p(\u03b8|\u03c1)\u2207\u03c1 log p(\u03b8|\u03c1)R(h)dhd\u03b8,\n\nwhere the log trick for \u2207\u03c1p(\u03b8|\u03c1) is used. We then approximate the expectation over h and \u03b8 by the\nempirical average:\n\n\u2207\u03c1(cid:98)J(\u03c1) = 1\n\nN\n\n(cid:80)N\nn=1 \u2207\u03c1 log p(\u03b8n|\u03c1)R(hn),\n\nwhere each trajectory sample hn is drawn from p(h|\u03b8n) and the parameter \u03b8n is drawn from\np(\u03b8n|\u03c1).\nLet us employ the Gaussian prior distribution with hyper-parameter \u03c1 = (\u03b7, \u03c4 ) to draw parameter\nvector \u03b8, where \u03b7 is the mean vector and \u03c4 is the vector consisting of the standard deviation in each\nelement:\n\np(\u03b8i|\u03c1i) = 1\n\u221a\n\n\u03c4i\n\n2\u03c0\n\nexp\n\n(cid:16)\u2212 (\u03b8i\u2212\u03b7i)2\n\n2\u03c4 2\ni\n\n(cid:17)\n\n.\n\nThen the derivative of log p(\u03b8|\u03c1) with respect to \u03b7i and \u03c4i are given as follows:\n\n\u2207\u03b7i log p(\u03b8|\u03c1) = \u03b8i\u2212\u03b7i\n\n\u03c4 2\ni\n\nand \u2207\u03c4i log p(\u03b8|\u03c1) =\n\n3 Variance of Gradient Estimates\n\n(\u03b8i \u2212 \u03b7i)2 \u2212 \u03c4 2\n\ni\n\n.\n\n\u03c4 3\ni\n\nIn this section, we theoretically investigate the variance of gradient estimates in REINFORCE and\nPGPE.\nFor multi-dimensional state space, we consider the trace of the covariance matrix of gradient vectors.\nThat is, for a random vector A = (A1, . . . , A(cid:96))(cid:62), we de\ufb01ne\n\n(cid:16)E(cid:2)(A \u2212 E[A])(A \u2212 E[A])(cid:62)(cid:3)(cid:17)\n\n=(cid:80)(cid:96)\n\nE(cid:104)\n(Am \u2212 E[Am])2(cid:105)\n\n,\n\n(1)\n\nm=1\n\n, where (cid:96) is the dimensionality of state s.\n\nVar(A) = tr\n\nwhere E denotes the expectation. Let B =(cid:80)(cid:96)\n\ni=1 \u03c4\u22122\n\ni\n\nBelow, we consider a subset of the following assumptions:\n\n3\n\n\fAssumption (A): r(s, a, s(cid:48)) \u2208 [\u2212\u03b2, \u03b2] for \u03b2 > 0.\nAssumption (B): r(s, a, s(cid:48)) \u2208 [\u03b1, \u03b2] for 0 < \u03b1 < \u03b2.\nAssumption (C): For \u03b4 > 0, there exist two series {ct}T\n\nt=1 such that (cid:107)st(cid:107)2 \u2265 ct and\n(cid:107)st(cid:107)2 \u2264 dt hold with probability at least (1\u2212\u03b4)1/2N respectively over the choice of sample\npaths, where (cid:107) \u00b7 (cid:107)2 denotes the (cid:96)2-norm.\n\nt=1 and {dt}T\n\nNote that Assumption (B) is stronger than Assumption (A). Let\n\nL(T ) = CT \u03b12 \u2212 DT \u03b22/(2\u03c0), CT =(cid:80)T\n\nt=1 c2\nt ,\n\nand DT =(cid:80)T\n\nt=1 d2\nt .\n\nFirst, we analyze the variance of gradient estimates in PGPE (the proofs of all the theorems are\nprovided in the supplementary material):\nTheorem 1. Under Assumption (A), we have the following upper bounds:\n\n(cid:104)\u2207\u03b7(cid:98)J(\u03c1)\n\n(cid:105) \u2264 \u03b22(1\u2212\u03b3T )2B\n\nN (1\u2212\u03b3)2\n\nVar\n\n(cid:104)\u2207\u03c4(cid:98)J(\u03c1)\n\n(cid:105) \u2264 2\u03b22(1\u2212\u03b3T )2B\n\nN (1\u2212\u03b3)2\n\n,\n\nand Var\n\nThis theorem means that the upper bound of the variance of \u2207\u03b7(cid:98)J(\u03c1) is proportional to \u03b22 (the upper\nand is inverse-proportional to sample size N. The upper bound of the variance of \u2207\u03c4(cid:98)J(\u03c1) is twice\nlarger than that of \u2207\u03b7(cid:98)J(\u03c1). When T goes to in\ufb01nity, (1 \u2212 \u03b3T )2 will converge to 1.\n\nbound of squared rewards), B (the trace of the inverse Gaussian covariance), and (1\u2212\u03b3T )2/(1\u2212\u03b3)2,\n\nNext, we analyze the variance of gradient estimates in REINFORCE:\nTheorem 2. Under Assumptions (B) and (C), we have the following lower bound with probability\nat least 1 \u2212 \u03b4:\n\nUnder Assumptions (A) and (C), we have the following upper bound with probability at least (1 \u2212\n\u03b4)1/2:\n\nVar\n\nN \u03c32(1\u2212\u03b3)2L(T ).\n\n(cid:104)\u2207\u00b5(cid:98)J(\u03b8)\n(cid:104)\u2207\u00b5(cid:98)J(\u03b8)\n(cid:104)\u2207\u03c3(cid:98)J(\u03b8)\n\n(cid:105) \u2265 (1\u2212\u03b3T )2\n(cid:105) \u2264 DT \u03b22(1\u2212\u03b3T )2\n(cid:105) \u2264 2T \u03b22(1\u2212\u03b3T )2\n\nN \u03c32(1\u2212\u03b3)2\n\nN \u03c32(1\u2212\u03b3)2 .\n\n.\n\nVar\n\nVar\n\nUnder Assumption (A), we have\n\nThe upper bounds for REINFORCE are similar to those for PGPE, but they are monotone increasing\nif it is positive, i.e., L(T ) > 0. This can be ful\ufb01lled, e.g., if \u03b1 and \u03b2 satisfy 2\u03c0CT \u03b12 > DT \u03b22.\n\nwith respect to trajectory length T . The lower bound for the variance of \u2207\u00b5(cid:98)J(\u03b8) will be non-trivial\nDeriving a lower bound of the variance of \u2207\u03c3(cid:98)J(\u03b8) is left open as future work.\n\nFinally, we compare the variance of gradient estimates in REINFORCE and PGPE:\nTheorem 3. In addition to Assumptions (B) and (C), we assume L(T ) is positive and mono-\nIf there exists T0 such that L(T0) \u2265 \u03b22B\u03c32, then we have\ntone increasing with respect to T .\n\nVar[\u2207\u00b5(cid:98)J(\u03b8)] > Var[\u2207\u03b7(cid:98)J(\u03c1)] for all T > T0, with probability at least 1 \u2212 \u03b4.\n\nThe above theorem means that PGPE is more favorable than REINFORCE in terms of the variance\nof gradient estimates of the mean, if trajectory length T is large. This theoretical result would\npartially support the experimental success of the PGPE method [12].\n\n4 Variance Reduction by Subtracting Baseline\n\nIn this section, we give a method to reduce the variance of gradient estimates in PGPE and analyze\nits theoretical properties.\n\n4\n\n\f4.1 Basic Idea of Introducing Baseline\n\nIt is known that the variance of gradient estimates can be reduced by subtracting a baseline b: for\nREINFORCE and PGPE, modi\ufb01ed gradient estimates are given by\n\n\u2207\u03b8(cid:98)J b(\u03b8) = 1\n\u2207\u03c1(cid:98)J b(\u03c1) = 1\n\nN\n\nN\n\nn=1(R(hn) \u2212 b)(cid:80)T\n(cid:80)N\n(cid:80)N\nn=1(R(hn) \u2212 b)\u2207\u03c1 log p(\u03b8n|\u03c1).\n\nt=1 \u2207\u03b8 log p(an\n\nt |sn\n\nt , \u03b8),\n\nb(n) = \u03b3R(hn\u22121) + (1 \u2212 \u03b3)b(n \u2212 1),\n\nThe adaptive reinforcement baseline [18] was derived as the exponential moving average of the past\nexperience:\nwhere 0 < \u03b3 \u2264 1. Based on this, an empirical gradient estimate with the moving-average baseline\nwas proposed for REINFORCE [18] and PGPE [12].\nThe above moving-average baseline contributes to reducing the variance of gradient estimates. How-\never, it was shown [5, 16] that the moving-average baseline is not optimal; the optimal baseline is,\nby de\ufb01nition, given as the minimizer of the variance of gradient estimates with respect to a baseline.\nFollowing this formulation, the optimal baseline for REINFORCE is given as follows [10]:\n\nREINFORCE := arg minb Var[\u2207\u03b8(cid:98)J b(\u03b8)] =\n\nb\u2217\n\nE[R(h)(cid:107)(cid:80)T\nE[(cid:107)(cid:80)T\n\nt=1 \u2207\u03b8 log p(at|st,\u03b8)(cid:107)2]\n\nt=1 \u2207\u03b8 log p(at|st,\u03b8)(cid:107)2]\n\n.\n\nHowever, only the moving-average baseline was introduced to PGPE so far [12], which is subopti-\nmal. Below, we derive the optimal baseline for PGPE, and study its theoretical properties.\n\n4.2 Optimal Baseline for PGPE\nLet b\u2217\n\nPGPE be the optimal baseline for PGPE that minimizes the variance:\n\nPGPE := arg minb Var[\u2207\u03c1(cid:98)J b(\u03c1)].\n\nb\u2217\n\nThen the following theorem gives the optimal baseline for PGPE:\nTheorem 4. The optimal baseline for PGPE is given by\n\nb\u2217\nPGPE =\n\nE[R(h)(cid:107)\u2207\u03c1 log p(\u03b8|\u03c1)(cid:107)2]\n\nE[(cid:107)\u2207\u03c1 log p(\u03b8|\u03c1)(cid:107)2]\n\n,\n\nand the excess variance for a baseline b is given by\n\nVar[\u2207\u03c1(cid:98)J b(\u03c1)] \u2212 Var[\u2207\u03c1(cid:98)J b\u2217\n\nPGPE(\u03c1)] = (b\u2212b\u2217\n\nPGPE)2\nN\n\nE[(cid:107)\u2207\u03c1 log p(\u03b8|\u03c1)(cid:107)2].\n\nThe above theorem gives an analytic-form expression of the optimal baseline for PGPE. When ex-\npected return R(h) and the squared norm of characteristic eligibility (cid:107)\u2207\u03c1 log p(\u03b8|\u03c1)(cid:107)2 are indepen-\ndent of each other, the optimal baseline is reduced to average expected return E[R(h)]. However,\nthe optimal baseline is generally different from the average expected return. The above theorem also\nshows that the excess variance is proportional to the squared difference of baselines (b \u2212 b\u2217\nPGPE)2\nand the expected squared norm of characteristic eligibility E[(cid:107)\u2207\u03c1 log p(\u03b8|\u03c1)(cid:107)2], and is inverse-\nproportional to sample size N.\nNext, we analyze the contribution of the optimal baseline to the variance with respect to mean\nparameter \u03b7 in PGPE:\nTheorem 5. If r(s, a, s(cid:48)) \u2265 \u03b1 > 0, we have the following lower bound:\n\nUnder Assumption (A), we have the following upper bound:\n\nVar[\u2207\u03b7(cid:98)J(\u03c1)] \u2212 Var[\u2207\u03b7(cid:98)J b\u2217\nVar[\u2207\u03b7(cid:98)J(\u03c1)] \u2212 Var[\u2207\u03b7(cid:98)J b\u2217\n\nPGPE(\u03c1)] \u2265 \u03b12(1\u2212\u03b3T )2B\nN (1\u2212\u03b3)2\n\nPGPE (\u03c1)] \u2264 \u03b22(1\u2212\u03b3T )2B\nN (1\u2212\u03b3)2\n\n.\n\n.\n\nThis theorem shows that the lower and upper bounds of the excess variance are proportional to \u03b12\nand \u03b22 (the bounds of squared immediate rewards), B (the trace of the inverse Gaussian covariance),\nand (1 \u2212 \u03b3T )2/(1 \u2212 \u03b3)2, and are inverse-proportional to sample size N. When T goes to in\ufb01nity,\n(1 \u2212 \u03b3T )2 will converge to 1.\n\n5\n\n\f4.3 Comparison with REINFORCE\n\nVar[\u2207\u03b8(cid:98)J b(\u03b8)] \u2212 Var[\u2207\u03b8(cid:98)J b\u2217\n\nNext, we analyze the contribution of the optimal baseline for REINFORCE, and compare it with that\nfor PGPE. It was shown [5, 16] that the excess variance for a baseline b in REINFORCE is given by\n\nREINFORCE (\u03b8)] = (b\u2212b\u2217\n\nREINFORCE)2\n\nE\n\nN\n\nt=1 \u2207\u03b8 log p(at|st, \u03b8)\n\n(cid:20)(cid:13)(cid:13)(cid:13)(cid:80)T\n\n(cid:13)(cid:13)(cid:13)2(cid:21)\n\n.\n\nBased on this, we have the following theorem:\nTheorem 6. Under Assumptions (B) and (C), we have the following bounds with probability at least\n1 \u2212 \u03b4:\n\nN \u03c32(1\u2212\u03b3)2 \u2264 Var[\u2207\u00b5(cid:98)J(\u03b8)] \u2212 Var[\u2207\u00b5(cid:98)J b\u2217\n\nCT \u03b12(1\u2212\u03b3T )2\n\nREINFORCE(\u03b8)] \u2264 \u03b22(1\u2212\u03b3T )2DT\nN \u03c32(1\u2212\u03b3)2\n\n.\n\nThe above theorem shows that the lower and upper bounds of the excess variance are monotone\nincreasing with respect to trajectory length T .\nIn the aspect of the amount of reduction in the variance of gradient estimates, Theorem 5 and Theo-\nrem 6 show that the optimal baseline for REINFORCE contributes more than that for PGPE.\nFinally, based on Theorem 1 and Theorem 5 and based on Theorem 2 and Theorem 6, we have the\nfollowing theorem:\nTheorem 7. Under Assumptions (B) and (C), we have\n\nVar[\u2207\u03b7(cid:98)J b\u2217\n\nVar[\u2207\u00b5(cid:98)J b\u2217\n\nPGPE(\u03c1)] \u2264 (1\u2212\u03b3T )2\nREINFORCE(\u03b8)] \u2264 (1\u2212\u03b3T )2\n\nN (1\u2212\u03b3)2 (\u03b22 \u2212 \u03b12)B,\nN \u03c32(1\u2212\u03b3)2 (\u03b22DT \u2212 \u03b12CT ),\n\nwhere the latter inequality holds with probability at least 1 \u2212 \u03b4.\n\nwith the optimal baseline can be further upper-bounded as Var[\u2207\u03b7(cid:98)J b\u2217\n\nThis theorem shows that the upper bound of the variance of gradient estimates for REINFORCE with\nthe optimal baseline is still monotone increasing with respect to trajectory length T . On the other\nhand, since (1 \u2212 \u03b3T )2 \u2264 1, the above upper bound of the variance of gradient estimates in PGPE\nN (1\u2212\u03b3)2 , which\nis independent of T . Thus, when trajectory length T is large, the variance of gradient estimates in\nREINFORCE with the optimal baseline may be signi\ufb01cantly larger than the variance of gradient\nestimates in PGPE with the optimal baseline.\n\nPGPE (\u03c1)] \u2264 (\u03b22\u2212\u03b12)B\n\n5 Experiments\n\nIn this section, we experimentally investigate the usefulness of the proposed method, PGPE with the\noptimal baseline.\n\nIllustrative Data\n\nN (0, 0.52) is stochastic noise. The immediate reward is de\ufb01ned as r = exp(cid:0)\u2212s2/2 \u2212 a2/2(cid:1) + 1,\n\n5.1\nLet the state space S be one-dimensional and continuous, and the initial state is randomly chosen\nfrom the standard normal distribution. The action space A is also set to be one-dimensional and\ncontinuous. The transition dynamics of the environment is set at st+1 = st + at + \u03b5, where \u03b5 \u223c\nwhich is bounded as 1 < r \u2264 2. The discount factor is set at \u03b3 = 0.9.\nHere, we compare the following \ufb01ve methods: REINFORCE without any baselines, REINFORCE\nwith the optimal baseline (OB), PGPE without any baselines, PGPE with the moving-average base-\nline (MB), and PGPE with the optimal baseline (OB). For fair comparison, all of these methods\nuse the same parameter setup: the mean and standard deviation of the Gaussian distribution is set\nat \u00b5 = \u22121.5 and \u03c3 = 1, the number of episodic samples is set at N = 100, and the length of the\ntrajectory is set at T = 10 or 50. We then calculate the variance of gradient estimates over 100 runs.\nTable 1 summarizes the results, showing that the variance of REINFORCE is overall larger than\nPGPE. A notable difference between REINFORCE and PGPE is that the variance of REINFORCE\n\n6\n\n\fTable 1: Variance and bias of estimated gradients for the illustrative data.\n\nMethod\n\nVariance\n\nT = 10\n\nBias\n\nVariance\n\nT = 50\n\nREINFORCE\n\nPGPE\n\n\u00b5, \u03b7\n\n13.2570\nREINFORCE-OB 0.0914\n0.9707\n0.2127\n0.0372\n\nPGPE-MB\nPGPE-OB\n\n\u03c3, \u03c4\n\n26.9173\n0.1203\n1.6855\n0.3238\n0.0685\n\n\u00b5, \u03b7\n\n-0.3102\n0.0672\n-0.0691\n0.0828\n-0.0164\n\n\u03c3, \u03c4\n\n-1.5098\n0.1286\n0.1319\n-0.1295\n0.0512\n\n\u00b5, \u03b7\n\n188.3860\n0.5454\n1.6572\n0.4123\n0.0850\n\n\u03c3, \u03c4\n\n278.3095\n0.8996\n3.3720\n0.8332\n0.1815\n\nBias\n\n\u00b5, \u03b7\n\n-1.8126\n-0.2988\n-0.1048\n0.0925\n0.0480\n\n\u03c3, \u03c4\n\n-5.1747\n-0.2008\n-0.3293\n-0.2556\n-0.0779\n\n(a) REINFORCE and REINFORCE-OB\n\n(b) PGPE, PGPE-MB and PGPE-OB\n\n(a) Good initial policy\n\n(c) REINFORCE and PGPE\n\n(d) REINFORCE-OB and PGPE-OB\n\n(b) Poor initial policy\n\nFigure 1: Variance of gradient estimates with respect to the\nmean parameter through policy-update iterations for the illus-\ntrative data.\n\nFigure 2: Return as functions of\nthe number of episodic samples\nN for the illustrative data.\n\nsigni\ufb01cantly grows as T increases, whereas that of PGPE is not in\ufb02uenced that much by T . This well\nagrees with our theoretical analysis in Section 3. The results also show that the variance of PGPE-\nOB (the proposed method) is much smaller than that of PGPE-MB. REINFORCE-OB contributes\nhighly to reducing the variance especially when T is large, which also well agrees with our theory.\nHowever, PGPE-OB still provides much smaller variance than REINFORCE-OB.\nWe also investigate the bias of gradient estimates of each method. We regard gradients estimated\nwith N = 1000 as true gradients, and compute the bias of gradient estimates when N = 100. The\nresults are also included in Table 1, showing that introduction of baselines does not increase the bias;\nrather, it tends to reduce the bias.\nNext, we investigate the variance of gradient estimates when policy parameters are updated over it-\nerations. In this experiment, we set N = 10 and T = 20, and the variance is computed from 50 runs.\nPolicies are updated over 50 iterations. In order to evaluate the variance in a stable manner, we repeat\nthe above experiments 20 times with random choice of initial mean parameter \u00b5 from [\u22123.0,\u22120.1],\nand investigate the average variance of gradient estimates with respect to mean parameter \u00b5 over 20\ntrials, in log10-scale.\nThe results are summarized in Figure 1. Figure 1(a) compares the variance of REINFORCE\nwith/without baselines, whereas Figure 1(b) compares the variance of PGPE with/without baselines.\nThese plots show that introduction of baselines contributes highly to the reduction of the variance\nover iterations. Figure 1(c) compares the variance of REINFORCE and PGPE without baselines,\nshowing that PGPE provides much more stable gradient estimates than REINFORCE. Figure 1(d)\ncompares the variance of REINFORCE and PGPE with the optimal baselines, showing that gradi-\nent estimates obtained by PGPE-OB are much smaller than those by REINFORCE-OB. Overall, in\nterms of the variance of gradient estimates, the proposed PGPE-OB compares favorably with other\nmethods.\nNext, we evaluate returns obtained by each method. The trajectory length is \ufb01xed at T = 20, and\nthe maximum number of policy-update iterations is set at 50. We investigate average returns over 20\nruns as functions of the number of episodic samples N. We have two experimental results for differ-\nent initial policies. Figure 2(a) shows the results when initial mean parameter \u00b5 is chosen randomly\n\n7\n\n010203040500123456IterationVariance in log10 scale REINFORCEREINFORCE\u2212OB01020304050\u2212101234IterationVariance in log10 scale PGPEPGPE\u2212MBPGPE\u2212OB01020304050123456IterationVariance in log10\u2212scale REINFORCEPGPE01020304050\u22121\u22120.500.511.522.53IterationVariance in log10\u2212scale REINFORCE\u2212OBPGPE\u2212OB024681012141618201414.51515.51616.517Number of episodes NReturn R REINFORCEREINFORCE\u2212OBPGPEPGPE\u2212MBPGPE\u2212OB0246810121416182012.51313.51414.51515.51616.517Number of episodes NReturn R REINFORCEREINFORCE\u2212OBPGPEPGPE\u2212MBPGPE\u2212OB\ffrom [\u22121.6,\u22120.1], which tends to perform well. The graph shows that PGPE-OB performs the best,\nespecially when N < 5; then REINFORCE-OB follows with a small margin. PGPE-MB and plain\nPGPE also work reasonably well, although they are slightly unstable due to larger variance. Plain\nREINFORCE is highly unstable, which is caused by the huge variance of gradient estimates (see\nFigure 1 again).\nFigure 2(b) describes the results when initial mean parameter \u00b5 is chosen randomly from\n[\u22123.0,\u22120.1], which tends to result in poorer performance.\nIn this setup, difference among the\ncompared methods is more signi\ufb01cant than the case with good initial policies. Overall, plain REIN-\nFORCE performs very poorly, and even REINFORCE-OB tends to be outperformed by the PGPE\nmethods. This means that REINFORCE is very sensitive to the choice of initial policies. Among\nthe PGPE methods, the proposed PGPE-OB works very well and converges quickly.\n\n5.2 Cart-Pole Balancing\n\nFinally, we evaluate the performance of our proposed method in a more complex task of cart-pole\nbalancing [3]. A pole is hanged to the roof of a cart, and the goal is to swing up the pole by moving\nthe cart properly and try to keep the pole at the top.\nThe state space S is two-dimensional and continuous, which consists of the angle \u03d5 \u2208 [0, 2\u03c0] and\nangular velocity \u02d9\u03d5 \u2208 [\u22123\u03c0, 3\u03c0] of the pole. The action space A is one-dimensional and continu-\nous, which corresponds to the force applied to the cart (note that we can not directly control the\npole, but only indirectly through moving the cart). We use the Gaussian policy model for REIN-\nFORCE and linear policy model for PGPE, where state s is non-linearly transformed to a feature\nspace via a basis function vector. We use 20 Gaussian kernels with standard deviation \u03c3 = 0.5\nas the basis functions, where the kernel centers are distributed over the following grid points:\n{0, \u03c0/2, \u03c0, 3\u03c0/2} \u00d7 {\u22123\u03c0,\u22123\u03c0/2, 0, 3\u03c0/2, 3\u03c0}. The dynamics of the pole (i.e., the update rule\nof the angle and the angular velocity) is given by\n\n\u03d5t+1 = \u03d5t + \u02d9\u03d5t+1\u2206t\n\nand\n\n\u02d9\u03d5t+1 = \u02d9\u03d5t + 9.8 sin(\u03d5t)\u2212\u03b1wl \u02d9\u03d52\n\nt sin(2\u03d5t)/2+\u03b1 cos(\u03d5t)at\n\n4l/3\u2212\u03b1wl cos2(\u03d5t)\n\n\u2206t,\n\nwhere \u03b1 = 1/W + w and at is the action taken at time t. We set the problem parameters as: the\nmass of the cart W = 8[kg], the mass of the pole w = 2[kg], and the length of the pole l = 0.5[m].\nWe set the time step \u2206t for the position and velocity updates at 0.01[s] and action selection at 0.1[s].\nThe reward function is de\ufb01ned as r(st, at, st+1) = cos(\u03d5t+1). That is, the higher the pole is, the\nmore rewards we can obtain. The initial policy is chosen randomly, and the initial-state probability\ndensity is set to be uniform. The agent collects N = 100 episodic samples with trajectory length\nT = 40. The discount factor is set at \u03b3 = 0.9.\nWe investigate average returns over 10 trials as the functions of policy-update iterations. The return\nat each trial is computed over 100 test episodic samples (which are not used for policy learning).\nThe experimental results are plotted in Figure 3, showing that the improvement of both plain REIN-\nFORCE and REINFORCE-OB tend to be slow, and all PGPE methods outperformed REINFORCE\nmethods overall. Among the PGPE methods, the proposed PGPE-OB converges faster than others.\n6 Conclusion\nIn this paper, we analyzed and improved the stability of the policy\ngradient method called PGPE (policy gradients with parameter-\nbased exploration). We theoretically showed that, under a mild\ncondition, PGPE provides more stable gradient estimates than\nthe classical REINFORCE method. We also derived the optimal\nbaseline for PGPE, and theoretically showed that PGPE with the\noptimal baseline is more preferable than REINFORCE with the\noptimal baseline in terms of the variance of gradient estimates.\nFinally, we demonstrated the usefulness of PGPE with optimal\nbaseline through experiments.\n\nFigure 3: Performance of policy\n\nAcknowledgments: TZ and GN were supported by the MEXT scholarship and the GCOE program,\nHH was supported by the FIRST program, and MS was supported by MEXT KAKENHI 23120004.\n\n8\n\n050100150200250300\u22122\u22121012345IterationReturn R REINFORCEREINFORCE\u2212OBPGPEPGPE\u2212MBPGPE\u2212OB\fReferences\n[1] N. Abe, P. Melville, C. Pendus, C. K. Reddy, D. L. Jensen, V. P. Thomas, J. J. Bennett, G. F.\nAnderson, B. R. Cooley, M. Kowalczyk, M. Domick, and T. Gardinier. Optimizing debt col-\nlections using constrained reinforcement learning. In Proceedings of The 16th ACM SGKDD\nConference on Knowledge Discovery and Data Mining, pages 75\u201384, 2010.\n\n[2] J. Baxter, P. Bartlett, and L. Weaver. Experiments with in\ufb01nite-horizon, policy-gradient esti-\n\nmation. Journal of Arti\ufb01cial Intelligence Research, 15:351\u2013381, 2001.\n\n[3] M. Bugeja. Non-linear swing-up and stabilizing control of an inverted pendulum system. In\n\nProceedings of IEEE Region 8 EUROCON, volume 2, pages 437\u2013441, 2003.\n\n[4] P. Dayan and G. E. Hinton. Using expectation-maximization for reinforcement learning. Neu-\n\nral Computation, 9(2):271\u2013278, 1997.\n\n[5] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient esti-\nmates in reinforcement learning. Journal of Machine Learning Research, 5:1471\u20131530, 2004.\n[6] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal\n\nof Arti\ufb01cial Intelligence Research, 4:237\u2013285, 1996.\n\n[7] S. Kakade. A natural policy gradient. In T. G. Dietterich, S. Becker, and Z. Ghahramani, ed-\nitors, Advances in Neural Information Processing Systems 14, pages 1531\u20131538, Cambridge,\nMA, 2002. MIT Press.\n\n[8] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning\n\nResearch, 4:1107\u20131149, 2003.\n\n[9] P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization\n\nof Markov reward processes. Discrete Event Dynamic Systems, 13(1-2):111\u2013148, 2004.\n\n[10] J. Peters and S. Schaal. Policy gradient methods for robotics. In Processing of the IEEE/RSJ\n\nInternational Conferece on Inatelligent Robots and Systems(IROS), 2006.\n\n[11] F. Sehnke, C. Osendorfer, T. R\u00a8uckstiess, A. Graves, J. Peters, and J. Schmidhuber. Policy gra-\ndients with parameter-based exploration for control. In Proceedings of The 18th International\nConference on Arti\ufb01cial Neural Networks, pages 387\u2013396, 2008.\n\n[12] F. Sehnke, C. Osendorfer, T. R\u00a8uckstiess, A. Graves, J. Peters, and J. Schmidhuber. Parameter-\n\nexploring policy gradients. Neural Networks, 23(4):551\u2013559, 2010.\n\n[13] R. S. Sutton and G. A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cam-\n\nbridge, MA, USA, 1998.\n\n[14] G. Tesauro. TD-gammon, a self-teaching backgammon program, achieves master-level play.\n\nNeural Computation, 6(2):215\u2013219, 1994.\n\n[15] L. Weaver and J. Baxter. Reinforcement learning from state and temporal differences. Techni-\n\ncal report, Department of Computer Science, Australian National University, 1999.\n\n[16] L. Weaver and N. Tao. The optimal reward baseline for gradient-based reinforcement learning.\nIn Processings of The Seventeeth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n538\u2013545, 2001.\n\n[17] J. D. Williams and S. Young. Partially observable Markov decision processes for spoken dialog\n\nsystems. Computer Speech and Language, 21(2):231\u2013422, 2007.\n\n[18] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine Learning, 8:229, 1992.\n\n9\n\n\f", "award": [], "sourceid": 197, "authors": [{"given_name": "Tingting", "family_name": "Zhao", "institution": null}, {"given_name": "Hirotaka", "family_name": "Hachiya", "institution": null}, {"given_name": "Gang", "family_name": "Niu", "institution": null}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}]}