{"title": "Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 1660, "page_last": 1668, "abstract": "In this paper, we propose an efficient algorithm for estimating the natural policy gradient with parameter-based exploration; this algorithm samples directly in the parameter space. Unlike previous methods based on natural gradients, our algorithm calculates the natural policy gradient using the inverse of the exact Fisher information matrix. The computational cost of this algorithm is equal to that of conventional policy gradients whereas previous natural policy gradient methods have a prohibitive computational cost. Experimental results show that the proposed method outperforms several policy gradient methods.", "full_text": "Natural Policy Gradient Methods with\n\nParameter-based Exploration for Control Tasks\n\nAtsushi Miyamae\u2020\u2021, Yuichi Nagata\u2020, Isao Ono\u2020, Shigenobu Kobayashi\u2020\n\n{miyamae@fe., nagata@fe., isao@, kobayasi@}dis.titech.ac.jp\n\n\u2020: Department of Computational Intelligence and Systems Science\n\u2021: Research Fellow of the Japan Society for the Promotion of Science\n\nTokyo Institute of Technology, Kanagawa, Japan\n\nAbstract\n\nIn this paper, we propose an ef\ufb01cient algorithm for estimating the natural policy\ngradient using parameter-based exploration; this algorithm samples directly in the\nparameter space. Unlike previous methods based on natural gradients, our algo-\nrithm calculates the natural policy gradient using the inverse of the exact Fisher\ninformation matrix. The computational cost of this algorithm is equal to that of\nconventional policy gradients whereas previous natural policy gradient methods\nhave a prohibitive computational cost. Experimental results show that the pro-\nposed method outperforms several policy gradient methods.\n\n1 Introduction\n\nReinforcement learning can be used to handle policy search problems in unknown environments.\nPolicy gradient methods [22, 20, 5] train parameterized stochastic policies by climbing the gradient\nof the average reward. The advantage of such methods is that one can easily deal with continuous\nstate-action and continuing (not episodic) tasks. Policy gradient methods have thus been successfully\napplied to several practical tasks [11, 21, 16].\n\nIn the domain of control, a policy is often constructed with a controller and an exploration strat-\negy. The controller is represented by a domain-appropriate pre-structured parametric function. The\nexploration strategy is required to seek the parameters of the controller. Instead of directly perturb-\ning the parameters of the controller, conventional exploration strategies perturb the resulting control\nsignal. However, a signi\ufb01cant problem with the sampling strategy is that the high variance in their\ngradient estimates leads to slow convergence. Recently, parameter-based exploration [18] strategies\nthat search the controller parameter space by direct parameter perturbation have been proposed, and\nthese have been demonstrated to work more ef\ufb01ciently than conventional strategies [17, 18, 13].\nAnother approach to speeding up policy gradient methods is to replace the gradient with the natural\ngradient [2], the so-called natural policy gradient [9, 4, 15]; this is motivated by the intuition that\na change in the policy parameterization should not in\ufb02uence the result of the policy update. The\ncombination of parameter-based exploration strategies and the natural policy gradient is expected\nto result in improvements in the convergence rate; however, such an algorithm has not yet been\nproposed.\n\nHowever, natural policy gradients with parameter-based exploration strategies have a disadvantage\nin that the computational cost is high. The natural policy gradient requires the computation of\nthe inverse of the Fisher information matrix (FIM) of the policy distribution; this is prohibitively\nexpensive, especially for a high-dimensional policy. Unfortunately, parameter-based exploration\nstrategies tend to have higher dimensions than control-based ones. Therefore, the expected method\nis dif\ufb01cult to apply for realistic control tasks.\n\n1\n\n\fIn this paper, we propose a new reinforcement learning method that combines the natural policy\ngradient and parameter-based exploration. We derive an ef\ufb01cient algorithm for estimating the natu-\nral policy gradient with a particular exploration strategy implementation. Our algorithm calculates\nthe natural policy gradient using the inverse of the exact FIM and the Monte Carlo-estimated gra-\ndient. The resulting algorithm, called natural policy gradients with parameter-based exploration\n(NPGPE), has a computational cost similar to that of conventional policy gradient algorithms. Nu-\nmerical experiments show that the proposed method outperforms several policy gradient methods,\nincluding the current state-of-the-art NAC [15] with control-based exploration.\n\n2 Policy Search Framework\n\nWe consider the standard reinforcement learning framework in which an agent interacts with a\nMarkov decision process. In this section, we review the estimation of policy gradients and describe\nthe difference between control- and parameter-based exploration.\n\n2.1 Markov Decision Process Notation\nAt each discrete time t, the agent observes state st \u2208 S, selects action at \u2208 A, and then receives an\ninstantaneous reward rt \u2208 (cid:60) resulting from a state transition in the environment. The state S and the\naction A are both de\ufb01ned as continuous spaces in this paper. The next state st+1 is chosen according\nto the transition probability pT (st+1|st, at), and the reward rt is given randomly according to the\nexpectation R(st, at). The agent does not know pT (st+1|st, at) and R(st, at) in advance.\nThe objective of the reinforcement learning agent is to construct a policy that maximizes the agent\u2019s\nperformance. A parameterized policy \u03c0(a|s, \u03b8) is de\ufb01ned as a probability distribution over an action\nspace under a given state with parameters \u03b8. We assume that each \u03b8 \u2208 (cid:60)d has a unique well-de\ufb01ned\nstationary distribution pD(s|\u03b8). Under this assumption, a natural performance measure for in\ufb01nite\nhorizon tasks is the average reward\n\n(cid:90)\n\n(cid:90)\n\nA\n\n\u03b7(\u03b8) =\n\nS\n\npD(s|\u03b8)\n\n\u03c0(a|s, \u03b8)R(s, a)dads.\n\n2.2 Policy Gradients\n\n(cid:80)\u221e\nPolicy gradient methods update policies by estimating the gradient of the average reward w.r.t. the\nt=1 rt \u2212 \u03b7(\u03b8)|s1 = s, a1 = a, \u03b8], and\npolicy parameters. The state-action value is Q\u03b8(s, a) = E[\nit is assumed that \u03c0(a|s, \u03b8) is differentiable w.r.t. \u03b8. The exact gradient of the average reward (see\n[20]) is given by\n\n\u03c0(a|s, \u03b8)\u2207\u03b8 log \u03c0(a|s, \u03b8)Q\u03b8(s, a)dads.\n\n(1)\n\n(cid:90)\n\n(cid:90)\n\n\u2207\u03b8\u03b7(\u03b8) =\n\npD(s|\u03b8)\n\nS\n\nA\n\nThe natural gradient [2] has a basis in information geometry, which studies the Riemannian geomet-\nric structure of the manifold of probability distributions. A result in information geometry states that\nthe FIM de\ufb01nes a Riemannian metric tensor on the space of probability distributions [3] and that the\ndirection of the steepest descent on a Riemannian manifold is given by the natural gradient, given by\nthe conventional gradient premultiplied by the inverse matrix of the Riemannian metric tensor [2].\nThus, the natural gradient can be computed from the gradient and the FIM, and it tends to converge\nfaster than the conventional gradient.\n\nKakade [9] applied the natural gradient to policy search; this was called as the natural policy gra-\ndient. If the FIM is invertible, the natural policy gradient \u02dc\u2207\u03b8\u03b7(\u03b8) \u2261 F\u22121\n\u03b8 \u2207\u03b8\u03b7(\u03b8) is given by the\npolicy gradient premultiplied by the inverse matrix of the FIM F\u03b8. In this paper, we employ the FIM\nproposed by Kakade [9], de\ufb01ned as\n\n(cid:90)\n\n(cid:90)\n\nF\u03b8 =\n\npD(s|\u03b8)\n\nS\n\nA\n\n\u03c0(a|s, \u03b8)\u2207\u03b8 log \u03c0(a|s, \u03b8)\u2207\u03b8 log \u03c0(a|s, \u03b8)Tdads.\n\n2\n\n\fFigure 1: Illustration of the main difference between control-based exploration and parameter-based\nexploration. The controller \u03c8(u|s, w) is represented by a single-layer perceptron. While the control-\nbased exploration strategy (left) perturbs the resulting control signal, the parameter-based explo-\nration strategy (right) perturbs the parameters of the controller.\n\n2.3 Learning from Samples\nThe calculation of (1) requires knowledge of the underlying transition probabilities pD(s|\u03b8).\nThe GPOMDP algorithm [5] instead computes a Monte Carlo approximation of (1):\nthe\nagent interacts with the environment, producing an observation, action, and reward sequence\n{s1, a1, r1, s2, ..., sT , aT , rT}. Under mild technical assumptions, the policy gradient approxima-\ntion is\n\nT(cid:88)\n\nt=1\n\n\u2207\u03b8\u03b7(\u03b8) \u2248 1\nT\n\nrtzt,\n\nwhere zt = \u03b2zt\u22121 + \u2207\u03b8 log \u03c0(at|st, \u03b8) is called the eligibility trace [12], \u2207\u03b8 log \u03c0(at|st, \u03b8) is\ncalled the characteristic eligibility [22], and \u03b2 denotes the discount factor (0 \u2264 \u03b2 < 1). As \u03b2 \u2192 1,\nthe estimation approaches the true gradient 1 , but the variance increases (\u03b2 is set to 0.9 in all\n\u03b8 \u2207\u03b8 log \u03c0(at|st, \u03b8). Therefore, the natural policy\nexperiments). We de\ufb01ne \u02dc\u2207\u03b8 log \u03c0(at|st, \u03b8) \u2261 F\u22121\ngradient approximation is\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\n\u02dc\u2207\u03b8\u03b7(\u03b8) \u2248 1\nT\n\nF\u22121\n\u03b8 rtzt =\n\n1\nT\n\nrt\u02dczt,\n\n(2)\n\nwhere \u02dczt = \u03b2\u02dczt\u22121 + \u02dc\u2207\u03b8 log \u03c0(at|st, \u03b8). To estimate the natural policy gradient, the heuristic sug-\ngested by Kakade [9] used\nF\u03b8,t = (1 \u2212 1\nt\n\n(\u2207\u03b8 log \u03c0(at|st, \u03b8)\u2207\u03b8 log \u03c0(at|st, \u03b8)T + \u03bbI),\n\n)F\u03b8,t\u22121 +\n\n(3)\n\n1\nt\n\nthe online estimate of the FIM, where \u03bb is a small positive constant.\n\n2.4 Parameter-based Exploration\nIn most control tasks, we attempt to have a (deterministic or stochastic) controller \u03c8(u|s, w) and\nan exploration strategy, where u \u2208 U \u2286 (cid:60)m denotes control and w \u2208 W \u2286 (cid:60)n, the parameters\nof the controller. The objective of learning is to seek suitable values of the parameters w, and\nthe exploration strategy is required to carry out stochastic sampling near the current parameters. A\ntypical exploration strategy model, we call control-based exploration, would be a normal distribution\nfor the control space (Figure1 (left)). In this case, the action of the agent is control, and the policy is\nrepresented by\n\n(cid:182)\n\n\u03c0U (u|s, \u03b8) =\n\n1\n\n(2\u03c0)m/2|\u03a3|1/2\n\nexp\n\n(u \u2212 \u03c8(s, w))T\u03a3\u22121(u \u2212 \u03c8(s, w))\n\n: S \u2192 U,\n\nwhere \u03a3 is the m \u00d7 m covariance matrix and the agent seeks \u03b8 = (cid:104)w, \u03a3(cid:105). The control at time t is\ngenerated by\n\n\u02dcut = \u03c8(st, w),\nut \u223c N (\u02dcut, \u03a3).\n\n1[5] showed that the approximation error is proportional to (1\u2212\u03b2)/(1\u2212|\u03ba2|), where \u03ba2 is the sub-dominant\n\neigenvalue of the Markov chain\n\n3\n\n(cid:181)\n\u22121\n2\n\n\fOne useful feature of such a Gaussian unit [22] is that the agent can potentially control its degree of\nexploratory behavior.\n\nThe control-based exploration strategy samples near the output of the controller. However, the\nstructures of the parameter space and the control space are not always identical. Therefore, the\nsampling strategy generates controls that are not likely to be generated from the current controller,\neven if the exploration variances decrease. This property leads to large variance gradient estimates.\nThis might be one reason why the policy improvement gets stuck.\n\nTo address this issue, Sehnke et al. [18] introduced a different exploration strategy for policy gradient\nmethods called policy gradients with parameter-based exploration (PGPE). In this approach, the\naction of the agent is the parameters of the controller, and the policy is represented by\n\n(cid:182)\n\n(cid:181)\n\u22121\n2\n\n\u03c0W ( \u02dcw|s, \u03b8) =\n\n1\n\n(2\u03c0)n/2|\u02dc\u03a3|1/2\n\nexp\n\n( \u02dcw \u2212 w)T \u02dc\u03a3\u22121( \u02dcw \u2212 w)\n\n: S \u2192 W,\n\nwhere \u02dc\u03a3 is the n \u00d7 n covariance matrix and the agent seeks \u03b8 = (cid:104)w, \u02dc\u03a3(cid:105). The controller is included\nin the dynamics of the environment, and the control at time t is generated by\n\n\u02dcwt \u223c N (w, \u02dc\u03a3),\nut = \u03c8(st, \u02dcwt).\n\nGPOMDP-based methods can estimate policy gradients such as partially observable settings, i.e., the\npolicy \u03c0W ( \u02dcw|s, \u03b8) excludes the observation of the current state. Because this exploration strategy\ndirectly perturbs the parameters (Figure1 (right)), the samples are generated near the current param-\neters under small exploration variances. Note that the advantage of this framework is that because\nthe gradient is estimated directly by sampling the parameters of the controller, the implementation\nof the policy gradient algorithms does not require \u2202\n\u2202\u03b8 \u03c8, which is dif\ufb01cult to derive from complex\ncontrollers.\n\nSehnke et al. [18] demonstrated that PGPE can yield faster convergence than the control-based ex-\nploration strategy in several challenging episodic tasks. However, the parameter-based exploration\ntends to have a higher dimension than the control-based one. Therefore, because of the computa-\ntional cost of the inverse of F\u03b8 calculated by (3), natural policy gradients \ufb01nd limited applications.\n\n3 Natural Policy Gradients with Parameter-based Exploration\n\nIn this section, we propose a new algorithm called natural policy gradients with parameter-based\nexploration (NPGPE) for the ef\ufb01cient estimation of the natural policy gradient.\n\nImplementation of Gaussian-based Exploration Strategy\n\n3.1\nWe employ the policy representation model \u00b5( \u02dcw|\u03b8), a multivariate normal distribution with parame-\nters \u03b8 = (cid:104)w, C(cid:105), where w represents the mean and C, the Cholesky decomposition of the covariance\nmatrix \u02dc\u03a3 such that C is an n \u00d7 n upper triangular matrix and \u02dc\u03a3 = CTC. Sun et al. [19] noted\ntwo advantages of this implementation: C makes explicit the n(n + 1)/2 independent parameters\ndetermining the covariance matrix \u02dc\u03a3; in addition, the diagonal elements of C are the square roots of\nthe eigenvalues of \u02dc\u03a3, and therefore, CTC is always positive semide\ufb01nite. In the remainder of the\ntext, we consider \u03b8 to be an [n(n + 3)/2]-dimensional column vector consisting of the elements of\nw and the upper-right elements of C, i.e.,\n\n\u03b8 = [wT, (C1:n,1)T, (C2:n,2)T, ..., (Cn:n,n)T]T.\n\nHere, Ck:n,k is the sub-matrix in C at row k to n and column k.\n\n3.2\n\nInverse of Fisher Information Matrix\n\nPrevious natural policy gradient methods [9] use the empirical FIM, which is estimated from a\nsample path. Such methods are highly inef\ufb01cient for \u00b5( \u02dcw|\u03b8) to invert the empirical FIM, a matrix\nwith O(n4) elements. We avoid this problem by directly computing the exact FIM.\n\n4\n\n\fAlgorithm 1 Natural Policy Gradient Method with Parameter-based Exploration\nRequire: \u03b8 = (cid:104)w, C(cid:105): policy parameters, \u03c8(u|s, w): controller, \u03b1: step size, \u03b2: discount rate, b:\n\nbaseline.\n\n1: Initialize \u02dcz0 = 0, observe s1.\n2: for t = 1, ... do\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nDraw \u03bet \u223c N (0, I), compute action \u02dcwt = CT\u03bet + w.\nExecute ut \u223c \u03c8(ut|st, \u02dcwt), obtain observation st+1 and reward rt.\n\u02dc\u2207w log \u00b5( \u02dcwt|\u03b8) = \u02dcwt \u2212 w, \u02dc\u2207C log \u00b5( \u02dcwt|\u03b8) = {triu(\u03bet\u03beT\nt ) \u2212 1\n\u02dczt = \u03b2\u02dczt\u22121 + \u02dc\u2207\u03b8 log \u00b5( \u02dcwt|\u03b8)\n\u03b8 \u2190 \u03b8 + \u03b1(rt \u2212 b)\u02dczt\n\n2diag(\u03bet\u03beT\n\nt ) \u2212 1\n\n2 I}C\n\nSubstituting \u03c0 = \u00b5( \u02dcw|\u03b8) into (1), we can rewrite the policy gradient to obtain\n\n\u2207\u03b8\u03b7(\u03b8) =\n\npD(s|\u03b8)\n\n\u00b5( \u02dcw|\u03b8)\u2207\u03b8 log \u00b5( \u02dcw|\u03b8)Q\u03b8(s, \u02dcw)d \u02dcwds.\n\n(cid:90)\n\n(cid:90)\n\nW\nFurthermore, the FIM of this distribution is\n\nS\n\n(cid:90)\n\nF\u03b8 =\n\n=\n\n\u00b5( \u02dcw|\u03b8)\u2207\u03b8 log \u00b5( \u02dcw|\u03b8)\u2207\u03b8 log \u00b5( \u02dcw|\u03b8)Td \u02dcwds\n\npD(s|\u03b8)\n\u00b5( \u02dcw|\u03b8)\u2207\u03b8 log \u00b5( \u02dcw|\u03b8)\u2207\u03b8 log \u00b5( \u02dcw|\u03b8)Td \u02dcw.\n\nW\n\n(cid:90)\n(cid:90)\n\nS\n\nW\n\nBecause F\u03b8 is independent of pD(s|\u03b8), we can use the real FIM.\nSun et al. [19] proved that the precise FIM of the Gaussian distribution N (w, CTC) becomes a\nblock-diagonal matrix diag(F0, ..., Fn) whose \ufb01rst block F0 is identical to \u02dc\u03a3\u22121 and whose k-th\n(1 \u2264 k \u2264 n) block Fk is given by\n\n(cid:184)\n(cid:183)\n= [0 I\u00afk] C\u22121(cid:161)\n\nc\u22122\nk,k 0\n0\n0\n\nFk =\n\n+ \u02dc\u03a3\u22121\n\nk:n,k:n\n\nvkvT\n\nk + I\n\n(cid:183)\n\n(cid:162)\n\nC\u2212T\n\n(cid:184)\n\n0\nI\u00afk\n\n,\n\nwhere vk denotes an n-dimensional column vector of which the only nonzero element is the k-th\nelement that is one, and I\u00afk is the [n \u2212 k + 1]-dimensional identity matrix.\nFurther, Akimoto et al. [1] derived the inverse matrix of the k-th diagonal block Fk of the FIM.\n(cid:183)\nBecause F\u03b8 is a block-diagonal matrix and C is upper triangular, it is easy to verify that the inverse\nmatrix of the FIM is\n\n(cid:184)\n\n(cid:183)\n\nF\u22121\nk = [0 I\u00afk] CT\n\nvkvT\n\nk +\n\nC\n\n0\nI\u00afk\n\n,\n\n(cid:181)\n\u22121\n2\n\nC\u22121 = vT\n\nk and [0 I\u00afk] C\n\nC\u22121 = [0 I\u00afk] .\n\n(4)\n\nwhere we use\n\nvT\nk C\n\n(cid:184)\n\n(cid:183)\n\n0\n0\n0 I\u00afk\n\n(cid:184)(cid:182)\n(cid:184)\n\n0\n0\n0 I\u00afk\n\n(cid:183)\n\n0\n0\n0 I\u00afk\n\n3.3 Natural Policy Gradient\nNow, we derive the eligibility premultiplied by the inverse matrix of the FIM \u02dc\u2207\u03b8 log \u00b5( \u02dcwt|\u03b8) =\n\u03b8 \u2207\u03b8 log \u00b5( \u02dcwt|\u03b8) in the same manner as [1]. The characteristic eligibility w.r.t. w is given by\nF\u22121\n\n\u2207w log \u00b5( \u02dcwt|\u03b8) = \u02dc\u03a3\u22121( \u02dcwt \u2212 w).\n\nObviously, F\u22121\neligibility w.r.t. C is given by\n\n0 = \u02dc\u03a3 and \u02dc\u2207w log \u00b5( \u02dcwt|\u03b8) = F\u22121\n\n0 \u2207w log \u00b5( \u02dcwt|\u03b8) = \u02dcwt \u2212 w. The characteristic\n\n(cid:161)\n\n\u2202\n\n\u2202ci,j\n\nlog \u00b5( \u02dcwt|\u03b8) = vT\n\ni\n\n(cid:162)\n\nvj,\n\ntriu(YtC\u2212T) \u2212 diag(C\u22121)\n\n5\n\n\fthe k-th block of F\u22121\n\u02dc\u2207ck log \u00b5( \u02dcwt|\u03b8) = F\u22121\n\n(cid:161)\n\u2207ck log \u00b5( \u02dcwt|\u03b8) = [0 I\u00afk]\n(cid:183)\n\nk,kvk and\n\n(cid:184)\n\n0\n0\n0 I\u00afk\n\n= [0 I\u00afk] CT\n\nk \u2207ck log \u00b5( \u02dcwt|\u03b8)\nvkvT\n\n\u03b8 \u2207\u03b8 log \u00b5( \u02dcwt|\u03b8) is therefore\n(cid:183)\n(cid:183)\n(cid:180)\n\n(cid:181)\n(cid:181)\n\u22121\n2\n\u22121\n2\n\n= [0 I\u00afk] CT\n\nvkvT\n\nk +\n\nk +\n\n(cid:179)\n\n0\n0\n0 I\u00afk\n0\n0\n0 I\u00afk\n\nFigure 2: Performance of NPG(w) as compared to that of NPG(u), VPG(w), and VPG(u) in the\nlinear quadratic regulation task averaged over 100 trials. Left: The empirical optimum denotes the\nmean return under the optimum gain. Center and Right: Illustration of the main difference between\ncontrol- and parameter-based exploration. The sampling area of 1\u03c3 in the state-control space (center)\nand the state-parameter space (right) is plotted.\n\nwhere triu(YtC\u2212T) denotes the upper triangular matrix whose (i, j) element is identical to the\n(i, j) element of YtC\u2212T if i \u2264 j and zero otherwise, and Yt = C\u2212T( \u02dcwt \u2212 w)( \u02dcwt \u2212 w)TC\u22121 is\na symmetric matrix.\nLet ck = (ck,k, ..., ck,n)T (of dimension n + 1 \u2212 k); then, the characteristic eligibility w.r.t. ck is\nexpressed as\n\n(cid:162)\n\nC\u22121Yt \u2212 diag(C\u22121)\n\nvk.\n\nAccording to (4), diag(C\u22121)vk = c\u22121\n\nk Cvk = ck,k and\nvT\n\nCvk = ck,kvk,\n\n(cid:184)(cid:182)\n(cid:184)(cid:182)\n\n(cid:184)(cid:161)\n\n(cid:183)\n\nC\n\n0\n0\n0 I\u00afk\n\n(Yt \u2212 I)vk.\n\n(cid:162)\n\nC\u22121Yt \u2212 diag(C\u22121)\n\nvk\n\n(5)\n\nBecause \u02dc\u2207ck log \u00b5( \u02dcwt|\u03b8)T =\n\n\u02dc\u2207C log \u00b5( \u02dcwt|\u03b8)\n\n, we obtain\n\n(cid:181)\ntriu(Yt) \u2212 1\n2\n\nk,k:n\n\ndiag(Yt) \u2212 1\n2\n\n(cid:182)\n\nI\n\nC.\n\n\u02dc\u2207C log \u00b5( \u02dcwt|\u03b8) =\n\nTherefore, the time complexity of computing\n\n\u02dc\u2207\u03b8 log \u00b5( \u02dcwt|\u03b8) = [ \u02dc\u2207w log \u00b5( \u02dcwt|\u03b8)T, \u02dc\u2207c1 log \u00b5( \u02dcwt|\u03b8)T, ..., \u02dc\u2207cn log \u00b5( \u02dcwt|\u03b8)T]T\n\nis O(n3), which is of the same order as the computation of \u2207\u03b8 log \u00b5( \u02dcwt|\u03b8). This is a signi\ufb01cant im-\nprovement over the current natural policy gradient estimation using (2) and (3) with parameter-based\nexploration, whose complexity is O(n6). Note that more simple forms for exploration distribution\ncould be used. When we use the exploration strategy that is represented as an independent normal\ndistribution for each parameter wi in w, the natural policy gradient is estimated in O(n) time. This\nlimited form ignores the relationship between parameters, but it is practical for high-dimensional\ncontrollers.\n\n3.4 An Algorithm\nFor a parameterized class of controllers \u03c8(u|s, w), we can use the exploration strategy \u00b5( \u02dcw|\u03b8). An\nonline version based on the GPOMDP algorithm of this implementation is shown in Algorithm 1. In\npractice, the parameters of the controller \u02dcwt are generated by \u02dcwt = CT\u03bet + w, where \u03bet \u223c N (0, I)\nare normal random numbers. Now, we can instead use Yt = C\u2212T( \u02dcwt\u2212w)( \u02dcwt\u2212w)TC\u22121 = \u03bet\u03beT\nt .\nTo reduce the variance of the gradient estimation, we employ variance reduction techniques [6] to\nadapt the reinforcement baseline b.\n\n6\n\n-2-1.5-1-0.5 0103104105106mean returnstepempirical optimumVPG(w)VPG(u)NPG(w)NPG(u)-4-3-2-1 0 1 2 3 4-4-3-2-1 0 1 2 3 4controlstatemeanparamter-basedcontrol-based-2-1.5-1-0.5 0 0.5 1-4-3-2-1 0 1 2 3 4paramterstatemeanparamter-basedcontrol-based\fFigure 3: Simulator of a two-link arm robot.\n\n4 Experiments\n\nIn this section, we evaluate the performance of our proposed NPGPE method. The ef\ufb01ciency of\nparameter-based exploration has been reported for episodic tasks [18]. We compare parameter- and\ncontrol-based exploration strategies with natural gradient and conventional \u201dvanilla\u201d gradients using\na simple continuing task as an example of a linear control problem. We also demonstrate NPGPE\u2019s\nusefulness for a physically realistic locomotion task using a two-link arm robot simulator.\n\n4.1\n\nImplementation\n\nWe compare two different exploration strategies. The \ufb01rst is the parameter-based exploration strat-\negy \u00b5( \u02dcw|\u03b8) presented in Section 3.1. The second is the control-based exploration strategy \u0001(u|\u02dcu, D)\nrepresented by a normal distribution for a control space, where \u02dcu is the mean vector of the control\ngenerated by controller \u03c8 and D represents the Cholesky decomposition of the covariance matrix\n\u03a3 such that D is an m \u00d7 m upper triangular matrix and \u03a3 = DTD. The parameters of the policy\n\u03c0U (u|s, \u03b8) are \u03b8 = (cid:104)w, D(cid:105) to be an [n + m(m + 1)/2]-dimensional column vector consisting of the\nelements of w and the upper-right elements of D.\n\n4.2 Linear Quadratic Regulator\n\nThe following linear control problem can serve as a benchmark of delayed reinforcement tasks [10].\nThe dynamics of the environment is\n\nst+1 = st + ut + \u03b4,\n\n(cid:112)\n\nwhere s \u2208 (cid:60)1, u \u2208 (cid:60)1, and \u03b4 \u223c N (0, 0.52). The immediate reward is given by rt = \u2212s2\nt . In\nthis experiment, the set of possible states is constrained to lie in the range [-4, 4], and st is truncated.\nWhen the agent chooses an action that does not lie in the range [\u22124, 4], the action executed in the\nenvironment is also truncated. The controller is represented by \u03c8(u|s, w) = s \u00b7 w, where w \u2208 (cid:60)1.\nThe optimal parameter is given by w\u2217 = 2/(1 + 2\u03b2 +\nFor clari\ufb01cation, we now write an NPG that employs the natural policy gradient and a VPG that em-\nploys the \u201dvanilla\u201d policy gradient. Therefore, NPG(w) and VPG(w) denote the use of the parameter-\nbased exploration strategy, and NPG(u) and VPG(u) denote the use of the control-based exploration\nstrategy. Our proposed NPGPE method is NPG(w).\n\n4\u03b22 + 1) \u2212 1 from the Riccati equation.\n\nt \u2212 u2\n\nFigure2 (left) shows the performance of all compared methods. We can see that the algorithm using\nparameter-based exploration had better performance than that using control-based exploration in the\ncontinuing task. The natural policy gradient also improved the convergence speed, and a combina-\ntion with parameter-based exploration outperformed all other methods. The reason for the accel-\neration in learning in this case may be the fact that the samples generated by the parameter-based\nexploration strategy allow effective search. Figure2 (center and right) show plots of the sampling\narea in the state-control space and the state-parameter space, respectively. Because control-based\nexploration maintains the sampling area in the control space, the sampling is almost uniform in the\nparameter space at around s = 0, where the agent visits frequently. Therefore, the parameter-based\nexploration may realize more ef\ufb01cient sampling than the control-based exploration.\n\n4.3 Locomotion Task on a Two-link Arm Robot\n\nWe applied the algorithm to the robot shown in Figure3 of Kimura et al. [11]. The objective of\nlearning is to \ufb01nd control rules to move forward. The joints are controlled by servo motors that react\n\n7\n\n\fFigure 4: Performance of NPG(w) as compared to that of NPG(u) and NAC(u) in the locomotion\ntask averaged over 100 trials. Left: Mean performance of all compared methods. Center: Parameters\nof controller for NPG(w). Right: Parameters of controller for NPG(u). The parameters of the\ncontroller are normalized by gain i =\nj wi,j and weight i,j = wi,j/gain i, where wi,j denotes\nthe j-th parameter of the i-th joint. Arrows in the center and right denote the changing points of the\nrelation between two important parameters.\n\n(cid:113)(cid:80)\n\nThe control for motor i is generated by ui = 1/(1 + exp(\u2212(cid:80)\n\nto angular-position commands. At each time step, the agent observes the angular position of two\nmotors, where each observation o1, o2 is normalized to [0, 1], and selects an action. The immediate\nreward is the distance of the body movement caused by the previous action. When the robot moves\nbackward, the agent receives a negative reward. The state vector is expressed as s = [o1, o2, 1]T.\nj sjwi,j)). The dimension of the\nparameters of the policies is dW = n(n + 3)/2 = 27 and dU = n + m(m + 1)/2 = 9 for the\nparameter- and control-based exploration strategy, respectively.\n\nWe compared NPG(w), i.e., NPGPE, with NPG(u) and NAC(u). NAC is the state-of-the-art policy\ngradient algorithm [15] that combines natural policy gradients, actor-critic framework, and least-\nsquares temporal-difference Q-learning. NAC computes the inverse of a d\u00d7 d matrix to estimate the\nnatural steepest ascent direction. Because NAC(w) has O(d3\nW ) time complexity for each iteration,\nwhich is prohibitively expensive, we apply NAC to only control-based exploration.\n\nFigure4 (left) shows our results. Initially, NPG(w) is outperformed by NAC(u); however, it then\nreaches good solutions with fewer steps. Furthermore, at a later stage, NAC(u) matches NPG(u).\nFigure4 (center and right) show the path of the relation between the parameters of the controller.\nNPG(w) is much slower than NPG(u) to adapt the relation at an early stage; however, it can seek the\nrelations of important parameters (indicated by arrows in the \ufb01gures) faster, whereas NPG(u) gets\nstuck because of inef\ufb01cient sampling.\n\n5 Conclusions\n\nThis paper proposed a novel natural policy gradient method combined with parameter-based ex-\nploration to cope with high-dimensional reinforcement learning domains. The proposed algorithm,\nNPGPE, is very simple and quickly calculates the estimation of the natural policy gradient. More-\nover, the experimental results demonstrate a signi\ufb01cant improvement in the control domain.\n\nFuture works will focus on developing actor-critic versions of NPGPE that might encourage perfor-\nmance improvements at an early stage, and on combining other gradient methods such as natural\nconjugate gradient methods [8].\n\nIn addition, a comparison with other direct parameter perturbation methods such as \ufb01nite difference\ngradient methods [14], CMA-ES [7], and NES [19] will be necessary to gain a better understanding\nof the properties and ef\ufb01cacy of the combination of parameter-based exploration strategies and the\nnatural policy gradient. Furthermore, the application of the algorithm to real-world problems is\nrequired to assess its utility.\n\nAcknowledgments\n\nThis work was suported by the Japan Society for the Promotion of Science (22 9031).\n\n8\n\n 0 1 2 3 4 5 6104105106107mean returnstepNPG(w)NPG(u)NAC(u)100101102102103104105106107gainstep-1-0.500.51weight100101102102103104105106107gainstep-1-0.500.51weight\fReferences\n\n[1] Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi. Bidirectional Relation\nbetween CMA Evolution Strategies and Natural Evolution Strategies. Parallel Problem Solving\nfrom Nature XI, pages 154\u2013163, 2010.\n\n[2] S. Amari. Natural Gradient Works Ef\ufb01ciently in Learning. Neural Computation, 10(2):251\u2013\n\n276, 1998.\n\n[3] S. Amari and H. Nagaoka. Methods of Information Geometry. American Mathematical Society,\n\n2007.\n\n[4] J. Andrew Bagnell and Jeff Schneider. Covariant policy search. In IJCAI\u201903: Proceedings of\n\nthe 18th international joint conference on Arti\ufb01cial intelligence, pages 1019\u20131024, 2003.\n\n[5] Jonathan Baxter and Peter L. Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of\n\nArti\ufb01cial Intelligence Research, 15:319\u2013350, 2001.\n\n[6] Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for\ngradient estimates in reinforcement learning. The Journal of Machine Learning Research,\n5:1471\u20131530, 2004.\n\n[7] V. Heidrich-Meisner and C. Igel. Variable metric reinforcement learning methods applied to\n\nthe noisy mountain car problem. In EWRL 2008, pages 136\u2013150, 2008.\n\n[8] Antti Honkela, Matti Tornio, Tapani Raiko, and Juha Karhunen. Natural conjugate gradient in\n\nvariational inference. In ICONIP 2007, pages 305\u2013314, 2008.\n\n[9] S. A. Kakade. A natural policy gradient. In In Advances in Neural Information Processing\n\nSystems, pages 1531\u20131538, 2001.\n\n[10] H. Kimura and S. Kobayashi. Reinforcement learning for continuous action using stochastic\n\ngradient ascent. In Intelligent Autonomous Systems (IAS-5), pages 288\u2013295, 1998.\n\n[11] Hajime Kimura, Kazuteru Miyazaki, and Shigenobu Kobayashi. Reinforcement learning in\npomdps with function approximation. In ICML \u201997: Proceedings of the Fourteenth Interna-\ntional Conference on Machine Learning, pages 152\u2013160, 1997.\n\n[12] Hajime Kimura, Masayuki Yamamura, and Shigenobu Kobayashi. Reinforcement learning by\n\nstochastic hill climbing on discounted reward. In ICML, pages 295\u2013303, 1995.\n\n[13] Jens Kober and Jan Peters. Policy search for motor primitives in robotics.\n\nNeural Information Processing Systems 21, pages 849\u2013856, 2009.\n\nIn Advances in\n\n[14] Jan Peters and Stefan Schaal. Policy Gradient Methods for Robotics.\n\nIn 2006 IEEE/RSJ\n\nInternational Conference on Intelligent Robots and Systems, pages 2219\u20132225, 2006.\n\n[15] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7\u20139):1180\u20131190, 2008.\n[16] Silvia Richter, Douglas Aberdeen, and Jin Yu. Natural actor-critic for road traf\ufb01c optimisa-\ntion. In Advances in Neural Information Processing Systems 19, pages 1169\u20131176. MIT Press,\nCambridge, MA, 2007.\n\n[17] Thomas R\u00a8uckstie\u00df, Martin Felder, and J\u00a8urgen Schmidhuber. State-dependent exploration for\npolicy gradient methods. In ECML PKDD \u201908: Proceedings of the European conference on\nMachine Learning and Knowledge Discovery in Databases - Part II, pages 234\u2013249, 2008.\n\n[18] Frank Sehnke, C Osendorfer, T Rueckstiess, A. Graves, J. Peters, and J. Schmidhuber. Policy\ngradients with parameter-based exploration for control. In Proceedings of the International\nConference on Arti\ufb01cial Neural Networks (ICANN), pages 387\u2013396, 2008.\n\n[19] Yi Sun, Daan Wierstra, Tom Schaul, and Juergen Schmidhuber. Ef\ufb01cient natural evolution\nstrategies. In GECCO \u201909: Proceedings of the 11th Annual conference on Genetic and evolu-\ntionary computation, pages 539\u2013546, 2009.\n\n[20] R. S. Sutton. Policy gradient method for reinforcement learning with function approximation.\nIn Advances in Neural Information Processing Systems, volume 12, pages 1057\u20131063, 2000.\n[21] Daan Wierstra, Er Foerster, Jan Peters, and Juergen Schmidhuber. Solving deep memory\npomdps with recurrent policy gradients. In In International Conference on Arti\ufb01cial Neural\nNetworks, 2007.\n\n[22] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist rein-\n\nforcement learning. In Machine Learning, pages 229\u2013256, 1992.\n\n9\n\n\f", "award": [], "sourceid": 606, "authors": [{"given_name": "Atsushi", "family_name": "Miyamae", "institution": null}, {"given_name": "Yuichi", "family_name": "Nagata", "institution": null}, {"given_name": "Isao", "family_name": "Ono", "institution": null}, {"given_name": "Shigenobu", "family_name": "Kobayashi", "institution": null}]}