{"title": "A Natural Policy Gradient", "book": "Advances in Neural Information Processing Systems", "page_first": 1531, "page_last": 1538, "abstract": "", "full_text": "A Natural Policy Gradient \n\nSham Kakade \n\nGatsby Computational Neuroscience Unit \n17 Queen Square, London, UK WC1N 3AR \n\nhttp: //www.gatsby.ucl.ac.uk \n\nsham@gatsby.ucl.ac.uk \n\nAbstract \n\nWe provide a natural gradient method that represents the steepest \ndescent direction based on the underlying structure of the param(cid:173)\neter space. Although gradient methods cannot make large changes \nin the values of the parameters, we show that the natural gradi(cid:173)\nent is moving toward choosing a greedy optimal action rather than \njust a better action. These greedy optimal actions are those that \nwould be chosen under one improvement step of policy iteration \nwith approximate, compatible value functions, as defined by Sut(cid:173)\nton et al. [9]. We then show drastic performance improvements in \nsimple MDPs and in the more challenging MDP of Tetris. \n\n1 \n\nIntroduction \n\nThere has been a growing interest in direct policy-gradient methods for approximate \nplanning in large Markov decision problems (MDPs). Such methods seek to find \na good policy 7r among some restricted class of policies, by following the gradient \nof the future reward. Unfortunately, the standard gradient descent rule is non(cid:173)\ncovariant. Crudely speaking, the rule !:l.()i = oJ] f / a()i is dimensionally inconsistent \nsince the left hand side has units of ()i and the right hand side has units of l/()i \n(and all ()i do not necessarily have the same dimensions). \n\nIn this paper, we present a covariant gradient by defining a metric based on the \nunderlying structure of the policy. We make the connection to policy iteration \nby showing that the natural gradient is moving toward choosing a greedy optimal \naction. We then analyze the performance of the natural gradient in both simple \nand complicated MDPs. Consistent with Amari's findings [1], our work suggests \nthat the plateau phenomenon might not be as severe using this method. \n\n2 A Natural Gradient \n\nA finite MDP is a tuple (S, So, A, R, P) where: S is finite set of states, So is a start \nstate, A is a finite set of actions, R is a reward function R : S x A --+ [0, Rmax], and \nP is the transition model. The agent's decision making procedure is characterized \nby a stochastic policy 7r(a; s) , which is the probability of taking action a in state \ns (a semi-colon is used to distinguish the random variables from the parameters of \n\n\fthe distribution). We make the assumption that every policy 7r is ergodic, ie has a \nwell-defined stationary distribution p7f. Under this assumption, the average reward \n(or undiscounted reward) is 1]( 7r) == 2:: s ,a p7f (s )7r(a; S )R(s, a), the state-action value \nis Q7f(S, a) == E7f{2:::oR(st,at) -1](7r)lso = s,ao = a} and the value function is \nJ7f(s) == E7f(a' ;s) {Q7f(s, a')}, where and St and at are the state and action at time t. \nWe consider the more difficult case where the goal of the agent is to find a policy that \nmaximizes the average reward over some restricted class of smoothly parameterized \npolicies, fr = {7rO : 8 E ~m}, where tro represents the policy 7r(a; S, 8). \nThe exact gradient of the average reward (see [8, 9]) is: \n\n\\11](8) = Lp7f(s)\\17r(a;s, 8)Q7f(s ,a) \n\ns,a \n\n(1) \n\nwhere we abuse notation by using 1](8) instead of 1](7ro). The steepest descent \ndirection of 1](8) is defined as the vector d8 that minimizes 1](8 + d8) under \nthe constraint that the squared length Id812 is held to a small constant. This \nsquared length is defined with respect to some positive-definite matrix G(8), ie \nId812 == 2::ij Gij (8)d8id8j = d8T G(8)d8 (using vector notation). The steepest de(cid:173)\nscent direction is then given by G- 1\\11](8) [1]. Standard gradient descent follows \nthe direction \\11](8) which is the steepest descent under the assumption that G(8) \nis the identity matrix, I. However, this as hoc choice of a metric is not necessarily \nappropriate. As suggested by Amari [1], it is better to define a metric based not \non the choice of coordinates but rather on the manifold (ie the surface) that these \ncoordinates parameterize. This metric defines the natural gradient. \n\nThough we slightly abuse notation by writing 1](8), the average reward is technically \na function on the set of distributions {7rO : 8 E ~m}. To each state s, there \ncorresponds a probability manifold, where the distribution 7r(a; S, 8) is a point on \nthis manifold with coordinates 8. The Fisher information matrix of this distribution \n7r(a; s,8) is \n\nF (8) = E \n\ns \n\n-\n\n7f(a;s,O) \n\n08i \n\n08j\n\n[81og 7r(a; s,8) o log 7r(a; s,8)] \n\n' \n\n(2) \n\nand it is clearly positive definite. As shown by Amari (see [1]), the Fisher infor(cid:173)\nmation matrix, up to a scale, is an invariant metric on the space of the parameters \nof probability distributions. It is invariant in the sense that it defines the same \ndistance between two points regardless of the choice of coordinates (ie the param(cid:173)\neterization) used, unlike G = I. \n\nSince the average reward is defined on a set of these distributions, the straightfor(cid:173)\nward choice we make for the metric is: \n\n(3) \n\nwhere the expectation is with respect to the stationary distribution of 7ro. Notice \nthat although each Fs is independent of the parameters of the MDP's transition \nmodel, the weighting by the stationary distribution introduces dependence on these \nparameters. Intuitively, Fs (8) measures distance on a probability manifold corre(cid:173)\nsponding to state sand F(8) is the average such distance. The steepest descent \ndirection this gives is: \n\n(4) \n\n\f3 The Natural Gradient and Policy Iteration \n\nWe now compare policy improvement under the natural gradient to policy iteration. \nFor an appropriate comparison, consider the case in which Q7r (s, a) is approximated \nby some compatible function approximator r(s ,a;w) parameterized by w [9, 6]. \n\n3.1 Compatible Function Approximation \n\nFor vectors (), w E ~m, we define: \n\n'IjJ (s , a)7r = \\7logn(a;s,()), \n\nr(s,a;w) = wT 'ljJ7r(s,a) \n\n(5) \n\nwhere [\\7logn(a;s, ())]i = 8logn(a;s, ())!8()i. Let w minimize the squared error \nf(W, n) == L,s ,a p7r (s )n(a; s, ())(r (s, a; w) _Q7r (s, a))2. This function approximator is \ncompatible with the policy in the sense that if we use the approximations f7r (s, a; w) \nin lieu of their true values to compute the gradient (equation 1), then the result \nwould still be exact [9, 6] (and is thus a sensible choice to use in actor-critic schemes). \nTheorem 1. Let w minimize the squared error f(W, no). Then \n\nw = ~1}(()) . \n\nProof. Since w minimizes the squared error, it satisfies the condition 8f!8wi = 0, \nwhich implies: \n\nLP7r(s)n(a;s,())'ljJ7r (s,a)('ljJ7r (s,a?w - Q7r(s,a)) = O. \ns,a \n\nor equivalently: \n\ns,a \n\ns,a \n\nBy definition of 'ljJ7r, \\7n(a;s,()) = n(a;s,())'ljJ7r(s,a) and so the right hand side is \nequal to \\71}. Also by definition of 'ljJ7r, F( ()) = L,s ,a p7r (s )n( a; s, ())'ljJ7r (s, a )'ljJ7r (s, a) T. \nSubstitution leads to: \n\nF(())w = \\71}(()) . \n\nSolving for w gives w = F(()) - l\\71}(()), and the result follows from the definition of \nD \nthe natural gradient. \n\nThus, sensible actor-critic frameworks (those using f7r(s , a; w)) are forced to use the \nnatural gradient as the weights of a linear function approximator. If the function ap(cid:173)\nproximation is accurate, then good actions (ie those with large state-action values) \nhave feature vectors that have a large inner product with the natural gradient. \n\n3.2 Greedy Policy Improvement \n\nA greedy policy improvement step using our function approximator would choose \naction a in state s if a E argmaxa, f7r (s, a'; w). In this section, we show that the \nnatural gradient tends to move toward this best action, rather than just a good \naction. \nLet us first consider policies in the exponential family (n(a ;s, ()) \nIX exp(()T\u00a2sa) \nwhere \u00a2sa is some feature vector in ~m). The motivation for the exponential family \nis because it has affine geometry (ie the flat geometry of a plane), so a translation of \na point by a tangent vector will keep the point on the manifold. In general, crudely \n\n\fspeaking, the probability manifold of 7r(a; s, 0) could be curved, so a translation of \na point by a tangent vector would not necessarily keep the point on the manifold \n(such as on a sphere). We consider the general (non-exponential) case later. \n\nWe now show, for the exponential family, that a sufficiently large step in the natural \ngradient direction will lead to a policy that is equivalent to a policy found after a \ngreedy policy improvement step. \nTheorem 2. For 7r(a; s, 0) ex: exp(OT 1>sa), assume that ~'TJ(O) is non-zero and that \nw minimizes the approximation error. Let7roo (a;s) =lima-+oo7r(a;s, O+a~'TJ(O)). \nThen 7r 00 (a; s) 1- 0 if and only if a E argmaxa, F' (s, a'; w). \n\nProof. By the previous result, F'(s,a;w) = ~'TJ(O)T'lj;7r(s,a). By definition of \n7r(a; s, 0) , 'lj;7r (s, a) = 1>sa - E7r(a';s ,O) (1)sa'). Since E7r(a';s,O) (1)sa') is not a function \nof a, it follows that \n\nargmaxa , r(s, a'; w) = argmaxa , ~'TJ(Of 1>sa' . \n\nAfter a gradient step, 7r(a; s, 0 + a~'TJ(O)) ex: exp(OT 1>sa + a~'TJ(O)T 1>sa). \n~'TJ(O) 1- 0, it is clear that as a -+ 00 the term ~'TJ(O)T 1>sa dominates, \n7r 00 (a, s) = 0 if and only if a f{. argmaxa , ~ 'TJ( 0) T 1>sa' . \n\nSince \nand so \nD \n\nIt is in this sense that the natural gradient tends to move toward choosing the best \naction. It is straightforward to show that if the standard non-covariant gradient \nrule is used instead then 7roo (a; s) will select only a better action (not necessarily \nthe best), ie it will choose an action a such that F'(s ,a;w) > E7r(a';s){F'(s,a';w)}. \nOur use of the exponential family was only to demonstrate this point in the extreme \ncase of an infinite learning rate. \n\nLet us return to case of a general parameterized policy. The following theorem shows \nthat the natural gradient is locally moving toward the best action, determined by \nthe local linear approximator for Q7r (s, a). \nTheorem 3. Assume that w minimizes the approximation error and let the update \nto the parameter be 0' = 0 + a~'TJ(O). Then \n\n7r(a; s, 0') = 7r(a; s, 0)(1 + r(s , a; w)) + 0(a2 ) \n\nProof. The change in 0, ,6.0, is a~'TJ(O), so by theorem 1, ,6.0 = aw. To first order, \n\n7r(a; s, 0') \n\n7r(a; s, 0) + fJ7r(a~;, O)T ,6.0 + 0(,6.02 ) \n7r(a; s, 0)(1 + 'lj;(s, af ,6.0) + 0(,6.02 ) \n7r(a; s, 0)(1 + a'lj;(s, af w) + 0(a2 ) \n7r(a;s,O)(l + ar(s,a;w)) + 0(a2 ) , \n\nwhere we have used the definition of 'lj; and f. \n\nD \n\nIt is interesting to note that choosing the greedy action will not in general improve \nthe policy, and many detailed studies have gone into understanding this failure [3]. \nHowever, with the overhead of a line search, we can guarantee improvement and \nmove toward this greedy one step improvement. Initial improvement is guaranteed \nsince F is positive definite. \n\n\f4 Metrics and Curvatures \n\nObviously, our choice of F is not unique and the question arises as to whether or \nnot there is a better metric to use than F. In the different setting of parameter \nestimation, the Fisher information converges to the Hessian, so it is asymptotically \nefficient [1], ie attains the Cramer-Rao bound. Our situation is more similar to \nthe blind source separation case where a metric is chosen based on the underlying \nparameter space [1] (of non-singular matrices) and is not necessarily asymptotically \nefficient (ie does not attain second order convergence). As argued by Mackay [7], \none strategy is to pull a metric out of the data-independent terms of the Hessian (if \npossible), and in fact, Mackay [7] arrives at the same result as Amari for the blind \nsource separation case. \n\nAlthough the previous sections argued that our choice is appropriate, we would like \nto understand how F relates to the Hessian V 2TJ(B), which, as shown in [5], has the \nform: \n\nsa \n\n(6) \nUnfortunately, all terms in this Hessian are data-dependent (ie are coupled to state(cid:173)\naction values) . It is clear that F does not capture any information from these last \ntwo terms, due to their VQ7r dependence. The first term might have some relation \nto F due the factor of V2 7f. However, the Q values weight this curvature of our \npolicy and our metric is neglecting such weighting. \n\nSimilar to the blind source separation case, our metric clearly does not necessarily \nconverge to the Hessian and so it is not necessarily asymptotically efficient (ie does \nnot attain a second order convergence rate). However, in general, the Hessian will \nnot be positive definite and so the curvature it provides could be of little use until \nB is close to a local maxima. Conjugate methods would be expected to be more \nefficient near a local maximum. \n\n5 Experiments \n\nWe first look at the performance of the natural gradient in a few simple MDPs \nbefore examining its performance in the more challenging MDP of Tetris. It is \nstraightforward to estimate F in an online manner, since the derivatives V log 7f \nmust be computed anyway to estimate VTJ(B). If the update rule \n\nf f- f + V log 7f(at; St,B)Vlog7f(at; St,Bf \n\nis used in a T-Iength trajectory, then fiT is a consistent estimate of F. In our \nfirst two examples, we do not concern ourselves with sampling issues and instead \nnumerically integrate the exact derivative (Bt = Bo + J~ VTJ(BddB). In all of our \nsimulations, the policies tend to become deterministic (V log 7f -+ 0) and to prevent \nF from becoming singular, we add about 10- 31 at every step in all our simulations. \n\nWe simulated the natural policy gradient in a simple I-dimensional linear quadratic \nregulator with dynamics x(t + 1) = \n.7x(t) + u(t) + E(t) and noise distribution \nE ~ G(O,l). The goal is to apply a control signal u to keep the system at \nx = 0, (incurring a cost of X(t)2 at each step). The parameterized policy used \nwas 7f(u; x, B) ex exp(Blx2 + B2X). Figure lA shows the performance improvement \nwhen the units of the parameters are scaled by a factor of 10 (see figure text). No(cid:173)\ntice that the time to obtain a score of about 22 is about three orders of magnitude \n\n\f'--''''' '~'' \n\n\u2022 ~...... \n\nunsealed \n$=10 s=1 ...... \n1 \n-'::',$,=1 $2=10 _. - -\n\n2 \n\n- 8' \nB ~a C 21 \n\ni \n\n\"E \n,\\2 ~ \n\n(' \n~R=O) \n\nI \n1 r ______ -11 \nI \n\nrl D \n\n',05 \n\n.. \n\n\" \n\n-\n\n20 \nL\n\n\":. \n\n\\::.:-. \nh \n\"::>:\" W \n\n~0'::-0 --0~5C------:-' -----:-':'5C------::'2 \n\n~21 \n~, L------------\n\ntime x 10\n\n7 /:--------1. Q -\n\n-\n\n-\n\n\"\\\" \n\n.\\'; \n\n_-=--2 --':-'':::::;0:::=:::' ~2 :=':3=::l4' \n\nI09 10(time) \n\n--\"-,-, ~ \n\n0 a \n\n0.5 \n\n1 \n\n1.5 \ntime \n\n2 \n\n2.5 \n\n3 \n\nL---::-,::::;\u00b77J~========-~ \n\n5 \n\n8., \n\n10 \n\n15 \n\nFigure 1: A) The cost Vs. 10glo(time) for an LQG (with 20 time step trajectories). \nThe policy used was 7f(u; x, ()) ex: exp(()lslX2 + ()2S2X) where the rescaling constants, \nSl and S2, are shown in the legend. Under equivalent starting distributions (()lSl = \n()2S2 = -.8) , the right-most three curves are generated using the standard gradient \nmethod and the rest use the natural gradient. B) See text. C top) The average \nreward vs. time (on a 107 scale) of a policy under standard gradient descent using \nthe sigmoidal policy parameterization (7f(I; s, ()i) ex: exp(()i)/(1 + exp(()i)), with the \ninitial conditions 7f(i , 1) = .8 and 7f(j, 1) = .1. C bottom) The average reward vs. \ntime (unscaled) under standard gradient descent (solid line) and natural gradient \ndescent (dashed line) for an early window of the above plot. D) Phase space plot \nfor the standard gradient case (the solid line) and the natural gradient case (dashed \nline) . \n\nfaster. Also notice that the curves under different rescaling are not identical. This \nis because F is not an invariant metric due to the weighting by Ps. \nThe effects of the weighting by p(s) are particularly clear in a simple 2-state MDP \n(Figure IB), which has self- and cross-transition actions and rewards as shown. \nIncreasing the chance of a self-loop at i decreases the stationary probability of j. \nUsing a sigmoidal policy parameterization (see figure text) and initial conditions \ncorresponding to p(i) = .8 and p(j) = .2, both self-loop action probabilities will ini(cid:173)\ntially be increased under a gradient rule (since one step policy improvement chooses \nthe self-loop for each state). Since the standard gradient weights the learning to \neach parameter by p(s) (see equation 1), the self-loop action at state i is increased \nfaster than the self loop probability at j, which has the effect of decreasing the ef(cid:173)\nfective learning-rate to state j even further. This leads to an extremely fiat plateau \nwith average reward 1 (shown in Figure lC top), where the learning for state j is \nthwarted by its low stationary probability. This problem is so severe that before the \noptimal policy is reached p(j) drops as low as 10-7 from its initial value of .2, which \nis disastrous for sampling methods. Figure 1 C bottom shows the performance of \nthe natural gradient (in a very early time window of Figure lC top). Not only is \nthe time to the optimal policy decreased by a factor of 107 , the stationary distri(cid:173)\nbution of state i never drops below .05. Note though the standard gradient does \nincrease the average reward faster at the start, but only to be seduced by sticking \nat state i. The phase space plot in Figure ID shows the uneven learning to the \ndifferent parameters, which is at the heart of the problem. In general, if a table \nlookup Boltzmann policy is used (ie 7f( a; s, ()) ex: exp( () sa)), it is straightforward to \nshow that the natural gradient weights the components of ~'fJ uniformly (instead of \nusing p(s)), thus evening evening out the learning to all parameters. \n\nThe game of Tetris provides a challenging high dimensional problem. As shown in \n[3], greedy policy iteration methods using a linear function approximator exhibit \ndrastic performance degradation after providing impressive improvement (see [3] \nfor a description of the game, methods, and results). The upper curve in Figure2A \nreplicates these results. Tetris provides an interesting case to test gradient methods, \n\n\fA 5000, - - - - - - - - - - - - - , \n\nB 7000, - - - - - - - - - - , - - - - , \n\nC \n\n4000 \n\n~3000 \n\u00b70 \na... 2000 \n\n1000 \n\n6000 \n\n5000 \n\n~4000 \n\n&3000 \n\n2000 \n\n1000 \n\n1 I09,O( lteralions) 2 \n\n500 \n\n1000 \n\nIterations \n\n1500 \n\n2000 \n\nFigure 2: A) Points vs. 10g(Iterations). The top curve duplicates the same results \nin [3] using the same features (which were simple functions of the heights of each \ncolumn and the number of holes in the game). We have no explanation for this per(cid:173)\nformance degradation (nor does [3]). The lower curve shows the poor performance \nof the standard gradient rule. B) The curve on the right shows the natural policy \ngradient method (and uses the biased gradient method of [2] though this method \nalone gave poor performance). We found we could obtain faster improvement and \nhigher asymptotes if the robustifying factor of 10- 3 I that we added to F was more \ncarefully controlled (we did not carefully control the parameters). C) Due to the \nintensive computational power required of these simulations we ran the gradient in a \nsmaller Tetris game (height of 10 rather than 20) to demonstrate that the standard \ngradient updates (right curve) would eventually reach the same performance of the \nnatural gradient (left curve). \n\nwhich are guaranteed not to degrade the policy. We consider a policy compatible \nwith the linear function approximator used in [3] (ie 7f(a;s, (}) ex: exp((}T\u00a2sa) where \n\u00a2sa are the same feature vectors). The features used in [3] are the heights of each \ncolumn, the differences in height between adjacent columns, the maximum height, \nand the number of 'holes' . The lower curve in Figure 2A shows the particularly \npoor performance of the standard gradient method. In an attempt to speed learn(cid:173)\ning, we tried a variety of more sophisticated methods to no avail, such as conjugate \nmethods, weight decay, annealing, the variance reduction method of [2], the Hes(cid:173)\nsian in equation 6, etc. Figure 2B shows a drastic improvement using the natural \ngradient (note that the timescale is linear). This performance is consistent with our \ntheoretical results in section 3, which showed that the natural gradient is moving \ntoward the solution of a greedy policy improvement step. The performance is some(cid:173)\nwhat slower than the greedy policy iteration (left curve in Figure 2B) which is to be \nexpected using smaller steps. However, the policy does not degrade with a gradient \nmethod. Figure 2 shows that the performance of the standard gradient rule (right \ncurve) eventually reaches the the same performance of the natural gradient, in a \nscaled down version of the game (see figure text). \n\n6 Discussion \n\nAlthough gradient methods cannot make large policy changes compared to greedy \npolicy iteration, section 3 implies that these two methods might not be that dis(cid:173)\nparate, since a natural gradient method is moving toward the solution of a policy \nimprovement step. With the overhead of a line search, the methods are even more \nsimilar. The benefit is that performance improvement is now guaranteed, unlike in \na greedy policy iteration step. \n\nIt is interesting, and unfortunate, to note that the F does not asymptotically con(cid:173)\nverge to the Hessian, so conjugate gradient methods might be more sensible asymp(cid:173)\ntotically. However, far from the converge point, the Hessian is not necessarily \n\n\finformative, and the natural gradient could be more efficient (as demonstrated in \nTetris). The intuition as to why the natural gradient could be efficient far from the \nmaximum, is that it is pushing the policy toward choosing greedy optimal actions. \nOften, the region (in parameter space) far from from the maximum is where large \nperformance changes could occur. Sufficiently close to the maximum, little perfor(cid:173)\nmance change occurs (due to the small gradient), so although conjugate methods \nmight converge faster near the maximum, the corresponding performance change \nmight be negligible. More experimental work is necessary to further understand the \neffectiveness of the natural gradient. \nAcknowledgments \n\nWe thank Emo Todorov and Peter Dayan for many helpful discussions. Funding is \nfrom the NSF and the Gatsby Charitable Foundation. \n\nReferences \n\n[I] S. Amari. Natural gradient works efficiently in learning. Neural Computation, \n\n10(2):251- 276, 1998. \n\n[2] J. Baxter and P. Bartlett. Direct gradient-based reinforcement learning. Technical \nreport, Australian National University, Research School of Information Sciences and \nEngineering, July 1999. \n\n[3] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, \n\n1996. \n\n[4] P. Dayan and G. Hinton. Using em for reinforcement learning. Neural Computation, \n\n9:271- 278, 1997. \n\n[5] S. Kakade. Optimizing average reward using discounted reward. COLT. in press., \n\n200l. \n\n[6] V. Konda and J. Tsitsiklis. Actor-critic algorithms. Advances in N eural Information \n\nProcessing Systems, 12, 2000. \n\n[7] D . MacKay. Maximum likelihood and covariant algorithms for independent compo(cid:173)\n\nnent analysis. Technical report , University of Cambridge, 1996. \n\n[8] P. Marbach and J . Tsitsiklis. Simulation-based optimization of markov reward pro(cid:173)\n\ncesses. Technical report, Massachusetts Institute of Technology, 1998. \n\n[9] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for \nreinforcement learning with function approximation. Neural Information Processing \nSystems, 13, 2000. \n\n[10] L. Xu and M. 1. Jordan. On convergence properties of the EM algorithm for gaussian \n\nmixtures. Neural Computation, 8(1):129- 151, 1996. \n\n\f", "award": [], "sourceid": 2073, "authors": [{"given_name": "Sham", "family_name": "Kakade", "institution": null}]}