{"title": "Policy Gradient Methods for Reinforcement Learning with Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1057, "page_last": 1063, "abstract": null, "full_text": "Policy Gradient Methods for \n\nReinforcement Learning with Function \n\nApproximation \n\nRichard S. Sutton, David McAllester, Satinder Singh, Yishay Mansour \n\nAT&T Labs - Research, 180 Park Avenue, Florham Park, NJ 07932 \n\nAbstract \n\nFunction approximation is essential to reinforcement learning, but \nthe standard approach of approximating a value function and deter(cid:173)\nmining a policy from it has so far proven theoretically intractable. \nIn this paper we explore an alternative approach in which the policy \nis explicitly represented by its own function approximator, indepen(cid:173)\ndent of the value function, and is updated according to the gradient \nof expected reward with respect to the policy parameters. Williams's \nREINFORCE method and actor-critic methods are examples of this \napproach. Our main new result is to show that the gradient can \nbe written in a form suitable for estimation from experience aided \nby an approximate action-value or advantage function. Using this \nresult, we prove for the first time that a version of policy iteration \nwith arbitrary differentiable function approximation is convergent to \na locally optimal policy. \n\nLarge applications of reinforcement learning (RL) require the use of generalizing func(cid:173)\ntion approximators such neural networks, decision-trees, or instance-based methods. \nThe dominant approach for the last decade has been the value-function approach, in \nwhich all function approximation effort goes into estimating a value function, with \nthe action-selection policy represented implicitly as the \"greedy\" policy with respect \nto the estimated values (e.g., as the policy that selects in each state the action with \nhighest estimated value). The value-function approach has worked well in many appli(cid:173)\ncations, but has several limitations. First, it is oriented toward finding deterministic \npolicies, whereas the optimal policy is often stochastic, selecting different actions with \nspecific probabilities (e.g., see Singh, Jaakkola, and Jordan, 1994). Second, an arbi(cid:173)\ntrarily small change in the estimated value of an action can cause it to be, or not be, \nselected. Such discontinuous changes have been identified as a key obstacle to estab(cid:173)\nlishing convergence assurances for algorithms following the value-function approach \n(Bertsekas and Tsitsiklis, 1996). For example, Q-Iearning, Sarsa, and dynamic pro(cid:173)\ngramming methods have all been shown unable to converge to any policy for simple \nMDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsit(cid:173)\nsiklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996). This can occur even if the \nbest approximation is found at each step before changing the policy, and whether the \nnotion of \"best\" is in the mean-squared-error sense or the slightly different senses of \nresidual-gradient, temporal-difference, and dynamic-programming methods. \n\nIn this paper we explore an alternative approach to function approximation in RL. \n\n\f1058 \n\nR. S. Sutton, D. McAl/ester. S. Singh and Y. Mansour \n\nRather than approximating a value function and using that to compute a determinis(cid:173)\ntic policy, we approximate a stochastic policy directly using an independent function \napproximator with its own parameters. For example, the policy might be represented \nby a neural network whose input is a representation of the state, whose output is \naction selection probabilities, and whose weights are the policy parameters. Let 0 \ndenote the vector of policy parameters and p the performance of the corresponding \npolicy (e.g., the average reward per step). Then, in the policy gradient approach, the \npolicy parameters are updated approximately proportional to the gradient: \n\nap \n~O~CtaO' \n\n(1) \nwhere Ct is a positive-definite step size. If the above can be achieved, then 0 can \nusually be assured to converge to a locally optimal policy in the performance measure \np. Unlike the value-function approach, here small changes in 0 can cause only small \nchanges in the policy and in the state-visitation distribution. \nIn this paper we prove that an unbiased estimate of the gradient (1) can be obtained \nfrom experience using an approximate value function satisfying certain properties. \nWilliams's (1988, 1992) REINFORCE algorithm also finds an unbiased estimate of \nthe gradient, but without the assistance of a learned value function. REINFORCE \nlearns much more slowly than RL methods using value functions and has received \nrelatively little attention. Learning a value function and using it to reduce the variance \nof the gradient estimate appears to be ess~ntial for rapid learning. Jaakkola, Singh \nand Jordan (1995) proved a result very similar to ours for the special case of function \napproximation corresponding to tabular POMDPs. Our result strengthens theirs and \ngeneralizes it to arbitrary differentiable function approximators. Konda and Tsitsiklis \n(in prep.) independently developed a very simialr result to ours. See also Baxter and \nBartlett (in prep.) and Marbach and Tsitsiklis (1998). \nOur result also suggests a way of proving the convergence of a wide variety of algo(cid:173)\nrithms based on \"actor-critic\" or policy-iteration architectures (e.g., Barto, Sutton, \nand Anderson, 1983; Sutton, 1984; Kimura and Kobayashi, 1998). In this paper we \ntake the first step in this direction by proving for the first time that a version of \npolicy iteration with general differentiable function approximation is convergent to \na locally optimal policy. Baird and Moore (1999) obtained a weaker but superfi(cid:173)\ncially similar result for their VAPS family of methods. Like policy-gradient methods, \nVAPS includes separately parameterized policy and value functions updated by gra(cid:173)\ndient methods. However, VAPS methods do not climb the gradient of performance \n(expected long-term reward), but of a measure combining performance and value(cid:173)\nfunction accuracy. As a result, VAPS does not converge to a locally optimal policy, \nexcept in the case that no weight is put upon value-function accuracy, in which case \nVAPS degenerates to REINFORCE. Similarly, Gordon's (1995) fitted value iteration \nis also convergent and value-based, but does not find a locally optimal policy. \n\n1 Policy Gradient Theorem \n\nWe consider the standard reinforcement learning framework (see, e.g., Sutton and \nBarto, 1998), in which a learning agent interacts with a Markov decision process \n(MDP). The state, action, and reward at each time t E {O, 1, 2, . . . } are denoted St E \nS, at E A, and rt E R respectively. The environment's dynamics are characterized by \nstate transition probabilities, P:SI = Pr { St+ 1 = Sf I St = s, at = a}, and expected re(cid:173)\nwards 'R~ = E {rt+l 1st = s, at = a}, 'r/s, Sf E S, a E A. The agent's decision making \nprocedure at each time is characterized by a policy, 1l'(s, a, 0) = Pr {at = alst = s, O}, \n'r/s E S,a E A, where 0 E ~, for l \u00ab lSI, is a parameter vector. We assume that 1l' \nis diffentiable with respect to its parameter, i.e., that a1f~~a) exists. We also usually \nwrite just 1l'(s, a) for 1l'(s, a, 0). \n\n\fPolicy Gradient Methods for RL with Function Approximation \n\n1059 \n\nWith function approximation, two ways of formulating the agent's objective are use(cid:173)\nful. One is the average reward formulation, in which policies are ranked according to \ntheir long-term expected reward per step, p(rr): \n\np(1I\") = lim .!.E{rl +r2 + ... +rn 11I\"} = '\" \u00a3ff(s) \"'1I\"(s,a)'R.:, \n\nn-+oon \n\n~ ~ \nII \n\nQ \n\nwhere cP (s) = limt-+oo Pr {St = slso, 11\"} is the stationary distribution of states under \n11\", which we assume exists and is independent of So for all policies. In the average \nreward formulation, the value of a state-action pair given a policy is defined as \n\nQ1r(s,a) = LE {rt - p(1I\") I So = s,ao = a,1I\"}, \n\n00 \n\nt=l \n\nVs E S,a E A. \n\nThe second formulation we cover is that in which there is a designated start state \nSo, and we care only about the long-term reward obtained from it. We will give our \nresults only once, but they will apply to this formulation as well under the definitions \n\nand Q1r(s,a) = E{t. \"(k-lrt+k 1St = s,at = a, 11\" }. \n\np(1I\") = E{t. \"(t-lrt I 8 0 ,1I\"} \nwhere,,( E [0,1] is a discount rate (\"( = 1 is allowed only in episodic tasks). In this \nformulation, we define d1r (8) as a discounted weighting of states encountered starting \nat So and then following 11\": cP(s) = E:o\"(tpr{st = slso,1I\"}. \nOur first result concerns the gradient of the performance metric with respect to the \npolicy parameter: \nTheorem 1 (Policy Gradient). For any MDP, in either the average-reward or \nstart-state formulations, \n\nap = \"'.ftr( )'\" a1l\"(s,a)Q1r( \nao ~ u \n\ns ~ ao \n\n) \ns, a . \n\nII \n\nQ \n\n(2) \n\nProof: See the appendix. \nThis way of expressing the gradient was first rtiscussed for the average-reward formu(cid:173)\nlation by Marbach and Tsitsiklis (1998), based on a related expression in terms of the \nstate-value function due to Jaakkola, Singh, and Jordan (1995) and Coo and Chen \n(1997). We extend their results to the start-state formulation and provide simpler \nand more direct proofs. Williams's (1988, 1992) theory of REINFORCE algorithms \ncan also be viewed as implying (2). In any event, the key aspect of both expressions \nfor the gradient is that their are no terms of the form adiJII): the effect of policy \nchanges on the distribution of states does not appear. This is convenient for approxi(cid:173)\nmating the gradient by sampling. For example, if 8 was sampled from the distribution \nobtained by following 11\", then Ea a1r~~,a) Q1r (s, a) would be an unbiased estimate of \n~. Of course, Q1r(s, a) is also not normally known and must be estimated. One ap(cid:173)\nproach is to use the actual returns, Rt = E~l rt+k - p(1I\") (or Rt = E~l \"(k-lrt+k \nin the start-state formulation) as an approximation for each Q1r (St, at). This leads to \nWilliams's episodic REINFORCE algorithm, t::..Ot oc a1r~~,at2 Rt (1 \n) (the ~a \ncorrects for the oversampling of actions preferred by 11\"), which is known to follow ~ \nin expected value (Williams, 1988, 1992). \n\n7r\\St,Ut) \n\n7r St,at \n\n2 Policy Gradient with Approximation \n\nNow consider the case in which Q1r is approximated by a learned function approxima(cid:173)\ntor. If the approximation is sufficiently good, we might hope to use it in place of Q1r \n\n\f1060 \n\nR. S. Sutton, D. MeAl/ester, S. Singh and Y. Mansour \n\nin (2) and still point roughly in the direction of the gradient. For example, Jaakkola, \nSingh, and Jordan (1995) proved that for the special case of function approximation \narising in a tabular POMDP one could assure positive inner product with the gra(cid:173)\ndient, which is sufficient to ensure improvement for moving in that direction. Here \nwe extend their result to general function approximation and prove equality with the \ngradient. \nLet fw : S x A - ~ be our approximation to Q7f, with parameter w. It is natural \nto learn f w by following 1r and updating w by a rule such as AWt oc I,u [Q7f (St, at) -\nfw(st,at)]2 oc [Q7f(st,at) - fw(st,at)]alw~~,ad, where Q7f(st,at) is some unbiased \nestimator of Q7f(st, at), perhaps Rt. When such a process has converged to a local \noptimum, then \n\nLcF(s):E 1r(s,a)[Q7f (s,a) - fw(s,a)] 8f~~,a) = o. \n\n(3) \n\n/I \n\na \n\nTheorem 2 (Policy Gradient with Function Approximation). If fw satisfies \n(3) and is compatible with the policy parameterization in the sense thatl \n\nthen \n\n8fw(s, a) \n\n8w \n\n= \n\n81r(s, a) \n\n1 \n\n80 \n\n1r(s, a) , \n\n8p ~ ~ 81r(s, a) \nao = ~cF(s) ~ ao \n\nfw(s,a). \n\nII \n\na \n\nProof: Combining (3) and (4) gives \n\nLd7f (s) L 87r1~a) [Q7f (s,a) - fw(s,a)] = 0 \n\nII \n\na \n\n(4) \n\n(5) \n\n(6) \n\nwhich tells us that the error in fw(s, a) is orthogonal to the gradient of the policy \nparameterization. Because the expression above is zero, we can subtract it from the \npolicy gradient theorem (2) to yield \n\nap \nao \n\n= L cF(s) L a1r1~ a) Q7f(s, a) - :E cF(s) :E a1r1~ a) [Q7f (s, a) - fw(s, a)] \n\nII \n\n11 \n\n~ ~ a1r(s,a) \n~cF(s)~ ao \n\n/I \n\na \n\nII \n\na \n\n[Q7f(s,a)-Q7f(s,a)+fw(s,a)] \n\n~ ~ a1r(s,a) \n\n= ~ cF(s) ~ ao \n\nfw(s, a). \n\nII \n\na \n\nQ.E.D. \n\n3 Application to Deriving Algorithms and Advantages \n\nGiven a policy parameterization, Theorem 2 can be used to derive an appropriate \nform for the value-function parameterization. For example, consider a policy that is \na Gibbs distribution in a linear combination of features: \n\n'is E S,s E A, \n\nITsitsiklis (personal communication) points out that /w being linear in the features given \n\non the righthand side may be the only way to satisfy this condition. \n\n\fPolicy Gradient Methods for RL with Function Approximation \n\n1061 \n\nwhere each