{"title": "Incremental Natural Actor-Critic Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 105, "page_last": 112, "abstract": null, "full_text": "Incremental Natural Actor-Critic Algorithms\n\nDepartment of Computer Science & Automation, Indian Institute of Science, Bangalore, India\n\nShalabh Bhatnagar\n\nRichard S. Sutton, Mohammad Ghavamzadeh, Mark Lee\n\nDepartment of Computing Science, University of Alberta, Edmonton, Alberta, Canada\n\nAbstract\n\nWe present four new reinforcement learning algorithms based on actor-critic and\nnatural-gradient ideas, and provide their convergence proofs. Actor-critic rein-\nforcement learning methods are online approximations to policy iteration in which\nthe value-function parameters are estimated using temporal difference learning\nand the policy parameters are updated by stochastic gradient descent. Methods\nbased on policy gradients in this way are of special interest because of their com-\npatibility with function approximation methods, which are needed to handle large\nor in(cid:2)nite state spaces. The use of temporal difference learning in this way is of\ninterest because in many applications it dramatically reduces the variance of the\ngradient estimates. The use of the natural gradient is of interest because it can\nproduce better conditioned parameterizations and has been shown to further re-\nduce variance in some cases. Our results extend prior two-timescale convergence\nresults for actor-critic methods by Konda and Tsitsiklis by using temporal differ-\nence learning in the actor and by incorporating natural gradients, and they extend\nprior empirical studies of natural actor-critic methods by Peters, Vijayakumar and\nSchaal by providing the (cid:2)rst convergence proofs and the (cid:2)rst fully incremental\nalgorithms.\n\n1 Introduction\nActor-critic (AC) algorithms are based on the simultaneous online estimation of the parameters of\ntwo structures, called the actor and the critic. The actor corresponds to a conventional action-\nselection policy, mapping states to actions in a probabilistic manner. The critic corresponds to a\nconventional value function, mapping states to expected cumulative future reward. Thus, the critic\naddresses a problem of prediction, whereas the actor is concerned with control. These problems are\nseparable, but are solved simultaneously to (cid:2)nd an optimal policy, as in policy iteration. A variety\nof methods can be used to solve the prediction problem, but the ones that have proved most effective\nin large applications are those based on some form of temporal difference (TD) learning (Sutton,\n1988) in which estimates are updated on the basis of other estimates. Such bootstrapping methods\ncan be viewed as a way of accelerating learning by trading bias for variance.\nActor-critic methods were among the earliest to be investigated in reinforcement learning (Barto\net al., 1983; Sutton, 1984). They were largely supplanted in the 1990\u2019s by methods that estimate\naction-value functions and use them directly to select actions without an explicit policy structure.\nThis approach was appealing because of its simplicity, but when combined with function approxima-\ntion was found to have theoretical dif(cid:2)culties including in some cases a failure to converge. These\nproblems led to renewed interest in methods with an explicit representation of the policy, which\ncame to be known as policy gradient methods (Marbach, 1998; Sutton et al., 2000; Konda & Tsit-\nsiklis, 2000; Baxter & Bartlett, 2001). Policy gradient methods without bootstrapping can be easily\nproved convergent, but converge slowly because of the high variance of their gradient estimates.\nCombining them with bootstrapping is a promising avenue toward a more effective method.\nAnother approach to speeding up policy gradient algorithms was proposed by Kakade (2002) and\nthen re(cid:2)ned and extended by Bagnell and Schneider (2003) and by Peters et al. (2003). The idea\n\n1\n\n\fwas to replace the policy gradient with the so-called natural policy gradient. This was motivated by\nthe intuition that a change in the policy parameterization should not in(cid:3)uence the result of the policy\nupdate. In terms of the policy update rule, the move to the natural gradient amounts to linearly\ntransforming the gradient using the inverse Fisher information matrix of the policy.\nIn this paper, we introduce four new AC algorithms, three of which incorporate natural gradients. All\nthe algorithms are for the average reward setting and use function approximation in the state-value\nfunction. For all four methods we prove convergence of the parameters of the policy and state-value\nfunction to a local maximum of a performance function that corresponds to the average reward plus\na measure of the TD error inherent in the function approximation. Due to space limitations, we\ndo not present the convergence analysis of our algorithms here; it can be found, along with some\nempirical results using our algorithms, in the extended version of this paper (Bhatnagar et al., 2007).\nOur results extend prior AC methods, especially those of Konda and Tsitsiklis (2000) and of Peters\net al. (2005). We discuss these relationships in detail in Section 6. Our analysis does not cover the\nuse of eligibility traces but we believe the extension to that case would be straightforward.\n\n2 The Policy Gradient Framework\nWe consider the standard reinforcement learning framework (e.g., see Sutton & Barto, 1998), in\nwhich a learning agent interacts with a stochastic environment and this interaction is modeled as a\ndiscrete-time Markov decision process. The state, action, and reward at each time t 2 f0; 1; 2; : : :g\nare denoted st 2 S, at 2 A, and rt 2 R respectively. We assume the reward is random, real-\nvalued, and uniformly bounded. The environment\u2019s dynamics are characterized by state-transition\nprobabilities p(s0js; a) = Pr(st+1 = s0jst = s; at = a), and single-stage expected rewards\nr(s; a) = E[rt+1jst = s; at = a], 8s; s0 2 S; 8a 2 A. The agent selects an action at each time t\nusing a randomized stationary policy (cid:25)(ajs) = Pr(at = ajst = s). We assume\n\n(B1) The Markov chain induced by any policy is irreducible and aperiodic.\nThe long-term average reward per step under policy (cid:25) is de(cid:2)ned as\n\nJ((cid:25)) = lim\nT !1\n\n1\nT\n\nE\"T (cid:0)1Xt=0\n\nrt+1j(cid:25)# =Xs2S\n\nd(cid:25)(s)Xa2A\n\n(cid:25)(ajs)r(s; a);\n\nwhere d(cid:25)(s) is the stationary distribution of state s under policy (cid:25). The limit here is well-\nde(cid:2)ned under (B1). Our aim is to (cid:2)nd a policy (cid:25)(cid:3) that maximizes the average reward, i.e.,\n(cid:25)(cid:3) = arg max(cid:25) J((cid:25)). In the average reward formulation, a policy (cid:25) is assessed according to the\nexpected differential reward associated with states s or state(cid:150)action pairs (s; a). For all states s 2 S\nand actions a 2 A, the differential action-value function and the differential state-value function\nunder policy (cid:25) are de(cid:2)ned as1\n\nQ(cid:25)(s; a) =\n\nE[rt+1 (cid:0) J((cid:25))js0 = s; a0 = a; (cid:25)]\n\n;\n\n(cid:25)(ajs)Q(cid:25)(s; a):\n\n(1)\n\nV (cid:25)(s) =Xa2A\n\n1Xt=0\n\nIn policy gradient methods, we de(cid:2)ne\nclass of parameterized stochastic policies\nf(cid:25)((cid:1)js; (cid:18)); s 2 S; (cid:18) 2 (cid:2)g, estimate the gradient of the average reward with respect to the\npolicy parameters (cid:18) from the observed states, actions, and rewards, and then improve the policy\nby adjusting its parameters in the direction of the gradient. Since in this setting a policy (cid:25) is\nrepresented by its parameters (cid:18), policy dependent functions such as J((cid:25)), d(cid:25)((cid:1)), V (cid:25)((cid:1)), and Q(cid:25)((cid:1); (cid:1))\ncan be written as J((cid:18)), d((cid:1); (cid:18)), V ((cid:1); (cid:18)), and Q((cid:1); (cid:1); (cid:18)), respectively. We assume\n\na\n\n(B2) For any state\u2013action pair (s; a), policy (cid:25)(ajs; (cid:18)) is continuously differentiable in the\nparameters (cid:18).\nPrevious works (Marbach, 1998; Sutton et al., 2000; Baxter & Bartlett, 2001) have shown that the\ngradient of the average reward for parameterized policies that satisfy (B1) and (B2) is given by2\n\nr(cid:25)(ajs)Q(cid:25)(s; a):\n\n(2)\n\nrJ((cid:25)) =Xs2S\n\nd(cid:25)(s)Xa2A\n\n1From now on in the paper, we use the terms state-value function and action-value function instead of\n\ndifferential state-value function and differential action-value function.\n\n2Throughout the paper, we use notation r to denote r(cid:18) (cid:150) the gradient w.r.t. the policy parameters.\n\n2\n\n\fObserve that if b(s) is any given function of s (also called a baseline), then\n\nand thus, for any baseline b(s), the gradient of the average reward can be written as\n\nXs2S\n\nd(cid:25)(s)Xa2A\n\nr(cid:25)(ajs)b(s) =Xs2S\nrJ((cid:25)) =Xs2S\n\nd(cid:25)(s)b(s)r Xa2A\nd(cid:25)(s)Xa2A\n\n(cid:25)(ajs)! =Xs2S\n\nd(cid:25)(s)b(s)r(1) = 0;\n\nr(cid:25)(ajs)(Q(cid:25)(s; a) (cid:6) b(s)):\n\n(3)\n\nThe baseline can be chosen such in a way that the variance of the gradient estimates is minimized\n(Greensmith et al., 2004).\nThe natural gradient, denoted ~rJ((cid:25)), can be calculated by linearly transforming the regular gra-\ndient, using the inverse Fisher information matrix of the policy: ~rJ((cid:25)) = G(cid:0)1((cid:18))rJ((cid:25)). The\nFisher information matrix G((cid:18)) is positive de(cid:2)nite and symmetric, and is given by\n\nG((cid:18)) = Es(cid:24)d(cid:25) ;a(cid:24)(cid:25)[r log (cid:25)(ajs)r log (cid:25)(ajs)>]:\n\n(4)\n\n3 Policy Gradient with Function Approximation\nNow consider the case in which the action-value function for a (cid:2)xed policy (cid:25), Q(cid:25), is approximated\nby a learned function approximator. If the approximation is suf(cid:2)ciently good, we might hope to\nuse it in place of Q(cid:25) in Eqs. 2 and 3, and still point roughly in the direction of the true gradient.\nSutton et al. (2000) showed that if the approximation ^Q(cid:25)\nw with parameters w is compatible, i.e.,\nrw ^Q(cid:25)\n\nw(s; a) = r log (cid:25)(ajs), and minimizes the mean squared error\n(cid:25)(ajs)[Q(cid:25)(s; a) (cid:0) ^Q(cid:25)\n\nw(s; a)]2\n\n(5)\n\nE (cid:25)(w) =Xs2S\n\nd(cid:25)(s)Xa2A\n\nw(cid:3) in Eqs. 2 and 3. Thus, we work with\nfor parameter value w(cid:3), then we can replace Q(cid:25) with ^Q(cid:25)\na linear approximation ^Q(cid:25)\nw(s; a) = w> (s; a), in which the  (s; a)\u2019s are compatible features\nde(cid:2)ned according to  (s; a) = r log (cid:25)(ajs). Note that compatible features are well de(cid:2)ned under\n(B2). The Fisher information matrix of Eq. 4 can be written using the compatible features as\n\nSuppose E (cid:25)(w) denotes the mean squared error\n\nG((cid:18)) = Es(cid:24)d(cid:25) ;a(cid:24)(cid:25)[ (s; a) (s; a)>]:\n\nE (cid:25)(w) =Xs2S\n\nd(cid:25)(s)Xa2A\n\n(cid:25)(ajs)[Q(cid:25)(s; a) (cid:0) w> (s; a) (cid:0) b(s)]2\n\n(6)\n\n(7)\n\nof our compatible linear parameterized approximation w> (s; a) and an arbitrary baseline b(s).\nLet w(cid:3) = arg minw E (cid:25)(w) denote the optimal parameter. Lemma 1 shows that the value of w(cid:3)\ndoes not depend on the given baseline b(s); as a result the mean squared error problems of Eqs. 5\nand 7 have the same solutions. Lemma 2 shows that if the parameter is set to be equal to w (cid:3), then\nthe resulting mean squared error E (cid:25)(w(cid:3)) (now treated as a function of the baseline b(s)) is further\nminimized when b(s) = V (cid:25)(s). In other words, the variance in the action-value-function estimator\nis minimized if the baseline is chosen to be the state-value function itself.3\nLemma 1 The optimum weight parameter w(cid:3) for any given (cid:18) (policy (cid:25)) satis\ufb01es4\n\nw(cid:3) = G(cid:0)1((cid:18))Es(cid:24)d(cid:25) ;a(cid:24)(cid:25)[Q(cid:25)(s; a) (s; a)]:\n\nProof Note that\n\nrwE (cid:25)(w) = (cid:0)2Xs2S\nd(cid:25)(s)Xa2A\n(cid:25)(ajs) (s; a) (s; a)>w(cid:3) =Xs2S\n\nEquating the above to zero, one obtains\nXs2S\n\nd(cid:25)(s)Xa2A\n\nd(cid:25)(s)Xa2A\n\n(cid:25)(ajs)Q(cid:25)(s; a) (s; a)(cid:0)Xs2S\n\nd(cid:25)(s)Xa2A\n\n3It is important to note that Lemma 2 is not about the minimum variance baseline for gradient estimation.\n\nIt is about the minimum variance baseline of the action-value-function estimator.\n\n4This lemma is similar to Kakade\u2019s (2002) Theorem 1.\n\n(cid:25)(ajs)[Q(cid:25)(s; a) (cid:0) w> (s; a) (cid:0) b(s)] (s; a):\n\n(8)\n\n(cid:25)(ajs)b(s) (s; a):\n\n3\n\n\fThe last term on the right-hand side equals zero because Pa2A (cid:25)(ajs) (s; a) = Pa2A r(cid:25)(ajs) = 0\n\nfor any state s. Now, from Eq. 8, the Hessian r2\nThe claim follows because G((cid:18)) is positive de(cid:2)nite for any (cid:18).\n\nwE (cid:25)(w) evaluated at w(cid:3) can be seen to be 2G((cid:18)).\n\n(cid:3)\n\nNext, given the optimum weight parameter w(cid:3), we obtain the minimum variance baseline in\nthe action-value-function estimator corresponding to policy (cid:25). Thus we consider now E (cid:25)(w(cid:3)) as a\nfunction of the baseline b, and obtain b(cid:3) = arg minb E (cid:25)(w(cid:3)).\n\nLemma 2 For any given policy (cid:25), the minimum variance baseline b(cid:3)(s) in the action-value-\nfunction estimator corresponds to the state-value function V (cid:25)(s).\n\nProof For any s 2 S,\n\nlet E (cid:25);s(w(cid:3)) = Pa2A (cid:25)(ajs)[Q(cid:25)(s; a) (cid:0) w(cid:3)> (s; a) (cid:0) b(s)]2.\nThen E (cid:25)(w(cid:3)) =Ps2S d(cid:25)(s)E (cid:25);s(w(cid:3)). Note that by (B1), the Markov chain corresponding to any\n\npolicy (cid:25) is positive recurrent because the number of states is (cid:2)nite. Hence, d(cid:25)(s) > 0 for all s 2 S.\nThus, one needs to (cid:2)nd the baseline b(s) that minimizes E (cid:25);s(w(cid:3)) for each s 2 S. For any s 2 S,\n\n@E (cid:25);s(w(cid:3))\n\n@b(s)\n\n(cid:25)(ajs)[Q(cid:25)(s; a) (cid:0) w(cid:3)> (s; a) (cid:0) b(s)]:\n\nEquating the above to zero, we obtain\n\n= (cid:0)2Xa2A\n(cid:25)(ajs)Q(cid:25)(s; a) (cid:0)Xa2A\n\nb(cid:3)(s) =Xa2A\n\n(cid:25)(ajs)w(cid:3)> (s; a):\n\nThe rightmost term equals zero becausePa2A (cid:25)(ajs) (s; a) = 0. Hence b(cid:3)(s) =Pa2A (cid:25)(ajs)\n\nQ(cid:25)(s; a) = V (cid:25)(s). The second derivative of E (cid:25);s(w(cid:3)) w.r.t. b(s) equals 2. The claim follows. (cid:3)\n\nFrom Lemmas 1 and 2, w(cid:3)> (s; a) is a least-squared optimal parametric representation for\nthe advantage function A(cid:25)(s; a) = Q(cid:25)(s; a) (cid:0) V (cid:25)(s) as well as for the action-value function\n\nQ(cid:25)(s; a). However, because Ea(cid:24)(cid:25)[w> (s; a)] = Pa2A (cid:25)(ajs)w> (s; a) = 0; 8s 2 S, it is\n\nbetter to think of w> (s; a) as an approximation of the advantage function rather than of the\naction-value function.\nThe TD error (cid:14)t is a random quantity that is de(cid:2)ned according to (cid:14)t = rt+1(cid:0) ^Jt+1+ ^V (st+1)(cid:0) ^V (st),\nwhere ^V and ^J are consistent estimates of the state-value function and the average reward, respec-\ntively. Thus, these estimates satisfy E[ ^V (st)jst; (cid:25)] = V (cid:25)(st) and E[ ^Jt+1jst; (cid:25)] = J((cid:25)), for any\nt (cid:21) 0. The next lemma shows that (cid:14)t is a consistent estimate of the advantage function A(cid:25).\n\nLemma 3 Under given policy (cid:25), we have E[(cid:14)tjst; at; (cid:25)] = A(cid:25)(st; at).\nProof Note that\nE[(cid:14)tjst; at; (cid:25)] = E[rt+1(cid:0) ^Jt+1+ ^V (st+1)(cid:0) ^V (st)jst; at; (cid:25)] = r(st; at)(cid:0)J((cid:25))+E[ ^V (st+1)jst; at; (cid:25)](cid:0)V (cid:25)(st):\nNow\nE[ ^V (st+1)jst; at; (cid:25)] = E[E[ ^V (st+1)jst+1; (cid:25)]jst; at; (cid:25)] = E[V (cid:25)(st+1)jst; at] = Xst+1 2S\nAlso r(st; at) (cid:0) J((cid:25)) +Pst+12S p(st+1jst; at)V (cid:25)(st+1) = Q(cid:25)(st; at). The claim follows.\nrJ((cid:25)) = Ps2S d(cid:25)(s)Pa2A (cid:25)(ajs) (s; a)A(cid:25)(s; a). From Lemma 3, (cid:14)t is a consistent estimate\nof the advantage function A(cid:25)(s; a). Thus,drJ((cid:25)) = (cid:14)t (st; at) is a consistent estimate of rJ((cid:25)).\n\nHowever, calculating (cid:14)t requires having estimates, ^J, ^V , of the average reward and the value func-\ntion. While an average reward estimate is simple enough to obtain given the single-stage reward\nfunction, the same is not necessarily true for the value function. We use function approximation for\nthe value function as well. Suppose f (s) is a feature vector for state s. One may then approximate\nV (cid:25)(s) with v>f (s), where v is a parameter vector that can be tuned (for a (cid:2)xed policy (cid:25)) using a\nTD algorithm. In our algorithms, we use (cid:14)t = rt+1 (cid:0) ^Jt+1 + v>\nt f (st) as an estimate\nt f (st+1) (cid:0) v>\nfor the TD error, where vt corresponds to the value function parameter at time t.\n\nto the value function V (cid:25)(s), Eq. 3 can be written as\n\nBy setting the baseline b(s) equal\n\n(cid:3)\n\np(st+1jst; at)V (cid:25)(st+1):\n\n4\n\n\fLet (cid:22)V (cid:25)(s) = Pa2A (cid:25)(ajs)[r(s; a) (cid:0) J((cid:25)) +Ps02S p(s0js; a)v(cid:25)>f (s0)], where v(cid:25)>f (s0) is an\n\nestimate of the value function V (cid:25)(s0) that is obtained upon convergence viz., limt!1 vt = v(cid:25) with\nprobability one. Also, let (cid:14)(cid:25)\nt corresponds to\na stationary estimate of the TD error with function approximation under policy (cid:25).\n\nt = rt+1 (cid:0) ^Jt+1 + v(cid:25)>f (st+1) (cid:0) v(cid:25)>f (st), where (cid:14)(cid:25)\n\nLemma 4 E[(cid:14)(cid:25)\n\nt  (st; at)j(cid:18)] = rJ((cid:25)) +Ps2S d(cid:25)(s)[r (cid:22)V (cid:25)(s) (cid:0) rv(cid:25)>f (s)].\n\nProof of this lemma can be found in the extended version of this paper (Bhatnagar et al., 2007).\nNote that E[(cid:14)t (st; at)j(cid:18)] = rJ((cid:25)), provided (cid:14)t is de(cid:2)ned as (cid:14)t = rt+1 (cid:0) ^Jt+1 + ^V (st+1) (cid:0) ^V (st)\n(as was considered in Lemma 3). For the case with function approximation that we study, from\n\nLemma 4, the quantityPs2S d(cid:25)(s)[r (cid:22)V (cid:25)(s) (cid:0) rv(cid:25)>f (s)] may be viewed as the error or bias in\n\nthe estimate of the gradient of average reward that results from the use of function approximation.\n\n4 Actor-Critic Algorithms\nWe present four new AC algorithms in this section. These algorithms are in the general form shown\nin Table 1. They update the policy parameters along the direction of the average-reward gradient.\nWhile estimates of the regular gradient are used for this purpose in Algorithm 1, natural gradient\nestimates are used in Algorithms 2(cid:150)4. While critic updates in our algorithms can be easily extended\nto the case of TD((cid:21)), (cid:21) > 0, we restrict our attention to the case when (cid:21) = 0. In addition to\nassumptions (B1) and (B2), we make the following assumption:\n\n(B3) The step-size schedules for the critic f(cid:11)tg and the actor f(cid:12)tg satisfy\n\nXt\n\n(cid:11)t =Xt\n\n(cid:12)t = 1 ; Xt\n\n(cid:11)2\n\nt ; Xt\n\n(cid:12)2\nt < 1 ;\n\nlim\nt!1\n\n(cid:12)t\n(cid:11)t\n\n= 0:\n\n(9)\n\nAs a consequence of Eq. 9, (cid:12)t ! 0 faster than (cid:11)t. Hence the critic has uniformly higher increments\nthan the actor beyond some t0, and thus it converges faster than the actor.\n\nTable 1: A Template for Incremental AC Algorithms.\n\nInput:\n\n(cid:15) Randomized parameterized policy (cid:25)((cid:1)j(cid:1); (cid:18)),\n(cid:15) Value function feature vector f (s).\n\n^Jt+1 = (1 (cid:0) (cid:24)t) ^Jt + (cid:24)trt+1\n(cid:14)t = rt+1 (cid:0) ^Jt+1 + v>\nalgorithm speci(cid:2)c (see the text)\nalgorithm speci(cid:2)c (see the text)\n\nt f (st+1) (cid:0) v>\n\nt f (st)\n\n1:\n\n2:\n\n3:\n4:\n\n5:\n6:\n7:\n8:\n9 :\n10:\n\nInitialization:\n\n(cid:15) Policy parameters (cid:18) = (cid:18)0,\n(cid:15) Value function weight vector v = v0,\n(cid:15) Step sizes (cid:11) = (cid:11)0; (cid:12) = (cid:12)0; (cid:24) = c(cid:11)0,\n(cid:15) Initial state s0.\n\nfor t = 0; 1; 2; : : : do\n\nExecution:\n\n(cid:15) Draw action at (cid:24) (cid:25)(atjst; (cid:18)t),\n(cid:15) Observe next state st+1 (cid:24) p(st+1jst; at),\n(cid:15) Observe reward rt+1.\nAverage Reward Update:\nTD error:\nCritic Update:\nActor Update:\n\nendfor\nreturn Policy and value-function parameters (cid:18); v\n\nWe now present the critic and the actor updates of our four AC algorithms.\n\nAlgorithm 1 (Regular-Gradient AC):\n\nCritic Update:\nActor Update:\n\nvt+1 = vt + (cid:11)t(cid:14)tf (st);\n(cid:18)t+1 = (cid:18)t + (cid:12)t(cid:14)t (st; at):\n\n5\n\n\fThis is the only AC algorithm presented in the paper that is based on the regular gradient estimate.\nThis algorithm stores two parameter vectors (cid:18) and v. Its per time-step computational cost is linear\nin the number of policy and value-function parameters.\nThe next algorithm is based on the natural-gradient estimate ~rJ((cid:18)t) = G(cid:0)1((cid:18)t)(cid:14)t (st; at) in\nplace of the regular-gradient estimate in Algorithm 1. We derive a procedure for recursively esti-\nmating G(cid:0)1((cid:18)) and show in Lemma 5 that our estimate G(cid:0)1\nconverges to G(cid:0)1((cid:18)) as t ! 1 with\nprobability one. This is required for proving convergence of this algorithm. The Fisher information\ni=0  (si; ai) >(si; ai). One may\nmatrix can be estimated in an online manner as Gt+1 = 1\nobtain recursively Gt+1 = (1 (cid:0) 1\n\nt+1  (st; at) >(st; at), or more generally\n\nt+1 )Gt + 1\n\nt+1Pt\n\nt\n\n(10)\n\n(11)\n\nUsing the Sherman-Morrison matrix inversion lemma, one obtains\n\nGt+1 = (1 (cid:0) (cid:16)t)Gt + (cid:16)t (st; at) >(st; at):\n\nG(cid:0)1\n\nt+1 =\n\n1\n\n1 (cid:0) (cid:16)t(cid:20)G(cid:0)1\n\nt (cid:0) (cid:16)t\n\nG(cid:0)1\n\nt  (st; at)(G(cid:0)1\n\nt  (st; at))>\n\n1 (cid:0) (cid:16)t + (cid:16)t (st; at)>G(cid:0)1\n\nt  (st; at)(cid:21)\n\nFor our Alg. 2 and 4, we require the following additional assumption for the convergence analysis:\n\n(B4) The iterates Gt and G(cid:0)1\n\nt\n\nsatisfy supt;(cid:18);s;a k Gt k and supt;(cid:18);s;a k G(cid:0)1\n\nt\n\nk< 1.\n\nLemma 5 For any given parameter (cid:18); G(cid:0)1\nprobability one.\n\nt\n\nin Eq. 11 satis\ufb01es G(cid:0)1\n\nt ! G(cid:0)1((cid:18)) as t ! 1 with\n\nIt is easy to see from Eq. 10 that Gt ! G((cid:18)) as t ! 1 with probability one, for\n\nProof\nany given (cid:18) held (cid:2)xed. For a (cid:2)xed (cid:18),\n\nk G(cid:0)1\n\nt (cid:0)G(cid:0)1((cid:18)) k=k G(cid:0)1((cid:18))(G((cid:18))G(cid:0)1\n\nsup\n\n(cid:18)\n\nk G(cid:0)1((cid:18)) k sup\nt;(cid:18);s;a\n\nk G(cid:0)1\n\nt\n\nby assumption (B4). The claim follows.\n\nt (cid:0) I) k=k G(cid:0)1((cid:18))(G((cid:18)) (cid:0) Gt)G(cid:0)1\nk (cid:1) k G((cid:18)) (cid:0) Gt k! 0\n\nt\nt ! 1\n\nas\n\nk(cid:20)\n\n(cid:3)\n\nOur second algorithm stores a matrix G(cid:0)1 and two parameter vectors (cid:18) and v.\nIts per time-\nstep computational cost is linear in the number of value-function parameters and quadratic in the\nnumber of policy parameters.\n\nAlgorithm 2 (Natural-Gradient AC with Fisher Information Matrix):\n\nCritic Update:\nActor Update:\n\nvt+1 = vt + (cid:11)t(cid:14)tf (st);\n(cid:18)t+1 = (cid:18)t + (cid:12)tG(cid:0)1\n\nt+1(cid:14)t (st; at);\n\n0\n\n0 = kI, where k is a positive constant. Thus G(cid:0)1\n\nwith the estimate of the inverse Fisher information matrix updated according to Eq. 11. We let\nand G0 are positive de(cid:2)nite and symmetric\nG(cid:0)1\nmatrices. From Eq. 10, Gt; t > 0 can be seen to be positive de(cid:2)nite and symmetric because these\n; t > 0, are\nare convex combinations of positive de(cid:2)nite and symmetric matrices. Hence, G(cid:0)1\npositive de(cid:2)nite and symmetric as well.\nAs mentioned in Section 3, it is better to think of the compatible approximation w> (s; a)\nas an approximation of the advantage function rather than of the action-value function.\nIn our\nnext algorithm we tune the parameters w in such a way as to minimize an estimate of the\nleast-squared error E (cid:25)(w) = Es(cid:24)d(cid:25) ;a(cid:24)(cid:25)[(w> (s; a) (cid:0) A(cid:25)(s; a))2]. The gradient of E (cid:25)(w)\nis thus rwE (cid:25)(w) = 2Es(cid:24)d(cid:25) ;a(cid:24)(cid:25)[(w> (s; a) (cid:0) A(cid:25)(s; a)) (s; a)], which can be estimated as\n\\rwE (cid:25)(w) = 2[ (st; at) (st; at)>w (cid:0) (cid:14)t (st; at)]. Hence, we update advantage parameters w\nalong with value-function parameters v in the critic update of this algorithm. As with Peters et al.\n(2005), we use the natural gradient estimate ~rJ((cid:18)t) = wt+1 in the actor update of Alg. 3. This\nalgorithm stores three parameter vectors, v, w, and (cid:18). Its per time-step computational cost is linear\nin the number of value-function parameters and quadratic in the number of policy parameters.\n\nt\n\n6\n\n\fAlgorithm 3 (Natural-Gradient AC with Advantage Parameters):\n\nCritic Update:\n\nActor Update:\n\nvt+1 = vt + (cid:11)t(cid:14)tf (st);\nwt+1 = [I (cid:0) (cid:11)t (st; at) (st; at)>]wt + (cid:11)t(cid:14)t (st; at);\n(cid:18)t+1 = (cid:18)t + (cid:12)twt+1:\n\nAlthough an estimate of G(cid:0)1((cid:18)) is not explicitly computed and used in Algorithm 3, the con-\nvergence analysis of this algorithm shows that the overall scheme still moves in the direction of\nthe natural gradient of average reward. In Algorithm 4, however, we explicitly estimate G(cid:0)1((cid:18))\n(as in Algorithm 2), and use it in the critic update for w. The overall scheme is again seen\nto follow the direction of the natural gradient of average reward. Here, we let ~rwE (cid:25)(w) =\n[ (st; at) (st; at)>w (cid:0) (cid:14)t (st; at)] be the estimate of the natural gradient of the least-\n2G(cid:0)1\nsquared error E (cid:25)(w). This also simpli(cid:2)es the critic update for w. Algorithm 4 stores a matrix G(cid:0)1\nand three parameter vectors, v, w, and (cid:18). Its per time-step computational cost is linear in the num-\nber of value-function parameters and quadratic in the number of policy parameters.\nAlgorithm 4 (Natural-Gradient AC with Advantage Parameters and Fisher Information Matrix):\n\nt\n\nCritic Update:\n\nActor Update:\n\nvt+1 = vt + (cid:11)t(cid:14)tf (st);\nwt+1 = (1 (cid:0) (cid:11)t)wt + (cid:11)tG(cid:0)1\n(cid:18)t+1 = (cid:18)t + (cid:12)twt+1;\n\nt+1(cid:14)t (st; at);\n\nwhere the estimate of the inverse Fisher information matrix is updated according to Eq. 11.\n5 Convergence of Our Actor-Critic Algorithms\nSince our algorithms are gradient-based, one cannot expect to prove convergence to a globally\noptimal policy. The best that one could hope for is convergence to a local maximum of J((cid:18)).\nHowever, because the critic will generally converge to an approximation of the desired projection of\nthe value function (de(cid:2)ned by the value function features f) in these algorithms, the corresponding\nconvergence results are necessarily weaker, as indicated by the following theorem.\n\nTheorem 1 For the parameter iterations in Algorithms 1-4,5 we have ( ^Jt; vt; (cid:18)t) !\nf(J((cid:18)(cid:3)); v(cid:18)(cid:3)\n; (cid:18)(cid:3))j(cid:18)(cid:3) 2 Zg as t ! 1 with probability one, where the set Z corresponds to\nthe set of local maxima of a performance function whose gradient is E[(cid:14) (cid:25)\nt  (st; at)j(cid:18)] (cf. Lemma 4).\n\nFor the proof of this theorem, please refer to Section 6 (Convergence Analysis) of the ex-\ntended version of this paper (Bhatnagar et al., 2007). This theorem indicates that the policy\nand state-value-function parameters converge to a local maximum of a performance function\nthat corresponds to the average reward plus a measure of the TD error inherent in the function\napproximation.\n6 Relation to Previous Algorithms\nActor-Critic Algorithm of Konda and Tsitsiklis (2000): Unlike our Alg. 2(cid:150)4, their algorithm\ndoes not use estimates of the natural gradient in its actor\u2019s update. Their algorithm is similar to\nour Alg. 1, but with some key differences. 1) Konda\u2019s algorithm uses the Markov process of state(cid:150)\naction pairs, and thus its critic update is based on an action-value function. Alg. 1 uses the state\nprocess, and therefore its critic update is based on a state-value function. 2) Whereas Alg. 1 uses\na TD error in both critic and actor recursions, Konda\u2019s algorithm uses a TD error only in its critic\nupdate. The actor recursion in Konda\u2019s algorithm uses an action-value estimate instead. Because\nthe TD error is a consistent estimate of the advantage function (Lemma 3), the actor recursion in\nAlg. 1 uses estimates of advantages instead of action-values, which may result in lower variances.\n3) The convergence analysis of Konda\u2019s algorithm is based on the martingale approach and aims at\nbounding error terms and directly showing convergence; convergence to a local optimum is shown\nwhen a TD(1) critic is used. For the case where (cid:21) < 1, they show that given an (cid:15) > 0, there exists\n(cid:21) close enough to one such that when a TD((cid:21)) critic is used, one gets lim inf t jrJ((cid:18)t)j < (cid:15) with\n5The proof of this theorem requires another assumption viz., (A3) in the extended version of this paper\n(Bhatnagar et al., 2007), in addition to (B1)-(B3) (resp. (B1)-(B4)) for Algorithm 1 and 3 (resp. for Algorithm 2\nand 4). This was not included in this paper due to space limitations.\n\n7\n\n\fprobability one. Unlike Konda and Tsitsiklis, we primarily use the ordinary differential equation\n(ODE) based approach for our convergence analysis. Though we use martingale arguments in our\nanalysis, these are restricted to showing that the noise terms asymptotically diminish; the resulting\nscheme can be viewed as an Euler-discretization of the associated ODE.\nNatural Actor-Critic Algorithm of Peters et al. (2005): Our Algorithms 2(cid:150)4 extend their algo-\nrithm by being fully incremental and in that we provide convergence proofs. Peters\u2019s algorithm uses\na least-squares TD method in its critic\u2019s update, whereas all our algorithms are fully incremental.\nIt is not clear how to satisfactorily incorporate least-squares TD methods in a context in which the\npolicy is changing, and our proof techniques do not immediately extend to this case.\n7 Conclusions and Future Work\nWe have introduced and analyzed four AC algorithms utilizing both linear function approximation\nand bootstrapping, a combination which seems essential to large-scale applications of reinforcement\nlearning. All of the algorithms are based on existing ideas such as TD-learning, natural policy gradi-\nents, and two-timescale stochastic approximation, but combined in new ways. The main contribution\nof this paper is proving convergence of the algorithms to a local maximum in the space of policy\nand value-function parameters. Our Alg. 2(cid:150)4 are explorations of the use of natural gradients within\nan AC architecture. The way we use natural gradients is distinctive in that it is totally incremental:\nthe policy is changed on every time step, yet the gradient computation is never reset as it is in the\nalgorithm of Peters et al. (2005). Alg. 3 is perhaps the most interesting of the three natural-gradient\nalgorithms. It never explicitly stores an estimate of the inverse Fisher information matrix and, as\na result, it requires less computation. In empirical experiments using our algorithms (not reported\nhere) we observed that it is easier to (cid:2)nd good parameter settings for Alg. 3 than it is for the other\nnatural-gradient algorithms and, perhaps because of this, it converged more rapidly than the others\nand than Konda\u2019s algorithm. All our algorithms performed better than Konda\u2019s algorithm.\nThere are a number of ways in which our results are limited and suggest future work. 1) It is\nimportant to characterize the quality of the converged solutions, either by bounding the performance\nloss due to bootstrapping and approximation error, or through a thorough empirical study. 2) The\nalgorithms can be extended to incorporate eligibility traces and least-squares methods. As discussed\nearlier, the former seems straightforward whereas the latter requires more fundamental extensions.\n3) Application of the algorithms to real-world problems is needed to assess their ultimate utility.\nReferences\nBagnell, J., & Schneider, J. (2003). Covariant policy search. Proceedings of the Eighteenth International Joint\n\nConference on Arti\ufb01cial Intelligence.\n\nBarto, A. G., Sutton, R. S., & Anderson, C. (1983). Neuron-like elements that can solve dif(cid:2)cult learning\n\ncontrol problems. IEEE Transaction on Systems, Man and Cybernetics, 13, 835(cid:150)846.\n\nBaxter, J., & Bartlett, P. (2001). In(cid:2)nite-horizon policy-gradient estimation. JAIR, 15, 319(cid:150)350.\nBhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2007). Natural actor-critic algorithms. Submitted\n\nto Automatica.\n\nGreensmith, E., Bartlett, P., & Baxter, J. (2004). Variance reduction techniques for gradient estimates in rein-\n\nforcement learning. Journal of Machine Learning Research, 5, 1471(cid:150)1530.\n\nKakade, S. (2002). A natural policy gradient. Proceedings of NIPS 14.\nKonda, V., & Tsitsiklis, J. (2000). Actor-critic algorithms. Proceedings of NIPS 12 (pp. 1008(cid:150)1014).\nMarbach, P. (1998). Simulated-based methods for Markov decision processes. Doctoral dissertation, MIT.\nPeters, J., Vijayakumar, S., & Schaal, S. (2003). Reinforcement learning for humanoid robotics. Proceedings\n\nof the Third IEEE-RAS International Conference on Humanoid Robots.\n\nPeters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actor-critic. Proceedings of the Sixteenth European\n\nConference on Machine Learning (pp. 280(cid:150)291).\n\nSutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Doctoral dissertation, UMass\n\nAmherst.\n\nSutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9(cid:150)44.\nSutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press.\nSutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement\n\nlearning with function approximation. Proceedings of NIPS 12 (pp. 1057(cid:150)1063).\n\n8\n\n\f", "award": [], "sourceid": 294, "authors": [{"given_name": "Shalabh", "family_name": "Bhatnagar", "institution": null}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": null}, {"given_name": "Mark", "family_name": "Lee", "institution": null}, {"given_name": "Richard", "family_name": "Sutton", "institution": null}]}