{"title": "Actor-Critic Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1008, "page_last": 1014, "abstract": null, "full_text": "Actor-Critic Algorithms \n\nVijay R. Konda \n\nJohn N. Tsitsiklis \n\nLaboratory for Information and Decision Systems , \n\nMassachusetts Institute of Technology, \n\nCambridge, MA, 02139. \n\nkonda@mit.edu, jnt@mit.edu \n\nAbstract \n\nWe propose and analyze a class of actor-critic algorithms for \nsimulation-based optimization of a Markov decision process over \na parameterized family of randomized stationary policies. These \nare two-time-scale algorithms in which the critic uses TD learning \nwith a linear approximation architecture and the actor is updated \nin an approximate gradient direction based on information pro(cid:173)\nvided by the critic. We show that the features for the critic should \nspan a subspace prescribed by the choice of parameterization of the \nactor. We conclude by discussing convergence properties and some \nopen problems. \n\n1 \n\nIntroduction \n\nThe vast majority of Reinforcement Learning (RL) [9J and Neuro-Dynamic Pro(cid:173)\ngramming (NDP) [lJ methods fall into one of the following two categories: \n\n(a) Actor-only methods work with a parameterized family of policies. The gra(cid:173)\n\ndient of the performance, with respect to the actor parameters, is directly \nestimated by simulation, and the parameters are updated in a direction of \nimprovement [4, 5, 8, 13J. A possible drawback of such methods is that the \ngradient estimators may have a large variance. Furthermore, as the pol(cid:173)\nicy changes, a new gradient is estimated independently of past estimates. \nHence, there is no \"learning,\" in the sense of accumulation and consolida(cid:173)\ntion of older information. \n\n(b) Critic-only methods rely exclusively on value function approximation and \naim at learning an approximate solution to the Bellman equation, which will \nthen hopefully prescribe a near-optimal policy. Such methods are indirect \nin the sense that they do not try to optimize directly over a policy space. A \nmethod of this type may succeed in constructing a \"good\" approximation of \nthe value function, yet lack reliable guarantees in terms of near-optimality \nof the resulting policy. \n\nActor-critic methods aim at combining the strong points of actor-only and critic(cid:173)\nonly methods. The critic uses an approximation architecture and simulation to \nlearn a value function, which is then used to update the actor's policy parameters \n\n\fActor-Critic Algorithms \n\n1009 \n\nin a direction of performance improvement. Such methods, as long as they are \ngradient-based, may have desirable convergence properties, in contrast to critic(cid:173)\nonly methods for which convergence is guaranteed in very limited settings. They \nhold the promise of delivering faster convergence (due to variance reduction), when \ncompared to actor-only methods. On the other hand, theoretical understanding of \nactor-critic methods has been limited to the case of lookup table representations of \npolicies [6]. \nIn this paper, we propose some actor-critic algorithms and provide an overview of \na convergence proof. The algorithms are based on an important observation. Since \nthe number of parameters that the actor has to update is relatively small (compared \nto the number of states), the critic need not attempt to compute or approximate \nthe exact value function, which is a high-dimensional object. In fact, we show that \nthe critic should ideally compute a certain \"projection\" of the value function onto a \nlow-dimensional subspace spanned by a set of \"basis functions,\" that are completely \ndetermined by the parameterization of the actor. Finally, as the analysis in [11] \nsuggests for TD algorithms, our algorithms can be extended to the case of arbitrary \nstate and action spaces as long as certain ergodicity assumptions are satisfied. \n\nWe close this section by noting that ideas similar to ours have been presented in \nthe simultaneous and independent work of Sutton et al. [10]. \n\n2 Markov decision processes and parameterized family of \n\nRSP's \n\nConsider a Markov decision process with finite state space S, and finite action space \nA. Let 9 : S x A -t ffi. be a given cost function. A randomized stationary policy (RSP) \nis a mapping I-\" that assigns to each state x a probability distribution over the action \nspace A. We consider a set of randomized stationary policies JPl = {1-\"9; e E ffi.n }, \nparameterized in terms of a vector e. For each pair (x, u) E S x A, 1-\"9 (x, u) denotes \nthe probability of taking action u when the state x is encountered, under the policy \ncorresponding to e. Let PXy(u) denote the probability that the next state is y, given \nthat the current state is x and the current action is u. Note that under any RSP, the \nsequence of states {Xn} and of state-action pairs {Xn' Un} of the Markov decision \nprocess form Markov chains with state spaces Sand S x A, respectively. We make \nthe following assumptions about the family of policies JPl. \n\n(AI) For all xES and u E A the map e t-t 1-\"9(X, u) is twice differentiable \nwith bounded first, second derivatives. Furthermore, there exists a ffi.n_ \nvalued function 'l/J9(X, u) such that \\l1-\"9(X, u) = 1-\"9 (x, U)'l/J9(X, u) where the \nmapping e t-t 'l/J9(X, u) is bounded and has first bounded derivatives for any \nfixed x and u. \n(A2) For each e E ffi.n , the Markov chains {Xn} and {Xn, Un} are irreducible and \naperiodic, with stationary probabilities 7r9(X) and 'T}9(X, u) = 7r9 (x) 1-\"9 (x, u), \nrespectively, under the RSP 1-\"9. \n\nIn reference to Assumption (AI) , note that whenever 1-\"9 (x, u) is nonzero we have \n\n'l/J9 (x, u) = \n\n\\l1-\"9(X, u) \n1-\"9 x,u \n\n( \n\n) = \\lIn 1-\"9 (x, u). \n\nConsider the average cost function>. : ffi.n t-t ffi., given by \n>.(e) = L g(x, U)'T}9(X, u) . \n\nxES,uEA \n\n\f1010 \n\nV. R. Konda and J. N. Tsitsiklis \n\nWe are interested in minimizing >'(19) over all 19. For each 19 E Rn , let Ve : S t--7 R \nbe the \"differential\" cost function, defined as solution of Poisson equation: \n\n>'(19) + Ve(x) = L \n\nI-'e(x,u) [g(X,U) + LPxY(U)Ve(Y)]. \n\nuEA \n\nY \n\nIntuitively, Ve(x) can be viewed as the \"disadvantage\" of state x: it is the expected \nexcess cost - on top of the average cost - incurred if we start at state x. It plays a \nrole similar to that played by the more familiar value function that arises in total \nor discounted cost Markov decision problems. Finally, for every 19 E Rn, we define \nthe q-function qe : S x A -+ R, by \n\nqe(x, u) = g(x, u) - >'(19) + LPxy(u)Ve(y). \n\nY \n\nWe recall the following result, as stated in [8]. (Different versions of this result have \nbeen established in [3, 4, 5].) \nTheorem 1. \n\n8 \n819. >'(19) = L..J 1]e(x, u)qe(x , u)1/;o(x, u) \n\n. \n\n'\"' \nX,U \n\n1. \n\n(1) \n\nwhere 1/;b (x, u) stands for the i th component of 1/;e . \n\nIn [8], the quantity qe(x,u) in the above formula is interpreted as the expected \nexcess cost incurred over a certain renewal period of the Markov chain {Xn, Un}, \nunder the RSP I-'e, and is then estimated by means of simulation, leading to actor(cid:173)\nonly algorithms. Here, we provide an alternative interpretation of the formula in \nTheorem 1, as an inner product, and thus derive a different set of algorithms, which \nreadily generalize to the case of an infinite space as well. \nFor any 19 E Rn , we define the inner product (', .) e of two real valued functions q1 , q2 \non S x A, viewed as vectors in RlsiIAI, by \n\n(q1, q2)e = L 1]e(x, U)q1 (x, U)q2(X, u). \n\nx,u \n\nWith this notation we can rewrite the formula (1) as \n\n8 \n819i >'(19) = (qe,1/;o)e, \n\n. \n\ni = 1, ... ,n. \n\nLet 11 \u00b7l le denote the norm induced by this inner product on RlsiIAI. For each 19 E Rn \nlet we denote the span of the vectors {1/;b; 1 ::; i ::; n} in RISIIAI. (This is same as \nthe set of all functions f on S x A of the form f(x ,u) = 2::7=1 ai1/;~(x , U), for some \nscalars a1,\u00b7 . . ,an,) \nNote that although the gradient of >. depends on the q-function, which is a vector \nin a possibly very high dimensional space RlsiIAI, the dependence is only through \nits inner products with vectors in we . Thus, instead of \"learning\" the function qe, \nit would suffice to learn the projection of qe on the subspace We. \nIndeed, let rIe : RlsllAI t--7 We be the projection operator defined by \n\nrIeq = arg !llin Ilq - qlle. \n\nqEwe \n\nSince \n\n(qe ,1/;e)e = (rIeqe, 1/;e)e, \nit is enough to compute the projection of qe onto we. \n\n(2) \n\n\fActor-Critic Algorithms \n\n1011 \n\n3 Actor-critic algorithms \n\nWe view actor critic-algorithms as stochastic gradient algorithms on the parameter \nspace of the actor. When the actor parameter vector is 0, the job of the critic is \nto compute an approximation of the projection IIeqe of qe onto 'lie. The actor uses \nthis approximation to update its policy in an approximate gradient direction. The \nanalysis in [11, 12] shows that this is precisely what TD algorithms try to do, i.e., \nto compute the projection of an exact value function onto a subspace spanned by \nfeature vectors. This allows us to implement the critic by using a TD algorithm. \n(Note, however, that other types of critics are possible, e.g., based on batch solution \nof least squares problems, as long as they aim at computing the same projection.) \n\nWe note some minor differences with the common usage of TD. In our context, \nwe need the projection of q-functions, rather than value functions. But this is \neasily achieved by replacing the Markov chain {xt} in [11, 12] by the Markov chain \n{Xn, Un}. A further difference is that [11, 12] assume that the control policy and \nthe feature vectors are fixed. In our algorithms, the control policy as well as the \nfeatures need to change as the actor updates its parameters. As shown in [6, 2], \nthis need not pose any problems, as long as the actor parameters are updated on a \nslower time scale. \nWe are now ready to describe two actor-critic algorithms, which differ only as far \nas the critic updates are concerned. In both variants, the critic is a TD algorithm \nwith a linearly parameterized approximation architecture for the q-function, of the \nform \n\nQ~(x, u) = I: r j 4>~(x, u), \n\nm \n\nj=l \n\nwhere r = (rl, ... , rm) E ]Rm denotes the parameter vector of the critic. The \nfeatures 4>~, j = 1, ... ,m, used by the critic are dependent on the actor parameter \nvector 0 and are chosen such that their span in ]RlsIIAI, denoted by ~ = 't/J~ for each i. Nevertheless, we allow the possibility that m > nand 91c (Xk+l' Uk+l) , \n\u00a2>9,. (Xk+l, Uk+d, \n\nif Xk+l ::/= x*, \n\notherwise. \n\nTD(a) Critic, 0 ~ a < 1: \n\nActor: Finally, the actor updates its parameter vector by letting \n\n(Jk+l = (Jk - rhf(rk)Q~~ (Xk+l' Uk+l)1/!9,. (Xk+l' Uk+l). \n\nHere, 13k is a positive stepsize and r(rk) > 0 is a normalization factor satisfying: \n\n(A3) f(\u00b7) is Lipschitz continuous. \n(A4) There exists C > 0 such that \n\nC \n\nr(r) ~ 1 + Ilrll' \n\nThe above presented algorithms are only two out of many variations. For instance, \none could also consider \"episodic\" problems in which one starts from a given initial \nstate and runs the process until a random termination time (at which time the \nprocess is reinitialized at x*), with the objective of minimizing the expected cost \nIn this setting, the average cost estimate Ak is unnecessary \nuntil termination. \nand is removed from the critic update formula. If the critic parameter rk were to \nbe reinitialized each time that x* is entered, one would obtain a method closely \nrelated to Williams' REINFORCE algorithm [13]. Such a method does not involve \nany value function learning, because the observations during one episode do not \naffect the critic parameter r during another episode. In contrast, in our approach, \nthe observations from all past episodes affect current critic parameter r, and in \nthis sense critic is \"learning\". This can be advantageous because, as long as (J is \nslowly changing, the observations from recent episodes carry useful information on \nthe q-function under the current policy. \n\n4 Convergence of actor-critic algorithms \n\nSince our actor-critic algorithms are gradient-based, one cannot expect to prove \nconvergence to a globally optimal policy (within the given class of RSP's). The \nbest that one could hope for is the convergence of '\\l A((J) to zero; in practical terms, \nthis will usually translate to convergence to a local minimum of A((J). Actually, \nbecause the T D(a) critic will generally converge to an approximation of the desired \nprojection of the value function, the corresponding convergence result is necessarily \nweaker, only guaranteeing that '\\l A((h) becomes small (infinitely often). Let us now \nintroduce some further assumptions. \n\n\fActor-Critic Algorithms \n\n1013 \n\n(A5) For each 0 E ~n, we define an m x m matrix G(O) by \n\nG(O) = L1Jo(x,u)\u00a2o(x,u)\u00a2O(x,U)T. \n\nx,u \n\nWe assume that G(O) is uniformly positive definite, that is, there exists \nsome fl > 0 such that for all r E ~m and 0 E ~n \n\nrTG(O)r ~ fdlrW\u00b7 \n\n(A6) We assume that the stepsize sequences bk}, {th} are positive, nonincreas(cid:173)\n\ning, and satisfy \n\n15k > 0, Vk, L 15k = 00, L 15k < 00, \n\nwhere 15k stands for either /h or 'Yk. We also assume that \n\nk \n\nk \n\n13k --+ o. \n'Yk \n\nNote that the last assumption requires that the actor parameters be updated at a \ntime scale slower than that of critic. \nTheorem 2. In an actor-critic algorithm with a TD(l) critic, \n\nliminf IIV'A(Ok)11 = 0 \n\nk \n\nw.p. 1. \n\nFurthermore, if {Od is bounded w.p. 1 then \n\nlim IIV' A(Ok)11 = 0 \nk \n\nw.p. 1. \n\nTheorem 3. For every f > 0, there exists a: sufficiently close to 1, such that \nliminfk IIV'A(Ok)11 ::; f w.p. 1. \n\nNote that the theoretical guarantees appear to be stronger in the case of the TD(l) \ncritic. However, we expect that TD(a:) will perform better in practice because of \nmuch smaller variance for the parameter rk. (Similar issues arise when considering \nactor-only algorithms. The experiments reported in [7] indicate that introducing a \nforgetting factor a: < 1 can result in much faster convergence, with very little loss of \nperformance.) We now provide an overview of the proofs of these theorems. Since \n13k/'Yk --+ 0, the size of the actor updates becomes negligible compared to the size \nof the critic updates. Therefore the actor looks stationary, as far as the critic is \nconcerned. Thus, the analysis in [1] for the TD(l) critic and the analysis in [12] \nfor the TD(a:) critic (with a: < 1) can be used, with appropriate modifications, to \nconclude that the critic's approximation of IIokqok will be \"asymptotically correct\". \nIf r(O) denotes the value to which the critic converges when the actor parameters \nare fixed at 0, then the update for the actor can be rewritten as \n\nOk+l = Ok - 13kr(r(Ok))Q~(Ok) (Xk+l, Uk+l)'l/JOk (Xk+1 , Uk+d + 13kek, \n\nwhere ek is an error that becomes asymptotically negligible. At this point, standard \nproof techniques for stochastic approximation algorithms can be used to complete \nthe proof. \n\n5 Conclusions \n\nThe key observation in this paper is that in actor-critic methods, the actor pa(cid:173)\nrameterization and the critic parameterization need not, and should not be chosen \n\n\f1014 \n\nV. R. Konda and J. N. Tsitsiklis \n\nindependently. Rather, an appropriate approximation architecture for the critic is \ndirectly prescribed by the parameterization used in actor. \n\nCapitalizing on the above observation, we have presented a class of actor-critic algo(cid:173)\nrithms, aimed at combining the advantages of actor-only and critic-only methods. In \ncontrast to existing actor-critic methods, our algorithms apply to high-dimensional \nproblems (they do not rely on lookup table representations), and are mathematically \nsound in the sense that they possess certain convergence properties. \n\nAcknowledgments: This research was partially supported by the NSF under \ngrant ECS-9873451, and by the AFOSR under grant F49620-99-1-0320. \n\nReferences \n\n[1] D. P. Bertsekas and J. N. Tsitsiklis. Neurodynamic Programming. Athena \n\nScientific, Belmont, MA, 1996. \n\n[2] V. S. Borkar. Stochastic approximation with two time scales. Systems and \n\nControl Letters, 29:291-294, 1996. \n\n[3] X. R. Cao and H. F. Chen. Perturbation realization, potentials, and sensitiv(cid:173)\nity analysis of Markov processes. IEEE Transactions on Automatic Control, \n42:1382-1393,1997. \n\n[4] P. W. Glynn. Stochastic approximation for monte carlo optimization. In Pro(cid:173)\n\nceedings of the 1986 Winter Simulation Conference, pages 285-289, 1986. \n\n[5] T . Jaakola, S. P. Singh, and M. 1. Jordan. Reinforcement learning algorithms \n\nfor partially observable Markov decision problems. In Advances in Neural In(cid:173)\nformation Processing Systems, volume 7, pages 345- 352, San Francisco, CA, \n1995. Morgan Kaufman. \n\n[6] V. R. Konda and V. S. Borkar. Actor-critic like learning algorithms for Markov \ndecision processes. SIAM Journal on Control and Optimization, 38(1) :94-123, \n1999. \n\n[7] P. Marbach. Simulation based optimization of Markov reward processes. PhD \n\nthesis, Massachusetts Institute of Technology, 1998. \n\n[8] P. Marbach and J. N. Tsitsiklis. Simulation-based optimization of Markov \n\nreward processes. Submitted to IEEE Transactions on Automatic Control. \n\n[9] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, \n\nCambridge, MA, 1995. \n\n[10] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient meth(cid:173)\nods for reinforcement learning with function approximation. In this proceedings. \n[11] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learn(cid:173)\ning with function approximation. IEEE Transactions on Automatic Control, \n42(5):674-690, 1997. \n\n[12] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. \n\nAutomatica, 35(11):1799-1808, 1999. \n\n[13] R. Williams. Simple statistical gradient following algorithms for connectionist \n\nreinforcement learning. Machine Learning, 8:229-256, 1992. \n\n\f", "award": [], "sourceid": 1786, "authors": [{"given_name": "Vijay", "family_name": "Konda", "institution": null}, {"given_name": "John", "family_name": "Tsitsiklis", "institution": null}]}