{"title": "TD(0) Leads to Better Policies than Approximate Value Iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 1377, "page_last": 1384, "abstract": null, "full_text": "TD(0) Leads to Better Policies than Approximate Value Iteration\n\nBenjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 bvr@stanford.edu\n\nAbstract\nWe consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same fixed points as TD(0). Our analysis also leads to the first performance loss bound for approximate value iteration with an average cost objective.\n\n1\n\nPreliminaries\n\nConsider a discrete-time communicating Markov decision process (MDP) with a finite state space S = {1, . . . , |S |}. At each state x  S , there is a finite set Ux of admissible actions. If the current state is x and an action u  Ux is selected, a cost of gu (x) is incurred, and theysystem transitions to a state y  S with probability pxy (u). For any x  S and u  Ux , S pxy (u) = 1. Costs are discounted at a rate of   (0, 1) per period. Each instance of such an MDP is defined by a quintuple (S , U , g , p, ). A (stationary deterministic) policy is a mapping  that assigns an action u  Ux to each state x  S . If actions are selected based on a policy , the state follows a Markov process with transition matrix P , where each (x, y )th entry is equal to pxy ((x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy  is associated with a cost-to-go function J  |S | , defined by J =  tt -1 g , where, with some abuse of notation, g (x) = g(x) (x) t=0  P g = (I - P ) for each x  S . A policy  is said to be greedy with respect to a function J if y (x)  argmin(gu (x) +  S pxy (u)J (y )) for all x  S .\nuUx\n\nThe optimal cost-to-go function J   |S | is defined by J  (x) = min J (x), for all x  S . A policy  is said to be optimal if J = J  . It is well-known that an optimal policy exists. Further, a policy  is optimal if and only if it is greedy with respect to J  . Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion.\n\n\f\nValue iteration generates a sequence J converging to J  according to J +1 = T J , whery T is the dynamic programming operator, defined by (T J )(x) = minuUx (gu (x) + e S|   S pxy (u)J (y )), for all x  S and J  | . This sequence converges to J for any initialization of J0 .\n\n2\n\nApproximate Value Iteration\n\nThe state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . . . , SK and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation  all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we define a matrix   |S |K such that each k th column is an indicator function for the k th partition Sk . Hence, for any r  K, k , and x  Sk , (r)(x) = rk . In this paper, we study variations of value iteration, each of which computes a vector r so that r approximates J  . The use of such a policy r which is greedy with respect to r is justified by the following result (see [10] for a proof): ~ Theorem 1 If  is a greedy policy with respect to a function J  |S | then J - J    2 ~ J  - J . 1-\n\nOne common way of approximating a function J  |S | with a function of the form r involves projection with respect to a weighted Euclidean norm   . The weighted Euclidean x 1/2 S| 2 norm: J 2, = . Here,   | is a vector of weights that assign + S  (x)J (x) relative emphasis among states. The projection  J is the function r that attains the minimum of J - r 2, ; if there are multiple functions r that attain the minimum, they must form an affine space, and the projection is taken to be the one with minimal norm r 2, . Note that in our context, where each k th column of  reprey ents an indicator fy nction for s u the k th partition, for any  , J , and x  Sk , ( J )(x) =  (y )J (y )/ Sk Sk  (y ). Approximate value iteration begins with a function r(0) and generates a sequence according to r( +1) =  T r( ) . It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further,  is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general , but is true in our context in which columns of  are indicator functions for partitions.) It follows that the composition  T is a contraction mapping. By the contraction mapping theorem,  T has a unique fixed point r, which is the limit of the sequence r( ) . Further, the following result holds: ~ Theorem 2 For any MDP, partition, and weights  with support intersecting every partition, if r =  T r then ~ ~ ~ - J    r and (1 - ) Jr - J    ~ 2 min J  - r , 1 -  r K 4 min J  - r . 1 -  r K\n\nThe first inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The\n\n\f\nsecond is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that Jr (x)  J  (x) for all x, so the left-hand side of the performance loss bound ~ is the maximal increase in cost-to-go, normalized by 1 - . This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefficients 1, , 2 , . . ., which sum to 1/(1 - ). Our motivation of the normalizing constant begs the question of whether, for fixed MDP parameters (S , U , g , p) and fixed , minr J  - r  also grows with 1/(1 - ). It turns out that minr J  - r  = O(1). To see why, note that for any , J = (I - P )-1 g = 1   + h , 1-\n\nwhere  (x) is the expected average cost if the process starts in state x and is controlled by policy ,  -1 1t t  = lim P g ,    =0 and h is the discounted differential cost function h = (I - P )-1 (g -  ). Both  and h converge to finite vectors as  approaches 1 [3]. For an optimal policy  , lim1  (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of , lim min J  - r   lim h  < .\n1 r  K 1\n\nThe performance loss bound still exhibits an undesirable dependence on  through the coefficient 4/(1 - ). In most relevant contexts,  is close to 1; a representative value might be 0.99. Consequently, 4/(1 - ) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any  > 0,   (0, 1), and   0, there exists MDP parameters (S , U , g , p) and a partition such that minr K J  - r  =  and, if r =  T r ~ ~ with  = 1, 4 min J  - r  - . (1 - ) Jr - J    ~ 1 -  r K This theorem is established through an example in [22]. The choice of uniform weights ( = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S , U , g , p) and a partition such that, with  = 1, 1 .   (1 - ) Jr - J  =  min J - r  ~ 1 -  r K In other words, that the performance loss is both lower and upper bounded by 1/(1 - ) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a finite state space. However, as the following theorem establishes, the coefficient multiplying minr K J  - r  can grow arbitrarily large as  increases, keeping all else fixed.\n\n\f\nTheorem 4 For any L and   0, there exists MDP parameters (S , U , g , p) and a partition such that lim1 minr K J  - r  =  and, if r =  T r with  = 1, ~ ~ lim inf (1 - ) (Jr (x) - J  (x))  L lim min J  - r , ~\n1 1 r  K\n\nfor all x  S . This Theorem is also established through an example [22]. For any  and x, lim ((1 - )J (x) -  (x)) = lim(1 - )h (x) = 0.\n1 1\n\nCombined with Theorem 4, this yields the following corollary. Corollary 1 For any L and   0, there exists MDP parameters (S , U , g , p) and a partition such that lim1 minr K J  - r  =  and, if r =  T r with  = 1, ~ ~  lim inf (r (x) -  (x))  L lim min J - r , ~\n1 1 r  K\n\nfor all x  S .\n\n3\n\nUsing the Invariant Distribution\n\nIn the previous section, we considered an approximation r that solves  T r = r for ~ ~ ~ some arbitrary pre-selected weights  . We now turn to consider use of an invariant state distribution r of Pr as the weight vector.1 This leads to a circular definition: the weights ~ ~ are used in defining r and now we are defining the weights in terms of r. What we are ~ ~ really after here is a vector r that satisfies r T r = r. The following theorem captures ~ ~ ~ ~ the associated benefits. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if r = r T r and r has support intersecting ~ ~ ~ ~ T every partition, (1 - )r (Jr - J  )  2 minr K J  - r . ~ ~ When  is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 -  from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all , lim1 (1 - )J -   = 0, as explained in Section 2. In particular, based on this, we have lim(1 - ) J - J   = | -  | =  -  = lim  T (J - J  ),\n1 1\n\nfor all policies  and probability distributions  . Hence, the left-hand-side expressions from the two performance bounds become directly comparable as  approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 x or all MDP parameters (S , U , g , p) and partitions, if r = r T r and F ~ ~ ~ lim inf 1 ~ Sk r (x) > 0 for all k , lim sup r -    2 lim min J  - r . ~\n1 1 r  K\n\nThe comparison suggests that solving r = r T r is strongly preferable to solving ~ ~ ~ r =  T r with  = 1. ~ ~\nBy an invariant state distribution of a transition matrix P , we mean any probability distribution  such that  T P =  T . In the event that Pr has multiple invariant distributions, r denotes an ~ ~ arbitrary choice.\n1\n\n\f\n4\n\nExploration\n\nIf a vector r solves r = r T r and the support of r intersects every partition, Theorem ~ ~ ~ ~ ~ 5 promises a desirable bound. However, there are two significant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation r T r = r does not have a solution. It is easy to produce examples of this; though ~ ~ ~ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of r intersect every partition. ~ To address these shortcomings, we introduce stochastic policies. A stochastic policy  maps state-action pairs to probabilities. For each x  S and u  Ux , (x, u) is the probability of tuking action u when in state x. Hence, (x, u)  0 for all x  S and a u  Ux , and Ux (x, u) = 1 for all x  S . Given a scalar > 0 and a function J , the -greedy Boltzmann exploration policy with respect to J is defined by (x, u) = u e-(Tu J )(x)(|Ux |-1)/ e . -(Tu J )(x)(|Ux |-1)/ e Ux e\n\nFor any > 0 and r, let r denote the -greedy Boltzmann exploration policy with respect to r. Further, we define a modified dynamic programming operator that incorporates Boltzmann exploration: u -(Tu J )(x)(|Ux |-1)/ e (Tu J )(x) Ux e J u (T )(x) = . -(Tu J )(x)(|Ux |-1)/ e Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modified dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J , lim 0 r (x, r (x)) = 1 and lim 1 T J = T J . These are immediate consequences of the following result (see [4] for a proof). i -vi (n-1)/ e i -vi (n-1)/ e Lemma 1 For any n, v  n, mini vi +  e vi / e  mini vi . Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy r and the support of this distribution is S . Let r denote this distribution. We consider a vector r that ~ ~ solves r = r T r. For any > 0, there exists a solution to this equation (this is an ~ ~ immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].)\n~ Theorem 6 For any MDP, partition, and > 0, if r = r T ~ ~T   ~ - J )  2 min )(r ) (Jr r  K J - r  + .\n\n\n\nr then (1 - ~\n\n5\n\nComputation: TD(0)\n\nThough computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve r =  T r. It requires an abil~ ~ ity to sample a sequence of states x(0) , x(1) , x(2) , . . ., each independent and identically\n\n\f\ndistributed according toy  . Also required is a way to efficiently compute (T r)(x) = minuUx (gu (x) +  S pxy (u)(r )(y )), for any given x and r . This is typically possible when the action set Ux and the support of px (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r( ) according to ( , r( +1) = r( ) +  (x( ) ) T r( ) )(x( ) ) - (r( ) )(x( ) ) where  is a step size and (x) denotes the column vector made up of components from the xth row of . In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves r =  T r. ~ ~ ~ The equation r =  T r may have no solution. Further, the requirement that states ~ ~ are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) ~ that aims at solving r = r T r. The algorithm requires simulation of a trajectory ~ ~ x0 , x1 , x2 , . . . of the MDP, with each action ut  Uxt generated by the -greedy Boltzmann exploration policy with respect to r(t) . The sequence of vectors r(t) is generated according to ( . r(t+1) = r(t) + t (xt ) T r(t) )(xt ) - (r(t) )(xt ) Under suitable conditions on the step size sequence, if this algorithm converges, the limit ~ satisfies r = r T r. Whether such an algorithm converges and whether there are ~ ~ ~ other algorithms that can effectively solve r = r T r for broad classes of relevant ~ ~ problems remain open issues.\n\n6\n\nExtensions and Open Issues\n\nOur results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the first time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a fixed policy (no control) and a general linearly parameterized approximator (arbitrary matrix ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(), which is parameterized by   [0, 1]. It is not known whether the results of this paper can be extended to the general case of   [0, 1]. Prior research has suggested that larger values of  lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a fixed policy, suggest that approximation error is amplified by a factor of 1/(1 - ) as  is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are\n\n\f\nunavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 - ) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes\n~ r = r (T ~\n\n\n\nr - 1). ~~\n\n~ The variables include an estimate   of the minimal average cost   and an approximation r of the optimal differential cost function h . The discount factor  is set ~ to 1 in computing an -greedy Boltzmann exploration policy as well as T . There is an average-cost version of temporal-difference learning for which any limit of convergence ~~ (, r) satisfies this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes infinite as  approaches 1. On the other hand, generalization of Theorem 6 yields the first performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and ~ ~~ > 0, if r = r (T r - 1) then ~\n~ r -   2 min h - r  + .\n\nr K\n\n~ ~ Here, r  denotes the average cost under policy r , which is well-defined because the process is irreducible under an -greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy ~ to see that the limit of the left-hand side is r -  . The limit of minr K J  - r  on the right-hand side is minr K h - r . (This follows from the analysis of [3].)\n\nAcknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS-9985229 and by the Office of Naval Research under Grant MURI N00014-001-0637. The author's understanding of the topic benefited from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has benefited from a number of useful comments and suggestions made by reviewers.\n\nReferences\n[1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7:270279, 1994. [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719726, 1962. [4] D. P. de Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 2000. [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002.\n\n\f\n[6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850865, 2003. [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, 1995. [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6:11851201, 1994. [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, 1994. [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:944, 1988. [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, 1995. [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, 1996. MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185202, 1994. [16] J. N. Tsitsiklis and B. Van Roy. Featurebased methods for large scale dynamic programming. Machine Learning, 22:5994, 1996. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674690, 1997. [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, 1997. [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):17991808, 1999. [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3):179191, 2002. [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at www.stanford.edu/ bvr/psfiles/aggregation.pdf, 2005. [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279292, 1992.\n\n\f\n", "award": [], "sourceid": 2877, "authors": [{"given_name": "Benjamin", "family_name": "Roy", "institution": null}]}