{"title": "A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 417, "page_last": 424, "abstract": null, "full_text": " A Cost-Shaping LP for\n Bellman Error Minimization with\n Performance Guarantees\n\n\n\n Daniela Pucci de Farias Benjamin Van Roy\n Mechanical Engineering Management Science and Engineering\n Massachusetts Institute of Technology and Electrical Engineering\n Stanford University\n\n\n\n Abstract\n\n We introduce a new algorithm based on linear programming that\n approximates the differential value function of an average-cost\n Markov decision process via a linear combination of pre-selected\n basis functions. The algorithm carries out a form of cost shaping\n and minimizes a version of Bellman error. We establish an error\n bound that scales gracefully with the number of states without\n imposing the (strong) Lyapunov condition required by its counter-\n part in [6]. We propose a path-following method that automates\n selection of important algorithm parameters which represent coun-\n terparts to the \"state-relevance weights\" studied in [6].\n\n\n1 Introduction\n\nOver the past few years, there has been a growing interest in linear programming\n(LP) approaches to approximate dynamic programming (DP). These approaches\noffer algorithms for computing weights to fit a linear combination of pre-selected\nbasis functions to a dynamic programming value function. A control policy that is\n\"greedy\" with respect to the resulting approximation is then used to make real-time\ndecisions.\n\nEmpirically, LP approaches appear to generate effective control policies for high-\ndimensional dynamic programs [1, 6, 11, 15, 16]. At the same time, the strength\nand clarity of theoretical results about such algorithms have overtaken counterparts\navailable for alternatives such as approximate value iteration, approximate policy\niteration, and temporal-difference methods. As an example, a result in [6] implies\nthat, for a discrete-time finite-state Markov decision process (MDP), if the span of\nthe basis functions contains the constant function and comes within a distance of\nof the dynamic programming value function then the approximation generated by a\ncertain LP will come within a distance of O( ). Here, the coefficient of the O( ) term\ndepends on the discount factor and the metric used for measuring distance, but not\non the choice of basis functions. On the other hand, the strongest results available\nfor approximate value iteration and approximate policy iteration only promise O( )\nerror under additional requirements on iterates generated in the course of executing\n\n\f\nthe algorithms [3, 13]. In fact, it has been shown that, even when = 0, approximate\nvalue iteration can generate a diverging sequence of approximations [2, 5, 10, 14].\n\nIn this paper, we propose a new LP for approximating optimal policies. We work\nwith a formulation involving average cost optimization of a possibly infinite-state\nMDP. The fact that we work with this more sophisticated formulation is itself a\ncontribution to the literature on LP approaches to approximate DP, which have\nbeen studied for the most part in finite-state discounted-cost settings. But we view\nas our primary contributions the proposed algorithms and theoretical results, which\nstrengthen in important ways previous results on LP approaches and unify certain\nideas in the approximate DP literature. In particular, highlights of our contributions\ninclude:\n\n 1. Relaxed Lyapunov Function dependence. Results in [6] suggest that\n in order for the LP approach presented there to scale gracefully to large\n problems a certain linear combination of the basis functions must be\n a \"Lyapunov function,\" satisfying a certain strong Lyapunov condition.\n The method and results in our current paper eliminate this requirement.\n Further, the error bound is strengthened because it alleviates an undesirable\n dependence on the Lyapunov function that appears in [6] even when the\n Lyapunov condition is satisfied.\n\n 2. Restart Distribution Selection. Applying the LP studied in [6] requires\n manual selection of a set of parameters called state-relevance weights. That\n paper illustrated the importance of a good choice and provided intuition\n on how one might go about making the choice. The LP in the current\n paper does not explicitly make use of state-relevance weights, but rather,\n an analog which we call a restart distribution, and we propose an automated\n method for finding a desirable restart distribution.\n\n 3. Relation to Bellman-Error Minimization. An alternative approach\n for approximate DP aims at minimizing \"Bellman error\" (this idea was\n first suggested in [16]). Methods proposed for this (e.g., [4, 12]) involve\n stochastic steepest descent of a complex nonlinear function. There are no\n results indicating whether a global minimum will be reached or guaranteeing\n that a local minimum attained will exhibit desirable behavior. In this\n paper, we explain how the LP we propose can be thought of as a method\n for minimizing a version of Bellman error. The important differences here\n are that our method involves solving a linear rather than a nonlinear (and\n nonconvex) program and that there are performance guarantees that can\n be made for the outcome.\n\nThe next section introduces the problem formulation we will be working with. Sec-\ntion 3 presents the LP approximation algorithm and an error bound. In Section 4,\nwe propose a method for computing a desirable reset distribution. The LP approx-\nimation algorithm works with a perturbed version of the MDP. Errors introduced\nby this perturbation are studied in Section 5. A closing section discusses relations\nto our prior work on LP approaches to approximate DP [6, 8].\n\n\n2 Problem Formulation and Perturbation Via Restart\n\nConsider an MDP with a countable state space S and a finite set of actions A\navailable at each state. Under a control policy u : S A, the system dynamics\nare defined by a transition probability matrix Pu |S||S|, where for policies\nu and u and states x and y, (Pu)xy = (Pu)xy if u(x) = u(x). We will assume\n\n\f\nthat, under each policy u, the system has a unique invariant distribution, given by\nu(x) = limt(P tu)yx, for all x, y S.\nA cost g(x, a) is associated with each state-action pair (x, a). For shorthand, given\nany policy u, we let gu(x) = g(x, u(x)). We consider the problem of computing a\npolicy that minimizes the average cost u = Tu gu. Let = minu u and define\nthe differential value function h(x) = minu limT Eux[ T (g\n t=0 u(xt) - )]. Here,\nthe superscript u of the expectation operator denotes the control policy and the\nsubscript x denotes conditioning on x0 = x. It is easy to show that there exists\na policy u that simultaneously minimizes the expectation for every x. Further, a\npolicy u is optimal if and only if u(x) arg minu(g(x, a) + (P\n y u)xy h(y)) for\nall x S.\n\nWhile in principle h can be computed exactly by dynamic programming algorithms,\nthis is often infeasible due to the curse of dimensionality. We consider approximating\nh using a linear combination K r\n k=1 kk of fixed basis functions 1, . . . , K : S \n . In this paper, we propose and analyze an algorithm for computing weights\nr K to approximate: h K \n k=1 k(x)rk. It is useful to define a matrix\n |S|K so that our approximation to h can be written as r.\n\nThe algorithm we will propose operates on a perturbed version of the MDP. The\nnature of the perturbation is influenced by two parameters: a restart probability\n(1 - ) [0, 1] and a restart distribution c over the state space. We refer to the\nnew system as an (, c)-perturbed MDP. It evolves similarly with the original MDP,\nexcept that at each time, the state process restarts with probability 1 - ; in this\nevent, the next state is sampled randomly according to c. Hence, the perturbed\nMDP has the same state space, action space, and cost function as the original one,\nbut the transition matrix under each policy u are given by P,u = Pu + (1 - )ecT .\n\nWe define some notation that will streamline our discussion and analysis of per-\nturbed MDPs. Let ,u(x) = limt(P t,u)yx, ,u = T,ugu, = minu ,u, and\nlet h be the differential value function for the (, c)-perturbed MDP, and let u\nbe a policy satisfying u(x) arg minu(g(x, a) + (P\n y ,u)xy h\n (y)) for all x S .\nFinally, we will make use of dynamic programming operators T,uh = gu + P,uh\nand Th = minu T,uh.\n\n\n3 The New LP\n\nWe now propose a new LP that approximates the differential value function of a\n(, c)-perturbed MDP. This LP takes as input several pieces of problem data:\n\n 1. MDP parameters: g(x, a) and (Pu)xy for all x, y S, a A, u : S A.\n 2. Perturbation parameters: [0, 1] and c : S [0, 1] with c(x) = 1.\n x\n\n 3. Basis functions: = [1 K] |S|K.\n 4. Slack function and penalty: : S [1, ) and > 0.\n\nWe have defined all these terms except for the slack function and penalty, which we\nwill explain after defining the LP. The LP optimizes decision variables r K and\ns1, s2 according to\n\n minimize s1 + s2 (1)\n subject to Tr - r + s11 + s2 0\n s2 0.\n\n\f\nIt is easy to see that this LP is feasible. Further, if is sufficiently large, the\nobjective is bounded. We assume that this is the case and denote an optimal\nsolution by (~\n r, ~\n s1, ~\n s2). Though the first |S| constraints are nonlinear, each involves\na minimization over actions and therefore can be decomposed into |A| constraints.\nThis results in a total of |S| |A| + 1 constraints, which is unmanageable if the\nstate space is large. We expect, however, that the solution to this LP can be\napproximated closely and efficiently through use of constraint sampling techniques\nalong the lines discussed in [7].\n\nWe now offer an interpretation of the LP. The constraint Tr - r - 1 0 is\nsatisfied if and only if r = h + 1 for some . Terms (s1 + )1 and s2 can\nbe viewed as cost shaping. In particular, they effectively transform the costs g(x, a)\nto g(x, a) + s1 + + s2(x), so that the constraint Tr - r - 1 0 can be\nmet.\n\nThe LP can alternatively be viewed as an efficient method for minimizing a form\nof Bellman error, as we now explain. Suppose that s2 = 0. Then, minimization\nof s1 corresponds to minimization of min(Tr - r - 1, 0) , which can be\nviewed as a measure of (one-sided) Bellman error. Measuring error with respect\nto the maximum norm is problematic, however, when the state space is large. In\nthe extreme case, when there is an infinite number of states and an unbounded cost\nfunction, such errors are typically infinite and therefore do not provide a meaningful\nobjective for optimization. This shortcoming is addressed by the slack term s2.\nTo understand its role, consider constraining s1 to be - and minimizing s2. This\ncorresponds to minimization of min(Tr - r - 1, 0) ,1/, where the norm\nis defined by h ,1/ = maxx |h(x)|/(x). This term can be viewed as a measure\nof Bellman error with respect to a weighted maximum norm, with weights 1/(x).\nOne important factor that distinguishes our LP from other approaches to Bellman\nerror minimization [4, 12, 16] is a theoretical performance guarantee, which we now\ndevelop.\n\nFor any r, let u,r(x) arg minu(gu(x) + (P,ur)(x)). Let ,r = ,u,r\nLet ,r = T,rgu . The following theorem establishes that the difference be-\n ,r\ntween the average cost ,~r associated with an optimal solution (~\n r, ~\n s1, ~\n s2) to\nthe LP and the optimal average cost is proportional to the minimal er-\nror that can be attained given the choice of basis functions. A proof of this\ntheorem is provided in the appendix of a version of this paper available at\nhttp://www.stanford.edu/ bvr/psfiles/LPnips04.pdf.\n\nTheorem 3.1. If (2 - )T\n ,u then\n \n \n\n\n (1 + ) max(, 1)\n ,~r - min h\n 1 - - r ,1/ ,\n r K\n\nwhere\n\n P,uh\n = max P ,1/\n ,u ,1/ max ,\n u u h ,1/\n T\n = ,~\n r (T~\n r - ~\n r + ~\n s11 + ~\n s2) .\n cT (T~\n r - ~\n r + ~\n s11 + ~\n s2)\n\nThe bound suggests that the slack function should be chosen so that the basis\nfunctions can offer a reasonably sized approximation error h - r ,1/. At\nthe same time, this choice affects the sizes of and . The theorem requires that\nthe penalty be at least (2 - )T\n ,u is the steady-state\n . The term T\n ,u\n \n\n\f\nexpectation of the slack function under an optimal policy. Note that\n\n (P\n max P ,u)(x)\n ,u ,1/ = max ,\n u u,x (x)\n\nwhich is the maximal factor by which the expectation of can increase over a single\ntime period. When dealing with specific classes of problems it is often possible to\nselect so that the norm h - r ,1/ as well as the terms maxu P,u ,1/\nand T\n ,u scale gracefully with the number of states and/or state variables. This\nissue will be addressed further in a forthcoming full-length version of this paper.\n\nIt may sometimes be difficult to verify that any particular value of dominates\n(2-)T\n ,u. One approach to selecting is to perform a line search over possible\nvalues of , solving an LP in each case, and choosing the value of that results in\nthe best-performing control policy. A simple line search algorithm solves the LP\nsuccessively for = 1, 2, 4, 8, . . ., until the optimal solution is such that ~\n s2 = 0. It\nis easy to show that the LP is unbounded for all < 1, and that there is a finite\n = inf{|~\n s2 = 0} such that for each , the solution is identical and ~\n s2 = 0.\nThis search process delivers a policy that is at least as good as a policy generated\nby the LP for some [(2 - )T\n ,u , 2(2 - )T\n ,u ], and the upper bound of\nTheorem 3.1 would hold with replaced by 2(2 - )T\n ,u .\n \n \n\nWe have discussed all but two terms involved in the bound: and 1/(1 - ). Note\nthat if c = ,~r, then = 1. In the next section, we discuss an approach that aims\nat choosing c to be close enough to ,~r so that is approximately 1. In Section\n5, we discuss how the reset probability 1 - should be chosen in order to ensure\nthat policies for the perturbed MDP offer similar performance when applied to the\noriginal MDP. This choice determines the magnitude of 1/(1 - ).\n\n\n4 Fixed Points and Path Following\n\nThe coefficient would be equal to 1 if c were equal to ,~r. We can not to simply\nchoose c to be equal to ,~r, since ,~r depends on ~\n r, an outcome of the LP which\ndepends on c. Rather, arriving at a distribution c such that c = ,~r is a fixed point\nproblem. In this section, we explore a path-following algorithm for approximating\nsuch a fixed point [9], with the aim of arriving at a value of that is close to one.\n\nConsider solving a sequence indexed by i = 1, . . . , M of (i, ci)-perturbed MDPs.\nLet ~\n ri denote the weight vector associated with an optimal solution to the LP (1)\nwith perturbation parameters (i, ci). Let 1 = 0 and i+1 = i + for i 1,\nwhere is a small positive step size. For any initial choice of c1, we have c1 = 1,~r1,\nsince the system resets in every time period. For i 1, let ci+1 = i,~ri. One might\nhope that the change in ci is gradual, and therefore, ci i,~ri for each i.\nWe can not yet offer rigorous theoretical support for the proposed path following\nalgorithm. However, we will present promising results from a simple computational\nexperiment. This experiment involves a problem with continuous state and action\nspaces. Though our main result Theorem 3.1 applies to problems with countable\nstate spaces and finite action spaces, there is no reason why the LP cannot be applied\nto broader classes of problems such as the one we now describe. Consider a scalar\nstate process xt+1 = xt + at + wt, driven by scalar actions at and a sequence wt\ni.i.d. zero-mean unit-variance normal random variables. Consider a cost function\ng(x, a) = (x - 2)2 + a2. We aim at approximating the differential value function\nusing a single basis function (x) = x2. Hence, (r)(x) = rx2, with r . We will\nuse a slack function (x) = 1 + x2 and penalty = 5. The special structure of this\n\n\f\nproblem allows for exact solution of the LP (1) as well as the exact computation\nof the parameter , though we will not explain here how this is done. Figure 1\nplots versus , as is increased from 0 to 0.99, with c initially set to a zero-mean\nnormal distribution with variance 4. The three curves represent results from using\nthree different step sizes {0.01, 0.005, 0.0025}. Note that in all cases, is very\nclose to 1. Smaller values of resulted in curves being closer to 1: the lowest curve\ncorresponds to = 0.01 and the highest curve corresponds to = 0.0025.\n\n\n\n\n\n Figure 1: Evolution of with {0.01, 0.005, 0.0025}.\n\n\n5 The Impact of Perturbation\n\nSome simple algebra will show that for any policy u,\n \n ,u - u = (1 - ) t cT P tugu - Tu gu .\n t=0\n\nWhen the state space is finite |cT P tugu - Tu gu| decays at a geometric rate. This\nis also true in many practical contexts involving infinite state spaces. One might\nthink of mu = (cT P t\n t=0 ugu - T\n u gu), as the mixing time of the policy u if the\ninitial state is drawn according to the restart distribution c. This mixing time is\nfinite if the differences cT P tugu - Tu gu converge geometrically. Further, we have\n|,u - u| = mu(1 - ), and coming back to the LP, this implies that\n u - + (1 - )(m + max(m )).\n ,~\n r u ,~\n r - ,u u u , mu\n ,~\n r \n\nCombined with the bound of Theorem 3.1, this offers a performance bound for the\npolicy u,~r applied to the original MDP. Note that when c = ,~r, in the spirit\ndiscussed in Section 4, we have mu = 0. For simplicity, we will assume in the\n ,~\n r\nrest of this section that mu = 0 and m , so that\n ,~\n r u mu\n \n\n u - + (1 - )m\n ,~\n r u ,~\n r - ,u u .\n \n\n\nLet us turn to discuss how should be chosen. This choice must strike a balance\nbetween two factors: the coefficient of 1/(1 - ) in the bound of Theorem 3.1 and\nthe loss of (1-)mu associated with the perturbation. One approach is to fix some\n > 0 that we are willing to accept as an absolute performance loss, and then choose\n so that (1 - )mu . Then, we would have 1/(1 - ) mu/ . Note that the\nterm 1/(1 - ) multiplying the right-hand-side of the bound can then be thought\nof as a constant multiple of the mixing time of u. An important open question is\nwhether it is possible to design an approximate DP algorithm and establish for that\nalgorithm an error bound that does not depend on the mixing time in this way.\n\n\f\n6 Relation to Prior Work\n\nIn closing, it is worth discussing how our new algorithm and results relate to our\nprior work on LP approaches to approximate DP [6, 8]. If we remove the slack\nfunction by setting s2 to zero and let s1 = -(1 - )cT r, our LP (1) becomes\n\n maximize cT r (2)\n subject to min(gu + Pur) - r 0,\n u\n\nwhich is precisely the LP considered in [6] for approximating the optimal cost-to-go\nfunction in a discounted MDP with discount factor . Let ^\n r be an optimal solution\nto (2). For any function V : S +, let V = maxu PuV ,1/V . We call V\na Lyapunov function if V < 1. The following result can be established using an\nanalysis entirely analogous to that carried out in [6]:\n\nTheorem 6.1. If v < 1 and v = 1 for some v, v K. Then,\n\n 2cT v\n ,^r - min h\n 1 - - r ,1/v .\n v r K\n\n\nA comparison of Theorems 3.1 and 6.1 reveals benefits afforded by the slack func-\ntion. We consider the situation where = v, which makes the bounds directly\ncomparable. An immediate observation is that, even though and v play analo-\ngous roles in the bounds, is not required to be a Lyapunov function. In this sense,\nTheorem 3.1 is stronger than Theorem 6.1. Moreover, if = T\n ,u , we have\n \n \n\n\n cT v\n = cT (I - P )-1 max cT (I - P .\n 1 - u u)-1v \n u 1 - V\n\nHence, the first term which appears in the bound of Theorem 6.1 grows with the\nlargest mixing time among all policies, whereas the second term which appears in\nthe bound of Theorem 3.1 only depends on the mixing time of an optimal policy.\n\nAs discussed in [6], appropriate choice of c there referred to as the state-relevance\nweights can be important for the error bound of Theorem 6.1 to scale well with the\nnumber of states. In [8], it is argued that some form of weighting of states in terms\nof a metric of relevance should continue to be important when considering average\ncost problems. An LP-based algorithm is also presented in [8], but the results are\nfar weaker than the ones we have presented in this paper, and we suspect that the\nthat LP-based algorithm of [8] will not scale well to high-dimensional problems.\n\nSome guidance is offered in [6] regarding how c might be chosen. However, this\nis ultimately left as a manual task. An important contribution of this paper is\nthe path-following algorithm proposed in Section 4, which aims at automating an\neffective choice of c.\n\n\nAcknowledgments\n\nThis research was supported in part by the NSF under CAREER Grant ECS-\n9985229 and by the ONR under grant MURI N00014-00-1-0637.\n\n\nReferences\n\n [1] D. Adelman, \"A Price-Directed Approach to Stochastic Inventory/Routing,\"\n preprint, 2002, to appear in Operations Research.\n\n\f\n [2] L. C. Baird, \"Residual Algorithms: Reinforcement Learning with Function\n Approximation,\" ICML, 1995.\n [3] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena\n Scientific, Bellmont, MA, 1996.\n\n [4] D. P. Bertsekas, Dynamic Programming and Optimal Control, second edition,\n Athena Scientific, Bellmont, MA, 2001.\n\n [5] J. A. Boyan and A. W. Moore, \"Generalization in Reinforcement Learning:\n Safely Approximating the Value Function,\" NIPS, 1995.\n\n [6] D. P. de Farias and B. Van Roy, \"The Linear Programming Approach to\n Approximate Dynamic Programming,\" Operations Research, Vol. 51, No. 6,\n November-December 2003, pp. 850-865. Preliminary version appeared in NIPS,\n 2001.\n [7] D. P. de Farias and B. Van Roy, \"On Constraint Sampling in the Linear Pro-\n gramming Approach to Approximate Dynamic Programming,\" Mathematics\n of Operations Research, Vol. 29, No. 3, 2004, pp. 462478.\n\n [8] D.P. de Farias and B. Van Roy, \"Approximate Linear Programming for\n Average-Cost Dynamic Programming,\" NIPS, 2003.\n [9] C. B. Garcia and W. I. Zangwill, Pathways to Solutions, Fixed Points, and\n Equilibria, Prentice-Hall, Englewood Cliffs, NJ, 1981.\n\n[10] G. J. Gordon, \"Stable Function Approximation in Dynamic Programming,\"\n ICML, 1995.\n[11] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman, \"Efficient Solution\n Algorithms for Factored MDPs,\" Journal of Artificial Intelligence Research,\n Volume 19, 2003, pp. 399-468. Preliminary version appeared in NIPS, 2001.\n\n[12] M. E. Harmon, L. C. Baird, and A. H. Klopf, \"Advantage Updating Applied\n to a Differential Game,\" NIPS 1995.\n[13] R. Munos, \"Error Bounds for Approximate Policy Iteration,\" ICML, 2003.\n[14] J. N. Tsitsiklis and B. Van Roy, \"Feature-Based Methods for Large Scale Dy-\n namic Programming,\" Machine Learning, Vol. 22, 1996, pp. 59-94.\n\n[15] D. Schuurmans and R. Patrascu, \"Direct Value Approximation for Factored\n MDPs,\" NIPS, 2001.\n\n[16] P. J. Schweitzer and A. Seidman, \"Generalized Polynomial Approximation in\n Markovian Decision Processes,\" Journal of Mathematical Analysis and Appli-\n cations, Vol. 110, `985, pp. 568-582.\n\n\f\n", "award": [], "sourceid": 2621, "authors": [{"given_name": "Daniela", "family_name": "Farias", "institution": null}, {"given_name": "Benjamin", "family_name": "Roy", "institution": null}]}