{"title": "Approximate Dynamic Programming via Linear Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 689, "page_last": 695, "abstract": null, "full_text": "Approximate Dynamic Programming \n\nvia Linear Programming \n\nDaniela P. de Farias \n\nDepartment of Management Science and Engineering \n\nStanford University \nStanford, CA 94305 \npucci@stanford.edu \n\nDepartment of Management Science and Engineering \n\nBenjamin Van Roy \n\nStanford University \nStanford, CA 94305 \n\nbvr@stanford. edu \n\nAbstract \n\nThe curse of dimensionality gives rise to prohibitive computational \nrequirements that render infeasible the exact solution of large- scale \nstochastic control problems. We study an efficient method based \non linear programming for approximating solutions to such prob(cid:173)\nlems. The approach \"fits\" a linear combination of pre- selected \nbasis functions to the dynamic programming cost- to- go function. \nWe develop bounds on the approximation error and present experi(cid:173)\nmental results in the domain of queueing network control, providing \nempirical support for the methodology. \n\n1 \n\nIntroduction \n\nDynamic programming offers a unified approach to solving problems of stochastic \ncontrol. Central to the methodology is the cost- to- go function, which can obtained \nvia solving Bellman's equation. The domain of the cost- to- go function is the state \nspace of the system to be controlled, and dynamic programming algorithms com(cid:173)\npute and store a table consisting of one cost- to- go value per state. Unfortunately, \nt he size of a state space typically grows exponentially in the number of state vari(cid:173)\nables. Known as the curse of dimensionality, this phenomenon renders dynamic \nprogramming intractable in the face of problems of practical scale. \n\nOne approach to dealing with this difficulty is to generate an approximation within \na parameterized class of functions , in a spirit similar to that of statistical regres(cid:173)\nsion. The focus of this paper is on linearly parameterized functions: one tries to \napproximate the cost- to- go function J* by a linear combination of prespecified ba(cid:173)\nsis functions. Note that this scheme depends on two important preconditions for the \ndevelopment of an effective approximation. First, we need to choose basis functions \n\n\fthat can closely approximate the desired cost-to-go function. In this respect, a suit(cid:173)\nable choice requires some practical experience or theoretical analysis that provides \nrough information on the shape of the function to be approximated. \"Regularities\" \nassociated with the function, for example, can guide the choice of representation. \nSecond, we need an efficient algorithm that computes an appropriate linear combi(cid:173)\nnation. \n\nThe algorithm we study is based on a linear programming formulation, originally \nproposed by Schweitzer and Seidman [5], that generalizes the linear programming \napproach to exact dynamic programming, originally introduced by Manne [4]. We \npresent an error bound that characterizes the quality of approximations produced \nby the linear programming approach. The error is characterized in relative terms, \ncompared against the \"best possible\" approximation of the optimal cost-to-go func(cid:173)\ntion given the selection of basis functions. This is the first such error bound for \nany algorithm that approximates cost- to- go functions of general stochastic control \nproblems by computing weights for arbitrary collections of basis functions. \n\n2 Stochastic control and linear programming \n\nWe consider discrete- time stochastic control problems involving a finite state space \nS of cardinality lSI = N. For each state XES, there is a finite set of available \nactions A x. Taking action a E A x when the current state is x incurs cost 9a(X) . \nState transition probabilities Pa(x,y) represent, for each pair (x,y) of states and \neach action a E A x, the probability that the next state will be y given that the \ncurrent state is x and the current action is a E Ax. \nA policy u is a mapping from states to actions. Given a policy u, the dynamics of \nthe system follow a Markov chain with transition probabilities Pu(x)(x, y). For each \npolicy u, we define a transition matrix Pu whose (x,y)th entry is Pu(x)(x,y). \nThe problem of stochastic control amounts to selection of a policy that optimizes \na given criterion. In this paper, we will employ as an optimality criterion infinite(cid:173)\nhorizon discounted cost of the form \n\nJu(x) =E [~(i9U(Xdlxo =x] , \n\nwhere 9u(X) is used as shorthand for 9u(x)(X) and the discount factor a E (0,1) \nreflects inter- temporal preferences. Optimality is attained by any policy that is \ngreedy with respect to the optimal cost-to-go function J*(x) = minu Ju(x) (a policy \nu is called greedy with respect to J if TuJ = T J). \nLet us define operators Tu and T by TuJ = 9u +aPuJ and T J = minu (9u + aPuJ). \nThe optimal cost-to-go function solves uniquely Bellman's equation J = T J. Dy(cid:173)\nnamic programming offers a number of approaches to solving this equation; one of \nparticular relevance to our paper makes use of linear programming, as we will now \ndiscuss. Consider the problem \n\nmax clJ \nS.t. T J;::: J, \n\n(1) \n\nwhere c is a vector with positive components, which we will refer to as state(cid:173)\nrelevance wei9hts. It can be shown that any feasible J satisfies J :::; J*. It follows \nthat, for any set of positive weights c, J* is the unique solution to (1). \nNote that each constraint (T J)(x) ;::: J(x) is equivalent to a set of constraints \n9a(X) + a L.YEs Pa(X ,y) J(y) ;::: J(x), Va E A x, so that the optimization problem \n(1) can be represented as an LP, which we refer to as the exact LP. \n\n\fAs mentioned in the introduction, state spaces for practical problems are enormous \ndue to the curse of dimensionality. Consequently, the linear program of interest in(cid:173)\nvolves prohibitively large numbers of variables and constraints. The approximation \nalgorithm we study reduces dramatically the number of variables. \n\nLet us now introduce the linear programming approach to approximate dynamic \nprogramming. Given pre-selected basis functions (Pl, .. . , cPK, define a matrix If> = \ncPK ]. With an aim of computing a weight vector f E ~K such that If>f \n[ cPl \nis a close approximation to J*, one might pose the following optimization problem: \n\nmax c'lf>r \ns.t. Tlf>r 2:: If>r. \n\n(2) \n\nGiven a solution f, one might then hope to generate near- optimal decisions by using \na policy that is greedy with respect to If>f. \n\nAs with the case of exact dynamic programming, the optimization problem (2) can \nbe recast as a linear program. We will refer to this problem as the approximate \nLP. Note that, though the number of variables is reduced to K, the number of \nconstraints remains as large as in the exact LP. Fortunately, we expect that most \nof the constraints will become irrelevant, and solutions to the linear program can \nbe approximated efficiently, as demonstrated in [3] . \n\n3 Error Bounds for the Approximate LP \n\nWhen the optimal cost- to- go function lies within the span of the basis functions, \nsolution of the approximate LP yields the exact optimal cost-to-go function. Un(cid:173)\nfortunately, it is difficult in practice to select a set of basis functions that contains \nthe optimal cost- to- go function within its span. Instead, basis functions must be \nbased on heuristics and simplified analyses. One can only hope that the span comes \nclose to the desired cost- to- go function. \nFor the approximate LP to be useful, it should deliver good approximations when \nthe cost- to- go function is near the span of selected basis functions. In this section, \nwe present a bound that ensure desirable results of this kind. \n\nTo set the stage for development of an error bound, let us establish some notation. \nFirst, we introduce the weighted norms, defined by \n\n1IJ111 ~ = '\"' ')'(x) IJ(x)l , IIJlloo ~ = max ')'(x) IJ(x)l, \n\n\" \n\nxES \n\n\" ~ \n\nxES \n\nfor any ')' : S f-t ~+. Note that both norms allow for uneven weighting of errors \nacross the state space. \n\nWe also introduce an operator H, defined by \n\n(HV)(x) = max L Pa(x, y)V(y), \n\naEAz \n\ny \n\nfor all V : S f-t R For any V , (HV)(x) represents the maximum expected value \nof V (y) if the current state is x and y is a random variable representing the next \nstate. Based on this operator, we define a scalar \nV(x) \n\nkv = m,:x V(x) - a(HV)(x) , \n\n(3) \n\nfor each V : S f-t ~. \n\n\fWe interpret the argument V of H as a \"Lyapunov function,\" while we view kv as \na \"Lyapunov stability factor,\" in a sense that we will now explain. In the upcoming \ntheorem, we will only be concerned with functions V that are positive and that \nmake kv nonnegative. Also, our error bound for the approximate LP will grow \nproportionately with kv, and we therefore want kv to be small. At a minimum, kv \nshould be finite, which translates to a condition \n\na(HV)(x) < V(x) , \n\n\"Ix ES. \n\n(4) \n\nIf a were equal to 1, this would look like a Lyapunov stability condition: \nthe \nmaximum expected value (HV)(x) at the next time step must be less than the \ncurrent value V(x). In general, a is less than 1, and this introduces some slack in \nthe condition. Note also that kv becomes smaller as the (HV)(x)'s become small \nrelative to the V(x)'s. Hence, kv conveys a degree of \"stability,\" with smaller values \nrepresenting stronger stability. \n\nWe are now ready to state our main result. For any given function V mapping S \nto positive reals, we use l/V as shorthand for a function x I-t l/V(x). \n\nTheorem 3.1 {2} Let f be a solution of the approximate LP. Then, for any v E 3rK \nsuch that (v) (x) > 0 for all xES and aH v < v, \n\nIIJ* - flkc :::; 2k*v(c'v) min IIJ* - rlloo,l/**v\u00b7 \n\nr \n\n(5) \n\nA proof of Theorem 3.1 can be found in the long version of this paper [2]. \n\nWe highlight some implications of Theorem 3.1. First, the error bound (5) tells \nthat the the approximation error yielded by the approximate LP is proportional to \nthe error associated with the best possible approximation relative to a certain norm \n11\u00b7lll,l/**v. Hence we expect that the approximate LP will have reasonable behavior \n- if the choice of basis functions is appropriate, the approximate LP should yield a \nrelatively good approximation to the cost-to-go function, as long as the constants \nk**v and c' v remain small. \n\nNote that on the left-hand side of (5), we measure the approximation error with the \nweighted norm 11\u00b7lkc. Recall that the weight vector c appears in objective function \nof the approximate LP (2) and must be chosen. In approximating the solution to a \ngiven stochastic control problem, it seems sensible to weight more heavily portions \nof the state space that are visited frequently, so that accuracy will be emphasized \nin such regions. As discussed in [2], it seems reasonable that the weight vector c \nshould be chosen to reflect the relative importance of each state. \nFinally, note that the Lyapunov function v plays a central role in the bound of \nTheorem 3.1. Its choice influences three terms on the right-hand-side of the bound: \n\n1. the error minr IIJ* - rlloo,l/**v; \n2. the Lyapunov stability factor k**v; \n3. the inner product c' v with the state- relevance weights. \n\nAn appropriately chosen Lyapunov function should make all three of these terms \nrelatively small. Furthermore, for the bound to be useful in practical contexts, \nthese terms should not grow much with problem size. We now illustrate with an \napplication in queueing problems how a suitable Lyapunov function could be found \nand show how these terms scale with problem size. \n\n\f3.1 Example: A Queueing Network \n\nConsider a single reentrant line with d queues and finite buffers of size B. We assume \nthat exogenous arrivals occur at queue 1 with probability p < 1/2. The state x E ~d \nindicates the number of jobs in each queue. The cost per stage incurred at state x \nis given by \n\nthe average number of jobs per queue. \nAs discussed in [2] , under certain stability assumptions we expect that the optimal \ncost-to-go function should satisfy \n\nO J * () \n\n::::; \n\nx::::; dX x + de x + Po, \n\nP2 I \n\nPl I \n\nfor some positive scalars Po, Pl and P2 independent of d. We consider a Lyapunov \nfunction V(x) = ~XIX + C for some constant C > 0, which implies \n\nm}n IIJ* -lJ>rll oo,l/V < \n\nIIJ*lloo,l /V \n\n< \n\nP2XlX + Plelx + dpo \nmax '-----'---..,-----'--\nx2: O \n\nXiX + dC \nPo \n< P2 + Pl + C' \n\nand the above bound is independent of the number of queues in the system. \n\nNow let us study kv. We have \n\na(HV)(x) < a [p (~XIX + 2X1/ \n\n1 + C) + (1- p) (~XIX + C) ] \n\n< V(x) (a+ap:;~:~), \n\nand it is clear that, for C sufficiently large and independent of d, there is a j3 < 1 \nindependent of d such that aHV ::::; j3V, and therefore kv ::::; 1 ~,6 . \n\nFinally, let us consider ciV. Discussion presented in [2] suggests that one might want \nto choose c so as to reflect the stationary state distribution. We expect that under \nsome stability assumptions, the tail of the stationary state distribution will have an \nupper bound with geometric decay [1]. Therefore we let c(x) = (l!;l+l)d plxl, for \nsome 0 < P < 1. In this case, c is equivalent to the conditional joint distribution of \nd independent and identically distributed geometric random variables conditioned \non the event that they are less than B + 1, and we have \nclV = E [~t, xl + C I Xi < B + 1, i = 1, ... , d] < 2 (1 ~2p)2 + 1 ~ P + C, \n\nwhere Xi, i = 1, ... , d are identically distributed geometric random variables with \nIt follows that clV is uniformly bounded over the number of \nparameter 1 - p. \nqueues. \n\nThis example shows that the terms involved in the error bound (5) are uniformly \nbounded both in the number of states in the system and in the number of state \nvariables, hence the behavior of the approximate LP does not deteriorate as the \nproblem size increases. \n\nWe finally present a numerical experiment to further illustrate the performance of \nthe approximate LP. \n\n\fL \", - 3 /11.5 \n\nAl - 1/11.5 \n\n) ~ - 4 / 11.5 \n\n=-r IJ.z - 2/ 11.5 \n\n~ -----':7 \n\n1A8 - 2 .5/ 11.5 \n\nIC>I\"'-\" 'U \n\nJ \n\n-\n\n\"\" - 3/ 11.5 \n\n) ~5 - 3 / 11.5 \n\n-\n\n-\n\nA 2 - 1/11.5 \n\n) 1\"4 - 3 .1/11.5 l \n\nFigure 1: System for Example 3.2. \n\nPolicy \n\nA verage Cost \n\nTable 1: Average number of jobs after 50,000,000 simulation steps \n\n3.2 An Eight-Dimensional Queueing Network \n\nWe consider a queueing network with eight queues. The system is depicted in Figure \n1, with arrival P'i, i = 1,2) and departure (J.Li, i = 1, ... ,8) probabilities indicated. \nThe state x E ~8 represents the number of jobs in each queue. The cost-per-state \nis g(x) = lxi, and the discount factor 0:: is 0.995. Actions a E {O, 1}8 indicate which \nqueues are being served; ai = 1 iff a job from queue i is being processed. We \nconsider only non-iddling policies and, at each t ime step, a server processes jobs \nfrom one of its queues exclusively. \nWe choose c of the form c(x) = (1 - p)8 plxl. The basis functions are chosen to span \nall polynomials in x of degree 2; therefore, the approximate LP has 47 variables. \nConstraints (T**r)(x) 2: (**r)(x) for the approximate LP are generated by sampling \n5000 states according to the distribution associated with c. Experiments were per(cid:173)\nformed for p = 0.85,0.9 and 0.95, and p = 0.9 yielded the policy with smallest \naverage cost. \n\nWe compared the performance of the policy yielded by the approximate LP (ALP) \nwith that of first-in-first-out (FIFO), last-buffer-first-serve (LBFS)l and a policy \nthat serves the longest queue in each server (LONG). The average number of jobs \nin the system for each policy was estimated by simulation. Results are shown in \nTable 1. The policy generated by the approximate LP performs significantly better \nthan each of the heuristics, yielding more than 10% improvement over LBFS, the \nsecond best policy. We expect that even better results could be obtained by refining \nthe choice of basis functions and state-relevance weights. \n\n4 Closing Remarks and Open Issues \n\nIn t his paper we studied the linear programming approach to approximate dynamic \nprogramming for stochastic control problems as a means of alleviating the curse of \n\n1 LBFS serves the job that is closest to leaving the system; for example, if there are jobs \nin queue 2 and in queue 6, a job from queue 2 is processed since it will leave the system \nafter going through only one more queue, whereas the job from queue 6 will still have to \ngo through two more queues. We also choose to assign higher priority to queue 8 than to \nqueue 3 since queue 8 has higher departure probability. \n\n\fdimensionality. We provided an error bound based on certain assumptions on the \nbasis functions. The bounds were shown to be uniformly bounded in the number \nof states and state variables in certain queueing problems. \n\nSeveral questions remain open and are the object of future investigation: Can the \nstate-relevance weights in the objective function be chosen in some adaptive way? \nCan we add robustness to the approximate LP algorithm to account for errors in the \nestimation of costs and transition probabilities, i.e., design an alternative LP with \nmeaningful performance bounds when problem parameters are just known to be in \na certain range? How do our results extend to the average cost case? How do our \nresults extend to the infinite-state case? How does the quality of the approximate \nvalue function, measure by the weighted L1 norm, translate into actual performance \nof the associated greedy policy? \n\nAcknowledgements \n\nThis research was supported by NSF CAREER Grant ECS-9985229, by the ONR \nunder Grant MURI N00014-00-1-0637, and by an IBM Research Fellowship. \n\nReferences \n\n[1] Bertsimas, D. , Gamarnik, D. & Tsitsiklis, J. , \"Performance of Multiclass Markovian \nQueueing Networks via Piecewise Linear Lyapunov Functions,\" submitted to Annals of \nApplied Probability, 2000. \n\n[2] de Farias, D.P. & Van Roy, B. , \"The Linear Programming Approach to Approximate \n\nDynamic Programming,\" submitted to publication, 200l. \n\n[3] de Farias, D.P. & Van Roy, B., \"On Constraint Sampling for Approximate Linear \n\nProgramming,\" , submitted to publication, 200l. \n\n[4] Manne, A.S., \"Linear Programming and Sequential Decisions,\" Management Science \n\n6, No.3, pp. 259-267, 1960. \n\n[5] Schweitzer, P. & Seidmann, A. , \"Generalized Polynomial Approximations in Markovian \nDecision Processes,\" Journal of Mathematical Analysis and Applications 110, pp. 568-\n582, 1985. \n\n\f", "award": [], "sourceid": 2129, "authors": [{"given_name": "Daniela", "family_name": "Farias", "institution": null}, {"given_name": "Benjamin", "family_name": "Roy", "institution": null}]}*