{"title": "Approximate Linear Programming for Average-Cost Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1619, "page_last": 1626, "abstract": null, "full_text": "Approximate Linear Programming for\nAverage-Cost Dynamic Programming\n\nDaniela Pucci de Farias\n\nIBM Almaden Research Center\n\n650 Harry Road, San Jose, CA 95120\n\npucci@mit.edu\n\nDepartment of Management Science and Engineering\n\nBenjamin Van Roy\n\nStanford University\nStanford, CA 94305\n\nbvr@stanford.edu\n\nAbstract\n\nThis paper extends our earlier analysis on approximate linear program-\nming as an approach to approximating the cost-to-go function in a\ndiscounted-cost dynamic program [6].\nIn this paper, we consider the\naverage-cost criterion and a version of approximate linear programming\nthat generates approximations to the optimal average cost and differential\ncost function. We demonstrate that a naive version of approximate linear\nprogramming prioritizes approximation of the optimal average cost and\nthat this may not be well-aligned with the objective of deriving a policy\nwith low average cost. For that, the algorithm should aim at producing a\ngood approximation of the differential cost function. We propose a two-\nphase variant of approximate linear programming that allows for external\ncontrol of the relative accuracy of the approximation of the differential\ncost function over different portions of the state space via state-relevance\nweights. Performance bounds suggest that the new algorithm is compat-\nible with the objective of optimizing performance and provide guidance\non appropriate choices for state-relevance weights.\n\n1 Introduction\n\nThe curse of dimensionality prevents application of dynamic programming to most prob-\nlems of practical interest. Approximate linear programming (ALP) aims to alleviate the\ncurse of dimensionality by approximation of the dynamic programming solution. In [6], we\ndevelop a variant of approximate linear programming for the discounted-cost case which\nis shown to scale well with problem size. In this paper, we extend that analysis to the\naverage-cost criterion.\n\nOriginally introduced by Schweitzer and Seidmann [11], approximate linear programming\ncombines the linear programming approach to exact dynamic programming [9] to ap-\n\n\f\u0001\u0014\u0013\u0016\u0015\u0018\u0017\n\n\u0001\u001a\u0019\u001c\u001b\u0002\u001d\n\nproximation of the differential cost function (cost-to-go function, in the discounted-cost\ncase) by a linear architecture. More speci\ufb01cally, given a collection of basis functions\n, mapping states in the system to be controlled to real numbers, approxi-\nmate linear programming involves solution of a linear program for generating an approxi-\n\n\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\t\u000b\u0003\r\f\u000e\f\u000e\f\u000e\u0003\u0006\u000f\nmation to the differential cost function of the form \u0010\u0012\u0011\n\nExtension of approximate linear programming to the average-cost setting requires a differ-\nent algorithm and additional analytical ideas. Speci\ufb01cally, our contribution can be summa-\nrized as follows:\nAnalysis of the usual formulation of approximate linear programming for average-\ncost problems. We start with the observation that the most natural formulation of average-\ncost ALP, which follows immediately from taking limits in the discounted-cost formulation\nand can be found, for instance, in [1, 2, 4, 10], can be interpreted as an algorithm for ap-\nproximating of the optimal average cost. However, to obtain a good policy, one needs a\ngood approximation to the differential cost function. We demonstrate through a counterex-\nample that approximating the average cost and approximating the differential cost function\nso that it leads to a good policy are not necessarily aligned objectives. Indeed, the algorithm\nmay lead to arbitrarily bad policies, even if the approximate average cost is very close to\noptimal and the basis functions have the potential to produce an approximate differential\ncost function leading to a reasonable policy.\nProposal of a variant of average-cost ALP. A critical limitation of the average-cost ALP\nalgorithm found in the literature is that it does not allow for external control of how the ap-\nproximation to the differential cost function should be emphasized over different portions\nof the state space. In situations like the one described in the previous paragraph, when the\nalgorithm produces a bad policy, there is little one can do to improve the approximation\nother than selecting new basis functions. To address this issue, we propose a two-phase\nvariant of average-cost ALP: the \ufb01rst phase is simply the average-cost ALP algorithm al-\nready found in the literature, which is used for generating an approximation for the optimal\naverage cost. This approximation is used in the second phase of the algorithm for gener-\nating an approximation to the differential cost function. We show that the second phase\nselects an approximate differential cost function minimizing a weighted sum of the dis-\ntance to the true differential cost function, where the weights (referred to as state-relevance\nweights) are algorithm parameters to be speci\ufb01ed during implementation of the algorithm,\nand can be used to control which states should have more accurate approximations for the\ndifferential cost function.\nDevelopment of bounds linking the quality of approximate differential cost functions\nto the performance of the policy associated with them. The observation that the usual\nformulation of ALP may lead to arbitrarily bad policies raises the question of how to de-\nsign an algorithm for directly optimizing performance of the policy being obtained. With\nthis question in mind, we develop bounds that relate the quality of approximate differential\ncost functions \u2014 i.e., their proximity to the true differential cost function \u2014 to the ex-\npected increase in cost incurred by using a greedy policy associated with them. The bound\nsuggests using a weighted sum of the distance to the true differential cost function for com-\nparing different approximate differential cost functions. Thus the objective of the second\nphase of our ALP algorithm is compatible with the objective of optimizing performance\nof the policy being obtained, and we also have some guidance on appropriate choices of\nstate-relevance weights.\n\n2 Stochastic Control Problems and the Curse of Dimensionality\nWe consider discrete-time stochastic control problems involving a \ufb01nite state space \u001e of\ncardinality \u001f\n, there is a \ufb01nite set %'& of available actions.\n\n. For each state \u001b$#\n\n\u0007\"!\n\n\u001e \u001f\n\n\u0001\n\n\f\n\u001e\n\f\u0003\u000b\n\n%'&\n\n\u0003\u0012\n\n\u0019\u001c\u001b\n\n, there\n\n\u0013*)\n\u0019\u001c\u001b\u0018\u001d\n\n\u001d,+\n\n\u0019\u001c\u001b\n\u000710\n\n\u001b/.\n\nin the system.\n\n\u001b\u0002\u001d\n\n, this\n\nis incurred. State\n\n whose \u0019\u001c\u001b\n\nrepresent, for each pair \u0019\u001c\u001b\n\n, the dynamics of the system\n, we\n\nIn stochastic control problems, we want to select a policy optimizing a given crite-\n\n\u001b\u0002\u001d\u0005\u0004\u0007\u0006\n\u001d of states and each action\n& , the probability that the next state will be \n given that the current state is \u001b and the\n\nis taken, a cost\u0001\u0003\u0002\n\u0003\u000b\n\nis a mapping from states to actions. Given a policy\f\n\u0019\u001c\u001b\n&\u0011\u0010\n\u0003\u0012\n\nWhen the current state is \u001b and action\ntransition probabilities \b\t\u0002\ncurrent action is\n& .\nA policy\f\n\u001d . For each policy \f\n\u0003\u0012\n\nfollow a Markov chain with transition probabilities \b\u000e\r\u0003\u000f\n\u001d , and a cost vector\u0001\n\u001d th entry is\b\u0013\r\u0003\u000f\n\r whose\nde\ufb01ne a transition matrix\b\n&\u0011\u0010\n\u0019\u001c\u001b\u0002\u001d . We make the following assumption on the transition probabilities:\n\u001b th entry is\u0001\u0014\r\u0003\u000f\n&\u0011\u0010\nAssumption 1 (Irreducibility). For each pair of states \u001b and\n and each policy \f\n\u001d\u0018\u0004\u0007\u0006\n\u0003\u0012\n\nis\u0015 such that\b\u0017\u0016\nrion. In this paper, we will employ as an optimality criterion the average cost \u0019\n\f Irreducibility implies that, for each policy \f\n\u001a\u001c\u001b\u001e\u001d \u001f\"!$#\n\u001f&%('\nlimit exists and \u0019\nWe denote the minimal average cost by 0*2\nassociated dynamic programming operator 6\n\r87\n#\u0007;\u0017<\non vectors 7\nby 6\ndynamic programming operator 6\nif it attains the minimum in the de\ufb01nition of 6\nrespect to7\n\u0003 where @\nman\u2019s equation 0-@\n9A7\n\u001d . The scalar0*2\nBellman\u2019s equation by pairs \u0019\n0*2\u000b\u0003\naverage cost. The vector7\ntion is unique up to a constant factor; if 7\na solution for all B\nfor an arbitrary state \u001b\n\u001b\u0002\u001d\nuniqueness by imposing 7\n\nAn optimal policy minimizing the average cost can be derived from the solution of Bell-\nis the vector of ones. We denote solutions to\nis unique and equal to the the optimal\nis called a differential cost function. The differential cost func-\nis also\n, and all other solutions can be shown to be of this form. We can ensure\n. Any policy that is greedy with\n\n . For any policy \f\n\r87\n\r:9\n=>< corresponding to functions on the state space \u001e\n\f A policy \f\n\n\u001b-)\nfor all \u001b \u2014 the average cost is independent of the initial state\n\n2 solves Bellman\u2019s equation, then 7\n\nrespect to the differential cost function is optimal.\n\n\f Note that6\n\n. We also de\ufb01ne the\nis called greedy with\n\n operates\n\nSolving Bellman\u2019s equation involves computing and storing the differential cost function\nfor all states in the system. This is computationally infeasible in most problems of practi-\ncal interest due to the explosion on the number of states as the number of state variables\ngrows. We try to combat the curse of dimensionality by settling for the more modest goal\nof \ufb01nding an approximation to the differential cost function. The underlying assumption is\nthat, in many problems of practical interest, the differential cost function will exhibit some\nregularity, or structure, allowing for reasonable approximations to be stored compactly.\n\n\u00073\u001d4\u001b\u001c5\n\r by6\n\u0007?\u001d4\u001b\u001c5\n\n, we de\ufb01ne the\n\n9CB\n\n.\n\nWe consider a linear approximation architecture: given a set of functions \n\t\u000b\u0003\r\f\n\n, we generate approximations of the form\n\n\f\u0014\f\n\n\u0001&D\n\n\u001eFE\n\n\u0003\u0004\u0005\n\n\u0001\u001a\u0019\u001c\u001b\u0002\u001d\n\n\u0019\u001c\u001b\n\u0019\u001c\u001b\u0002\u001d&IKJ\n#O;\u0017<\n\u00073R\n=><QP\nWe de\ufb01ne a matrix N\nby N\n, and each row corresponds to a vector \nis stored as a column ofN\nevaluated at a distinct state \u001b\n. We represent J\nare prespeci\ufb01ed, and address the problem of choosing a suitable parameter vector \u0017 . For\n\u0019W\u0006\n\nLCV\nin matrix notation as N\n\u0006 ; accordingly, we assume that the basis functions are such that \n\nsimplicity, we choose an arbitrary state \u2014 henceforth called state \u201c0\u201d\u2014 for which we set\n\nIn the remainder of the paper, we assume that (a manageable number of) basis functions\n\n\u0019\u001c\u001b\u0018\u001d of the basis functions\n\n, i.e., each of the basis functions\n\n\u0015TSUSUS\n\u0019\u0012S\n\n\u0001\u001a\u0019X\u0006\n\n\u0017 .\n\n\u0005 .\n\n(1)\n\n\u0006 ,Y\n\n#\n\u0019\n\u0019\n\u001b\n\u001d\n\n#\n%\n#\n%\n\u0019\n\u001b\n\n\f\n\n\u0019\n\u0007\n\u0015\n\u0010\n\u001f\n\u0016\n\u0001\n\n\u0016\n+\n+\n\u0007\n\n\f\n\n0\n\u0007\n\u0001\n\b\n7\n\n6\n\n7\n\u0007\n6\n7\n7\n2\n2\n2\n@\n2\n\u0019\n\u0007\n\u0006\nG\n;\n\u0007\nH\n7\n2\n7\n\u0003\n\u0017\n\u001d\n\u0007\nL\nM\n\u0001\n\u0013\n\u0015\n\u0017\n\u0001\n\n\f\nL\n\n\n7\n\u0003\n\u0017\n\u001d\n7\n2\n\u001d\n\u0007\n\u001d\n\u0007\n\f3 Approximate Linear Programming\n\nApproximate linear programming [11, 6] is inspired by the traditional linear programming\napproach to dynamic programming, introduced by [9]. Bellman\u2019s equation can be solved\nby the average-cost exact LP (ELP):\n\nIn approximate linear programming, we reduce the generally intractable dimensions of\n\n\u0003 therefore we can think of problem (2) as an LP.\n\n\u0019\u001c\u001b\n\n\u0003\u0012\n\n\b\t\u0002\n\n\u001d\u0001\u0003\u0002\u0005\u0004\nNote that the constraints 0\n937\f\n\n\u0010\u000e\r\nthe average-cost ELP by constraining 7\n\u001d\u000f\u0010\u0002\u0011\u0004\n\napproximate LP (ALP)\n\n7\u000b\n\ncan be replaced by 0\nto be of the form N\n\n(2)\n\n\u001b\u0002\u001d\n\n\u0019\u001c\u001b\u0002\u001d\n\n\u0017 . This yields the \ufb01rst-phase\n\n(3)\n\nProblem (3) can be expressed as an LP by the same argument used for the exact LP. We\n\n\u0015\r\u001d\n\nregion.\n\nThe following result is immediate.\n\ndenote its solution by \u0019\nLemma 1. The solution 0\u0018\u0015 of the \ufb01rst-phase ALP minimizes \u001f\nProof: Maximizing0\nin (3) is equivalent to maimizing 0>2\u0015\u0013C0\ncorresponds to the exact LP (2) with extra constraints 7\n. Hence0\"2\u0018\u0013\nfeasible0\n\u001f , and the claim follows.\n\n0\"2\u0018\u0013\n\n\u001f over the feasible\n\n0*2\u0014\u0013F0\n\u0017 , we have 0\u0017\u0016\n\n0>2\n\n. Since the \ufb01rst-phase ALP\nfor all\n\niffN\n\n2 .\n\nLemma 1 implies that the \ufb01rst-phase ALP can be seen as an algorithm for approximating\nthe optimal average cost. Using this algorithm for generating a policy for the average-\ncost problem is based on the hope that approximation of the optimal average cost should\nalso implicitly imply approximation of the differential cost function. Note that it is not\nunreasonable to expect that some approximation of the differential cost function should be\n\ninvolved in the minimization of \u001f\n\n0*2\u001a\u0013\n\n\u001f ; for instance, we know that0\n\nThe ALP has as many variables as the number of basis functions plus one, which will\nusually amount to a dramatically smaller number of variables than what we had in the ELP.\nHowever, the ALP still has as many constraints as the number of state-action pairs. This\nproblem is also found in the discounted-cost formulation and there are several approaches\nin the literature for dealing with it, including constraint sampling [7] and exploitation of\nproblem-speci\ufb01c structures for ef\ufb01cient elimination of redundant constraints [8, 10].\n\nOur \ufb01rst step in the analysis of average-cost ALP is to demonstrate through a counterex-\nample that it can produce arbitrarily bad policies, even if the approximation to the average\ncost is very accurate.\n\n4 Performance of the \ufb01rst-phase ALP: a counterexample\n\n\u0003\r\t\u000b\u0003\u000e\f\u0014\f\u0014\f\nWe consider a Markov process with states \u0006\nof jobs in a queue with buffer of size \u001b\n. The system state\u001d\n\u0003\"!\n\u001b#\t%$'&\u0011(%)+*\u0011\u0003*/\u001b\u001c\u001a\u001c\u001b,\t.-0/\n\u0003\"!\nH\u0016\u0003\n\u001b#\t%$'&\u0011(%)+*\u0011\u0003*/\u001b\u001c\u001a\u001c\u001b,\t.-\n)\u0003\t\u001c$213(\u001c!\n\n\u0003\u001c\u001b\n\n\u0016\u001f\u001e\n\n, each representing a possible number\n\nevolves according to\n\n\u0006\n\u0007\n0\n\b\n\f\n\t\n\f\n0\n@\n9\n6\n7\n\f\n@\n6\n7\n9\n7\n\n\u0001\n\u0002\n\u0019\n9\n\u001d\n7\n\u0019\n\n\u001d\n\u0003\nY\n\u001b\n\u0003\n\n\u0006\n\u0012\n0\n\b\n\f\n\t\n\f\n0\n@\n9\nN\n\u0017\n\n6\nN\n\u0017\n\f\n0\n\u0015\n\u0003\n\u0017\n\u0007\nN\n0\n\u0007\n\u001f\n0\n\u0019\n0\n\u0015\n\u0007\n0\n2\n\u0017\n\u0015\n\u0007\n7\n\u0016\n\u001d\n\u0015\n\u0007\n \n\u001d\n\u0016\n\u0013\n\t\n\u0019\n\u001d\n\u0016\n\u001d\n\u0003\n\u001d\n\u0016\n9\n\t\n\u001d\n\u0016\n\u0003\n\u001b\n\b\n1\n\f\n\f\t\u000b\u0003\r\f\u000e\f\r\f\n\n\u001d and\n\n, respectively.\n\nfor the optimal average cost, which\n\nFrom state\u0006 , transitions to states \t and\u0006 occurs with probabilitiesH and \t\n\t and \u001b\n, transitions to states \u001b\nFrom state \u001b\noccur with probabilities /\n\u001d , respectively. The arrival probabilityH\nis the same for all states and we let\n\u0001\u0003\u0002 . The action to be chosen in each state \u001b\nis the departure probability or service\n\u0019\u001c\u001b\u0018\u001d , which takes values the set\n\f\u0014\t\u0006\u0005\b\u0007\nrate /\n\u0002\r\f . The cost incurred at\n\t\u0003\n\u0003\u000b\n\u001b\u000f\u000e\nstate \u001b\n\u0003%/\n\u0006\u0014\u0006\nif action/\n\u000e .\nis taken is given by\u0001\n\u0019\u001c\u001b\u0018\u001d , \n\u0019\u001c\u001b\u0018\u001d\n\u0019\u001c\u001b\u0002\u001d\n\u0003\u0006\u0005\nWe use basis functions \n\u0006\u0014\u0006 , the\n\u0001 . For\u001b\n\u0007\u0011\u0010\n\ufb01rst-phase ALP yields an approximation 0\n\u0007\u0012\u0010\u0003\u0007\n\u0006 . However, the average cost yielded by the\nis within 2% of the true value 0*2\n\u0006\u0014\u0006 , and goes to in\ufb01nity as we\nis 9842.2 for \u001b\ngreedy policy with respect to N\nincrease the buffer size. Figure 1 explains this behavior. Note that N\n2 over states \u001b\n\u0006 , and becomes progressive worse as \u001b\n\u0016\u0013\u0007\napproximation for 7\nStates \u001b\n\u0006 correspond to virtually all of the stationary probability under the optimal\n\u0016\u0013\u0007\npolicy (\b\n\u0010\u0015\u0010\u0015\u0010\b\u0010\u0015\u0010 ), hence it is not surprising that the \ufb01rst-phase ALP yields\na very accurate approximation for 0*2 , as other states contribute very little to the optimal\naverage cost. However, \ufb01tting the optimal average cost and the differential cost function\nover states visited often under the optimal policy is not suf\ufb01cient for getting a good policy.\n\u0015 severely underestimates costs in large states, and the greedy policy drives the\nIndeed, N\nlead to a reasonably good policy \u2014 indeed, for \u0017\npolicy associated with N\nlarger than the optimal average cost. Hence even\nthe buffer size, which is only about \u0002\nthough the \ufb01rst-phase ALP is being given a relatively good set of basis functions, it is\nproducing a bad approximate differential cost function, which cannot be improved unless\ndifferent basis functions are selected.\n\nIt is also troublesome to note that our choice of basis function actually has the potential to\n, the greedy\n\nsystem to those states, yielding a very large average cost and ultimately making the system\nunstable, when the buffer size goes to in\ufb01nity.\n\n\f\u0016\u0007\u0018\u0007\u0019\u0010\n\u0017 has an average cost approximately equal to \u0010\u0015\u0005\n\n\f\u0016\u000b , regardless of\n\nis a very good\nincreases.\n\n\f\u0016\u000b\n\n\t\u0006\n\u0017\u000b\n\nR#\u0013\n\n\u0016\u0014\u0007\n\n\f\u0014\t\u001b\u001a\n\n5 Two-phase average-cost ALP\n\nA striking difference between the \ufb01rst-phase average-cost ALP and discounted-cost ALP\nis the presence in the latter of state relevance weights. These are algorithm parameters that\ncan be used to control the accuracy of the approximation to the cost-to-go function (the\ndiscounted-cost counterpart of the differential cost function) over different portions of the\nstate space and have been shown in [6] to have a \ufb01rst-order impact on the performance of\nthe policy being generated. For instance, in the example described in the previous section,\nin the discounted-cost formulation one might be able to improve the policy yielded by ALP\n\nby choosing state-relevance weights that put more emphasis on states \u001b\n\nby this observation, we propose a two-phase algorithm with the characteristic that state-\nrelevance weights are present and can be used to control the quality of the differential\ncost function approximation. The \ufb01rst phase is simply the \ufb01rst-phase ALP introduced in\nSection 3, and is used for generating an approximation to the optimal average cost. The\nsecond phase consists of solving the second-phase ALP for \ufb01nding approximations to the\ndifferential cost function:\n\n\u0006 . Inspired\n\n\u001d\u000f\u0010\u0002\n\nThe state-relevance weights\nuser and\n\n\u0012\u001d\u001c\n\u001d\u000e\u0019\u001c\u001b\u0018\u001d\n\u00041\u0006 and 0\n\u001f denotes the transpose of\n\nare algorithm parameters to be speci\ufb01ed by the\n. We denote the optimal solution of the second-phase\n\n.\n\nALP by \u0017\nWe now demonstrate how the state-relevance weights and 0\n\ncan be used for controlling\nthe quality of the approximation to the differential cost function. We \ufb01rst de\ufb01ne, for any\n\n\u001b\u0002\u001d\n\n\u001b\u001f\u001e\n\n(4)\n\n\u0013\nH\n\u0013\n\u0019\n\u001b\n\u0013\n\t\n\t\n\u0013\n/\n\u0019\n\u001b\n\u0013\n\t\nH\n\u0007\n\u0006\n\f\n\u0004\n\u0006\n\u0002\n\u0003\n\u0006\n\f\n\u0001\n\u0007\n\u0002\n\u0003\n\u0006\n\f\n\u0002\n\u0003\n\u0006\n\f\n\u0005\n\u0019\n\u001b\n\u001d\n\u0007\n9\n\u0002\n/\n)\n\u0007\n\t\n\u0013\n\t\n)\n\u0001\n\u0007\n\u001b\n\u0001\n\u0007\n\u0003\n\n\t\n\u0015\n\u0006\n\f\n\u0001\n\f\n\u0017\n\u0015\n\u0007\n\t\nJ\n\u0017\n\u0015\n\u0019\n\u001b\n\u0006\n\u001d\nI\n\u0006\n\f\n\u0017\n\u0007\n\u0002\n\n\f\n\u0002\n\u0006\n\f\n\u0001\nV\n\n\u0007\n\u001f\nN\n\u0017\n\b\n\f\n\t\n\f\n\u0019\n6\nN\n\u0017\n\n0\n\u000e\n9\n\u0019\nN\n\u0017\n\u001d\n\u0019\n\u0003\nY\n\u0007\n\u0006\n\f\n\u001c\n\u000e\n\u001c\n\u001c\n\u000e\n\u000e\n\f(5)\n\nto\n\n\u001b\u0002\u001d\n\n\u001b\u001f\u001e\n\nto state 0.\n\n2 and 7\n\ndrop from all vectors and matrices rows and columns corresponding to state 0, so that, for\n\n2 . Our \ufb01rst result links the difference between 7\n, when 0\n2 corresponds to the original vector 7\n\r\u0001 corresponds to the original matrix \b\n\ngiven0\n\u0004 , given by the unique solution to [3]\n, the function7\n\u0019\u001c\u001b\u0018\u001d\n\u0019W\u0006\nIf 0\n\u0004 can be seen as an estimate to\nis our estimate for the optimal average cost, then 7\nthe differential cost function 7\n0\"2 . For simplicity of notation, we implicitly\nthe difference between 0*2 and 0\n2 without the row corresponding to state 0,\ninstance,7\n\r\u0002 without rows and columns corresponding\nand \b\nLemma 2. For all 0\n\u001d\u000e\u0019\u0004\u0003\n\u00040\u0013\n\u0004 , corresponds to Bellman\u2019s equation for the problem of\nProof: Equation (5), satis\ufb01ed by 7\n\ufb01nding the stochastic shortest path to state 0, when costs are given by \u0001\n\u0004 corresponds to the vector of smallest expected lengths of paths until state 0. It follows\n\u0019\u0004\u0003\n\u0019\u0004\u0003\n\n[3]. Hence\n\n, we have\n\n\u0019\u001c\u001b\u0018\u001d\n\nthat\n\n\u0001\n\n\u0002\n\n\u001d\u0007\u0005\n\n\u001d\u0006\u0005\n\nNote that if0\n0\"2 , we also have7\nIn the following theorem, we show that the second-phase ALP minimizes \t\nover the feasible region. The weighted \r\nof the paper, is de\ufb01ned as \t\nTheorem 1. Let \u0017\n\n\u0004\u0014\u0013\n\u000e , which will be used in the remainder\n\u001f , for any \u000f\n\nbe the optimal solution to the second-phase ALP. Then it minimizes\n\n0\"2\u0018\u0013\n\u0006 .\n\n@ .\n\n\u0004\u000b\n\n\u001d\u000e\u0019\u0004\u0003\n\n\f over the feasible region of the second-phase ALP.\n\nis equivalent to minimizing\n\nProof: Maximizing\n\n\u0019\u0004\u0003\n\n\f .\n\nminimizes\n\nsuch that6\n\n . It follows that N\n\u0004\u000b\n\nsatisfying0\n\u0004\u000b\n\n, we have 7\nresult that, for all 7\nover the feasible region of the second-phase ALP, and N\n2 , we have\nFor any \ufb01xed choice of 0\nhence the second-phase ALP minimizes an upper bound on the weighted \n\nstate-relevance weights\ndetermine how errors over different portions of the state space\nare weighted in the decision of which approximate differential cost function to select, and\ncan be used for balancing accuracy of the approximation over different states. In the next\n\n\u0015 norm\n\f of the error in the differential cost function approximation. Note that\n\u0015 norm of the difference\n2 andN\n\u0017 . This demonstrates that the objective optimized by the second-phase ALP is\n\nsection, we will provide performance bounds that tie a certain \r\nbetween7\nrespect to N\n\u0007F0\u0018\u0015 , since0\u0018\u0015\nWe have not yet speci\ufb01ed how to choose 0\nestimate for the optimal average cost yielded by the \ufb01rst-phase ALP and it satis\ufb01es 0\n0\"2 , so that bound (6) holds. In practice, it may be advantageous to perform a line search\n\ncompatible with the objective of optimizing performance of the policy being obtained, and\nit also provides some insight about appropriate choices of state-relevance weights.\n\nto the expect increase in cost incurred by using the greedy policy with\n\n. An obvious choice is 0\n\nis the\n\n\u0001\n\n(6)\n\n\u0001\n\n\u0019\b\u0003\n2 , and \u001f\n\u0015 norm \t\n\u001b\u0002\u001d\n\u0019\u001c\u001b\u0002\u001d\n&\u0010\u000f\n\n\u001d . It is a well-known\n\n7\n\u0007\n\u0019\n6\n7\n\u001d\n\u0019\n\u0013\n0\n\u0003\nY\n\u0007\n\u0006\n\u0003\n7\n\u001d\n\u0007\n\u0006\n\f\n\u0004\n\u0016\n7\n7\n2\n\u0016\n\u0019\n0\n2\n\u0013\n0\n\u0013\n\b\n\u0015\n@\n\f\n\u0002\n\u0013\n0\n7\n7\n\u0004\n\u0016\n\u0013\n\b\n\u0015\n\u0019\n\u0001\n\n\u0013\n0\n@\n\u001d\n\u0007\n\u0013\n\b\n\n\n\u001d\n\u0005\n\u0015\n\u0019\n\u0001\n\n\u0013\n0\n2\n@\n9\n\u0019\n0\n2\n\u0013\n0\n\u001d\n@\n\u001d\n\u0007\n7\n2\n9\n\u0019\n0\n2\n\u0013\n0\n\u001d\n\u0013\n\b\n\u001d\n\u0005\n\u0015\n@\n\f\n\u0019\n\u0016\n\u0004\n\n7\n7\n7\n2\n\u001f\n\u0016\n\u0019\n0\n\u0013\n\b\n\n\n\u001d\n\u0005\n\u0015\n7\n\u0013\nN\n\u0017\n\t\n\u0015\n\u0006\n\f\nS\n\t\n\u0015\n\u0006\n7\n\t\n\u0015\n\u0006\n\u000e\n\u0007\n\u0010\n\u0019\n\u001f\n7\n\u0004\n\u000e\n\t\n7\n\u0004\n\n\u0013\nN\n\u0017\n\t\n\u0015\n\u0006\n\u001c\n\u001f\nN\n\u0017\n\u001c\n\u001f\n\u0019\n7\n\u0004\n\n\u0013\nN\n\u0017\n7\n\u0013\n0\n\u000e\n@\n\n7\n\u0016\n7\n\u0004\n\u0017\n\u0016\n7\n\u0004\n\n\u0017\n\u000e\n\u001c\n\u001f\n\u0019\n7\n\u0013\nN\n\u0017\n\u001d\n\u0007\n\u001c\n\u001f\n\u001f\n7\n\u0004\n\n\u0013\nN\n\u0017\n\u001f\n\u0007\n\t\n7\n\u0004\n\n\u0013\nN\n\u0017\n\t\n\u0015\n\u0006\n\u0019\n\u000e\n\u000e\n\u0016\n0\n\t\n7\n2\n\u0013\nN\n\u0017\n\u000e\n\t\n\u0015\n\u0006\n\f\n\u0016\n\t\n7\n\u0013\nN\n\u0017\n\u000e\n\t\n\u0015\n\u0006\n\f\n9\n\u0019\n0\n2\n\u0013\n0\n\u000e\n\u001d\n\u001c\n\u001f\n\u0013\n\b\n\u001d\n\u0005\n\u0015\n@\n\u0003\n\t\n7\n2\n\u0013\nN\n\u0017\n\u000e\n\t\n\u0015\n\u0006\n\u001c\n\u0017\n\u000e\n\u000e\n\u0015\n\u0016\n\fto optimize performance of the ultimate policy being generated. An important issue\n; for\n\nover0\nis the feasibility of the second-phase ALP will be feasible for a given choice of 0\non the basis functions N\nregardless of the choice of 0\n\n0\u0018\u0015 , this will always be the case. It can also be shown that, under certain conditions\n\u0017 , the second-phase ALP possesses multiple feasible solutions\n\n.\n\n6 A performance bound\n\nIn this section, we present a bound on the performance of greedy policies associated with\napproximate differential cost functions. This bound provide some guidance on appropriate\nchoices for state-relevance weights.\n\n, let 0\n\n2 , we have0\n\nTheorem 2. Let Assumption 1 hold. For all 7\n\u0007 denote the average cost and\nstationary state distribution of the greedy policy associated with 7\n. Then, for all 7\nthat7\n\u001d , where\u0001\nProof: We have0\n\u0007 denote\n\u0007 and\b\nthe costs and transition matrix associated with the greedy policy with respect to 7\n2 , we have0\n\n\u0007 and\n\nhave used\n\n, and we\n\n2\u0018\u0013\n\nsuch\n\n\u0001\u0003\u0002\n\nin the \ufb01rst equality. Now if 7\n0\"2\u0018\u0013\n\n\u0007(0\n\n.\n\n\u0001\u0003\u0002\n\nTheorem 2 suggests that one approach to selecting state-relevance weights may be to run\nthe second-phase ALP adaptively, using in each iteration weights corresponding to the\nstationary state distribution associated with the policy generated by the previous iteration.\nAlternatively, in some cases it may suf\ufb01ce to use rough guesses about the stationary state\ndistribution of the MDP as choices for the state-relevance weights. We revisit the example\nfrom Section 4 to illustrate this idea.\nExample 1. Consider applying the second-phase ALP to the controlled queue described in\nSection 4. We use weights of the form\n[6] and is motivated by the fact that, if the system runs under a \u201cstabilizing\u201d policy, there\nare exponential lower and upper bounds to the stationary state distribution [5]. Hence\n\n& . This is similar to what is done in\n\u0015 .\nas we increase \u0004 . Note that there is signi\ufb01cant\n\u0015 . The best policy is obtained for \u0004\n\u0010 ,\n\u000b , regardless of the buffer size. This cost is\n\nis a reasonable guess for the shape of the stationary distribution. We also let 0\nFigure 1 demonstrates the evolution of N\nrelative to N\nimprovement in the shape of N\nand incurs an average cost of approximately \u0010\u0015\u0005\nonly about \u0002\n7 Conclusions\n\nhigher than the optimal average cost.\n\n\u0013\u0005\u0004\n\n\u0019\u001c\u001b\u0018\u001d\n\nWe have extended the analysis of ALP to the case of minimization of average costs. We\nhave shown how the ALP version commonly found in the literature may lead to arbitrarily\nbad policies even if the choice of basis functions is relatively good; the main problem\nis that this version of the algorithm \u2014 the \ufb01rst-phase ALP \u2014 prioritizes approximation\nof the optimal average cost, but does not necessarily yield a good approximation for the\ndifferential cost function. We propose a variant of approximate linear programming \u2014\nthe two-phase approximate linear programming method \u2014 that explicitly approximates\nthe differential cost function. The main attractive of the algorithm is the presence of state-\nrelevance weights, which can be used for controlling the relative accuracy of the differential\ncost function approximation over different portions of the state space.\n\nMany open issues must still be addressed. Perhaps most important of all is whether there\nis an automatic way of choosing state-relevance weights. The performance bound suggest\nin Theorem 2 suggests an iterative scheme, where the second-phase ALP is run multiple\n\n\u000e\n\u000e\n0\n\u000e\n\u0016\n\u000e\n\n\u0016\n7\n\u0007\n\u0016\n0\n2\n9\n\t\n7\n7\n\t\n\u0015\n\u0006\n\f\n\u0007\n\u0007\n\n\u001f\n\u0007\n\u0001\n\u0007\n\u0007\n\n\u001f\n\u0007\n\u0019\n\u0001\n\u0007\n9\n\b\n\u0007\n7\n\u0013\n7\n\u001d\n\u0007\n\n\u001f\n\u0007\n\u0019\n6\n7\n\u0013\n7\n\n\u001f\n\u0007\n\b\n\u0007\n\u0007\n\n\u001f\n\u0007\n\u0016\n7\n\u0007\n\u0007\n\n\u001f\n\u0007\n\u0019\n6\n7\n\u0013\n7\n\u001d\n\u0016\n\n\u001f\n\u0007\n\u0019\n6\n7\n2\n\u0013\n7\n\u001d\n\u0007\n\n\u001f\n\u0007\n\u0019\n7\n2\n9\n7\n\u001d\n2\n9\n\t\n7\n2\n\u0013\n7\n\t\n\u0015\n\u0006\n\u0019\n\u001c\n\u0007\n\u0019\n\t\n\u001d\n\u0004\n\u001c\n\u0019\nS\n\u001d\n\u000e\n\u0007\n\u0006\n\f\n\u0010\n\u0002\n0\n\u0017\n\u000e\n\u0017\n\u000e\n\u0017\n\u0007\n\u0006\n\f\n\f\n\u001a\n\fh*\nFr (r=0.9)\n2\nFr (r=0.8)\n2\nFr (r=0.7)\n2\nFr 1\n\nx 10 6\n\n1\n\n0.5\n\n0\n\n0.5\n\n1\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90 100\n\nFigure 1: Controlled queue example: Differential cost function approximations as a func-\n(with\n\ntion of \u0004 . From top to bottom, differential cost function 7\n\n\f\u0016\u000b ), and approximationN\n\n\u0015 .\n\n2 , approximations N\n\ntimes state-relevance weights are updated in each iteration according to the stationary state\ndistribution obtained with the policy generated by the algorithm in the previous iteration. It\nremains to be shown whether such a scheme converges. It is also important to note that, in\n\n2 . If 0\n, and the appropriateness of minimizing \t\n\nprinciple, Theorem 2 holds only for 7\nforN\n\nReferences\n\n\u0016(0\n2\u0018\u0013\n\n2 , this condition cannot be veri\ufb01ed\n\nis only speculative.\n\n[1] D. Adelman. A price-directed approach to stochastic inventory/routing. Preprint, 2002.\n[2] D. Adelman. Price-directed replenishment of subsets: Methodology and its application to in-\n\nventory routing. Preprint, 2002.\n\n[3] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scienti\ufb01c, 1995.\n[4] D. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[5] D. Bertsimas, D. Gamarnik, and J.N. Tsitsiklis. Performance of multiclass Markovian queueing\n\nnetworks via piecewise linear Lyapunov functions. Annals of Applied Probability, 11.\n\n[6] D.P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic\n\nprogramming. To appear in Operations Research, 2001.\n\n[7] D.P. de Farias and B. Van Roy. On constraint sampling in the linear programming approach\nto approximate dynamic programming. Conditionally accepted to Mathematics of Operations\nResearch, 2001.\n\n[8] C. Guestrin, D. Koller, and R. Parr. Ef\ufb01cient solution algorithms for factored MDPs. Submitted\n\nto Journal of Arti\ufb01cial Intelligence Research, 2001.\n\n[9] A.S. Manne. Linear programming and sequential decisions. Management Science, 6(3):259\u2013\n\n267, 1960.\n\n[10] J.R. Morrison and P.R. Kumar. New linear program performance bounds for queueing networks.\n\nJournal of Optimization Theory and Applications, 100(3):575\u2013597, 1999.\n\n[11] P. Schweitzer and A. Seidmann. Generalized polynomial approximations in Markovian decision\n\nprocesses. Journal of Mathematical Analysis and Applications, 110:568\u2013582, 1985.\n\n\u0017\n\u000e\n\u0004\n\u0007\n\u0006\n\f\n\u0010\n\u0003\n\u0006\n\f\n\n\u0003\n\u0006\n\u0017\n\u0016\n7\n\u000e\n\u0017\n\u000e\n7\nN\n\u0017\n\u000e\n\t\n\u0015\n\u0006\n\f\n\f", "award": [], "sourceid": 2241, "authors": [{"given_name": "Benjamin", "family_name": "Roy", "institution": null}, {"given_name": "Daniela", "family_name": "Farias", "institution": null}]}