{"title": "Efficient Resources Allocation for Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1571, "page_last": 1578, "abstract": "", "full_text": "Efficient Resources Allocation\nfor Markov Decision Processes\n\nCMAP, Ecole Polytechnique, 91128 Palaiseau, France\n\nhttp://www.cmap.polytechnique.fr/....munos\n\nRemi Munos\n\nremi.munos@polytechnique.fr\n\nAbstract\n\nIt is desirable that a complex decision-making problem in an uncer(cid:173)\ntain world be adequately modeled by a Markov Decision Process\n(MDP) whose structural representation is adaptively designed by a\nparsimonious resources allocation process. Resources include time\nand cost of exploration, amount of memory and computational time\nallowed for the policy or value function representation. Concerned\nabout making the best use of the available resources, we address\nthe problem of efficiently estimating where adding extra resources\nis highly needed in order to improve the expected performance of\nthe resulting policy. Possible application in reinforcement learning\n(RL) , when real-world exploration is highly costly, concerns the de(cid:173)\ntection of those areas of the state-space that need primarily to be\nexplored in order to improve the policy. Another application con(cid:173)\ncerns approximation of continuous state-space stochastic control\nproblems using adaptive discretization techniques for which highly\nefficient grid points allocation is mandatory to survive high dimen(cid:173)\nsionality. Maybe surprisingly these two problems can be formu(cid:173)\nlated under a common framework: for a given resource allocation,\nwhich defines a belief state over possible MDPs, find where adding\nnew resources (thus decreasing the uncertainty of some parame(cid:173)\nters -transition probabilities or rewards) will most likely increase\nthe expected performance of the new policy. To do so, we use sam(cid:173)\npling techniques for estimating the contribution of each parameter's\nprobability distribution function (Pdf) to the expected loss of us(cid:173)\ning an approximate policy (such as the optimal policy of the most\nprobable MDP) instead of the true (but unknown) policy.\n\nIntroduction\n\nAssume that we model a complex decision-making problem under uncertainty by\na finite MDP. Because of the limited resources used, the parameters of the MDP\n(transition probabilities and rewards) are uncertain: we assume that we only know\na belief state over their possible values.\nIT we select the most probable values of\nthe parameters, we can build a MDP and solve it to deduce the corresponding\noptimal policy. However, because of the uncertainty over the true parameters, this\npolicy may not be the one that maximizes the expected cumulative rewards of the\n\n\ftrue (but partially unknown) decision-making problem. We can nevertheless use\nsampling techniques to estimate the expected loss of using this policy. Furthermore,\nif we assume independence of the parameters (considered as random variables), we\nare able to derive the contribution of the uncertainty over each parameter to this\nexpected loss. As a consequence, we can predict where adding new resoUrces (thus\ndecreasing the uncertainty over some parameters) will decrease mostly this loss,\nthus improving the MDP model of the decision-making problem so as to optimize\nthe expected future rewards.\n\nAs possible application, in model-free RL we may wish to minimize the amount of\nreal-world exploration (because each experiment is highly costly). Following [1] we\ncan maintain a Dirichlet pdf over the transition probabilities of the corresponding\nMDP. Then, our algorithm is able to predict in which parts of the state space we\nshould make new experiments, thus decreasing the uncertainty over some param(cid:173)\neters (the posterior distribution being less uncertain than the prior) in order to\noptimize the expected payoff.\n\nAnother application concerns the approximation of continuous (or large discrete)\nstate-space control problems using variable resolution grids, that requires an effi(cid:173)\ncient resource allocation process in order to survive the \"curse of dimensionality\"\nin high dimensions. For a given grid, because of the interpolation process, the ap(cid:173)\nproximate back-up operator introduces a local interpolation error (see [4]) that may\nbe considered as a random variable (for example in the random grids of [6]). The\nalgorithm introduced in this paper allows to estimate where we should add new\ngrid-points, thus decreasing the uncertainty over the local interpolation error, in\norder to increase the expected performance of the new grid representation. The\nmain tool developed here is the calculation of the partial derivative of useful global\nmeasures (the value function or the loss of using a sub-optimal policy) with respect\nto each parameter (probabilities and rewards) of a MDP.\n\n1 Description of the problem\n\nWe consider a MDP with a finite state-space X and action-space A. A transition\nfrom a state x, action a to a next state y occurs with probability p(Ylx, a) and the\ncorresponding (deterministic) reward is r(x, a). We introduce the back-up operator\nT a defined, for any function W : X --t JR, as\n\nT a W(x) == (' LP(Ylx, a)W(y) + r(x, a)\n\n(1)\n\ny\n\n(with some discount factor 0 < (' < 1).\nIt is a contraction mapping, thus the\ndynamic programming (DP) equation V(x) == maxaEA T aV(x) has a unique fixed\npoint V called the value function. Let. us define the Q-values Q(x, a) == T a V (x).\nThe optimal policy 1[* is the mapping from any state x to\u00b7 the action 1[*(x) that\nmaximizes the Q-values: 1[*(x) == maxaEA Q(x, a).\nThe parameters of the MDP - the probability and the reward functions - are\nnot perfectly known: all we know is a pdf over their possible values. This uncer(cid:173)\ntainty comes from the limited amount of allocated resources for estimating those\nparameters.\nLet us choose a specific policy 1r (for example the optimal policy of the MDP with\nthe most probable parameters). We can estimate the expected loss of using 1r instead\nof the true (but unknown) optimal policy 1[*. Let us write J-t == {Pj} the set of all\nparameters (p and r functions) of a MDP. We assume that we know a probability\ndistribution function pdf(J-Lj) over their possible values. For a MDP MJ.t defined\n\n\fby its parameters P, we write pJL (yIx, a), r JL (x, a), V JL, QJL, and 7f1-!' respectively its\ntransition probabilities, rewards, value function, Q-values, and optimal policy.\n\n1.1 Direct gain optimization\n\nWe define the gain ]JL(x; 7f) in the MDP MJL as the expected sum of discounted\nrewards obtained starting from state x and using policy 7f:\n\n]JL(x; 1f) == E[2: rykrJL(Xk' 7f(xk))lxo == x; 7f]\n\n(2)\n\nk\n\nwhere the expectation is taken for sequences of states Xk --t Xk+l occurring with\nprobability pP(Xk+llxk, 7fJL(Xk)). By definition, the optimal gain in MJL is VJL(x) ==\n]JL (x; 7fJL) which is obtained for the optimal policy 7fIL. Let ~ (x) == ]JL (x; if) be\nthe approximate gain obtained for some approximate policy .7r in the same MDP\nMIL. We define the loss to occur LJL(x) from X when one uses the approximate\npolicy 7r instead of the optimal one 7fJL in MJL:\n\nLIL(X) == VIL(x) - ~(x)\n\n(3)\nAn example of approximate policy 1? would be the optimal policy of the most prob(cid:173)\nable MDP, defined by the most probable parameters fi(ylx, a) and r(x, a).\nWe also consider the problem of maximizing the global gain from a set of initial\nstates chosen according to some probability distribution P(x). Accordingly, we\n]JL(7f) == Ex ]JL(x; 7f)P(x) and the global\ndefine the global gain of a policy 11\"\":\nloss LIL of using some approximate policy 7r instead of the optimal one nIL\n\n(4)\n\nThus, knowing the pdf over all parameters J-l we can define the expected global\nloss L == EJL[LIL].\nNext, we would like to define what is the contribution of each parameter uncertainty\nto this loss, so we know where we should add new resources (thus reducing some\nparameters uncertainty) in order to decrease the expected global loss. We would\nlike to estimate, for each parameter J-lj,\n\nE[8L I Add 8u units of resource for Pj]\n\n(5)\n\n1.2 Partial derivative of the loss\n\nill order to quantify (5) we need to be more explicit about the pdf over JL. First, we\nassume the independence of the parameters JLj (considered as random variables).\nSuppose that pdf (JLj) == N (0, U j) (normal distribution of mean 0 and standard\ndeviation Uj). We would like to estimate the variation 8L of the expected loss L\nwhen we make a small change of the uncertainty over Pj (consequence of adding new\nresources), for example when changing the standard deviation of 8aj in pdf(J.tj). At\nthe limit of an infinitesimal variation we obtain the partial derivative Z;., which\nwhen computed for all parameters J-lj, provides the respective contributions of each\nparameter's uncertainty to the global loss.\n\n3\n\nAnother example is when the pdf(pj) is a uniform distribution of support [-bj , bj].\nThen the partial contribution of JLj'S uncertainty to the global loss can be expressed\nas gf. More generally, we can define a finite number of characteristic scalar mea-\nsurements of the pdf uncertainty (for example the entropy or the moments) and\n\n3\n\n\fcompute the partial derivative of the expected global loss with respect to these co(cid:173)\nefficients. Finally, knowing the actual resources needed to estimate a parameter J..tj\nwith some uncertainty defined by pdf (J..tj ), we are able to estimate (5).\n\n1.3 Unbiased estimator\nWe sample N sets of parameters {J..ti }i=1..N from the pd!(J..t) , which define N-MDPs\nMi. For convenience, we use the superscript i\nto refer to the i-th MDP sample\nand the subscript j for the j-th parameter of a variable. We solve each MDP using\nstandard DP techniques (see [5]). This expensive computation that can be speed-up\nin two ways: first, by using the value function and policy computed for the first\nMDP as initial values for the other MDPs; second, since all MDPs have the same\nstructure, by computing once for all an efficient ordering (using a topological sort,\npossibly with loops) of the states that will be used for value iteration.\nFor each MDP, we compute the global loss L i of using the policy 'if and estimate\nthe expected global loss: L ~ -1 2:::1 L i . In order to estimate the contribution of\na p-arameter's uncertainty to L, we derive the partial derivative of L with respect\nto the characteristic coefficients of pdf (J-tj ). In the case of a reward parameter J..tj\nthat follows a normal distribution N(O, Uj), we can write J..tj == Uj\u20acj where \u20acj\nfollows\nN(O, 1). The partial derivative of the expected loss L with respect to Uj is\n8 Ee~N(o.l)[LUe] = Ee~N(o.1)[8aLue ~j]\n\n8 E/L~N(o.u)[L/L] = a\n\n(6)\n\n~\n\n~\n\naL == a\na\n~\n\n~\n\nfrom which we deduce the unbiased estimator\n\n8L '\" ~t 8L\n\ni Jt;\naUj - N i=l aJ..tj Uj\n\n(7)\n\nwhere ~;; is the partial derivative of the global loss Li of MDP M i with respect to\nthe parameter J..tj (considered as a variable). For other distributions, we can define\nsimilar results to (6) and deduce analogous estimators (for uniform distributions,\nwe have the same estimator with bj instead of Uj).\nThe remainder of the paper is organized as follow. Section 2 introduces useful tools\nto derive the partial contribution of each parameter -transition probability and\nreward- to the value function in a Markov Chain, Section 3 establishes the partial\ncontribution of each parameter to the global loss, allowing to calculate the estimator\n(7), and Section 4 provides an efficient algorithm. All proofs are given in the full\nlength paper [2].\n\n2 Non-local dependencies\n\nInfluence of a markov chain\n\n2.1\nIn [3] we introduced the notion of influence of a Markov Chain as a way to measure\nvalue function/rewards correlations between states. Let us consider a set of values\nV satisfying a Bellman equation\n\nVex) == , LP(ylx)V(y) + rex)\n\ny\n\n(8)\n\nWe define the discounted cumulative k-chained transition probabilities Pk(ylx):\n\npo(ylx)\nPl(ylx)\n\n(= 1 (if x = y) or 0 (if x =1= y))\n\nIx=y\nIP(ylx)\n\n\fLP1(ylw)Pl(wlx)\nw\n\nLP1(ylw)Pk-l(wlx)\nw\n\nThe influence I(ylx) of a state y on another state x is defined as I(ylx) =\n2::%:oPk(ylx). Intuitively, I(ylx) measures the expected discounted number of vis(cid:173)\nits of state y starting from x; it is also the partial derivative of the value function\nVex) with respect to the reward r(y).\nIndeed Vex) can be expressed, as a linear\ncombination of the rewards at y weighted by the influence I(ylx)\n\nVex) = LI(Ylx)r(y)\n\n(9)\n\nWe can also define the influence of a state y on a function f:\n2::x l(ylx)f(x) and the influence of a function f on another\nl(f(\u00b7)\\g(\u00b7)) = Y\":y I(ylg(\u00b7))f(y)\u00b7 In [3], we showed that the influence satisfies\n\nI(ylf(\u00b7)) =\n:\n\nfunction 9\n\ny\n\nI(ylx) = , LP(ylw)I(wlx) + lx=y\n\nw\n\n(10)\n\nwhich is a fixed-point equation of a contractant operator (in I-norm) thus has\na unique solution -the influence- that can be computed by successive iterations.\nSimilarly, the influence I(ylf(\u00b7)) can be obtained as limit of the iterations\n\nI(ylf(\u00b7)) +-, LP(Ylw)I(wlf(\u00b7)) + fey)\n\nw\n\nThus the computation of the influence I(ylf(\u00b7)) is cheap (equivalent to solving a\nMarkov chain).\n\n2.2 Total derivative of V\n\nWe wish to express the contribution of all parameters - transition probabilities p\nand rewards r - (considered as variables) to the value function V by defining the\ntotal derivative of V as a function of those P\u00a5ameters. We recall that the total\nderivative of a function f of several variables Xl, ..,' X n is df = 88f dXI + ... + a8t dxn .\nWe already know that the partial derivative of Vex) with respect to the reward r(z)\nis the influence I(zjx) = ~~~1. Now, the dependency with respect to the transition\nprobabilities has to be expressed more carefully because the probabilities p(wlz) for\na given z are dependent (they sum to one). A way to express that is provided in\nthe theorem that follows whose proof is in [2].\n\nXn\n\nXl\n\nlet us alter the probabilities p(wlz), for all w,\nTheorelll 1 For a given state z,\nwith some c5'p(wlz) value, such that 2::w c5'p(wlz) = o. Then Vex) is altered by\nc5'V(x) = I(zlx)[,2::wV(w)c5'p(wlz)]. We deduce the total derivative of v:\n\ndV(x) = L1(zlx)[, L V(w)dp(wlz) + dr(z)]\n\nunder the constraint 2::w dp(wi z) = 0 for all z.\n\nz\n\nw\n\n3 Total derivative of the loss\n\n, For a given MDP M with parameters J..L (for notation simplification we do not\nwrite the JL superscript in what follows), we want to estimate the loss of using an\napproximate policy 7? instead of the optimal one 1f. First, we define the one-step\n\n\floss l(x) at a state x as the difference between the gain obtained by choosing the\nbest action 7f(x) then using the optimal policy 1f and the gain obtained by choosing\naction n(x) then the same optimal policy 7f\n\nl(x) == Q(x,1f(x)) - Q(x,ir(x))\n\n(11)\n\nNow we consider the loss L(x), defined by (3), for an initial state x when we use\nthe approximate policy n. We can prove that L(x) is the expected discounted\ncumulative one-step losses l(Xk) for reachable states Xk:\n\nL(x) == E[L I'k l(Xk)lxo == x;n]\n\nk\n\nwith the expectation taken in the same sense as in (2).\n\n3.1 Decomposition of the one-step loss\n\nWe use (9) to decompose the Q-values\n\nQ(x, a)\n\n== I' LP(wlx, a) L I(ylw)r(y, 1f(Y)) + r(x, a)\n\nw\n\ny\n\n==\n\nr(x,a) + Lq(Ylx,a)r(y,7f(y))\n\ny\n\nusing the partial contributions q(ylx,a) == I'Ewp(wlx,a)I(ylw) where I(ylw) is\nthe influence of y on w in the Markov chain derived from the MDP M by choosing\npolicy 7f. Similarly, we decompose the one-step loss\n\nl(x) == Q(x,7f(x)) - Q(x, n(x))\n\n== r(x,1f(x)) - r(x,7f(x)) + L [q(ylx,1f(x)) - q(ylx,n(x))] r(y,7f(Y))\n\n== r(x, 7f(x)) -r(x, 7?(x)) + Ll(Ylx)r(y, 7f(Y))\n\ny\n\ny\n\nas a function of the partial contributions l(ylx) == q(ylx,1f(x)) - q(ylx, n(x)) (see\nfigure 1).\n\no\n\nq (ylx ,IT )\n\nq (ylx ,11- )\n\nFigure 1: The reward r(y,1r(Y)) at\nstate y contributes to the one-step\nloss l(x) = Q(x, 1r(x)) - Q(x, 1?(x))\nwith the\nq(ylx, 1I\"(x)) - q(ylx, 1?(x)).\n\nproportion\n\nl(ylx)\n\n\f3.2 Total derivative of the one-step loss and global loss\n\nSimilarly to section (2.2), we wish to express the contribution of all parameters (cid:173)\ntransition probabilities p and rewards r - (considered as variables) to the one-step\nloss function by defining the total derivative of I as a function of those parameters.\n\nTheorem 2 Let us introduce the (formal) differential back-up operator dT a de(cid:173)\nfined, for any function W : X ~ JR, as\n\ndT aW(x) == ry L W(y)dp(ylx, a) + dr(x, a)\n\ny\n\n(similar to the back-up operator (1) but using dp and dr instead of p and r). The\ntotal derivative of the one-step loss is\n\ndl(x).==L1(zlx)dT7f(z)V(z) + dT7f(x)V(x) - dT;Cx)V(x)\n\nz\n\nunder the constraint E y dp(ylx, a) == 0 for all x and a.\n\nTheorem 3 Let us introduce the one-step-loss back-up operator S and its formal\ndifferential version dS defined, for any function W : X ~ JR, as\n\nSW(x)\n\nry LP(Ylx, 7T\"(x))W(y) + l(x)\n\ny\n\ndSW(x)\n\nry L dp(ylx, 7T\"(x))W(y) + dl(x)\n\ny\n\nThen, the loss L(x) at x satisfies Bellman's equation L == SL. The total derivative\nof the loss L (x) and global loss L are, respectively\n\ndL(x)\n\nL I(zlx)dSL(z)\n\nZ\n\ndL\n\nL I(zIP(\u00b7))dSL(z)\nz\n\nfrom which (after regrouping the contribution to each parameter) we deduce the\npartial derivatives of the global loss with respect to the rewards and transition\nprobabilities\n\n4 A fast algorithm\n\nWe use the sampling technique introduced in section 1.3. In order to compute the\nestimator (7) we calculate the partial derivatives ~~; based on the result of the\nprevious section, with the following algorithm~\nGiven the pdf over the parameters j.L, select a policy 7? (for example the optimal\npolicy of the most probable MDP). For i == 1..N, solve each MDP M i and deduce\n\n\fx,a\n\nits value function Vi, Q-values Qi, and optimal policy 7ri . Deduce the one-step loss\nli(x) from (11). Compute the influence I(xIP(\u00b7)) (which depends on the transi(cid:173)\ntion probabilities pi of M i ) and the influence I(li(xl\u00b7)IP(\u00b7)) from which we deduce\na ~(Li ). Then calculate Li(x) by solving Bellman's equation Li = SLi and deduce\nr\n8P,r~:,a). These partial derivatives enable to compute the unbiased estimator (7).\nThe complexity of solving a discounted MDP with K states, each one connected to\nM next states, is O(KM), as is the complexity of computing the influences. Thus,\nthe overall complexity of this algorithm is O(NKM).\n\nConclusion\u00b7\n\nBeing able to compute the contribution of each parameter -transition probabilities\nand rewards- to the value function (theorem 1) and to the loss of the expected\nrewards to occur if we use an approximate policy (theorem 3) enables us to use\nsampling techniques to estimate what are the parameters whose uncertainty are the\nmost harmful to the expected gain.. A relev-ant resource allocation process would\nconsider adding new computational resources to reduce uncertainty over the true\nvalue of those parameters. In the examples given in the introduction, this would be\ndoing new experiments in model-free RL for defining more precisely the transition\nprobabilities of some relevant states.\nIn discretization techniques for continuous\ncontrol problems, this would be adding new grid points in order to improve the\nquality of the interpolation at relevant areas of the state space in order to maximize\nthe expected gain of the new policy.\nInitial experiments for variable resolution\ndiscretization using random grids show improved performance compared to [3].\n\nAcknowledgments\n\nI am grateful to Andrew Moore, Drew Bagnell and Auton's Lab members for mo(cid:173)\ntivating discussions.\n\nReferences\n[1] Richard Dearden, Nir Friedman, and David Andre. Model based bayesian ex(cid:173)\n\nploration. Proceeding of Uncertainty in Artificial Intelligence, 1999.\n\n[2] Remi Munos. Decision-making under uncertainty:. Efficiently estimating where\n\nextra ressources are needed. Technical report, Ecole Polytechnique, 2002.\n\n[3] Remi Munos and Andrew Moore. Influence and variance of a markov chain :\nApplication to adaptive discretizations in optimal control. Proceedings of the\n38th IEEE Conference on Decision and Control, 1999.\n\n[4] Remi Munos and Andrew W. Moore. Rates of convergence for variable resolution\nInternational Conference on Machine Learning,\n\nschemes in optimal control.\n2000.\n\n[5] Martin L. Puterman. Markov Decision Processes, Discrete Stochastic Dynamic\n\nProgramming. A Wiley-Interscience Publication, 1994.\n\n[6] John Rust. Using Randomization to Break the Curse of Dimensionality. Com(cid:173)\n\nputational Economics. 1997.\n\n\f\f", "award": [], "sourceid": 2043, "authors": [{"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}