{"title": "Distributionally Robust Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2505, "page_last": 2513, "abstract": "We consider Markov decision processes where the values of the parameters are uncertain. This uncertainty is described by a sequence of nested sets (that is, each set contains the previous one), each of which corresponds to a probabilistic guarantee for a different confidence level so that a set of admissible probability distributions of the unknown parameters is specified. This formulation models the case where the decision maker is aware of and wants to exploit some (yet imprecise) a-priori information of the distribution of parameters, and arises naturally in practice where methods to estimate the confidence region of parameters abound. We propose a decision criterion based on *distributional robustness*: the optimal policy maximizes the expected total reward under the most adversarial probability distribution over realizations of the uncertain parameters that is admissible (i.e., it agrees with the a-priori information). We show that finding the optimal distributionally robust policy can be reduced to a standard robust MDP where the parameters belong to a single uncertainty set, hence it can be computed in polynomial time under mild technical conditions.", "full_text": "Distributionally Robust Markov Decision Processes\n\nHuan Xu\n\nECE, University of Texas at Austin\nhuan.xu@mail.utexas.edu\n\nDepartment of Electrical Engineering, Technion, Israel\n\nShie Mannor\n\nshie@ee.technion.ac.il\n\nAbstract\n\nWe consider Markov decision processes where the values of the parameters are\nuncertain. This uncertainty is described by a sequence of nested sets (that is,\neach set contains the previous one), each of which corresponds to a probabilistic\nguarantee for a different con\ufb01dence level so that a set of admissible probability\ndistributions of the unknown parameters is speci\ufb01ed. This formulation models the\ncase where the decision maker is aware of and wants to exploit some (yet impre-\ncise) a-priori information of the distribution of parameters, and arises naturally in\npractice where methods to estimate the con\ufb01dence region of parameters abound.\nWe propose a decision criterion based on distributional robustness: the optimal\npolicy maximizes the expected total reward under the most adversarial probability\ndistribution over realizations of the uncertain parameters that is admissible (i.e.,\nit agrees with the a-priori information). We show that \ufb01nding the optimal dis-\ntributionally robust policy can be reduced to a standard robust MDP where the\nparameters belong to a single uncertainty set, hence it can be computed in poly-\nnomial time under mild technical conditions.\n\n1 Introduction\n\nSequential decision making in stochastic dynamic environments, also called the \u201cplanning prob-\nlem,\u201d is often modeled using a Markov Decision Process (MDP, cf [1, 2, 3]). In practice, parameter\nuncertainty \u2013 the deviation of the model parameters from the true ones (reward r and transition prob-\nability p) \u2013 often causes the performance of \u201coptimal\u201d policies to degrade signi\ufb01cantly [4]. Many\nefforts have been made to reduce such performance variation under the robust MDP framework\n(e.g., [5, 6, 7, 8, 9, 10]). In this context, it is assumed that the parameters can be any member of a\nknown set (termed the uncertainty set), and solutions are ranked based on their performance under\nthe (respective) worst parameter realizations.\n\nIn this paper we extend the robust MDP framework to deal with probabilistic information on uncer-\ntain parameters. To motivate the problem, let us consider the following example. Suppose that an\nagent (car, plane, robot etc) wants to \ufb01nd a fastest path from the source location to the destination. If\nthe passing time to area A is uncertain and can be very large, then the solution to robust MDP would\ntend to take a detour and avoid A. However, if it is further known that the passing time can be large\nonly when some unusual event (whose chance is less than, say, 10%), such as a storm, happens, and\notherwise the passing time is reasonable, then avoiding A may be overly pessimistic. The statement\n\u201cthe probability of the (uncertain) passing time being large is at most 10%\u201d is important, and should\nbe incorporated into the decision making paradigm. Indeed, it was observed that since the robust\nMDP framework ignores probabilistic information, it can provide conservative solutions [11, 12].\n\nA different approach to embeding prior information is by adopting a Bayesian perspective on the\nparameters of the problem; see [11] and references therein. However, a complete Bayesian prior\nto the model parameters may be dif\ufb01cult to conjure as the decision maker may not have a reliable\ngenerative model to the uncertainty. For example, in the path planning problem above, the decision\nmaker may not know how to assign probabilities to the model dynamics when a storm occurs. Our\n\n1\n\n\fapproach offers a middle ground between the fully Bayesian approach and the robust approach:\nwe want the decision maker to be able to use prior information but we do not require a complete\nBayesian interpretation.\n\nWe adapt the distributionally robust approach to MDPs under parameter uncertainty. The distri-\nbutionally robust formulation has been extensively studied and broadly applied in single stage op-\ntimization problems to effectively incorporates a-priori probabilistic information of the unknown\nparameters (e.g., [13, 14, 15, 16, 17, 18]). In this framework, the uncertain parameters are regarded\nas stochastic, with a distribution \u00b5 that is not precisely observed, yet assumed to belong to an a-priori\nknown set C. The objective is then formulated based on the worst-case analysis over distributions\nin C. That is, given a utility function u(x, \u03be) where x \u2208 X is the optimizing variable and \u03be is the\n\nunknown parameter, distributionally robust optimization solves maxx\u2208X (cid:2) inf \u00b5\u2208C E\u03be\u223c\u00b5u(x, \u03be)(cid:3). In-\n\ndeed, such approach has also been developed in the mathematical \ufb01nance community, usually in the\nstatic setup [19, 20]. Here the goal is to optimize a so-called coherent risk measure, which is shown\nto be equivalent to a distributionally robust formulation.\n\nFrom a decision theory perspective, the distributionally robust approach coincides with the cele-\nbrated MaxMin Expected Utility framework [21, 22], which states that if a preference relationship\namong actions satis\ufb01es certain axioms, then the optimal action maximizes the minimal expected util-\nity with respect to a class of distributions. This approach addresses the famous neglect of probability\ncognitive bias [23], i.e., the tendency to completely disregard probability when making a decision\nunder uncertainty. Two extreme cases of such biases are the normalcy bias, which roughly speak-\ning, can be states as \u201csince a disaster has never occurred then it never will occur,\u201d and the zero-risk\nbias, which stands for the tendency of individuals to prefer small bene\ufb01ts that are certain to large\nones that are uncertain, regardless of the size of the \u201ccertain\u201d bene\ufb01t and the expected magnitude of\nthe uncertain one. It is easy to see that the nominal approach and the robust approach suffers from\nnormalcy bias and zero-risk bias, respectively.\n\ns \u2286 C2\n\nWe formulate and solve the distributionally robust MDP with respect to the nested uncertainty set.\nThe nesting structure implies that there are n different levels of estimation, that is, C1\ns ,\ns \u2286 \u00b7 \u00b7 \u00b7 Cn\nrepresenting the possible parameters of the problem. The probability that the parameters of state\ns belong to Ci\ns is at least \u03bbi. We also require the parameters to be state-wise independent (i.e.,\nthe uncertainty set is a product set over states). Policies are then ranked based on their expected\nperformance under the (respective) most adversarial distribution. The main contribution of this paper\nis showing that for both the \ufb01nite horizon case and the discounted reward in\ufb01nite horizon case, such\noptimal policy satis\ufb01es a Bellman type equation, and can be solved via backward iteration.\nMotivating example. The nested-set formulation is motivated by the \u201cmulti-scenario\u201d setup, where\nin different scenarios the parameters are subject to different levels of uncertainty. For instance, in\nthe path planning example, the uncertainty of the passing time of A can be modeled as a nested-set\nwith two uncertainty sets: the parameters with at least 90% belong to a small uncertainty set corre-\nsponding to \u201cno storm,\u201d and guaranteed to belong to a large worst-case uncertainty set representing\n\u201cstorm\u201d with probability of at most 10%. In fact the multi-layer formulations allows the decision\nmaker to handle more than two scenarios. For example, a plane can encounter scenarios such as\n\u201cnormal,\u201d \u201cstorm,\u201d \u201cbig storm,\u201d and even \u201cvolcano ashes,\u201d each corresponding to a different level of\nparameter uncertainty. One appealing advantage of the nested-set formulation is that it does not re-\nquire a precise description of the uncertainty, which leads to considerable \ufb02exibility. For example, if\nthe uncertainty set of a robust MDP is not precisely known, then one can instead solve distribution-\nally robust MDP with a 2-set formulation where the inner and the outer sets represent, respectively,\nan \u201coptimistic\u201d estimation and a \u201cconservative\u201d estimation. Additionally, the nested-set formula-\ntion also results from estimating the distributions of parameters via sampling. Such estimation is\noften imprecise especially when only a small number of samples is available. Instead, estimating\nuncertainty sets with high con\ufb01dence can be made more accurate, and one can easily sharpen the\napproximation by incorporating more layers of con\ufb01dence sets (i.e, increase n).\n\n2 Preliminaries and Problem Setup\n\nA (\ufb01nite) MDP is de\ufb01ned as a 6-tuple < T, \u03b3, S, As, p, r > where: T is the possibly in\ufb01nite decision\nhorizon; \u03b3 \u2208 (0, 1] is the discount factor; S is the \ufb01nite state set; As is the \ufb01nite action set of state s;\np is the transition probability; and r is the expected reward. That is, for s \u2208 S and a \u2208 As, r(s, a)\n\n2\n\n\fis the expected reward and p(s\u2032|s, a) is the probability to reach state s\u2032. Following Puterman [1], we\ndenote the set of all history-dependent randomized policies by \u03a0HR, and the set of all Markovian\nrandomized policies by \u03a0M R. We use subscript s to denote the value associated with state s, e.g., rs\ndenotes the vector form of rewards associated with state s, and \u03c0s is the (randomized) action chosen\nat state s for policy \u03c0. The elements in vector ps are listed in the following way: the transition\nprobabilities of the same action are arranged in the same block, and inside each block they are\nlisted according to the order of the next state. We use s to denote the (random) state following s,\n\nand \u2206(s) to denote the probability simplex on As. We use N to represent cartesian product, e.g.,\np = Ns\u2208S ps. For a policy \u03c0, we denote the expected (discounted) total-reward under parameters\n\np, r by u(\u03c0, p, r), that is,\n\nu(\u03c0, p, r) , Ep\n\u03c0{\n\nT\n\nXi=1\n\n\u03b3i\u22121r(si, ai)}.\n\nIn this paper we propose and solve distributionally robust policy under parameter uncertainty, which\nincorporates a-prior information of how parameters are distributed. Suppose it is known that p\nand r follows some unknown distribution \u00b5 that belongs to a set CS. We evaluate each policy\nby its expected performance under the (respective) most adversarial distribution of the uncertain\nparameters, and a distributionally robust policy is the optimal policy according to this measure.\nDe\ufb01nition 1. A policy \u03c0\u2217 \u2208 \u03c0HR is distributionally robust with respect to CS if it satis\ufb01es that for\nall \u03c0 \u2208 \u03a0HR,\n\n\u00b5\u2208CS Z u(\u03c0, p, r) d\u00b5(p, r) \u2264 inf\n\n\u00b5\u2032\u2208CS Z u(\u03c0\u2217, p, r) d\u00b5\u2032(p, r).\n\ninf\n\nNext we specify the set of admissible distributions of uncertain parameters CS investigated in this\npaper. Let 0 = \u03bb0 \u2264 \u03bb1 \u2264 \u03bb2 \u2264 \u00b7 \u00b7 \u00b7 \u2264 \u03bbn = 1, and P 1\ns for s \u2208 S. We use the\nfollowing set of distributions CS for our model.\n\ns \u2286 \u00b7 \u00b7 \u00b7 \u2286 P n\n\ns \u2286 P 2\n\nCS , {\u00b5|\u00b5 = Os\u2208S\nwhere: Cs , {\u00b5s|\u00b5s(P n\n\n\u00b5s; \u00b5s \u2208 Cs, \u2200s \u2208 S},\n\n(1)\n\ns ) = 1; \u00b5s(P i\n\ns) \u2265 \u03bbi, i = 1, \u00b7 \u00b7 \u00b7 , n \u2212 1}.\n\ns) \u2265 \u03bbi means that with probability at least \u03bbi, (ps, rs) \u2208 P i\n\nWe brie\ufb02y explain this set of distributions. For a state s, the condition \u00b5s(P n\ns ) = 1 means that\nthe unknown parameters (ps, rs) are restricted to the outermost uncertainty set; and the condition\ns provides\n\u00b5s(P i\nprobabilistic guarantees of (ps, rs) for n different uncertainty sets (or equivalently con\ufb01dence lev-\nels). Note that Ns\u2208S \u00b5s stands for the product measure generated by \u00b5s, which indicates that the\n\nparameters among different states are independent. Throughout this paper we make a standard as-\nsumption (cf [5, 6, 8]) that P i\n\ns is nonempty, convex and compact.\n\ns. Thus, P 1\n\ns , \u00b7 \u00b7 \u00b7 , P n\n\n3 Distributionally robust MDPs: The \ufb01nite-horizon case.\n\nIn this section we show how to solve distributionally robust policies to MDPs having \ufb01nitely many\ndecision stages. We assume that when a state is visited multiple times, each time it can take a\ndifferent parameter realization (non-stationary model). Equivalently, this means that multiple visits\nto a state can be treated as visiting different states, which leads to the Assumption 1 without loss\nof generality (by adding dummy states). Thus, we can partition S according to the stage each state\nbelongs to, and let St be the set of states belong to tth stage. The non-stationary model is proposed\nin [5] because the stationary model is generally intractable and a lower-bound on it is given by the\nnon-stationary model.\nAssumption 1. (i) Each state belongs to only one stage; (ii) the terminal reward equals zero; and\n(iii) the \ufb01rst stage only contains one state sini.\n\nWe next de\ufb01ne sequentially robust policies through a backward induction as a policy that is robust\nin every step for a standard robust MDP. We will later shows that sequentially robust policies are\nalso distributionally robust by choosing the uncertainty set of the robust MDP carefully.\nDe\ufb01nition 2. Let T < \u221e and let Ps be the uncertainty set of state s. De\ufb01ne the following:\n\n3\n\n\f1. For s \u2208 ST , the sequentially robust value \u02dcvT (s) , 0.\n\n2. For s \u2208 St where t < T , the sequentially robust value \u02dcvt(s) and sequentially robust action\n\n\u02dc\u03c0s are de\ufb01ned as\n\n\u02dcvt(s) , max\n\n(ps,rs)\u2208Ps\n\n\u03c0s\u2208\u2206(s)n min\n\u03c0s\u2208\u2206(s)n min\n\n(ps,rs)\u2208Ps\n\nEps\n\n\u03c0s [r(s, a) + \u03b3\u02dcvt+1(s)]o.\n\u03c0s [r(s, a) + \u03b3\u02dcvt+1(s)]o.\n\nEps\n\n\u02dc\u03c0s \u2208 arg max\n\n3. A policy \u02dc\u03c0\u2217 is a sequentially robust policy w.r.t. Ps if \u2200s \u2208 S, \u02dc\u03c0\u2217\n\ns is a sequentially robust\n\naction.\n\nA standard game theoretic argument implies that sequentially robust actions, and hence sequentially\nrobust policies, exist. Indeed, from the literature in robust MDP (cf [5, 7, 8]) it is easy to see that a\n\nsequentially robust policy is the solution to the robust MDP where the uncertainty set isNs Ps. The\n\nfollowing theorem, which is the main result of this paper, shows that any sequentially robust policy\n(w.r.t. a speci\ufb01c uncertainty set) \u03c0\u2217 is distributionally robust.\nTheorem 1. Let T < \u221e. Let Assumption 1 hold, and suppose that \u03c0\u2217 is a sequentially robust\npolicy w.r.t. Ns\n\n\u02c6Ps, where\n\nn\n\n\u02c6Ps = {\n\nXi=1\n\n(\u03bbi \u2212 \u03bbi\u22121)(rs(i), ps(i))|(ps(i), rs(i)) \u2208 P i\n\ns}.\n\nThen\n\n1. \u03c0\u2217 is a distributionally robust policy with respect to Cs; and\n\n2. there exists \u00b5\u2217 \u2208 Cs such that (\u03c0\u2217, \u00b5\u2217) is a saddle point. That is,\n\n\u03c0\u2208\u03a0HRZ u(\u03c0, p, r) d\u00b5\u2217(p, r) = Z u(\u03c0\u2217, p, r) d\u00b5\u2217(p, r) = inf\n\nsup\n\n\u00b5\u2208CS Z u(\u03c0\u2217, p, r) d\u00b5(p, r).\n\nTherefore, to \ufb01nd the sequentially robust policy, we need only to solve the sequentially robust action.\nTheorem 2. Denote \u03bb0 = 0. For s \u2208 St where t < T , the sequentially robust action is given by\n\n\u2217 = arg max\n\nq\n\nq\u2208\u2206(s)n\n\nn\n\nXi=1\n\n(\u03bbi \u2212 \u03bbi\u22121) min\n\n(p\n\ni\ni\ns,r\n\ns)\u2208P i\n\ni\n\ns)\u22a4\n\ns(cid:2)(r\n\ni\nq + (p\n\ns)\u22a4 \u02dcVsq(cid:3)o,\n\n(2)\n\nwhere m = |As|, \u02dcvt+1 is the vector form of \u02dcvt+1(s\u2032) for all s\u2032 \u2208 St+1, and\n\n\u02dcVs , \uf8ee\n\uf8f0\n\n\u22a4\n\u02dcvt+1e\n1 (m)\n:\n\u22a4\n\u02dcvt+1e\nm(m)\n\n.\n\n\uf8f9\n\uf8fb\n\nTheorem 2 implies that the computation of the sequentially robust action at a state s critically de-\npends on the structure of the sets P i\ns. In fact, it can be shown that for \u201cgood\u201d uncertainty sets,\ncomputing the sequentially robust action is tractable. This claim is made precise by the following\ncorollary. We omit the proof that is standard.\nCorollary 1. The sequentially robust action for state s can be found in polynomial-time, if for each\ns has a polynomial separation oracle. Here, a polynomial separation oracle of a\ni = 1, \u00b7 \u00b7 \u00b7 , n, P i\nconvex set H \u2286 Rn is a subroutine that given x \u2208 Rn, reports in polynomial time whether x \u2208 H,\nand if the answer is negative, it \ufb01nds a hyperplane that separates x and H.\n\n3.1 Proof of Theorem 1\n\nWe prove Theorem 1 in this section. The outline of the proof is as follows: We \ufb01rst show that for a\ngiven policy, the expected performance under an admissible \u00b5 depends only on the expected value\n\u02c6Ps. Thus the\n\nof the parameters. Then we show that the set of expected parameters is indeed Ns\u2208S\n\n4\n\n\fdistributionally robust MDP reduces to the robust MDP with Ns\u2208S\n\n\u02c6Ps being the uncertainty set.\nFinally, by applying results from robust MDPs we prove the theorem. Some of the intermediate\nresults are stated with proof omitted due to space constraints.\nLet ht denote a history up to stage t and s(ht) denote the last state of history ht. We use \u03c0ht (a)\nto represent the probability of choosing an action a at state s(ht), following a policy \u03c0 and under a\nhistory ht. A t + 1 stage history, with ht followed by action a and state s\u2032 is written as (ht, a, s\u2032).\nWith an abuse of notation, we denote the expected reward-to-go under a history as:\n\nu(\u03c0, p, r, ht) , Ep\n\u03c0{\n\nT\n\nXi=t\n\n\u03b3i\u2212tr(si, ai)|(s1, a1 \u00b7 \u00b7 \u00b7 , st) = ht}.\n\ntive. One can show that the following recursion formula for w(\u00b7) holds, due to the fact that\n\nFor \u03c0 \u2208 \u03a0HR and \u00b5 \u2208 CS(\u03bb), de\ufb01ne w(\u03c0, \u00b5, ht) , E(p,r)\u223c\u00b5us(\u03c0, p, r, h(t)) =\nR u(\u03c0, p, r, h(t))d\u00b5(p, r). Thus, w(\u03c0, \u00b5, (sini)) = R u(\u03c0, p, r) d\u00b5(p, r) is the minimax objec-\n\u00b5(p, r) = Ns\u2208S \u00b5s(ps, rs).\nLemma 1. Fix \u03c0 \u2208 \u03a0HR, \u00b5 \u2208 CS and a history ht where t < T , denote r = E\u00b5(r), p = E\u00b5(p),\nthen we have:\n\nw(\u03c0, \u00b5, ht) = Z Xa\u2208As(ht )\n= Xa\u2208As(ht )\n\n\u03c0ht (a)(cid:16)r(cid:0)s(ht), a(cid:1) + Xs\u2032\u2208S\n\u03c0ht (a)(cid:16)r(cid:0)s(ht), a(cid:1) + Xs\u2032\u2208S\n\n\u03b3p(cid:0)s\u2032|s(ht), a(cid:1)w(cid:0)\u03c0, \u00b5, (ht, a, s\u2032)(cid:1)(cid:17)d\u00b5s(ht)(ps(ht), rs(ht))\n\u03b3p(cid:0)s\u2032|s(ht), a(cid:1)w(cid:0)\u03c0, \u00b5, (ht, a, s\u2032)(cid:1)(cid:17).\n\nFrom Lemma 1, by backward induction, one can show the following lemma holds, which essentially\nmeans that for any policy, the expected performance under an admissible distribution \u00b5 only depends\non the expected value of the parameters under \u00b5. Thus, the distributionally robust MDP reduces to\na robust MDP.\nLemma 2. Fix \u03c0 \u2208 \u03a0HR and \u00b5 \u2208 CS, denote p = E\u00b5(p) and r = E\u00b5(r). We have:\nw(cid:0)\u03c0, \u00b5, (sini)(cid:1) = u(\u03c0, p, r).\n\nNext we characterize the set of expected value of the parameters.\nLemma 3. Fix s \u2208 S, we have {E\u00b5s (ps, rs)|\u00b5s \u2208 Cs} = \u02c6Ps.\nNote that Lemma 3 implies that {E\u00b5(p, r)|\u00b5 \u2208 CS} = Ns\u2208S\ncertainty set is Ns\u2208S\n\n\u02c6Ps. We complete the proof of the\nTheorem 1 using the equivalence of distributionally robust MDPs and robust MDPs where the un-\n\u02c6Ps. Recall that for each s \u2208 S, \u02c6Ps is convex and compact. It is well known\nthat for robust MDPs, a saddle point of the minimax objective exists (cf [5, 8]). More precisely,\nthere exists \u03c0\u2217 \u2208 \u03a0HR, (p\n\n\u2217, r\n\n\u2217) \u2208 Ns\u2208S\n\u2217, r\n\n\u2217) = u(\u03c0\u2217, p\n\n\u02c6Ps such that\n\u2217) =\n\n\u2217, r\n\nsup\n\n\u03c0\u2208\u03a0HR\n\nu(\u03c0, p\n\ninf\n\n(r,p)\u2208Ns\u2208S\n\n\u02c6Ps\n\nu(\u03c0\u2217, p, r).\n\nMoreover, \u03c0\u2217 and (p\nNs\u2208S(p\n\n\u2217, r\n\ns), and for each s \u2208 St, \u03c0\u2217\n\u2217\n\u2217\ns, r\n\n\u2217) can be constructed state-wise: \u03c0\u2217 = Ns\u2208S \u03c0\u2217\n\ns) solves the following zero-sum game\n\u2217\n\u2217\ns , (p\ns, r\n\u03c0 (cid:0)r(s, a) + \u03b3\u02dcvt+1(s)(cid:1).\n\n(ps,rs)\u2208 \u02c6Ps\n\nmax\n\n\u03c0s\n\nmin\n\nEps\n\ns and (p\n\n\u2217, r\n\n\u2217) =\n\nIt follows that \u03c0\u2217\npolicy. From Lemma 3, there exists \u00b5\u2217\n\ns is any sequentially robust action, and hence \u03c0\u2217 can be any sequentially robust\ns). Let \u00b5\u2217 =\n\u2217\n\u2217\ns (ps, rs) = (p\ns, r\n\ns \u2208 Cs that satis\ufb01es E\u00b5\u2217\n\nNs\u2208S \u00b5\u2217\n\ns. By Lemma 2 we have\n\nu(\u03c0, p\n\n\u2217, r\n\n\u2217);\n\nsup\n\n\u03c0\u2208\u03a0HR\n\nw(cid:0)\u03c0, \u00b5\u2217, (sini)(cid:1) = sup\n\u03c0\u2208\u03a0HR\n\u2217, r\n\u2217(cid:1);\n\nw(cid:0)\u03c0\u2217, \u00b5\u2217, (sini)(cid:1) = u(cid:0)\u03c0\u2217, p\n\ninf\n\u00b5\u2208CS\n\nw(cid:0)\u03c0\u2217, \u00b5, (sini)(cid:1) =\n\n(p,r)\u2208Ns\n\n\u02c6Ps\n\ninf\n\nu(\u03c0\u2217, p, r).\n\n5\n\n\fThis leads to sup\u03c0\u2208\u03a0HR w(cid:0)\u03c0, \u00b5\u2217, (sini)(cid:1) = w(cid:0)\u03c0\u2217, \u00b5\u2217, (sini)(cid:1) = inf \u00b5\u2208CS w(cid:0)\u03c0\u2217, \u00b5, (sini)(cid:1). Thus,\n\npart (ii) of Theorem 1 holds. Note that part (ii) immediately implies part (i) of Theorem 1.\nRemark: Lemma 1 holds for a broader class of distribution sets than we discussed here. Indeed,\nthe only requirement of C for Lemma 1 to hold is the state-wise decomposibility. Therefore, the\nresults presented in this paper may well extend to distributionally robust MDPs whose parameters\nbelongs to other interesting sets of distributions, such as a set of parametric distribution (Gaussian,\nexponential, binomial etc) with the distribution parameter not precisely determined.\n\n4 Distributionally robust MDP: The discounted reward in\ufb01nite horizon case.\n\nIn this section we show how to compute a distributionally robust policy for in\ufb01nite horizon MDPs.\nSpeci\ufb01cally, we generalize the notion of sequentially robust policies to discounted-reward in\ufb01nite-\nhorizon MDPs, and show that it is distributionally robust in an appropriate sense.\n\nDe\ufb01nition 3. Let T = \u221e and \u03b3 < 1. Denote the uncertainty set by \u02c6P = Ns\n\nfollowing:\n\n\u02c6Ps. We de\ufb01ne the\n\n1. The sequentially robust value \u02dcv\u221e(s) w.r.t. \u02c6Ps is the unique solution to the following set of\n\nequations:\n\n\u02dcv\u221e(s) = max\n\n\u03c0s\u2208\u2206(s)n min\n\n(ps,rs)\u2208 \u02c6Ps\n\nEps\n\n\u03c0s [r(s, a) + \u03b3\u02dcv\u221e(s)]o, \u2200s \u2208 S.\n\n2. The sequentially robust action w.r.t. \u02c6Ps, \u02dc\u03c0s, is given by\n\n\u02dc\u03c0s \u2208 arg max\n\n\u03c0s\u2208\u2206(s)n min\n\n(ps,rs)\u2208 \u02c6Ps\n\nEps\n\n\u03c0s [r(s, a) + \u03b3\u02dcv\u221e(s)]o.\n\n3. A policy \u02dc\u03c0\u2217 is a sequentially robust policy w.r.t. \u02c6Ps if \u2200s \u2208 S, \u02dc\u03c0\u2217\n\ns is a sequentially robust\n\naction.\n\nThe sequentially robust policy is well de\ufb01ned, since the following operator L : R|S| \u2192 R|S| is a \u03b3\ncontraction for k \u00b7 k\u221e norm.\n{Lv}(s) , max\nq\u2208\u2206(s)\n\nq(a)p(s\u2032|s, a)v(s\u2032)].\n\nq(a)r(s, a) + \u03b3 Xa\u2208As Xs\u2032\u2208S\n\n[ Xa\u2208As\n\nmin\n\n(p,r)\u2208 \u02c6Ps\n\nFurthermore, given any v, applying L is equivalent to solving a minimax problem, which by Theo-\nrem 2 can be ef\ufb01ciently computed. Hence, by applying L on any initial v\n0 \u2208 R|S| repeatedly, the\nresulting value vector will converge to the sequentially robust value \u02dcv exponentially fast.\nNote that in the in\ufb01nite horizon case, we cannot model the system as (1) having \ufb01nitely many states,\nand (2) each visited at most once. In contrast, we have to relax either one of these two assumptions,\nleading to two different natural formulations. The \ufb01rst formulation, termed non-stationary model,\nis to treat the system as having in\ufb01nitely many states, each visited at most once. Therefore, we\nconsider an equivalent MDP with an augmented state space, where each augmented state is de\ufb01ned\nby a pair (s, t) where s \u2208 S and t, meaning state s in the tth horizon. Thus, each augmented state\nwill be visited at most once, which leads to the following set of distributions.\n\n\u00afC\u221e\nS\n\n, {\u00b5|\u00b5 = Os\u2208S,t=1,2,\u00b7\u00b7\u00b7\n\n\u00b5s,t; \u00b5s,t \u2208 Cs, \u2200s \u2208 S, \u2200t = 1, 2, \u00b7 \u00b7 \u00b7 }.\n\nThe second formulation, termed stationary model, treats the system as having a \ufb01nite number of\nstates, while multiple visits to one state is allowed. That is, if a state s is visited for multiple times,\nthen each time the distribution (of uncertain parameters) \u00b5s is the same. Mathematically, we can\nadapt the augmented state space as in the non-stationary model, and requires that \u00b5s,t does not\ndepend on t. Thus, the set of admissible distributions is\n\n\u00afCS , {\u00b5|\u00b5 = Os\u2208S,t=1,2,\u00b7\u00b7\u00b7\n\n\u00b5s,t; \u00b5s,t = \u00b5s; \u00b5s \u2208 Cs, \u2200s \u2208 S, \u2200t = 1, 2, \u00b7 \u00b7 \u00b7 }.\n\nThe next theorem is the main result of this section; it shows that a sequentially robust policy is\ndistributionally robust to both stationary and non-stationary models.\n\n6\n\n\fi=1(\u03bbi \u2212 \u03bbi\u22121)(rs(i), ps(i))|(ps(i), rs(i)) \u2208 P i\n\n\u02c6Ps where \u02c6Ps =\ns}, is distributionally robust with respect to \u00afC\u221e\nS ,\n\nTheorem 3. Given T = \u221e and \u03b3 < 1, any sequentially robust policy w.r.t. Ns\n{Pn\nand with respect to \u00afCS.\nDue to space constraints, we omit the proof details. The basic idea for proving the \u00afC\u221e\nS case is to\nconsider a \u02c6T -truncated problem, i.e., a \ufb01nite horizon problem that stops at stage \u02c6T with a termi-\nnation reward \u02dcv\u221e(\u00b7), and show that the optimal strategy for this problem, which is a sequential\nrobust strategy, coincides with that of the in\ufb01nite horizon one. Indeed, given any sequential robust\nstrategy \u03c0\u2217, one can construct a stationary distribution \u00b5\u2217 such that (\u03c0\u2217, \u00b5\u2217) is a saddle point for\nS and \u00b5\u2217 \u2208 \u00afCS.\nsup\u03c0\u2208\u03a0HR inf \u00b5\u2208 \u00afC\u221e\nWe remark that the decision maker is allowed to take non-stationary strategies, although the distri-\nbutionally robust solution is proven to be stationary.\n\nS R u(\u03c0, p, r) d\u00b5(p, r). The proof to \u00afCS follows from \u00afCS \u2282 \u00afC\u221e\n\nBefore concluding this section, we brie\ufb02y compare the stationary model and the non-stationary\nmodel. These two formulations model different setups: if the system, more speci\ufb01cally the distribu-\ntion of uncertain parameters, evolves with time, then the non-stationary model is more appropriate;\nwhile if the system is static, then the stationary model is preferable. For any given policy, the worst\nexpected performance under the non-stationary model provides a lower bound to that of the station-\nary model since \u00afCS \u2286 \u00afC\u221e\nS . Thus, one can use the non-stationary model to approximate the stationary\nmodel, when the latter is intractable (e.g., in the \ufb01nite horizon case; see Nilim and El Ghaoui [5]).\nWhen the horizon approaches in\ufb01nity, such approximation becomes exact, as we showed in this sec-\ntion, the optimal solutions to both formulations coincide, and can be computed by iteratively solving\na minimax problem.\n\n5 Numerical simulations\n\nIn this section we illustrate with numerical examples that by incorporating additional probabilistic\ninformation, the distributional robust approach handles uncertainty in a more \ufb02exible way, which\noften leads to a better performance than the nominal approach and the robust approach.\n\nWe consider a path planning problem: an agent wants to exit a 4 \u00d7 21 maze (shown in Figure 1)\nusing the least possible time. Starting from the upper-left corner, the agent can move up, down, left\nand right, but can only exit the grid at the lower-right corner. Here, a white box stands for a normal\nplace where the agent needs one time unit to pass through. A shaded box represents a \u201cshaky\u201d place.\nTo be more speci\ufb01c, we consider two setups. The \ufb01rst one is the uncertain cost case, where the\ntrue (yet unknown to the planning agent) time for the agent to pass through a \u201cshaky\u201d place equals\nx = 1 + \u02dce(\u03bb), and \u02dce(\u03bb) is an exponential distributed random variable with parameter \u03bb. The three\napproaches are formulated as follows: the nominal approach takes the most likely value (i.e., 1) as\nthe parameter; the robust approach takes [1, 1 + 3/\u03bb] as the uncertainty set; and the distributional\nrobust approach takes into account the additional information that Pr(x \u2208 [1, 1 + log 2/\u03bb]) \u2265 0.5\nand Pr(x \u2208 [1, 1 + 2 log 2/\u03bb]) \u2265 0.75. We vary 1/\u03bb, and test these approaches using 300 runs for\neach parameter set. The results are reported in Figure 2 (a).\n\nFigure 1: The maze for the path planning problem.\n\nThe second case is the uncertain transition case:\nif an agent reaches a \u201cshaky\u201d place, then the\ntransition becomes unpredictable \u2013 in the next step with probability 20% it will make an (unknown)\n\n7\n\n\fjump. The three approaches are set as follows: The nominal approach neglects this random jump.\nThe robust approach takes a worst-case analysis, i.e., it assumes that with 20% the agent will jump\nto the spot with the highest cost-to-go. The distributionally robust approach takes into account an\nadditional information that if a jump happens, the probability that it jumps to a spot that is left to\nthe current place is no more than \u03b3. Each policy is tested over 300 runs, while the true jump is set\nas with probability 0.2\u03b3 the agent returns to the starting point (\u201creboot\u201d), with 0.2(1 \u2212 \u03b3) the agent\nstay in the current position for a time unit (\u201cstuck\u201d). The results are reported in Figure 2 (b).\n\n(a) Uncertain cost\n\n(b) Uncertain transition\n\nnominal\nrobust\ndis. rob.\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\nt\ni\nx\nE\no\n\n \n\nt\n \n\ne\nm\nT\n\ni\n\n10\n\n0\n\n1\n\n2\n\n3\n\n5\n\n4\n\n1/\u03bb\n\n6\n\n7\n\n8\n\n9\n\nt\ni\nx\nE\no\n\n \n\nt\n \n\ne\nm\nT\n\ni\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nnominal\nrobust\ndis. rob.\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n\u03b3\n\nFigure 2: Simulation results of the path planning problem.\n\nIn both the uncertain cost and the uncertain transition probability setups, the distributionally robust\napproach outperforms the other two approach over virtually the whole range of parameters. This\nis well expected, since additional probabilistic information is available to and incorporated by the\ndistributionally robust approach.\n\n6 Concluding remarks\n\nIn this paper we proposed a distributionally robust approach to mitigate the conservatism of the\nrobust MDP framework and incorporate additional a-prior probabilistic information regarding the\nunknown parameters. In particular, we considered the nested-set structured parameter uncertainty\nto model a-prior probabilistic information of the parameters. We proposed to \ufb01nd a policy that\nachieves maximum expected utility under the worst admissible distribution of the parameters. Such\nformulation leads to a policy that is obtained through a Bellman type backward induction, and can\nbe solved in polynomial time under mild technical conditions.\n\nA different perspective on our work is that we develop a principled approach to the problem of\nuncertainty set design in multi-stage decision problems. It has been observed that shrinking the un-\ncertainty set in single-stage problems leads to better performance. We provide a principled approach\nto the problem of uncertainty set selection: the distributionally robust policy is a robust policy w.r.t. a\ncarefully designed single uncertainty set that depends on the a-priori knowledge.\n\nA natural question is how can we take advantage of the distributionally robust approach and solve\n(exactly) a full-blown Bayesian generative model MDP? The problem with taking an increasingly\nre\ufb01ned nested uncertainty structure (i.e., increasing n) is that of representation: the equivalent ro-\nbust MDP uncertainty set may become too complicated to represent ef\ufb01ciently. Nevertheless, if it is\npossible to offer upper and lower bounds on the probability of each nested sets (based on the gener-\native model), the corresponding distributionally robust policies provide performance bounds on the\noptimal policies in the, often intractable, Bayesian model.\n\nAcknowledgements\n\nWe thank an anonymous reviewer for pointing out relevant references in mathematical \ufb01nance. H.\nXu would like to acknowledge the support from DTRA grant HDTRA1-08-0029. S. Mannor would\nlike to acknowledge the support the Israel Science Foundation under contract 890015.\n\n8\n\n\fReferences\n\n[1] M. L. Puterman. Markov Decision Processes. John Wiley & Sons, New York, 1994.\n[2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\n[3] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[4] S. Mannor, D. Simester, P. Sun, and J. Tsitsiklis. Bias and variance in value vunction estima-\n\ntion. In Proceedings of the 21th international conference on Machine learning, 2004.\n\n[5] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain\n\ntransition matrices. Operations Research, 53(5):780\u2013798, September 2005.\n\n[6] A. Bagnell, A. Ng, and J. Schneider. Solving uncertain Markov decision problems. Technical\n\nReport CMU-RI-TR-01-25, Carnegie Mellon University, August 2001.\n\n[7] C. C. White III and H. K. El Deib. Markov decision processes with imprecise transition prob-\n\nabilities. Operations Research, 42(4):739\u2013748, July 1992.\n\n[8] G. N. Iyengar. Robust dynamic programming. Mathematics of Operations Research,\n\n30(2):257\u2013280, 2005.\n\n[9] L. G. Epstein and M. Schneider. Learning under ambiguity. Review of Economic Studies,\n\n74(4):1275\u20131303, 2007.\n\n[10] A. Nilim and L. El Ghaoui. Robustness in Markov decision problems with uncertain transition\n\nmatrices. In Advances in Neural Information Processing Systems 16, 2004.\n\n[11] E. Delage and S. Mannor. Percentile optimization for Markov decision processes with param-\n\neter uncertainty. Operations Research, (1):203\u2013213, 2010.\n\n[12] H. Xu and S. Mannor. The robustness-performance tradeoff in Markov decision processes. In\nB. Sch\u00a8olkopf, J. C. Platt, and T. Hofmann, editors, Advances in Neural Information Processing\nSystems 19, pages 1537\u20131544. MIT Press, 2007.\n\n[13] H. Scarf. A min-max solution of an inventory problem. In Studies in Mathematical Theory of\n\nInventory and Production, pages 201\u2013209. Stanford University Press, 1958.\n\n[14] J. Dupacov\u00b4a. The minimax approach to stochastic programming and an illustrative application.\n\nStochastics, 20:73\u201388, 1987.\n\n[15] P. Kall. Stochastic programming with recourse: Upper bounds and moment problems, a review.\n\nIn Advances in Mathematical Optimization. Academie-Verlag, Berlin, 1988.\n\n[16] A. Shapiro. Worst-case distribution analysis of stochastic programs. Mathematical Program-\n\nming, 107(1):91\u201396, 2006.\n\n[17] I. Popescu. Robust mean-covariance solutions for stochastic optimization. Operations Re-\n\nsearch, 55(1):98\u2013112, 2007.\n\n[18] E. Delage and Y. Ye. Distributional robust optimization under moment uncertainty with appli-\n\ncations to data-driven problems. To appear in Operations Research, 2010.\n\n[19] A. Ruszczy\u00b4nski. Risk-averse dynamic programming for Markov decision processes. Mathe-\n\nmatical Programming, Series B, 125:235\u2013261, 2010.\n\n[20] H. F\u00a8ollmer and A. Schied. Stochastic \ufb01nance: An introduction in discrete time. Berlin: Walter\n\nde Gruyter, 2002.\n\n[21] I. Gilboa and D. Schmeidler. Maxmin expected utility with a non-unique prior. Journal of\n\nMathematical Economics, 18:141\u2013153, 1989.\n\n[22] D. Kelsey. Maxmin expected utility and weight of evidence. Oxford Economic Papers, 46:425\u2013\n\n444, 1994.\n\n[23] J. Baron. Thinking and Deciding. Cambridge University Press, 2000.\n\n9\n\n\f", "award": [], "sourceid": 251, "authors": [{"given_name": "Huan", "family_name": "Xu", "institution": null}, {"given_name": "Shie", "family_name": "Mannor", "institution": null}]}