{"title": "The Robustness-Performance Tradeoff in Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1537, "page_last": 1544, "abstract": null, "full_text": "The Robustness-Performance Tradeoff in Markov Decision Processes\nHuan Xu, Shie Mannor Department of Electrical and Computer Engineering McGill University Montreal, Quebec, Canada, H3A2A7 xuhuan@cim.mcgill.ca shie@ece.mcgill.ca\n\nAbstract\nComputation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the \"optimal\" tradeoff might be non-Markovian and hence is in general not tractable.\n\n1\n\nIntroduction\n\nIn many decision problems the parameters of the problem are inherently uncertain. This uncertainty, termed parameter uncertainty, can be the result of estimating the parameters from a finite sample or a specification of the parameters that itself includes uncertainty. The standard approach in decision making to circumvent the adverse effect of the parameter uncertainty is to find a solution that performs best under the worst possible parameters. This approach, termed the \"robust\" approach, has been used in both single stage ([1]) and multi-stage decision problems (e.g., [2]). In robust optimization problems, it is usually assumed that the constraint parameters are uncertain. By requiring the solution to be feasible to all possible parameters within the uncertainty set, Soyester ([1]) solved the column-wise independent uncertainty case, and Ben-Tal and Nemirovski ([3]) solved the row-wise independent case. In robust MDP problems, there may be two different types of parameter uncertainty, namely, the reward uncertainty and the transition probability uncertainty. Under the assumption that the uncertainty is state-wise independent (an assumption made by all papers to date, to the best of our knowledge), the optimality principle holds and this problem can be decomposed as a series of step by step mini-max problems solved by backward induction ([2, 4, 5]). The above cited results focus on worst-case analysis. This implies that the vector of nominal parameters (the parameters used as an approximation of the true one regardless of the uncertainty) is not treated in a special way and is just an element of the set of feasible parameters. The objective of the worst-case analysis is to eliminate the possibility of disastrous performance. There are several disadvantages to this approach. First, worst-case analysis may lead to an overly conservative solution, i.e., a solution which provides mediocre performance under all possible parameters. Second, the desirability of the solution highly depends on the precise modeling of the uncertainty set which is often based on some ad-hoc criterion. Third, it may happen that the nominal parameters are close to\n\n\f\nthe real parameters, so that the performance of the solution under nominal parameters may provide important information for predicting the performance under the true parameters. Finally, there is a certain tradeoff relationship between the worst-case performance and the nominal performance, that is, if the decision maker insists on maximizing one criterion, the other criterion may decrease dramatically. On the other hand, relaxing both criteria may lead to a well balanced solution with both satisfactory nominal performance and also reasonable robustness to parameter uncertainty. In this paper we capture the Robustness-Performance (RP) tradeoff explicitly. We use the worstcase behavior of a solution as the function representing its robustness, and formulate the decision problem as an optimization of both the robustness criterion and the performance under nominal parameters simultaneously. Here, \"simultaneously\" is achieved by optimizing the weighted sum of the performance criterion and the robustness criterion. To the best of our knowledge, this is the first attempt to address the overly conservativeness of worst-case analysis in robust MDP. Instead of optimizing the weighted sum of the robustness and performance for some specific weights, we show how to efficiently find the solutions for all possible weights. We prove that the set of these solutions is in fact equivalent to the set of all Pareto efficient solutions in the robustness-performance space. Therefore, we solve the tradeoff problem without choosing a specific tradeoff parameter, and leave the subjective decision of determining the exact tradeoff to the decision maker. Instead of arbitrarily claiming that a certain solution is a good tradeoff, our algorithm computes the whole tradeoff relationship so that the decision maker can choose the most desirable solution according to her preference, which is usually complicated and an explicit form is not available. Our approach thus avoids the tuning of tradeoff parameters, where generally no good a-priori method exists. This is opposed to certain relaxations of the worst-case robust optimization approach like [6] (for single stage only) where some explicit tradeoff parameters have to be chosen. Unlike risk sensitive learning approaches [7, 8, 9] which aim to tune a strategy online, our approach compute a robust strategy offline without trial and error. The paper is organized as follows. Section 2 is devoted to the RP tradeoff for Linear Programming. In Section 3 and Section 4 we discuss the RP tradeoff for MDP with uncertain rewards, and uncertain transition probabilities, respectively. In Section 5 we present a computational example. Some concluding remarks are offered in Section 6.\n\n2\n\nParametric linear programming and RP tradeoffs in optimization\n\nIn this section, we briefly recall Parametric Linear Programming (PLP) [10, 11, 12], and show how it can be used to find the whole set of Pareto efficient solutions for RP tradeoffs in Linear Programming. This serves as the base for the discussion of RP tradeoffs in MDPs. 2.1 Parametric Linear Programming\n\nA Parametric Linear Programming is the following set of infinitely many optimization problems: For all [0, 1]: Minimize: Subject to:\nx x\n\nc(1) + (1 - )c(2) Ax = b x 0.\n\nx\n\nx\n\n(1)\n\nWe call c(1) the first objective, and c(2) the second objective. We assume that the Linear Program (LP) is feasible and bounded for both objectives. Although there are uncountably many possible , Problem (1) can be solved by a simplex-like algorithm. Here, \"solve\" means that for each , we find at least one optimal solution. An outline of the PLP algorithm is described in Algorithm 1, which is essentially a tableau simplex algorithm while the entering variable is determined in a specific way. See [10] for a precise description. Algorithm 1. 1. Find a basic feasible optimal solution for = 0. If multiple solutions exist, x choose one among those with minimal c(1) . 2. Record current basic feasible solution. Check the reduced cost (i.e., the zero row in the (1) simplex table) of the first objective, denoted as cj . If none of them is negative, end. \n\n\f\n3. Among all columns with negative cj , choose the one with largest ratio |cj /cj | as the entering variable. 4. Pivot the base, go to 2. This algorithm is based on the observation that for any , there exists an optimal basic feasible solution. Hence, by finding a suitable subset of all vertices of the feasible region, we can solve the PLP. Furthermore, we can find this subset by sequentially pivoting among neighboring extreme points like the simplex algorithm does. This algorithm terminates after finitely many iterations. It is also known that the optimal value for PLP is a continuous piecewise linear function of . The theoretical computational cost is exponential, although practically it works well. Such property is shared by all simplex based algorithm. A detailed discussion on PLP can be found in [10, 11, 12]. 2.2 RP tradeoffs in Linear Programming\n\n(1)\n\n(1)\n\n(2)\n\nConsider the following LP: NOMINAL PROBLEM : Minimize: c x Subject to: Ax b Here A Rnm , x Rm , b Rn , c Rm .\n\n(2)\n\nSuppose that the constraint matrix A is only a guess of the unknown true parameter Ar which is known to belonging to set A (we call A the uncertainty set). We assume that A is constraint-wise n independent and polyhedral for each of the constraints. a hat is, A = i=1 Ai , and for each i, there T . exists a matrix T (i) and a vector v(i) such that Ai = (i) |T (i)a(i) v(i) To quantify how a solution x behaves with respect to the parameter uncertainty, we define the following criterion to be minimized as its robustness measure (more accurately, non-robustness measure). ~ +1 p(x) sup Ax - b\n~ AA\n\n= sup\n\n~ AA =1\n\n~ Here []+ stands for the positive part of a vector, ~(i) is the ith row of the matrix A, and bi is the a ith element of b. In words, the function p(x) is the largest possible sum of constraint violations. Using the weighted sum of the performance and robustness objective as the minimizing objective, we formulate the explicit tradeoff between robustness and performance as: GENERAL PROBLEM : [0, 1] Minimize: c x + (1 - )p(x) Subject to: Ax b. (4) Here A Rnm , x Rm , b Rn , c Rm .\nx\n\nin\n\n~ max a(i)\n\nx\n\n- bi , 0\n\n= in\n\nmax\n\n=1\n\n~(i):T (i)~(i)v(i) a a\n\ns\n\nup\n\n~(i) a\n\nx\n\n-\n\nbi , 0\n\n.\n\n(3)\n\na By duality theorem, for a given x, sup~(i):T (i)~(i)v(i) ~(i) a a following LP on y(i): Minimize: v(i) y(i) Subject to:\n\nequals to the optimal value of the\n\nT (i) y(i) = x y(i) 0.\n\nThus, by adding slack variables, we rewrite GENERAL PROBLEM as the following PLP and solve it using Algorithm 1: GENERAL PROBLEM (PLP) : [0, 1] Minimize: c x + (1 - )1 z Subject to: Ax b, T (i)\ny\n\n(i) = x,\n\nv(i) (i) - bi zi , z 0, y(i) 0; i = 1, 2, , n.\n\ny\n\n(5)\n\n\f\nHere, 1 stands for a vector of ones of length n, zi is the ith element of z, and x, y(i), z are the optimization variables.\n\n3\n\nThe robustness-performance tradeoff for MDPs with uncertain rewards\n\nA (finite) MDP is defined as a 5-tuple < T , S, As , p(|s, a), r(s, a) > where: T is the (possibly infinite) set of decision stages; S is the state set; As is the action set of state s; p(|s, a) is the transition probability; and r(s, a) is the expected reward of state s with action a As . We use r to denote the vector combining the reward for all state-action pairs and rs to denote the vector combining all reward of state s. Thus, r(s, a) = rs (a). Both S and As are assumed finite. Both p and r are time invariant. In this section, we consider the case where r is not known exactly. More specifically, we have a nominal parameter r(s, a) which is believed to be a reasonably good guess of the true reward. The reward r is known to belong to a bounded set R. We further assume shat the uncertainty set R is t state-wise independent and a polytope for each state. That is, R = S Rs , and for each s S , there exists a matrix Cs and a vector ds such that Rs = {rs |Cs rs ds }. We assume that for different visits of one state, the realization of the reward need not be identical and may take different values within the uncertainty set. The set of admissible control policies for the decision maker is the set of randomized history dependent policies, which we denote by H R . In the following three subsections we discuss different standard reward criteria: cumulative reward with a finite horizon, discounted reward with infinite horizon, and limiting average reward with infinite horizon under a unichain assumption. 3.1 Finite horizon case\n\nIn the finite horizon case (T = {1, , N }), we assume without loss of generality that each state belongs to only one stage, which is equivalent to the assumption of non-stationary reward realization, and use Si to denote the set of states at the ith stage. We also assume that the first stage consists of only one state s1 , and that there are no terminal rewards. We define the following two functions as the performance measure and the robustness measure of a policy H R : P ( ) E {\nN -1 i =1\n\nr(si , ai )},\nN -1 i =1\n\n(6) r(si , ai )}.\n\nR( )\n\nmin E {\nrR\n\nThe minimum is attainable, since R is compact and the total expected reward is a continuous function of r. We say that a strategy is Pareto efficient if it obtains the maximum of P ( ) among all strategies that have a certain value of R( ). The following result is straightforward; the proof can be found in the full version of the paper. Proposition 1. 1. If is a Pareto efficient strategy, then there exists a [0, 1] such that arg maxH R {P ( ) + (1 - )R( )}. 2. If arg maxH R {P ( ) + (1 - )R( )} for some (0, 1). Then is a Pareto efficient strategy. For 0 t N , s St , and [0, 1] define: N -1 R i Pt ( , s) E r(si , ai )|st = s\n=t t ( , s) t (s)\n\nc\n\nmin E\nrR H R\n\nmax {Pt ( , s) + (1 - )Rt ( , s)} .\n\nN -1 i\n=t\n\nr(si , ai )|st = s\n\n(7)\n\n\f\nWe now consider the maximin problem in each state and show how to find the solutions for all in one pass. We also prove that c (s) is piecewise linear in . Let St+1 = {s1 , , sk }. Assume t for all j {1, , k }, c+1 (sj ) are continuous piece-wise linear functions. Thus, we can divide t [0, 1] into finite (say n) intervals [0, 1 ], [n-1 , 1] such that in each interval, all ct+1 functions j j are linear. That is, there exist constants li and mj such that c+1 (sj ) = li + mj , for [i-1 , i ]. t i i By the duality theorem, we have that c (s) equals to the optimal value of the following LP on y and t q. Maximize: (1 - )ds y + rs q + Subject to: C y = q, 1 q = 1, q, y 0. Observe that the feasible set is the same for all . Substituting c+1 (sj ) and rearranging, it follows t that for [i-1 , i ] the objective function equals to a k + q j j (1 - ) (a) + ds y As j =1 p(s |s, a)mi a q . r k j (s, a) + j =1 p(sj |s, a)(li + mj ) (a) i As\ns\n\nWe set PN RN cN 0, and note that c (s1 ) is the optimal RP tradeoff with weight . The 1 following theorem shows that the principle of optimality holds for c. The proof is omitted since it follows similarly to standard backward induction in finite horizon robust decision problems. Theorem 1. For s St , t < N , let s be the probability simplex on As , then m a + a in c (s) = max t As r (s, a)q (a) + (1 - ) As r (s, a)q (a) qs rs Rs . s a | ) S As p(s s, a)q (a)ct+1 (s t+1\n\njk a\n\n=1 As\n\np(sj |s, a)q (a)c+1 (sj ) t (8)\n\nThus, for [i-1 , i ], from the optimal solution for i-1 , we can solve for all using Algorithm 1. Furthermore, we need not to re-initiate for each interval, since the optimal solution for the end of ith interval is also the optimal solution for the begin of the next interval. It is obvious that the resulting c (s) is also continuous, piecewise linear. Thus, since cN = 0, the assumption of t continuous and piecewise linear value functions holds by backward induction. 3.2 Discounted reward infinite horizon case\n\nIn this section we address the RP tradeoff for infinite horizon MDPs with a discounted reward criterion. For a fixed , the problem is equivalent to a zero-sum game, with the decision maker trying to maximize the weighted sum and Nature trying to minimize it by selecting an adversarial reward realization. A well known result in discounted zero-sum stochastic games states that, even if non-stationary policies are admissible, a Nash equilibrium in which both players choose a stationary policy exists; see Proposition 7.3 in [13]. Given an initial state distribution (s), it is also a known result [14] at there exists a one-toth one correspondence relationship between the state-action frequencies i=1 i-1 E(1si =s,ai =a ) for stationary strategies and vectors belonging to the following polytope X : sa a p(s |s, a)x(s, a) = (s ), x(s, a) 0, s, a As . (9) (s , a) -\nAs\nx\n\nS As\n\nSince it suffices to consider a stationary policy for Nature, the tradeoff problem becomes: sa Maximize: inf [r(s, a)x(s, a) + (1 - )r(s, a)x(s, a)]\nrR S As\n\n(10)\n\nSubject to: x X .\n\n\f\nBy duality of LP, Equation (10) could be rewritten as the following PLP and solved by Algorithm 1. s sa S ds ys r(s, a)x(s, a) + (1 - ) Maximize:\nS As S\n\nubject to:\n\nCs ys = xs s, ys 0, s. 3.3\n\nx(s, a) 0,\n\na\n\nAs\n\nx\n\n(s a) -\n\n,\n\ns, a,\n\ns\n\nS As\n\na\n\n p(s s, a)x(s, a) = (s ),\n\n|\n\ns ,\n\n(11)\n\nLimiting average reward case (unichain)\n\nIn the unichain case, the set of limiting ave, age state-action frequency vectors (that is, all limit points r 1T of sequences T n=1 E [1sn =s,an =a ] for H R ) is the following polytope X : sa a p(s |s, a)x(s, a) = 0, s S, (s , a) -\nAs\nx\n\nS As\n\nx(s, a) 0, s, a As . As before, there exists an optimal maximin stationary policy. By a similar argument as for the discounted case, the tradeoff problem can be converted to the following PLP: sa s S ds Maximize: ys r(s, a)x(s, a) + (1 - )\nS As S\n\ns\n\nS As\n\na\n\nx(s, a) = 1,\n\n(12)\n\nubject to:\n\na s\n\nAs\n\nx\n\n(s a) -\n\n,\n\nS As\n\nCs ys = xs , s, ys 0, s, x(s, a) 0, s, a.\n\na\n\ns\n\nS As\n\na\n\np(s s, a)x(s, a) = 0,\n\n|\n\ns ,\n\nx(s, a) = 1,\n\n(13)\n\n4\n\nThe RP tradeoff in MDPs with uncertain transition probabilities\n\nIn this section we provide a counterexample which demonstrates that the weighted sum criterion in the most general case, i.e., the uncertain transition probability case, may lead to non-Markovian optimal policies. In the finite horizon MDP shown in the Figure 1, S = {s1, s2, s3, s4, s5, t1, t2, t3, t4}; As1 = {a(1, 1)}; As2 = {a(2, 1)}; As3 = {a(3, 1)}; As4 = {a(4, 1)} and As5 = {a(5, 1), a(5, 2)}. Rewards are only available at the final stage, and are perfectly known. The nominal transition probabilities are p (s2|s1, a(1, 1)) = 0.5, p (s4|s2, a(2, 1)) = 1, and p (t3|s5, a(5, 2)) = 1. The set of possible realization is p (s2|s1, a(1, 1)) {0.5}, p (s4|s2, a(2, 1)) [0, 1], and p (t3|s5, a(5, 2)) [0, 1]. Observe that the worst parameter realization is p(s4|s2, a(2, 1)) = p(t3|s5, a(5, 2)) = 0. We look for the strategy that maximizes the sum of the nominal reward and the worst-reward (i.e., = 0.5). Since multiple actions only exist in state s5, a strategy is determined by the action chosen on s5. Let the probability of choosing action a(5, 1) and a(5, 2) be p and 1 - p, respectively.\n\nConsider the history \"s1 s2\". In this case, with the nominal transition probability, this trajectory will reach t1 with a reward of 10, regardless of the choice of p. The worst transition is that action a(2, 1) leads to s5 and action a(5, 2) leads to t4, hence the expected reward is 5p + 4(1 - p). Therefore the optimal p equals to 1, i.e., the optimal action is to choose a(5, 1) deterministically.\n\n\f\ns1 a(1,1)\n\n0.5 (0.5)\n\ns2 a(2,1) 1 (0) 0 (1)\n\ns4 a(4,1)\n\nt1 r=10 t2 r=5 t3 r=8 t4 r=4\n\na(5,1) s5 a(5,2) 1 (0)\n\n0.5 (0.5) s3 a(3,1)\n\n0 (1)\n\nFigure 1: Example of non-Markovian best strategy Consider the history \"s1 s3\". In this case, the nominal reward is 5p + 8(1 - p), and the worst case reward is 5p + 4(1 - p). Thus p = 0 optimize the weighted sum, i.e., the optimal strategy is to choose a(5, 2). The unique optimal strategy for this example is thus non-Markovian. This non-Markovian property implies a possibility that past actions affect the choice of future actions, and hence could render the problem intractable. The optimal strategy is non-Markovian because we are taking expectation over two different probability measures, hence the smoothing property of conditional expectation cannot be used in finding the optimal strategy.\n\n5\n\nA computational example\n\nWe apply our algorithm to a T -stage machine maintenance problem. Let S {1, , n} denote the state space for each stage. In state h, the decision maker can choose either to replace the machine which will lead to state 1 deterministically, or to continue running, which with probability p will lead to state h + 1. If the machine is in state n, then the decision maker has to replace it. The replacing cost is perfectly known to be cr , and the nominal running cost in state h is ch . Wessume a that the realization of the running cost lies in the interval [ch - h , ch + h ]. We set ch = h - 1 and h = 2h/n. The objective is to minimize the total cost, in a risk-averse attitude. Figure 2(a) shows the tradeoff of this MDP. For each solution found, we sample the reward 300 times according to a uniform distribution. We normalize the cost for each simulation, i.e., we divide the cost by the smallest expected nominal cost. Denoting the normalized cost of the ith simulation for strategy j as si (j ), we use the following function to compare the solutions: 3 00 i=1 |si (j )| . vj () = 300 Note that = 1 is the mean of the simulation cost, whereas larger puts higher penalty on deviation representing a risk-averse decision maker. Figure 2(b) shows that, the solutions that focus on nominal parameters (i.e., close to 1) achieve good performance for small , but worse performance for large . That is, if the decision maker is risk neutral, then the solutions based on nominal parameters are good. However, these solutions are not robust and are not good choices for risk-averse decision makers. Note that, in this example, the nominal cost is the expected cost for each stage, i.e., the parameters are exactly formulated. Even in such case, we see that risk-averse decision makers can benefit from considering the RP tradeoff.\n\n6\n\nConcluding remarks\n\nIn this paper we proposed a method that directly addresses the robustness versus performance tradeoff by treating the robustness as an optimization objective. Based on PLP, for MDPs where only\n\n\f\n(a)\n29.28 29\n\n(b)\n1.4 1.35\n\n=1\n\n28.5\n\nNormalized Modified Mean\n\nWorst Case Performance\n\n1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0 =1 =10 =100 =1000 0.2 0.4\n\n28\n\n27.5\n\n27\n\n26.5\n\n26 25.68 16.79\n\n=0\n17 17.2 17.4 17.6 17.8 18 18.2 18.4 18.6 18.76\n\nNorminal Performance\n\n\n\n0.6\n\n0.8\n\n1\n\nFigure 2: The machine maintenance problem: (a) the PR tradeoff; (b) normalized mean of the simulation for different values of .\n\nrewards are uncertain, we presented an efficient algorithm that computes the whole set of optimal RP tradeoffs for MDPs with finite horizon, infinite horizon discounted reward, and limiting average reward (unichain). For MDPs with uncertain transition probabilities, we showed an example where the solution may be non-Markovian and hence may in general be intractable. The main advantage of the presented approach is that it addresses robustness directly. This frees the decision maker from the need to make probabilistic assumptions on the problems parameters. It also allows the decision maker to determine the desired robustness-performance tradeoff based on observing the whole curve of possible tradeoffs rather than guessing a single value.\n\nReferences\n[1] A. L. Soyster. Convex programming with set-inclusive constraints and applications to inexact linear programming. Oper. Res., 1973. [2] A. Bagnell, A. Ng, and J. Schneider. Solving uncertain markov decision processes. Technical Report CMU-RI-TR-01-25, Carnegie Mellon University, August 2001. [3] A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Oper. Res. Lett., 25(1):1 13, August 1999. [4] C. C. White III and H. K. El-Deib. Markov decision process with imprecise transition probabilities. Oper. Res., 42(4):739748, July 1992. [5] A. Nilim and L. El Ghaoui. Robust control of markov decision processes with uncertain transition matrices. Oper. Res., 53(5):780798, September 2005. [6] D. Bertsimas and M. Sim. The price of robustness. Oper. Res., 52(1):3553, January 2004. [7] M. Heger. Consideration of risk in reinforcement learning. In Proc. 11th International Conference on Machine Learning, pages 105111. Morgan Kaufmann, 1994. [8] R. Neuneier and O. Mihatsch. Risk sensitive reinforcement learning. In Advances in Neural Information Processing Systems 11, pages 10311037, Cambridge, MA, USA, 1999. MIT Press. [9] P. Geibel. Reinforcement learning with bounded risk. In Proc. 18th International Conf. on Machine Learning, pages 162169. Morgan Kaufmann, San Francisco, CA, 2001. [10] D. Bertsimas and J. N. Tsitsiklis. Introduction to Linear Optimization. Athena Scientific, 1997. [11] M. Ehrgott. Multicriteria Optimization. Springer-Verlag Berlin Heidelberg, 2000. [12] K. G. Murty. Linear Programming. John Wiley & Sons, 1983. [13] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. [14] M. L. Puterman. Markov Decision Processes. John Wiley & Sons, INC, 1994.\n\n\f\n", "award": [], "sourceid": 3053, "authors": [{"given_name": "Huan", "family_name": "Xu", "institution": null}, {"given_name": "Shie", "family_name": "Mannor", "institution": null}]}