{"title": "RAAM: The Benefits of Robustness in Approximating Aggregated MDPs in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1979, "page_last": 1987, "abstract": "We describe how to use robust Markov decision processes for value function approximation with state aggregation. The robustness serves to reduce the sensitivity to the approximation error of sub-optimal policies in comparison to classical methods such as fitted value iteration. This results in reducing the bounds on the gamma-discounted infinite horizon performance loss by a factor of 1/(1-gamma) while preserving polynomial-time computational complexity. Our experimental results show that using the robust representation can significantly improve the solution quality with minimal additional computational cost.", "full_text": "RAAM: The Bene\ufb01ts of Robustness in Approximating\n\nAggregated MDPs in Reinforcement Learning\n\nMarek Petrik\n\nIBM T. J. Watson Research Center\n\nYorktown Heights, NY 10598\nmpetrik@us.ibm.com\n\nDharmashankar Subramanian\nIBM T. J. Watson Research Center\n\nYorktown Heights, NY 10598\ndharmash@us.ibm.com\n\nAbstract\n\nWe describe how to use robust Markov decision processes for value function ap-\nproximation with state aggregation. The robustness serves to reduce the sensitiv-\nity to the approximation error of sub-optimal policies in comparison to classical\nmethods such as \ufb01tted value iteration. This results in reducing the bounds on the\n\u03b3-discounted in\ufb01nite horizon performance loss by a factor of 1/(1 \u2212 \u03b3) while\npreserving polynomial-time computational complexity. Our experimental results\nshow that using the robust representation can signi\ufb01cantly improve the solution\nquality with minimal additional computational cost.\n\n1\n\nIntroduction\n\nState aggregation is one of the simplest approximate methods for reinforcement learning with very\nlarge state spaces; it is a special case of linear value function approximation with binary features.\nThe main advantages of using aggregation in comparison with other value function approximation\nmethods are its simplicity, \ufb02exibility, and the ease of interpretability (Bean et al., 1987; Bertsekas\nand Castanon, 1989; Van Roy, 2005).\nInformally, value function approximation methods compute an approximately-optimal policy \u02dc\u03c0 by\ncomputing an approximate value function \u02dcv as an intermediate step. The quality of the solution can\nbe measured by its performance loss: \u03c1(\u03c0(cid:63)) \u2212 \u03c1(\u02dc\u03c0) where \u03c0(cid:63) is the optimal policy, and \u03c1(\u00b7) is the\n\u03b3-discounted in\ufb01nite-horizon return of the policy, averaged over (any) given initial state distribution.\nThe tight upper bound guarantees on the performance loss\u2014 tighter for state-aggregation than for\ngeneral linear value function approximation\u2014are (Van Roy, 2005),\n\u03c1(\u03c0(cid:63)) \u2212 \u03c1(\u02dc\u03c0) \u2264 4 \u03b3 \u0001(v(cid:63))/(1 \u2212 \u03b3)2\n\n(1.1)\n\nwhere \u0001(v(cid:63))\u2014de\ufb01ned formally in Section 4\u2014is the smallest approximation error for the optimal\nvalue function v(cid:63). It is important that the error is with respect to the optimal value function which can\nhave special structural properties, such as convexity in inventory management problems (Porteus,\n2002).\nBecause the bound in (1.1) is tight, the performance loss may grow with the discount factor as fast as\n\u03b3/(1\u2212\u03b3)2 while the total return for any policy only grows as 1/(1\u2212\u03b3). Therefore, for \u03b3 suf\ufb01ciently\nclose to 1, the policy \u02dc\u03c0 computed through state aggregation may be no better than a random policy\neven when the approximation error of the optimal policy is small. This large performance loss is\ncaused by large errors in approximating sub-optimal value functions (Van Roy, 2005).\nIn this paper, we show that it is possible to guarantee much smaller performance loss by using a\nrobust model of the approximation errors through a new algorithm we call RAAM (robust approxi-\nmation for aggregated MDPs). Informally, we use robustness to reduce the approximated return of\npolicies with large approximation errors to make it less likely that such policies will be selected.\n\n1\n\n\fThe performance loss of the RAAM can be bounded as:\n\n\u03c1(\u03c0(cid:63)) \u2212 \u03c1(\u02dc\u03c0) \u2264 2 \u0001(v(cid:63))/(1 \u2212 \u03b3) .\n\n(1.2)\n\nAs the main contribution of the paper\u2014described in Section 3\u2014we incorporate the desired robust-\nness into the aggregation model by assuming bounded worst-case state importance weights. The\nstate importance weights determine the relative importance of the approximation errors among the\nstates. RAAM formulates the robust optimization over the importance weights as a robust Markov\ndecision process (RMDP).\nRMDPs extend MDPs to allow uncertain transition probabilities and rewards and preserve most of\nthe favorable MDP properties (Iyengar, 2005; Nilim and Ghaoui, 2005; Le Tallec, 2007; Wiesemann\net al., 2013). RMDPs can be solved in polynomial time and the solution methods are practical (Kauf-\nman and Schaefer, 2013; Hansen et al., 2013). To minimize the overhead of RAAM in comparison\nwith standard aggregation, we describe a new linear-time algorithm for the Bellman update in Sec-\ntion 3.1 for RMDPs with robust sets constrained by the L1 norm.\nAnother contribution of this paper\u2014described in Section 4\u2014is the analysis of RAAM performance\nloss and the impact of the choice of robust uncertainty sets. We focus on constructing aggregate\nRMPDs with rectangular uncertainty sets (Iyengar, 2005; Wiesemann et al., 2013) and show that it\nis possible to use MDP structural properties to reduce RAAM performance loss guarantees compared\nto (1.2).\nThe experimental results in Section 5 empirically illustrate settings in which RAAM outperforms\nstandard state aggregation methods. In particular, RAAM is more robust to sub-optimal policies\nwith a large approximation error. The results also show that the computational overhead of using\nthe robust formulation is very small.\n\n2 Preliminaries\n\nIn this section, we brie\ufb02y overview robust Markov decision processes (RMDPs). RMDPs general-\nize MDPs to allow for uncertain transition probabilities and rewards. Our de\ufb01nition of RMDPs is\ninspired by stochastic zero-sum games to generalize previous results to allow for uncertainty in both\nthe rewards and transition probabilities (Filar and Vrieze, 1997; Iyengar, 2005).\nFormally, an RMDP is a tuple (S,A,B, P, r, \u03b1), where S is a \ufb01nite set of states, \u03b1 \u2208 (cid:52)S is the\ninitial distribution, As is a set of actions that can be taken in state s \u2208 S, and Bs is a set of robust\noutcomes for s \u2208 S that represent the uncertainty in transitions and rewards. From a game-theoretic\nperspective, Bs can be seen as the actions of the adversary. For any a \u2208 As, b \u2208 Bs, the transition\nprobabilities are Pa,b : S \u2192 (cid:52)S and the reward is ra,b : S \u2192 R. The rewards depend only on the\nstarting state and are independent of the target state1.\nThe basic solution concepts of RMDPs are very similar to regular MDPs with the exception that\nthe solution also includes the policy of the adversary. We consider the set of randomized stationary\npolicies \u03a0R = {\u03c0s \u2208 (cid:52)As}s\u2208S as candidate solutions and use \u03a0D for deterministic policies.\nTwo main practical models of the uncertainty in Bs have been considered: s-rectangular and s, a-\nrectangular sets (Le Tallec, 2007; Wiesemann et al., 2013). In s-rectangular uncertainty models,\nthe realization of the uncertainty depends only on the state and is independent on the action; the\ncorresponding set of nature\u2019s policies is: \u039eS = {\u03bes \u2208 (cid:52)Bs}s\u2208S. In s, a-rectangular models, the\nrealization of the uncertainty can also depend on the action: \u039eSA = {\u03bes,a \u2208 (cid:52)Bs}s\u2208S,a\u2208As. We\nwill also consider restricted sets on the adversary\u2019s policies: \u039eQ\nSA =\n{\u03bes,a \u2208 Qs}s,a\u2208S\u00d7As, for some Qs \u2282 (cid:52)Bs.\nNext, we brie\ufb02y overview the basic properties of robust MDPs; please refer to (Iyengar, 2005; Nilim\nand Ghaoui, 2005; Le Tallec, 2007; Wiesemann et al., 2013) for more details. The transitions and\nrewards for any stationary policies \u03c0 and \u03be are de\ufb01ned as:\n\nS = {\u03bes \u2208 Qs}s\u2208S and \u039eQ\n\nP\u03c0,\u03be(s, s(cid:48)) =\n\nPa,b(s, s(cid:48)) \u03c0s,a \u03bes,b ,\n\nr\u03c0,\u03be(s) =\n\nra,b(s) \u03c0s,a \u03bes,b .\n\n1Rewards that depend on the target state can be readily reduced to independent ones by taking the appropri-\n\nate expectation.\n\n2\n\n(cid:88)\n\na,b\u2208As\u00d7Bs\n\n(cid:88)\n\na,b\u2208As\u00d7Bs\n\n\fIt will be convenient to use P\u03c0,\u03be to denote the transition matrix and r\u03c0,\u03be and \u03b1 as vectors over\nstates. We will also use I to denote an identity matrix and 1, 0 to denote vectors of ones and zeros\nrespectively with appropriate dimensions. Using this notation, with a s, a-rectangular model, the\nobjective in the RMDP is to maximize the \u03b3-discounted in\ufb01nite horizon robust return \u03c1 as:\n\n\u03c1\u2212 = sup\n\u03c0\u2208\u03a0R\n\n\u03c1\u2212(\u03c0) = sup\n\u03c0\u2208\u03a0R\n\ninf\n\u03be\u2208\u039eSA\n\n\u03c1(\u03c0, \u03be) = sup\n\u03c0\u2208\u03a0R\n\ninf\n\u03be\u2208\u039eSA\n\n\u03b1T(\u03b3 P\u03c0,\u03be)t r\u03c0,\u03be .\n\n(RBST)\n\nThe negative superscript denotes the fact that this is the robust return. The value function for a policy\npair \u03c0 and \u03be is denoted by v\u2212\n(cid:63) . Similarly to regular\nMDPs, the optimal robust value function must satisfy the robust Bellman optimality equation:\n\n\u03c0,\u03be and the optimal robust value function is v\u2212\n(cid:88)\n\n(cid:88)\n\n(cid:16)\n\nPa,b(s, s(cid:48)) v\u2212\n\n.\n\n(2.1)\n\n\u03c0s,a \u03bes,a,b\n\nra,b(s) + \u03b3\n\n(cid:17)\n(cid:63) (s(cid:48))\n\nv\u2212\n(cid:63) (s) = max\n\u03c0\u2208\u03a0R\n\nmin\n\u03be\u2208\u039eQ\n\nSA\n\na,b\u2208As\u00d7Bs\n\n\u221e(cid:88)\n\nt=0\n\ns(cid:48)\u2208S\n\n3 RAAM: Robust Approximation for Aggregated MDPs\n\nThis section describes how RAAM uses transition samples to compute an approximately optimal\npolicy. We also describe a linear-time algorithm for computing value function updates for the robust\nMDPs constructed by RAAM.\n\nAlgorithm 1: RAAM: Robust Approximation for Aggregated MDPs\n\n// \u03a3 - samples, w - weights, \u03b8 - aggregation, \u03c9 - robustness\nInput: \u03a3, w, \u03b8, \u03c9\nOutput: \u00af\u03c0 \u2013 approximately optimal policy\n// Compute RMDP parameters\n1 S \u2190 {\u03b8(\u00afs) : (\u00afs, \u00afs(cid:48), \u00afa, r) \u2208 \u03a3} \u222a {\u03b8(\u00afs(cid:48)) : (\u00afs, \u00afs(cid:48), \u00afa, \u00afr) \u2208 \u03a3} ;\n2 forall the s \u2208 S do\n\n// States\n\nAs \u2190 {\u00afa : (\u00afs, \u00afs(cid:48), \u00afa, r) \u2208 \u03a3, s = \u03b8(\u00afs)} ;\nBs \u2190 {\u00afs : (\u00afs, \u00afs(cid:48), \u00afa, r) \u2208 \u03a3, s = \u03b8(\u00afs)} ;\n\n3\n4\n5 end\n// Compute RMDP transition probabilities and rewards\n6 forall the s, s(cid:48) \u2208 S \u00d7 S do\n\n// Actions\n// Outcomes\n\n7\n8\n\n9\n\n10\n\nforall the a, b \u2208 As \u00d7 Bs do\n\n\u03a3(cid:48) \u2190 {(\u00afs(cid:48), \u00afr) : (\u00afs, \u00afs(cid:48), \u00afa, \u00afr) \u2208 \u03a3, \u03b8(\u00afs) = s, a = \u00afa, b = \u00afs} ;\nPa,b(s, s(cid:48)) \u2190 1|\u03a3(cid:48)|\n\nra,b(s) \u2190(cid:80)\u00b7,\u00afr\u2208\u03a3(cid:48) \u00afr/|\u03a3(cid:48)| ;\n\n\u00afs(cid:48),\u00b7\u2208\u03a3(cid:48) 1s(cid:48)=\u03b8(\u00afs(cid:48)) ;\n\n(cid:80)\n\n15 Solve (2.1) to get \u03c0(cid:63)\u2014the optimal RMDP policy\u2014and let \u00af\u03c0\u00afs,a = \u03c0(cid:63)\n16 return \u00af\u03c0 ;\n\n\u03b8(\u00afs),a ;\n\nAlgorithm 1 depicts a simpli\ufb01ed implementation of RAAM. In general, we use \u00afs to distinguish the\nun-aggregated MDP states from the states in the aggregated RMDP. The main input to the algorithm\nconsists of transition samples \u03a3 = {(\u00afsi, \u00afs(cid:48)\ni, \u00afai, ri)}i\u2208I which represent transitions from a state \u00afsi\nto the state \u00afs(cid:48)\ni given reward ri and taking an action ai; the transitions need to be sampled according\nto the transition probabilities conditioned on the state and an action. The aggregation function\n\u03b8 : \u00afS \u2192 S, which maps every MDP state from \u00afS to an aggregate RMDP state, is also assumed to\nbe given. Finally, the state weights w \u2208 (cid:52)S and the robustness \u03c9 are tunable parameters.\nWe use the L1 norm to bound the uncertainty. The representation uses \u03c9 to continuously trade\noff between \ufb01xed importance weights for \u03c9 = 0 and complete robustness \u03c9 = 2. We analyze\n\n3\n\n// Construct robust sets based on state weights and L1 bounds\n\nend\n\n11\n12 end\n13 Qs \u2190 {\u03be \u2208 (cid:52)Bs : (cid:107)\u03be \u2212 ws\n1Tw|Bs\nSA \u2190 {\u03bes,a \u2208 Qs}s,a\u2208S\u00d7As;\n14 \u039eQ\n// Solve RMDP\n\n(cid:107)1 \u2264 \u03c9};\n\n\fa1\n\n1\n\n\u00afs1\n\n0\n\ns1\n\n\u00afs2\n\n\u0001\n\na2\n\n0\n\n\u00afs3\n\ns2\n\n\u00afs1\n0\n\ns1\n\n0\n\u00afs1\n\na1\n\na2\n\n1\n\n\u00afs2\n\ns2\n\n\u0001\n\n\u00afs2\n\n0\n\nFigure 1: An example MDP.\n\nFigure 2: Aggregated RMDP.\n\nthe effect of this parameter in Section 4. However, simply setting w to be uniform and \u03c9 = 2\nwill provide suf\ufb01ciently strong theoretical guarantees and generally works well in practice. Finally,\nwe assume s, a-rectangular uncertainty sets for the sake of reducing the computational complexity;\nbetter approximations could be obtained by using s-rectangular sets, but this makes no difference\nfor deterministic policies.\nNext, we show an example that demonstrates how the robust MDP is constructed from the aggrega-\ntion. We will also use this example to show the tightness of our bounds on the performance loss.\nExample 3.1. The original MDP problem is shown in Fig. 1. The round white nodes represent the\nstates, while the black nodes represent state-action pairs. All transitions are deterministic, with the\nnumber next to the transition representing the corresponding reward. Two shaded regions marked\nwith s1 and s2 denote the aggregate states. Fig. 2 depicts the corresponding aggregated robust MDP\nconstructed by RAAM. The rectangular nodes in this picture represent the robust outcome.\n\n3.1 Reducing Computational Complexity\n\nSolving an RMDP is in general more dif\ufb01cult than solving a regular MDP. Most RMDP algorithms\nare based on value or policy iteration, but in general involve repeated solutions of linear or convex\nprograms (Kaufman and Schaefer, 2013). Even though the worst-case time complexity of these\nalgorithms is polynomial, they may be impractical because they require repeatedly solving (2.1) for\nevery state, action, and iteration. Each of these computations may require solving a linear program.\nThe optimization over \u039eSA when computing the value function update for solving Line 15 of Algo-\nrithm 1 requires solving the following linear program for each s and a.\n\n(cid:88)\n\n(cid:16)\n\nmin\n\n\u03bes,a\u2208(cid:52)Bs\ns.t.\n\n\u03beT\ns,azs =\n\u03bes,a,b\n(cid:107)\u03bes,a \u2212 qs(cid:107)1 \u2264 \u03c9 .\n\nb\u2208Bs\n\n(cid:88)\n\ns(cid:48)\u2208S\n\n(cid:17)\n\nra,b(s) + \u03b3\n\nPa,b(s, s\n\n(cid:48)\n\n(cid:48)\n) v(s\n\n)\n\n(3.1)\n\nHere qs = ws/1Tw(Bs). While this problem can be solved directly using a linear program solver,\nwe describe a signi\ufb01cantly more ef\ufb01cient method in Algorithm 2.\nTheorem 3.2. Algorithm 2 correctly solves (3.1) in O(|Bs|) time when the full sort is replaced by a\nquickselect quantile selection algorithm in Line 4.\n\nThe proof is technical and is deferred to Appendix B.1. The main idea is to dualize the norm\nconstraint and examine the structure of the optimal solution as a function of the dual variable.\n\n4 Performance Loss Bounds\n\nThis section describes new bounds on the performance loss which is the difference between the\nreturn of the optimal and approximate policy. The performance loss is a more reliable measure of\nthe error than the error in the value function (Van Roy, 2005). We also analyze the effect of the state\nweights w and the robustness parameter \u03c9 on performance loss.\nIt will be convenient, for the purpose of deriving the error bounds, to treat aggregation as a linear\nvalue function approximation (Van Roy, 2005). For that purpose, de\ufb01ne a matrix \u03a6(\u00afs, s) = 1s=\u03b8(\u00afs)\n\n4\n\n\fAlgorithm 2: Solve (3.1) in Line 15 of Algorithm 1\n\ns,a \u2013 optimal solution of (3.1)\n\nInput: zs, qs \u2013 sorted such that zs is non-decreasing, indexed as 1 . . . n\nOutput: \u03be(cid:63)\n1 o \u2190 copy(qs) ; i \u2190 n ;\n2 \u0001 \u2190 min{1 \u2212 q1, \u03c9\n2 } ;\n3 o1 \u2190 \u0001 + q1;\n4 while \u0001 > 0 ;\n5 do\n6\n7\n8\n9 end\n10 return o ;\n\noi \u2190 oi \u2212 min{\u0001, oi} ;\n\u0001 \u2190 \u0001 \u2212 min{\u0001, oi} ;\ni \u2190 i \u2212 1;\n\n// Determine the threshold\n\nwhere s \u2208 S, \u00afs \u2208 \u00afS, and 1 represents the indicator function. That is, each column corresponds to\na single aggregate state with each row entry being either 1 or 0 depending on whether the original\nstate belongs into the aggregate state.\nIn order to simplify the derivation of the bounds, we start by assuming that the RMDP in RAAM\nis constructed from the full sample of the original MDP; we discuss \ufb01nite-sample bounds later.\nTherefore, assume that the full regular MDP is M = ( \u00afS, \u00afA, \u00afP , \u00afr, \u00af\u03b1); we are using bars in general\nto denote MDP values, but assume that A = \u00afA. We also use \u00af\u03c1 to denote the return of a policy in the\nMDP. The robust outcomes correspond to the original states that compose any s: Bs = \u03b8\u22121(s). The\nRMDP transitions and rewards for some \u03c0 and \u03be are computed as:\n\nHere, \u00af\u03be\u00afs =(cid:80)\nbe easily decomposed as \u00af\u03c1(\u00af\u03c0(cid:63))\u2212 \u00af\u03c1(\u02dc\u03c0) =(cid:2)\u00af\u03c1(\u00af\u03c0(cid:63))\u2212 \u00af\u03c1(\u03c0(cid:63))(cid:3)+(cid:2)\u00af\u03c1(\u03c0(cid:63))\u2212 \u00af\u03c1(\u02dc\u03c0)(cid:3). The error \u03c1(\u00af\u03c0(cid:63))\u2212 \u00af\u03c1(\u03c0(cid:63))\n\nThere are two types of optimal policies: \u00af\u03c0(cid:63) and \u03c0(cid:63); \u00af\u03c0(cid:63) is the truly optimal policy, while \u03c0(cid:63) is the\noptimal policy given aggregation constraints requiring the same action for all aggregated states. For\nany computed policy \u02dc\u03c0, we focus primarily on the performance loss \u00af\u03c1(\u03c0(cid:63))\u2212 \u00af\u03c1(\u02dc\u03c0). The total loss can\n\n\u03c0s,a \u03bes,a,\u00afs such that \u03b8(\u00afs) = s are state weights induced by \u03be.\n\nP\u03c0,\u03be = \u03a6T diag(cid:0) \u00af\u03be(cid:1) \u00afP\u03c0 \u03a6\n\nr\u03c0,\u03be = \u03a6T diag(cid:0) \u00af\u03be(cid:1) \u00afr\u03c0\n\n\u03b1T = \u00af\u03b1T \u03a6.\n\n(4.1)\n\na\u2208A\u00afs\n\nis independent of how the value of the aggregation is computed.\nThe following theorem states the main result of the paper. A part of the results uses the concentration\ncoef\ufb01cient C for a given distribution \u00b5 of the MDP (Munos, 2005) which are de\ufb01ned as: \u00afPa(s, s(cid:48)) \u2264\nC\u00b5(s(cid:48)) for all s, s(cid:48) \u2208 \u00afS, a \u2208 \u00afA.\nTheorem 4.1. Let \u02dc\u03c0 be the solution of Algorithm 1 based on the full sample for \u03c9 = 2. Then:\n\n\u00af\u03c1(\u03c0(cid:63)) \u2212 \u00af\u03c1(\u02dc\u03c0) \u2264 2 \u0001(v(cid:63))\n1 \u2212 \u03b3\n\n,\n\nwhere \u0001(v(cid:63)) = minv\u2208RS (cid:107)v(cid:63) \u2212 \u03a6v(cid:107)\u221e and this bound is tight. In addition, when the concentration\ncoef\ufb01cient of the original MDP is C with distribution \u00b5, then \u0001(v(cid:63)) = minv\u2208RS (cid:107)e(v)(cid:107)1,\u03c3 where\n\u03c3 = \u03a6T (\u03b3 \u03b1 + (1 \u2212 \u03b3) \u00b5) and e(v)s = max\u00afs\u2208\u03b8\u22121(s) |(I \u2212 \u03b3 \u00afP\u03c0(cid:63) )(\u00afv(cid:63) \u2212 \u03a6 v)|\u00afs.\nBefore proving Theorem 4.1, it is instrumental to compare it with the performance loss of related\nreinforcement learning algorithms. When the aggregation is constructed using constant and uni-\nform aggregation weights (as when Algorithm 1 is used with \u03c9 = 0), the performance loss of the\ncomputed policy \u02dc\u03c0 is bounded as (Tsitsiklis and Van Roy, 1996; Gordon, 1995):\n\n\u00af\u03c1(\u03c0(cid:63)) \u2212 \u00af\u03c1(\u02dc\u03c0) \u2264 4 \u03b3 \u0001(v(cid:63))\n(1 \u2212 \u03b3)2 .\n\nThis bound holds speci\ufb01cally for aggregation (and approximators that are averagers) and is tight;\nthe performance loss for more general algorithms can be even larger. Note that the difference in the\n1/(1 \u2212 \u03b3) factor is very signi\ufb01cant when \u03b3 \u2192 1. Van Roy (2005) shows similar bounds as RAAM,\nbut they are weaker and require the invariant distribution \u03c8. In addition, similar performance loss\nbounds as Theorem 4.1 can be guaranteed by DRADP, but this approach results in general to NP-\nhard computational problems (Petrik, 2012). In fact, the robust aggregation can be seen as a special\ncase of DRADP with rectangular uncertainty sets (Iyengar, 2005).\n\n5\n\n\fTo prove Theorem 4.1 we need the following result showing that for properly chosen robust uncer-\ntainty sets, the robust return is a lower bound on the true return. We will use \u00afd\u03c0 to represent the\nnormalized occupancy frequency for the MDP M and policy \u03c0.\nSA as constructed in (4.1). Then \u03c1\u2212(\u03c0) \u2264\nLemma 4.2. Assume the uncertainty set to be \u039eQ\n\u00af\u03c1(\u03c0) as long as for each \u03c0 \u2208 \u03a0 we have that \u00afd\u03c0|Bs \u2208 \u03c8s \u00b7 Qs for each s \u2208 S and some \u03c8s.\nWhen \u03c9 = 2, the inequality in the theorem also holds for value functions as Proposition B.1 in the\nappendix shows.\n\nS or \u039eQ\n\nProof. We prove the result for s-rectangular uncertainty sets; the proof for s, a-rectangular sets\nis analogous. When the policy \u03c0 is \ufb01xed, solving for the nature\u2019s policy represents a minimiza-\ntion MDP with continuous action constraints that has the following dual linear program formula-\ntion (Marecki et al., 2013):\n\n\u2212\n\n\u03c1\n\n(\u03c0) =\n\nmin\n\ndT \u00afr\u03c0 / (1 \u2212 \u03b3)\n\u03a6T (I \u2212 \u03b3 \u00afP T\nds,b /\n\ns.t.\n\nd\u2208{RBs}s\u2208S\n\n(cid:88)\nfrom (B.4). The normalization constant is \u03c8s =(cid:80)\n\nb(cid:48)\u2208Bs\n\nb(cid:48)\u2208Bs\n\nds,b(cid:48).\n\nNote that the left-hand side of the last constraint corresponds to \u03bea,b. Now, setting d = \u00afd\u03c0 shows the\ndesired inequality for \u03c0; this value is feasible in (4.2) from (B.3) and the objective value is correct\n\n\u03c0 ) d = (1 \u2212 \u03b3) \u03a6T \u00af\u03b1\nds,b(cid:48) \u2208 Qs,\n\n\u2200s \u2208 S, \u2200b \u2208 Bs .\n\n(4.2)\n\nProof of Theorem 4.1. Using Lemma 4.2, the performance loss for \u03c9 = 2 can be bounded as:\n\n0 \u2264 \u00af\u03c1(\u03c0(cid:63)) \u2212 \u00af\u03c1(\u02dc\u03c0) \u2264 \u00af\u03c1(\u03c0(cid:63)) \u2212 \u03c1\u2212(\u02dc\u03c0) = min\n\u03c0\u2208\u03a0\n\n(\u00af\u03c1(\u03c0(cid:63)) \u2212 \u00af\u03c1\u2212(\u03c0)) \u2264 \u00af\u03c1(\u03c0(cid:63)) \u2212 \u03c1\u2212(\u03c0(cid:63))\nFor a policy \u03c0, solving \u03c1\u2212(\u03c0) corresponds to an MDP with the following LP formulation:\n\n\u00af\u03c1(\u03c0(cid:63)) \u2212 \u03c1\u2212(\u03c0(cid:63)) \u2264 min\n\nv\n\n{\u03b1T(v(cid:63) \u2212 \u03a6v) : \u03a6v \u2264 \u03b3 \u00afP\u03c0(cid:63) \u03a6v + r\u03c0(cid:63)} .\n\n(4.3)\n\nNow, let the minimum \u0001 = minv (cid:107)v(cid:63)\u2212\u03a6v(cid:107)\u221e be attained at v0. Then, to show that v1 = v0\u2212 1+\u03b3\nis feasible in (4.3), for any k:\n\n1\u2212\u03b3 \u0001 1\n\n\u2212\u0001 1 \u2264 v(cid:63) \u2212 \u03a6v0 \u2264 \u0001 1\n\n(k \u2212 1)\u0001 1 \u2264 v(cid:63) \u2212 \u03a6v0 + k\u0001 1 \u2264 (1 + k)\u0001 1\n(k \u2212 1)\u03b3\u0001 1 \u2264 \u03b3 \u00afP\u03c0(cid:63) (v(cid:63) \u2212 \u03a6v0 + k\u0001 1) \u2264 (1 + k)\u03b3\u0001 1\n\n(4.4)\n(4.5)\nThe derivation above uses the monotonicity of \u00afP\u03c0(cid:63) in (4.5). Then, after multiplying by (I\u2212 \u03b3 \u00afP\u03c0(cid:63) ),\nwhich is monotone, and rearranging the terms:\n\n(I \u2212 \u03b3 \u00afP\u03c0(cid:63) )\u03a6(v0 \u2212 k\u0001 1) \u2264 (1 + \u03b3 \u2212 (1 \u2212 \u03b3)k)\u0001 1 + r\u03c0(cid:63) ,\n\nwhere (I \u2212 \u03b3 \u00afP\u03c0(cid:63) )v(cid:63) = r\u03c0(cid:63). Letting k = (1 + \u03b3)/(1 \u2212 \u03b3) proves the needed feasibility and (4.4)\nestablishes the bound. The tightness of the bound follows from Example 3.1 with \u0001 \u2192 0.\nThe bound on the second inequality follows from bounding the dual gap between the primal feasible\nsolution v1 and an upper bound on a dual optimal solution. To upper-bound the dual solution, de\ufb01ne\na concentration coef\ufb01cient for an RMDP similarly to an MDP: \u00afPa,b(s, s(cid:48)) \u2264 C\u00b5(s(cid:48)) for all s, s(cid:48) \u2208 S,\na \u2208 As, b \u2208 Bs. By algebraic manipulation, if the original MDP has a concentration coef\ufb01cient\nC with a distribution \u00b5 then the aggregated RMDP has the same concentration coef\ufb01cient with a\ndistribution \u03a6T\u00b5. Then, using Lemma 4.3 in (Petrik, 2012), the occupancy frequency (and therefore\n1\u2212\u03b3 \u03a6((1 \u2212 \u03b3) \u03a6T \u03b1 + \u03b3\u03a6T\u00b5).\nthe dual value) of the RMDP for any policy is bounded as u \u2264 C\nThe linear program (4.3) can be formulated as the following penalized optimization problem:\n\nmax\n\nmin\n\nNote that:\n\nv\n\nu\n\n\u03b1T(v(cid:63) \u2212 \u03a6v) = \u03b1T(cid:0)I \u2212 \u03b3 \u00afP\u03c0(cid:63)\n\n\u03b1T(v(cid:63) \u2212 \u03a6v) + uT(cid:2)(I \u2212 \u03b3 \u00afP\u03c0(cid:63) )\u03a6v \u2212 r\u03c0(cid:63)\n(cid:1)\u22121\n\n(I \u2212 \u03b3 \u00afP\u03c0(cid:63) )(v(cid:63) \u2212 \u03a6v) = \u00afdT\n\n(cid:3)\n\n+ ,\n\n\u03c0(cid:63) (I \u2212 \u03b3 \u00afP\u03c0(cid:63) )(v(cid:63) \u2212 \u03a6v) .\n\n6\n\n\fThe penalized optimization problem can be rewritten, using the fact that r\u03c0(cid:63) = (I \u2212 \u03b3 \u00afP\u03c0(cid:63) ) v(cid:63) and\nthe feasibility of v1 as:\n\nmax\n\nu\ns.t.\n\n2\n1 \u2212 \u03b3\nu \u2264 C\n1 \u2212 \u03b3\n\nuT |(I \u2212 \u03b3 \u00afP\u03c0(cid:63) )(\u03a6 v1 \u2212 v(cid:63))|\n\n\u03a6 ((1 \u2212 \u03b3) \u03a6T\u03b1 + \u03b3 \u03a6T\u00b5)\n\nThe theorem then follows by simple algebraic manipulation from the upper bound on u.\n\n4.1 State Importance Weights\n\nIn this section, we discuss how to select the state importance weights w and the robustness parameter\n\u03c9. Note that Lemma 4.2 shows that any choice of w and \u03c9 such that the normalized occupancy fre-\nquency is within \u03c9 of w in terms of the L1 norm, provides the theoretical guarantees of Theorem 4.1.\nSmaller uncertainty sets under this condition only improve the guarantees. In practice, the values w\nand \u03c9 can be treated as regularization parameters. We show suf\ufb01cient conditions under which the\nright choice of w and \u03c9 can signi\ufb01cantly reduce the performance loss, but these conditions have a\nmore explanatory than predictive character.\nAs it can be seen easily from the proof of Lemma 4.2 and Appendix B.2, the optimal choice for\nthe RAAM weights w to approximate the return of a policy \u03c0 is to use its state occupancy fre-\nquency. While the occupancy frequency is rarely known, there exist structural properties, such as\nthe concentration coef\ufb01cient (Munos, 2005), that can lead to upper bounds on the possible occu-\npancy frequencies. However, the following example shows that simply using an upper bound on the\noccupancy frequency is not suf\ufb01cient to reduce the performance loss.\nExample 4.3. Consider an MDP with 4 states: s1, . . . , s4 and the aggregation with two states that\ncorrespond to {s1, s2} and {s3, s4}. Let the set of admissible occupancy frequencies be: Q = {d \u2208\n(cid:52)4\n: 1/4 \u2264 d(s1) + d(s4) \u2264 1/2, d \u2265 1/8}. The set of uncertainties for this bounded set is\n+ : 1/6 \u2264 d(si) \u2264 4/5, 1/5 \u2264 d(sj) \u2264\nfor i = 1, 3, and j = 2, 4 as follows: \u039eQ\n5/6, d(si) + d(sj) = 1}, which is smaller than \u039eS. However, Q without the constraint d \u2265 1/8\nresults in \u039eQ\nAs Example 4.3 demonstrates, the concentration coef\ufb01cient alone does not guarantee an improve-\nment in the policy loss. One possible additional structural assumption is that the occupancy fre-\nquencies for the individual states in each aggregate state to be \u201ccorrelated\u201d across policies. More\nformally, the aggregation correlation coef\ufb01cient D \u2208 R+ must satisfy:\n\nS = {d \u2208 R4\n\nS = \u039eS.\n\n\u03bb \u03c3(\u00afs) \u2264 d\u03c0(\u00afs) \u2264 \u03bb D \u03c3(\u00afs) ,\n\n(4.6)\nfor some \u03bb \u2265 0, each \u00afs \u2208 \u00afS, and \u03c3 as de\ufb01ned in Theorem 4.1. Using this assumption, we can derive\nthe following theorem. Consider the uncertainty set Qs = {q : q \u2264 C (\u03c3|Bs)/(1T\u03c3(Bs))} then we\ncan show the following theorem.\nTheorem 4.4. Given an MDP with a concentration coef\ufb01cient C for \u00b5 and a correlation coef\ufb01cient\nD, then for uncertainty set \u039eQ\n\nS and for \u03c3 = \u03a6T (\u03b3 \u03b1 + (1 \u2212 \u03b3) \u00b5) we have:\nv\u2208RS (cid:107)(I \u2212 \u03b3 \u00afP\u03c0(cid:63) ) (\u00afv(cid:63) \u2212 \u03a6 v)(cid:107)1,\u03c3 .\n\n\u00af\u03c1(\u03c0(cid:63)) \u2212 \u00af\u03c1(\u02dc\u03c0) \u2264 2 C D\n1 \u2212 \u03b3\n\nmin\n\nThe proof is based on a minor modi\ufb01cation of Theorem 4.1 and is deferred until the appendix.\nTheorem 4.4 improves on Theorem 4.1 by entirely replacing the L\u221e norm by a weighted L1 norm.\nWhile the correlation coef\ufb01cient may not be easy to determine in practice, it may a property to\nanalyze to explain a failure of the method.\nFinite-sample bounds are beyond the scope of this paper. However, the sampling error is additive\nand can be based for example on \u0001 coverage assumptions made for approximate linear programs.\nIn particular, (4.2) represents an approximate linear program and can be bounded as such, as for\nexample done by Petrik et al. (2010).\n\n5 Experimental Results\n\nIn this section, we experimentally validate the approximation properties of RAAM with respect to\nthe quality of the solutions and the computational time required. For the purpose of the empirical\n\n7\n\n\fFigure 3: Sensitivity to the reward perturba-\ntion for regular aggregation and RAAM.\n\nFigure 4: Time to compute (3.1) for Algo-\nrithm 2 versus a CPLEX LP solver.\n\nevaluation we use a modi\ufb01ed inverted pendulum problem with a discount factor of 0.99, as described\nfor example in (Lagoudakis and Parr, 2003). For the aggregation, we use a uniform grid of dimension\n40 \u00d7 40 and uniform sampling of dimensions 120 \u00d7 120. The ordinary setting is solved easily and\nreliably by both the standard aggregation and RAAM. To study the robustness with respect to the\napproximation error of suboptimal policies we add an additional reward ra for the pendulum under a\ntilted angle (\u03c0/2\u2212 0.12 \u2264 \u03b8 \u2264 \u03c0/2 and \u00a8\u03b8 \u2265 0 where \u03b8 is the angle and \u00a8\u03b8 is the action). This reward\ncan be only achieved by a suboptimal policy. Fig. 3 shows the return of the approximate policy as\nthe function of the magnitude of the additional reward for the standard aggregation and RAAM with\nvarious values on \u03c9. We omit the con\ufb01dence ranges, which are small, to enhance image clarity.\nNote that we assume that once the pendulum goes over \u03c0/2, the reward -1 is accrued until the end of\nthe horizon. This result clearly demonstrates the greater stability and robustness of RAAM for than\nstandard aggregation. The results also illustrate the lack of stability of ALP, which is can be seen as\nan optimistic version of RAAM. We observed the same behavior for other parameter choices.\nThe main cost of using RAAM compared to ordinary aggregation is the increased computational\ncomplexity. Our results show, however, that the computational overhead of RAAM is minimal.\nSection 5 shows that Algorithm 2 is several orders of magnitude faster than CPLEX 12.3. The\nvalue function update for the aggregated inverted pendulum with 1600 states, 3 actions, and about\n9 robust outcomes takes 8.7ms for ordinary aggregation, 8.8ms for RAAM with \u03c9 = 2, and 9.7ms\nfor RAAM with \u03c9 = 1. The guarantees on the improvement for one iteration are the same for both\nalgorithms and all implementations are in C++.\n\n6 Conclusion\n\nRAAM is novel approach to state aggregation which leverages RMDPs. RAAM signi\ufb01cantly re-\nduces performance loss guarantees in comparison with standard aggregation while introducing neg-\nligible computational overhead. The robust approach has some distinct advantages in comparison\nwith previous methods with improved performance loss guarantees. Our experimental results are en-\ncouraging and show that adding robustness can signi\ufb01cantly improve the solution quality. Clearly,\nnot all problems will bene\ufb01t from this approach. However, given the small computational overhead\nand there is no reason to not try. While we do provide some theoretical justi\ufb01cation for choosing w\nand \u03c9, it is most likely that in practice these can be best treated as regularization parameters.\nMany improvements on the basic RAAM algorithm are possible. Most notably, the RMDP action\nset could be based on \u201cmeta-actions\u201d or \u201coptions\u201d. The L1 may be replaced by other polynomial\nnorms or KL divergence. RAAM could be also extended to choose adaptively the most appropri-\nate aggregation for the given samples (Bernstein and Shikim, 2008). Finally, using s-rectangular\nuncertainty sets may lead to better results.\n\nAcknowledgments\n\nWe thank Ban Kawas for extensive discussions on this topic and the anonymous reviewers for their\ncomments that helped to signi\ufb01cantly improve the paper.\n\n8\n\n0.00.51.01.52.0Extra Reward rq\u221260\u221240\u22122002040Mean ReturnMean Aggregation/LSPIRobust Aggregation, jj\u00a2jj1\u22190:5Robust Aggregation, jj\u00a2jj1\u22191:5Approximate Linear Programming101102103104Variables10\u2212510\u2212410\u2212310\u2212210\u22121100Time(s)CPLEXTotalCPLEXSolverCustomPythonCustomC++\fReferences\nBean, J. J. C., Birge, J. R. J., and Smith, R. R. L. (1987). Aggregation in dynamic programming.\n\nOperations Research, 35(2), 215\u2013220.\n\nBernstein, A. and Shikim, N. (2008). Adaptive aggregation for reinforcement learning with ef\ufb01cient\n\nexploration: Deterministic domains. In Conference on Learning Theory (COLT).\n\nBertsekas, D. P. D. and Castanon, D. A. (1989). Adaptive aggregation methods for in\ufb01nite horizon\n\ndynamic programming. IEEE Transations on Automatic Control, 34, 589\u2013598.\n\nde Farias, D. P. and Van Roy, B. (2003). The linear programming approach to approximate dynamic\n\nprogramming. Operations Research, 51(6), 850\u2013865.\n\nDesai, V. V., Farias, V. F., and Moallemi, C. C. (2012). Approximate dynamic programming via a\n\nsmoothed linear program. Operations Research, 60(3), 655\u2013674.\n\nFilar, J. and Vrieze, K. (1997). Competitive Markov Decision Processes. Springer.\nGordon, G. J. (1995). Stable function approximation in dynamic programming. In International\n\nConference on Machine Learning, pages 261\u2013268. Carnegie Mellon University.\n\nHansen, T., Miltersen, P., and Zwick, U. (2013). Strategy iteration is strongly polynomial for 2-\nplayer turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM),\n60(1), 1\u201316.\n\nIyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2),\n\n257\u2013280.\n\nKaufman, D. L. and Schaefer, A. J. (2013). Robust modi\ufb01ed policy iteration. INFORMS Journal on\n\nComputing, 25(3), 396\u2013410.\n\nLagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning\n\nResearch, 4, 1107\u20131149.\n\nLe Tallec, Y. (2007). Robust, Risk-Sensitive, and Data-driven Control of Markov Decision Pro-\n\ncesses. Ph.D. thesis, MIT.\n\nMannor, S., Mebel, O., and Xu, H. (2012). Lightning does not strike twice: Robust MDPs with\n\ncoupled uncertainty. In International Conference on Machine Learning.\n\nMarecki, J., Petrik, M., and Subramanian, D. (2013). Solution methods for constrained Markov\ndecision process with continuous probability modulation. In Uncertainty in Arti\ufb01cial Intelligence\n(UAI).\n\nMunos, R. (2005). Performance bounds in Lp norm for approximate value iteration. In National\n\nConference on Arti\ufb01cial Intelligence (AAAI).\n\nNilim, A. and Ghaoui, L. E. (2005). Robust control of Markov decision processes with uncertain\n\ntransition matrices. Operations Research, 53(5), 780\u2013798.\n\nPetrik, M. (2012). Approximate dynamic programming by minimizing distributionally robust\n\nbounds. In International Conference of Machine Learning.\n\nPetrik, M. and Zilberstein, S. (2009). Constraint relaxation in approximate linear programs.\n\nInternational Conference on Machine Learning, New York, New York, USA. ACM Press.\n\nIn\n\nPetrik, M., Taylor, G., Parr, R., and Zilberstein, S. (2010). Feature selection using regularization\nin approximate linear programs for Markov decision processes. In International Conference on\nMachine Learning.\n\nPorteus, E. L. (2002). Foundations of Stochastic Inventory Theory. Stanford Business Books.\nPuterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming.\n\nJohn Wiley & Sons, Inc.\n\nTsitsiklis, J. N. and Van Roy, B. (1996). An analysis of temporal-difference learning with function\n\napproximation.\n\nVan Roy, B. (2005). Performance loss bounds for approximate value iteration with state aggregation.\n\nMathematics of Operations Research, 31(2), 234\u2013244.\n\nWiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust Markov decision processes. Mathematics\n\nof Operations Research, 38(1), 153\u2013183.\n\n9\n\n\f", "award": [], "sourceid": 1080, "authors": [{"given_name": "Marek", "family_name": "Petrik", "institution": "IBM Research"}, {"given_name": "Dharmashankar", "family_name": "Subramanian", "institution": null}]}