{"title": "Policy-Conditioned Uncertainty Sets for Robust Markov Decision Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 8939, "page_last": 8949, "abstract": "What policy should be employed in a Markov decision process with uncertain parameters? Robust optimization answer to this question is to use rectangular uncertainty sets, which independently reflect available knowledge about each state, and then obtains a decision policy that maximizes expected reward for the worst-case decision process parameters from these uncertainty sets. While this rectangularity is convenient computationally and leads to tractable solutions, it often produces policies that are too conservative in practice, and does not facilitate knowledge transfer between portions of the state space or across related decision processes. In this work, we propose non-rectangular uncertainty sets that bound marginal moments of state-action features defined over entire trajectories through a decision process. This enables generalization to different portions of the state space while retaining appropriate uncertainty of the decision process. We develop algorithms for solving the resulting robust decision problems, which reduce to finding an optimal policy for a mixture of decision processes, and demonstrate the benefits of our approach experimentally.", "full_text": "Policy-Conditioned Uncertainty Sets for\n\nRobust Markov Decision Processes\n\nAndrea Tirinzoni\nPolitecnico di Milano\n\nXiangli Chen\n\nAmazon Robotics\n\nandrea.tirinzoni@polimi.it\n\ncxiangli@amazon.com\n\nMarek Petrik\n\nUniversity of New Hampshire\n\nmpetrik@cs.unh.edu\n\nBrian D. Ziebart\n\nUniversity of Illinois at Chicago\n\nbziebart@uic.edu\n\nAbstract\n\nWhat policy should be employed in a Markov decision process with uncertain\nparameters? Robust optimization\u2019s answer to this question is to use rectangular\nuncertainty sets, which independently re\ufb02ect available knowledge about each state,\nand then to obtain a decision policy that maximizes the expected reward for the\nworst-case decision process parameters from these uncertainty sets. While this\nrectangularity is convenient computationally and leads to tractable solutions, it\noften produces policies that are too conservative in practice, and does not facilitate\nknowledge transfer between portions of the state space or across related decision\nprocesses. In this work, we propose non-rectangular uncertainty sets that bound\nmarginal moments of state-action features de\ufb01ned over entire trajectories through\na decision process. This enables generalization to different portions of the state\nspace while retaining appropriate uncertainty of the decision process. We develop\nalgorithms for solving the resulting robust decision problems, which reduce to\n\ufb01nding an optimal policy for a mixture of decision processes, and demonstrate the\nbene\ufb01ts of our approach experimentally.\n\n1\n\nIntroduction\n\nPolicies with high expected reward are often desired for uncertain decision processes with which little\nexperience exists. Speci\ufb01cally, we consider the setting in which only a limited number of trajectories\nfrom a sub-optimal control policy through a decision process are available. Robust control approaches\nfor this task [1, 2, 3, 4] de\ufb01ne uncertainty sets for the decision process based on the limited outcome\nsamples and seek the policy that maximizes this expected reward for the worst possible choice of\ndecision process parameters in these sets.\nWhen the uncertainty sets relating to different decision process states are jointly constrained in\nseemingly natural ways, the robust control problem becomes NP-hard (e.g., [5, 6]). To avoid these\ncomputationally intractable robust control problems, uncertainty sets have often been independently\nconstructed for parameters associated with a particular state-action pair or particular state\u2014s, a-\nrectangularity or s-rectangularity [7, 8, 4], respectively. Unfortunately, independently assuming the\nworst-case in every encountered state is often too conservative in practice to be useful [9].\nLeveraging ideas from distributionally robust optimization [10, 11, 12], we construct policy-\nconditioned marginal uncertainty sets for robustly learning a decision policy that optimizes the\nreward given trajectory samples produced by a sub-optimal policy. State transition dynamics under\nour formulation are estimated based on two competing objectives. First, the estimated dynamics must\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(approximately) match measured properties observed under the sub-optimal reference policy. Second,\nthe estimated dynamics must be the worst case for the simultaneously-hypothesized optimal policy.\nThis formulation has three main bene\ufb01ts: (1) Non-rectangularity: Our uncertainty sets are de\ufb01ned\nby feature-based statistics of distributions over entire trajectories, enabling generalization across\nstates; (2) Off-policy robustness: We de\ufb01ne our performance objective using the desired control\npolicy and the uncertainty set using the sub-optimal data generation policy; and (3) Convex parameter\noptimization: We avoid the nonconvex parameter optimization pitfalls of other nonrectangular\nformulations by shifting the main computational dif\ufb01culties to parameterized prediction/control\nproblems (which can be ef\ufb01ciently approximated). Together, these properties aid in addressing a\nnumber of existing concerns for robust control, including settings in which the state de\ufb01nition violates\nthe Markov assumption [13] or the transition probabilities are derived from limited data sets [3, 9].\nIn the remainder of this paper, we review existing robust control methods and directed information\ntheory concepts in Section 2. Using these concepts, we formulate the robust control task using\nfeature-based marginal constraints in Section 3. We reformulate this problem and present algorithms\nfor solving it using a combination of convex optimization and dynamic programming to optimize a\nnon-Markovian mixed decision process optimal control problem that arises from the formulation. We\nevaluate our approach in Section 4 to demonstrate its comparative bene\ufb01ts over rectangular robust\ncontrol methods. Lastly, we provide concluding thoughts and discuss future work in Section 5.\n\n2 Background and Related Work\n\n2.1 Robust control\nThe Markov Decision Process (MDP) with state set S and action set A povides a common formulation\nof discrete control problems. In the MDP, the transition probabilities are given by \u03c4 (st+1 | st, at) and\nthe reward is R(st, at, st+1). Though consideration is often restricted to deterministic Markovian\npolicies, \u03c0 : S \u2192 A, the generalization to randomized Markovian policies \u03a0M = {\u03c0 : S \u2192 \u2206A}\nprovides stochastic mappings from the current state to actions. Even more generally, we will consider\nnon-Markovian, history-dependent, randomized policies \u03a0H = {\u03c0 : S t \u00d7 At \u2192 \u2206A} in this work.\nThe expected sum of rewards or return \u03c1 of a policy \u03c0 applied to an MDP with dynamics \u03c4 and\nreward function R is: \u03c1R(\u03c0, \u03c4 ) = E\u03c4,\u03c0[(cid:80)T\u22121\nt=1 R(St, At, St+1)]. For decision problems, the standard\nobjective is to choose a policy that maximizes the expected sum of rewards: max\u03c0 \u03c1R(\u03c0, \u03c4 ). Since\na Markovian and deterministic policy always exists that maximizes this quantity, one with those\ncharacteristics is typically sought when solving this optimization problem by many well-known\nalgorithms, such as value iteration or policy iteration [14].\nUnfortunately, in many settings the dynamics \u03c4 are not entirely known. Control policies are needed\nthat can perform well despite this uncertainty about the decision process. One option is to formally\nde\ufb01ne the uncertainty as a set of possibilities and assume the worst case (De\ufb01niton 1).\nDe\ufb01nition 1. The robust control problem is to \ufb01nd a control policy \u03c0 \u2208 \u03a0 that performs best for the\nworst-case choice of state transition dynamics, \u03c4 \u2208 \u039e:\nE\u03c4,\u03c0\n\n(cid:34)T\u22121(cid:88)\n\nR(St, At, St+1)\n\n(cid:35)\n\n(1)\n\n.\n\nmax\n\u03c0\u2208\u03a0\n\nmin\n\u03c4\u2208\u039e\n\n\u03c1(\u03c0, \u03c4 ) = max\n\u03c0\u2208\u03a0\n\nmin\n\u03c4\u2208\u039e\n\nt=1\n\nThe speci\ufb01cation of the uncertainty set(s), \u039e, has signi\ufb01cant implications for the tractability of this\nproblem. Robust MDPs [7] are typically used to represent uncertainty in transition probabilities\nand rewards in regular MDPs. When the state-transition probabilities for different states are jointly\nconstrained in arbitrary ways, the robust control problem becomes NP-hard [5, 6]. Two common\nforms of constraints that enable ef\ufb01cient solutions are s,a-rectangular and s-rectangular [4] constraint\nsets. This form arises when transition probabilities are not known precisely, but are known to be\nbounded in terms of an L1 norm. A corresponding robust MDP has uncertain transition probabilities:\n\n\u039e = {\u03c4 : \u2200s, a \u2208 S \u00d7 A, (cid:107)\u03c4 (\u00b7|s, a) \u2212 p(\u00b7 | s, a)(cid:107)1 \u2264 c}.\n\nThis is an s, a-rectangular set. It employs independent constraints for each state-action pair or state\n(s-rectangular set). A convenient way to model a robust MDP is to introduce a set of outcomes B\nto represent the uncertainty in transitions and rewards. The transition probabilities are then de\ufb01ned\n\n2\n\n\fas p(st+1 | st, at, bt) and rewards become r(st, at, bt, st+1), while \u03be(bt|st, at) denotes the nature\u2019s\npolicy, i.e., a distribution over outcomes.\nThe optimal value function v(cid:63) in a robust MDP with s-rectangular and s, a-rectangular uncertainty\nsets (and discount factor \u03b3) satis\ufb01es the Bellman optimality equation for each s \u2208 S as follows:\n)(cid:17).\n\n\u03c0(a|s)\u03be(b|s, a)(cid:16)r(s, a, b, s(cid:48)\n\n| s, a, b)v(cid:63)(s(cid:48)\n\nmin\n\n\u03be\u2208\u039e(cid:88)a\u2208A(cid:88)b\u2208B\n\np(s(cid:48)\n\n) + \u03b3(cid:88)s(cid:48)\u2208S\n\nv(cid:63)(s) = max\n\u03c0\u2208\u03a0\n\nIn our formulation, we consider state-action feature-based constraints over the marginals of state-\naction sequences to de\ufb01ne our uncertainty sets. When the sum of rewards and the constraints are\nde\ufb01ned in terms of different policies, this naturally induces a \u201cbelief state\u201d that is similar to the\naugmenting set of outcomes B previously described. In our case, this augmenting information tracks\nthe relative signi\ufb01cance of the policies for providing robustness based on the sum of rewards versus\nmatching feature-based measurements from training trajectories.\n\n(2)\n\n2.2 Directed information theory for processes\n\nWe make extensive use of ideas and notation from directed information theory [15, 16, 17, 18, 19].\nUnder this theory, processes\u2014the products of T conditional probabilities over a sequence of T\nvariables\u2014are treated as \ufb01rst-order objects. The causally conditioned probability distribution [20],\nt=1 p(yt|y1:t\u22121, x1:t), illustrates the notation for this process of generating the\nsequence of y1:T variables given the sequence of x1:T variables. It differs from the conditional prob-\nt=1 p(yt|x1:T , y1:t\u22121), in the limited history of x variables\n\np(y1:T||x1:T ) (cid:44)(cid:81)T\nability distribution, p(y1:T|x1:T ) =(cid:81)T\nBoth (stochastic) control policies, \u03c0(a1:T||s1:T ) (cid:44)(cid:81)T\nt=1 \u03c0(at|a1:t\u22121, s1:t), and (stochastic) state\ntransition dynamics, \u03c4 (s1:T||a1:T\u22121) (cid:44) (cid:81)T\nt=1 \u03c4 (st|s1:t\u22121, a1:t\u22121), can be expressed using this\nnotation. The joint probability distribution over states and actions is then p(a1:T , s1:T ) =\n\u03c0(a1:T||s1:T )\u03c4 (s1:T||a1:T\u22121), and the expected reward can be expressed as an af\ufb01ne combination of\nbilinear functions of these processes:\n\neach yt variable is conditioned upon.\n\n\u03c1R(\u03c0, \u03c4 ) =(cid:88)a1:T(cid:88)s1:T\n\n\u03c0(a1:T||s1:T )\u03c4 (s1:T||a1:T\u22121)\n\nR(st, at, st+1).\n\n(3)\n\nT\u22121(cid:88)t=1\n\nAdditionally, the uncertainty of state sequence outcomes can be quanti\ufb01ed using the causally condi-\ntioned entropy:\n\nH\u03c4,\u03c0(S1:T||A1:T\u22121) = \u2212 (cid:88)a1:T ,s1:T\n\n\u03c0(a1:T||s1:T )\u03c4 (s1:T||a1:T\u22121) log \u03c4 (s1:T||a1:T\u22121).\n\n(4)\n\nOf crucial importance for optimization purposes, the set of causally conditioned probability distribu-\ntions is convex and the causal entropy is a convex function of those probabilities [21].\n\n3 Marginally-Constrained Robust Control Processes\n\nWe de\ufb01ne constraints on uncertainties about a decision process based on its interactions with a\nreference policy. In other words, state-action trajectories through the decision process are available\nthat were produced from a policy that may be quite different from the optimal one. Similarly to\nprevious works [5, 22], we propose practical algorithms for this problem by augmenting the state\nspace.\n\n3.1 De\ufb01ning Uncertainty Sets with Marginal Features\nWe consider a feature function \u03c6 : S \u00d7 A \u00d7 S \u2192 Rd characterizing the relationships between states\nand actions to restrict the set of possible realizations of uncertain MDP parameters. We denote the\n\ufb01rst moment of the occupancy frequencies with respect to \u03c6 (also known as feature expectations in\nt=1 \u03c6(St, At, St+1)(cid:105),\nthe inverse reinforcement learning literature [23, 24]) as \u03ba\u03c6(\u03c0, \u03c4 ) := E\u03c4,\u03c0(cid:104)(cid:80)T\u22121\n\nwhile we denote the empirical sample statistics, which are measured from N sample trajectories,\n\n3\n\n\fN(cid:80)N\n\ni=1(cid:80)T\u22121\n\nt=1 \u03c6(s(i)\n\nas(cid:98)\u03ba = 1\n\nt+1). Based on these quantities, we can now de\ufb01ne the robust\ncontrol problem with constraints using marginal statistics of the state-action sequence to de\ufb01ne the\nuncertainty set \u039e.\nDe\ufb01nition 2. The marginally-constrained robust control problem given reference policy \u02dc\u03c0 is:\n\n, a(i)\nt\n\nt\n\n, s(i)\n\nmax\n\u03c0\u2208\u03a0\n\nmin\n\u03c4\u2208\u039e\n\n\u03c1(\u03c0, \u03c4 ) \u2212\n\n1\n\u03bb\n\nH\u03c4,\u00af\u03c0(S1:T||A1:T\u22121),\n\n(5)\n\nwhere \u039e is the set of all transition probabilities whose feature expectations match the empirical sample\n\n\u03bb H\u03c4,\u00af\u03c0(S1:T||A1:T\u22121),\n\nstatistics, i.e., \u039e = {\u03c4 | \u03ba\u03c6(\u02dc\u03c0, \u03c4 ) =(cid:98)\u03ba}. In general, and of practical signi\ufb01cance, slack can also be\nadded to the constraints, leading to a relaxed uncertainty set(cid:101)\u039e = {\u03c4 | ||\u03ba\u03c6(\u02dc\u03c0, \u03c4 ) \u2212(cid:98)\u03ba|| \u2264 \u03b2}1. We\n\ninclude an optional causal entropy (Equation 4) regularization penalty term, 1\nwhere \u03bb \u2208 (0,\u221e) is a provided parameter and \u00af\u03c0(a1:T||s1:T ) is an arbitrary distribution.\nIntuitively, our formulation allows constraints for whole trajectories rather single state-action pairs, as\nwith rectangular constraints. Furthermore, features \u03c6 allow us to specify properties of the unknown\ntransition dynamics that generalize globally across the state-action space, which is not possible using\nlocal constraints, such as rectangular ones. When limited data is available and generalization is\ntherefore required to achieve good performance, this constitutes a signi\ufb01cant advantage. Finally, our\noptional entropy regularization term leads to smoother solutions, where the smoothness is controlled\nby parameter \u03bb. Many previous works have shown the bene\ufb01ts of having entropy-based smoothing\n[2, 25].\nIn practice, the design of the feature function \u03c6 is fundamental for properly constraining the estimated\ntransition probabilities. Although a speci\ufb01c choice is highly application dependent, the features\nshould in general encode known properties of the underlying MDP. Since our solution reduces\nto \ufb01nding dynamics that induce a behavior on the reference policy, speci\ufb01ed through \u03ba\u03c6, that\napproximately matches the one observed from the given trajectories, many analogies exist with\nfeature design in the IRL literature (see, e.g., chapter 6 of [26]). Common choices thus include\nindicator functions over important properties/events, such as reaching certain goal states, entering\ndangerous zones, taking very likely (or unlikely) transitions, and so on. The key consequence of\nadding these kinds of features is that the probability of these events occurring under the estimated\ndynamics will be (approximately) the same as the one observed in the given trajectories. Consider, for\ninstance, an MDP where s, a, s(cid:48) triples with some known property P(s, a, s(cid:48)) have zero probability\n(e.g., in a gridworld or a chain-walk domain, a transition is impossible if s and s(cid:48) are not adjacent).\nThen, using a feature \u03c6(s, a, s(cid:48)) = 1[P(s, a, s(cid:48))], i.e., an indicator function over P, will constrain\nthe estimated transition probabilities to be zero for all triples where such property holds. In fact,\n\u03ba\u03c6(\u03c4, \u02dc\u03c0) = 0 and \u02c6\u03ba\u03c6 = 0 for any reference policy. More generally, most MDPs of practical interest\nhave properties that couple the transition probabilities of several state-action pairs. Capturing these\nglobal properties using moment-based constraints is typically much better than focusing on single\nstates or state-action pairs, which is more prone to over\ufb01tting the given trajectories. In the limiting\ncase, one could consider a separate feature (e.g., an indicator) over each s, a, s(cid:48) triple. However,\nsimilarly to rectangular solutions, having separate constraints for different state-action pairs is likely\nto lead to very conservative solutions in the presence of limited data. Finally, notice that using an\nindicator function over each s, a, s(cid:48) triple is equivalent to matching the (empirical) joint distribution\np(St, At, St+1) induced by the reference policy and the true dynamics. Thus, even when we consider\na different constraint for each triple, our solution implicitly couples the transition probabilities of\ndifferent state-action pairs and differs from a rectangular formulation which focuses on matching the\nconditional distribution p(St+1|St, At).\nA key characteristic of this formulation is the difference in control policies: the expected reward\nis de\ufb01ned in terms of \u03c0, while the constraints are de\ufb01ned in terms of \u02dc\u03c0. Unfortunately, treating\nthe marginally-constrained robust control problem (De\ufb01nition 1) as an optimization problem over\nthe individual state transition probabilities, \u03c4 (st+1|st, at), appears daunting. This is because the\nconstraints in Equation (5) are not convex functions of those transition probabilities. We instead con-\nsider optimizing the control policy and state transition dynamics as causally conditioned probability\ndistributions in the following section. Though the solution for this formulation does not naturally\nhave a Markovian property, our process estimation leads to an augmented-Markovian representation\nin Section 3.3.\n\n1Notice that \u03c4 must also belong to the set of valid probability distributions. We omit the corresponding\n\nconstraints for the sake of clarity.\n\n4\n\n\f3.2 Reformulation as Process Estimation\n\nWe re-express the optimization problem of De\ufb01nition 2 using processes\u2014the causally conditioned\nprobabilities of Section 2.2\u2014for the control policy \u03c0(a1:T\u22121||s1:T\u22121) and state transition dynamics\n\u03c4 (s1:T||a1:T\u22121), which conveniently combine the individual conditional probabilities over the state-\naction sequence. Notice that we consider stochastic processes ending with a state at time T and an\naction at time T \u2212 1. Using this new notation, we now reformulate our main optimization problem in\na more convenient manner.\nTheorem 1. The marginally-constrained robust control problem of De\ufb01nition 2 can be solved by\nposing it as an unconstrained zero-sum game parameterized by a vector of Lagrange multipliers, \u03c9:\n\nsoftmin\n\nmax\n\u03c9\u2208Rd\n\nmax\n\u03c0\u2208\u03a0\n\n\u03c4\u2208\u039e (cid:32)E\u03c4,\u03c0(cid:34)T\u22121(cid:88)t=1\n\nR(St, At, St+1)(cid:35) + E\u03c4,(cid:101)\u03c0(cid:34)T\u22121(cid:88)t=1\n\u03bb log(cid:80)x\u2208X e\u2212\u03bbf (x) and \u00b7 denotes the dot product.\n\n\u03c9 \u00b7 \u03c6(St, At, St+1)(cid:35)(cid:33)\u2212\u03c9\u00b7(cid:98)\u03ba, (6)\n\nwhere softminx\u2208X f (x) = \u2212 1\nThe proof is given in Appendix A. Notice that Theorem 1 holds for the slack-free uncertainty set \u039e\nof De\ufb01nition 2. Using the slack-based version leads to regularization of the dual parameters \u03c9. As\nshown by [27], adding l1 regularization \u2212\u03b2||\u03c9||1 to the dual objective is equivalent to a constraint\n2 is equivalent to an\n2 potential on the constraint values in the primal. In practice, it is important to add l1 and/or l2\nl2\n2\nregularization to ensure proper convergence of the algorithm. Both types of regularization enjoy\nsimilar theoretical guarantees [28].\nWe now address the inner minimax game for choosing \u03c4 and \u03c0 in Section 3.3 and the outer optimiza-\ntion of \u03c9 from Equation (6) in Section 3.4.\n\n||\u03ba\u03c6((cid:101)\u03c0, \u03c4 ) \u2212(cid:98)\u03ba||1 \u2264 \u03b2 in the primal, while adding l2\n\n2 regularization \u2212 \u03b1\n\n2 ||\u03c9||2\n\n3.3 Mixed Objective Minimax Optimal Control\n\nChoosing state transition dynamics to optimize a mixture of expected returns under different control\npolicies, \u03c0 and \u02dc\u03c0 (De\ufb01nition 3)2 is an important subproblem arising from our formulation of robust\ncontrol as a process estimation task with robustness properties and uncertainty sets de\ufb01ned by different\ncontrol policies. To the best of our knowledge, this problem has not been previously investigated in\nthe literature.\nDe\ufb01nition 3. Given two control policies \u03c0 and \u02dc\u03c0, and two reward functions R and \u02dcR, the mixed\nobjective optimization problem seeks state transition dynamics \u03c4 that minimizes a mixture of these\nweighted by \u03b8 \u2265 0 : min\u03c4 {\u03b8\u03c1R(\u03c0, \u03c4 ) + (1 \u2212 \u03b8)\u03c1 \u02dcR(\u02dc\u03c0, \u03c4 )}.\nNotice that the inner minimization of Equation (6) is an entropy-regularized instance of this problem.\nIn fact, we can set (cid:101)R(st, at, st+1) \u2190 \u03c9 \u00b7 \u03c6(st, at, st+1) and \u03b8 = 1\n2 (provided that rewards are\nproperly rescaled). As we already know from Theorem 1, the entropy leads to a softmin solution\nand does not pose any additional complication in solving the optimization problem of De\ufb01nition 3.\nFurthermore, in the inner zero-sum game of Equation (6), \u03c0 is chosen as the maximizer of \u03c1(\u03c0, \u03c4 ).\nThus, we can see De\ufb01nition 3 as a special case where \u03c0 is \ufb01xed rather than chosen dynamically.\nAn important observation for this problem is that the optimal transition dynamics are not Markovian.\nIndeed, the in\ufb02uence of \u03c1R and \u03c1 \u02dcR on choosing the next-state distribution at some decision point\ndepends on how probable it is for that decision point to be realized under \u03c0 and under \u02dc\u03c0. This, in turn,\ndepends on the entire history of states and actions leading to the current decision point. However,\nwe establish that this non-Markovian problem can be Markovianized by augmenting the current\nstate-action pair with a continuous \u201cbelief state\u201d as follows:\n\nb(a1:t||s1:t) (cid:44)\n\n(cid:81)t\ni=1 \u03c0(ai|a1:i\u22121, s1:i)\n\n(cid:81)t\ni=1 \u03c0(ai|a1:i\u22121, s1:i) +(cid:81)t\n\ni=1(cid:101)\u03c0(ai|a1:i\u22121, s1:i)\n\n\u03c0(a1:t||s1:t)\n\n\u03c0(a1:t||s1:t) +(cid:101)\u03c0(a1:t||s1:t)\n\n=\n\n.\n\n(7)\n\nThe belief state tracks the relative probability of the decision point under \u03c0 and \u02dc\u03c0. De\ufb01ning it in this\nmanner is convenient because it limits the domain for b to [0, 1]. It can also be updated to incorporate\n\n2Without any loss of generality, this problem could be equivalently posed as \ufb01nding the control policy \u03c0\nthat maximizes a mixture of rewards \u03b8\u03c1R(\u03c0, \u03c4 ) + (1 \u2212 \u03b8)\u03c1 \u02dcR(\u03c0, \u02dc\u03c4 ) for two different decision processes with\ndynamics/reward (\u03c4, R) and (\u02dc\u03c4 , \u02dcR).\n\n5\n\n\fa new action at+1 in state st+1 as:\n\nb(a1:t+1||s1:t+1) =\n\nb(a1:t||s1:t)\u03c0(at+1|a1:t, s1:t+1) + (1 \u2212 b(a1:t||s1:t))(cid:101)\u03c0(at+1|a1:t, s1:t+1)\n\nb(a1:t||s1:t)\u03c0(at+1|a1:t, s1:t+1)\n\n.\n\n(8)\n\nAugmenting with the belief state of Equation (7), we prove that it is possible to compute a Markovian\nsolution to the inner zero-sum game of Equation (6) and, thus, to the optimization problem of\nDe\ufb01nition 3.\nTheorem 2. Let \u02dc\u03c0 be a given randomized Markovian policy and Z(st, at, bt\u22121) = bt\u22121 + (1 \u2212\nbt\u22121)\u02dc\u03c0(at|st), where bt is the belief state de\ufb01ned in Equation (7). Then, a solution (\u03c0\u2217, \u03c4\u2217) to the\ninner zero-sum game of Equation (6) is:\ne\u2212\u03bbQ(st,at,bt,st+1)\n\u03c4\u2217(st+1|st, at, bt) =\ns(cid:48) e\u2212\u03bbQ(st,at,bt,s(cid:48)) ; \u03c0\u2217(st, bt\u22121) = argmax\nZ(st, at, bt\u22121)\nwith Q as the value of a transition to state st+1, V as the value of state st and belief state bt\u22121, and\nQR as the expected return from R obtained by taking action at in state st and belief state bt:\n\nQ(st, at, bt, st+1) = btR(st, at, st+1) + (1 \u2212 bt)(cid:101)R(st, at, st+1) + V (st+1, bt),\n(cid:19)\n\n(cid:80)\n\n(cid:18)\n\n(cid:19)\n\nbt\u22121\n\nst, at,\n\nQR\n\n(9)\n\nat\n\n,\n\nV (st, bt\u22121) = Z(cid:48)(st, bt\u22121) softmin\n\nst+1\n\nQ\n\nst, \u03c0\u2217(st, bt\u22121),\n\nZ(cid:48)(st, bt\u22121)\n\nbt\u22121\n\n(cid:18)\n\n, st+1\n\n,\n\n(10)\n\n(11)\n\n(cid:19)(cid:33)\n\n(cid:18)\n(cid:32)\n\n(cid:88)\n\nbt\n\n\u03c4\u2217(st+1|st, at, bt)\n\nbt\n\n(12)\n\nst+1, \u03c0\u2217(st+1, bt),\n\nst+1\n\nZ(cid:48)(st+1, bt)\n\nR(st, at, st+1) + QR\n\nQR(st, at, bt) =\nwhere Z(cid:48)(st, bt\u22121) = Z(st, \u03c0\u2217(st, bt\u22121), bt\u22121).\nThe proof is given in Appendix A.\nSince we have a maximum causal entropy estimation problem, \u03c4\u2217 (Equation 9) takes the form of a\nBoltzmann distribution with temperature \u03bb\u22121 and energy given by Q(st, at, bt, st+1). Function Q\n(Equation 10) speci\ufb01es the value of a transition from st, at, bt to state st+1. Intuitively, it is a sum\nof (i) the immediate return, which in turn is a mixture of rewards from R and \u02dcR weighted by the\ncurrent belief state, and (ii) the value of the next state st+1 given that the current belief is bt. We have\nthe additional complication that \u03c0 is chosen dynamically as the maximizer of \u03c1R(\u03c0, \u03c4 ) rather than\nstatically. Given \u03c4\u2217, the optimal policy \u03c0\u2217 (Equation 9) aims at maximizing the expected future return\nfrom R de\ufb01ned in (12). Notice that since the optimal policy \u03c0\u2217 is deterministic and \u02dc\u03c0 is Markovian,\nthe belief state update rule of (8) can be written in the more concise form: bt+1 =\nZ(cid:48)(st+1,bt). Finally,\ngiven \u03c4\u2217 and \u03c0\u2217, we can compute the optimal value V obtained from state st and belief state bt\u22121 as\nde\ufb01ned in (11). Algorithm 1 summarizes our Markovian dynamic program.\nIn contrast to typical value iteration in\ndiscrete MDPs, the belief states are con-\ntinuous variables in Algorithm 1. In prac-\ntice, we discretize them by considering a\nset B of values in the range [0, 1] and then\ninterpolate between these points. Notice\nthat since \u03c0\u2217 is deterministic, values in\n(0, 0.5) are not possible and can be safely\nneglected. This discretization allows for\na compact tabular representation of all\nfunctions de\ufb01ned in Theorem 2. The\nasymptotic complexity of this procedure\n(Algorithm 1) is then O(|S|2|A| |B| T ).\nThe robust policy \u03c0\u2217 returned by Algo-\nrithm 1 is, for each time-step t, a function\n\u03c0\u2217\nt : S \u00d7 B \u2192 A mapping state-belief\nstate couples to actions. For the sake of completeness, we show how such a policy can be used in a\nregular MDP with dynamics \u03c4. Notice that, since belief states are updated according to Equation\n(8), we need to keep track of the reference policy \u02dc\u03c0. At the \ufb01rst time-step, state s1 is drawn from\nthe MDP\u2019s initial state distribution, while the initial belief state b0 is set to 0.5, as can be seen\nfrom Equation (7). Then, action a1 = \u03c0\u2217\n1(s1, b0) is taken, and the system transitions to the next\nstate s2 \u223c \u03c4 (\u00b7|s1, a1). Finally, the belief state is updated to account for the choice of action a1:\nb1 = b0 / N (s1, b0). Then, this process is repeated until the maximum time-step is reached.\n\nV (sT , bT\u22121)\u2190 0; (cid:101)R(st, at, st+1)\u2190 \u03c9\u00b7\u03c6(st, at, st+1)\nfor t = T \u2212 1 to 1 do\nSet Q(st, at, bt, st+1) from V using (10)\nSet \u03c4\u2217(\u00b7|st, at, bt) \u221d e\u2212\u03bbQ(st,at,bt,\u00b7)\nSet QR(st, at, bt) from \u03c4\u2217 and QR using (12)\nSet \u03c0\u2217(st, bt\u22121) = argmaxat QR(st, at, bt)\nSet V (st, bt\u22121) from Q and \u03c0\u2217 using (11)\n\nfunction\nR(st, at, st+1),\nfunction \u03c6(st, at, st+1),\nLagrange multiplier \u03c9, entropy regularization weight \u03bb\n\nAlgorithm 1 Min-max Dynamic Programming\nRequire: Reference\nreward\n\nEnsure: Robust dynamics \u03c4\u2217, optimal policy \u03c0\u2217\n\nend for\n\nfeature\n\npolicy\n\n\u02dc\u03c0,\n\n6\n\n\f3.4 Parameter Optimization\n\nStandard gradient-based methods can be used to optimize the choice of model parameters \u03c9, since\nthe unconstrained dual objective function is a concave function of \u03c9. Any such method is required\nto repeatedly solve the inner minimax problem of Equation (6) as speci\ufb01ed in the previous section,\n\nto update \u03c9. Conceptually, model parameters \u03c9 are chosen to motivate the adversary\u2019s dynamics to\nsatisfy the constraints from the reference policy\u2014(approximately) matching the state-action feature\nstatistics of the training trajectories. Hence, under the assumption that matching features is feasible,\n\nobtaining (\u03c0\u2217, \u03c4\u2217), compute the feature expectations of the reference policy(cid:101)\u03c0 under \u03c4\u2217, and use these\nfollowing the gradient update rule, \u03c9i+1 \u2190 \u03c9i + \u03b7i(\u03ba\u03c6((cid:101)\u03c0, \u03c4\u2217) \u2212(cid:98)\u03ba), converges when the statistics\nmatch, i.e., when \u03ba\u03c6((cid:101)\u03c0, \u03c4\u2217) =(cid:98)\u03ba3.\n\nComputing the expected features under the adversary\u2019s non-Markovian dynamics, \u03c4\u2217, requires an\nextension of the dynamic programming algorithm used to obtain \u03c4\u2217 itself. The next result follows\nalmost straightforwardly from Theorem 2. For the sake of completeness, we include a proof in\nAppendix A.\nCorollary 1. Let (\u03c0\u2217, \u03c4\u2217) be the belief-augmented solution of Theorem 2, p(s1) be the initial state\n\ndistribution of the given MDP, and(cid:101)\u03c0 be a randomized Markovian policy. Then:\n\np(s1)\u03a8(s1, b0),\n\n) =(cid:88)s1\n\n(13)\n\n(14)\n\nwhere \u03a8 is de\ufb01ned recursively for t = 1, . . . , T \u2212 1 as:\n\n\u03ba\u03c6((cid:101)\u03c0, \u03c4\u2217\n\u03a8(st, bt\u22121) =(cid:88)at (cid:101)\u03c0(at|st)(cid:88)st+1\n\n\u03c4\u2217\n\nwith \u03a8(sT , bT\u22121) = 0 and bt =\n\nbt\u22121\n\nZ(st,at,bt\u22121)\n\n1 [at = \u03c0\u2217(st, bt\u22121)].\n\n(st+1|st, at, bt) [\u03c6(st, at, st+1) + \u03a8(st+1, bt)] ,\n\ndynamic program of Algorithm 1 by updating \u03a8 as the last step of each iteration according to (14).\n\nNotice that the computation of \u03ba\u03c6((cid:101)\u03c0, \u03c4\u2217), as given by Corollary 1, can be ef\ufb01ciently included in the\n\n4 Experiments\n\nIn this section, we empirically evaluate our robust approach for control using uncertainty sets de\ufb01ned\nby marginal state-action statistics. We consider two different experiments. The \ufb01rst one is a classic\ngrid navigation problem and the second one is a more challenging domain in which the goal is to\ncontrol the population change of an invasive species. In all experiments, we compare our marginally-\nconstrained approach (MC) to three other methods for estimating the state-transition dynamics: (1) a\nsupervised approach using logistic regression (LR); (2) a robust MDP with s,a-rectangular uncertainty\nsets (RECT); and (3) a simple maximum likelihood estimation (MLE) of the conditional transition\nprobabilities for all state-action pairs. Furthermore, due to the similarity between our settings and\nbatch reinforcement learning, we also compare to \ufb01tted Q-iteration (FQI) [29].\n\n4.1 Gridworld\nWe consider an agent navigating through an N \u00d7 N grid in order to reach a goal position. The agent\u2019s\nlocation is described by its horizontal and vertical coordinates (x, y). At each time-step, the agent\ncan attempt to move in each of the four cardinal directions. With probability p = 0.3, the action fails\nand the agent moves in a random direction instead. Attempts to move off the grid have no effect.\nThe agent\u2019s initial position is (1, 1), while the goal is to reach state (N, N ). The horizon is set to\nT = 2N, while the reward function is the negative l1 distance between the next state and the goal.\nIn this experiment, we prove the generalization capabilities of our approach. We consider a sequence\nof gridworlds with increasing size. For each of them, we collect 50 trajectories under a uniform\nreference policy and we run all algorithms on such data. Intuitively, for small grids, such trajectories\nprovide enough exploration to allow all methods to accurately approximate the state-transition\n\n3When l1 or l2\n\nto the sample statistics, where the closeness depends on the amount of regularization used (see Section 3.2).\n\n2 regularization of \u03c9 is used, this procedure converges when the feature expectations are close\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Results of the gridworld experiments, each with 95% con\ufb01dence intervals. (a) Expected\nreturn under the true dynamics as a function of the grid size. (b) Expected return under the estimated\n(robust) dynamics as function of the grid size. (c) Approximation error incurred by our algorithm due\nto the discretization of the belief space.\n\ndynamics. However, as the grid grows larger, only a small portion of the state-space is observed in\nthe training data. Thus, generalization is required to achieve good performance. Additional details on\nthe adopted parameters are given in Appendix C.1.\nFigure 1a shows the expected return achieved by all algorithms as a function of the grid size N.\nResults are averaged over 20 runs. As expected, for small grids (e.g., N \u2264 7) all approaches obtain\nnearly-optimal performance. However, as the grid size increases, only our method is able to estimate\ndynamics that generalize across unseen regions of the state-space, thus maintaining nearly-optimal\nperformance. FQI is also able to generalize and achieves a signi\ufb01cant improvement over the other\nalternatives, but is not able to compete with our method due to the small number of trajectories\navailable. LR is likely to estimate very optimistic dynamics, thus leading to worse performance.\nFinally, RECT obtains results comparable to LR even without generalizing. However, rectangular\nuncertainty sets are too conservative to compete with our method. To better demonstrate this fact,\nFigure 1b shows the performance achieved by the optimal policy computed by each algorithm under\ntheir own estimated dynamics (except for FQI, which is model-free). We clearly notice that the\nworst-case expected return obtained by the rectangular solution is, as claimed, very conservative.\nOur approach, on the other hand, shows robust performance comparable to the true ones of the other\nmethods. Due to their optimistic estimates, both LR and MLE obtain an expected return even larger\nthan the optimal one.\nFinally, we analyze the approximation error incurred from discretizing the belief states in our approach.\nWe consider a 5 \u00d7 5 gridworld with the same parameters as before and run the dynamic program of\nAlgorithm 1 for 50 random values of w using two different reference policies: the uniform one and a\nrandom one. Figure 1c shows the average absolute deviation of the objective function from its true\nvalue as a function of the number of discrete belief states Nb. Since, as we can observe from (7), the\ntotal number of belief states that are reachable in a \ufb01nite horizon depends on the number of different\n\nprobability values assigned by(cid:101)\u03c0, the uniform reference policy achieves a very small approximation\n\nerror even with few belief states. Interestingly, the approximation error for a random reference policy,\nwhich can be regarded as a \u2019worst-case\u2019 scenario, can also be reduced using a relatively small number\nof belief states.\n\n4.2\n\nInvasive Species\n\nWe next consider modeling the population change of an invasive species in an ecosystem with a\nsingle action available for mitigating its spread (e.g., introducing a predator). Our starting point is a\nstate-space model with exponential dynamics adapted from Chapter 5 of [30]. Each state captures the\ncurrent abundance of the invasive species, which we denote as Nt at time t. The population evolves\naccording to exponential dynamics, so that Nt+1 = min{\u03bdtNt, K}, where K is the maximum\ncarrying capacity. The growth rate \u03bd depends on (i) whether the control action at has been applied,\n(ii) the current population level Nt, and (iii) random noise. When the control action is not applied\n\u03bd)}, where \u00af\u03bd is the mean growth rate. In\n(at = 0), the growth rate is: \u03bdt = max{0, \u00af\u03bd + N (0, \u03c32\nthis case, the growth rate is independent of the current population level. When the control action is\napplied (at = 1), the growth rate is: \u03bdt = \u00af\u03bd \u2212 \u03b21Nt \u2212 \u03b22 max{0, Nt \u2212 \u02c6N}2 + N (0, \u03c32\n\u03bd), where\n\n8\n\n810121416\u2212800\u2212600\u2212400\u2212200GridSizeExpectedReturnOptimalLRMLEFQIRECTMC810121416\u2212800\u2212600\u2212400\u2212200GridSizeExpectedReturnOptimalLRMLERECTMC2040608000.511.52NumberofBeliefStatesApproximationErrorUniforme\u03c0Randome\u03c0\fTable 1: Negative expected return for different numbers of trajectories M and reference policy\u2019s\ncontrol probabilities p in the invasive species experiment. Each value is the average of 20 independent\nruns. 95% con\ufb01dence intervals are shown. The best algorithms are highlighted in bold.\n\nAlg.\nMLE\nLR\nMC\nRECT\nFQI\nMLE\nLR\nMC\nRECT\nFQI\n\nM\n\n50\n50\n50\n50\n50\n\n100\n100\n100\n100\n100\n\np = 0.1\n\n121.74 \u00b1 0.82\n152.95 \u00b1 13.5\n99.37 \u00b1 0.96\n111.91 \u00b1 5.33\n140.85 \u00b1 6.11\n120.91 \u00b1 0.63\n169.27 \u00b1 8.72\n98.25 \u00b1 0.88\n100.98 \u00b1 3.33\n126.66 \u00b1 5.84\n\np = 0.2\n\n128.34 \u00b1 2.06\n106.77 \u00b1 2.21\n102.38 \u00b1 1.82\n107.71 \u00b1 4.13\n133.08 \u00b1 5.36\n125.21 \u00b1 1.25\n104.70 \u00b1 3.43\n103.66 \u00b1 1.05\n103.80 \u00b1 3.22\n121.93 \u00b1 6.27\n\np = 0.3\n\n140.36 \u00b1 1.28\n117.43 \u00b1 5.09\n98.36 \u00b1 0.78\n117.15 \u00b1 6.76\n133.77 \u00b1 4.70\n134.23 \u00b1 1.33\n110.09 \u00b1 2.57\n96.20 \u00b1 0.95\n108.69 \u00b1 4.95\n119.85 \u00b1 4.30\n\np = 0.4\n\n147.189 \u00b1 1.78\n122.756 \u00b1 5.94\n107.39 \u00b1 3.44\n123.55 \u00b1 7.95\n134.05 \u00b1 6.22\n140.96 \u00b1 1.76\n114.23 \u00b1 2.49\n105.17 \u00b1 1.95\n106.18 \u00b1 4.02\n125.65 \u00b1 5.08\n\np = 0.5\n\n149.82 \u00b1 2.12\n123.28 \u00b1 4.82\n124.47 \u00b1 1.81\n142.26 \u00b1 8.28\n140.25 \u00b1 5.04\n145.42 \u00b1 1.72\n124.53 \u00b1 4.98\n115.04 \u00b1 6.18\n136.24 \u00b1 8.41\n131.51 \u00b1 4.92\n\n\u03bd = 0.02, \u03c32\n\n\u03b21 and \u03b22 are the coef\ufb01cients of effectiveness and \u02c6N is the population at which the effectiveness\npeaks. That is, the effectiveness of the control method may increase or decrease depending on the\npopulation of the invasive species. This dependence is modeled using a simpli\ufb01ed quadratic spline.\nThe precise population Nt of the species cannot be directly observed. Instead, one can observe a\ny). The exact values of the parameters used in this experiment are\nnoisy estimate yt = Nt + N (0, \u03c32\nK = 500, T = 100, \u02c6K = 300, \u00af\u03bd = 1.02, \u03b21 = 0.001, \u03b22 = \u22120.0000021, \u03c32\ny = 20.\nNotice that due to its highly unstable dynamics and noisy observations, this domain represents a very\nchallenging control problem.\nIn this experiment, we analyze the behavior of all algorithms when given different amounts of trajec-\ntories collected under different reference policies. In particular, we consider \ufb01ve reference policies,\nwhere each chooses to apply the control action with a \ufb01xed probability p \u2208 {0.1, 0.2, 0.3, 0.4, 0.5}.\nFor each reference policy, we generate two datasets of M1 = 50 trajectories and M2 = 100 trajecto-\nries, respectively. Additional details are given in Appendix C.2.\nResults of our experiments in these settings are reported in Table 1. Each datapoint is obtained as\nthe result of an average over 20 runs, We notice that MC outperforms all alternatives when p < 0.5\nand M = 50. As before, this is due to its generalization capabilities. When considering M = 100\ntrajectories, all other approaches signi\ufb01cantly improve their performance. However, MC is still able\nto achieve better results for most values of p. The rectangular solution (RECT) also achieves good\nperformance, but shows a much higher variability. Finally, we note that all algorithms suffer from the\nvery limited exploration provided by a reference policy with p = 0.5. In such cases, the performance\nof the feature-based approaches are superior.\n\n5 Conclusion & Future Work\n\nIn this paper, we have proposed a new approach to robust control based on causally conditioned\nprobability distribution estimation that de\ufb01nes uncertainty sets using features of the interaction with\nthe decision process with a different policy. Though the solution to the corresponding robust control\nproblem is non-Markovian, we show that it can be closely approximated by augmenting the typical\nMarkovian robust MDP formulation [31, 5] with a continuous-valued \u201cbelief state\u201d that can then be\ndiscretized. We have empirically tested our approach on a synthetic experiment and a real-world\ncontrol problem, highlighting its advantages over methods that form rectangular uncertainty sets.\nWe plan to extend our formulation to incorporate constraints that are obtained from multiple separate\nreference control policies. This could also allow episodic reinforcement learning [32] where the\nrobust optimal control policy is employed and then updated based on the trajectories that are observed\nfrom its application. Incorporating more sophisticated ideas for solving POMDPs using belief state\ncompression will likely be required, since discretizing the belief space scales poorly with the number\nof different reference policies.\n\n9\n\n\fAcknowledgments\n\nWe thank the anonymous reviewers whose comments helped to improve the paper signi\ufb01cantly. This\nwork was supported, in part, by the National Science Foundation under Grant No. 1652530 and Grant\nNo. 1717368, and by the Future of Life Institute (futureo\ufb02ife.org) FLI-RFP-AI1 program.\n\nReferences\n[1] J. Andrew (Drew) Bagnell. Learning Decisions: Robustness, Uncertainty, and Approximation.\n\nPhD thesis, Carnegie Mellon University, Pittsburgh, PA, August 2004.\n\n[2] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain\n\ntransition matrices. Operations Research, 53(5):780\u2013798, 2005.\n\n[3] Grani Adiwena Hanasusanto and Daniel Kuhn. Robust data-driven dynamic programming. In\n\nAdvances in Neural Information Processing Systems, pages 827\u2013835, 2013.\n\n[4] Wolfram Wiesemann, Daniel Kuhn, and Ber\u00e7 Rustem. Robust markov decision processes.\n\nMathematics of Operations Research, 38(1):153\u2013183, 2013.\n\n[5] Shie Mannor, O\ufb01r Mebel, and Huan Xu. Lightning does not strike twice: Robust mdps with\ncoupled uncertainty. In Proceedings of the 29th International Conference on Machine Learning,\nICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.\n\n[6] J. Andrew (Drew) Bagnell, Andrew Y. Ng, and Jeff Schneider. Solving uncertain markov deci-\nsion problems. Technical Report CMU-RI-TR-01-25, Carnegie Mellon University, Pittsburgh,\nPA, August 2001.\n\n[7] Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research,\n\n30(2):257\u2013280, 2005.\n\n[8] Yann Le Tallec. Robust, Risk-Sensitive, and Data-driven Control of Markov Decision Processes.\n\nPhD thesis, MIT, 2007.\n\n[9] Marek Petrik, Mohammad Ghavamzadeh, and Yinlam Chow. Safe Policy Improvement by\nMinimizing Robust Baseline Regret. In Advances in Neural Information Processing Systems,\n2016.\n\n[10] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty\n\nwith application to data-driven problems. Operations research, 58(3):595\u2013612, 2010.\n\n[11] Huan Xu and Shie Mannor. Distributionally robust markov decision processes. In Advances in\n\nNeural Information Processing Systems, pages 2505\u20132513, 2010.\n\n[12] Wolfram Wiesemann, Daniel Kuhn, and Melvyn Sim. Distributionally robust convex optimiza-\n\ntion. Operations Research, 62(6):1358\u20131376, 2014.\n\n[13] Marek Petrik and Dharmashankar Subramanian. RAAM : The bene\ufb01ts of robustness in approxi-\nmating aggregated MDPs in reinforcement learning. In Neural Information Processing Systems\n(NIPS), 2014.\n\n[14] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming.\n\nJohn Wiley & Sons, Inc., 2005.\n\n[15] G. Kramer. Directed Information for Channels with Feedback. PhD thesis, Swiss Federal\n\nInstitute of Technology (ETH) Zurich, 1998.\n\n[16] Hans Marko. The bidirectional communication theory \u2013 a generalization of information theory.\n\nIn IEEE Transactions on Communications, pages 1345\u20131351, 1973.\n\n[17] James L. Massey. Causality, feedback and directed information. In Proc. IEEE International\n\nSymposium on Information Theory and Its Applications, pages 27\u201330, 1990.\n\n10\n\n\f[18] Haim H. Permuter, Young-Han Kim, and Tsachy Weissman. On directed information and\ngambling. In Proc. IEEE International Symposium on Information Theory, pages 1403\u20131407,\n2008.\n\n[19] S. Tatikonda. Control under Communication Constraints. PhD thesis, Massachusetts Institute\n\nof Technology, 2000.\n\n[20] G. Kramer. Capacity results for the discrete memoryless network. Proc. IEEE Transactions on\n\nInformation Theory, 49(1):4\u201321, Jan 2003.\n\n[21] Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modeling interaction via the principle\nof maximum causal entropy. In Proc. International Conference on Machine Learning, pages\n1255\u20131262, 2010.\n\n[22] Bita Analui and Georg Ch P\ufb02ug. On distributionally robust multiperiod stochastic optimization.\n\nComputational Management Science, 11(3):197\u2013220, 2014.\n\n[23] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proc.\n\nInternational Conference on Machine Learning, pages 1\u20138, 2004.\n\n[24] Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy\ninverse reinforcement learning. In Proc. AAAI Conference on Arti\ufb01cial Intelligence, pages\n1433\u20131438, 2008.\n\n[25] Zhaolin Hu and L Jeff Hong. Kullback-leibler divergence constrained distributionally robust\n\noptimization. 2013.\n\n[26] B. D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal\n\nEntropy. PhD thesis, Carnegie Mellon University, 2010.\n\n[27] Jun\u2019ichi Kazama and Jun\u2019ichi Tsujii. Evaluation and extension of maximum entropy models\nwith inequality constraints. In Proceedings of the 2003 conference on Empirical methods in\nnatural language processing, pages 137\u2013144. Association for Computational Linguistics, 2003.\n\n[28] Miroslav Dud\u00edk, Steven J. Phillips, and Robert E. Schapire. Maximum entropy density esti-\nmation with generalized regularization and an application to species distribution modeling. J.\nMach. Learn. Res., 8:1217\u20131260, 2007.\n\n[29] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-Based Batch Mode Reinforcement\n\nLearning. Journal of Machine Learning Research, 6(1):503\u2013556, 2005.\n\n[30] M. Kery and M. Schaub. Bayesian Population Analysis using WinBUGS: A Hierarchical\n\nPerspective. Elsevier Science, 2011.\n\n[31] A B Philpott, V de Matos, and V L De Matos. Dynamic sampling algorithms for multi-stage\nstochastic programs with risk aversion. European Journal of Operations Research, 218(2):470\u2013\n483, 2012.\n\n[32] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press,\n\nCambridge, 1998.\n\n[33] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n11\n\n\f", "award": [], "sourceid": 5356, "authors": [{"given_name": "Andrea", "family_name": "Tirinzoni", "institution": "Politecnico di Milano"}, {"given_name": "Marek", "family_name": "Petrik", "institution": "University of New Hampshire"}, {"given_name": "Xiangli", "family_name": "Chen", "institution": "University of Illinois at Chicago"}, {"given_name": "Brian", "family_name": "Ziebart", "institution": "University of Illinois at Chicago"}]}