{"title": "Tractable Objectives for Robust Policy Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2069, "page_last": 2077, "abstract": "Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP.", "full_text": "Tractable Objectives for Robust Policy Optimization\n\nKatherine Chen\n\nUniversity of Alberta\nkchen4@ualberta.ca\n\nMichael Bowling\nUniversity of Alberta\n\nbowling@cs.ualberta.ca\n\nAbstract\n\nRobust policy optimization acknowledges that risk-aversion plays a vital role in\nreal-world decision-making. When faced with uncertainty about the effects of ac-\ntions, the policy that maximizes expected utility over the unknown parameters of\nthe system may also carry with it a risk of intolerably poor performance. One\nmight prefer to accept lower utility in expectation in order to avoid, or reduce\nthe likelihood of, unacceptable levels of utility under harmful parameter realiza-\ntions. In this paper, we take a Bayesian approach to parameter uncertainty, but\nunlike other methods avoid making any distributional assumptions about the form\nof this uncertainty. Instead we focus on identifying optimization objectives for\nwhich solutions can be ef\ufb01ciently approximated. We introduce percentile mea-\nsures: a very general class of objectives for robust policy optimization, which\nencompasses most existing approaches, including ones known to be intractable.\nWe then introduce a broad subclass of this family for which robust policies can\nbe approximated ef\ufb01ciently. Finally, we frame these objectives in the context of a\ntwo-player, zero-sum, extensive-form game and employ a no-regret algorithm to\napproximate an optimal policy, with computation only polynomial in the number\nof states and actions of the MDP.\n\n1\n\nIntroduction\n\nReinforcement learning is focused on learning optimal policies from trajectories of data. One com-\nmon approach is to build a Markov decision process (MDP) with parameters (i.e., rewards and tran-\nsition probabilities) learned from data, and then \ufb01nd an optimal policy: a sequence of actions that\nwould maximize expected cumulative reward in that MDP. However, optimal policies are sensitive\nto the estimated reward and transition parameters. The optimal performance on the estimated MDP\nis unlikely to be actually attained under the true, but unknown, parameter values. Furthermore, opti-\nmizing for the estimated parameter realization may risk unacceptable performance under other less\nlikely parameter realizations. For example, consider a data-driven medical decision support setting:\ngiven one-step trajectory data from a controlled trial, the goal is to identify an effective treatment\npolicy. The policy that maximizes expected utility under a single estimated model, or even averaged\nover a distribution of models, may still result in poor outcomes for a substantial minority of patients.\nWhat is called for is a policy that is more robust to the uncertainties of individual patients.\nThere are two main approaches for \ufb01nding robust policies in MDPs with parameter uncertainty. The\n\ufb01rst assumes rewards and transitions belong to a known and compact uncertainty set, which also\nincludes a single nominal parameter setting that is thought most likely to occur [19]. Robustness, in\nthis context, is a policy\u2019s performance under worst-case parameter realizations from the set and is\nsomething one must trade-off against how well a policy performs under the nominal parameters. In\nmany cases, the robust policies found are overly conservative because they do not take into account\nhow likely it is for an agent to encounter worst-case parameters. The second approach takes a\nBayesian perspective on parameter uncertainty, where a prior distribution over the parameter values\nis assumed to be given, with a goal to optimize the performance for a particular percentile [4].\nUnfortunately, the approach assumes speci\ufb01c distributions of parameter uncertainty in order to be\n\n1\n\n\ftractable, e.g., rewards from Gaussians and transition probabilities from independent Dirichlets. In\nfact, percentile optimization with general parameter uncertainty is NP-hard [3].\nIn this paper we focus on the Bayesian setting where a distribution over the parameters of the MDP\nis given. Rather than restricting the form of the distribution in order to achieve tractable algorithms,\nwe consider general parameter uncertainty, and instead explore the space of possible objectives. We\nintroduce a generalization of percentile optimization with objectives de\ufb01ned by a measure over per-\ncentiles instead of a single percentile. This family of objectives subsumes tractable objectives such\nas optimizing for expected value, worst-case, or Conditional Value-at-Risk; as well as intractable\nobjectives such as optimizing for a single speci\ufb01c percentile (percentile optimization or Value-at-\nRisk). We then introduce a particular family of percentile measures, which can be ef\ufb01ciently ap-\nproximated. We show this by framing the problem as a two-player, zero-sum, extensive-form game,\nand then employing a form of counterfactual regret minimization to \ufb01nd near-optimal policies in\ntime polynomial in the number of states and actions in the MDP. We give a further generalization\nof this family by proving a general, but suf\ufb01cient, condition under which percentile measures admit\nef\ufb01cient optimization. Finally, we empirically demonstrate our algorithm on a synthetic uncertain\nMDP setting inspired by \ufb01nding robust policies for diabetes management.\n\n2 Background\n\nWe begin with an overview of Markov decision processes and existing techniques for dealing with\nuncertainty in the parameters of the underlying MDP. In section 3, we show that many of the objec-\ntives described here are special cases of percentile measures.\n\n2.1 Markov Decision Processes\nA \ufb01nite-horizon Markov decision process is a tuple M = \uffffS, A, R, P, H\uffff. S is a \ufb01nite set of states,\nA is a \ufb01nite set of actions, and H is the horizon. The decision agent starts in an initial state s0, drawn\nfrom an initial state distribution P (s0). System dynamics are de\ufb01ned by P (s, a, s\uffff) = P(s\uffff|s, a)\nwhich indicates the probability of transitioning from one state s \u2208 S to another state s\uffff \u2208 S after\ntaking action a \u2208 A. The immediate reward for being in a state and taking an action is de\ufb01ned by the\nreward function R : S \u00d7 A \uffff\u2192 R. We will assume the rewards are bounded so that |R(s, a)|\u2264 \u2206/2.\nWe denote \u03a0HR as the set of all history-dependent randomized policies, i.e., those that map\nsequences of state-action pairs and the current state to probability distribution over actions. We\ndenote \u03a0M R as the set of all Markov randomized policies, i.e., those that map only the current\nstate and timestep to a probability distribution over actions. For a \ufb01xed MDP M, the objective is to\ncompute a policy \u03c0 that maximizes expected cumulative reward,\n\nV \u03c0\nM\n\n= E\uffff H\ufffft=0\n\nR(st, at)\uffff\uffff\uffff\uffff\uffffM, s0 \u221d P (s0),\u03c0\uffff\n\n(1)\n\nFor a \ufb01xed MDP, the set of Markov random policies (in fact, Markov deterministic policies) contains\na maximizing policy. This is called the optimal policy for the \ufb01xed MDP: \u03c0\u2217 = argmax\u03c0\u2208\u03a0M R V \u03c0\n.\nM\nHowever, for MDPs with parameter uncertainty, Markov random policies may not be a suf\ufb01cient\nclass. We will return to this issue again when discussing our own work.\n\n2.2 MDPs with Parameter Uncertainty\n\nIn this paper, we are interested in the situation where the MDP parameters, R and P , are not known.\nIn general, we call this an uncertain MDP. The form of this uncertainty and associated optimization\nobjectives has been the topic of a number of papers.\nUncertainty Set Approach. One formulation for parameter uncertainty assumes that the parameters\nare taken from uncertainty sets R \u2208R and P \u2208P [12]. In the robust MDP approach the desired\npolicy maximizes performance in the worst-case parameters of the uncertainty sets:\n\n\u03c0\u2217 = argmax\n\n\u03c0\n\nmin\n\nR\u2208R,P\u2208P\n\nV \u03c0\nM\n\n2\n\n(2)\n\n\fThe robust MDP objective has been criticized for being overly-conservative as it focuses entirely on\nthe worst-case [19]. A further re\ufb01nement is to assume that a nominal \ufb01xed MDP model is also given,\nwhich is thought to be a good guess for the true model. A mixed optimization objective is then pro-\nposed that trades-off between the nominal performance and robust (worst-case) performance [19].\nHowever, neither the robust MDP objective nor the mixed objective consider a policy\u2019s performance\nin parameter realizations other than the nominal- and worst-cases, and neither considers the relative\nlikelihood of encountering these parameter realizations.\nXu and Mannor [20] propose a further alternative by placing parameter realizations into nested\nuncertainty sets, each associated with a probability of drawing a parameter realization from the set.\nThey then propose a distributional robustness approach, which maximizes the expected performance\nover the worst-case distribution of parameters that satis\ufb01es the probability bounds on uncertainty\nsets. This approach is a step between the speci\ufb01cation of uncertainty sets and a Bayesian approach\nwith a fully speci\ufb01ed MDP parameter distribution.\nBayesian Uncertainty Approach. The alternative formulation to uncertainty sets, is to assume\nthat the true parameters of the MDP, R\u2217 and P \u2217, are distributed according to a known distribution\nP(R, P ). A worst-case analysis in such a formulation is non-sensical, except in the case of dis-\ntributions with bounded support (i.e. Uniform distributions), in which case it offers nothing over\nuncertainty sets. A natural alternative is to look at percentile optimization [4]. For a \ufb01xed \u03b7, the ob-\njective is to seek a policy that will maximize the performance on \u03b7 percent of parameter realizations.\nFormally, this results in the following optimization:\n\n\u03c0\u2217 = argmax\n\n\u03c0\n\n(3)\n\ny\n\nmax\ny\u2208R\nsubject to PM[V \u03c0\n\nM \u2265 y] \u2265 \u03b7\n\nThe optimal policy \u03c0\u2217 guarantees the optimal value y\u2217 is achieved with probability \u03b7 given the\ndistribution over parameters P(R, P ). Delage and Mannor showed that for general reward and/or\ntransition uncertainty, percentile optimization is NP-hard (even for a small \ufb01xed horizon) [3]. They\ndid show that for Gaussian reward uncertainty, the optimization can be ef\ufb01ciently solved as a second\norder cone program. They also showed that for transitions with independent Dirichlet distributions\nthat are suf\ufb01ciently-peaked (e.g., given enough observations), optimizing an approximation of the\nexpected performance over the parameters approximately optimizes for percentile performance [4].\nObjectives from Financial Economics. Value-at-Risk (VaR) and Conditional Value-at-Risk\n(CVaR) are optimization objectives used to assess the risk of \ufb01nancial portfolios. Value-at-Risk\nis equivalent to percentile optimization and is intractable for general forms of parameter uncertainty.\nAdditionally, it is not a coherent risk measure in that it does not follow subadditivity, a key coher-\nence property that states that the risk of a combined portfolio must be no larger than the sum of the\nrisks of its components. In contrast, Conditional Value-at-Risk at the \u03b7% level is de\ufb01ned as the \u201cav-\nerage of the \u03b7 \u00b7 100 worst losses\u201d[1]. It is both a coherent and a tractable objective [13]. In section 3\nwe show that CVaR is also encompassed by percentile measures.\nRestrictions on Parameter Uncertainty. One commonality among previous approaches is that\nthey all make heavy restrictions on the form of parameter uncertainty in order to obtain ef\ufb01cient al-\ngorithms. A common requirement, for example, is that the uncertainty between states is uncoupled\nor independent; or that reward and transition uncertainty themselves are uncoupled or independent.\nA very recent paper relaxes this coupling in the context of uncertainty sets, however the relaxation\nstill takes a very speci\ufb01c form allowing for a \ufb01nite number of deviations [9]. Another common\nassumption is that the uncertainty is non-stationary, i.e., a state\u2019s parameter realization can vary in-\ndependently with each visit. The Delage and Mannor work on percentile optimization [4] makes\nthe more natural assumption that the uncertain parameters are stationary, but in turn requires very\nspeci\ufb01c choices for the uncertainty distributions themselves. In this work, we avoid making as-\nsumptions on the form of parameter uncertainty beyond the ability to sample from the distribution.\nInstead, we focus on identifying the possible optimality criteria which admit ef\ufb01cient algorithms.\n\n3 Percentile Measures\n\nWe take the Bayesian approach to uncertainty where the true MDP parameters are assumed to be\ndistributed according to a known distribution, i.e., the true MDP M\u2217 is distributed according to an\n\n3\n\n\f1\n\ny\nt\ni\ns\nn\ne\nD\n\n0.5\n\np = 0.1\np=0.25\np=0.40\n\n \n\n0.1\n\ny\nt\ni\ns\nn\ne\nD\n\n0.05\n\nk of N = 1 of 1\nk of N = 1 of 2\nk of N = 1 of 5\nk of N = 1 of 10\n\n \n0\n0\n\n0.2\n\n0.4\n0.6\nPercentile\n\n0.8\n\n1\n\n \n0\n0\n\n(a) Percentile optimization\n\n0.5\n\nPercentile\n(b) 1 of N\n\n \n\n1\n\n0.1\n\ny\nt\ni\ns\nn\ne\nD\n\n0.05\n\n \n0\n0\n\n \n\n1\n\nk of N = 100 of 1000\nk of N = 250 of 1000\nk of N = 400 of 1000\nk of N = 1000 of 1000\n\n0.5\n\nPercentile\n(c) k of N\n\nFigure 1: Examples of percentile measures.\n\narbitrary distribution P(M). We begin by delineating a family of objectives for robust policy opti-\nmization, which generalizes the concept of percentile optimization. While percentile optimization\nis already known to be NP-hard, in section 4, we will restrict our focus to a subclass of our family\nthat does admit ef\ufb01cient algorithms. Rather than seeking to maximize one speci\ufb01c percentile of\nMDPs, our family of objectives maximizes an integral of a policy\u2019s performance over all percentiles\n\u03b7 \u2208 [0, 1] of MDPs M as weighted by a percentile measure \u00b5. Formally, given a measure \u00b5 over\nthe interval [0, 1] a \u00b5-robust policy is the solution to the following optimization:\n\n\u03c0\u2217 = argmax\n\n\u03c0\u2208\u03a0\n\n\uffff\u03b7\n\ny(\u03b7)d\u00b5\n\nsup\ny\u2208F\nsubject to PM[V \u03c0\n\nM \u2265 y(\u03b7)] \u2265 \u03b7 \u2200\u03b7 \u2208 [0, 1]\n\n(4)\n\nwhere F is the class of real-valued, bounded, \u00b5-integrable functions on the interval [0, 1].\nThere are many possible ways to choose the measure \u00b5, each of which corresponds to a different\nrobustness interpretation and degree.\nIn fact, our distribution measures framework encompasses\noptimization objectives for the expected, robust, and percentile MDP problems as well as for VaR\nand CVaR. In particular, if \u00b5 is the Lebesgue measure (i.e., a uniform density over the unit interval),\nall percentiles are equally weighted and the \u00b5-robust policy will optimize the expected cumulative\n]. This objective was\nreward over the distribution P(M). In other words, it maximizes EM [V \u03c0\nM\nexplored by Mannor et al. [10], where they concluded that the common approach of computing an\noptimal policy for the expected MDP, i.e., maximizing V \u03c0\nE[M], results in a biased optimization of the\ndesired value expectation under general transition uncertainty. Alternatively, when \u00b5 = \u03b40.1, where\n\u03b4\u03b7 is the Dirac delta at \u03b7, the optimization problem becomes identical to the VaR and percentile\noptimization problems where \u03b7 = 0.1, the 10th percentile. The measures for the 10th, 25th, and\n40th percentiles are shown in \ufb01gure 1a. When \u00b5 = \u03b40, the optimization problem becomes the worst-\ncase robust MDP problem, over the support of the distribution P(M). Finally, if \u00b5 is a decreasing\nstep function at \u03b7, this corresponds to the CVaR objective at the \u03b7% level, with equal weighting for\nthe bottom \u03b7 percentiles and zero weighting elsewhere.\n\n4 k-of-N Measures\n\nThere is little reason to restrict ourselves to percentile measures that put uniform weight on all\npercentiles, or Dirac deltas on the worst-case or speci\ufb01c percentiles. One can imagine creating\nother density functions over percentiles, and not all of these percentile measures will necessarily\nbe intractable like percentile optimization.\nIn this section we introduce a subclass of percentile\nmeasures, called k-of-N measures, and go on to show that we can ef\ufb01ciently approximate \u00b5-robust\npolicies for this entire subclass.\nWe start by imagining a sampling scheme for evaluating the robustness of a \ufb01xed policy \u03c0. Consider\nsampling N = 1000 MDPs from the distribution P(M). For each MDP we can evaluate the policy\n\u03c0 and then rank the MDPs based on how much expected cumulative reward \u03c0 attains on each. If\nwe choose to evaluate our policy based on the very worst of these MDPs, that is, the k = 1 of the\nN = 1000 MDPs, then we get a loose estimate of the percentile value of \u03c0 in the neighborhood of\nthe 1/1000th percentile for the distribution P(M). If we sample just N = 1 MDP, then we get an\nestimate of \u03c0\u2019s expected return over the distribution. Each choice of N results in a different density,\nand corresponding measure, over the percentiles on the interval [0, 1]. Figure 1b depicts the shape of\nthe density when we hold k = 1 while increasing the number of MDPs we sample, N. We see that as\n\n4\n\n\fN increases, the policy puts more weight on optimizing for lower percentiles of MDPs. Thus we can\nsmoothly transition from \ufb01nding policies that perform well in expectation (no robustness) to policies\nthat care almost only about worst-case performance (overly conservative robustness). Alternatively,\nafter sampling N MDPs we could instead choose the expected cumulative reward of a random MDP\nfrom the k \u2265 1 least-favorable MDPs for \u03c0. For every choice of k and N, this gives a different\ndensity function and associated measure. Figure 1c shows the density function for N = 1000 while\nincreasing k. The densities themselves act as approximate step-functions whose weight falls off in\nthe neighborhood of the percentile \u03b7 = k/N. Furthermore, as N increases, the shape of the density\nmore closely approximates a step-function, and thus more closely approximates the CVaR objective.\nFor a particular N and k, we call this measure the k-of-N measure, or \u00b5k-of-N.\nProposition 1. For any 1 \u2264 k \u2264 N, the density g of the measure \u00b5k-of-N is g(\u03b7) \u221d 1\u2212I\u03b7(k, N \u2212k),\nwhere Ix(\u03b1, \u03b2) = B(x; \u03b1, \u03b2)/B(\u03b1, \u03b2) is the regularized incomplete Beta function.\n\nThe proof can be found in the supplemental material.\n\n4.1 k-of-N Game\n\nOur sampling description of the k-of-N measure can be reframed as a two-player zero-sum\nextensive-form game with imperfect information, as shown in Figure 2. Each node in the tree repre-\nsents a game state or history labeled with the player whose turn it is to act, with each branch being\na possible action.\n\n!\n\n%\n\u03c0\n\n%\n\n&*)$\n\n!\n\n\"\n\n\"\n\n#$\n\n(%)$\n\n&'$\n\n(+$\n\n#$\n\n%\n\n\uffff\u221e\nN\uffff\n\n\uffffN\nk\uffff\nM\n\n\"\n\n% #$\n\nFigure 2: k-of-N game tree\n\nIn our game formulation, chance, denoted as player c, \ufb01rst selects\nN MDPs according to P(M). The adversary, denoted as player\n2, has only one decision in the game which is to select a subset\nof k MDPs out of the N, from which chance selects one MDP M\nuniformly at random. At this point, the decision maker, denoted\nas player 1, has no knowledge of the sampled MDPs, the choice\nmade by the adversary, or the \ufb01nal selected MDP. Hence, player 1\nmight be in any one of the circled nodes and can not distinguish one\nfrom the other. Such histories are partitioned into one set, termed\nan information set, and the player\u2019s policy must be identical for all\nhistories in an information set. The decision maker now alternates\nturns with chance, observing states sampled by chance according to\nthe chosen MDP\u2019s transition function, but not ever observing the chosen MDP itself, i.e., histories\nwith the same sequence of sampled states and chosen actions belong to the same information set\nfor player 1. After the horizon has been reached, the utility of the leaf node is just the sum of the\nimmediate rewards of the decision maker\u2019s actions according to the chosen MDP\u2019s reward function.\nThe decision maker\u2019s behavioral strategy in the game maps information sets of the game to a distri-\nbution over actions. Since the only information is the observed state-action sequence, the strategy\ncan be viewed as a policy in \u03a0HR (or possibly \u03a0M R, as we will discuss below).\nBecause the k-of-N game is zero-sum, a Nash equilibrium policy in the game is one that maximizes\nits expected utility against its best-response adversary. The best-response adversary for any policy\nis the one that chooses the k least favorable MDPs for that policy. Thus a policy\u2019s value against\nits best-response is, in fact, its value under the measure \u00b5k-of-N. Hence, a Nash equilibrium policy\nfor the k-of-N game is a \u00b5k-of-N-robust policy. Furthermore, an \uffff-Nash equilibrium policy is a 2\uffff\napproximation of a \u00b5k-of-N-robust policy.\n\n4.2 Solving k-of-N Games\n\nIn the past \ufb01ve years there have been dramatic advances in solving large zero-sum extensive-form\ngames with imperfect information [21, 5, 8]. These algorithmic advancements have made it possi-\nble to solve games \ufb01ve orders of magnitude larger than previously possible. Counterfactual regret\nminimization (CFR) is one such approach [21]. CFR is an ef\ufb01cient form of regret minimization\nfor extensive-form games. Its use in solving extensive-form games is based on the principle that\ntwo no-regret learning algorithms in self-play will have their average strategies converge to a Nash\nequilibrium. However, the k-of-N game presents a dif\ufb01culty due to the imbalance in the size of the\ntwo players\u2019 strategies. While player one\u2019s strategy is tractable (the size of a policy in the underly-\n\n5\n\n\fing MDP), player two\u2019s strategy involves decisions at in\ufb01nitely many information sets (one for each\nsampled set of N MDPs).\nA recent variant of CFR, called CFR-BR, speci\ufb01cally addresses the challenge of an adversary having\nan intractably large strategy space [6]. It combines two ideas. First, it avoids representing the entirety\nof the second player\u2019s strategy space, by having the player always play according to a best-response\nto the \ufb01rst player\u2019s strategy. So, the repeated games now involve a CFR algorithm playing against\nits own best-response. Note that best-response is also a regret-minimizing strategy, and so such\nrepeated play still converges to a Nash equilibrium. Second, it avoids having to compute or store\na complete best-response by employing sampling over chance outcomes to focus the best-response\nand regret updates on a small subtree of the game on each iteration. The approach removes all\ndependence on the size of the adversary\u2019s strategy space in either computation time or memory.\nFurthermore, it can be shown that the player\u2019s current strategy is approaching almost-always a Nash\nequilibrium strategy, and so there is no need to worry about strategy averaging. CFR-BR has the\nfollowing convergence guarantee.\nTheorem 1 (Theorems 4 and 6 [6]). For any p \u2208 (0, 1], after T \u2217 iterations of chance-sampled\nCFR-BR where T \u2217 is chosen uniformly at random from {1, . . . , T}, with probability (1 \u2212 p), player\n1\u2019s strategy on iteration T \u2217 is part of an \uffff-Nash equilibrium with\n\n\u221ap\uffff 2H\u2206|I1|\uffff|A1|\n\np\u221aT\n\n\uffff =\uffff1 +\n\n2\n\nwhere H\u2206 is the maximum difference in total reward over H steps, and |I1| is the number of\ninformation sets for player 1.\n\nThe key property of this theorem is that the bound is decreasing with the number of iterations T and\nthere is no dependence on the size of the adversary\u2019s strategy space. The random stopping time of\nthe algorithm is unusual and is needed for the high-probability guarantee. Johanson and colleagues\nnote, \u201cIn practice, our stopping time is dictated by convenience and availability of computational\nresources, and so is expected to be suf\ufb01ciently random.\u201d [6]; we follow this practice.\nThe application of chance-sampled CFR-BR to k-of-N games is straightforward. The algorithm\nis iterative. On each iteration, N MDPs are sampled from the uncertainty distribution. The best-\nresponse for this subtree of the game involves simply evaluating the player\u2019s current MDP policy on\nthe N MDPs and choosing the least-favorable k. Chance samples again, by choosing a single MDP\nfrom the least-favorable k. The player\u2019s regrets are then updated using the transitions and rewards\nfor the selected MDP, resulting in a new policy for the next iteration. See the supplemental material\nfor complete details.\nMarkovian Policies and Imperfect Recall. There still remains one important detail that we have\nnot discussed: the nature and size of player 1\u2019s strategy space. In \ufb01nite horizon MDPs with no pa-\nrameter uncertainty, an optimal policy exists in the space of Markovian policies (\u03a0M R) \u2014 policies\nthat depend only on the number of timesteps remaining and the current state, but not on the history\nof past states and actions. Under transition uncertainty, this is no longer true. The sequence of past\nstates and actions provide information about the uncertain transition parameters, which is informa-\ntive for future transitions. For this case, optimal policies are not in general Markovian policies as\nthey will depend upon the entire history of states and actions (\u03a0HR). As a result, the number of in-\nformation sets (i.e., decision points) in an optimal policy is |I1| = |S|((|S||A|)H \u2212 1)/(|S||A|\u2212 1),\nand so polynomial in the number of states and actions for any \ufb01xed horizon, but exponential in the\nhorizon itself. While being exponential in the horizon may seem like a problem, there are many\ninteresting real-world problems with short time horizons. One such class of problems is Adap-\ntive treatment strategies (ATS) for sequential medical treatment decisions [11, 15]. Many ATS\nproblems have time horizons of H \u2264 3, e.g., CATIE (H = 2) [16, 17] and STAR*D (H = 3) [14].\nUnder reward uncertainty (where rewards are not observed by the agent while acting), the sequence\nof past states and actions is not informative, and so Markovian policies again suf\ufb01ce.1. In this case,\nthe number of information sets |I1| = |S|H, and so polynomial in both states and the horizon.\nHowever, such an information-set structure for the player results in a game with imperfect recall,\n\n1Markovian policies are also suf\ufb01cient under a non-stationary uncertainty model, where the transition pa-\n\nrameters are resampled independently on repeated visits to states (see the end of Section 2.2).\n\n6\n\n\fwhere the player forgets information (past states and actions) it previously knew. Perfect recall is a\nfundamental requirement for extensive-form game solvers. However, a recent result has presented\nsuf\ufb01cient conditions under which the perfect recall assumption can be relaxed and CFR will still\nminimize overall regret [7]. These conditions are exactly satis\ufb01ed in the case of reward uncertainty:\nthe forgotten information (i) does not in\ufb02uence future rewards, (ii) does not in\ufb02uence future transi-\ntion probabilities, (iii) is never known by the opponent, (iv) is not remembered later by the player.\nTherefore, we can construct the extensive-form game with the player restricted to Markovian poli-\ncies and still solve it with CFR-BR.\nCFR-BR for k-of-N Games. We can now analyze the use of CFR-BR for computing approximate\n\u00b5k-of-N-robust policies.\nTheorem 2. For any \uffff> 0 and p \u2208 (0, 1], let,\nT =\uffff1 +\n2\n\n\u221ap\uffff2 16H 2\u22062|I1|2|A|\n\np2\uffff2\n\n.\n\nWith probability 1 \u2212 p, when applying CFR-BR to the k-of-N game, its current strategy at iteration\nT \u2217, chosen uniformly at random in the interval [1, T ], is an \uffff-approximation to a \u00b5k-of-N-robust pol-\n\uffff, where |I1|\u2208 O(|S|H) for arbitrary\nicy. The total time complexity is O\uffff(H\u2206/\uffff)2 |I1|3|A|3N log N\nreward uncertainty and |I1|\u2208 O(|S|H+1|A|H) for arbitrary transition and reward uncertainty.\n\np3\n\nProof. The proof follows almost directly from Theorem 1 and our connection between k-of-N\ngames and the \u00b5k-of-N measure. The choice of T by Theorem 1 guarantees the policy is an \uffff/2-\nNash approximation, which in turn guarantees the policy is within \uffff of optimal in the worst-case,\nand so is an \uffff approximation to a \u00b5k-of-N-robust policy. Each iteration requires N policy evaluations\neach requiring O(|I1||A|) time; these are then sorted in O(N log N) time; and \ufb01nally the regret\nupdate in O(|I1||A|) time. Theorem 2 gives us our overall time bound.\n5 Non-Increasing Measures\n\nWe have de\ufb01ned a family of percentile measures, \u00b5k-of-N, that represent optimization objectives that\ndiffer in how much weight they place on different percentiles and can be solved ef\ufb01ciently. In this\nsection, we go beyond our family of measures and provide a very broad but still suf\ufb01cient condition\nfor which a measure can be solved ef\ufb01ciently. We conjecture that a form of this condition is also\nnecessary, but leave that for future work.\nTheorem 3. Let \u00b5 be an absolutely continuous measure with density function g\u00b5, such that g\u00b5 is non-\nincreasing and piecewise Lipschitz continuous with m pieces and Lipschitz constant L. A \u00b5-robust\npolicy can be approximated with high probability in time polynomial in {|A|,|S|, \u2206, L, m, 1\np} for\n(i) arbitrary reward uncertainty with time also polynomial in the horizon or (ii) arbitrary transition\nand reward uncertainty with a \ufb01xed horizon.\n\n\uffff , 1\n\nThe proof is in the supplemental material. Note that previously known measures with ef\ufb01cient\nsolutions (i.e., worst-case, expectation-maximization, and CVaR) satisfy the property that the weight\nplaced on a particular percentile is never smaller than a larger percentile. Our k-of-N measures also\nhave this property. Percentile measures (\u03b7> 0), though, do not: they place in\ufb01nitely more weight\non the p percentile than any of the percentiles less than \u03b7. At the very least, we have captured the\ncondition that separates the currently known-to-be-easy measures from the currently known-to-be-\nhard ones.\n\n6 Experiments\n\nWe now explore our k-of-N approach in a simpli\ufb01ed version of a diabetes management task. Our\nresults aim to demonstrate two things: \ufb01rst, that CFR-BR can \ufb01nd k-of-N policies for MDP prob-\nlems with general uncertainty in rewards and transitions; and second, that optimizing for different\npercentile measures creates policies that differ accordingly.\n\n7\n\n\fPercentile Measure Densities\n\nInverse CDF (Quantile) plot\n\n0.06\n\n0.04\n\n0.02\n\ny\nt\ni\ns\nn\ne\nD\n\n0\n \n\n \n\n1\u2212of\u22121\n1\u2212of\u22125\n10\u2212of\u221250\n\n20\n\n40\n60\nPercentile\n\n80\n\n100\n\n40\n\n\u03c0M\nV\n\n20\n\n0\n\n \n0\n\n1\u2212of\u22121\n1\u2212of\u22125\n10\u2212of\u221250\nmean MDP\n\n \n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nD\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n \n\u221215\n0\n\nComparison to 1\u2212of\u22121 policy\n\n \n\n1\u2212of\u22121\n1\u2212of\u22125\n10\u2212of\u221250\nmean MDP\n\n20\n\n40\n60\nPercentile\n\n80\n\n100\n\n20\n\n40\n60\nPercentile\n\n80\n\n100\n\nFigure 3: Evaluation of k-of-N percentile measures on the diabetes management task.\n\nOur simpli\ufb01ed diabetes management MDP simulates the daily life of a diabetic patient distilled into\na small MDP with |S| = 9 states, |A| = 3 actions and a time horizon of H = 3. States are a com-\nbination of blood glucose level and meal size. Three times daily, corresponding to meal times, the\npatient injects themselves with a dose of insulin to bring down the rise in blood glucose that comes\nwith consuming carbohydrates at each meal. A good treatment policy keeps blood glucose in the\nmoderate range all day. The uncertain reward function is sampled from a independent multivariate\nNormal distribution and transition probabilities are sampled from Dirichlet distributions, but both\ncould have been drawn from other distributions. The Dirichlet parameter vector is the product of a\n\ufb01xed set of per-state parameters with an MDP-wide multiplicative factor q \u223c Unif[1, 5] to simulate\nvariation in patient sensitivity to insulin, and results in transition uncertainty between states that is\nnot independent. For full details on the problem set up, see the supplemental material.\nWe used CFR-BR to \ufb01nd optimal policies for the 1-of-1, 1-of-5, and 10-of-50 percentile measures.\nThe densities for these measures are shown in Figure 3(left). We also computed the policy that\nE(M), that is the optimal policy for the mean MDP. We evaluated the performance of\noptimizes V \u03c0\nall of these policies empirically on over 10,000 sampled MDPs and show the empirical quantile\nfunction (inverse CFR) in Figure 3(center). To highlight the differences between these policies,\nwe show the performance of the policies relative to the 1-of-1-robust policy over the full range of\npercentiles in Figure 3(right). From the difference plot, we see that the optimal policy for the mean\nMDP, although optimal for the mean MDP\u2019s speci\ufb01c parameters, does not perform well over the\nuncertainty distribution (as noted in [10]). All of the k-of-N policies are more robust, performing\nbetter on the lower percentiles, while 1-of-1 is almost a uniform improvement. We also see that\n1-of-5 and 10-of-50 policies perform quite differently despite having the same k/N ratio. Because\nthe 10-of-50 policy has a sharper drop-off in density at the 20th percentile compared to the 1-of-5\npolicy, we see that 10-of-50 policies give up more performance in higher percentile MDPs for a bit\nmore performance in the lowest 20 percentile MDPs compared to the 1-of-5 policy.\n\n7 Conclusion\n\nThis is the \ufb01rst work we are aware of to do robust policy optimization with general parameter un-\ncertainty. We describe a broad family of robustness objectives that can be ef\ufb01ciently optimized,\nand present an algorithm based on techniques for Nash approximation in imperfect information\nextensive-form games. We believe this approach will be useful for adaptive treatment strategy op-\ntimization, where small sample sizes cause real parameter uncertainty and the short time horizons\nmake even transition uncertainty tractable. The next step in this direction is to extend these robust-\nness techniques to large, or continuous state-action spaces. Abstraction has proven useful for \ufb01nding\ngood policies in other large extensive-form-games [2, 18], and so will likely prove effective here.\n\n8 Acknowledgements\n\nWe would like to thank Kevin Waugh, Anna Koop, the Computer Poker Research Group at the\nUniversity of Alberta, and the anonymous reviewers for their helpful discussions. This research was\nsupported by NSERC, Alberta Innovates Technology Futures, and the use of computing resources\nprovided by WestGrid and Compute/Calcul Canada.\n\n8\n\n\fReferences\n[1] Carlo Acerbi. Spectral Measures of Risk: a Coherent Representation of Subjective Risk Aversion. Journal\n\nof Baking and Finance, 2002.\n\n[2] D. Billings, N. Burch, A. Davidson, R. Holte, J. Schaeffer, T. Schauenberg, and D. Szafron. Approximat-\ning game-theoretic optimal strategies for full-scale poker. Proceedings of the Eighteenth International\nJoint Conference on Arti\ufb01cial Intelligence (IJCAI), 2003.\n\n[3] Erick Delage. Distributionally Robust Optimization in context of Data-driven Problems. PhD thesis,\n\nStanford University, 2009.\n\n[4] Erick Delage and Shie Mannor. Percentile Optimization in Uncertain Markov decision processes with Ap-\nplication to Ef\ufb01cient Exploration. Proceedings of the 24th International Conference on Machine Learning\n(ICML), 2007.\n\n[5] Samid Hoda, Andrew Gilpin, Javier Pe\u02dcna, and Tuomas Sandholm. Smoothing techniques for computing\n\nNash equilibria of sequential games. Mathematics of Operations Research, 35(2):494\u2013512, 2010.\n\n[6] Michael Johanson, Nolan Bard, Neil Burch, and Michael Bowling. Finding optimal abstract strategies in\n\nextensive-form games. Proceedings of the 26th Conference on Arti\ufb01cial Intelligence (AAAI), 2012.\n\n[7] Marc Lanctot, Richard Gibson, Neil Burch, Martin Zinkevich, and Michael Bowling. No-regret learning\nin extensive-form games with imperfect recall. Proceedings of the 29th International Conference on\nMachine Learning (ICML), 2012.\n\n[8] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo Sampling for Regret\n\nMinimization in Extensive Games. Advances in Neural Information Processing Systems (NIPS), 2009.\n\n[9] Shie Mannor, O\ufb01r Mebel, and Huan Xu. Lighting Does Not Strike Twice: Robust MDPs with Coupled\n\nUncertainty. Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.\n\n[10] Shie Mannor, Duncan Simester, Peng Sun, and John N. Tsitsiklis. Bias and variance in value function\n\nestimation. Management Science, 53(2):308\u2013322, February 2007.\n\n[11] Susan A. Murphy and James R. McKay. Adaptive treatment strategies: an emerging approach for improv-\ning treatment effectiveness. (Newsletter of the American Psychological Association Division 12, section\nIII: The Society for the Science of Clinical Psychology), 2003.\n\n[12] Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov decision processes with Uncertain Tran-\n\nsition matrices. Operations Research, 53(5):780\u2013798, October 2006.\n\n[13] R. Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distributions.\n\nJournal of Baking and Finance, 26:1443\u20131471, 2002.\n\n[14] A. J. Rush, M. Fava, S. R. Wisniewski, P.W. Lavori, M. H. Trivedi, H. A. Sackeim, M. E. Thase, A. A.\nNierenberg, F. M. Quitkin, T.M. Kashner, D.J. Kupfer, J. F. Rosenbaum, J. Alpert, J. W. Stewart, P. J. Mc-\nGrath, M. M. Biggs, K. Shores-Wilson, B. D. Lebowitz, L. Ritz, and G. Niederehe. Sequenced treatment\nalternatives to relieve depression (STAR*D): rational and design. Controlled Clinical Trials, 25(1):119\u2013\n142, 2004.\n\n[15] Susan A. Shortreed, Eric Laber, Daniel J. Lizotte, T. Scott Stroup, Joelle Pineau, and Susan A. Mur-\nphy. Informing sequential clinical decision-making through Reinforcement learning: an empirical study.\nMachine Learning, 84(1-2):109\u2013136, July 2011.\n\n[16] T. Scott Stroup, J.P. McEvoy, M.S. Swartz, M.J. Byerly, I.D. Glick, J.M Canive, M. McGee, G.M. Simp-\nson, M.D. Stevens, and J.A. Lieberman. The National Institute of Mental Health clinical antipschotic\ntrials of intervention effectiveness (CATIE) project: schizophrenia trial design and protocol development.\nSchizophrenia Bulletin, 29(1):15\u201331, 2003.\n\n[17] M.S. Swartz, D.O. Perkins, T.S. Stroup, J.P. McEvoy, J.M. Nieri, and D.D. Haal. Assessing clinical\nand functional outcomes in the clinical antipsychotic of intervention effectiveness (CATIE) schizophrenia\ntrial. Schizophrenia Bulletin, 1(33-43), 29.\n\n[18] Kevin Waugh, Martin Zinkevich, Michael Johanson, Morgan Kan, David Schnizlein, and Michael Bowl-\ning. A practical use of imperfect recall. Proceedings of the Eighth Symposium on Abstraction, Reformu-\nlation and Approximation (SARA), 2009.\n\n[19] Huan Xu and Shie Mannor. On robustness/performance tradeoffs in linear programming and markov\n\ndecision processes. Operations Research, 2005.\n\n[20] Huan Xu and Shie Mannor. Distributionally Robust Markov decision processes. Advances in Neural\n\nInformation Processing Systems (NIPS), 2010.\n\n[21] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret Minimization in\nGames with Incomplete Information. Advances in Neural Information Processing Systems (NIPS), 2008.\n\n9\n\n\f", "award": [], "sourceid": 1023, "authors": [{"given_name": "Katherine", "family_name": "Chen", "institution": null}, {"given_name": "Michael", "family_name": "Bowling", "institution": null}]}