{"title": "Representation Balancing MDPs for Off-policy Policy Evaluation", "book": "Advances in Neural Information Processing Systems", "page_first": 2644, "page_last": 2653, "abstract": "We study the problem of off-policy policy evaluation (OPPE) in RL. In contrast to prior work, we consider how to estimate both the individual policy value and average policy value accurately. We draw inspiration from recent work in causal reasoning, and propose a new finite sample generalization error bound for value estimates from MDP models. Using this upper bound as an objective, we develop a learning algorithm of an MDP model with a balanced representation, and show that our approach can yield substantially lower MSE in common synthetic benchmarks and a HIV treatment simulation domain.", "full_text": "Representation Balancing MDPs\nfor Off-Policy Policy Evaluation\n\nYao Liu\n\nStanford University\n\nyaoliu@stanford.edu\n\nAniruddh Raghu\n\nCambridge University\n\naniruddhraghu@gmail.com\n\nOmer Gottesman\nHarvard University\n\ngottesman@fas.harvard.edu\n\nMatthieu Komorowski\nImperial College London\n\nmatthieu.komorowski@gmail.com\n\nAldo Faisal\n\nImperial College London\n\na.faisal@imperial.ac.uk\n\nFinale Doshi-Velez\nHarvard University\n\nfinale@seas.harvard.edu\n\nEmma Brunskill\nStanford University\n\nebrun@cs.stanford.edu\n\nAbstract\n\nWe study the problem of off-policy policy evaluation (OPPE) in RL. In contrast\nto prior work, we consider how to estimate both the individual policy value and\naverage policy value accurately. We draw inspiration from recent work in causal\nreasoning, and propose a new \ufb01nite sample generalization error bound for value\nestimates from MDP models. Using this upper bound as an objective, we develop a\nlearning algorithm of an MDP model with a balanced representation, and show that\nour approach can yield substantially lower MSE in common synthetic benchmarks\nand a HIV treatment simulation domain.\n\n1\n\nIntroduction\n\nIn reinforcement learning, off-policy (batch) policy evaluation is the task of estimating the perfor-\nmance of some evaluation policy given data gathered under a different behavior policy. Off-policy\npolicy evaluation (OPPE) is essential when deploying a new policy might be costly or risky, such as\nin consumer marketing, healthcare, and education. Technically off-policy evaluation relates to other\n\ufb01elds that study counterfactual reasoning, including causal reasoning, statistics and economics.\nOff-policy batch policy evaluation is challenging because the distribution of the data under the\nbehavior policy will in general be different than the distribution under the desired evaluation policy.\nThis difference in distributions comes from two sources. First, at a given state, the behavior policy\nmay select a different action than the one preferred by the evaluation policy\u2014for example, a clinician\nmay chose to amputate a limb, whereas we may be interested in what might have happened if the\nclinician had not. We never see the counterfactual outcome. Second, the distribution of future\nstates\u2014not just the immediate outcomes\u2014is also determined by the behavior policy. This challenge\nis unique to sequential decision processes and is not covered by most causal reasoning work: for\nexample, the resulting series of a patient\u2019s health states observed after amputating a patient\u2019s limb is\nlikely to be signi\ufb01cantly different than if the limb was not amputated.\nApproaches for OPPE must make a choice about whether and how to address this data distribution\nmismatch. Importance sampling (IS) based approaches [16, 23, 8, 5, 10, 22] are typically unbiased\nand strongly consistent, but despite recent progress tend to have high variance\u2014especially if the\nevaluation policy is deterministic, as evaluating deterministic policies requires \ufb01nding in the data\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fsequences where the actions exactly match the evaluation policy. However, in most real-world\napplications deterministic evaluation policies are more common\u2014policies are typically to either\namputate or not, rather than a policy that that \ufb02ips a biased coin (to sample randomness) to decide\nwhether to amputate. IS approaches also often rely on explicit knowledge of the behavior policy,\nwhich may not be feasible in situations such as medicine where the behaviors results from human\nactions. In contrast, some model based approaches ignore the data distribution mismatch, such as by\n\ufb01tting a maximum-likelihood model of the rewards and dynamics from the behavioral data, and then\nusing that model to evaluate the desired evaluation policy. These methods may not converge to the\ntrue estimate of the evaluation policy\u2019s value, even in the limit of in\ufb01nite data [15]. However, such\nmodel based approaches often achieve better empirical performance than the IS-based estimators\n[10].\nIn this work, we address the question of building model-based estimators for OPPE that both do have\ntheoretical guarantees and yield better empirical performance that model-based approaches that ignore\n\ns0 is an initial state, by evaluating its mean squared error (MSE). Most previous research (e.g. [10, 22])\n\nthe data distribution mismatch. Typically we evaluate the quality of an OPPE estimatebV \u21e1e(s0), where\nevaluates their methods using MSE for the average policy value (APV): [Es0bV \u21e1e(s0)Es0V \u21e1e(s0)]2,\nrather than the MSE for individual policy values (IPV): Es0[bV \u21e1e(s0) V \u21e1e(s0)]2. This difference is\n\ncrucial for applications such as personalized healthcare since ultimately we may want to assess the\nperformance of a policy for an speci\ufb01c individual (patient) state.\nInstead, in this paper we develop an upper bound of the MSE for individual policy value estimates.\nNote that this bound is automatically an upper bound on the average treatment effect. Our work is\ninspired by recent advances[19, 11, 12] in estimating conditional averaged treatment effects (CATE),\nalso known as heterogeneous treatment effects (HTE), in the contextual bandit setting with a single\n(typically binary) action choice. CATE research aims to obtain precise estimates in the difference in\noutcomes for giving the treatment vs control intervention for an individual (state).\nRecent work [11, 19] on CATE1 has obtained very promising results by learning a model to predict\nindividual outcomes using a (model \ufb01tting) loss function that explicitly accounts for the data distribu-\ntion shift between the treatment and control policies. We build on this work to introduce a new bound\non the MSE for individual policy values, and a new loss function for \ufb01tting a model-based OPPE\nestimator. In contrast to most other OPPE theoretical analyses (e.g. [10, 5, 22]), we provide a \ufb01nite\nsample generalization error instead of asymptotic consistency. In contrast to previous model value\ngeneralization bounds such as the Simulation Lemma [13], our bound accounts for the underlying\ndata distribution shift if the data used to estimate the value of an evaluation policy were collected by\nfollowing an alternate policy.\nWe use this to derive a loss function that we can use to \ufb01t a model for OPPE for deterministic\nevaluation policies. Conceptually, this process gives us a model that prioritizes \ufb01tting the trajectories\nin the batch data that match the evaluation policy. Our current estimation procedure works for\ndeterministic evaluation policies which covers a wide range of scenarios in real-world applications\nthat are particularly hard for previous methods. Like recently proposed IS-based estimators [22, 10, 7],\nand unlike the MLE model-based estimator that ignores the distribution shift [15], we prove that\nour model-based estimator is asymptotically consistent, as long as the true MDP model is realizable\nwithin our chosen model class; we use neural models to give our model class high expressivity.\nWe demonstrate that our resulting models can yield substantially lower mean squared error estimators\nthan prior model-based and IS-based estimators on a classic benchmark RL task (even when the\nIS-based estimators are given access to the true behavior policy). We also demonstrate our approach\ncan yield improved results on a HIV treatment simulator [6].\n\n2 Related Work\n\nMost prior work on OPPE in reinforcement learning falls into one of three approaches. The \ufb01rst,\nimportance sampling (IS), reweights the trajectories to account for the data distribution shift. Under\nmild assumptions importance sampling estimators are guaranteed to be both unbiased and strongly\nconsistent, and were \ufb01rst introduced to reinforcement learning OPPE by Precup et al. [16]. Despite\n\n1Shalit et al. [19] use the term individual treatment effect (ITE) to refer to a criterion which is actually de\ufb01ned\nas CATE in most causal inference literature. We discuss the confusion about the two terms in the appendix B.\n\n2\n\n\frecent progress (e.g.[23, 8]) IS-only estimators still often yield very high variance estimates, par-\nticularly when the decision horizon is large, and/or when the evaluation policy is deterministic. IS\nestimators also typically result in extremely noisy estimates for policy values of individual states.\nA second common approach is to estimate a dynamics and reward model, which can substantially\nreduce variance, but can be biased and inconsistent (as noted by [15]). The third approach, doubly\nrobust estimators, originates from the statistics community [17]. Recently proposed doubly robust\nestimators for OPPE from the machine and reinforcement learning communities [5, 10, 22] have\nsometimes yielded orders of magnitude tighter estimates. However, most prior work that leverages\nan approximate model has largely ignored the choice of how to select and \ufb01t the model parameters.\nRecently, Farajtabar et al. [7] introduced more robust doubly robust (MRDR), which involves \ufb01tting\na Q function for the model-value function part of the doubly robust estimator based on \ufb01tting a\nweighted return to minimize the variance of doubly robust. In contrast, our work learns a dynamics\nand reward model using a novel loss function, to estimate a model that yields accurate individual\npolicy value estimates. While our method can be combined in doubly robust estimators, we will also\nsee in our experimental results that directly estimating the performance of the model estimator can\nyield substantially bene\ufb01ts over estimating a Q function for use in doubly robust.\nOPPE in contextual bandits and RL also has strong similarities with the treatment effect estimation\nproblem common in causal inference and statistics. Recently, different kinds of machine learning\nmodels such as Gaussian Processes [1], random forests [24], and GANs [25] have been used to\nestimate heterogeneous treatment effects (HTE), in non-sequential settings. Schulam and Saria [18]\nstudy using Gaussian process models for treatment effect estimation in continuous time settings. Their\nsetting differs from MDPs by not having sequential states. Most theoretical analysis of treatment\neffects focuses on asymptotic consistency rather than generalization error.\nOur work is inspired by recent research that learns complicated outcome models (reward models in\nRL) to estimate HTE using new loss functions to account for covariate shift [11, 19, 2, 12]. In contrast\nto this prior work we consider the sequential state-action setting. In particular, Shalit et al. [19]\nprovided an algorithm with a more general model class, and a corresponding generalization bound.\nWe extend this idea from the binary treatment setting to sequential and multiple action settings.\n\n3 Preliminaries: Notation and Setting\n\nWe consider undiscounted \ufb01nite horizon MDPs, with \ufb01nite horizon H < 1, bounded state space\nS\u21e2 Rd, and \ufb01nite action space A. Let p0(s) be the initial state distribution, and T (s0|s, a) be the\ntransition probability. Given a state action pair, the expectation of reward r is E[r|x, a] = \u00afr(x, a).\nGiven n trajectories collected from a stochastic behavior policy \u00b5, our goal is to evaluate the policy\nvalue of \u21e1(s). We assume the policy \u21e1(s) is deterministic. We will learn a model of both reward and\n\ntransition dynamics,cM = hbr(s, a),bT (s0, s, a)i, based on a learned representation. The representation\nfunction : S 7! Z is a reversible and twice-differentiable function, where Z is the representation\nspace. is the reverse representation such that ((s)) = s. The speci\ufb01c form of our MDP model is:\ncM = hbr(s, a),bT (s0, s, a)i = hhr((s), a), hT ((s0), (s), a)i, where hr and hT is some function\nover space Z. We will use the notationcM instead ofcM later for simplicity.\nLet \u2327 = (s0, a0, . . . , sH) be a trajectory of H + 1 states and actions, sampled from the\njoint distribution of MDP M and a policy \u00b5. The joint distributions of \u2327 are: pM,\u00b5(\u2327 ) =\np0(s0)QH1\nt=0 [T (st+1|st, at)\u00b5(at|st)]. Given the joint distribution, we denote the associated\nmarginal and conditional distributions as pM,\u00b5(s0), pM,\u00b5(s0, a0), pM,\u00b5(s0|a0) etc. We also have the\njoint, marginal and conditional, distributions p\nM,\u00b5(\u00b7) based on the representation space Z. We focus\non the undiscounted \ufb01nite horizon case, using V \u21e1\nM,t(s) to denote the t-step value function of policy \u21e1.\n\n4 Generalization Error Bound for MDP based OPPE estimator\n\nOur goal is to learn a MDP model cM that directly minimizes a good upper bound of the MSE for\n\nthe individual evaluation policy \u21e1 values: Es0[V \u21e1\ncM\nfunction estimates of the policy \u21e1 and be used as part of doubly robust methods.\n\n(s0) V \u21e1\n\nM (s0)]2. This model can provide value\n\n3\n\n\fIn the on-policy case, the Simulation Lemma ( [13] and repeated for completeness in Lemma 1)\nshows that MSE of a policy value estimate can be upper bounded by a function of the reward and\ntransition prediction losses. Before we state this result, we \ufb01rst de\ufb01ne some useful notation.\nDe\ufb01nition 1. The square error loss function of value function, reward, transition are:\n\n\u00af`V (s,cM , H t) =\u21e3V \u21e1\ncM ,Ht\n\n(s) V \u21e1\n\nM,Ht(s)\u23182\n\n\u00af`T (st, at,cM ) =\u2713ZS\u21e3bT (s0|st, at) T (s0|st, at)\u2318 V \u21e1\nEs0hV \u21e1\ncM\n\nM (s0)i2\n\n(s0) V \u21e1\n\nH1Xt=0\n\n\uf8ff 2H\n\nThen the Simulation lemma ensures that\n\n\u00af`r(st, at,cM ) = (br(st, at) \u00afr(st, at))2\n\n(s0)ds0\u25c62\n\ncM ,Ht1\n\nEst,at\u21e0pM,\u21e1h\u00aflr(st, at, \u02c6M ) + \u00aflT (st, at, \u02c6M )i ,\n\n(1)\n\n(2)\n\nThe right hand side can be used to formulate an objective to \ufb01t a model for policy evaluation. In\noff-policy case our data is from a different policy \u00b5, and one can get unbiased estimation of the RHS\nof Equation 2 by importance sampling. However, this will provide an objective function with high\nvariance, especially for a long horizon MDP or a deterministic evaluation policy due to the product of\nIS weights. An alternative is to learn an MDP model by directly optimizing the prediction loss over\nour observational data, ignoring the covariate shift. From the Simulation Lemma this minimizes an\nupper bound of MSE of behavior policy value, but the resulting model may not be a good one for\nestimating the evaluation policy value. In this paper we propose a new upper bound on the MSE of\nthe individual evaluation policy values inspired by recent work in treatment effect estimation, and use\nthis as a loss function for \ufb01tting models.\nBefore proceeding we \ufb01rst state our assumptions, which are common in most OPPE algorithms:\n\n1. Support of behavior policy covers the evaluation policy: for any state s and action a,\n\n2. Strong ignorability: there are no hidden confounders that in\ufb02uence the choice of actions\n\n\u00b5(a|s) = 0 only if \u21e1(a|s) = 0.\nother than the current observed state.\n\nDenote a factual sequence to be a trajectory that matches the evaluation policy, a0 =\n\u21e1(s0), . . . , at1 = \u21e1(st1) as a0:t1 = \u21e1. Let a counterfactual action sequence a0:t1 6= \u21e1\nbe an action sequence with at least one action that does not match \u21e1(s). pM,\u00b5(\u00b7) is the distribution\nover trajectories under M and policy \u00b5. We de\ufb01ne the H t step value error with respect to the state\ndistribution given the factual action sequence.\n\u00af`V (st, H t)pM,\u00b5(st|a0:t1 = \u21e1)dst\nWe use the idea of bounding the distance between representations given factual and counterfactual\naction sequences to adjust the distribution mismatch. Here the distance between representation\ndistributions is formalized by Integral Probability Metric (IPM).\nDe\ufb01nition 3. Let p, q be two distributions and let G be a family of real-valued functions de\ufb01ned over\n\nDe\ufb01nition 2. H t step value error is: \u270fV (cM , H t) =RS\n\nthe same space. The integral probability metric is: IPMG(p, q) = supg2GR g(x)(p(x) q(x))dx\n\nSome important instances of IPM include the Wasserstein metric where G is 1-Lipschitz continuous\nfunction class, and Maximum Mean Discrepancy where G is norm-1 function class in RKHS.\nLet p,F\nM,\u00b5(zt|at 6= \u21e1, a0:t1 = \u21e1), where F and\nCF denote factual and counterfactual. We \ufb01rst give an upper bound of MSE in terms of an expected\nloss term and then develop a \ufb01nite sample bound which can be used as a learning objective.\nTheorem 1. For any MDP M, approximate MDP model cM, behavior policy \u00b5 and deterministic\nEs0\u21e5V \u21e1\n\nevaluation policy \u21e1, let B,t and Gt be a real number and function family that satisfy the condition\nin Lemma 4. Then:\n\nM (s0)\u21e42 \uf8ff 2H\npM,\u00b5(a0:t = \u21e1)\u21e3\u00af`r(st,\u21e1 (st),cM ) + \u00af`T (st,\u21e1 (st),cM )\u2318 pM,\u00b5(st, a0:t = \u21e1)dst\n\nH1Xt=0 hB,tIPMGt\u21e3p,F\n\ncM (s0) V \u21e1\n+ZS\n\nM,\u00b5(zt|a0:t = \u21e1) and p,CF\n\nM,\u00b5 (zt)\u2318\n\nM,\u00b5 (zt) = p\n\nM,\u00b5(zt) = p\n\nM,\u00b5(zt), p,CF\n\n(3)\n\n1\n\n4\n\n\f(Proof Sketch) The key idea is to use Equation 20 in Lemma 1 to view each step as a contextual\n\nbandit problem, and bound \u270fV (cM , H) recursively. We decompose the value function error into a one\n\nstep reward loss, a transition loss and a next step value loss, with respect to the on-policy distribution.\nWe can treat this as a contextual bandit problem, and we build on the method in Shalit et al.\u2019s work\n[19] about binary action bandits to bound the distribution mismatch by a representation distance\npenalty term; however, additional care is required due to the sequential setting since the next states\nare also in\ufb02uenced by the policy. By adjusting the distribution for the next step value loss, we reduce\n\nit into \u270fV (cM , H t 1), allowing us recursively repeat this process for H steps.\n\nThis theorem bounds the MSE for the individual evaluation policy value by a loss on the distribution\nof the behavior policy, with the cost of an additional representation distribution metric. The \ufb01rst IPM\nterm measures how different the state representations are conditional on factual and counterfactual\naction history. Intuitively, a balanced representation can generalize better from the observational\ndata distribution to the data distribution under the evaluation policy, but we also need to consider the\nprediction ability of the representation on the observational data distribution. This bound quantitatively\ndescribes those two effects about MSE by the IPM term and the loss terms. The re-weighted expected\nloss terms over the observational data distribution is weighted by the marginal action probabilities\nratio instead of the conditional action probability ratio, which is used in importance sampling. The\nmarginal probabilities ratio has lower variance than the importance sampling weights (See Appendix\nC.3).\nOne natural approach might be to use the right hand side of Equation 3 as a loss, and try to directly\noptimize a representation and model that minimizes this upper bound on the mean squared error in\nthe individual value estimates. Unfortunately, doing so can suffer from two important issues. (1) The\nsubset of the data that matches the evaluation policy can be very sparse for large t, and though the\nabove bound re-weights data, \ufb01tting a model to it can be challenging due to the limited data size. (2)\nUnfortunately this approach ignores all the other data present that do not match the evaluation policy.\nIf we are also learning a representation of the domain in order to scale up to very large problems, we\nsuspect that we may bene\ufb01t from framing the problem as related to transfer or multitask learning.\nMotivated by viewing off-policy policy evaluation as a transfer learning task, we can view the source\ntask as the evaluating the behavior policy, for which we have on-policy data, and view the target\ntask as evaluating the evaluation policy, for which we have the high-variance re-weighted data from\nimportance sampling. This is similar to transfer learning where we only have a few, potentially noisy,\ndata points for the target task. Thus we can take the idea of co-learning a source task and a target task\nat the same time as a sort of regularization given limited data. More precisely, we now bound the\nOPPE error by an upper bound of the sum of two terms:\n\nEs0\u21e5V \u21e1\n|\n\ncM (s0) V \u21e1\n\nMSE\u21e1\n\n{z\n\nM (s0)\u21e42\n}\n\n+ Es0hV \u00b5\ncM\n|\n\n(s0) V \u00b5\n{z\n\nMSE\u00b5\n\nM (s0)i2\n}\n\n,\n\n(4)\n\nwhere we bound the former part using Theorem 1. Thus our upper bound of this objective can\naddress the issues with separately using MSE\u21e1 and MSE\u00b5 as objective: compared with IS estimation\nof MSE\u21e1, the \"marginal\" action probability ratio has lower variance. The representation distribution\ndistance term regularizes the representation layer such that the learned representation would not vary\nsigni\ufb01cantly between the state distribution under the evaluation policy and the state distribution under\nthe behavior policy. That reduces the concern that using MSE\u00b5 as an objective will force our model\nto evaluate the behavior policy, rather than the evaluation policy, more effectively.\nOur work is also inspired by treatment effect estimation in the casual inference literature, where\nwe estimate the difference between the treated and control groups. An analogue in RL would be\nestimating the difference between the target policy value and the behavior policy value, by minimizing\nthe MSE of policy difference estimation. The objective above is an upper bound of the MSE of policy\ndifference estimator: 1\n\n\uf8ff MSE\u21e1 + MSE\u00b5\n\nWe now bound Equation 4 further by \ufb01nite sample terms. For the \ufb01nite sample generalization bound,\nwe \ufb01rst introduce a minor variant of the loss functions, with respect to the sample set.\n\n2Es0h\u21e3V \u21e1\ncM\n\n(s0) V \u00b5\ncM\n\n(s0)\u2318 (V \u21e1\n\nM (s0) V \u00b5\n\nM (s0))i2\n\n5\n\n\fDe\ufb01nition 4. Let rt and s0t be an observation of reward and next step given state action pair st, at.\nDe\ufb01ne the loss functions as:\n\nDe\ufb01nition 5. De\ufb01ne the empirical risk over the behavior distribution and weighted distribution as:\n\ncM ,Ht1\n\n(s0)ds0 V \u21e1\n\ncM ,Ht1\n\n(s0t)\u25c62\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n`r(st, at, rt,cM ) = (br(st, at) rt)2\n`T (st, at, s0t,cM ) =\u2713ZS bT (s0|st, at)V \u21e1\nbR\u00b5(cM ) =\nbR\u21e1,u(cM ) =\n\nH1Xt=0\nH1Xt=0\n\nnXi=1\nnXi=1\n\n0:t = \u21e1)\n\n`r(s(i)\nt\n\n1(a(i)\n\n, a(i)\nt\n\n1\nn\n\n1\nn\n\nt\n\nt\n\n, s0(i)\n\n, a(i)\nt\n\n, r(i),cM ) + `T (s(i)\nh`r(s(i)\n\n,cM )\n, r(i),cM ) + `T (s(i)\n\n, a(i)\nt\n\nt\n\nt\n\n, a(i)\nt\n\n, s0(i)\n\nt\n\n,cM )i ,\n\n1\n\n.\n\ni=1\n\n1(a(i)\n0:t=\u21e1)\nn\n\nwhere n is the dataset size, s(i)\nt\n\nbu0:t\nis the state of the tth step in the ith trajectory, and bu0:t =\nPn\nTheorem 2. Suppose M is a model class of MDP models based on representation . For n\ntrajectories sampled by \u00b5, let `t(st, at,cM) = `r(st, at, rt,cM ) + `T (st, at, s0t,cM ), and dt be the\npseudo-dimension of function class {`t(st, at,cM),cM 2M }. Suppose H is the reproducing\nkernel Hilbert space induced by k, and F is the unit ball in it. Assume there exists a constant B,t\n`t( (z),\u21e1 ( (z)),cM) 2F . With probability 1 3, for anycM 2M :\nsuch that\nM (s0)|cMi2\nEs0hV \u21e1\n\uf8ff MSE\u21e1 + MSE\u00b5 \uf8ff 2HbR\u00b5(cM ) + 2HbR\u21e1,u(cM )\n(s0) V \u21e1\ncM\nM,\u00b5 (zt)\u2318 + min\u21e2DF \u2713 1\nB,t\u2713IPMF\u21e3bp,F\nH1Xt=0\nM,\u00b5(zt),bp,CF\nn3/8 \u2713V[\nH1Xt=0\nCMn,,t\nbu0:t\n\npmt,2\u25c6 , 2\u232b\u25c6\n, 1]\u25c6 (9)\n\n,` t] + V[1,` t] + `t,maxV[\n\n1(a0:t = \u21e1)\n\n1(a0:t = \u21e1)\n\npmt,1\n\n+ 2H\n\n+ 2H\n\nu0:t\n\nB,t\n\n+\n\n1\n\nmt,1 and mt,2 are the number of samples used to estimate bp,F\nspectively. DF\nmax{qEpM,\u00b5[w2`2\n\nt ]}. `t,max = maxst,at |`t(st, at)|.\n\nt ],qEbpM,\u00b5[w2`2\n\nis a function of the kernel k.\n\nCMn,,t\n\nM,\u00b5 (zt) re-\nis a function of dt. V[w, `t] =\n\nM,\u00b5(zt) and bp,CF\n\nThe \ufb01rst term is the empirical loss over the observational data distribution. The second term is a\nre-weighted empirical loss, which is an empirical version of the \ufb01rst term in Theorem 1. As said\npreviously, this re-weighting has less variance than importance sampling in practice, especially when\nthe sample size is limited. Theorem 3 in Appendix C.3 shows that the variance of this ratio is also no\ngreater than the variance of IS weights. Our bound is based on the empirical estimate of the marginal\nprobability u0:t and we are not required to know the behavior policy. Our method\u2019s independence of\nthe behavior policy is a signi\ufb01cant advantage over IS methods which are very susceptible to errors\nits estimation, as we discuss in appendix A. In practice, this marginal probability u0:t is easier to\nestimate than \u00b5 when \u00b5 is unknown. The third term is an empirical estimate of IPM, which we\ndescribed in Theorem 1. We use norm-1 RKHS functions and MMD distance in this theorem and our\nalgorithm. There are similar but worse results for Wasserstein distance and total variation distance\n[20]. DF measures how complex F is. It is obtained from concentration measures about empirical\nIPM estimators [20]. The constant CMn,,t measures how complex the model class is and it is derived\nfrom traditional learning theory results [4].\nWe compare our bound with the upper bound of model error for OPPE in [9]. In the corrected version\nof corollary 2 in [9], the upper bound of absolute error has a linear dependency on p\u00af\u21e21:H where\n\u00af\u21e21:H is an upper bound of the importance ratio, which is usually a dominant term in long horizon\ncases. As we clari\ufb01ed in last paragraph, the re-weighting weights in our bound, which are marginal\naction probability ratios, enjoy a lower variance than IS weights (See Appendix C.3).\n\n6\n\n\f5 Algorithm for Representation Balancing MDPs\n\nBased on our generalization bound above, we propose an algorithm to learn an MDP model for OPPE,\nminimizing the following objective function:\n\nt=0\n\n(10)\nThis objective is based on Equation 9 in Theorem 2. We minimize the terms in this upper bound\n\nL(cM; \u21b5t) = bR\u00b5(cM) + bR\u21e1,u(cM) +XH1\nthat are related to the model cM. Note that since B,t depends on the loss function, we cannot\nknow B,t in practice. We therefore use a tunable factor \u21b5 in our algorithm. R(cM) here is some\n\nkind of bounded regularization term of model that one could choose, corresponding to the model\nclass complexity term in Equation 9. This objective function matches our intuition about using\nlower-variance weights for the re-weighting component and using IPM of the representation to avoid\n\ufb01tting the behavior data distribution.\n\nM,\u00b5(zt),bp,CF\n\n\u21b5tIPMF\u21e3bp,F\n\nM,\u00b5 (zt)\u2318 +\n\nR(cM)\n\nn3/8\n\nIn this work, (s) andcM are parameterized by neural networks, due to their strong ability to learn\n\nrepresentations. We use an estimator of IPM term from Sriperumbudur et al. [21]. All terms in the\nobjective function are differentiable, allowing us to train them jointly by minimizing the objective by\na gradient based optimization algorithm.\nAfter we learn an MDP by minimizing the objective above, we use Monte-Carlo estimates or value\niteration to get the value for any initial state s0 as an estimator of policy value for that state. We show\nthat if there exists an MDP and representation model in our model class that could achieve:\n\ncM \u2713R\u00b5(cM) + R\u21e1,u(cM) +XH1\n\n\u21b5tIPMF\u21e3p,F\nM (s0)]2 ! 0 and estimator V \u21e1\ncM\u21e4\u21e4\n\nthen limn!1 Es0[V \u21e1\ncM\u21e4\u21e4\nany s0. See Corollary 2 in Appendix for detail.\nWe can use our model in any OPPE estimators that leverage model-based estimators, such as doubly\nrobust [10] and MAGIC [22], though our generalization MSE bound is just for the model value.\n\n(s0) is a consistent estimator for\n\nM,\u00b5 (zt)\u2318\u25c6 = 0,\n\n(s0) V \u21e1\n\nmin\n\nt=0\n\nM,\u00b5(zt), p,CF\n\n6 Experiments\n\n6.1 Synthetic control domain: Cart Pole and Montain Car\nWe test our algorithm on two continuous-state benchmark domains. We use a greedy policy from a\nlearned Q function as the evaluation policy, and an \u270f-greedy policy with \u270f = 0.2 as the behavior policy.\nWe collect 1024 trajectories for OPPE. In Cart Pole domain the average length of trajectories is\naround 190 (long horizon variant), or around 23 (short horizon variant). In Mountain Car the average\nlength of trajectories is around 150. The long horizon setting (H>100) is challenging for IS-based\nOPPE estimators due to the deterministic evaluation policy and long horizon, which will give the IS\nweights high variance. Deterministic dynamics and long horizons are common in real-world domains,\nand most off policy policy evaluation algorithms struggle in such scenarios.\nWe compare our method RepBM, with two baseline approximate models (AM and AM(\u21e1)), doubly\nrobust (DR), more robust doubly robust (MRDR), and importance sampling (IS). The baseline\napproximate model (AM) is an MDP model-based estimator trained by minimizing the empirical risk,\nusing the same model class as RepBM. AM(\u21e1) is an MDP model trained with the same objective\nas our method but without the MSE\u00b5 term. DR is a doubly robust estimator using our model and\nDR(AM) is a doubly robust estimator using the baseline model. MRDR [7] is a recent method that\ntrains a Q function as the model-based part in DR to minimize the resulting variance. We include\ntheir Q function estimator (MRDR Q), the doubly robust estimator that combines this Q function\nwith IS (MRDR).\nThe reported results are square root of the average MSE over 100 runs. \u21b5 is set to 0.01 for RepBM.\nWe report mean and individual MSEs, corresponding to MSEs of average policy value and individual\n\npolicy value, [Es0bV (s0)Es0V (s0)]2 and Es0[bV (s0) V (s0)]2 respectively. IS and DR methods re-\n\nweight samples, so their estimates for single initial states are not applicable, especially in continuous\nstate space. A comparison across more methods is included in the appendix.\n\n7\n\n\fTable 1: Root MSE for Cart Pole\n\nDR(AM) AM(\u21e1) MRDR Q MRDR\n\nIS\n\nLong Horizon\n\nMean\n\nIndividual\n\nShort Horizon\n\nMean\n\nIndividual\n\nRepBM\n0.4121\n1.033\nRepBM\n0.07836\n0.4811\n\nDR\n1.359\n\n-\nDR\n\n0.02081\n\nAM\n0.7535\n1.313\nAM\n0.1254\n0.5506\n\n1.786\n\n-\n\n41.80\n47.63\n\n151.1\n151.9\n\n202\n-\n\nDR(AM) AM(\u21e1) MRDR Q MRDR\n0.258\n0.0235\n\n0.1233\n0.5974\n\n3.013\n3.823\n\n-\n\n194.5\n\n-\nIS\n2.86\n\n-\n\n-\nTable 2: Root MSE for Mountain Car\n\n-\n\nMean\n\nIndividual\n\nRepBM DR\n12.31\n135.8\n31.38\n\n-\n\nAM DR(AM) AM(\u21e1) MRDR Q MRDR\n17.15\n172.7\n36.36\n\n135.4\n138.1\n\n72.61\n79.46\n\n141.6\n\n-\n\n-\n\nIS\n\n149.7\n\n-\n\nRepresentation Balancing MDPs outperform baselines for long time horizons. We observe that\nMRDR variants and IS methods have high MSE in the long horizon setting. The reason is that the IS\nweights for 200-step trajectories are extremely high-variance, and MRDR whose objective depends\non the square of IS weights, also fails. Compared with the baseline model, we can see that our method\nis better than AM for both the pure model case and when used in doubly robust. We also observe that\nthe IS part in doubly robust actually hurts the estimates, for both RepBM and AM.\nRepresentation Balancing MDPs outperform baselines in deterministic settings. To observe the\nbene\ufb01t of our method beyond long horizon cases, we also include results on Cart Pole with a shorter\nhorizon, by using weaker evaluation and behavior policies. The average length of trajectories is about\n23 in this setting. Here, we observe that RepBM is still better than other model-based estimators,\nand doubly robust that uses RepBM is still better than other doubly robust methods. Though MRDR\nproduces substantially lower MSE than IS, which matches the report in Farajtabar et al. [7], it still\nhas higher MSE than RepBM and AM, due to the high variance of its learning objective when the\nevaluation policy is deterministic.\nRepresentation Balancing MDPs produce accurate estimates even when the behavior policy\nis unknown. For both horizon cases, we observe that RepBM learned with no knowledge of the\nbehavior policy is better than methods such as MRDR and IS that use the true behavior policy.\n\n6.2 HIV simulator\n\nWe demonstrate our method on an HIV treatment simulation domain. The simulator is described in\nErnst et al. [6], and consists of 6 parameters describing the state of the patient and 4 possible actions.\nThe HIV simulator has richer dynamics than the two simple control domains above. We learn an\nevaluation policy by \ufb01tted Q iteration and use the \u270f-greedy policy of the optimal Q function as the\nbehavior policy.\nWe collect 50 trajectories from the behavior policy and learn our model with the baseline approximate\nmodel (AM). We compare the root average MSE of our model with the baseline approximate\nMDP model, importance sampling (IS), per-step importance sampling (PSIS) and weighted per-step\nimportance sampling (WPSIS). The root average MSEs reported are averaged over 80 runs. We\nobserve that RepBM has the lowest root MSE on estimating the value of the evaluation policy.\n\nTable 3: Relative Root MSE for HIV\n\nRepBM AM\n0.062\n0.067\n\nIS\n0.95\n\nPSIS WPSIS\n0.273\n0.146\n\nMean\n\n7 Discussion and Conclusion\n\nOne interesting issue for our method is the effect of the hyper-parameter \u21b5 on the quality of estimator.\nIn the appendix, we include the results of RepBM across different values of \u21b5. We \ufb01nd that our\nmethod outperforms prior work for a large range of alphas, for both domains. In both domains\n\n8\n\n\fwe observe that the effect of IPM adjustment (non-zero \u21b5) is less than the effect of \"marginal\" IS\nre-weighting, which matches the results in Shalit et al.\u2019s work in the binary action bandit case [19].\nTo conclude, in this work we give an MDP model learning method for the individual OPPE problem\nin RL, based on a new \ufb01nite sample generalization bound of MSE for the model value estimator.\nWe show our method results in substantially smaller MSE estimates compared to state-of-the-art\nbaselines in common benchmark control tasks and on a more challenging HIV simulator.\n\nAcknowledgments\nThis work was supported in part by the Harvard Data Science Initiative, Siemens, and a NSF CAREER\ngrant.\n\nReferences\n[1] A. M. Alaa and M. van der Schaar. Bayesian inference of individualized treatment effects using multi-task\n\ngaussian processes. In Advances in Neural Information Processing Systems, pages 3424\u20133432, 2017.\n\n[2] O. Atan, W. R. Zame, and M. van der Schaar. Learning optimal policies from observational data. arXiv\n\npreprint arXiv:1802.08679, 2018.\n\n[3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym.\n\narXiv preprint arXiv:1606.01540, 2016.\n\n[4] C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. In Advances in neural\n\ninformation processing systems, pages 442\u2013450, 2010.\n\n[5] M. Dud\u00edk, J. Langford, and L. Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th\nInternational Conference on International Conference on Machine Learning, pages 1097\u20131104. Omnipress,\n2011.\n\n[6] D. Ernst, G.-B. Stan, J. Goncalves, and L. Wehenkel. Clinical data based optimal sti strategies for hiv: a\nreinforcement learning approach. In Decision and Control, 2006 45th IEEE Conference on, pages 667\u2013672.\nIEEE, 2006.\n\n[7] M. Farajtabar, Y. Chow, and M. Ghavamzadeh. More robust doubly robust off-policy evaluation. In\n\nProceedings of the 35th International Conference on Machine Learning, pages 1447\u20131456, 2018.\n\n[8] Z. Guo, P. S. Thomas, and E. Brunskill. Using options and covariance testing for long horizon off-policy\n\npolicy evaluation. In Advances in Neural Information Processing Systems, pages 2492\u20132501, 2017.\n\n[9] J. P. Hanna, P. Stone, and S. Niekum. Bootstrapping with models: Con\ufb01dence intervals for off-policy\nevaluation. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages\n538\u2013546. International Foundation for Autonomous Agents and Multiagent Systems, 2017.\n\n[10] N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings\nof the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages\n652\u2013661. JMLR. org, 2016.\n\n[11] F. Johansson, U. Shalit, and D. Sontag. Learning representations for counterfactual inference. In Interna-\n\ntional Conference on Machine Learning, pages 3020\u20133029, 2016.\n\n[12] F. D. Johansson, N. Kallus, U. Shalit, and D. Sontag. Learning weighted representations for generalization\n\nacross designs. arXiv preprint arXiv:1802.08598, 2018.\n\n[13] M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Machine learning,\n\n49(2-3):209\u2013232, 2002.\n\n[14] S. K\u00fcnzel, J. Sekhon, P. Bickel, and B. Yu. Meta-learners for estimating heterogeneous treatment effects\n\nusing machine learning. arXiv preprint arXiv:1706.03461, 2017.\n\n[15] T. Mandel, Y.-E. Liu, S. Levine, E. Brunskill, and Z. Popovic. Of\ufb02ine policy evaluation across represen-\ntations with applications to educational games. In Proceedings of the 2014 international conference on\nAutonomous agents and multi-agent systems, pages 1077\u20131084. International Foundation for Autonomous\nAgents and Multiagent Systems, 2014.\n\n[16] D. Precup, R. S. Sutton, and S. P. Singh. Eligibility traces for off-policy policy evaluation. In ICML, pages\n\n759\u2013766. Citeseer, 2000.\n\n9\n\n\f[17] J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coef\ufb01cients when some regressors are\n\nnot always observed. Journal of the American statistical Association, 89(427):846\u2013866, 1994.\n\n[18] P. Schulam and S. Saria. Reliable decision support using counterfactual models. In Advances in Neural\n\nInformation Processing Systems, pages 1697\u20131708, 2017.\n\n[19] U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds\n\nand algorithms. In International Conference on Machine Learning, pages 3076\u20133085, 2017.\n\n[20] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch\u00f6lkopf, and G. R. Lanckriet. On integral probability\n\nmetrics, -divergences and binary classi\ufb01cation. arXiv preprint arXiv:0901.2698, 2009.\n\n[21] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch\u00f6lkopf, G. R. Lanckriet, et al. On the empirical\n\nestimation of integral probability metrics. Electronic Journal of Statistics, 6:1550\u20131599, 2012.\n\n[22] P. Thomas and E. Brunskill. Data-ef\ufb01cient off-policy policy evaluation for reinforcement learning. In\n\nInternational Conference on Machine Learning, pages 2139\u20132148, 2016.\n\n[23] P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High-con\ufb01dence off-policy evaluation. In AAAI,\n\n2015.\n\n[24] S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests.\n\nJournal of the American Statistical Association, just-accepted, 2017.\n\n[25] J. Yoon, J. Jordon, and M. van der Schaar. Ganite: Estimation of individualized treatment effects using\n\ngenerative adversarial nets. ICLR, 2018.\n\n10\n\n\f", "award": [], "sourceid": 1356, "authors": [{"given_name": "Yao", "family_name": "Liu", "institution": "Stanford University"}, {"given_name": "Omer", "family_name": "Gottesman", "institution": "Harvard University"}, {"given_name": "Aniruddh", "family_name": "Raghu", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Matthieu", "family_name": "Komorowski", "institution": "Imperial College London / MIT"}, {"given_name": "Aldo", "family_name": "Faisal", "institution": "Imperial College London"}, {"given_name": "Finale", "family_name": "Doshi-Velez", "institution": "Harvard"}, {"given_name": "Emma", "family_name": "Brunskill", "institution": "Stanford University"}]}