{"title": "Reliable Decision Support using Counterfactual Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1697, "page_last": 1708, "abstract": "Decision-makers are faced with the challenge of estimating what is likely to happen when they take an action. For instance, if I choose not to treat this patient, are they likely to die? Practitioners commonly use supervised learning algorithms to fit predictive models that help decision-makers reason about likely future outcomes, but we show that this approach is unreliable, and sometimes even dangerous. The key issue is that supervised learning algorithms are highly sensitive to the policy used to choose actions in the training data, which causes the model to capture relationships that do not generalize. We propose using a different learning objective that predicts counterfactuals instead of predicting outcomes under an existing action policy as in supervised learning. To support decision-making in temporal settings, we introduce the Counterfactual Gaussian Process (CGP) to predict the counterfactual future progression of continuous-time trajectories under sequences of future actions. We demonstrate the benefits of the CGP on two important decision-support tasks: risk prediction and \u201cwhat if?\u201d reasoning for individualized treatment planning.", "full_text": "Reliable Decision Support using\n\nCounterfactual Models\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nPeter Schulam\n\nJohns Hopkins University\n\nBaltimore, MD 21211\npschulam@cs.jhu.edu\n\nSuchi Saria\n\nJohns Hopkins University\n\nBaltimore, MD 21211\nssaria@cs.jhu.edu\n\nAbstract\n\nDecision-makers are faced with the challenge of estimating what is likely to happen\nwhen they take an action. For instance, if I choose not to treat this patient, are they\nlikely to die? Practitioners commonly use supervised learning algorithms to \ufb01t\npredictive models that help decision-makers reason about likely future outcomes,\nbut we show that this approach is unreliable, and sometimes even dangerous. The\nkey issue is that supervised learning algorithms are highly sensitive to the policy\nused to choose actions in the training data, which causes the model to capture\nrelationships that do not generalize. We propose using a different learning objective\nthat predicts counterfactuals instead of predicting outcomes under an existing\naction policy as in supervised learning. To support decision-making in temporal\nsettings, we introduce the Counterfactual Gaussian Process (CGP) to predict the\ncounterfactual future progression of continuous-time trajectories under sequences\nof future actions. We demonstrate the bene\ufb01ts of the CGP on two important\ndecision-support tasks: risk prediction and \u201cwhat if?\u201d reasoning for individualized\ntreatment planning.\n\n1\n\nIntroduction\n\nDecision-makers are faced with the challenge of estimating what is likely to happen when they take\nan action. One use of such an estimate is to evaluate risk; e.g. is this patient likely to die if I do not\nintervene? Another use is to perform \u201cwhat if?\u201d reasoning by comparing outcomes under alternative\nactions; e.g. would changing the color or text of an ad lead to more click-throughs? Practitioners\ncommonly use supervised learning algorithms to help decision-makers answer such questions, but\nthese decision-support tools are unreliable, and can even be dangerous.\nConsider, for instance, the \ufb01nding discussed by Caruana et al. [2015] regarding risk of death among\nthose who develop pneumonia. Their goal was to build a model that predicts risk of death for a\nhospitalized individual with pneumonia so that those at high-risk could be treated and those at low-risk\ncould be safely sent home. Their model counterintuitively learned that asthmatics are less likely to\ndie from pneumonia. They traced the result back to an existing policy that asthmatics with pneumonia\nshould be directly admitted to the intensive care unit (ICU), therefore receiving more aggressive\ntreatment. Had this model been deployed to assess risk, then asthmatics might have received less\ncare, putting them at greater risk. Caruana et al. [2015] show how these counterintuitive relationships\ncan be problematic and ought to be addressed by \u201crepairing\u201d the model. We note, however, that these\nissues stem from a deeper limitation: when training data is affected by actions, supervised learning\nalgorithms capture relationships caused by action policies, and these relationships do not generalize\nwhen the policy changes.\nTo build reliable models for decision support, we propose using learning objectives that predict\ncounterfactuals, which are collections of random variables {Y [a] : a \u2208 C} used in the potential\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Best viewed in color. An illustration of the counterfactual GP applied to health care. The red box in\n(a) shows previous lung capacity measurements (black dots) and treatments (the history). Panels (a)-(c) show the\ntype of predictions we would like to make. We use Y [a] to represent the potential outcome under action a.\noutcomes framework [Neyman, 1923, 1990, Rubin, 1978]. Counterfactuals model the outcome Y\nafter an action a is taken from a set of choices C. Counterfactual predictions are broadly applicable to\na number of decision-support tasks. In medicine, for instance, when evaluating a patient\u2019s risk of\ndeath Y to determine whether they should be treated aggressively, we want an estimate of how they\nwill fare without treatment. This can be done by predicting the counterfactual Y [\u2205], where \u2205 stands\nfor \u201cdo nothing\u201d. In online marketing, to decide whether we should display ad a1 or a2, we may want\nan estimate of click-through Y under each, which amounts to predicting Y [a1] and Y [a2].\nTo support decision-making in temporal settings, we develop the Counterfactual Gaussian Process\n(CGP) to predict the counterfactual future progression of continuous-time trajectories under sequences\nof future actions. The CGP can be learned from and applied to time series data where actions are\ntaken and outcomes are measured at irregular time points; a generalization of discrete time series.\nFigure 1 illustrates an application of the CGP. We show an individual with a lung disease, and would\nlike to predict her future lung capacity (y-axis). Panel (a) shows the history in the red box, which\nincludes previous lung capacity measurements (black dots) and previous treatments (green and blue\nbars). The blue counterfactual trajectory shows what might occur under no action, which can be\nused to evaluate this individual\u2019s risk. In panel (b), we show the counterfactual trajectory under a\nsingle future green treatment. Panel (c) illustrates \u201cwhat if?\u201d reasoning by overlaying counterfactual\ntrajectories under two different action sequences; in this case it seems that two future doses of the\nblue drug may lead to a better outcome than a single dose of green.\nContributions. Our key methodological contribution is the Counterfactual Gaussian process (CGP),\na model that predicts how a continuous-time trajectory will progress under sequences of actions. We\nderive an adjusted maximum likelihood objective that learns the CGP from observational traces;\nirregularly sampled sequences of actions and outcomes denoted using D = {{(yij, aij, tij)}ni\nj=1}m\ni=1,\nwhere yij \u2208 R \u222a {\u2205}, aij \u2208 C \u222a {\u2205}, and tij \u2208 [0, \u03c4 ].1 Our objective accounts for and removes\nthe effects of the policy used to choose actions in the observational traces. We derive the objective\nby jointly modeling observed actions and outcomes using a marked point process (MPP; see e.g.,\nDaley and Vere-Jones 2007), and show how it correctly learns the CGP under a set of assumptions\nanalagous to those required to learn counterfactual models in other settings.\nWe demonstrate the CGP on two decision-support tasks. First, we show how the CGP can make\nreliable risk predictions that do not depend on the action policy in the training data. On the other hand,\nwe show that predictions made by models trained using classical supervised learning objectives are\nsensitive to the policies. In our second experiment, we use data from a real intensive care unit (ICU)\nto learn the CGP, and qualitatively demonstrate how the CGP can be used to compare counterfactuals\nand answer \u201cwhat if?\u201d questions, which could offer medical decision-makers a powerful new tool for\nindividualized treatment planning.\n\n1.1 Related Work\n\nDecision support is a rich \ufb01eld; because our main methodological contribution is a counterfactual\nmodel for time series data, we limit the scope of our discussion of related work to this area.\nCausal inference. Counterfactual models stem from causal inference. In that literature, the differ-\nence between the counterfactual outcomes if an action had been taken and if it had not been taken\n1yij and aij may be the null variable \u2205 to allow for the possibility that an action is taken but no outcome is\n\nobserved and vice versa. [0, \u03c4 ] denotes a \ufb01xed period of time over which the trajectories are observed.\n\n2\n\n\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf406080100120051015Years Since First SymptomPFVC\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf406080100120051015Years Since First SymptomPFVCE[Y[]|H]Lung Capacity\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf406080100120051015Years Since First SymptomPFVCHistoryHDrug ADrug BE[Y[]|H]E[Y[]|H](a)Years Since First Symptom (b)(c)E[Y[?]|H]E[Y[?]|H]E[Y[?]|H]\fis de\ufb01ned as the causal effect of the action (see e.g., Pearl 2009 or Morgan and Winship 2014).\nPotential outcomes are commonly used to formalize counterfactuals and obtain causal effect estimates\n[Neyman, 1923, 1990, Rubin, 1978]. Potential outcomes are often applied to cross-sectional data;\nsee, for instance, the examples in Morgan and Winship 2014. Recent examples from the machine\nlearning literature are Bottou et al. [2013] and Johansson et al. [2016].\nPotential outcomes in discrete time. Potential outcomes have also been used to estimate the causal\neffect of a sequence of actions in discrete time on a \ufb01nal outcome (e.g. Robins 1986, Robins and\nHern\u00e1n 2009, Taubman et al. 2009). The key challenge in the sequential setting is to account for\nfeedback between intermediate outcomes that determine future treatment. Conversely, Brodersen et al.\n[2015] estimate the effect that a single discrete intervention has on a discrete time series. Recent work\non optimal dynamic treatment regimes uses the sequential potential outcomes framework proposed\nby Robins [1986] to learn lists of discrete-time treatment rules that optimize a scalar outcome.\nAlgorithms for learning these rules often use action-value functions (Q-learning; e.g., Nahum-Shani\net al. 2012). Alternatively, A-learning is a semiparametric approach that directly learns the relative\ndifference in value between alternative actions [Murphy, 2003].\nPotential outcomes in continuous time. Others have extended the potential outcomes framework\nin Robins [1986] to learn causal effects of actions taken in continuous-time on a single \ufb01nal outcome\nusing observational data. Lok [2008] proposes an estimator based on structural nested models\n[Robins, 1992] that learns the instantaneous effect of administering a single type of treatment. Arjas\nand Parner [2004] develop an alternative framework for causal inference using Bayesian posterior\npredictive distributions to estimate the effects of actions in continuous time on a \ufb01nal outcome. Both\nLok [2008] and Arjas and Parner [2004] use marked point processes to formalize assumptions that\nmake it possible to learn causal effects from continuous-time observational data. We build on these\nideas to learn causal effects of actions on continuous-time trajectories instead of a single outcome.\nThere has also been recent work on building expressive models of treatment effects in continuous\ntime. Xu et al. [2016] propose a Bayesian nonparametric approach to estimating individual-speci\ufb01c\ntreatment effects of discrete but irregularly spaced actions, and Soleimani et al. [2017] model the\neffects of continuous-time, continuous-valued actions. Causal effects in continuous-time have also\nbeen studied using differential equations. Mooij et al. [2013] formalize an analog of Pearl\u2019s \u201cdo\u201d\noperation for deterministic ordinary differential equations. Sokol and Hansen [2014] make similar\ncontributions for stochastic differential equations by studying limits of discrete-time non-parametric\nstructural equation models [Pearl, 2009]. Cunningham et al. [2012] introduce the Causal Gaussian\nProcess, but their use of the term \u201ccausal\u201d is different from ours, and refers to a constraint that holds\nfor sample paths of the GP.\nReinforcement learning. Reinforcement learning (RL) algorithms learn from data where actions\nand observations are interleaved in discrete time (see e.g., Sutton and Barto 1998). In RL, however, the\nfocus is on learning a policy (a map from states to actions) that optimizes the expected reward, rather\nthan a model that predicts the effects of the agent\u2019s actions on future observations. In model-based RL,\na model of an action\u2019s effect on the subsequent state is produced as a by-product either of\ufb02ine before\noptimizing the policy (e.g., Ng et al. 2006) or incrementally as the agent interacts with its environment.\nIn most RL problems, however, learning algorithms rely on active experimentation to collect samples.\nThis is not always possible; for example, in healthcare we cannot actively experiment on patients, and\nso we must rely on retrospective observational data. In RL, a related problem known as off-policy\nevaluation also uses retrospective observational data (see e.g., Dud\u00edk et al. 2011, Swaminathan and\nJoachims 2015, Jiang and Li 2016, P\u02d8aduraru et al. 2012, Doroudi et al. 2017). The goal is to use\nstate-action-reward sequences generated by an agent operating under an unknown policy to estimate\nthe expected reward of a target policy. Off-policy algorithms typically use action-value function\napproximation, importance reweighting, or doubly robust combinations of the two to estimate the\nexpected reward.\n2 Counterfactual Models from Observational Traces\nCounterfactual GPs build on ideas from potential outcomes [Neyman, 1923, 1990, Rubin, 1978],\nGaussian processes [Rasmussen and Williams, 2006], and marked point processes [Daley and Vere-\nJones, 2007]. In the interest of space, we review potential outcomes and marked point processes, but\nrefer the reader to Rasmussen and Williams [2006] for background on GPs.\nBackground: Potential Outcomes. To formalize counterfactuals, we adopt the potential outcomes\nframework [Neyman, 1923, 1990, Rubin, 1978], which uses a collection of random variables {Y [a] :\n\n3\n\n\fa \u2208 C} to model the outcome after each action a from a set of choices C. To make counterfactual\npredictions, we must learn the distribution P (Y [a] | X) for each action a \u2208 C given features X. If we\ncan freely experiment by repeatedly taking actions and recording the effects, then it is straightforward\nto \ufb01t a predictive model. Conducting experiments, however, may not be possible. Alternatively, we\ncan use observational data, where we have example actions A, outcomes Y , and features X, but do\nnot know how actions were chosen. Note the difference between the action a and the random variable\nA that models the observed actions in our data; the notation Y [a] serves to distinguish between the\nobserved distribution P (Y | A, X) and the target distribution P (Y [a] | X).\nIn general, we can only use observational data to estimate P (Y | A, X). Under two assumptions,\nhowever, we can show that this conditional distribution is equivalent to the counterfactual model\nP (Y [a] | X). The \ufb01rst is known as the Consistency Assumption.\nAssumption 1 (Consistency). Let Y be the observed outcome, A \u2208 C be the observed action, and\nY [a] be the potential outcome for action a \u2208 C, then: ( Y (cid:44) Y [a] ) | A = a.\nUnder consistency, we have that P (Y | A = a) = P (Y [a] | A = a). Now, the potential outcome\nY [a] may depend on the action A, so in general P (Y [a] | A = a) (cid:54)= P (Y [a]). The next assumption\nposits that the features X include all possible confounders [Morgan and Winship, 2014], which are\nsuf\ufb01cient to d-separate Y [a] and A.\nAssumption 2 (No Unmeasured Confounders (NUC)). Let Y be the observed outcome, A \u2208 C be\nthe observed action, X be a vector containing all potential confounders, and Y [a] be the potential\noutcome under action a \u2208 C, then: ( Y [a] \u22a5 A ) | X.\nUnder Assumptions 1 and 2, P (Y | A, X) = P (Y [a] | X). An extension of Assumption 2 introduced\nby Robins [1997] known as sequential NUC allows us to estimate the effect of a sequence of actions\nin discrete time on a single outcome. In continuous-time settings, where both the type and timing\nof actions may be statistically dependent on the potential outcomes, Assumption 2 (and sequential\nNUC) cannot be applied as-is. We will describe an alternative that serves a similar role for CGPs.\nBackground: Marked Point Processes. Point processes are distributions over sequences of times-\ntamps {Ti}N\ni=1, which we call points, and a marked point process (MPP) is a point process where\neach point is annotated with an additional random variable Xi, called its mark. For example, a point\nT might represent the arrival time of a customer, and X the amount that she spent at the store. We\nemphasize that both the annotated points (Ti, Xi) and the number of points N are random variables.\nA point process can be characterized as a counting process {Nt : t \u2265 0} that counts the number of\nI(Ti\u2264t). By de\ufb01nition, this processes can\nonly take integer values, and Nt \u2265 Ns if t \u2265 s. In addition, it is commonly assumed that N0 = 0 and\nthat \u2206Nt = lim\u03b4\u21920+ Nt \u2212 Nt\u2212\u03b4 \u2208 {0, 1}. We can parameterize a point process using a probabilistic\nmodel of \u2206Nt given the history of the process Ht\u2212 up to but not including time t (we use t\u2212 to\ndenote the left limit of t). Using the Doob-Meyer decomposition [Daley and Vere-Jones, 2007], we\ncan write \u2206Nt = \u2206Mt + \u2206\u039bt, where Mt is a martingale, \u039bt is a cumulative intensity function, and\n\npoints that occured up to and including time t: Nt =(cid:80)N\n\ni=1\n\nP (\u2206Nt = 1 | Ht\u2212 ) = E [\u2206Nt | Ht\u2212 ] = E [\u2206Mt | Ht\u2212 ] + \u2206\u039bt(Ht\u2212 ) = 0 + \u2206\u039bt(Ht\u2212 ),\n\nwhich shows that we can parameterize the point process using the conditional intensity function\n\u03bb\u2217(t) dt (cid:44) \u2206\u039bt(Ht\u2212). The star superscript on the intensity function serves as a reminder that it\ndepends on the history Ht\u2212. For example, in non-homogeneous Poisson processes \u03bb\u2217(t) is a function\nof time that does not depend on the history. On the other hand, a Hawkes process is an example of\na point process where \u03bb\u2217(t) does depend on the history [Hawkes, 1971]. MPPs are de\ufb01ned by an\nintensity that is a function of both the time t and the mark x: \u03bb\u2217(t, x) = \u03bb\u2217(t)p\u2217(x | t). We have\nwritten the joint intensity in a factored form, where \u03bb\u2217(t) is the intensity of any point occuring (that\nis, the mark is unspeci\ufb01ed), and p\u2217(x | t) is the pdf of the observed mark given the point\u2019s time. For\nan MPP, the history Ht contains each prior point\u2019s time and mark.\n\n2.1 Counterfactual Gaussian Processes\nLet {Yt : t \u2208 [0, \u03c4 ]} denote a continuous-time stochastic process, where Yt \u2208 R, and [0, \u03c4 ] de\ufb01nes\nthe interval over which the process is de\ufb01ned. We will assume that the process is observed at a\ndiscrete set of irregular and random times {(yj, tj)}n\nj=1. We use C to denote the set of possible action\ntypes, a \u2208 C to denote the elements of the set, and de\ufb01ne an action to be a 2-tuple (a, t) specifying\n\n4\n\n\fn(cid:88)\n\nn(cid:88)\n\n(cid:90) \u03c4\n\n0\n\nan action type a \u2208 C and a time t \u2208 [0, \u03c4 ] at which it is taken. To refer to multiple actions, we use\na = [(a1, t1), . . . , (an, tn)]. Finally, we de\ufb01ne the history Ht at a time t \u2208 [0, \u03c4 ] to be a list of all\nprevious observations of the process and all previous actions. Our goal is to model the counterfactual:\n(1)\n\nP ({Ys[a] : s > t} | Ht), where a = {(aj, tj) : tj > t}m\n\nj=1.\n\nTo learn the counterfactual model, we will use traces D (cid:44) {hi = {(tij, yij, aij)}ni\ni=1, where\nyij \u2208 R \u222a {\u2205}, aij \u2208 C \u222a {\u2205}, and tij \u2208 [0, \u03c4 ]. Our approach is to model D using a marked point\nprocess (MPP), which we learn using the traces. Using Assumption 1 and two additional assumptions\nde\ufb01ned below, the estimated MPP recovers the counterfactual model in Equation 1.\nWe de\ufb01ne the MPP mark space as the Cartesian product of the outcome space R and the set of\naction types C. To allow either the outcome or the action (but not both) to be the null variable \u2205, we\nintroduce binary random variables zy \u2208 {0, 1} and za \u2208 {0, 1} to indicate when the outcome y and\naction a are not \u2205. Formally, the mark space is X = (R \u222a {\u2205}) \u00d7 (C \u222a {\u2205}) \u00d7 {0, 1} \u00d7 {0, 1}. We\ncan then write the MPP intensity as\n\nj=1}m\n\n\u03bb\u2217(t, y, a, zy, za) = \u03bb\u2217(t)p\u2217(zy, za | t)\n\np\u2217(y | t, zy)\n\np\u2217(a | y, t, za)\n\n,\n\n(2)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n[A] Event model\n\n[B] Outcome model (GP)\n\n[C] Action model\n\nwhere we have again used the \u2217 superscript as a reminder that the hazard function and densities above\nare implicitly conditioned on the history Ht\u2212. The parameterization of the event and action models\ncan be chosen to re\ufb02ect domain knowledge about how the timing of events and choice of action\ndepend on the history. The outcome model is parameterized using a GP (or any elaboration such as a\nhierarchical GP or mixture of GPs), and can be treated as a standard regression model that predicts\nhow the future trajectory will progress given the previous actions and outcome observations.\nLearning. To learn the CGP, we maximize the likelihood of observational traces over a \ufb01xed\ninterval [0, \u03c4 ]. Let \u03b8 denote the model parameters, then the likelihood for a single trace is\n\n(cid:96)(\u03b8) =\n\nlog p\u2217\n\n\u03b8(yj | tj, zyj) +\n\nlog \u03bb\u2217\n\n\u03b8(tj)p\u2217\n\n\u03b8(aj, zyj, zaj | tj, yj) \u2212\n\n\u03bb\u2217\n\u03b8(s) ds.\n\n(3)\n\nj=1\n\nj=1\n\nWe assume that traces are independent, and so can learn from multiple traces by maximizing the\nsum of the individual-trace log likelihoods with respect to \u03b8. We refer to Equation 3 as the adjusted\nmaximum likelihood objective. We see that the \ufb01rst term \ufb01ts the GP to the outcome data, and the\nsecond term acts as an adjustment to account for dependencies between future outcomes and the\ntiming and types of actions that were observed in the training data.\nConnection to target counterfactual. By maximizing Equation 3, we obtain a statistical model of\nthe observational traces D. In general, the statistical model may not recover the target counterfactual\nmodel (Equation 1). To connect the CGP to Equation 1, we describe two additional assumptions. The\n\ufb01rst assumption is an alternative to Assumption 2.\nAssumption 3 (Continuous-Time NUC). For all times t and all histories Ht\u2212, the densities \u03bb\u2217(t),\np\u2217(zy, za | t), and p\u2217(a | y, t, za) do not depend on Ys[a] for all times s > t and all actions a.\nThe key implication of this assumption is that the policy used to choose actions in the observational\ndata did not depend on any unobserved information that is predictive of the future potential outcomes.\nAssumption 4 (Non-Informative Measurement Times). For all times t and any history Ht\u2212, the\nfollowing holds: p\u2217(y | t, zy = 1) dy = P (Yt \u2208 dy | Ht\u2212 ).\nUnder Assumptions 1, 3, and 4, we can show that Equation 1 is equivalent to the GP used to model\np\u2217(y | t, zy = 1). In the interest of space, the argument for this equivalence is in Section A of the\nsupplement. Note that these assumptions are not statistically testable (see e.g., Pearl 2009).\n3 Experiments\nWe demonstrate the CGP on two decision-support tasks. First, we show that the CGP can make\nreliable risk predictions that are insensitive to the action policy in the training data. Classical\nsupervised learning algorithms, however, are dependent on the action policy and this can make\nthem unreliable decision-support tools. Second, we show how the CGP can be used to compare\ncounterfactuals and ask \u201cwhat if?\u201d questions for individualized treatment planning by learning the\neffects of dialysis on creatinine levels using real data from an intensive care unit (ICU).\n\n5\n\n\fRisk Score \u2206 from A\nKendall\u2019s \u03c4 from A\nAUC\n\nRegime A\n\nRegime B\n\nRegime C\n\nBaseline GP\n0.000\n1.000\n0.853\n\nCGP Baseline GP\n0.083\n0.000\n0.857\n1.000\n0.872\n0.832\n\nCGP Baseline GP\n0.162\n0.001\n0.640\n0.998\n0.872\n0.806\n\nCGP\n0.128\n0.562\n0.829\n\nTable 1: Results measuring reliability for simulated data experiments. See Section 3.1 for details.\n\n3.1 Reliable Risk Prediction with CGPs\nWe \ufb01rst show how the CGP can be used for reliable risk prediction, where the objective is to predict\nthe likelihood of an adverse event so that we can intervene to prevent it from happening. In this\nsection, we use simulated data so that we can evaluate using the true risk on test data. For concreteness,\nwe frame our experiment within a healthcare setting, but the ideas can be more broadly applied.\nSuppose that a clinician records a real-valued measurement over time that re\ufb02ects an individual\u2019s\nhealth, which we call a severity marker. We consider the individual to not be at risk if the severity\nmarker is unlikely to fall below a particular threshold in the future without intervention. As discussed\nby Caruana et al. [2015], modeling this notion of risk can help doctors decide when an individual can\nsafely be sent home without aggressive treatment.\nWe simulate the value of a severity marker recorded over a period of 24 hours in the hospital; high\nvalues indicate that the patient is healthy. A natural approach to predicting risk at time t is to model\nthe conditional distribution of the severity marker\u2019s future trajectory given the history up until time t;\ni.e. P ({Ys : s > t} | Ht). We use this as our baseline. As an alternative, we use the CGP to explicitly\nmodel the counterfactual \u201cWhat if we do not treat this patient?\u201d; i.e. P ({Ys[\u2205] : s > t} | Ht). For\nall experiments, we consider a single decision time t = 12hrs. To quantify risk, we use the negative\nof each model\u2019s predicted value at the end of 24 hours, normalized to lie in [0, 1].\nData. We simulate training and test data from three regimes. In regimes A and B, we simulate\nseverity marker trajectories that are treated by policies \u03c0A and \u03c0B respectively, which are both\nunknown to the baseline model and CGP at train time. Both \u03c0A and \u03c0B are designed to satisfy\nAssumptions 1, 3, and 4. In regime C, we use a policy that does not satisfy these assumptions. This\nregime will demonstrate the importance of verifying whether the assumptions hold when applying\nthe CGP. We train both the baseline model and CGP on data simulated from all three regimes. We\ntest all models on a common set of trajectories treated up until t = 12hrs with policy \u03c0A and report\nhow risk predictions vary as a function of action policy in the training data.\nSimulator. For each patient, we randomly sample outcome measurement times from a homoge-\nneous Poisson process with with constant intensity \u03bb over the 24 hour period. Given the measurement\ntimes, outcomes are sampled from a mixture of three GPs. The covariance function is shared between\nall classes, and is de\ufb01ned using a Mat\u00e9rn 3/2 kernel (variance 0.22, lengthscale 8.0) and independent\nGaussian noise (scale 0.1) added to each observation. Each class has a distinct mean function\nparameterized using a 5-dimensional, order-3 B-spline. The \ufb01rst class has a declining mean trajectory,\nthe second has a trajectory that declines then stabilizes, and the third has a stable trajectory.2 All\nclasses are equally likely a priori. At each measurement time, the treatment policy \u03c0 determines\na probability p of treatment administration (we use only a single treatment type). The treatments\nincrease the severity marker by a constant amount for 2 hours. If two or more actions occur within\n2 hours of one another, the effects do not add up (i.e. it is as though only one treatment is active).\nAdditional details about the simulator and policies can be found in the supplement.\nModel. For both the baseline GP and CGP, we use a mixture of three GPs (as was used to simulate\nthe data). We assume that the mean function coef\ufb01cients, the covariance parameters, and the treatment\neffect size are unknown and must be learned. We emphasize that both the baseline GP and CGP\nhave identical forms, but are trained using different objectives; the baseline marginalizes over future\nactions, inducing a dependence on the treatment policy in the training data, while the CGP explicitly\ncontrols for them while learning. For both the baseline model and CGP, we analytically sum over\nthe mixture component likelihoods to obtain a closed form expression for the likelihood, which we\noptimize using BFGS [Nocedal and Wright, 2006].3 Predictions for both models are made using the\nposterior predictive mean given data and interventions up until 12 hours.\n\n2The exact B-spline coef\ufb01cients can be found in the simulation code included in the supplement.\n3Additional details can be found in the supplement.\n\n6\n\n\fFigure 2: Example factual (grey) and counterfactual (blue) predictions on real ICU data using the CGP.\n\nResults. We \ufb01nd that the baseline GP\u2019s risk scores \ufb02uctuate across regimes A, B, and C. The CGP\nis stable across regimes A and B, but unstable in regime C, where our assumptions are violated. In\nTable 1, the \ufb01rst row shows the average difference in risk scores (which take values in [0, 1]) produced\nby the models trained in each regime and produced by the models trained in regime A. In row 1,\ncolumn B we see that the baseline GP\u2019s risk scores differ for the same person on average by around\neight points (\u2206 = 0.083). From the perspective of a decision-maker, this behavior could make the\nsystem appear less reliable. Intuitively, the risk for a given patient should not depend on the policy\nused to determine treatments in retrospective data. On the other hand, the CGP\u2019s scores change very\nlittle when trained on different regimes (\u2206 = 0.001), as long as Assumptions 1, 3, and 4 are satis\ufb01ed.\nA cynical reader might ask: even if the risk scores are unstable, perhaps it has no consequences on\nthe downstream decision-making task? In the second row of Table 1, we report Kendall\u2019s \u03c4 computed\nbetween each regime and regime A using the risk scores to rank the patient\u2019s in the test data according\nto severity (i.e. scores closer to 1 are more severe). In the third row, we report the AUC for both\nmodels trained in each regime on the common test set. We label a patient as \u201cat risk\u201d if the last marker\nvalue in the untreated trajectory is below zero, and \u201cnot at risk\u201d otherwise. In row 2, column B we\nsee that the CGP has a high rank correlation (\u03c4 = 0.998) between the two regimes where the policies\nsatisfy our key assumptions. The baseline GP model trained on regime B, however, has a lower\nrank correlation of \u03c4 = 0.857 with the risk scores produced by the same model trained on regime A.\nSimilarly, in row three, columns A and B, we see that the CGP\u2019s AUC is unchanged (AUC = 0.872).\nThe baseline GP, however, is unstable and creates a risk score with poorer discrimination in regime\nB (AUC = 0.832) than in regime A (AUC = 0.853). Although we illustrate stability of the CGP\ncompared to the baseline GP using two regimes, this property is not speci\ufb01c to the particular choice\nof policies used in regimes A and B; the issue persists as we generate different training data by\nvarying the distribution over the action choices.\nFinally, the results in column C highlight the importance of Assumptions 1, 3, and 4. The policy \u03c0C\ndoes not satisfy these assumptions, and we see that the risk scores for the CGP are different when \ufb01t\nin regime C than when \ufb01t in regime A (\u2206 = 0.128). Similarly, in row 2 the CGP\u2019s rank correlation\ndegrades (\u03c4 = 0.562), and in row 3 the AUC decreases to 0.829. Note that the baseline GP continues\nto be unstable when \ufb01t in regime C.\nConclusions. These results have important implications for the practice of building predictive\nmodels for decision support. Classical supervised learning algorithms can be unreliable due to an\nimplicit dependence on the action policy in the training data, which is usually different from the\nassumed action policy at test time (e.g. what will happen if we do not treat?). Note that this issue is not\nresolved by training only on individuals who are not treated because selection bias creates a mismatch\nbetween our train and test distributions. From a broader perspective, supervised learning can be\nunreliable because it captures features of the training distribution that may change (e.g. relationships\ncaused by the action policy). Although we have used a counterfactual model to account for and\nremove these unstable relationships, there may be other approaches that achieve the same effect (e.g.,\nDyagilev and Saria 2016). Recent related work by Gong et al. [2016] on covariate shift aims to\nlearn only the components of the source distribution that will generalize to the target distribution. As\npredictive models are becoming more widely used in domains like healthcare where safety is critical\n(e.g. Li-wei et al. 2015, Schulam and Saria 2015, Alaa et al. 2016, Wiens et al. 2016, Cheng et al.\n2017), the framework proposed here is increasingly pertinent.\n\n3.2\n\n\u201cWhat if?\u201d Reasoning for Individualized Treatment Planning\n\nTo demonstrate how the CGP can be used for individualized treatment planning, we extract ob-\nservational creatinine traces from the publicly available MIMIC-II database [Saeed et al., 2011].\n\n7\n\nHours Since ICU AdmissionCreatinine\fCreatinine is a compound produced as a by-product of the chemical reaction in the body that breaks\ndown creatine to fuel muscles. Healthy kidneys normally \ufb01lter creatinine out of the body, which can\notherwise be toxic in large concentrations. During kidney failure, however, creatinine levels rise and\nthe compound must be extracted using a medical procedure called dialysis.\nWe extract patients in the database who tested positive for abnormal creatinine levels, which is a sign\nof kidney failure. We also extract the times at which three different types of dialysis were given to\neach individual: intermittent hemodialysis (IHD), continuous veno-venous hemo\ufb01ltration (CVVH),\nand continuous veno-venous hemodialysis (CVVHD). The data set includes a total of 428 individuals,\nwith an average of 34 (\u00b112) creatinine observations each. We shuf\ufb02e the data and use 300 traces for\ntraining, 50 for validation and model selection, and 78 for testing.\nModel. We parameterize the outcome model of the CGP using a mixture of GPs. We always\ncondition on the initial creatinine measurement and model the deviation from that initial value.\nThe mean for each class is zero (i.e. we assume there is no deviation from the initial value on\naverage). We parameterize the covariance function using the sum of two non-stationary kernel\nfunctions. Let \u03c6 : t \u2192 [1, t, t2](cid:62) \u2208 R3 denote the quadratic polynomial basis, then the \ufb01rst kernel is\nk1(t1, t2) = \u03c6(cid:62)(t1)\u03a3\u03c6(t2), where \u03a3 \u2208 R3\u00d73 is a positive-de\ufb01nite symmetric matrix parameterizing\nthe kernel. The second kernel is the covariance function of the integrated Ornstein-Uhlenbeck (IOU)\nprocess (see e.g., Taylor et al. 1994), which is parameterized by two scalars \u03b1 and \u03bd and de\ufb01ned as\n\n(cid:0)2\u03b1min(t1, t2) + e\u2212\u03b1t1 + e\u2212\u03b1t2 \u2212 1 \u2212 e\u2212\u03b1|t1\u2212t2|(cid:1) .\n\nkIOU(t1, t2) = \u03bd2\n2\u03b13\n\n(cid:0)e\u2212b\u00b7t \u2212 e\u2212a\u00b7t(cid:1), and g(cid:96)(\u03b4 : h2, r) = h2 \u00b7 (1.0 \u2212 e\u2212r\u00b7t). The two response\n\nThe IOU covariance corresponds to the random trajectory of a particle whose velocity drifts according\nto an OU process. We assume that each creatinine measurement is observed with independent\nGaussian noise with scale \u03c3. Each class in the mixture has a unique set of covariance parameters.\nTo model the treatment effects in the outcome model, we de\ufb01ne a short-term function and long-\nterm response function. If an action is taken at time t0, the outcome \u03b4 = t \u2212 t0 hours later will\nbe additively affected by the response function g(\u03b4; h1, a, b, h2, r) = gs(\u03b4; h1, a, b) + g(cid:96)(\u03b4; h2, r),\nwhere h1, h2 \u2208 R and a, b, r \u2208 R+. The short-term and long-term response functions are de\ufb01ned\nas gs(\u03b4; h1, a, b) = h1a\na\u2212b\nfunctions are included in the mean function of the GP, and each class in the mixture has a unique set\nof response function parameters. We assume that Assumptions 1, 3, and 4 hold, and that the event\nand action models have separate parameters, so can remain unspeci\ufb01ed when estimating the outcome\nmodel. We \ufb01t the CGP outcome model using Equation 3, and select the number of classes in the\nmixture using \ufb01t on the validation data (we choose three components).\nResults. Figure 2 demonstrates how the CGP can be used to do \u201cwhat if?\u201d reasoning for treatment\nplanning. Each panel in the \ufb01gure shows data for an individual drawn from the test set. The green\npoints show measurements on which we condition to obtain a posterior distribution over mixture class\nmembership and the individual\u2019s latent trajectory under each class. The red points are unobserved,\nfuture measurements. In grey, we show predictions under the factual sequence of actions extracted\nfrom the MIMIC-II database. Treatment times are shown using vertical bars marked with an \u201cx\u201d\n(color indicates which type of treatment was given). In blue, we show the CGP\u2019s counterfactual\npredictions under an alternative sequence of actions. The posterior predictive trajectory is shown for\nthe MAP mixture class (mean is shown by a solid grey/blue line, 95% credible intervals are shaded).\nWe qualitatively discuss the CGP\u2019s counterfactual predictions, but cannot quantitatively evaluate them\nwithout prospective experimental data from the ICU. We can, however, measure \ufb01t on the factual\ndata and compare to baselines to evaluate our modeling decisions. Our CGP\u2019s outcome model allows\nfor heterogeneity in the covariance parameters and the response functions. We compare this choice\nto two alternatives. The \ufb01rst is a mixture of three GPs that does not model treatment effects. The\nsecond is a single GP that does model treatment effects. Over a 24-hour horizon, the CGP\u2019s mean\nabsolute error (MAE) is 0.39 (95% CI: 0.38-0.40),4, and for predictions between 24 and 48 hours\nin the future the MAE is 0.62 (95% CI: 0.60-0.64). The pairwise mean difference between the \ufb01rst\nbaseline\u2019s absolute errors and the CGP\u2019s is 0.07 (0.06, 0.08) for 24 hours, and 0.09 (0.08, 0.10) for\n24-48 hours. The mean difference between the second baseline\u2019s absolute errors and the CGP\u2019s is\n0.04 (0.04, 0.05) for 24 hours and 0.03 (0.02, 0.04) for 24-48 hours. The improvements over the\nbaselines suggest that modeling treatments and heterogeneity with a mixture of GPs for the outcome\nmodel are useful for this problem.\n\n495% con\ufb01dence intervals computed using the pivotal bootstrap are shown in parentheses\n\n8\n\n\fFigure 2 shows factual and counterfactual predictions made by the CGP. In the \ufb01rst (left-most)\npanel, the patient is factually administered IHD about once a day, and is responsive to the treatment\n(creatinine steadily improves). We query the CGP to estimate how the individual would have\nresponded had the IHD treatment been stopped early. The model reasonably predicts that we would\nhave seen no further improvement in creatinine. The second panel shows a similar case. In the\nthird panel, an individual with erratic creatinine levels receives CVVHD for the last 100 hours and\nis responsive to the treatment. As before, the CGP counterfactually predicts that she would not\nhave improved had CVVHD not been given. Interestingly, panel four shows the opposite situation:\nthe individual did not receive treatment and did not improve for the last 100 hours, but the CGP\ncounterfactually predicts an improvement in creatinine as in panel 3 under daily CVVHD.\n\n4 Discussion\n\nClassical supervised learning algorithms can lead to unreliable and, in some cases, dangerous decision-\nsupport tools. As a safer alternative, this paper advocates for using potential outcomes [Neyman,\n1923, 1990, Rubin, 1978] and counterfactual learning objectives (like the one in Equation 3). We\nintroduced the Counterfactual Gaussian Process (CGP) as a decision-support tool for scenarios where\noutcomes are measured and actions are taken at irregular, discrete points in continuous-time. The\nCGP builds on previous ideas in continuous-time causal inference (e.g. Robins 1997, Arjas and\nParner 2004, Lok 2008), but is unique in that it can predict the full counterfactual trajectory of a\ntime-dependent outcome. We designed an adjusted maximum likelihood algorithm for learning the\nCGP from observational traces by modeling them using a marked point process (MPP), and described\nthree structural assumptions that are suf\ufb01cient to show that the algorithm correctly recovers the CGP.\nWe empirically demonstrated the CGP on two decision-support tasks. First, we showed that the CGP\ncan be used to make reliable risk predictions that are insensitive to the action policies used in the\ntraining data. This is critical because an action policy can cause a predictive model \ufb01t using classical\nsupervised learning to capture relationships between the features and outcome (risk) that lead to poor\ndownstream decisions and that are dif\ufb01cult to diagnose. In the second set of experiments, we showed\nhow the CGP can be used to compare counterfactuals and answer \u201cwhat if?\u201d questions, which could\noffer decision-makers a powerful new tool for individualized treatment planning. We demonstrated\nthis capability by learning the effects of dialysis on creatinine trajectories using real ICU data and\npredicting counterfactual progressions under alternative dialysis treatment plans.\nThese results suggest a number of new questions and directions for future work. First, the validity\nof the CGP is conditioned upon a set of assumptions (this is true for all counterfactual models). In\ngeneral, these assumptions are not testable. The reliability of approaches using counterfactual models\ntherefore critically depends on the plausibility of those assumptions in light of domain knowledge.\nFormal procedures, such as sensitivity analyses (e.g., Robins et al. 2000, Scharfstein et al. 2014), that\ncan identify when causal assumptions con\ufb02ict with a data set will help to make these methods more\neasily applied in practice. In addition, there may be other sets of structural assumptions beyond those\npresented that allow us to learn counterfactual GPs from non-experimental data. For instance, the\nback door and front door criteria are two separate sets of structural assumptions discussed by Pearl\n[2009] in the context of estimating parameters of causal Bayesian networks from observational data.\nMore broadly, this work has implications for recent pushes to introduce safety, accountability, and\ntransparency into machine learning systems. We have shown that learning algorithms sensitive to\ncertain factors in the training data (the action policy, in this case) can make a system less reliable.\nIn this paper, we used the potential outcomes framework and counterfactuals to characterize and\naccount for such factors, but there may be other ways to do this that depend on fewer or more\nrealistic assumptions (e.g., Dyagilev and Saria 2016). Moreover, removing these nuisance factors is\ncomplementary to other system design goals such as interpretability (e.g., Ribeiro et al. 2016).\n\nAcknowledgements\nWe thank the anonymous reviewers for their insightful feedback. This work was supported by\ngenerous funding from DARPA YFA #D17AP00014 and NSF SCH #1418590. PS was also supported\nby an NSF Graduate Research Fellowship. We thank Katie Henry and Andong Zhan for help with\nthe ICU data set. We also thank Miguel Hern\u00e1n for pointing us to earlier work by James Robins on\ntreatment-confounder feedback.\n\n9\n\n\fReferences\nA.M. Alaa, J. Yoon, S. Hu, and M. van der Schaar. Personalized Risk Scoring for Critical Care\nPatients using Mixtures of Gaussian Process Experts. In ICML Workshop on Computational\nFrameworks for Personalization, 2016.\n\nE. Arjas and J. Parner. Causal reasoning from longitudinal data. Scandinavian Journal of Statistics,\n\n31(2):171\u2013187, 2004.\n\nL. Bottou, J. Peters, J.Q. Candela, D.X. Charles, M. Chickering, E. Portugaly, D. Ray, P.Y. Simard,\nand E. Snelson. Counterfactual reasoning and learning systems: the example of computational\nadvertising. Journal of Machine Learning Research (JMLR), 14(1):3207\u20133260, 2013.\n\nK.H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S.L. Scott. Inferring causal impact using\n\nbayesian structural time-series models. The Annals of Applied Statistics, 9(1):247\u2013274, 2015.\n\nR. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. Intelligible models for healthcare:\nIn International Conference on\n\nPredicting pneumonia risk and hospital 30-day readmission.\nKnowledge Discovery and Data Mining (KDD), pages 1721\u20131730. ACM, 2015.\n\nL.F. Cheng, G. Darnell, C. Chivers, M.E. Draugelis, K. Li, and B.E. Engelhardt. Sparse multi-output\n\nGaussian processes for medical time series prediction. arXiv preprint arXiv:1703.09112, 2017.\n\nJ. Cunningham, Z. Ghahramani, and C.E. Rasmussen. Gaussian processes for time-marked time-\nseries data. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages\n255\u2013263, 2012.\n\nD.J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes. Springer Science &\n\nBusiness Media, 2007.\n\nS. Doroudi, P.S. Thomas, and E. Brunskill.\n\nImportance sampling for fair policy selection.\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2017.\n\nIn\n\nM. Dud\u00edk, J. Langford, and L. Li. Doubly robust policy evaluation and learning. In International\n\nConference on Machine Learning (ICML), 2011.\n\nK. Dyagilev and S. Saria. Learning (predictive) risk scores in the presence of censoring due to\n\ninterventions. Machine Learning, 102(3):323\u2013348, 2016.\n\nM. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, and B. Sch\u00f6lkopf. Domain adaptation with\nconditional transferable components. In International Conference on Machine Learning (ICML),\n2016.\n\nA.G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika,\n\npages 83\u201390, 1971.\n\nN. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning.\n\nInternational Conference on Machine Learning (ICML), pages 652\u2013661, 2016.\n\nIn\n\nF.D. Johansson, U. Shalit, and D. Sontag. Learning representations for counterfactual inference. In\n\nInternational Conference on Machine Learning (ICML), 2016.\n\nH.L Li-wei, R.P. Adams, L. Mayaud, G.B. Moody, A. Malhotra, R.G. Mark, and S. Nemati. A\nphysiological time series dynamics-based approach to patient monitoring and outcome prediction.\nIEEE Journal of Biomedical and Health Informatics, 19(3):1068\u20131076, 2015.\n\nJ.J. Lok. Statistical modeling of causal effects in continuous time. The Annals of Statistics, pages\n\n1464\u20131507, 2008.\n\nJ.M. Mooij, D. Janzing, and B. Sch\u00f6lkopf. From ordinary differential equations to structural causal\n\nmodels: the deterministic case. 2013.\n\nS.L. Morgan and C. Winship. Counterfactuals and causal inference. Cambridge University Press,\n\n2014.\n\n10\n\n\fS.A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B\n\n(Statistical Methodology), 65(2):331\u2013355, 2003.\n\nI. Nahum-Shani, M. Qian, D. Almirall, W.E. Pelham, B. Gnagy, G.A. Fabiano, J.G. Waxmonsky,\nJ. Yu, and S.A. Murphy. Q-learning: A data analysis method for constructing adaptive interventions.\nPsychological Methods, 17(4):478, 2012.\n\nJ. Neyman. Sur les applications de la th\u00e9orie des probabilit\u00e9s aux experiences agricoles: Essai des\n\nprincipes. Roczniki Nauk Rolniczych, 10:1\u201351, 1923.\n\nJ. Neyman. On the application of probability theory to agricultural experiments. Statistical Science,\n\n5(4):465\u2013472, 1990.\n\nA.Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang. Autonomous\ninverted helicopter \ufb02ight via reinforcement learning. In Experimental Robotics IX, pages 363\u2013372.\nSpringer, 2006.\n\nJ. Nocedal and S.J. Wright. Numerical optimization 2nd, 2006.\n\nC. P\u02d8aduraru, D. Precup, J. Pineau, and G. Com\u02d8anici. An empirical analysis of off-policy learning in\n\ndiscrete mdps. In Workshop on Reinforcement Learning, page 89, 2012.\n\nJ. Pearl. Causality: models, reasoning and inference. Cambridge University Press, 2009.\n\nC.E. Rasmussen and C.K.I. Williams. Gaussian processes for machine learning. the MIT Press,\n\n2006.\n\nM.T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any\nclassi\ufb01er. In International Conference on Knowledge Discovery and Data Mining (KDD), pages\n1135\u20131144. ACM, 2016.\n\nJ.M. Robins. A new approach to causal inference in mortality studies with a sustained exposure\nperiod\u2014application to control of the healthy worker survivor effect. Mathematical Modelling, 7\n(9-12):1393\u20131512, 1986.\n\nJ.M. Robins. Estimation of the time-dependent accelerated failure time model in the presence of\n\nconfounding factors. Biometrika, 79(2):321\u2013334, 1992.\n\nJ.M. Robins. Causal inference from complex longitudinal data. In Latent variable modeling and\n\napplications to causality, pages 69\u2013117. Springer, 1997.\n\nJ.M. Robins and M.A. Hern\u00e1n. Estimation of the causal effects of time-varying exposures. Longitudi-\n\nnal data analysis, pages 553\u2013599, 2009.\n\nJ.M. Robins, A. Rotnitzky, and D.O. Scharfstein. Sensitivity analysis for selection bias and un-\nmeasured confounding in missing data and causal inference models. In Statistical models in\nepidemiology, the environment, and clinical trials, pages 1\u201394. Springer, 2000.\n\nD.B. Rubin. Bayesian inference for causal effects: The role of randomization. The Annals of statistics,\n\npages 34\u201358, 1978.\n\nM. Saeed, M. Villarroel, A.T. Reisner, G. Clifford, L.W. Lehman, G. Moody, T. Heldt, T.H. Kyaw,\nB. Moody, and R.G. Mark. Multiparameter intelligent monitoring in intensive care II (MIMIC-II):\na public-access intensive care unit database. Critical Care Medicine, 39(5):952, 2011.\n\nD. Scharfstein, A. McDermott, W. Olson, and F. Wiegand. Global sensitivity analysis for repeated\nmeasures studies with informative dropout: A fully parametric approach. Statistics in Biopharma-\nceutical Research, 6(4):338\u2013348, 2014.\n\nP. Schulam and S. Saria. A framework for individualizing predictions of disease trajectories by\nexploiting multi-resolution structure. In Advances in Neural Information Processing Systems\n(NIPS), pages 748\u2013756, 2015.\n\nA. Sokol and N.R. Hansen. Causal interpretation of stochastic differential equations. Electronic\n\nJournal of Probability, 19(100):1\u201324, 2014.\n\n11\n\n\fH. Soleimani, A. Subbaswamy, and S. Saria. Treatment-response models for counterfactual reasoning\nwith continuous-time, continuous-valued interventions. In Uncertainty in Arti\ufb01cial Intelligence\n(UAI), 2017.\n\nR.S. Sutton and A.G. Barto. Reinforcement learning: An introduction, volume 1. MIT press\n\nCambridge, 1998.\n\nA. Swaminathan and T. Joachims. Counterfactual risk minimization. In International Conference on\n\nMachine Learning (ICML), 2015.\n\nS.L. Taubman, J.M. Robins, M.A. Mittleman, and M.A. Hern\u00e1n. Intervening on risk factors for\ncoronary heart disease: an application of the parametric g-formula. International Journal of\nEpidemiology, 38(6):1599\u20131611, 2009.\n\nJ. Taylor, W. Cumberland, and J. Sy. A stochastic model for analysis of longitudinal AIDS data.\n\nJournal of the American Statistical Association, 89(427):727\u2013736, 1994.\n\nJ. Wiens, J. Guttag, and E. Horvitz. Patient risk strati\ufb01cation with time-varying parameters: a\nmultitask learning approach. Journal of Machine Learning Research (JMLR), 17(209):1\u201323, 2016.\n\nY. Xu, Y. Xu, and S. Saria. A Bayesian nonparametric approach for estimating individualized\ntreatment-response curves. In Machine Learning for Healthcare Conference (MLHC), pages\n282\u2013300, 2016.\n\n12\n\n\f", "award": [], "sourceid": 1060, "authors": [{"given_name": "Peter", "family_name": "Schulam", "institution": "Johns Hopkins University"}, {"given_name": "Suchi", "family_name": "Saria", "institution": "Johns Hopkins University"}]}