Review for NeurIPS paper: Counterfactual Predictions under Runtime Confounding

NeurIPS 2020

Counterfactual Predictions under Runtime Confounding

Review 1

Summary and Contributions: This paper addresses the problem of decision-making in the presence of runtime confounding, confounding due to current information not yet being available on some relevant variables, via counterfactual prediction. In particular, the notion of runtime confounding is formalized; a doubly-robust prediction method is proposed, as is a doubly-robust estimate of MSE that allows construction of confidence intervals from observed data. The motivating example comes from criminal justice where interest lies in predicting an offender's success on parole if placed on minimal supervision, while experiments rely on synthetic data. Of note is that the main goal is prediction rather than treatment effect estimates.

Strengths: Runtime confounding is an interesting problem that lends itself well to counterfactual reasoning based on a subset of confounders. The formulation of the problem and the theoretical results are sound and relevant for algorithm-assisted decision-making.

Weaknesses: In general I quite liked this paper, but not the motivating example. In the broader impact section the authors highlight that conditions 2.1 must hold and acknowledge that the training data needs to be unconfounded. However, I suspect this is unlikely in the case of criminal justice (for example, certain groups of offenders on parole may not be brought in for a small infraction and only given a warning, while others in a different demographic group may not be afforded such leniancies). This motivating example needs to be changed (see potential ethical concerns below) and is my main concern with the paper. The motivating example should not be one that suffers from the problems discussed in broader impact. Otherwise there is a false impression that the method is suitable for this example. The main result of the paper is Theorem 3.1, which is an estimate of the MSE that permits estimates with confidence intervals. However, the simulation study does not seem to evaluate this estimator and does not provide coverages for confidence intervals based on this doubly-robust estimator.

Correctness: Yes.

Clarity: The data generation for the simulation study is outlined without any intuition. The appendix adds these details, but more details are needed in the main paper, particularly in defining k_v and k_z more precisely. What is h in Algorithm 5?

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: line 154: "uses this to construct" to "uses this \mu to construct" line 175: "next result specialized" to "next result is specialized" line 217: "We expect the TCR..." to "Accordingly, we expect the TCR..." In this work, the algorithm is agnostic to which variables contribute to the runtime confounding. In practical situations, is there typically a core of variables that must be available before making predictions? Does this depend on how big the confounding effect of each runtime confounder is? Post-rebuttal: ------------------ First of all, l feel that the authors have identified an interesting practical problem and have come up with a logical approach to tackling it. However, the formal definition of runtime confounding still needs clarification. Even more important, is that the whole problem is embedded in contexts in which historical decisions likely suffer from systemic bias and in which training ignorability is likely violated (the child welfare setting likely suffers from same shortcomings as the parole board setting). If the problem was cast within a different framework/set of applications, this could be a useful tool for practitioners. But since it isn't and the final discussion of the broader implication cannot be seen before going to press, I have not changed my overall score.

Review 2

Summary and Contributions: EDIT: Thanks for your rebuttal. My score for this paper is unchanged, although I agree with other reviewers that 1) an additional experiment with real or semi-synthetic data would strengthen the paper quite a bit, and 2) the language around high-stakes decision-making, parole in particular, should be more careful. -------------------- - Gives estimators for run-time confounding – where all confounders are observed at train time, but not at test time - Gives bounds on error and a doubly robust estimator - Runs experiments on synthetic data and examines results of varying data generation hyperparameters

Strengths: - Claims seem theoretically sound, with proofs of most but not all claims in the appendix - Evaluation is thorough in synthetic data, varies parameters of data generation process and demonstrates how they connect to theoretical results - Contribution seems somewhat significant: it’s not an application I’ve thought of or seen described before as an open problem, but I can imagine it would be useful in some cases and the authors discuss this at the beginning of the paper - Relevant to Neurips community’s interest in causal inference + ML, and prediction under confounding

Weaknesses: - A couple of places where the work could be spelled out a little more thoroughly - It’s not 100% clear to me that this use case is plausible. But hard to say

Correctness: -- Yes this paper seems correct

Clarity: -- Paper is clear

Relation to Prior Work: -- Yes

Reproducibility: Yes

Additional Feedback: - Reproducibility: I think yes, - the code is provided but it is in R which I barely know. But it seems like it could be reproduced - Line 50: “it is common…” Is there a citation for this practice? I understand it’s common sense but given that it’s a central motivation for the paper could use some more grounding - Line 125: I think the positivity condition needs a “for all a” or something - Line 128: Would like to see one more line of explanation to unpack these stacked expectations - Line 132: are L and R functions? Clarify this condition. - Algorithm text throughout are extremely small and hard to read - Line 149: describing this decomposition in this way is nice - Line 173: is this the definition of oracle-efficient? Clarify the connection here - Line 176: would like to see this corollary 3.1 worked through in an appendix - Line 177: need to define what a k-sparse model is - Line 182: I don’t see why DR will outperform PL if d_v << d – the first term in the bound is the same and the second doesn’t include d_v? - Line 224: why do we see this effect with random forests? Is it due to lower model bias? Higher accuracy? - Line 229: a negative correlation exacerbating confounding doesn’t seem like a general result – clarify if this is dependent on the form of your data generating process - Line 236: unclear on why this inflection point happens with V_i and Z_i – need to explain more clearly how the data is generated in this way - Line 244: I like the motivation to test an interpretable second stage

Review 3

Summary and Contributions: The paper studies the problem of predicting the outcomes of interventions after learning from observational data. In this particular setting, unconfoundedness is satisfied at training time but not at runtime---all confounders are observed in data, but these may not be available when the predictions are made. The paper refers to this setting as "runtime confounding." The authors propose several algorithms for learning prediction models in this setting, based on approximating a consistent estimator of the causal effect using variables that are available at runtime. The algorithms are evaluated on synthetic data in settings with and without model misspecification. After rebuttal: Se comments under "weaknesses".

Strengths: - The problem is very relevant in settings where ML is deployed in practice. In some sense, it is a more general problem than the authors make it out to be---it is relevant also to more standard prediction problems, "without" causal components. - The proposed algorithm is simple but provably correct in the sense that it leads to consistent estimates of the sought-after quantity when sufficient data is available. - The paper is well organized and focused.

Weaknesses: - Despite giving plenty examples in the introduction of applications where "runtime confounding" is an issue, none of them are addressed in the empirical evaluation. It is true that unbiased evaluation of causal effects is difficult in the general setting, but it is common practice to synthesize either treatment assignments or outcomes based on real-world covariates, such as the commonly used IHDP benchmark. UPDATE: In the rebuttal, the authors have addressed this point and provided additional results which will strengthen the paper. - The problem is an instance of a more general problem which is not discussed (see Relation to prior work). UPDATE: In the rebuttal, the authors have addressed this point and provided additional discussion which will strengthen the paper. - I take issue with the idea that standard practice for this setting is "treatment-conditional regression", as defined by the authors. To me, this is a straw-man baseline. For the case where important confounders are known at training time, I have never heard of them being removed from analysis due to not being available at test time. I would expect that runtime imputation would be closer to standard practice, but this baseline is never discussed in the paper, nor used as a baseline. - As acknowledged by the authors, the plugin (and doubly-robust) approach(es) are "simple" but reasonable solutions to the problem at hand. (In fact, I would be surprised if the former is not already in wide-spread use). As such, I wish that the authors would include a discussion on the potential optimality of these solutions. Is a two-stage estimator optimally sample efficient or should we hope to find a better algorithm? Theorem 3.1 shines some light on this issue but is not compared to an alternative approach which could induce a different tradeoff. - The authors state that the biased baseline (TCR) is expected to perform best at low levels of confounding due to the (potentially) compounding error of the two-stage estimators. However, this error should be reduced when sample size increases, while the bias of TCR should remain. Regrettably, this is not confirmed by experiments. In light of this, the results in Figure 1 c) are a little odd as they show that all methods perform about as well. The prediction that "we expect this increase [in error] to be significantly larger for the TCR method that has confounding bias" appears to not be true. Is this an effect due to sample size or something else?

Correctness: - The results in the paper appear correct.

Clarity: - The paper is well presented and easy to read.

Relation to Prior Work: - The general problem of learning with access to data that is not available at runtime was formalized by Vapnik & Vashist (2009) as "Learning using privileged information" (LuPI). It seems to me that the current problem is an instance of this paradigm and a discussion of its relation is warranted.

Reproducibility: Yes

Additional Feedback: - The problem studied here is posed as a causal problem, but it is clearly an instance of a more general "coarsening" or "approximation" of prediction functions from a larger variable set to a smaller one. This is clear also from the algorithms given in the paper which first fit the function of interest using regression or doubly robust estimation, effectively handling the issue of causality. This parameter is then approximated as a function of the runtime variable set. I worry that the framing as a problem of causality may separate the literature unnecessarily. For example, is this not simply an example of predicting with missing variables at runtime? - The introduction gives a lot of examples of what the authors call "runtime confounding" but don't quite explain why it is an example of confounding. Certainly, the examples are relevant and important, but I'm not sure they are problems of confounding. The quantities necessary for prediction at runtime (i.e., the expected potential outcomes) are fully identified by the training data. Further, by definition, the outcome of the runtime instance is not observed, let alone used in the algorithm. So, what quantity is confounded at runtime? Based on this, I'm not sure that "runtime confounding" is an appropriate name.

Review 4

Summary and Contributions: This paper considers a specific kind of prediction task feeding into a human decision, where outcomes are a function of a (fully observed) set of covariates, but where some of these covariates—while observed—are off limits. This can be for a variety of reasons including timing, concern for bias, explainability, etc. Thus is it different from either straightforward tasks with no unobserved covariates, and from tasks where there are fully unobserved covariates. It then sets up a series of algorithms for accurate and efficient estimation of counterfactual predictions (eg, whether the decision is taken or not). The concrete example I had in my head as I was reading is the parole hearing setting the authors set up: prior data V can be used to make a prediction (eg on recidivism), and spoken testimony Z is collected at the time of the hearing. Z has signal both for predicting the decision (release/hold) and the outcome (recidivism, only observed if release). So omitting Z is a problem.

Strengths: I don’t know this literature very well, but the paper seems technically well done. The problem is an interesting and potentially a very important one, and the authors’ approach is rigorous and comprehensive. They note and deal with a number of concerns that are often unaddressed or unacknowledged: not just the presence/absence of unobserved covariates, but also the potential for the decision to affect the observed outcome (eg, medical treatments reduce the risk of outcomes, making the observed predictions too low relative to the counterfactual of interest, ie no treatment). Again, I am not the best person to evaluate the methods, but they seem convincing if you accept the basic “fully observed, but partially off limits” premise.

Weaknesses: I wonder if the paper would benefit from setting up a more concrete—even if hypothetical—example of when this scenario would occur. This seems important not just as a way to convince the reader of the need for this enterprise, but also to set up precise empirical tests showing that this approach works better than other, more naive methods with respect to decisions rather than prediction errors. I wasn’t sure what to make of the generality of the examples described in the introduction. The authors assert that the parole board wants the prediction before hearing the testimony, but they are going to hear the spoken testimony anyway. (i) One use of the prediction based on prior data is to determine who gets a hearing and thus who has Z collected. (ii) More generally, the prior prediction could influence not only the presence of Z, but how the decision maker views Z (eg, anchoring on a prior probability and not updating; or updating too much).

Correctness: Yes, as above.

Clarity: Yes. I would have liked the concrete examples to be continued to build intuitions, and perhaps more on the generality of the problem, but otherwise I thought it was easy to follow and clear.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: