NeurIPS 2020

Minimax Optimal Nonparametric Estimation of Heterogeneous Treatment Effects

Review 1

Summary and Contributions: The paper theoretically studies nonparametric minimax-optimal rates for heterogeneous treatment effect estimation. The main assumption is that the conditional average treatment effect function is smoother than the treatment-specific baseline responses. ` ---- I have read the author feedback. Thanks for the clarifications. The additional experiments are particularly illuminating.

Strengths: soundness of the claims: theoretically-grounded lower bounds for nonparametric estimation relevance: heterogeneous treatment effects are well-studied

Weaknesses: Relevance: The minimax bound in this tau-smoother-than-baseline setting is relevant is (but perhaps less so because it does not take into account additional structure on the propensity scores, which typically algorithmic approaches for causal inference do) The proposed algorithm seems relevant mostly because it is a constructive procedure to achieve the bound, rather than providing additional insight on how this theoretical understanding should affect algorithm design for heterogeneous treatment effects.

Correctness: The claims are correct to the best of my knowledge.

Clarity: The paper is reasonably well written. (It is useful to illustrate the simplification for fixed design)

Relation to Prior Work: Comparison to the rates achieved by previous work and different assumptions made could greatly improve relevance. For example, the following paper studied minimax bounds for heterogeneous treatment effect estimation: Alaa, Ahmed, and Mihaela Schaar. "Limits of estimating heterogeneous treatment effects: Guidelines for practical algorithm design." International Conference on Machine Learning. 2018. Though they may not have made different structural assumptions on \tau rather than \mu, they focus a bit on algorithmic implications of the work.

Reproducibility: Yes

Additional Feedback: Some questions: Re: line 295-296 Could you comment on why this is the case -- is this purely a consequence of the random design focus? Since intuitively the order of generating P(X) P(T|X) vs. P(T) P(X|T) shouldn't matter (these are equivalent factorizations of the joint distribution) Assumption 1: When is it expected for the inequality to hold strictly? Clearly one example where this is the case is when \tau is constant. Are there other examples where it holds strictly? Intermediate/realistic settings seem they might be better modeled by assuming that "\tau" is well-approximated by a function with a strictly smaller smoothness parameter".

Review 2

Summary and Contributions: - This paper studies the estimation of heterogeneous treatment effects (HTE) using nonparametric methods. In particular, the authors give results on the minimax rate with which HTE can be learned, under assumptions on smoothness of the true effect functions, and propose algorithms for doing so. Two settings are considered: a fixed design where subjects in different treatment groups are essentially matched on covariates. In the second setting, a random design, covariates are sampled from a distribution. Finally, the proposed algorithms for each setting are shown to achieve the minimal optimal rates. After author response: I maintain my position that this is a good paper that would be a nice contribution to the conference.

Strengths: - The paper studies a problem---causal effect estimation---that is highly relevant for many applications of machine learning. It provides a valuable addition to the theoretical understanding of this problem. In particular, while many works are agnostic to the distribution of treatments and covariates, this paper studies two distinct and concrete settings which are relevant in practice. - This choice to study concrete settings makes interpretation of the results more intuitive and the take-aways more clear. This is illustrated in the discussion of the nearest-neighbor-based estimator which discards observations that are poorly matched. To add to this, the authors provide a useful discussion about the limitations of the proposed approaches at the end of the paper. - A comprehensive survey of related work is given and comparisons with theoretical results from these are made.

Weaknesses: - The two settings considered, the fixed design and the low-smoothness setting are both fairly restricted. In particular, requiring that the smoothness parameter beta < 1 is rather strong, as indicated by the example/discussion given in Section 4. - The machinery used for analysis, e.g., kernel-methods and differencing are known and used often in nonparametric estimation. Nevertheless, the application yields interesting results here. - There are no empirical results included in the paper. These could have been used to study the conjectured phase transition from beta < 1 to beta > 1. Given the proposed algorithms, this seems like a missed opportunity.

Correctness: - The results of the paper appear correct.

Clarity: - The goals of the paper are clearly stated and followed up on, see e.g. ln 78: "The main aim of this paper is to characterize the tight minimax rates for the above quantities". The paper is well written and assumptions are clearly stated.

Relation to Prior Work: - Good overview of related work

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The author calculated a minimax rate of CATE functions under the setting, CATE is smoother than the baseline function. They consider two designs and find the minimax rate and the method nearly achieving it.

Strengths: This paper is quite strong a paper. First, this paper searches for the fundamental limitation (minimax rates) of CATE estimation. I do not think the rate result is obvious. So, it is technically challenging. At the same time, minimax rates have a reasonable interpretation well explained in the paper. Second, the method (nearly) achieving it is also novel in at least a causal inference community.

Weaknesses: I do not think there is a big weakness. One thing the reader would want to know is the practical performance (experiment) of the proposed methods. I recommend the author to add it so that people can see the implication of the theorems.

Correctness: I am familiar with causal inference literature. But I know the basics of minimax theory. In this sense, I am not sure these theorems are really correct. Based on my educated guess and their intuitive explanation (and my brief checking of the proof), it looks correct.

Clarity: Yes. It is clearly written. The author tries to convey a difficult theorem to the reader in an easier way.

Relation to Prior Work: yes. It is clearly discussed.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The purpose of this paper is to provide new theoretical tools and bounds for the heterogeneous treatment effect (HTE) estimation in causal inference. This work is in line with a fairly current theme: the HTE estimation is experiencing a growing interest in applications, particularly in the field of personalized medicine. To avoid strong assumptions and to benefit from a broader scope of application, the authors focus on nonparametric estimation. As the authors point out, much effort has been devoted to proposing practical methods, but not so much to the statistical study of nonparametric HTE estimation. This paper establishes minimax rates with dependence on both the geometry of the covariates, and parameters related to propensity scores and noise levels. The authors provide two designs: a fixed and a random one. In the fixed design, the covariates are generated from the same regular grid, translated by a vector able to quantify the matching distance between the control and treatment groups. The random design is more realistic, with no matching parameters and using the propensity score. Section 2 is devoted to the study of the fixed design. In this design, the estimation relies on the regular structure of the grid and kernel estimator. Characterize the minimax L2 risks for the random design (Section 3) is a more tricky problem. Here, the authors propose a two-stage nearest-neighbor estimator and show that the minimax estimation error exhibits three different behaviors (Theorem 2).

Strengths: The article is well-written, the explanations are clear and detailed. In particular, the different contributions regarding the existing literature are easily identifiable. All the theorems are put into context, and the different bounds obtained are explained, effectively highlighting the key ideas.

Weaknesses: Detailed algorithms are provided, but I would have liked to see them "in action". However, I am aware that the size constraint does not necessarily allow me to propose numerical studies and I find the size of the article well mastered as it is. A really minor concern: I agree that Theorems 1 and 2 are direct consequences of Theorems 3 and 4 and 5 and 6 respectively, as clearly stated in the appendices. I just find it a little unfortunate that we have to refer to the appendices to read it, I would have appreciated a mention in the main document.

Correctness: As far as I checked, the proofs are correct, detailed and easy to follow. The decomposition into lemmata is judicious.

Clarity: Good readability and organization. The annexes are also well written. The notation could be slightly improved as it is sometime difficult do notice the dot above the “=” sign (introduced line 138).

Relation to Prior Work: I find the paper well put in context.

Reproducibility: Yes

Additional Feedback: Post-rebuttal: I thank the authors for their detailed response, which confirms my positive opinion on this work.