Review for NeurIPS paper: Estimating the Effects of Continuous-valued Interventions using Generative Adversarial Networks

NeurIPS 2020

Estimating the Effects of Continuous-valued Interventions using Generative Adversarial Networks

Review 1

Summary and Contributions: The paper proposes GAN-based framework which learns the distribution of the unobserved counterfactuals by estimating the effect of continuous interventions. The presented generator-discriminator framework effectively deal with estimating outcomes of continuous interventions. The paper also introduces a new semi-synthetic data simulation under the continuous intervention setting and validate their model on the dataset with pre-existing benchmarks.

Strengths: This paper is a well-written paper. It also conveys the main idea clearly. I found the paper clearly well written and very well presented. I love the idea of training the counterfactual generator and discriminator adversarially such that the generator competes with the discriminator by generating counterfactuals. They theoretically justified the presented solutions.

Weaknesses: More detailed description about the simi-synthetic dataset (e.g., information about feature, interventions, and treatment) should be included in the appendix section for clinical clarity.

Correctness: The technical solutions presented in the paper look reasonable.

Clarity: The paper is well-written.

Relation to Prior Work: The paper clearly discuss the connections to the pre-existing work by addressing their limitations and how they tackle the existing problems behind them.

Reproducibility: Yes

Additional Feedback: I really enjoy reading this paper. I find the idea of simultaneously estimating counterfactual outcomes for continuous interventions very interesting and I believe their proposed methodology can have a significant impact on the domains of healthcare: estimating an individual-level response to dosage can be very practical and useful in clinical decision-making situations. [Additional comment] Authors answered most of my raised concerns. I would like to have this paper accepted.

Review 2

Summary and Contributions: The paper addresses the important problem of estimating the effect of continuous treatment variables, provides theoretical guarantees for the proposed solution, and significantly improves over state-of-art baselines. The writing is clear and the experiment section is exhaustive.

Strengths: [S.N] Novelty: existing work for estimating treatment effects is mostly concerned with binary/categorical treatments whereas the proposed method can handle continuous as well as discrete interventions. [S.E] Empirical evaluation: thorough. [S.R] Relevance: important for adaptation of machine/deep learning in sensitive domains such as medicine.

Weaknesses: [W. N] Novelty: seemes like the paper draws a lot of inspiration from related works ([2, 6]), however the contributions of the SCIGAN are clear to me wrt empirical and theoretical results. [W. E. 1] Empirical: in the conclusion (line 308) the authors mention, “SIGAN needs a few thousands of training samples”, I was wondering if there is a sample efficiency experiment for this result? [W. E. 2] Appendix K: Was PCA used for all baselines or just for GPS, and is there a difference in the results with/without PCA? [W. E. 3] Will there be a version of the code, results, simulated data and generators be available?

Correctness: To the best of my knowledge, yes.

Clarity: Yes, only few minor comments, see "Additional feedback".

Relation to Prior Work: Yes this has been made clear throughout the text and also with additional results in the Appendix.

Reproducibility: Yes

Additional Feedback: I wonder if the authors considered the issues of callibration wrt the data distibuiton, can this have an impact on the results for estimating ITE? It has been shown that the generator distribution does not match the true data distribution (for example [1], [2]), which can be accounted for with custumized models. [1] Dai, Zihang, et al. "Calibrating energy-based generative adversarial networks." arXiv preprint arXiv:1702.01691 (2017). [2] Hitaj, Briland, et al. "Passgan: A deep learning approach for password guessing." International Conference on Applied Cryptography and Network Security. Springer, Cham, 2019. Minor: - Maybe replace "Intervention" with "Treatment" in Fig 1. - There is no bold annotation in Table 10. - I'm not sure I understand why use the term "hierarchical" and not "ensamble" of discriminator networks. - Maybe rename Theorem 2 in Appendix (currently enumerated same as in the main text). - Could you elaborate more on your motivation to include the permutation invariance and equivariance for the discriminator?

Review 3

Summary and Contributions: The current work proposes SCIGAN to learn the distribution of the unobserved counterfactuals. While previous works are focused on estimating the effect of discrete interventions, this work focuses on the continuous-valued intervention setting.

Strengths: 1. This work leverages the generative adversarial networks to estimate the effect of continuous-valued intervention, which can also be applied to the discrete case. 2. A hierarchical discriminator is proposed to handle the complexity of multiple-treatment setting and the structure of the continuous intervention setting. 3. It has been shown theoretically that the learned counterfactuals agree with the true data in marginal distribution. 4. Experimental results show superior performance compare to other methods.

Weaknesses: 1. One of the issues with the current presentation of the paper is that notations are very messy and hard to follow. It would have been better if a simplified notations were used. 2. Many of the details are omitted from the paper and left to appendices. 3. The stopping criteria for the GAN network (e.g. IS score, FID distance etc.) is not discussed. 4. Did authors face any of the famous common issues with training GANs such as gradient vanishing, mode collapse etc.? and if so, how did they avoid these issues for a stable training? 5. In the first paragraph of section 6, authors have mentioned that meaningful evaluation on real-world datasets is not feasible. If that’s the case, then what is the use of the current work in real applications? And isn’t that in contradiction with the author’s claim in the Introduction section about applicability of the work in many domains such as medical etc.?

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: