__ Summary and Contributions__: Paper provides a subgrouping algorithm that estimates confidence intervals on mean outcome and then uses the interval bounds to produce improved subgroups.
The idea is that confidence intervals will capture uncertainty in effect estimates which can then provide a way to control heterogeneity within a subgroup.

__ Strengths__: The paper provides a new method for using confidence interval estimates to improve subgrouping.
The idea is intuitive and seems to work well in the experiments.
The theoretical claims are reasonable but could be better.
The experiments are wide which adds to the merit of the method and the paper.

__ Weaknesses__: The paper needs to do a much better job of what happens when the underlying mean outcome estimate is bad.
For example, what if I predict random normal noise as mean outcome for each sample in the dataset? How does this affect the estimated subgroups?
How small must the error be in ITE so that that subgroups are meaningful? Such error bounds would significantly improve the usability of R2P.
Finally, using ITE estimates based on separate mean outcome estimates induces error due to covariate imbalance. How does this affect R2P-HTE?

__ Correctness__: The claims are correct.

__ Clarity__: The paper is well written.

__ Relation to Prior Work__: The related work should be explained better. As in why is not relying on the ITE estimation algorithm better? Where does this help?

__ Reproducibility__: Yes

__ Additional Feedback__: Update after rebuttal:
Thanks for clarifying the claims. I'm updating my score accordingly.

__ Summary and Contributions__: This paper presents an outlier detection approach based on the self-supervised training. The proposed approach composes from investigating the pretrained features and self-supervised learning. The paper develops a novel method for subgroup analysis, which is both more reliable and more informative than previous methods via promising confidence estimates. Experiments demonstrate its effectiveness compared with various baselines.

__ Strengths__: The problem is meaningful and interesting.
The structure of the paper is explicit, which is easy to follow and review.
The methodology part is clearly presented and solid.
Experimental results validate the effectiveness of the proposed model on both synthetic dataset and real-world dataset, especially on COVID 19 applications.
Supplementary materials validates the effectiveness of the proposed methods with concrete experimental design and codes brings the reproducibility.

__ Weaknesses__: The methodology part lacks some deeper understanding and innovations. How does quantifying the uncertainty benefit the ITE process, how did the proposed method specifically tailored to the given task. More relevant analysis is expected.
The analysis of experimental results is not enough. Uncertainty usually brings crucial information to the target task. The proposed method quantifies the uncertainty in the methodology part, yet more deeper insights and analysis are expected in the experiment part.

__ Correctness__: The claims and methodology parts are technically sound.

__ Clarity__: The structure of the paper is explicit, which is easy to follow and review.

__ Relation to Prior Work__: It clearly clarifies and discusses the technical contributions.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The authors propose robust recursive partitioning (R2P), a recursive partitioning scheme which incorporates conformal prediction. Instead of a traditional homogeneity criterion (such as mean squared-error), R2P splits to optimize a "confident homogeneity" criterion which penalizes cases where the fixed subgroup means lie outside of covariate-dependent confidence intervals estimated by conformal prediction. This loss is regularized by the subgroup confidence interval widths, to avoid degenerate solutions.
R2P is straightforwardly extended from the regression setting into the causal inference setting for estimation of individual treatment effects. Experiments are run on two simulated datasets, as well as the IHDP and CCP semi-synthetic datasets, showing improved across-partition heterogeneity and within-partition homogeneity.
---
Update: I thank the authors for their response. In particular I am grateful for helping me understand why they did not evaluate R2P as an ITE estimator, and including Table R1 in the rebuttal. I am increasing my score from a 5 to a 6.
My main concern remains the same: "it's difficult to disentangle the effects of a more powerful ITE estimator (the causal GP) from the added benefits of R2P". The authors acknowledge that this is a major drawback of the paper. I would insist they in run appropriate ablation studies for future revisions.

__ Strengths__: The proposed R2P scheme incorporating conformal prediction is quite reasonable and well-motivated. As far as I can tell, this is a novel contribution and likely of relevance to the community.

__ Weaknesses__: The experiments section feels a bit rushed and could use improvement.
First, I'd like to see details added to the main text:
(1) Explanations of how synthetic datasets A and B were derived (this is already in the Appendix). In particular it's impossible to interpret Figure 3 without details on the data-generating process for synthetic dataset B.
(2) Explanations of how hyperparameters (lambda especially) were chosen (some of this is already in the Appendix). The experiments showing the effect of tuning lambda are helpful, but it looks like lambda was fixed to 0 for CPP and 0.5 for all other datasets. What's behind this choice of lambda? If the reader was interested in tuning lambda, what would the authors recommend as a selection criterion?
(3) Evaluation on effectiveness of predicting the ITE, for example via RMSE of the estimated ITEs such as in Johansson, Shalit, Sontag ICML 2016. While evaluation on variance across groups and variance within groups is intuitive and valuable, to me RMSE of estimated ITE is a more straightforward way to evaluate estimator effectiveness.
(4) Can we add for comparison a baseline an ITE method that's more powerful than simple decision trees? For example, we take predictions from a causal forest and divide subgroups by quantiles of predicted ITE. At the moment all baselines use fairly simple decision tree estimators. As a result it's difficult to disentangle the effects of a more powerful ITE estimator (the causal GP) from the added benefits of R2P. It'd be interesting as well to see R2P applied with a decision tree ITE estimator instead of the causal GP estimator.
(5) Table 2 is a bit confusing to me, since R2P isn't compared to any other methods. Would it make sense to compare to, again, subgroups defined by quantiles of predicted ITE as determined by a baseline estimator such as a causal forest?
Second, it's difficult to see how the experimental results show that R2P prevents false discovery. This may be implied by some of the differences in variance across subgroups and variance within subgroups, but it's not obvious to me and some explanation would be much appreciated. Is there a more direct way of measuring false discovery? For example a simulation closer to that of Figure 1? Otherwise I'm not sure how much of the claim that R2P prevents false discovery is empirically backed up.

__ Correctness__: For the most part claims and methods look correct to me. See the above section for some discussion on the claim that R2P prevents false discovery.

__ Clarity__: Early parts of the paper are quite well written, especially Sections 1-3. Latter parts feel a bit messier.
For example, on lines 284-287, the evaluation metrics of variance are defined in terms of D_l^te. But is this the variance of y for each data point? Or variance of tau? The text specifies the latter, but the notation could be more explicit.

__ Relation to Prior Work__: Relation to prior work is clearly discussed in Section 4.

__ Reproducibility__: Yes

__ Additional Feedback__: On lines 200-201: "Note that confident homogeneity does not necessarily. improve as the group size shrinks because smaller groups lead to greater uncertainty". Could you explain this claim a bit further? It's not obvious to me that a smaller group will always yield wider confidence intervals.
Typo on line 271: "here abbreviate".
Figure 3: I assume these results are for only one simulation of dataset B. Is there a way to compare the treatment effects across identified subgroups for all 50 simulations?