Review for NeurIPS paper: Sparse Learning with CART

NeurIPS 2020

Sparse Learning with CART

Review 1

Summary and Contributions: This paper studies the CART algorithm with a new point of view based on Pearson correlation. They connect several parts of the algorithm to this notion of correlation which allow to study (asymptotic and finite case) consistency for sparse models. Following the author's response: I thank the authors for the newly added section. It is definitely a good and interesting point to extend to ensemble of trees / random forests.

Strengths: The main interest of this work is the new way of studying the CART algorithm. To the best of my knowledge, this was not done using Pearson correlation and this is of interest to encourage researchers to use alternative ways to show consistency or other desirable properties of tree-based models.

Weaknesses: It is pointed in the conclusion but I think it is still the main weakness of this work. This work focuses only on CART for which its soundness is already quite well established and it is very hard to see (or to be convinced) that such results are useful for methods made of ensemble of randomised trees that are used in practice nowadays (and sometimes without optimal split search). I would like a more in-depth discussion than the classical principle of ensemble (that is better in average than a single decision tree).

Correctness: Yes, I think so.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: In appendix, some refs are not well compiled (or missing).

Review 2

Summary and Contributions: UPDATE: I read the rebuttal and I believe the added experiments clearly improve the paper. The paper analyzes the statistical properties of splits for regression decision trees (DT) and find that decision splits are related to Pearson correlation. Specifically the pearson correlation of the split output and response variable, influences the location of the split (the higher the correlation, the better the split) and the error reduction in the tree. This is specially relevant in sparse problems in which the number of informative variables is much smaller than the number of variables (d0<<d). For this cases, the decision trees are consistent, which is not possible with other methods such as K-NN or kernel methods without a prior dimensionality reduction.

Strengths: The paper seems sound in their derivations and, as far as I am aware, novel.

Weaknesses: From my point of view I would include an empirical evaluation of some sort to illustrate the main results. Even using ad-hoc synthetic datasets would be of interest. Another aspect that is unclear to me is what are the consequences of "Suppose X is uniformly distributed" in the results since the data is not generally uniformly distributed.

Correctness: There is no empirical methodology as the paper is theoretical. Regarding the theorems they seem to be correct but I did not fully revise them

Clarity: The paper is clear although I would carry out a proof reading of it as there are some typos.

Relation to Prior Work: Yes. Another reference that may be of interest is: A. B. Nobel, "Analysis of a complexity-based pruning scheme for classification trees," in IEEE Transactions on Information Theory, vol. 48, no. 8, pp. 2362-2368, Aug. 2002, doi: 10.1109/TIT.2002.800482.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper sought to investigate theoretical and statistical properties of the CART methodology that are often overlooked. It provides an in-depth explanation and analysis of a CART algorithm, allowing others to replicate it. In doing so, it proves the reduction in training error among every recursive binary split. Additionally, models trained with this methodology have a high probability of having a bounded training error, even with arbitrary response sparsity. Proving the effectiveness of the model is important to anyone who is seeking to make a data-dependent decision. It also sheds light on the dependency decision stumps have on the most impactful data. Where other algorithms may have a stronger bias because of the assumption that all data is relevant, models following the CART methodology are less susceptible to approximation error because the recursive splits are based on the most relevant data. This helps any data-driven decision makers compare and contrast different supervised learning algorithms.

Strengths: see above

Weaknesses: see above

Correctness: see above

Clarity: Overall, I think this paper is well written because of its thorough and intuitive structure. However, it seems difficult for readers without expertise in the subject area to follow along with the mathematical notation. The summaries following the sections with mathematical notation serve as a good supplement for a complete understanding of the paper.s

Relation to Prior Work: see above

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper demonstrates that the training error of CART is governed by the Pearson correlation between the optimal decision stump and response data in each node. Based on this observation, the authors achieve an optimal complexity/goodness-of-fit trade-off and prove that the convergence rate of prediction errors is controlled by data dependent quantities.

Strengths: -There are solid analyses on the derivation of the connection between the Pearson correlation and training error, as well as the convergence of prediction errors. -The interpretability of the decision tree algorithms is demonstrated with thorough mathematical proof.

Weaknesses: -Given that the scope of this paper, as indicated in the title, is high-dimensional and sparse learning with CART, I would like to see a concrete performance evaluation of the proposed regression model on real datasets in addition to the theoretical deduction. -In Assumption 1, it is not clear why the quantity governs the convergence rate of both the training error and prediction. -In Ln. 56, the authors claim that the training error is bounded instead of designing the proxy task for approximation error, but the motivation is not clear and detailed to me. The authors' rebuttal provides detailed feedback on my concerns about the training error and convergence. The added experiments of comparing with simple KNN improve the paper to some extent.

Correctness: The methodology is correct in this paper.

Clarity: Most of the paper is clear and well written.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: See the weakness.