Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper considers the piecewise linear regression problem and proposes a split selection algorithm followed by a pruning operation to capture the underlying partitions of the piecewise linear model. The language of the paper is okay but it is hard to follow the paper at certain points due to the heavy notation. I would recommend the authors to provide a verbal description of the meaning of such equations and inequalities. The main idea of the paper is interesting but I'm not sure if it is enough to be published at NeurIPS. Theoretical results do not really provide any intuition about how well the algorithm will perform when implemented, although I acknowledge that proving such a result is tricky. Given that the theoretical contribution of the paper is not outstanding, I would've expected the experiments to provide more insight about the algorithm. However, the proposed algorithm is only compared against 20-30 year old methods, which does not say much about its performance. I would recommend the authors to see the following more recent articles on tree-based algorithms: Wang et al. "Local Supervised Learning through Space Partitioning" 2012. Vanli et al. "A Comprehensive Approach to Universal Piecewise Nonlinear Regression Based on Trees" 2014. Ikonomvska et al. "Online tree-based ensembles and option trees for regression on evolving data streams" 2015. Based on the above reasons, I think this paper is not ready for publication yet. As a side note, the supplementary material contains huge amount of typos, so I would recommend the authors to go over it carefully.
Some comments per section: Section 2 - Line 128: I am not sure how the distance measure between two pairs (pair of index and split value) should be interpreted/computed. - In case of presence of correlated covariates, is the algorithm able to discover the required linear transformation simultaneously with the required local split value? Section 3 - As for correlated covariates, please comment on how this is handled in Algorithm 2. I don't see any reference of matrix $P$. Experiments - How big is the ensemble size? - Please comment on how correlated covariates were handled for each of the datasets. Summary Originality: the proposed method is an easy extension of existing methodologies. Quality: overall quality is average because of the clarity problem (see below). Clarity: the paper overall is clearly written - however, could use some work in organization and maybe bring in one of the theoretical results inside the main writeup. Significance: because the work is largely based on existing methodologies, either a strong theoretical result or an empirical result should have been provided. Given that the theoretical result in this paper is not quite strong, the paper is more empirically oriented.
The proposed partitioning process fit the least square regression over the node and select the optimal split. In addition, for each leaf node, Lasso linear regression model is fitted. My concern is that this can be very slow for large-scale datasets. It is interesting to see the empirical time cost in the experiments. In the experiments, authors compared baselines such as GUIDE, CART and MARs on 9 datasets, and found CART is the worst method. It is reasonable to think CART might not be the best for base classifiers in the ensemble. Although authors compared WRF of SLRT with RF/WRF of CART, it is better to have comparing results with base classifier such as GUIDE and MARS to see how much the proposed method can improve. LASSO linear regression on leaf nodes is used for prediction. How did authors determine the regularization parameter in the experiments? How sensitive it is to affect the predictive results.
Originality: The paper is fairly original in that it proposes a new tree-splitting criterion that seems to work very well when the leaves are linear models rather than constants. It also provides a novel application of several pieces of previous work, including LASSO and random forests. There are adequate citations of related work. Quality: I did not carefully check the math or read the proofs in the supplemental material, but I did not observe any technical mistakes. There is not much discussion of the limitations of their approach. Clarity: The paper is written very clearly; I could understand everything without difficulty. Significance: This paper does seem to advance the state of the art in learning with tree-based models. Unless their experimental section is misleading, I expect this work to be generally useful to researchers and practitioners of tree-based modeling. Detailed notes: * You mention that the SLRT algorithm is consistent; does this property hold only in cases where the response variable is truly a piecewise-linear function of the regressors, with axis-aligned rectangular subspaces? I would expect the majority of real-world data not to conform to this particular shape. What can be said about SLRT in such cases? * Line 38: What is meant by "prevailing" in this sentence? You mention theoretical analysis, so I would assume you have some precise definition in mind, but I don't know what it would be. * Line 46: This sentence seems ungrammatical. Did you mean "Futhermore, by implanting SLRT as the base predictor..."? * Line 76: Typo: "method" should be "methods". * Line 305: Typo: "SIsson" should be "Sisson". * In Equations (2) and (4), you include examples that are equal to the split level in the Left child, but in Line 8 of Algorithm 1, you include them in the Right child. Is this a mistake? * How do you determine whether to use Algorithm 1 or Algorithm 2? Do you use some threshold for the sample correlations between pairs of regressors? * Line 193: Did you mean "subset or equal to" instead of "strict subset"? * Line 202: You don't want to "minimize the average predictive accuracy", so maybe replace "minimize" with "optimize". * Line 203: Typo, replace "optimially" with "optimal". * Line 232: I think the \sigma_i in the numerator should be \sigma_b.