Review for NeurIPS paper: Margins are Insufficient for Explaining Gradient Boosting

NeurIPS 2020

Margins are Insufficient for Explaining Gradient Boosting

Review 1

Summary and Contributions: This paper demonstrates that the k’th margin bound is inadequate in explaining the performance of gradient boosting. Accordingly, it provides a refined margin bound for gradient boosting.

Strengths: While I am not an expert on this topic, I feel the material is well motivated and developed. The empirical and theoretical results provide some new insights into gradient boosting.

Weaknesses: UPDATE: Thank you for your response, which addressed most of my concerns. The only issue is that only controlling the number of leaves can still be problematic since the depth of the also matters [1]. [1] Reyzin L, Schapire RE. How boosting the margin can also boost classifier complexity, ICML 2006. ============================= I have several concerns and questions: 1. Line 97: what does “same size” mean? We know that in order to make a fair comparison, we must make sure that the model complexity of base learners is the same. For decision trees, does it mean the same number of leaves? To me, the best way could be just using a decision stump as a base learner. 2. The empirical results are also not convincing to me: 1) the results are only averaged over three runs, which is insufficient to me. 2) I would also like to see the standard deviations of the average accuracies. 3) The experiments are only evaluated on one data set. To make the conclusion more convicting, the author should make a comparison on more data sets. 4) I was also wondering if the conclusion still holds with other base learners. 3. We know that AdaBoost is just a special case of gradient boosting with the exponential loss. Therefore, my feeling is that the analysis is just about the gradient boosting with different loss functions, not the behavior of gradient boosting itself.

Correctness: I feel the theoretical analysis is correct, though I didn’t go through the proofs of the theorems in the paper.

Clarity: The overall structure of this paper is clear to me.

Relation to Prior Work: The authors give a review and discussion on previous work. The theoretical analysis is novel to me.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper shows empirically that the margin of adaboost (AB) type ensembles grow with the number of combined classifiers (as known) but the margin of gradient boosting (GB) with combined base-classifiers diminishes (not known). However, the generalization accuracy of GB is slightly better than the generalization accuracy of AB. In addition, these observations contrast with the theoretical generalization bounds for these methods. Finally, the article derives a refined theoretical bound, more in line with the observed values.

Strengths: The paper is well written and sound (although I did not check proof sec 5). The observation made about GB is really interesting and deserves attention.

Weaknesses: UPDATE: I read the author's reply and I do not agree. In this text, I will focus on the two-class problem, {-1,+1}, for simplicity. First, GB combines regressors, and not classifiers, and their outputs cannot be normalized as classifiers. Second, the training of GB cannot be unlinked from the sigmoid as the pseudo-residuals are computed as the sigmoid times the class (Friedman 1999, section 4.5). In fact the output of the raw function of GB, that is F(x), tends to the log-odds ratio of the two classes. This quantity is unbounded even if the single regressors for the 2 class problem are bounded in [-1,+1]. Doing some simple experiments, one can check that the raw outputs of GB tend (in train) to +inf for +1 class instances and to -inf to -1 instances. This means that the pseudo-residuals tend to zero and the certainty of the ensemble tends to 100%. I also check that if one divides the raw outputs of GB by the number of base learners (or the sum of weights) the global outputs of the ensemble tend to 0 or to very low margins but I believe this normalization is not related to the training process of GB at all. The outputs of the base learners in GB tend to decrease with the size of the ensemble for the instances already "learned" with high confidence. This occurs precisely because GB learns an unnormalised additive expansion. I only have one concern about the article, that is, that it is not clear to me how the margin and output of GB is treated. GB uses regressors for each iteration. Hence, the outputs of the base classifiers of AD and GB are not directly comparable. It is true that the outputs of GB can be very small as shown in Fig.3 (I did not know, however, that most of them were so small). However, those outputs, after being linearly combined with the rest of the outputs, are passed through a sigmoid (for two class classification) that produces the final output of GB as a probability. In this context, it is unclear to me how the margin of GB is computed. It is computed as prob of correct class minus prob of the other class? This would produce the standard [-1, 1] margin range. In any case, the small values of GB may produce bigger changes after the sigmoid. Something that is different in AB where the -{-1, 1} outputs are directly summed up. From my point of view this should be clarified. I would also appreciate a deeper discussion about why the margin is reduced with the number of base rergessors in GB. Even if the outputs are small, that does not mean that the margins should reduce. GB is designed to reduce the residual of the outputs and reducing the pseudo-residuals to 0 and this would produce margin 1. I am willing to modify my score if this is clearly addressed.

Correctness: I believe everything is correct up to sec 4. I did not check the derivation of Sec5.

Clarity: Yes, the paper is well written

Relation to Prior Work: Prior work is properly discussed.

Reproducibility: Yes

Additional Feedback: The legend and lines of the plots are very thin and difficult to read. Missing reference in Line 69 A small detail: L2 says "great practical performance with little fine-tuning.". This is not really so, specially for GB. Only to navigate in the hundreds of hyper-params of LighGBM you can loss a great amount of time and in any case it needs to be tuned to get competitive results (learning rate, depth of tree, etc). I would just nuance the sentence. Another small detail: L19 says LightGBM is one of the best GB methods and I am not quite sure of that. I believe CatBoost and XGBoost or even standard GB are better than LightGBM in generalization capability. The benefit of LightGBM is its speed. The authors of [7] say it in the abstract of their article "LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving *almost the same accuracy*."

Review 3

Summary and Contributions: The main result of this paper is a new generalization bound for voting classifiers (in the context of boosting). The bound is stated at Theorem 1 in Section 5 (a strange choice of the position of the main result). It improves margin-based bounds for boosting, that depends on theta^{-2}, where theta is the margin parameter. Here, the bound depends on the maximum between theta^{-1} and Bla*theta^{-2}, where Bla is some strange moment of the average value of |f(x)-h(x)|^2, where f is the voting classifier, x is a random example, and h is a random hypothesis sampled from the distribution induced by f. The authors show that the new bound correlate with empirical relative performance of AdaBoost vs. LightGBM on some benchmark datasets.

Strengths: Solid theoretical contribution that aims at explaining experimental results.

Weaknesses: This is a retrospect explanation of existing results, which always has the danger of over-fitting the theory to the results. A good theory should predict new phenomenon or yield new algorithms. The authors note this, but do not derive a new algorithm from the new bound. I'd give the paper a much higher score if the new bound would have yielded a new algorithm, which improves state-of-the-art in some aspect.

Correctness: Proofs seem correct.

Clarity: Yes.

Relation to Prior Work: Reasonably well.

Reproducibility: Yes

Additional Feedback: