Review for NeurIPS paper: Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect

NeurIPS 2020

Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect

Review 1

Summary and Contributions: The authors propose to disconnect the relationship between M to X to remove the bad confounder bias while to keep the good bias. This removal entails (or derives) normalized classifiers [38,39] which has been shown to be effective for the same task. The empirical validation has been done with image classification on ImageNet-LT and instance segmentation on LVIS dataset.

Strengths: S1. clear motivation and a derivation to normalized classifier S2. Clear gain by the proposed method

Weaknesses: W1. There are some parts that are not clear in derivation - Why, do() operation makes the conditional probability (Eq.(3)). Without do(), isn't the conditional probability same? - No explanation why f() and g() are expanded in such way - Isn't the range of \alpha from 0 to 1? But in figure 5, the alpha is swept from 0 to 5. W2. The condition that the assumption 1 breaks - This assumption seems only valid when the head and tail proportion is dramatically large (or small), i.e., when head is 99% dominating, the assumption is valid. In what proportion, the assumption starts breaking? ===== after rebuttal ======= The authors rebuttal clarifies many of my questions. I encourage the authors to clarify the items I asked in the final version.

Correctness: C1. The derivation seems correct but not 100% sure (I asked question in W1). C2. The empirical methodology seems correct.

Clarity: It reads well.

Relation to Prior Work: The literature review is not very thorough but all the necessary comparisons are there.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: In the paper, the authors focus on the optimization momentum in long-tailed recognition problems and propose a causal inference framework that provides some theoretical comparison with previous works. Their solution based on the causal graph is proved effective on two recognition benchmarks.

Strengths: 1) The theoretical grounding is solid and sounds reasonable, the method well matches their theory, together with a proper ablation study. 2) They revisit previous methods in Table 1 and Section 4.3 and provide adequate comparison and analysis based on their causal inference framework. 3) They perform experiments on both single-label (classification) and multi-label (segmentation) benchmarks to reveal the effectiveness of their method.

Weaknesses: 1) As you mentioned [9][11] which are also quite recent works with code released, have you conducted experimental comparison with them? 2) The organization of related works might need refined, e.g. Hard Example Mining part seems too short and not well discussed.

Correctness: Yes, most of the claims and methodology are correct.

Clarity: Yes, it’s well presented overall, especially the explanation of the causal graph they proposed.

Relation to Prior Work: Yes, the related work is clear.

Reproducibility: No

Additional Feedback: The discussion of \alpha and Figure 5 in L151-153 is hard to follow before reading section 4, maybe you can adjust the representation here. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% After rebuttal period: The rebuttal has addressed most of my concerns on evaluation and clarity, thus I keep my rating as "marginally above the acceptance threshold".

Review 3

Summary and Contributions: The paper deals with learning classification models in unbalanced datasets. The authors argue that the standard momentum term often used in SGD is detrimental to learn classification problems with long-tailed class distributions, and propose a way to amend this effect. The authors use a causal model to study the impact of momentum during learning with long-tailed distributions, and propose a way to address it. The authors show that the existing two-stage training approach (where features are learned on unbalanced data and the final classifier with re-balanced/re-weighted classes) could be regarded as a special case of their theoretical approach.

Strengths: The paper presents state-of-the-art results for ImageNet-LT, and favorable results in comparison to a now baseline method for in object detection and instance segmentation (Tan et al. equalization loss method, on the LVIS benchmak). The content of the paper is relevant to the NeurIPS community as it may address a a shortcoming of one of the most fundamental techniques used in deep learning to date.

Weaknesses: It is a bit unclear how the multi-head strategy divides features and weights into K groups [eq. 6]. Does it mean that those feature channels are handled as independent from each other during learning? How sensitive is the method to an inappropriate choice of K?

Correctness: The presented formulation seems correct, but please refer to the weakness item mentioned above.

Clarity: Yes, the paper is well written, theoretically justified, with sufficient references, adequate terminology, and almost no typos.

Relation to Prior Work: The paper adequately references prior work, and further explores these prior works under their proposed causal model.

Reproducibility: Yes

Additional Feedback: I've read the author's feedback, the fellow reviewers' reviews, and the responses to their reviews. The authors adequately addressed my concerns. I keep my overall score as "a good submission; accept."

Review 4

Summary and Contributions: The paper proposes a new perspective that SGD momentum deteriorates classifier performance in the presence of class imbalance. Intuitively speaking, momentum favors the direction of head classes and further deviates the final prediction features from a neutral position. The paper formalizes the contribution of momentum in terms of the causal diagram and derives a de-confounding training formulation based on the backdoor adjustment and a post-processing inference procedure based on the Total Direct Effect (TDE).

Strengths: The perspective of momentum's effect on imbalanced classification using causal relations is novel and can potentially lead to a more theoretically grounded study of class imbalance. Experimentally, the paper showed good performance on two tasks: image classification and object detection with one dataset each. The two tasks and respective datasets are representative of the class imbalance issue and have been used by previous works on the same topic. Class imbalance is a practical problem in machine learning and this work attempts to explain the effect of optimization on the prediction results. Therefore, it is very relevant to the NeurIPS community.

Weaknesses: Despite its intriguing new perspective, the paper has some weaknesses that need to be further addressed. First, the paper lacks experimental comparisons to many other long-tail classification methods such as LDAM, balanced loss, BNN, even though they were mentioned in related work. Second, the use of multi-head strategy is not related to the claimed theoretical founding and it makes the judgement on the effectiveness of the theoretical framework more difficult. To the reviewer’s point view, a fairer comparison would be just using K=1 just as other imbalanced classification framework. Third, the final form of the de-confounding training is very similar to previous works with the only difference being the hyperparameter gamma in equation 7. It is unclear to the reviewer whether the performance improvement comes from tuning the hyperparamter which is not directly inspired from the theoretical framework.

Correctness: Yes, the claims and empirical methodology is correct.

Clarity: The paper is well written with good supporting materials in supplementary.

Relation to Prior Work: The work listed and discussed its difference to previous works.

Reproducibility: Yes

Additional Feedback: The authors have addressed my concerns on comparisons and the inclusion of multi-head strategy. I think it is a well-motivated and well-compared method.