Review for NeurIPS paper: Identifying Mislabeled Data using the Area Under the Margin Ranking

NeurIPS 2020

Identifying Mislabeled Data using the Area Under the Margin Ranking

Review 1

Summary and Contributions: The paper introduced a new metric called AUM--area under the margin. This metric tries to measure the degree of ambiguity or mislabeledness of an example. Consequently, the measure is used to clean up datasets such as CIFAR and Imagenet and to improve performance on respective datasets upon retraining. *** I would like to thank the authors for clarifying my questions. I maintain my support for the paper.

Strengths: The paper achieves improvement over state-of-the-art techniques on an important problem to the community, finding large parts of the dataset that improve performance upon removal. The paper is concise, well-written, and enjoyable to read.

Weaknesses: I found no strong weaknesses in the paper.

Correctness: The claims and methodology seem sound and clean.

Clarity: yes

Relation to Prior Work: Yes, as far as I can tell.

Reproducibility: Yes

Additional Feedback: I would like to thank the authors for a fun paper! I had a few minor questions and remarks. 1) I think there might be an interesting relationship between how clean is a dataset with the recent work on double descent. Double descent is often observed when the data has a natural noise in the distribution or equivalently model misspecification. I conjecture that your method would work proportionally to the peak in test error when looking at epoch-wise double descent. Could be an interesting discussion point in the paper. 2) I'm wondering about the effect of data augmentation on the performance of your method. On one hand, when data augmentation is present, cleaning the dataset from ambiguous examples will have a compounding effect on the performance, as the optimization encounters much more (e.g. flipped) of the bad examples. On the other, data aug serves as regularization, so could be that the model is more smooth in some sense, which will reduce the effectiveness of your method. 3) Given the discussion in "limitations" I would presume that this is not true, but do you think that your method would be able to detect data set poisoning? Minor comments/typos: Line 96: Not sure about the empirical work on margin, but the theoretical work, as far as I know, has very little to do with what actually happens in real models. That is, the margin-based generalization bounds have a prohibitively large norm, rendering the bounds trivial. Figure 2 is slightly truncated Line 266 Why is 2% top 1 error not significant?

Review 2

Summary and Contributions: The paper proposed an area under the margin (AUM) statistic to identify mislabeled samples by measuring the average difference between the logit values for a sample’s assigned class and its highest non-assigned class. Further, the paper proposed adding an extra class populated with especially re-assigned threshold samples to learn the AUM upper bound of mislabeled data. The experimental results on synthetic and real-world datasets show the effectiveness of the proposed method.

Strengths: 1. The AUM method works with any classifier “out-of-the-box” without any changes to architecture or training procedure. 2. The empirical evaluations are sufficient and reproducible.

Weaknesses: 1） The method ignores the classification of the removed data – do the removed samples introduce new problem (e.g., the unbalanced training samples problem)? 2） The method may make unfair comparisons with other methods – in my opinion, the method should be compared with a set of methods doing the same thing, i.e., remove 20% data, and send the remaining to a base learner to check the consequence of removed data. 3） In section 3, the author hypothesizes that “at any given epoch during training, a mislabeled sample will in expectation have a smaller margin that correctly-labeled sample”. Figure 2 uses DOG cases to illustrate this hypothesizes. However, is this the case in general? 4） To find the threshold, a new class is added to estimate the upper bound of AUM for mislabeled data by Eqn (1). However, there are no clean positive samples to learn new classifier, thus the assigned logit value of threshold samples in Eqn (1) will be biased. 5） In Table 4, the results of AUM are worse than that of most compared methods when the degree of noise is 40% - does this indicate that too much data are removed? 6） In practice, how to choose a good set of samples for threshold setting?

Correctness: yes

Clarity: yes, very clear.

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: The authors have addressed my major concerns about leave out datasets. Overall, I think that this paper provides a reasonable and convenient tool for data cleaning. I will keep my score unchanged.

Review 3

Summary and Contributions: The paper addresses the problem of learning with label noise. The authors focus on identifying underlying noisy labels by exploiting the margin ranking. The experimental results on synthetic and real-world datasets verifies the effectiveness of the proposed method. **After reading author response** Thank the authors for their positive response. After reading this response carefully, I felt that part of my confusion had been removed. Thus, I changed my score to 5. However, my main concerns for rejecting remain: Limited contributions and inadequate experimental explanations.

Strengths: This paper uses the margin ranking to identify noisy labels, which prevents the memorzation of label noise. The idea is simple and effective. The experimental settings are detailed, which is easy to follow. The results of the synthenic experiments are enough and convincing.

Weaknesses: 1. Limited novelty. The authors use the margin ranking to find the data with noisy labels. As far as know, the concept of margin has long been used for classification task, e.g., face recognition [1], [2], semi-supervised learning [3]. The methods [4], which employ memorization effect to select confident samples (small loss samples), also share similar ideas. The data with small loss has clean labels with high confidence and also has a larger margin ranking. The authors ignore the discussion and comparison about these existing works. 2. The logic and contribution are not clear enough. The proposed method seems simple, but the descriptions make readers confused. Thus, I would like to see an algorithm flow and suggest the authors can re-emphasize the contributions of the paper. 3. Missing some necessary analysis. For example, the authors use a 100,000 sample subset of Clothing1M. They state the goal is to match the size of Webvision dataset. The training data in the different datasets is usually distributed differently and has different challenge to the proposed method. Maybe the authors can add some discussion to these settings. Besides, I would like to know the detailed analyses about the difference AUM and original margin, which is critical to readers. [1] Hao Wang et al. CosFace: Large Margin Cosine Loss for Deep Face Recognition. CVPR 2018. [2] Jiankang Deng et al. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. arxiv 1801.09414. [3] Jinhao Dong et al. MarginGAN: Adversarial Training in Semi-Supervised Learning. NeurIPS 2019. [4] Bo Han et al. Co-teaching: Robust training of deep neural networks with extremely noisy labels. NeurIPS 2018.

Correctness: Yes, the claims and method are correct.

Clarity: No. As for me, the paper is not well written. Some issues have been mentioned in "Weaknesses".

Relation to Prior Work: No. It is not clearly discussed, which is also metioned in "Weaknesses".

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper focuses on label noise learning problems in weakly-supervised learning. The authors propose a novel data cleansing method based on the AUM statistic, which can automatically distinguish mislabeled sample from the correct ones.

Strengths: 1) This paper is self-contained and well-structured. For example, based on previous works and examples, the authors introduce the dynamics of SGD in distinguishing correct/incorrect signals. In Sec 3, the authors verified these assumptions via AUM, and proposed the corresponding data cleansing algorithm. 2) Using threshold sample in determining hyperparameters might be a good idea, as it can save a lot of computational demands. Fig 3 and Fig 5 verified this strategy on synthetic datasets with label noise.

Weaknesses: 1) The authors claimed that the AUM is less prone to confusing difficult samples for mislabeled data (P2-68), I think it is a good idea to provide some explanations or heuristics in comparison with previous works. Moreover, data cleansing has been widely exploited in the literature of label noise learning. So, maybe it is better to demonstrate the superiority of the proposed AUM-based methods over the counterparts. 2) The experimental results on Clothing1M is not very convincing. Using 10% of the training data, the test accu of sota methods (e.g., PENCIL Kun Yi et al. ) can reach 73% with backbone ResNet-50. However, the test accu of AUM is only 67%. Could the authors provide some explanation? Maybe the sampled data in the experiment are imbalanced. 3) The authors claimed that non-uniform label noise is not too common in practice. However, in real-world noise datasets, such as Clothing1M, it has been verified that non-uniform label noise widely exists. Kun Yi, Jianxin Wu; Probabilistic End-To-End Noise Correction for Learning With Noisy Labels. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 7017-7025

Correctness: correct

Clarity: well written

Relation to Prior Work: clear

Reproducibility: Yes

Additional Feedback: