Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Label smoothing has become somewhat pervasive in various ML application areas (e.g., MT), but has mostly been thought of as a regularizer. The present work suggests a number of other properties of label smoothing that challenge this assumption, e.g., that label smoothing causes representations to cluster around class templates, that label smoothing improves model calibration, and that label-smoothed networks make worse "teachers" for knowledge distillation. These are important empirical/theoretical findings that are not obvious and should be interesting to a fairly broad audience. The paper is clearly written and the experiments are well designed to support the above claims.
Some questions and comments: -- In Figure 4, why does changing the temperature of a network that is already trained with label smoothing degrade calibration? Can the authors offer some insight? -- In Figure 5, why does label smoothing slightly degrade the baseline performance of the student network? Doesn't one expect the student's baseline accuracy to improve by enabling label smoothing? -- From the visualizations shown in the fourth row of Figure 1, it appears like label smoothing could be particularly useful for generalization on samples from classes that are semantically similar. Does this actually hold? (One can examine the confusion matrix of a classification task to see whether confusion between semantically similar classes is resolved in more cases when label smoothing is applied to the model.) -- The visualization idea is neat and it reveals how label smoothing forces training examples of the same class into tight clusters and encourages examples from a class to be equidistant from other classes. This uniformly holds for different datasets and model architectures (rows 1-3 in Fig. 1). While interesting, if we were to rank the contributions of this work, I'd rank this last. I would suggest reorganizing the layout of the paper so that this section on visualization appears after the sections on calibration and distillation. ------- Post rebuttal: Thanks to the authors for addressing my questions. I think this is a strong submission and would really like to see it get accepted. I'm raising my score to an 8.
Pros: This paper provides an empirical study to show that label smoothing helps the representation in each class to be close while equally distant to incorrect classes, which is intuitive but can provide insights. Also the effect on reducing the confidence of the prediction and thus help calibration is intuitive and interesting. The author also explains that label smoothing hurts distillation, due to label smoothing erases the relative confidence information between classes and examples, which also make sense to me. Cons: 1. Some of the findings in the paper are somewhat intuitive and natural, such as label smooth reduces confidence and help calibration. 2. What we care more is how these findings help us to get more insights or help us to better use label smoothing or design better methods. 3. The experiments in Section 2 and 4 are only conducted in image classification tasks. I wonder if the phenomenon holds in other tasks in NLP, such as text classification, machine translation. Since the work in mainly based on the empirical study but no theoretical proof, analyses on more tasks are necessary.