{"title": "When does label smoothing help?", "book": "Advances in Neural Information Processing Systems", "page_first": 4694, "page_last": 4703, "abstract": "The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.", "full_text": "When Does Label Smoothing Help?\n\nRafael M\u00fcller\u2217, Simon Kornblith, Geoffrey Hinton\n\nGoogle Brain\n\nToronto\n\nrafaelmuller@google.com\n\nAbstract\n\nThe generalization and learning speed of a multi-class neural network can often\nbe signi\ufb01cantly improved by using soft targets that are a weighted average of the\nhard targets and the uniform distribution over labels. Smoothing the labels in this\nway prevents the network from becoming over-con\ufb01dent and label smoothing has\nbeen used in many state-of-the-art models, including image classi\ufb01cation, language\ntranslation and speech recognition. Despite its widespread use, label smoothing is\nstill poorly understood. Here we show empirically that in addition to improving\ngeneralization, label smoothing improves model calibration which can signi\ufb01cantly\nimprove beam-search. However, we also observe that if a teacher network is\ntrained with label smoothing, knowledge distillation into a student network is much\nless effective. To explain these observations, we visualize how label smoothing\nchanges the representations learned by the penultimate layer of the network. We\nshow that label smoothing encourages the representations of training examples\nfrom the same class to group in tight clusters. This results in loss of information\nin the logits about resemblances between instances of different classes, which is\nnecessary for distillation, but does not hurt generalization or calibration of the\nmodel\u2019s predictions.\n\nIntroduction\n\n1\nIt is widely known that neural network training is sensitive to the loss that is minimized. Shortly\nafter Rumelhart et al. [1] derived backpropagation for the quadratic loss function, several researchers\nnoted that better classi\ufb01cation performance and faster convergence could be attained by performing\ngradient descent to minimize cross entropy [2, 3]. However, even in these early days of neural\nnetwork research, there were indications that other, more exotic objectives could outperform the\nstandard cross entropy loss [4, 5]. More recently, Szegedy et al. [6] introduced label smoothing,\nwhich improves accuracy by computing cross entropy not with the \u201chard\" targets from the dataset,\nbut with a weighted mixture of these targets with the uniform distribution.\n\nLabel smoothing has been used successfully to improve the accuracy of deep learning models across\na range of tasks, including image classi\ufb01cation, speech recognition, and machine translation (Table 1).\nSzegedy et al. [6] originally proposed label smoothing as a strategy that improved the performance of\nthe Inception architecture on the ImageNet dataset, and many state-of-the-art image classi\ufb01cation\nmodels have incorporated label smoothing into training procedures ever since [7, 8, 9]. In speech\nrecognition, Chorowski and Jaitly [10] used label smoothing to reduce the word error rate on the\nWSJ dataset. In machine translation, Vaswani et al. [11] attained a small but important improvement\nin BLEU score, despite a reduction in perplexity.\n\nAlthough label smoothing is a widely used \u201ctrick\" to improve network performance, not much is\nknown about why and when label smoothing should work. This paper tries to shed light upon behavior\n\n\u2217This work was done as part of the Google AI Residency.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Survey of literature label smoothing results on three supervised learning tasks.\n\nDATA SET\n\nARCHITECTURE\n\nMETRIC\n\nVALUE W/O LS VALUE W/ LS\n\nIMAGENET\n\nINCEPTION-V2 [6]\n\nTOP-1 ERROR\nTOP-5 ERROR\n\nEN-DE\n\nTRANSFORMER [11]\n\nBLEU\n\nPERPLEXITY\n\nWSJ\n\nBILSTM+ATT.[10]\n\nWER\n\n23.1\n6.3\n25.3\n4.67\n8.9\n\n22.8\n6.1\n25.8\n4.92\n\n7.0/6.7\n\nof neural networks trained with label smoothing, and we describe several intriguing properties of\nthese networks. Our contributions are as follows:\n\n\u2022 We introduce a novel visualization method based on linear projections of the penultimate\nlayer activations. This visualization provides intuition regarding how representations differ\nbetween penultimate layers of networks trained with and without label smoothing.\n\n\u2022 We demonstrate that label smoothing implicitly calibrates learned models so that the con\ufb01-\n\ndences of their predictions are more aligned with the accuracies of their predictions.\n\n\u2022 We show that label smoothing impairs distillation, i.e., when teacher models are trained with\nlabel smoothing, student models perform worse. We further show that this adverse effect\nresults from loss of information in the logits.\n\n1.1 Preliminaries\n\nT\n\nT\n\nwl\n\nwk\n\nl=1 ex\n\n, where pk is the likelihood the model assigns to the k-th class, wk represents\n\nBefore describing our \ufb01ndings, we provide a mathematical description of label smoothing. Suppose\nwe write the prediction of a neural network as a function of the activations in the penultimate layer\nas pk = ex\nPL\nthe weights and biases of the last layer, x is the vector containing the activations of the penultimate\nlayer of a neural network concatenated with \"1\" to account for the bias. For a network trained with\nhard targets, we minimize the expected value of the cross-entropy between the true targets yk and\nthe network\u2019s outputs pk as in H(y, p) = PK\nk=1 \u2212yk log(pk), where yk is \"1\" for the correct class\nand \"0\" for the rest. For a network trained with a label smoothing of parameter \u03b1, we minimize\ninstead the cross-entropy between the modi\ufb01ed targets yLS\nand the networks\u2019 outputs pk, where\nyLS\nk = yk(1 \u2212 \u03b1) + \u03b1/K.\n\nk\n\nT\n\nT\n\nx \u2212 2x\n\n2 Penultimate layer representations\nTraining a network with label smoothing encourages the differences between the logit of the correct\nclass and the logits of the incorrect classes to be a constant dependent on \u03b1. By contrast, training\na network with hard targets typically results in the correct logit being much larger than the any\nof the incorrect logits and also allows the incorrect logits to be very different from one another.\nIntuitively, the logit x\nwk of the k-th class can be thought of as a measure of the squared Euclidean\ndistance between the activations of the penultimate layer x and a template wk, as ||x \u2212 wk||2 =\nx is factored out when calculating\nx\nT\nthe softmax outputs and w\nk wk is usually constant across classes. Therefore, label smoothing\nencourages the activations of the penultimate layer to be close to the template of the correct class and\nequally distant to the templates of the incorrect classes. To observe this property of label smoothing,\nwe propose a new visualization scheme based on the following steps: (1) Pick three classes, (2)\nFind an orthonormal basis of the plane crossing the templates of these three classes, (3) Project the\npenultimate layer activations of examples from these three classes onto this plane. This visualization\nshows in 2-D how the activations cluster around the templates and how label smoothing enforces a\nstructure on the distance between the examples and the clusters from the other classes.\n\nT\nk wk. Here, each class has a template wk, x\n\nwk + w\n\nT\n\nT\n\nIn Fig. 1, we show results of visualizing penultimate layer representations of image classi\ufb01ers trained\non the datasets CIFAR-10, CIFAR-100 and ImageNet with the architectures AlexNet [12], ResNet-56\n[13] and Inception-v4 [14], respectively. Table 2 shows the effect of label smoothing on the accuracy\nof these models. We start by describing visualization results for CIFAR-10 (\ufb01rst row of Fig. 1) for\nthe classes \u201cairplane,\" \u201cautomobile\" and \u201cbird.\" The \ufb01rst two columns represent examples from the\ntraining and validation set for a network trained without label smoothing (w/o LS). We observe that\n\n2\n\n\fFigure 1: Visualization of penultimate layer\u2019s activations of: AlexNet/CIFAR-10 (\ufb01rst row), CIFAR-\n100/ResNet-56 (second row) and ImageNet/Inception-v4 with three semantically different classes\n(third row) and two semantically similar classes plus a third one (fourth row).\n\nTable 2: Top-1 classi\ufb01cation accuracies of networks trained with and without label smoothing used in\nvisualizations.\n\nDATA SET\n\nARCHITECTURE ACCURACY (\u03b1 = 0.0) ACCURACY (\u03b1 = 0.1)\n\nCIFAR-10\nCIFAR-100\nIMAGENET\n\nALEXNET\nRESNET-56\nINCEPTION-V4\n\n86.8 \u00b1 0.2\n72.1 \u00b1 0.3\n\n80.9\n\n86.7 \u00b1 0.3\n72.7 \u00b1 0.3\n\n80.9\n\nthe projections are spread into de\ufb01ned but broad clusters. The last two columns show a network\ntrained with a label smoothing factor of 0.1. We observe that now the clusters are much tighter,\nbecause label smoothing encourages that each example in training set to be equidistant from all the\nother class\u2019s templates. Therefore, when looking at the projections, the clusters organize in regular\ntriangles when training with label smoothing, whereas the regular triangle structure is less discernible\nin the case of training with hard-targets (no label smoothing). Note that these networks have similar\naccuracies despite qualitatively different clustering of activations.\n\nIn the second row, we investigate the activation\u2019s geometry for a different pair of dataset/architecture\n(CIFAR-100/ResNet-56). Again, we observe the same behavior for classes \u201cbeaver,\" \u201cdolphin,\"\n\u201cotter.\" In contrast to the previous example, now the networks trained with label smoothing have\nbetter accuracy. Additionally, we observe the different scale of the projections between the network\ntrained with and without label smoothing. With label smoothing, the difference between logits of two\nclasses has to be limited in absolute value to get the desired soft target for the correct and incorrect\n\n3\n\n\fclasses. Without label smoothing, however, the projection can take much higher absolute values,\nwhich represent over-con\ufb01dent predictions.\n\nFinally, we test our visualization scheme in an Inception-v4/ImageNet experiment and observe the\neffect of label smoothing for semantically similar classes, since ImageNet has many \ufb01ne-grained\nclasses (e.g. different breeds of dogs). The third row represents projections for semantically different\nclasses (tench, meerkat and cleaver) with the behavior similar to previous experiments. The fourth\nrow is more interesting, since we pick two semantically similar classes (toy poodle and miniature\npoodle) and observe the projection with the presence of a third semantically different one (tench, in\nblue). With hard targets, the semantically similar classes cluster close to each other with an isotropic\nspread. On the contrary, with label smoothing these similar classes lie in an arc. In both cases,\nthe semantically similar classes are harder to separate even on the training set, but label smoothing\nenforces that each example be equidistant to all remaining class\u2019s templates, which gives rise to\narc shape behavior with respect to other classes. We also observe that when training without label\nsmoothing there is continuous degree of change between the \"tench\" cluster and the \"poodles\" cluster.\nWe can potentially measure \"how much a poodle is a particular tench\". However, when training\nwith label smoothing this information is virtually erased. This erasure of information is shown in\nthe Section 4. Finally, the \ufb01gure shows that the effect of label smoothing on representations is\nindependent of architecture, dataset and accuracy.\n\nImplicit model calibration\n\n3\nBy arti\ufb01cially softening the targets, label smoothing prevents the network from becoming over-\ncon\ufb01dent. But does it improve the calibration of the model by making the con\ufb01dence of its predictions\nmore accurately represent their accuracy? In this section, we seek to answer this question. Guo\net al. [15] have shown that modern neural networks are poorly calibrated and over-con\ufb01dent despite\nhaving better performance than better calibrated networks from the past. To measure calibration, the\nauthors computed the estimated expected calibration error (ECE). They demonstrated that a simple\npost-processing step, temperature scaling, can reduce ECE and calibrate the network. Temperature\nscaling consists in multiplying the logits by a scalar before applying the softmax operator. Here, we\nshow that label smoothing also reduces ECE and can be used to calibrate a network without the need\nfor temperature scaling.\n\nImage classi\ufb01cation. We start by investigating the calibration of image classi\ufb01cation models. Fig. 2\n(left) shows the 15-bin reliability diagram of a ResNet-56 trained on CIFAR-100. The dashed line\nrepresent perfect calibration where the output likelihood (con\ufb01dence) predicts perfectly the accuracy.\nWithout temperature scaling, the model trained with hard targets (blue line without markers) is clearly\nover-con\ufb01dent, since in expectation the accuracy is always below the con\ufb01dence. To calibrate the\nmodel, one can tune the softmax temperature a posteriori (blue line with crosses) to a temperature\nof 1.9. We observe that the reliability diagram slope is now much closer to a slope of 1 and the\nmodel is better calibrated. We also show that, in terms of calibration, label smoothing has a similar\neffect. By training the same model with \u03b1 = 0.05 (green line), we obtain a model that is similarly\ncalibrated compared to temperature scaling. In Table 3, we observe how varying label smoothing and\ntemperature scaling affects ECE. Both methods can be used to reduce ECE to a similar and smaller\nvalue than an uncalibrated network trained with hard targets.\n\nWe also performed experiments on ImageNet (Fig. 2 right). Again, the network trained with hard\ntargets (blue curve without markers) is over-con\ufb01dent and achieves a high ECE of 0.071. Using\ntemperature scaling (T= 1.4), ECE is reduced to 0.022 (blue curve with crosses). Although we did\nnot tune \u03b1 extensively, we found that label smoothing of \u03b1 = 0.1 improves ECE to 0.035, resulting\nin better calibration compared to the unscaled network trained with hard targets.\n\nThese results are somewhat surprising in light of the penultimate layer visualizations of these networks\nshown in the previous section. Despite trying to collapse the training examples to tiny clusters, these\nnetworks generalize and are calibrated. Looking at the label smoothing representations for CIFAR-\n100 in Fig. 1 (second row, two last columns), we clearly observe this behavior. The red cluster is very\ntight in the training set, but in the validation set it spreads towards the center representing the full\nrange of con\ufb01dences for each prediction.\n\nMachine translation. We also investigate the calibration of a model trained on the English-to-\nGerman translation task using the Transformer architecture. This setup is interesting for two reasons.\nFirst, Vaswani et al. [11] noted that label smoothing with \u03b1 = 0.1 improved the BLEU score of\n\n4\n\n\fFigure 2: Reliability diagram of ResNet-56/CIFAR-100 (left) and Inception-v4/ImageNet (right).\n\nTable 3: Expected calibration error (ECE) on different architectures/datasets.\n\nDATA SET\n\nARCHITECTURE\n\nECE (T=1.0, \u03b1 = 0.0)\n\nECE / T (\u03b1 = 0.0)\n\nBASELINE\n\nTEMP. SCALING\n\nCIFAR-100\nIMAGENET\nEN-DE\n\nRESNET-56\nINCEPTION-V4\nTRANSFORMER\n\n0.150\n0.071\n0.056\n\n0.021 / 1.9\n0.022 / 1.4\n0.018 / 1.13\n\nLABEL SMOOTHING\nECE / \u03b1 (T=1.0)\n\n0.024 / 0.05\n0.035 / 0.1\n0.019 / 0.1\n\ntheir \ufb01nal model despite attaining worse perplexity compared to a model trained with hard targets\n(\u03b1 = 0.0). So we compare both setups in terms of calibration to verify that label smoothing also\nimproves calibration in this task. Second, compared to image classi\ufb01cation, where calibration does\nnot directly affect the metric we care about (accuracy), in language translation, the network\u2019s soft\noutputs are inputs to a second algorithm (beam-search) which is affected by calibration. Since\nbeam-search approximates a maximum likelihood sequence detection algorithm (Viterbi algorithm),\nwe would intuitively expect better performance for a better calibrated model, since the model\u2019s\ncon\ufb01dence predicts better the accuracy of the next token.\n\nWe start by looking at the reliability diagram (Fig. 3)\nfor a Transformer network trained with hard targets\n(with and without temperature scaling) and a network\ntrained with label smoothing (\u03b1 = 0.1). We plot cali-\nbration of the next-token predictions assuming a correct\npre\ufb01x on the validation set. The results are in agree-\nment with the previous experiments on CIFAR-100 and\nImageNet, and indeed the Transformer network [11]\nwith label smoothing is better calibrated than the hard\ntargets alternative.\n\nFigure 3: Reliability diagram of Trans-\nformer trained on EN-DE dataset.\n\nDespite being better calibrated and achieving better\nBLEU scores, label smoothing results in worse nega-\ntive log-likelihoods (NLL) than hard targets. Moreover,\ntemperature scaling with hard targets is insuf\ufb01cient to\nrecover the BLEU score improvement obtained with\nlabel smoothing. In Fig. 4, we arti\ufb01cially vary calibra-\ntion of both networks, using temperature scaling, and analyze the effect upon BLEU and NLL. The\nleft panel shows results for a network trained with hard targets. By increasing the temperature we\ncan both reduce ECE (red, right y-axis) and slightly improve BLEU score (blue left y-axis), but the\nBLEU score improvement is not enough to match the BLEU score of the network trained with label\nsmoothing (center panel). The network trained with label smoothing is \"automatically calibrated\"\nand changing the temperature degrades both calibration and BLEU score. Finally, in the right panel,\nwe plot the NLL for both networks, where markers represent the network with label smoothing.\nThe model trained with hard targets achieves better NLL at all temperature scaling settings. Thus,\nlabel smoothing improves translation quality measured by BLEU score despite worse NLL, and\nthe difference of BLEU score performance is only partly explained by calibration. Note that the\nminimum of ECE for this experiment predicts top BLEU score slightly better than NLL.\n\n5\n\n\fFigure 4: Effect of calibration of Transformer upon BLEU score (blue lines) and NLL (red lines).\nCurves without markers re\ufb02ect networks trained without label smoothing while curves with markers\nrepresent networks with label smoothing.\n\n4 Knowledge distillation\nIn this section, we study how the use of label smoothing to train a teacher network affects the ability\nto distill the teacher\u2019s knowledge into a student network. We show that, even when label smoothing\nimproves the accuracy of the teacher network, teachers trained with label smoothing produce inferior\nstudent networks compared to teachers trained with hard targets. We \ufb01rst noticed this effect when\ntrying to replicate a result in [16]. A non-convolutional teacher is trained on randomly translated\nMNIST digits with hard targets and dropout and gets 0.67% test error. Using distillation, this teacher\ncan be used to train a narrower, unregularized student on untranslated digits to get 0.74% test error. If\nwe use label smoothing instead of dropout, the teacher trains much faster and does slightly better\n(0.59%) but distillation produces a much worse student with 0.91% test error. Something goes\nseriously wrong with distillation when the teacher is trained with label smoothing.\n\nIn knowledge distillation, we replace the cross-entropy term H(y, p) by the weighted sum (1 \u2212\n\u03b2)H(y, p) + \u03b2H(p\nk(T ) are the outputs of the student and teacher\nafter temperature scaling with temperature T , respectively. \u03b2 controls the balance between two tasks:\n\ufb01tting the hard-targets and approximating the softened teacher. The temperature can be viewed as a\nway to exaggerate the differences between the probabilities of incorrect answers.\n\nt(T ), p(T )), where pk(T ) and pt\n\nBoth label smoothing and knowledge distillation involve \ufb01tting a model using soft targets. Knowledge\ndistillation is only useful if it provides an additional gain to the student compared to training the\nstudent with label smoothing, which is simpler to implement since it does not require training a\nteacher network. We quantify this gain experimentally. To demonstrate these ideas, we perform an\nexperiment on the CIFAR-10 dataset. We train a ResNet-56 teacher and we distill to an AlexNet\nstudent. We are interested in four results:\n\n1. the teacher\u2019s accuracy as a function of the label smoothing factor,\n\n2. the student\u2019s baseline accuracy as a function of the label smoothing factor without distillation,\n\n3. the student\u2019s accuracy after distillation with temperature scaling to control the smoothness\n\nof the teacher\u2019s provided targets (teacher trained with hard targets)\n\n4. the student\u2019s accuracy after distillation with \ufb01xed temperature (T = 1.0 and teacher trained\n\nwith label smoothing to control the smoothness of the teacher\u2019s provided targets)\n\nTo compare all solutions using a single smoothness index, we de\ufb01ne the equivalent label smoothing\nfactor \u03b3 which for scenarios 1 and 2 are equal to \u03b1. For scenarios 3 and 4, the smoothness index\nis \u03b3 = E(cid:2) PK\nk(T )K/(K \u2212 1)(cid:3), which calculates the mass allocated by the teacher\nto incorrect examples over the training set. Since the training accuracy is nearly perfect, for all\ndistillation experiments, we consider only the case where \u03b2 = 1, i.e., when the targets are the teacher\noutput and the true labels are ignored.\n\nk=1(1 \u2212 yk)pt\n\nFig. 5 shows the results of this distillation experiment. We \ufb01rst compare the performance of the\nteacher network (blue solid curve, top) and student network (light blue solid, bottom) trained without\ndistillation. For this particular setup, increasing \u03b1 improves the teacher\u2019s accuracy up to values of\n\u03b1 = 0.6, while label smoothing slightly degrades the baseline performance of the student networks.\n\n6\n\n\fFigure 5: Performance of distillation from\nResNet-56 to AlexNet on CIFAR-10.\n\nFigure 6: Estimated mutual information evolu-\ntion during teacher training.\n\nNext, we distill the teacher trained with hard targets to students using different temperatures, and\ncalculate the corresponding \u03b3 for each temperature (red dashed curve). We observe that all distilled\nmodels outperform the baseline student with label smoothing. Finally, we distill information from\nteachers trained with label smoothing \u03b1 > 0, which have better accuracy (blue dashed curve). The\n\ufb01gure shows that using these better performing teachers is no better, and sometimes worse, than\ntraining the student directly with label smoothing, as the relative information between logits is\n\"erased\" when the teacher is trained with label smoothing.\n\nTo observe how label smoothing \u201cerases\" the information contained in the different similarities that\nan individual example has to different classes we revisit the visualizations in Fig. 1. Note that we\nare interested in the visualization of the examples from the training set, since those are the ones\nused for distillation. For a teacher trained with hard targets (\u03b1 = 0.0), we observe that examples\nare distributed in broad clusters, which means that different examples from the same class can have\nvery different similarities to other classes. For a teacher trained with label smoothing, we observe the\nopposite behavior. Label smoothing encourages examples to lie in tight equally separated clusters, so\nevery example of one class has very similar proximities to examples of the other classes. Therefore, a\nteacher with better accuracy is not necessarily the one that distills better.\n\nOne way to directly quantify information erasure is to calculate the mutual information between the\ninput and the logits. Computing mutual information in high dimensions for unknown distributions is\nchallenging, but here we simplify the problem in several ways. We measure the mutual information\nbetween X and Y , where X is a discrete variable representing the index of the training example and\nY is continuous representing the difference between two logits (out of K classes). The source of\nrandomness comes from data augmentation and we approximate the distribution of Y as a Gaussian\nand we estimate the mean and variance from the examples using Monte Carlo samples. The difference\nof the logits y can be written as y = f (d(zx)), where zx is the \ufb02attened input image indexed by x,\nd(\u00b7) is a random data augmentation function (random shifts for example), and f (\u00b7) is a trained neural\nnetwork with an image as an input and the difference between two logits as output (resulting in a\nsingle dimension real-valued output). The mutual information and its respective approximation are\nI(X; Y ) = EX,Y [log(p(y|x)) \u2212 log(Px p(y|x))] and\n\n\u02c6I(X; Y ) =\n\nN\n\nX\n\nx=1\n\n(cid:2) \u2212 (f (d(zx)) \u2212 \u00b5x)2/(2\u03c32) \u2212 log(\n\nN\n\nX\n\nx=1\n\ne\u2212(f (d(zx))\u2212\u00b5x)2/(2\u03c32))(cid:3),\n\nwhere \u00b5x = PL\nx=1(f (d(zx)) \u2212 \u00b5x)2/N , L is the number of Monte Carlo\nsamples used to calculate the empirical mean and N is the number of training examples used for\nmutual information estimation. Here the mutual information is between 0 and log(N ).\n\nl=1 f (d(zx))/L, \u03c32 = PN\n\nFig. 6 shows the estimated mutual information between a subset (N = 600 from two classes) of\nthe training examples and the difference of the logits corresponding to these two classes. After\ninitialization, the mutual information is very small, but as the network is trained, \ufb01rst it rapidly\nincreases then it slowly decreases specially for the network trained with label smoothing. This result\ncon\ufb01rms the intuitions from the last sections. As the representations collapse to small clusters of\npoints, much of the information that could have helped distinguish examples is lost. This results in\nlower estimated mutual information and poor distillation for teachers trained with label smoothing.\n\n7\n\n\fFor later stages of training, mutual information stays slightly above log(2), which corresponds to\nthe extreme case where all training examples collapse to two separate clusters. In this case, all the\ninformation of the input is discarded except a single bit representing which class the example belongs\nto, resulting in no extra information in the teacher\u2019s logits compared to the information in the labels.\n\n5 Related work\nPereyra et al. [17] showed that label smoothing provides consistent gains across many tasks. That\nwork also proposed a new regularizer based on penalizing low entropy predictions, which the authors\nterm \u201ccon\ufb01dence penalty.\" They show that label smoothing is equivalent to con\ufb01dence penalty if the\norder of the KL divergence between uniform distributions and model\u2019s outputs is reversed. They also\npropose to use distributions other than uniform, resulting in unigram label smoothing (see Table 1)\nwhich is advantageous when the output labels\u2019 distribution is not balanced. Label smoothing also\nrelates to DisturbLabel [18], which can be seen as label dropout, whereas label smoothing is the\nmarginalized version of label dropout.\n\nCalibration of modern neural networks [15] has been widely investigated for image classi\ufb01cation,\nbut calibration of sequence models has been investigated only recently. Ott et al. [19] investigate\nthe sequence level calibration of machine translation models and conclude they are remarkably well\ncalibrated. Kumar and Sarawagi [20] investigate calibration of next-token prediction in language\ntranslation. They \ufb01nd that calibration of state-of-the-art models can be improved by a parametric\nmodel, resulting in a small increase in BLEU score. However, neither work invesigates the relation\nbetween label smoothing during training and calibration. For speech recognition, Chorowski and\nJaitly [10] investigate the effect of softmax temperature and label smoothing on decoding accuracy.\nThe authors conclude that both temperature scaling and label smoothing improve word error rates\nafter beam-search (label smoothing performs best), but the relation between calibration and label\nsmoothing/temperature scaling is not described.\n\nAlthough we are unaware of any previous work that shows the adverse effect of label smoothing upon\ndistillation, Kornblith et al. [21] previously demonstrated that label smoothing impairs the accuracy\nof transfer learning, which similarly depends on the presence of non-class-relevant information in\nthe \ufb01nal layers of the network. Additionally, Chelombiev et al. [22] propose an improved mutual\ninformation estimator based on binning and show correlation between compression of softmax layer\nrepresentations and generalization, which may explain why networks trained with label smoothing\ngeneralize so well. This relates to the information bottleneck theory [23, 24, 25] that explains\ngeneralization in terms of compression.\n\n6 Conclusion and future work\nMany state-of-the-art models are trained with label smoothing, but the inductive bias provided by\nthis technique is not well understood. In this work, we have summarized and explained several\nbehaviors observed while training deep neural networks with label smoothing. We focused on how\nlabel smoothing encourages representations in the penultimate layer to group in tight equally distant\nclusters. This emergent property can be visualized in low dimension thanks to a new visualization\nscheme that we proposed. Despite having a positive effect on generalization and calibration, label\nsmoothing can hurt distillation. We explain this effect in terms of erasure of information. With label\nsmoothing, the model is encouraged to treat each incorrect class as equally probable. With hard targets,\nless structure is enforced in later representations, enabling more logit variation across predicted class\nand/or across examples. This can be quanti\ufb01ed by estimating mutual information between input\nexample and output logit and, as we have shown, label smoothing reduces mutual information. This\n\ufb01nding suggests a new research direction, focusing on the relationship between label smoothing\nand the information bottleneck principle, with implications for compression, generalization and\ninformation transfer. Finally, we performed extensive experiments on how label smoothing can\nimplicitly calibrate model\u2019s predictions. This has big impact on model interpretability, but also, as\nwe have shown, can be critical for downstream tasks that depend on calibrated likelihoods such as\nbeam-search.\n\n7 Acknowledgements\nWe would like to thank Mohammad Norouzi, William Chan, Kevin Swersky, Danijar Hafner and\nRishabh Agrawal for the discussions and suggestions.\n\n8\n\n\fReferences\n[1] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by\n\nback-propagating errors. Nature, 323:19, 1986.\n\n[2] Eric B Baum and Frank Wilczek. Supervised learning of probability distributions by neural\n\nnetworks. In Neural information processing systems, pages 52\u201361, 1988.\n\n[3] Sara Solla, Esther Levin, and Michael Fleisher. Accelerated learning in layered neural networks.\n\nComplex systems, 2:625\u2013640, 1988.\n\n[4] Scott E Fahlman. An empirical study of learning speed in back-propagation networks. Technical\n\nreport, Carnegie Mellon University, Computer Science Department, 1988.\n\n[5] John A Hertz, Anders Krogh, and Richard G. Palmer. Introduction to the theory of neural\n\ncomputation. Addison-Wesley, 2018.\n\n[6] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE conference\non computer vision and pattern recognition, pages 2818\u20132826, 2016.\n\n[7] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable\narchitectures for scalable image recognition. In Proceedings of the IEEE conference on computer\nvision and pattern recognition, pages 8697\u20138710, 2018.\n\n[8] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for\n\nimage classi\ufb01er architecture search. arXiv preprint arXiv:1802.01548, 2018.\n\n[9] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le,\nand Zhifeng Chen. GPipe: Ef\ufb01cient training of giant neural networks using pipeline parallelism.\narXiv preprint arXiv:1811.06965, 2018.\n\n[10] Jan Chorowski and Navdeep Jaitly. Towards better decoding and language model integration in\n\nsequence to sequence models. Proc. Interspeech 2017, pages 523\u2013527, 2017.\n\n[11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information\nprocessing systems, pages 5998\u20136008, 2017.\n\n[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[14] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4,\ninception-resnet and the impact of residual connections on learning. In Thirty-First AAAI\nConference on Arti\ufb01cial Intelligence, 2017.\n\n[15] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural\nnetworks. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 1321\u20131330. JMLR. org, 2017.\n\n[16] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.\n\narXiv preprint arXiv:1503.02531, 2015.\n\n[17] Gabriel Pereyra, George Tucker, Jan Chorowski, \u0141ukasz Kaiser, and Geoffrey Hinton. Reg-\narXiv preprint\n\nularizing neural networks by penalizing con\ufb01dent output distributions.\narXiv:1701.06548, 2017.\n\n[18] Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, and Qi Tian. Disturblabel: Regularizing\ncnn on the loss layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 4753\u20134762, 2016.\n\n[19] Myle Ott, Michael Auli, David Grangier, et al. Analyzing uncertainty in neural machine\n\ntranslation. In International Conference on Machine Learning, pages 3953\u20133962, 2018.\n\n[20] Aviral Kumar and Sunita Sarawagi. Calibration of encoder decoder models for neural machine\n\ntranslation. arXiv preprint arXiv:1903.00802, 2019.\n\n9\n\n\f[21] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better?\n\narXiv preprint arXiv:1805.08974, 2018.\n\n[22] Ivan Chelombiev, Conor Houghton, and Cian O\u2019Donnell. Adaptive estimators show information\n\ncompression in deep neural networks. arXiv preprint arXiv:1902.09037, 2019.\n\n[23] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via\n\ninformation. arXiv preprint arXiv:1703.00810, 2017.\n\n[24] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In\n\n2015 IEEE Information Theory Workshop (ITW), pages 1\u20135. IEEE, 2015.\n\n[25] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.\n\narXiv preprint physics/0004057, 2000.\n\n10\n\n\f", "award": [], "sourceid": 2624, "authors": [{"given_name": "Rafael", "family_name": "M\u00fcller", "institution": "Google Brain"}, {"given_name": "Simon", "family_name": "Kornblith", "institution": "Google Brain"}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": "Google & University of Toronto"}]}