{"title": "L_DMI: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 6225, "page_last": 6236, "abstract": "Accurately annotating large scale dataset is notoriously expensive both in time and in money. Although acquiring low-quality-annotated dataset can be much cheaper, it often badly damages the performance of trained models when using such dataset without particular treatment. Various methods have been proposed for learning with noisy labels. However, most methods only handle limited kinds of noise patterns, require auxiliary information or steps (e.g., knowing or estimating the noise transition matrix), or lack theoretical justification. In this paper, we propose a novel information-theoretic loss function, L_DMI, for training deep neural networks robust to label noise. The core of L_DMI is a generalized version of mutual information, termed Determinant based Mutual Information (DMI), which is not only information-monotone but also relatively invariant. To the best of our knowledge, L_DMI is the first loss function that is provably robust to instance-independent label noise, regardless of noise pattern, and it can be applied to any existing classification neural networks straightforwardly without any auxiliary information. In addition to theoretical justification, we also empirically show that using L_DMI outperforms all other counterparts in the classification task on both image dataset and natural language dataset include Fashion-MNIST, CIFAR-10, Dogs vs. Cats, MR with a variety of synthesized noise patterns and noise amounts, as well as a real-world dataset Clothing1M.", "full_text": "LDMI: A Novel Information-theoretic Loss Function\n\nfor Training Deep Nets Robust to Label Noise\n\nSchool of Electronics Engineering and Computer Science, Peking University\n\n{xuyilun,caopeng2016}@pku.edu.cn\n\nYilun Xu\u2217, Peng Cao\u2217\n\nYuqing Kong\n\nThe Center on Frontiers of Computing Studies,\n\nComputer Science Dept., Peking University\n\nyuqing.kong@pku.edu.cn\n\nYizhou Wang\n\nComputer Science Dept., Peking University\n\nDeepwise AI Lab\n\nYizhou.Wang@pku.edu.cn\n\nAbstract\n\nAccurately annotating large scale dataset is notoriously expensive both in time\nand in money. Although acquiring low-quality-annotated dataset can be much\ncheaper, it often badly damages the performance of trained models when using\nsuch dataset without particular treatment. Various methods have been proposed for\nlearning with noisy labels. However, most methods only handle limited kinds of\nnoise patterns, require auxiliary information or steps (e.g., knowing or estimating\nthe noise transition matrix), or lack theoretical justi\ufb01cation. In this paper, we\n\nmutual information, termed Determinant based Mutual Information (DMI), which\nis not only information-monotone but also relatively invariant. To the best of\n\npropose a novel information-theoretic loss function,LDMI, for training deep neural\nnetworks robust to label noise. The core ofLDMI is a generalized version of\nour knowledge,LDMI is the \ufb01rst loss function that is provably robust to instance-\nusingLDMI outperforms all other counterparts in the classi\ufb01cation task on both\n\nindependent label noise, regardless of noise pattern, and it can be applied to any\nexisting classi\ufb01cation neural networks straightforwardly without any auxiliary\ninformation. In addition to theoretical justi\ufb01cation, we also empirically show that\n\nimage dataset and natural language dataset include Fashion-MNIST, CIFAR-10,\nDogs vs. Cats, MR with a variety of synthesized noise patterns and noise amounts,\nas well as a real-world dataset Clothing1M.\n\n1\n\nIntroduction\n\nDeep neural networks, together with large scale accurately annotated datasets, have achieved remark-\nable performance in a great many classi\ufb01cation tasks in recent years (e.g., [18, 11]). However, it is\nusually money- and time- consuming to \ufb01nd experts to annotate labels for large scale datasets. While\ncollecting labels from crowdsourcing platforms like Amazon Mechanical Turk is a potential way\nto get annotations cheaper and faster, the collected labels are usually very noisy. The noisy labels\nhampers the performance of deep neural networks since the commonly used cross entropy loss is not\nnoise-robust. This raises an urgent demand on designing noise-robust loss functions.\nSome previous works have proposed several loss functions for training deep neural networks with\nnoisy labels. However, they either use auxiliary information[29, 12](e.g., having an additional set of\nclean data or the noise transition matrix) or steps[20, 33](e.g. estimating the noise transition matrix),\n\u2217Equal Contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\for make assumptions on the noise [7, 48] and thus can only handle limited kinds of the noise patterns\n(see perliminaries for de\ufb01nition of different noise patterns).\nOne reason that the loss functions used in previous works are not robust to a certain noise pattern, say\ndiagonally non-dominant noise, is that they are distance-based, i.e., the loss is the distance between\nthe classi\ufb01er\u2019s outputs and the labels (e.g. 0-1 loss, cross entropy loss). When datapoints are labeled\nby a careless annotator who tends to label the a priori popular class (e.g. For medical images, given\nthe prior knowledge is 10% malignant and 90% benign, a careless annotator labels \u201cbenign\u201d when\nthe underline true label is \u201cbenign\u201d and labels \u201cbenign\u201d with 90% probability when the underline\ntrue label is \u201cmalignant\u201d.), the collected noisy labels have a diagonally non-dominant noise pattern\nand are extremely biased to one class (\u201cbenign\u201d). In this situation, the distanced-based losses will\nprefer the \u201cmeaningless classi\ufb01er\" who always outputs the a priori popular class (\u201cbenign\u201d) than the\nclassi\ufb01er who outputs the true labels.\nTo address this issue, instead of using distance-based losses, we propose to employ information-\ntheoretic loss such that the classi\ufb01er, whose outputs have the highest mutual information with the\nlabels, has the lowest loss. The key observation is that the \u201cmeaningless classi\ufb01er\" has no information\nabout anything and will be naturally eliminated by the information-theoretic loss. Moreover, the\ninformation-monotonicity of the mutual information guarantees that adding noises to a classi\ufb01er\u2019s\noutput will make this classi\ufb01er less preferred by the information-theoretic loss.\nHowever, the key observation is not suf\ufb01cient. In fact, we want an information measure I to satisfy\n\nI(classi\ufb01er 1\u2019s output; noisy labels)> I(classi\ufb01er 2\u2019s output; noisy labels)\n\u21d4I(classi\ufb01er 1\u2019s output; clean labels)> I(classi\ufb01er 2\u2019s output; clean labels).\n\nUnfortunately, the traditional Shannon mutual information (MI) does not satisfy the above formula,\nwhile we \ufb01nd that a generalized information measure, namely, DMI (Determinant based Mutual\nInformation), satis\ufb01es the above formula. Like MI, DMI measures the correlation between two\nrandom variables. It is de\ufb01ned as the determinant of the matrix that describes the joint distribution\nover the two variables. Intuitively, when two random variables are independent, their joint distribution\nmatrix has low rank and zero determinant. Moreover, DMI is not only information-monotone like MI,\nbut also relatively invariant because of the multiplication property of the determinant. The relative\ninvariance of DMI makes it satisfy the above formula.\n\nwith the noisy labels is theoretically equivalent with training with the clean labels in the dataset,\nregardless of the noise patterns, including the noise amounts.\n\nLDMI(noisy data; classi\ufb01er)=LDMI(clean data; classi\ufb01er)+ noise amount,\n\nBased on DMI, we propose a noise-robust loss functionLDMI which is simply\nLDMI(data; classi\ufb01er)\u2236=\u2212 log[DMI(classi\ufb01er\u2019s output; labels)].\nAs shown in theorem 4.1 later, withLDMI, the following equation holds:\nand the noise amount is a constant given the dataset. The equation reveals that withLDMI, training\nIn summary, we propose a novel information theoretic noise-robust loss functionLDMI based on\na generalized information measure, DMI. Theoretically we show thatLDMI is robust to instance-\nreal-world dataset Clothing1M. The results demonstrate the superior performance ofLDMI.\n\nindependent label noise. As an additional bene\ufb01t, it can be easily applied to any existing classi\ufb01cation\nneural networks straightforwardly without any auxiliary information. Extensive experiments have\nbeen done on both image dataset and natural language dataset including Fashion-MNIST, CIFAR-10,\nDogs vs. Cats, MR with a variety of synthesized noise patterns and noise amounts as well as a\n\n2 Related Work\n\nA series of works have attempted to design noise-robust loss functions. In the context of binary\nclassi\ufb01cation, some loss functions (e.g., 0-1 loss[22], ramp loss[3], unhinged loss[40], savage loss[23])\nhave been proved to be robust to uniform or symmetric noise and Natarajan et al. [26] presented a\ngeneral way to modify any given surrogate loss function. Ghosh et al. [7] generalized the existing\nresults for binary classi\ufb01cation problem to multi-class classi\ufb01cation problem and proved that MAE\n(Mean Absolute Error) is robust to diagonally dominant noise. Zhang et al. [48] showed MAE\nperforms poorly with deep neural network and they combined MAE and cross entropy loss to obtain\n\n2\n\n\fa new loss function. Patrini et al. [29] provided two kinds of loss correction methods with knowing\nthe noise transition matrix. The noise transition matrix sometimes can be estimated from the noisy\ndata [33, 20, 30]. Hendrycks et al. [12] proposed another loss correction technique with an additional\nset of clean data. To the best of our knowledge, we are the \ufb01rst to provide a loss function that is\nprovably robust to instance-independent label noise without knowing the transition matrix, regardless\nof noise pattern and noise amount.\nInstead of designing an inherently noise-robust function, several works used special architectures\nto deal with the problem of training deep neural networks with noisy labels. Some of them focused\non estimating the noise transition matrix to handle the label noise and proposed a variety of ways to\nconstrain the optimization [37, 43, 8, 39, 9, 44]. Some of them focused on \ufb01nding ways to distinguish\nnoisy labels from clean labels and used example re-weighting strategies to give the noisy labels less\nweights [31, 32, 21]. While these methods seem to perform well in practice, they cannot guarantee\nthe robustness to label noise theoretically and are also outperformed by our method empirically.\nOn the other hand, Zhang et al. [46] have shown that deep neural networks can easily memorize\ncompletely random labels, thus several works propose frameworks to prevent this over\ufb01tting issue\nempirically in the setting of deep learning from noisy labels. For example, teacher-student curriculum\nlearning framework [14] and co-teaching framework [10] have been shown to be helpful. Multi-\ntask frameworks that jointly estimates true labels and learns to classify images are also introduced\n[41, 19, 38, 45]. Explicit and implicit regularization methods can also be applied [47, 25]. We consider\na different perspective from them and focus on designing an inherently noise-robust function.\nIn this paper, we only consider instance-independent noise. There are also some works that investigate\ninstance-dependent noise model (e.g. [5, 24]). They focus on the binary setting and assume that the\nnoisy and true labels agree on average.\n\n3 Preliminaries\n\n3.1 Problem settings\n\nclass c. Note that \ufb01xing the input x, the randomness of a classi\ufb01er is independent of everything else.\n\nWe denote the set of classes byC and the size ofC by C. We also denote the domain of datapoints by\nX . A classi\ufb01er is denoted by h\u2236X\uffff C, where C is the set of all possible distributions overC. h\nrepresents a randomized classi\ufb01er such that given x\u2208X , h(x)c is the probability that h maps x into\nThere are N datapoints{xi}N\ni=1. For each datapoint xi, there is an unknown ground truth yi\u2208C. We\nassume that there is an unknown prior distribution QX,Y overX\u00d7C such that{(xi, yi)}N\ni=1 are i.i.d.\nQX,Y(x, y)= Pr[X= x, Y = y].\n\nNote that here we allow the datapoints to be \u201cimperfect\u201d instances, i.e., there still exists uncertainty\nfor Y conditioning on fully knowing X.\n\nsamples drawn from QX,Y and\n\n3\n\ndistribution between Y and , i.e.\n\nTraditional supervised learning aims to train a classi\ufb01er h\u2217 that is able to classify new datapoints into\ntheir ground truth categories with access to{(xi, yi)}N\ni=1. However, in the setting of learning with\nnoisy labels, instead, we only have access to{(xi, \u02dcyi)}N\ni=1 where \u02dcyi is a noisy version of yi.\nWe use a random variable \u02dcY to denote the noisy version of Y and TY\u2192 \u02dcY to denote the transition\nTY\u2192 \u02dcY(y, \u02dcy)= Pr[ \u02dcY = \u02dcy\uffffY = y].\nWe use TY\u2192 \u02dcY to represent the C\u00d7 C matrix format of TY\u2192 \u02dcY .\ntransition matrix TY\u2192 \u02dcY . It is de\ufb01ned as class-independent (or uniform) if a label is substituted by a\nuniformly random label regardless of the classes, i.e. Pr[ \u02dcY = \u02dcc\uffffY = c]= Pr[ \u02dcY = \u02dcc\u2032\uffffY = c],\u2200\u02dcc, \u02dcc\u2032\u2260 c\n(e.g. TY\u2192 \u02dcY =\uffff0.7\n0.7\uffff). It is de\ufb01ned as diagonally dominant if for every row of TY\u2192 \u02dcY , the\nmagnitude of the diagonal entry is larger than any non-diagonal entry, i.e. Pr[ \u02dcY = c\uffffY = c]> Pr[ \u02dcY =\n\nGenerally speaking [29, 7, 48], label noise can be divided into several kinds according to the noise\n\n0.3\n\n0.3\n\n\f0.9\n\n0\n\n0.1\uffff).\n\n0.3\n\n0.2\n\nInformation theory concepts\n\n\u02dcc\uffffY = c],\u2200\u02dcc\u2260 c (e.g. TY\u2192 \u02dcY =\uffff0.7\ndiagonally dominant (e.g. the example mentioned in introduction, TY\u2192 \u02dcY =\uffff 1\n\n0.8\uffff). It is de\ufb01ned as diagonally non-dominant if it is not\n\nWe assume that the noise is independent of the datapoints conditioning on the ground truth, which is\ncommonly assumed in the literature [29, 7, 48], i.e.,\nAssumption 3.1 (Independent noise). X is independent of \u02dcY conditioning on Y .\nWe also need that the noisy version \u02dcY is still informative.\n\nAssumption 3.2 (Informative noisy label). TY\u2192 \u02dcY is invertible, i.e., det(TY\u2192 \u02dcY)\u2260 0.\n\n3.2\nSince Shannon\u2019s seminal work [35], information theory has shown its powerful impact in various\nof \ufb01elds, including several recent deep learning works [13, 4, 17]. Our work is also inspired by\ninformation theory. This section introduces several basic information theory concepts.\nInformation theory is commonly related to random variables. For every random variable W1, Shan-\n\nexample, deterministic W1 has lowest entropy. For every two random variables W1 and W2, Shannon\n\nsures the amount of relevance between W1 and W2. For example, when W1 and W2 are independent,\nthey have the lowest Shannon mutual information, zero.\n\nnon\u2019s entropy H(W1)\u2236=\u2211w1 Pr[W = w1] log Pr[W = w1] measures the uncertainty of W1. For\nPr[W=w1,W=w2]\nmutual information MI(W1, W2)\u2236=\u2211w1,w2 Pr[W1= w1, W2= w2] log\nPr[W1=w1] Pr[W2=w2] mea-\nShannon mutual information is non-negative, symmetric, i.e., MI(W1, W2)= MI(W2, W1), and also\n\nsatis\ufb01es a desired property, information-monotonicity, i.e., the mutual information between W1 and\nW2 will always decrease if either W1 or W2 has been \u201cprocessed\u201d.\nFact 3.3 (Information-monotonicity [6]). For all random variables W1, W2, W3, when W3 is less\ninformative for W2 than W1, i.e., W3 is independent of W2 conditioning W1,\n\nThis property naturally induces that for all random variables W1, W2,\n\nMI(W3, W2)\u2264 MI(W1, W2).\nMI(W1, W2)\u2264 MI(W2, W2)= H(W2)\n\nsince W2 is always the most informative random variable for itself.\nBased on Shannon mutual information, a performance measure for a classi\ufb01er h can be naturally\n\nde\ufb01ned. High quality classi\ufb01er\u2019s output h(X) should have high mutual information with the ground\ntruth category Y . Thus, a classi\ufb01er h\u2019s performance can be measured by MI(h(X), Y).\nHowever, in our setting, we only have access to the i.i.d. samples of h(X) and \u02dcY . A natural attempt\nis to measure a classi\ufb01er h\u2019s performance by MI(h(X), \u02dcY). Unfortunately, under this performance\nmeasure, the measurement based on noisy labels MI(h(X), \u02dcY) may not be consistent with the\nmeasurement based on true labels MI(h(X), Y). (See a counterexample in Supplementary Material\n\n\u2200h, h\u2032, MI(h(X), Y)> MI(h\u2032(X), Y)\u21d4\uffff MI(h(X), \u02dcY)> MI(h\u2032(X), \u02dcY).\n\nThus, we cannot use Shannon mutual information as the performance measure for classi\ufb01ers. Here\nwe \ufb01nd that, a generalized mutual information, Determinant based Mutual Information (DMI) [16],\nsatis\ufb01es the above formula such that under the performance measure based on DMI, the measurement\nbased on noisy labels is consistent with the measurement based on true labels.\nDe\ufb01nition 3.4 (Determinant based Mutual Information [16]). Given two discrete random variables\nW1, W2, we de\ufb01ne the Determinant based Mutual Information between W1 and W2 as\n\nB.) That is,\n\nDMI(W1, W2)=\uffff det(QW1,W2)\uffff\n\nwhere QW1,W2 is the matrix format of the joint distribution over W1 and W2.\nDMI is a generalized version of Shannon\u2019s mutual information: it preserves all properties of Shannon\nmutual information, including non-negativity, symmetry and information-monotonicity and it is\nadditionally relatively invariant. DMI is initially proposed to address a mechanism design problem\n[16].\n\n4\n\n\fLemma 3.5 (Properties of DMI [16]). DMI is non-negative, symmetric and information-monotone.\nMoreover, it is relatively invariant: for all random variables W1, W2, W3, when W3 is less informative\nfor W2 than W1, i.e., W3 is independent of W2 conditioning W1,\n\nwhere TW1\u2192W3 is the matrix format of\n\nDMI(W2, W3)= DMI(W2, W1)\uffff det(TW1\u2192W3)\uffff\nTW1\u2192W3(w1, w3)= Pr[W3= w3\uffffW1= w1].\n\nProof. The non-negativity and symmetry follow directly from the de\ufb01nition, so we only need to\nprove the relatively invariance. Note that\n\nPr\n\nQW2,W3[W2= w2, , W3= w3]=\uffffw1\n\nPr\n\nQW1,W2[W2= w2, W1= w1] Pr[W3= w3\uffffW1= w1].\n\nas W3 is independent of W2 conditioning on W1. Thus,\n\nQW2,W3= QW2,W1TW1\u2192W3\n\ntively. We have\n\ndet(QW2,W3)= det(QW2,W1) det(TW1\u2192W3)\n\nwhere QW2,W3, QW2,W1, TW1\u2192W3 are the matrix formats of QW2,W3, QW2,W1, TW1\u2192W3, respec-\nbecause of the multiplication property of the determinant (i.e. det(AB)= det(A) det(B) for every\ntwo matrices A, B). Therefore, DMI(W2, W3)= DMI(W2, W1)\uffff det(TW1\u2192W3)\uffff.\nDMI(W3, W2)= DMI(W2, W3)= DMI(W2, W1)\uffff det(TW1\u2192W3)\uffff\n\u2264 DMI(W2, W1)= DMI(W1, W2)\n\nThe relative invariance and the symmetry imply the information-monotonicity of DMI. When W3 is\nless informative for W2 than W1, i.e., W3 is independent of W2 conditioning on W1,\n\nBased on DMI, an information-theoretic performance measure for each classi\ufb01er h is naturally\n\nbecause of the fact that for every square transition matrix T, det(T)\u2264 1 [34].\nde\ufb01ned as DMI(h(X), \u02dcY). Under this performance measure, the measurement based on noisy labels\nDMI(h(X), \u02dcY) is consistent with the measurement based on clean labels DMI(h(X), Y), i.e., for\nevery two classi\ufb01ers h and h\u2032,\n4 LDMI: An Information-theoretic Noise-robust Loss Function\nLDMI(Qh(X), \u02dcY)\u2236=\u2212 log(DMI(h(X), \u02dcY))=\u2212 log(\uffff det(Qh(X), \u02dcY)\uffff)\n\nDMI(h(X), Y)> DMI(h\u2032(X), Y)\u21d4 DMI(h(X), \u02dcY)> DMI(h\u2032(X), \u02dcY).\n\nwhere Qh(X), \u02dcY is the joint distribution over h(X), \u02dcY and Qh(X), \u02dcY is the C\u00d7 C matrix format of\nQh(X), \u02dcY . The randomness h(X) comes from both the randomness of h and the randomness of X.\nFigure 1 shows the computation ofLDMI. In each step of iteration, we sample a batch of datapoints\nand their noisy labels{(xi, \u02dcyi)}N\ni=1. We denote the outputs of the classi\ufb01er by a matrix O. Each\ncolumn of O is a distribution overC, representing for an output of the classi\ufb01er. We denote the noisy\nOci= h(xi)c, Li\u02dcc= [\u02dcyi= \u02dcc],\nWe de\ufb01ne U\u2236= 1\n2 @(c\uffff det(A)\uffff)\n\nlabels by a 0-1 matrix L. Each row of L is an one-hot vector, representing for a label. i.e.\n\n4.1 Method overview\nOur loss function is de\ufb01ned as\n\nThe log function here resolves many scaling issues2.\n\nN OL, i.e.,\n\n= c\uffff det(A)\uffff(A\u22121)T while @ log(c\uffff det(A)\uffff)\n\n@A\n\n=(A\u22121)T ,\u2200 matrix A and\u2200 constant c.\n\n@A\n\n5\n\n\fFigure 1: The computation ofLDMI in each step of iteration\n\nUc\u02dcc\u2236= 1\n\nN\n\nN\uffffi=1\n\nh(xi)c [\u02dcyi= \u02dcc].\n\nN\uffffi=1\n\nOciLi\u02dcc= 1\n\nN\n\nWe have EUc\u02dcc = Pr[h(X) = c, \u02dcY = \u02dcc] = Qh(X), \u02dcY(c, \u02dcc) (E means expectation, see proof in\nSupplementary Material B). Thus, U is an empirical estimation of Qh(X), \u02dcY . By abusing notation a\n\nlittle bit, we de\ufb01ne\n\nas the empirical loss function. Our formal training process is shown in Supplementary Material A.\n\nLDMI({(xi, \u02dcyi)}N\n\ni=1; h)=\u2212 log(\uffff det(U)\uffff)\n\n4.2 Theoretical justi\ufb01cation\n\nloss, i.e., for all classi\ufb01er h,\n\nTheorem 4.1 (Main Theorem). With Assumption 3.1 and Assumption 3.2,LDMI is\nlegal if there exists a ground truth classi\ufb01er h\u2217 such that h\u2217(X)= Y , then it must have the lowest\nLDMI(Qh\u2217(X), \u02dcY)\u2264LDMI(Qh(X), \u02dcY)\nand the inequality is strict when h(X) is not a permutation of h\u2217(X), i.e., there does not\nexist a permutation \u21e1\u2236C\uffffC s.t. h(x)= \u21e1(h\u2217(x)),\u2200x\u2208X ;\n\nnoise-robust for the set of all possible classi\ufb01ersH,\n\nand in fact, training using noisy labels is the same as training using clean labels in the\ndataset except a constant shift,\n\narg min\n\nh\u2208H LDMI(Qh(X),Y)\n\nh\u2208H LDMI(Qh(X), \u02dcY)= arg min\nLDMI(Qh(X), \u02dcY)=LDMI(Qh(X),Y)+ \u21b5;\ni.e. h\u2032(X) is independent of Y conditioning on h(X), then\nLDMI(Qh\u2032(X), \u02dcY)\u2264LDMI(Qh(X), \u02dcY).\nDMI(h(X), \u02dcY)= DMI(h(X), Y)\uffff det(TY\u2192 \u02dcY)\uffff.\nLDMI(Qh(X), \u02dcY)=LDMI(Qh(X),Y)+ log(\uffff det(TY\u2192 \u02dcY)\uffff).\n\ninformation-monotone for every two classi\ufb01ers h, h\u2032, if h\u2032(X) is less informative for Y than h(X),\n\nProof. The relatively invariance of DMI (Lemma 3.5) implies\n\nTherefore,\n\nThus, the information-monotonicity and the noise-robustness ofLDMI follows and the constant\n\u21b5= log(\uffff det(TY\u2192 \u02dcY)\uffff)\u2264 0.\nThe legal property follows from the information-monotonicity ofLDMI as h\u2217(X)= Y is the most\ndet(T)= 1 if and only if T is a permutation matrix [34].\n\ninformative random variable for Y itself and the fact that for every square transition matrix T ,\n\n6\n\n\f5 Experiments\n\nWe evaluate our method on both synthesized and real-world noisy datasets with different deep neural\nnetworks to demonstrate that our method is independent of both architecture and data domain. We\ncall our method DMI and compare it with: CE (the cross entropy loss), FW (the forward loss\n[29]), GCE (the generalized cross entropy loss [48]), LCCN (the latent class-conditional noise\nmodel [44]). For the synthesized data, noises are added to the training and validation sets, and\ntest accuracy is computed with respect to true labels. For our method, we pick the best learning\n\nrate from{1.0\u00d7 10\u22124, 1.0\u00d7 10\u22125, 1.0\u00d7 10\u22126} and the best batch size from{128, 256} based on the\n\nminimum validation loss. For other methods, we use the best hyperparameters they provided in\nsimilar settings. The classi\ufb01ers are pretrained with cross entropy loss \ufb01rst. All reported experiments\nwere repeated \ufb01ve times. We implement all networks and training procedures in Pytorch [28] and\nconduct all experiments on NVIDIA TITAN Xp GPUs.3 The explicit noise transition matrices are\nshown in Supplementary Material C. Due to space limit, we defer some additional experiments to\nSupplementary Material D.\n\n5.1 An explanation experiment on Fashion-MNIST\n\nTo compare distance-based and information-theoretic loss functions as we mentioned in the third\nparagraph in introduction, we conducted experiments on Fashion-MNIST [42]. It consists of 70,000\n\nduring training. Batch size is set to 128.\nWe synthesize three cases of noise patterns: (1) with probability r, a true label is substituted by a\n\nset, a 10, 000-image valiadation set and a 10, 000-image test set. For clean presentation, we only\ncompare our information-theoretic loss function DMI with the distance-based loss function CE here\nand convert the labels in the dataset to two classes, bags and clothes, to synthesize a highly imbalanced\ndataset (10% bags, 90% clothes). We use a simple two-layer convolutional neural network as the\n\n28\u00d7 28 grayscale fashion product image from 10 classes, which is split into a 50, 000-image training\nclassi\ufb01er. Adam with default parameters and a learning rate of 1.0\u00d7 10\u22124 is used as the optimizer\nrandom label through uniform sampling. (2) with probability r, bags\u2192 clothes, that is, a true label of\n\u2192 bags, that is, the a priori more popular class, \u201cclothes\u201d, is \ufb02ipped to the other one, \u201cbags\u201d. This\nis less likely a priori.) Note that the parameter 0\u2264 r\u2264 1 in the above three cases also represents the\namount of noise. When r= 0, the labels are clean and when r= 1, the labels are totally uninformative.\n\nhappens in real world when the annotators are risk-avoid and there will be smaller adverse effects if\nthe annotators label the image to a certain class. (e.g. a risk-avoid medical image annotator may be\nmore likely to label \u201cmalignant\u201d since it is usually safer when the annotator is not con\ufb01dent, even if it\n\nthe a priori less popular class, \u201cbags\u201d, is \ufb02ipped to the popular one, \u201cclothes\u201d. This happens in real\nworld when the annotators are lazy. (e.g., a careless medical image annotator may be more likely\nto label \u201cbenign\u201d since most images are in the \u201cbenign\u201d category.) (3) with probability r, clothes\n\nMoreover, in case (2) and (3), as r increases, the noise pattern changes from diagonally dominant to\ndiagonally non-dominant.\n\nFigure 2: Test accuracy (mean and std. dev.) on Fashion-MNIST.\n\nAs we mentioned in the introduction, distance-based loss functions will perform badly when the noise\nis non-diagonally dominant and the labels are biased to one class since they prefer the meaningless\n\nclassi\ufb01er h0 who always outputs the class who is the majority in the labels. (\u2200x, h0(x)= \u201cclothes\u201d\nand has accuracy 90% in case (2) and\u2200x, h0(x)= \u201cbags\u201d and has accuracy 10% in case (3)). The\n\n3Source codes are available at https://github.com/Newbeeer/L_DMI.\n\n7\n\n\fFigure 3: Test accuracy (mean) on CIFAR-10, Dogs vs. Cats and MR.\n\nexperiment results match our expectation. CE performs similarly with our DMI for diagonally\ndominant noises. For non-diagonally dominant noises, however, CE only obtains the meaningless\nclassi\ufb01er h0 while DMI still performs pretty well.\n\n5.2 Experiments on CIFAR-10, Dogs vs. Cats and MR\n\nCIFAR-10 [1] consists of 60,000 32\u00d7 32 color images from 10 classes, which is split into a 40, 000-\n\nimage training set, a 10, 000-image validation set and a 10, 000-image test set. Dogs vs. Cats [2]\nconsists of 25, 000 images from 2 classes, dogs and cats, which is split into a 12, 500-image training\nset, a 6, 250-image validation set and a 6, 250-image test set. MR [27] consist of 10, 662 one-sentence\nmovie reviews from 2 classes, positive and negative, which is split into a 7, 676-sentence training\nset, a 1, 919-sentence validation set and a 1, 067-sentence test set. We use ResNet-34[11], VGG-\n16[36], WordCNN[15] as the classi\ufb01er for CIFAR-10, Dogs vs. Cats, MR, respectively. SGD with\n\na momentum of 0.9, a weight decay of 1.0\u00d7 10\u22124 and a learning rate of 1.0\u00d7 10\u22125 is used as the\n\n8\n\n\foptimizer during training for CIFAR-10 and Dogs vs. Cats. Adam with default parameters and a\n\n4 pixels on each side as data augmentation for images in CIFAR-10 and Dogs vs Cats. We use the\nsame pre-processing pipeline in [15] for sentences in MR. Following [44], the noise for CIFAR-10 is\n\nlearning rate of 1.0\u00d7 10\u22124 is used as the optimizer during training for MR. Batch size is set to 128.\nWe use per-pixel normalization, horizontal random \ufb02ip and 32\u00d7 32 random crops after padding with\nadded between the similar classes, i.e. truck\u2192 automobile, bird\u2192 airplane, deer\u2192 horse, cat\u2192 dog,\nwith probability r. The noise for Dogs vs. Cats is added as cat\u2192 dog with probability r. The noise\nfor MR is added as positive\u2192 negative with probability r.\n\nAs shown in Figure 3, our method DMI almost outperforms all other methods in every experiment and\nits accuracy drops slowly as the noise amount increases. GCE has great performance in diagonally\ndominant noises but it fails in diagonally non-dominant noises. This phenomenon matches its theory:\nit assumes that the label noise is diagonally dominant. FW needs to pre-estimate a noise transition\nmatrix before training and LCCN uses the output of the model to estimate the true labels. These\ntasks become harder as the noise amount grows larger, so their performance also drop quickly as the\nnoise amount increases.\n\n5.3 Experiments on Clothing1M\nClothing1M [43] is a large-scale real world dataset, which consists of 1 million images of clothes\ncollected from shopping websites with noisy labels from 14 classes assigned by the surrounding text\nprovided by the sellers. It has additional 14k and 10k clean data respectively for validation and test.\n\nWe use ResNet-50[11] as the classi\ufb01er and apply random crop of 224\u00d7 224, random \ufb02ip, brightness\nand saturation as data augmentation. SGD with a momentum of 0.9, a weight decay of 1.0\u00d7 10\u22123 is\nused as the optimizer during training. We train the classi\ufb01er with learning rates of 1.0\u00d7 10\u22126 in the\n\ufb01rst 5 epochs and 0.5\u00d7 10\u22126 in the second 5 epochs. Batch size is set to 256.\n\nTable 1: Test accuracy (mean) on Clothing1M\n\nMethod\nAccuracy\n\nCE\n68.94\n\nFW GCE LCCN\n70.83\n71.63\n\n69.09\n\nDMI\n72.46\n\nAs shown in Table 5, DMI also outperforms other methods in the real-world setting.\n\n6 Conclusion and Discussion\n\nlabel noise. It is based on a generalized version of mutual information, DMI. We provide theoretical\nvalidation to our approach and compare our approach experimentally with previous methods on both\n\nWe propose a simple yet powerful loss function,LDMI, for training deep neural networks robust to\nsynthesized and real-world datasets. To the best of our knowledge,LDMI is the \ufb01rst loss function that\n\nis provably robust to instance-independent label noise, regardless of noise pattern and noise amount,\nand it can be applied to any existing classi\ufb01cation neural networks straightforwardly without any\nauxiliary information.\nIn the experiment, sometimes DMI does not have advantage when the data is clean and is outper-\nformed by GCE. GCE does a training optimization on MAE with some hyperparameters while\nsacri\ufb01ces the robustness a little bit theoretically. A possible future direction is to employ some\ntraining optimizations in our method to improve the performance.\nThe current paper focuses on the instance-independent noise setting. That is, we assume conditioning\n\non the latent ground truth label Y , \u02dcY and X are independent. There may exist Y\u2032\u2260 Y such that\n\u02dcY and X are independent conditioning on Y\u2032. Based on our theorem, training using \u02dcY is also\nthe same as training using Y\u2032. However, without any additional assumption, when we only has\nthe conditional independent assumption, no algorithm can distinguish Y\u2032 and Y . Moreover, the\ninformation-monotonicity of our loss function guarantees that if Y is more informative than Y\u2032 with\nX, the best hypothesis learned in our algorithm will be more similar with Y than Y\u2032. Thus, if we\nto predict Y rather than other Y\u2032s. An interesting future direction is to combine our method with\n\nassume that the actual ground truth label Y is the most informative one, then our algorithm can learn\n\nadditional assumptions to give a better prediction.\n\n9\n\n\fAcknowledgments\nWe would like to express our\nthanks for support\n2018AAA0102004, NSFC-61625201, NSFC-61527804.\n\nfrom the following research grants:\n\nReferences\n[1] CIFAR-10 and CIFAR-100 datasets. https://www.cs.toronto.edu/~kriz/cifar.html.\n\n2009.\n\n[2] Dogs vs. Cats competition. https://www.kaggle.com/c/dogs-vs-cats. 2013.\n[3] J Paul Brooks. Support vector machines with the ramp loss and the hard margin loss. Operations\n\nresearch, 59(2):467\u2013479, 2011.\n\n[4] Peng Cao, Yilun Xu, Yuqing Kong, and Yizhou Wang. Max-mig: an information theoretic\n\napproach for joint learning from crowds. 2018.\n\n[5] Jiacheng Cheng, Tongliang Liu, Kotagiri Ramamohanarao, and Dacheng Tao. Learning with\n\nbounded instance-and label-dependent label noise. arXiv preprint arXiv:1709.03768, 2017.\n\n[6] Imre Csisz\u00e1r, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations\n\nand Trends\u00ae in Communications and Information Theory, 1(4):417\u2013528, 2004.\n\n[7] Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for\n\ndeep neural networks. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[8] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adapta-\n\ntion layer. 2016.\n\n[9] Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor Tsang, Ya Zhang, and Masashi\nSugiyama. Masking: A new perspective of noisy supervision. In Advances in Neural Information\nProcessing Systems, pages 5836\u20135846, 2018.\n\n[10] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi\nSugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels.\nIn Advances in Neural Information Processing Systems, pages 8527\u20138537, 2018.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[12] Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to\ntrain deep networks on labels corrupted by severe noise. In Advances in Neural Information\nProcessing Systems, pages 10456\u201310465, 2018.\n\n[13] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler,\nand Yoshua Bengio. Learning deep representations by mutual information estimation and\nmaximization. arXiv preprint arXiv:1808.06670, 2018.\n\n[14] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Regularizing\n\nvery deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055, 4, 2017.\n\n[15] Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation.\n\narXiv:1408.5882, 2014.\n\narXiv preprint\n\n[16] Yuqing Kong. Dominantly truthful multi-task peer prediction, with constant number of tasks.\n\nACM-SIAM Symposium on Discrete Algorithms (SODA20), to appear.\n\n[17] Yuqing Kong and Grant Schoenebeck. Water from two rocks: Maximizing the mutual infor-\nmation. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages\n177\u2013194. ACM, 2018.\n\n[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n10\n\n\f[19] Kuang-Huei Lee, Xiaodong He, Lei Zhang, and Linjun Yang. Cleannet: Transfer learning for\nscalable image classi\ufb01er training with label noise. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 5447\u20135456, 2018.\n\n[20] Tongliang Liu and Dacheng Tao. Classi\ufb01cation with noisy labels by importance reweighting.\n\nIEEE Transactions on pattern analysis and machine intelligence, 38(3):447\u2013461, 2015.\n\n[21] Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah M Erfani, Shu-Tao Xia, Sudanthi\nWijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. arXiv\npreprint arXiv:1806.02612, 2018.\n\n[22] Naresh Manwani and PS Sastry. Noise tolerance under risk minimization. IEEE transactions\n\non cybernetics, 43(3):1146\u20131151, 2013.\n\n[23] Hamed Masnadi-Shirazi and Nuno Vasconcelos. On the design of loss functions for classi-\n\ufb01cation: theory, robustness to outliers, and savageboost. In Advances in neural information\nprocessing systems, pages 1049\u20131056, 2009.\n\n[24] Aditya Krishna Menon, Brendan Van Rooyen, and Nagarajan Natarajan. Learning from binary\n\nlabels with instance-dependent corruption. arXiv preprint arXiv:1605.00751, 2016.\n\n[25] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training:\na regularization method for supervised and semi-supervised learning. IEEE transactions on\npattern analysis and machine intelligence, 41(8):1979\u20131993, 2018.\n\n[26] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning\nwith noisy labels. In Advances in neural information processing systems, pages 1196\u20131204,\n2013.\n\n[27] Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization\nwith respect to rating scales. In Proceedings of the 43rd annual meeting on association for\ncomputational linguistics, pages 115\u2013124. Association for Computational Linguistics, 2005.\n\n[28] A Paszke, S Gross, S Chintala, and G Chanan. Tensors and dynamic neural networks in python\n\nwith strong gpu acceleration, 2017.\n\n[29] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu.\nMaking deep neural networks robust to label noise: A loss correction approach. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 1944\u20131952, 2017.\n[30] Harish Ramaswamy, Clayton Scott, and Ambuj Tewari. Mixture proportion estimation via\nkernel embeddings of distributions. In International Conference on Machine Learning, pages\n2052\u20132060, 2016.\n\n[31] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew\nRabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint\narXiv:1412.6596, 2014.\n\n[32] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples\n\nfor robust deep learning. arXiv preprint arXiv:1803.09050, 2018.\n\n[33] Clayton Scott. A rate of convergence for mixture proportion estimation, with application to\n\nlearning from noisy labels. In Arti\ufb01cial Intelligence and Statistics, pages 838\u2013846, 2015.\n\n[34] Eugene Seneta. Non-negative matrices and Markov chains. Springer Science & Business Media,\n\n2006.\n\n[35] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal,\n\n27(3):379\u2013423, 1948.\n\n[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[37] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training\n\nconvolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.\n\n11\n\n\f[38] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization\nframework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 5552\u20135560, 2018.\n\n[39] Arash Vahdat. Toward robustness against label noise in training deep discriminative neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 5596\u20135605, 2017.\n\n[40] Brendan Van Rooyen, Aditya Menon, and Robert C Williamson. Learning with symmetric\nlabel noise: The importance of being unhinged. In Advances in Neural Information Processing\nSystems, pages 10\u201318, 2015.\n\n[41] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie.\nLearning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 839\u2013847, 2017.\n\n[42] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms, 2017.\n\n[43] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive\nnoisy labeled data for image classi\ufb01cation. In Proceedings of the IEEE conference on computer\nvision and pattern recognition, pages 2691\u20132699, 2015.\n\n[44] Jiangchao Yao, Hao Wu, Ya Zhang, Ivor W Tsang, and Jun Sun. Safeguarded dynamic label\n\nregression for noisy supervision. 2019.\n\n[45] Kun Yi and Jianxin Wu. Probabilistic end-to-end noise correction for learning with noisy labels.\n\narXiv preprint arXiv:1903.07788, 2019.\n\n[46] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n[47] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond\n\nempirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.\n\n[48] Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks\nwith noisy labels. In Advances in Neural Information Processing Systems, pages 8778\u20138788,\n2018.\n\n12\n\n\f", "award": [], "sourceid": 3364, "authors": [{"given_name": "Yilun", "family_name": "Xu", "institution": "Peking University"}, {"given_name": "Peng", "family_name": "Cao", "institution": "Peking University"}, {"given_name": "Yuqing", "family_name": "Kong", "institution": "Peking University"}, {"given_name": "Yizhou", "family_name": "Wang", "institution": "Peking University"}]}