{"title": "Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels", "book": "Advances in Neural Information Processing Systems", "page_first": 8778, "page_last": 8788, "abstract": "Deep neural networks (DNNs) have achieved tremendous success in a variety of applications across many disciplines. Yet, their superior performance comes with the expensive cost of requiring correctly annotated large-scale datasets. Moreover, due to DNNs' rich capacity, errors in training labels can hamper performance. To combat this problem, mean absolute error (MAE) has recently been proposed as a noise-robust alternative to the commonly-used categorical cross entropy (CCE) loss. However, as we show in this paper, MAE can perform poorly with DNNs and large-scale datasets. Here, we present a theoretically grounded set of noise-robust loss functions that can be seen as a generalization of MAE and CCE. Proposed loss functions can be readily applied with any existing DNN architecture and algorithm, while yielding good performance in a wide range of noisy label scenarios. We report results from experiments conducted with CIFAR-10, CIFAR-100 and FASHION-MNIST datasets and synthetically generated noisy labels.", "full_text": "Generalized Cross Entropy Loss for Training Deep\n\nNeural Networks with Noisy Labels\n\nZhilu Zhang\nMert R. Sabuncu\nElectrical and Computer Engineering\n\nMeinig School of Biomedical Engineering\n\nCornell University\n\nzz452@cornell.edu, msabuncu@cornell.edu\n\nAbstract\n\nDeep neural networks (DNNs) have achieved tremendous success in a variety of\napplications across many disciplines. Yet, their superior performance comes with\nthe expensive cost of requiring correctly annotated large-scale datasets. Moreover,\ndue to DNNs\u2019 rich capacity, errors in training labels can hamper performance. To\ncombat this problem, mean absolute error (MAE) has recently been proposed as\na noise-robust alternative to the commonly-used categorical cross entropy (CCE)\nloss. However, as we show in this paper, MAE can perform poorly with DNNs and\nchallenging datasets. Here, we present a theoretically grounded set of noise-robust\nloss functions that can be seen as a generalization of MAE and CCE. Proposed loss\nfunctions can be readily applied with any existing DNN architecture and algorithm,\nwhile yielding good performance in a wide range of noisy label scenarios. We report\nresults from experiments conducted with CIFAR-10, CIFAR-100 and FASHION-\nMNIST datasets and synthetically generated noisy labels.\n\n1\n\nIntroduction\n\nThe resurrection of neural networks in recent years, together with the recent emergence of large\nscale datasets, has enabled super-human performance on many classi\ufb01cation tasks [21, 28, 30].\nHowever, supervised DNNs often require a large number of training samples to achieve a high level\nof performance. For instance, the ImageNet dataset [6] has 3.2 million hand-annotated images.\nAlthough crowdsourcing platforms like Amazon Mechanical Turk have made large-scale annotation\npossible, some error during the labeling process is often inevitable, and mislabeled samples can\nimpair the performance of models trained on these data. Indeed, the sheer capacity of DNNs to\nmemorize massive data with completely randomly assigned labels [42] proves their susceptibility to\nover\ufb01tting when trained with noisy labels. Hence, an algorithm that is robust against noisy labels\nfor DNNs is needed to resolve the potential problem. Furthermore, when examples are cheap and\naccurate annotations are expensive, it can be more bene\ufb01cial to have datasets with more but noisier\nlabels than less but more accurate labels [18].\nClassi\ufb01cation with noisy labels is a widely studied topic [8]. Yet, relatively little attention is given\nto directly formulating a noise-robust loss function in the context of DNNs. Our work is motivated\nby Ghosh et al. [9] who theoretically showed that mean absolute error (MAE) can be robust against\nnoisy labels under certain assumptions. However, as we demonstrate below, the robustness of MAE\ncan concurrently cause increased dif\ufb01culty in training, and lead to performance drop. This limitation\nis particularly evident when using DNNs on complicated datasets. To combat this drawback, we\nadvocate the use of a more general class of noise-robust loss functions, which encompass both MAE\nand CCE. Compared to previous methods for DNNs, which often involve extra steps and algorithmic\nmodi\ufb01cations, changing only the loss function requires minimal intervention to existing architectures\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand algorithms, and thus can be promptly applied. Furthermore, unlike most existing methods, the\nproposed loss functions work for both closed-set and open-set noisy labels [40]. Open-set refers to\nthe situation where samples associated with erroneous labels do not always belong to a ground truth\nclass contained within the set of known classes in the training data. Conversely, closed-set means that\nall labels (erroneous and correct) come from a known set of labels present in the dataset.\nThe main contributions of this paper are two-fold. First, we propose a novel generalization of CCE\nand present a theoretical analysis of proposed loss functions in the context of noisy labels. And\nsecond, we report a thorough empirical evaluation of the proposed loss functions using CIFAR-10,\nCIFAR-100 and FASHION-MNIST datasets, and demonstrate signi\ufb01cant improvement in terms of\nclassi\ufb01cation accuracy over the baselines of MAE and CCE, under both closed-set and open-set noisy\nlabels.\nThe rest of the paper is organized as follows. Section 2 discusses existing approaches to the problem.\nSection 3 introduces our noise-robust loss functions. Section 4 presents and analyzes the experiments\nand result. Finally, section 5 concludes our paper.\n\n2 Related Work\n\nNumerous methods have been proposed for learning with noisy labels with DNNs in recent years.\nHere, we brie\ufb02y review the relevant literature. Firstly, Sukhbaatar and Fergus [35] proposed account-\ning for noisy labels with a confusion matrix so that the cross entropy loss becomes\n\n1\nN\n\nNXn=1\n\n log(\n\ncXi\n\n1\nN\n\nNXn=1\n\n(1)\n\nL(\u2713) =\n\n log p(ey =eyn|xn,\u2713 ) =\n\np(ey =eyn|y = i)p(y = i|xn,\u2713 )),\nwhere c represents number of classes,ey represents noisy labels, y represents the latent true labels\nand p(ey =eyn|y = i) is the (eyn, i)\u2019th component of the confusion matrix. Usually, the real confusion\n\nis because the right hand side of Eq. 1 is minimized when p(y = i|xn,\u2713 ) = 1 for i = eyn and 0\n\nmatrix is unknown. Several methods have been proposed to estimate it [11, 14, 32, 17, 12]. Yet,\naccurate estimations can be hard to obtain. Even with the real confusion matrix, training with the\nabove loss function might be suboptimal for DNNs. Assuming (1) a DNN with enough capacity\nto memorize the training set, and (2) a confusion matrix that is diagonally dominant, minimizing\nthe cross entropy with confusion matrix is equivalent to minimizing the original CCE loss. This\notherwise, 8 n.\nIn the context of support vector machines, several theoretically motivated noise-robust loss functions\nlike the ramp loss, the unhinged loss and the savage loss have been introduced [5, 38, 27]. More\ngenerally, Natarajan et al. [29] presented a way to modify any given surrogate loss function for binary\nclassi\ufb01cation to achieve noise-robustness. However, little attention is given to alternative noise robust\nloss functions for DNNs. Ghosh et al. [10, 9] proved and empirically demonstrated that MAE is\nrobust against noisy labels. This paper can be seen as an extension and generalization of their work.\nAnother popular approach attempts at cleaning up noisy labels. Veit et al. [39] suggested using\na label cleaning network in parallel with a classi\ufb01cation network to achieve more noise-robust\nprediction. However, their method requires a small set of clean labels. Alternatively, one could\ngradually replace noisy labels by neural network predictions [33, 36]. Rather than using predictions\nfor training, Northcutt et al. [31] offered to prune the correct samples based on softmax outputs. As\nwe demonstrate below, this is similar to one of our approaches. Instead of pruning the dataset once,\nour algorithm iteratively prunes the dataset while training until convergence.\nOther approaches include treating the true labels as a latent variable and the noisy labels as an\nobserved variable so that EM-like algorithms can be used to learn true label distribution of the dataset\n[41, 18, 37]. Techniques to re-weight con\ufb01dent samples have also been proposed. Jiang et al. [16]\nused a LSTM network on top of a classi\ufb01cation model to learn the optimal weights on each sample,\nwhile Ren, et al. [34] used a small clean dataset and put more weights on noisy samples which have\ngradients closer to that of the clean dataset. In the context of binary classi\ufb01cation, Liu et al. [24]\nderived an optimal importance weighting scheme for noise-robust classi\ufb01cation. Our method can\nalso be viewed as re-weighting individual samples; instead of explicitly obtaining weights, we use\nthe softmax outputs at each iteration as the weightings. Lastly, Azadi et al. [2] proposed a regularizer\nthat encourages the model to select reliable samples for noise-robustness. Another method that uses\n\n2\n\n\fknowledge distillation for noisy labels has also been proposed [23]. Both of these methods also\nrequire a smaller clean dataset to work.\n\n3 Generalized Cross Entropy Loss for Noise-Robust Classi\ufb01cations\n\n3.1 Preliminaries\nWe consider the problem of k-class classi\ufb01cation. Let X\u21e2 Rd be the feature space and Y =\n{1,\u00b7\u00b7\u00b7 , c} be the label space. In an ideal scenario, we are given a clean dataset D = {(xi, yi)}n\ni=1,\nwhere each (xi, yi) 2 (X\u21e5Y ). A classi\ufb01er is a function that maps input feature space to the label\nspace f : X! Rc. In this paper, we consider the common case where the function is a DNN with\nthe softmax output layer. For any loss function L, the (empirical) risk of the classi\ufb01er f is de\ufb01ned\nas RL(f ) = ED[L(f (x), yx)] , where the expectation is over the empirical distribution. The most\ncommonly used loss for classi\ufb01cation is cross entropy. In this case, the risk becomes:\n\nRL(f ) = ED[L(f (x; \u2713), yx)] = \n\n1\nn\n\nnXi=1\n\ncXj=1\n\nyij log fj(xi; \u2713),\n\n(2)\n\nwhere \u2713 is the set of parameters of the classi\ufb01er, yij corresponds to the j\u2019th element of one-hot\nencoded label of the sample xi, yi = eyi 2{ 0, 1}c such that 1>yi = 1 8 i, and fj denotes the j\u2019th\nelement of f. Note that,Pn\nj=1 fj(xi; \u2713) = 1, and fj(xi; \u2713)  0,8j, i, \u2713, since the output layer is a\nsoftmax. The parameters of DNN can be optimized with empirical risk minimization.\nWe denote a dataset with label noise by D\u2318 = {(xi,eyi)}n\nrespect to each sample such that p(eyi = k|yi = j, xi) = \u2318(xi)\n\ni=1 whereeyi\u2019s are the noisy labels with\n\nassumption that noise is conditionally independent of inputs given the true labels so that\n\njk . In this paper, we make the common\n\np(eyi = k|yi = j, xi) = p(eyi = k|yi = j) = \u2318jk.\n\nIn general, this noise is de\ufb01ned to be class dependent. Noise is uniform with noise rate \u2318, if\n\u2318jk = 1  \u2318 for j = k, and \u2318jk = \u2318\nc1 for j 6= k. The risk of classi\ufb01er with respect to noisy dataset\nis then de\ufb01ned as R\u2318\nLet f\u21e4 be the global minimizer of the risk RL(f ). Then, the empirical risk minimization under loss\nfunction L is de\ufb01ned to be noise tolerant [26] if f\u21e4 is a global minimum of the noisy risk R\u2318\nL(f ).\nA loss function is called symmetric if, for some constant C,\n\nL(f ) = ED\u2318 [L(f (x),eyx)].\n\ncXj=1\n\nL(f (x), j) = C,\n\n8x 2X , 8f.\n\n(3)\n\nc , then under uniform label noise, for any f, R\u2318\n\nThe main contribution of Ghosh et al. [10] is they proved that if loss function is symmetric and\nL(f ) \uf8ff 0. Hence, f\u21e4 is also the\n\u2318< c1\nglobal minimizer for R\u2318\nand L is noise tolerant. Moreover, if RL(f\u21e4) = 0, then L is also noise\nL\ntolerant under class dependent noise.\nBeing a nonsymmetric and unbounded loss function, CCE is sensitive to label noise. On the contrary,\nMAE, as a symmetric loss function, is noise robust. For DNNs with a softmax output layer, MAE can\nbe computed as:\n\nL(f\u21e4)  R\u2318\n\nLM AE(f (x), ej) = ||ej  f (x)||1 = 2  2fj(x).\n\n(4)\nWith this particular con\ufb01guration of DNN, the proposed MAE loss is, up to a constant of proportion-\nality, the same as the unhinged loss Lunh(f (x), ej) = 1  fj(x) [38].\n3.2 Lq Loss for Classi\ufb01cation\nIn this section, we will argue that MAE has some drawbacks as a classi\ufb01cation loss function for\nDNNs, which are normally trained on large scale datasets using stochastic gradient based techniques.\nLet\u2019s look at the gradient of the loss functions:\ni=1 \ni=1 r\u2713fyi(xi; \u2713)\n\nfyi (xi;\u2713)r\u2713fyi(xi; \u2713)\n\nfor MAE/unhinged loss.\n\n@L(f (xi; \u2713), yi)\n\nfor CCE\n\n(5)\n\nnXi=1\n\n@\u2713\n\n=(Pn\nPn\n\n1\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a), (b) Test accuracy against number of epochs for training with CCE (orange) and MAE\n(blue) loss on clean data with (a) CIFAR-10 and (b) CIFAR-100 datasets. (c) Average softmax\nprediction for correctly (solid) and wrongly (dashed) labeled training samples, for CCE (orange) and\nLq (q = 0.7, blue) loss on CIFAR-10 with uniform noise (\u2318 = 0.4).\n\nThus, in CCE, samples with softmax outputs that are less congruent with provided labels, and hence\nsmaller fyi(xi; \u2713) or larger 1/fyi(xi; \u2713), are implicitly weighed more than samples with predictions\nthat agree more with provided labels in the gradient update. This means that, when training with CCE,\nmore emphasis is put on dif\ufb01cult samples. This implicit weighting scheme is desirable for training\nwith clean data, but can cause over\ufb01tting to noisy labels. Conversely, since the 1/fyi(xi; \u2713) term is\nabsent in its gradient, MAE treats every sample equally, which makes it more robust to noisy labels.\nHowever, as we demonstrate empirically, this can lead to signi\ufb01cantly longer training time before\nconvergence. Moreover, without the implicit weighting scheme to focus on challenging samples, the\nstochasticity involved in the training process can make learning dif\ufb01cult. As a result, classi\ufb01cation\naccuracy might suffer.\nTo demonstrate this, we conducted a simple experiment using ResNet [13] optimized with the default\nsetting of Adam [19] on the CIFAR datasets [20]. Fig. 1(a) shows the test accuracy curve when trained\nwith CCE and MAE respectively on CIFAR-10. As illustrated clearly, it took signi\ufb01cantly longer\nto converge when trained with MAE. In agreement with our analysis, there was also a compromise\nin classi\ufb01cation accuracy due to the increased dif\ufb01culty of learning useful features. These adverse\neffects become much more severe when using a more dif\ufb01cult dataset, such as CIFAR-100 (see\nFig. 1(b)). Not only do we observe signi\ufb01cantly slower convergence, but also a substantial drop in test\naccuracy when using MAE. In fact, the maximum test accuracy achieved after 2000 epochs, a long\ntime after training using CCE has converged, was 38.29%, while CCE achieved an higher accuracy\nof 39.92% after merely 7 epochs! Despite its theoretical noise-robustness, due to the shortcoming\nduring training induced by its noise-robustness, we conclude that MAE is not suitable for DNNs with\nchallenging datasets like ImageNet.\nTo exploit the bene\ufb01ts of both the noise-robustness provided by MAE and the implicit weighting\nscheme of CCE, we propose using the the negative Box-Cox transformation [4] as a loss function:\n\nLq(f (x), ej) =\n\n(1  fj(x)q)\n\nq\n\n,\n\n(6)\n\nwhere q 2 (0, 1]. Using L\u2019H\u00f4pital\u2019s rule, it can be shown that the proposed loss function is equivalent\nto CCE for limq!0 Lq(f (x), ej), and becomes MAE/unhinged loss when q = 1. Hence, this loss is\na generalization of CCE and MAE. Relatedly, Ferrari and Yang [7] viewed the maximization of Eq. 6\nas a generalization of maximum likelihood and termed the loss function Lq, which we also adopt.\nTheoretically, for any input x, the sum of Lq loss with respect to all classes is bounded by:\n\nc  c(1q)\n\nq\n\n\uf8ff\n\n(1  fj(x)q)\n\nq\n\nc  1\nq\n\n.\n\n\uf8ff\n\ncXj=1\n\nUsing this bound and under uniform noise with \u2318 \uf8ff 1  1\n\nc , we can show (see Appendix)\n\nA \uf8ff (RLq (f\u21e4)  RLq ( \u02c6f )) \uf8ff 0,\n\n4\n\n(7)\n\n(8)\n\n\fwhere A = \u2318[1c(1q)]\nq(c1\u2318c) < 0, f\u21e4 is the global minimizer of RLq (f ), and \u02c6f is the global minimizer\nof R\u2318\n(f ). The larger the q, the larger the constant A, and the tighter the bound of Eq. 8. In the\nLq\nextreme case of q = 1 (i.e., for MAE), A = 0 and RLq ( \u02c6f ) = RLq (f\u21e4). In other words, for q values\napproaching 1, the optimum of the noisy risk will yield a risk value (on the clean data) that is close to\n( \u02c6f )) is\nf\u21e4, which implies noise tolerance. It can also be shown that the difference (R\u2318\nLq\nbounded under class dependent noise, provided RLq (f\u21e4) = 0 and qij < qii 8i 6= j (see Thm 2 in\nAppendix).\nThe compromise on noise-robustness when using Lq over MAE prompts an easier learning process.\nLet\u2019s look at the gradients of Lq loss to see this:\n\n(f\u21e4)  R\u2318\nLq\n\n1\n\n@Lq(f (xi; \u2713), yi)\n\n@\u2713\n\n= fyi(xi; \u2713)q(\n\nfyi(xi; \u2713)r\u2713fyi(xi; \u2713)) = fyi(xi; \u2713)q1r\u2713fyi(xi; \u2713),\nwhere fyi(xi; \u2713) 2 [0, 1] 8 i and q 2 (0, 1). Thus, relative to CCE, Lq loss weighs each sample by an\nadditional fyi(xi; \u2713)q so that less emphasis is put on samples with weak agreement between softmax\noutputs and the labels, which should improve robustness against noise. Relative to MAE, a weighting\nof fyi(xi; \u2713)q1 on each sample can facilitate learning by giving more attention to challenging\ndatapoints with labels that do not agree with the softmax outputs. On one hand, larger q leads to a\nmore noise-robust loss function. On the other hand, too large of a q can make optimization strenuous.\nHence, as we will demonstrate empirically below, it is practically useful to set q between 0 and 1,\nwhere a tradeoff equilibrium is achieved between noise-robustness and better learning dynamics.\n\n3.3 Truncated Lq Loss\nSince a tighter bound inPc\ntruncated Lq loss:\n\nj=1 L(f (x, j)) would imply stronger noise tolerance, we propose the\nLtrunc(f (x), ej) =\u21e2Lq(k)\n\nif fj(x) \uf8ff k\nif fj(x) > k\n\nLq(f (x), ej)\n\n(9)\n\nwhere 0 < k < 1, and Lq(k) = (1  kq)/q. Note that, when k ! 0, the truncated Lq loss becomes\nthe normal Lq loss. Assuming k  1/c, the sum of truncated Lq loss with respect to all classes is\nbounded by (see Appendix):\n\ndLq(\n\n1\nd\n\n) + (c  d)Lq(k) \uf8ff\n\ncXj=1\n\nLtrunc(f (x), ej) \uf8ff cLq(k),\n\n(10)\n\nwhere d = max(1, (1q)1/q\nfor the truncated Lq loss, Lq(k), is smaller than that for the Lq loss of Eq. 7, if\n\n). It can be veri\ufb01ed that the difference between upper and lower bounds\n\nk\n\nd[Lq(k) L q(\n\n1\nd\n\n)] <\n\nc(1q)  1\n\nq\n\n.\n\n(11)\n\nAs an example, when k  0.3, the above inequality is satis\ufb01ed for all q and c. When k  0.2, the\ninequality is satis\ufb01ed for all q and c  10. Since the derived bounds in Eq. 7 and Eq. 10 are tight,\nintroducing the threshold k can thus lead to a more noise tolerant loss function.\nIf the softmax output for the provided label is below a threshold, truncated Lq loss becomes a constant.\nThus, the loss gradient is zero for that sample, and it does not contribute to learning dynamics. While\nEq. 10 suggests that a larger threshold k leads to tighter bounds and hence more noise-robustness,\ntoo large of a threshold would precipitate too many discarded samples for training. Ideally, we\nwould want the algorithm to train with all available clean data and ignore noisy labels. Thus the\noptimal choice of k would depend on the noise in the labels. Hence, k can be treated as a (bounded)\nhyper-parameter and optimized. In our experiments, we set k = 0.5 that yields a tighter bound for\ntruncated Lq loss, and which we observed to work well empirically.\nA potential problem arises when training directly with this loss function. When the threshold is\nrelatively large (e.g., k = 0.5 in a 10-class classi\ufb01cation problem), at the beginning of the training\nphase, most of the softmax outputs can be signi\ufb01cantly smaller than k, resulting in a dramatic drop\n\n5\n\n\fin the number of effective samples. Moreover, it is suboptimal to prune samples based on softmax\nvalues at the beginning of training. To circumvent the problem, observe that, by de\ufb01nition of the\ntruncated Lq loss:\nnXi=1\n\nviLq(f (xi; \u2713), yi) + (1  vi)Lq(k),\n\nLtrunc(f (xi; \u2713), yi) = argmin\n\nwhere vi = 0 if fyi(xi) \uf8ff k and vi = 1 otherwise, and \u2713 represents the parameters of the classi\ufb01er.\nOptimizing the above loss is the same as optimizing the following:\n\nnXi=1\n\nargmin\n\n(12)\n\n\u2713\n\n\u2713\n\nnXi=1\n\nnXi=1\n\nnXi=1\n\nargmin\n\n\u2713\n\nviLq(f (xi; \u2713), yi)  viLq(k) = argmin\n\u2713,w2[0,1]n\n\nwiLq(f (xi; \u2713), yi) L q(k)\n\nwi,\n\n(13)\nbecause for any \u2713, the optimal wi is 1 if Lq(f (xi; \u2713), yi) \uf8ffL q(k) and 0 if Lq(f (xi; \u2713), yi) > Lq(k).\nHence, we can optimize the truncated Lq loss by optimizing the right hand side of Eq. 13. If Lq is\nconvex with respect to the parameters \u2713, optimizing Eq. 13 is a biconvex optimization problem, and\nthe alternative convex search (ACS) algorithm [3] can be used to \ufb01nd the global minimum. ACS\niteratively optimizes \u2713 and w while keeping the other set of parameters \ufb01xed. Despite the high\nnon-convexity of DNNs, we can apply ACS to \ufb01nd a local minimum. We refer to the update of w as\n\"pruning\". At every step of iteration, pruning can be carried out easily by computing f (xi; \u2713(t)) for\nall training samples. Only samples with fyi(xi; \u2713(t))  k and Lq(f (xi; \u2713), yi) \uf8ffL q(k) are kept for\nupdating \u2713 during that iteration (and hence wi = 1 ). The additional computational complexity from\nthe pruning steps is negligible. Interestingly, the resulting algorithm is similar to that of self-paced\nlearning [22].\n\nAlgorithm 1 ACS for Training with Lq Loss\nInput Noisy dataset D\u2318, total iterations T , threshold k\n\nInitialize w(0)\n\ni = 1 8 i\n\ni=1 w(0)\n\nwhile t < T do\n\nUpdate \u2713(0) = argmin\u2713Pn\nUpdate w(t) = argminwPn\nUpdate \u2713(t) = argmin\u2713Pn\n\ni=1 w(0)\n\ni Lq(f (xi; \u2713), yi) L q(k)Pn\ni=1 wiLq(f (xi; \u2713(t1)), yi) L q(k)Pn\ni Lq(f (xi; \u2713), yi) L q(k)Pn\n\ni=1 w(t)\n\ni\n\nOutput \u2713(T )\n\ni=1 wi [Pruning Step]\ni\n\ni=1 w(t)\n\n4 Experiments\n\nThe following setup applies to all of the experiments conducted. Noisy datasets were produced by\narti\ufb01cially corrupting true labels. 10% of the training data was retained for validation. To realistically\nmimic a noisy dataset while justi\ufb01ably analyzing the performance of the proposed loss function, only\nthe training and validation data were contaminated, and test accuracies were computed with respect\nto true labels. A mini-batch size of 128 was used. All networks used ReLUs in the hidden layers\nand softmax layers at the output. All reported experiments were repeated \ufb01ve times with random\ninitialization of neural network parameters and randomly generated noisy labels each time. We\ncompared the proposed functions with CCE, MAE and also the confusion matrix-corrected CCE, as\nshown in Eq. 1. Following [32], we term this \"forward correction\". All experiments were conducted\nwith identical optimization procedures and architectures, changing only the loss functions.\n\n4.1 Toward a Better Understanding of Lq Loss\nTo better grasp the behavior of Lq loss, we implemented different values of q and uniform noise\nat different noise levels, and trained ResNet-34 with the default setting of Adam on CIFAR-10.\nAs shown in Fig. 2, when trained on clean dataset, increasing q not only slowed down the rate of\nconvergence, but also lowered the classi\ufb01cation accuracy. More interesting phenomena appeared\nwhen trained on noisy data. When CCE (q = 0) was used, the classi\ufb01er \ufb01rst learned predictive\n\n6\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 2: The test accuracy and validation loss against number of epochs for training with Lq loss at\ndifferent values of q. (a) and (d): \u2318 = 0.0; (b) and (e): \u2318 = 0.2; (c) and (f): \u2318 = 0.6.\n\npatterns, presumably from the noise-free labels, before over\ufb01tting strongly to the noisy labels, in\nagreement with Arpit et al.\u2019s observations [1]. Training with increased q values delayed over\ufb01tting\nand attained higher classi\ufb01cation accuracies. One interpretation of this behavior is that the classi\ufb01er\ncould learn more about predictive features before over\ufb01tting. This interpretation is supported by our\nplot of the average softmax values with respect to the correctly and wrongly labeled samples on the\ntraining set for CCE and Lq (q = 0.7) loss, and with 40% uniform noise (Fig. 1(c)). For CCE, the\naverage softmax for wrongly labeled samples remained small at the beginning, but grew quickly when\nthe model started over\ufb01tting. Lq loss, on the other hand, resulted in signi\ufb01cantly smaller softmax\nvalues for wrongly labeled data. This observation further serves as an empirical justi\ufb01cation for the\nuse of truncated Lq loss as described in section 3.3.\nWe also observed that there was a threshold of q beyond which over\ufb01tting never kicked in before\nconvergence. When \u2318 = 0.2 for instance, training with Lq loss with q = 0.8 produced an over\ufb01tting-\nfree training process. Empirically, we noted that, the noisier the data, the larger this threshold is.\nHowever, too large of a q hampers the classi\ufb01cation accuracy, and thus a larger q is not always\npreferred. In general, q can be treated as a hyper-parameter that can be optimized, say via monitoring\nvalidation accuracy. In remaining experiments, we used q = 0.7, which yielded a good compromise\nbetween fast convergence and noise robustness (no over\ufb01tting was observed for \u2318 \uf8ff 0.5).\n4.2 Datasets\nCIFAR-10/CIFAR-100: ResNet-34 was used as the classi\ufb01er optimized with the loss functions\nmentioned above. Per-pixel mean subtraction, horizontal random \ufb02ip and 32 \u21e5 32 random crops after\npadding with 4 pixels on each side was performed as data preprocessing and augmentation. Following\n[15], we used stochastic gradient descent (SGD) with 0.9 momentum, a weight decay of 104 and\nlearning rate of 0.01, and divided it by 10 after 40 and 80 epochs (120 in total) for CIFAR-10, and\nafter 80 and 120 (150 in total) for CIFAR-100. To ensure a fair comparison, the identical optimization\nscheme was used for truncated Lq loss. We trained with the entire dataset for the \ufb01rst 40 epochs for\nCIFAR-10 and 80 for CIFAR-100, and started pruning and training with the pruned dataset afterwards.\nPruning was done every 10 epochs. To prevent over\ufb01tting, we used the model at the optimal epoch\n\n7\n\n\fTable 1: Average test accuracy and standard deviation (5 runs) on experiments with closed-set noise.\nWe report accuracies of the epoch where validation accuracy is maximum. Forward T and \u02c6T represent\nforward correction with the true and estimated confusion matrices, respectively [32]. q = 0.7 was\nused for all experiments with Lq loss and truncated Lq loss. Best 2 accuracies are bold faced.\n\nDatasets\n\nLoss Functions\n\nFASHION\nMNIST\n\nCIFAR-10\n\nCIFAR-100\n\nCCE\nMAE\n\nForward T\nForward \u02c6T\nLq\nTrunc Lq\nCCE\nMAE\n\nForward T\nForward \u02c6T\nLq\nTrunc Lq\nCCE\nMAE\n\nForward T\nForward \u02c6T\nTrunc Lq\n\nLq\n\n0.2\n\n93.24 \u00b1 0.12\n80.39 \u00b1 4.68\n93.64 \u00b1 0.12\n93.26 \u00b1 0.10\n93.35 \u00b1 0.09\n93.21 \u00b1 0.05\n86.98 \u00b1 0.44\n83.72 \u00b1 3.84\n88.63 \u00b1 0.14\n87.99 \u00b1 0.36\n89.83 \u00b1 0.20\n89.7 \u00b1 0.11\n58.72 \u00b1 0.26\n15.80 \u00b1 1.38\n63.16 \u00b1 0.37\n39.19 \u00b1 2.61\n66.81 \u00b1 0.42\n67.61 \u00b1 0.18\n\nUniform Noise\nNoise Rate \u2318\n\n0.4\n\n0.6\n\n92.09 \u00b1 0.18\n79.30 \u00b1 6.20\n92.69 \u00b1 0.20\n92.24 \u00b1 0.15\n92.58 \u00b1 0.11\n92.60 \u00b1 0.17\n81.88 \u00b1 0.29\n67.00 \u00b1 4.45\n85.07 \u00b1 0.29\n83.25 \u00b1 0.38\n87.13 \u00b1 0.22\n87.62 \u00b1 0.26\n48.20 \u00b1 0.65\n9.03 \u00b1 1.54\n54.65 \u00b1 0.88\n31.05 \u00b1 1.44\n61.77 \u00b1 0.24\n62.64 \u00b1 0.33\n\n90.29 \u00b1 0.35\n82.41 \u00b1 5.29\n91.16 \u00b1 0.16\n90.54 \u00b1 0.10\n91.30 \u00b1 0.20\n91.56 \u00b1 0.16\n74.14 \u00b1 0.56\n64.21 \u00b1 5.28\n79.12 \u00b1 0.64\n74.96 \u00b1 0.65\n82.54 \u00b1 0.23\n82.70 \u00b1 0.23\n37.41 \u00b1 0.94\n7.74 \u00b1 1.48\n44.62 \u00b1 0.82\n19.12 \u00b1 1.95\n53.16 \u00b1 0.78\n54.04 \u00b1 0.56\n\n0.8\n\n86.20 \u00b1 0.68\n74.73 \u00b1 5.26\n87.59 \u00b1 0.35\n85.57 \u00b1 0.86\n88.01 \u00b1 0.22\n88.33 \u00b1 0.38\n53.82 \u00b1 1.04\n38.63 \u00b1 2.62\n64.30 \u00b1 0.70\n54.64 \u00b1 0.44\n64.07 \u00b1 1.38\n67.92 \u00b1 0.60\n18.10 \u00b1 0.82\n3.76 \u00b1 0.27\n24.83 \u00b1 0.71\n8.99 \u00b1 0.58\n29.16 \u00b1 0.74\n29.60 \u00b1 0.51\n\n0.1\n\n94.06 \u00b1 0.05\n74.03 \u00b1 6.32\n94.33 \u00b1 0.10\n94.09 \u00b1 0.10\n93.51 \u00b1 0.17\n93.53 \u00b1 0.11\n90.69 \u00b1 0.17\n82.61 \u00b1 4.81\n91.32 \u00b1 0.21\n90.52 \u00b1 0.26\n90.91 \u00b1 0.22\n90.43 \u00b1 0.25\n66.54 \u00b1 0.42\n13.38 \u00b1 1.84\n71.05 \u00b1 0.30\n45.96 \u00b1 1.21\n68.36 \u00b1 0.42\n68.86 \u00b1 0.14\n\nClass Dependent Noise\n\nNoise Rate \u2318\n\n0.2\n\n0.3\n\n93.72 \u00b1 0.14\n63.03 \u00b1 3.91\n94.03 \u00b1 0.11\n93.66 \u00b1 0.09\n93.24 \u00b1 0.14\n93.36 \u00b1 0.07\n88.59 \u00b1 0.34\n52.93 \u00b1 3.60\n90.35 \u00b1 0.26\n89.09 \u00b1 0.47\n89.33 \u00b1 0.17\n89.45 \u00b1 0.29\n59.20 \u00b1 0.18\n11.50 \u00b1 1.16\n71.08 \u00b1 0.22\n42.46 \u00b1 2.16\n66.59 \u00b1 0.22\n66.59 \u00b1 0.23\n\n92.72 \u00b1 0.21\n58.14 \u00b1 0.14\n93.91 \u00b1 0.14\n93.52 \u00b1 0.16\n92.21 \u00b1 0.27\n92.76 \u00b1 0.14\n86.14 \u00b1 0.40\n50.36 \u00b1 5.55\n89.25 \u00b1 0.43\n86.79 \u00b1 0.36\n85.45 \u00b1 0.74\n87.10 \u00b1 0.22\n51.40 \u00b1 0.16\n8.91 \u00b1 0.89\n70.76 \u00b1 0.26\n38.13 \u00b1 2.97\n61.45 \u00b1 0.26\n61.87 \u00b1 0.39\n\n0.4\n\n89.82 \u00b1 0.31\n56.04 \u00b1 3.76\n93.65 \u00b1 0.11\n88.53 \u00b1 4.81\n89.53 \u00b1 0.53\n91.62 \u00b1 0.34\n80.11 \u00b11.44\n45.52 \u00b1 0.13\n88.12 \u00b1 0.32\n83.55 \u00b1 0.58\n76.74\u00b1 0.61\n82.28 \u00b1 0.67\n42.74 \u00b1 0.61\n8.20 \u00b1 1.04\n70.82 \u00b1 0.45\n34.44 \u00b1 1.93\n47.22 \u00b1 1.15\n47.66 \u00b1 0.69\n\nTable 2: Average test accuracy on experiments with CIFAR-10. We replicated the exact experimental\nsetup as in [40]. The reported accuracies are the average last epoch accuracies after training for 100\nepochs. \u2318 = 40%. CCE, Forward and method by Wang et al. are adapted for direct comparison.\n\nNoise type\nCIFAR-10 + CIFAR-100 (open-set noise)\nCIFAR-10 (closed-set noise)\n\nCCE [40]\n\n62.92\n62.38\n\n64.18\n77.81\n\nForward [40] Wang, et al. [40] MAE Lq\n\n79.28\n78.15\n\n75.06\n74.31\n\n71.10\n64.79\n\nTrunc Lq\n79.55\n79.12\n\nbased on maximum validation accuracy for pruning. Uniform noise was generated by mapping a true\nlabel to a random label through uniform sampling. Following Patrini, et al. [32] class dependent noise\nwas generated by mapping TRUCK ! AUTOMOBILE, BIRD ! AIRPLANE, DEER ! HORSE,\nand CAT $ DOG with probability \u2318 for CIFAR-10. For CIFAR-100, we simulated class-dependent\nnoise by \ufb02ipping each class into the next circularly with probability \u2318.\nWe also tested noise-robustness of our loss function on open-set noise using CIFAR-10. For a direct\ncomparison, we followed the identical setup as described in [40]. For this experiment, the classi\ufb01er\nwas trained for only 100 epochs. We observed validation loss plateaued after about 10 epochs, and\nhence started pruning the data afterwards at 10-epoch intervals. The open-set noise was generated by\nusing images from the CIFAR-100 dataset. A random CIFAR-10 label was assigned to these images.\nFASHION-MNIST: ResNet-18 was used. The identical data preprocessing, augmentation, and\noptimization procedure as in CIFAR-10 was deployed for training. To generate a realistic class\ndependent noise, we used the t-SNE [25] plot of the dataset to associated classes with similar\nembeddings, and mapped BOOT ! SNEAKER , SNEAKER ! SANDALS, PULLOVER ! SHIRT,\nCOAT $ DRESS with probability \u2318.\n\n4.3 Results and Discussion\n\nExperimental results with closed-set noise is summarized in Table 1. For uniform noise, proposed loss\nfunctions outperformed the baselines signi\ufb01cantly, including forward correction with the ground truth\nconfusion matrices. In agreement with our theoretical expectations, truncating the Lq loss enhanced\nresults. For class dependent noise, in general Forward T offered the best performance, as it relied on\nthe knowledge of the ground truth confusion matrix. Truncated Lq loss produced similar accuracies\nas Forward \u02c6T for FASHION-MNIST and better results for CIFAR datasets, and outperformed the\nother baselines at most noise levels for all datasets. While using Lq loss improved over baselines for\nCIFAR-100, no improvements were observed for FASHION-MNIST and CIFAR-10 datasets. We\nbelieve this is in part because very similar classes were grouped together for the confusion matrices\nand consquently the DNNs might falsely put high con\ufb01dence on wrongly labeled samples.\n\n8\n\n\fIn general, classi\ufb01cation accuracy for both uniform and class dependent noise would be further\nimproved relative to baselines with optimized q and k values and more number of epochs. Based on\nthe experimental results, we believe the proposed approach would work well when correctly labeled\ndata can be differentiated from wrongly labeled data based on softmax outputs, which is often the\ncase with large-scale data and expressive models. We also observed that MAE performed poorly for\nall datasets at all noise levels, presumably because DNNs like ResNet struggled to optimize with\nMAE loss, especially on challenging datasets such as CIFAR-100.\nTable 2 summarizes the results for open-set noise with CIFAR-10. Following Wang et al. [40], we\nreported the last-epoch test accuracy after training for 100 epochs. We also repeated the closed-set\nnoise experiment with their setup. Using Lq loss noticeably prevented over\ufb01tting, and using truncated\nLq loss achieved better results than the state-of-the-art method for open-set noise reported in [40].\nMoreover, our method is signi\ufb01cantly easier to implement. Lastly, note that the poor performance of\nLq loss compared to MAE is due to the fact that test accuracy reported here is long after the model\nstarted over\ufb01tting, since a shallow CNN without data augmentation was deployed for this experiment.\n\n5 Conclusion\n\nIn conclusion, we proposed theoretically grounded and easy-to-use classes of noise-robust loss\nfunctions, the Lq loss and the truncated Lq loss, for classi\ufb01cation with noisy labels that can be\nemployed with any existing DNN algorithm. We empirically veri\ufb01ed noise robustness on various\ndatasets with both closed- and open-set noise scenarios.\n\nAcknowledgments\nThis work was supported by NIH R01 grants (R01LM012719 and R01AG053949), the NSF NeuroNex\ngrant 1707312, and NSF CAREER grant (1748377).\n\nReferences\n[1] Devansh Arpit, Stanis\u0142aw Jastrz\u02dbebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,\nMaxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A\ncloser look at memorization in deep networks. arXiv preprint arXiv:1706.05394, 2017.\n\n[2] Samaneh Azadi, Jiashi Feng, Stefanie Jegelka, and Trevor Darrell. Auxiliary image regulariza-\n\ntion for deep cnns with noisy labels. arXiv preprint arXiv:1511.07069, 2015.\n\n[3] Mokhtar S Bazaraa, Hanif D Sherali, and Chitharanjan M Shetty. Nonlinear programming:\n\ntheory and algorithms. John Wiley & Sons, 2013.\n\n[4] George EP Box and David R Cox. An analysis of transformations. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 211\u2013252, 1964.\n\n[5] J Paul Brooks. Support vector machines with the ramp loss and the hard margin loss. Operations\n\nresearch, 59(2):467\u2013479, 2011.\n\n[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[7] Davide Ferrari, Yuhong Yang, et al. Maximum lq-likelihood estimation. The Annals of Statistics,\n\n38(2):753\u2013783, 2010.\n\n[8] Beno\u00eet Fr\u00e9nay and Michel Verleysen. Classi\ufb01cation in the presence of label noise: a survey.\n\nIEEE transactions on neural networks and learning systems, 25(5):845\u2013869, 2014.\n\n[9] Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for\n\ndeep neural networks. In AAAI, pages 1919\u20131925, 2017.\n\n[10] Aritra Ghosh, Naresh Manwani, and PS Sastry. Making risk minimization tolerant to label\n\nnoise. Neurocomputing, 160:93\u2013107, 2015.\n\n9\n\n\f[11] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adapta-\n\ntion layer. 2016.\n\n[12] Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor Tsang, Ya Zhang, and Masashi\nSugiyama. Masking: A new perspective of noisy supervision. arXiv preprint arXiv:1805.08193,\n2018.\n\n[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[14] Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to\ntrain deep networks on labels corrupted by severe noise. arXiv preprint arXiv:1802.05300,\n2018.\n\n[15] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with\nstochastic depth. In European Conference on Computer Vision, pages 646\u2013661. Springer, 2016.\n\n[16] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Regularizing\n\nvery deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055, 2017.\n\n[17] Ishan Jindal, Matthew Nokleby, and Xuewen Chen. Learning deep networks from noisy labels\nwith dropout regularization. In Data Mining (ICDM), 2016 IEEE 16th International Conference\non, pages 967\u2013972. IEEE, 2016.\n\n[18] Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singly-labeled\n\ndata. arXiv preprint arXiv:1712.04577, 2017.\n\n[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[20] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[22] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable\n\nmodels. In Advances in Neural Information Processing Systems, pages 1189\u20131197, 2010.\n\n[23] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Jia Li. Learning from\n\nnoisy labels with distillation. arXiv preprint arXiv:1703.02391, 2017.\n\n[24] Tongliang Liu and Dacheng Tao. Classi\ufb01cation with noisy labels by importance reweighting.\n\nIEEE Transactions on pattern analysis and machine intelligence, 38(3):447\u2013461, 2016.\n\n[25] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n[26] Naresh Manwani and PS Sastry. Noise tolerance under risk minimization. IEEE transactions\n\non cybernetics, 43(3):1146\u20131151, 2013.\n\n[27] Hamed Masnadi-Shirazi and Nuno Vasconcelos. On the design of loss functions for classi-\n\ufb01cation: theory, robustness to outliers, and savageboost. In Advances in neural information\nprocessing systems, pages 1049\u20131056, 2009.\n\n[28] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[29] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning\nwith noisy labels. In Advances in neural information processing systems, pages 1196\u20131204,\n2013.\n\n10\n\n\f[30] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for\nsemantic segmentation. In Proceedings of the IEEE International Conference on Computer\nVision, pages 1520\u20131528, 2015.\n\n[31] Curtis G Northcutt, Tailin Wu, and Isaac L Chuang. Learning with con\ufb01dent examples: Rank\n\npruning for robust classi\ufb01cation with noisy labels. arXiv preprint arXiv:1705.01936, 2017.\n\n[32] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu.\nMaking deep neural networks robust to label noise: a loss correction approach. stat, 1050:22,\n2017.\n\n[33] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew\nRabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint\narXiv:1412.6596, 2014.\n\n[34] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples\n\nfor robust deep learning. arXiv preprint arXiv:1803.09050, 2018.\n\n[35] Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep neural networks.\n\narXiv preprint arXiv:1406.2080, 2(3):4, 2014.\n\n[36] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization\n\nframework for learning with noisy labels. arXiv preprint arXiv:1803.11364, 2018.\n\n[37] Arash Vahdat. Toward robustness against label noise in training deep discriminative neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 5601\u20135610, 2017.\n\n[38] Brendan Van Rooyen, Aditya Menon, and Robert C Williamson. Learning with symmetric\nlabel noise: The importance of being unhinged. In Advances in Neural Information Processing\nSystems, pages 10\u201318, 2015.\n\n[39] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie.\nLearning from noisy large-scale datasets with minimal supervision. In The Conference on\nComputer Vision and Pattern Recognition, 2017.\n\n[40] Yisen Wang, Weiyang Liu, Xingjun Ma, James Bailey, Hongyuan Zha, Le Song, and Shu-Tao\n\nXia. Iterative learning with open-set noisy labels. arXiv preprint arXiv:1804.00092, 2018.\n\n[41] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive\nnoisy labeled data for image classi\ufb01cation. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 2691\u20132699, 2015.\n\n[42] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n11\n\n\f", "award": [], "sourceid": 5289, "authors": [{"given_name": "Zhilu", "family_name": "Zhang", "institution": "Cornell University"}, {"given_name": "Mert", "family_name": "Sabuncu", "institution": "Cornell"}]}