{"title": "Data Parameters: A New Family of Parameters for Learning a Differentiable Curriculum", "book": "Advances in Neural Information Processing Systems", "page_first": 11095, "page_last": 11105, "abstract": "Recent works have shown that learning from easier instances first can help deep neural networks (DNNs) generalize better. However, knowing which data to present during different stages of training is a challenging problem. In this work, we address\nthis problem by introducing data parameters. More specifically, we equip each sample and class in a dataset with a learnable parameter (data parameters), which governs their importance in the learning process. During training, at each iteration,\nas we update the model parameters, we also update the data parameters. These updates are done by gradient descent and do not require hand-crafted rules or design. When applied to image classification task on CIFAR10, CIFAR100,WebVision and ImageNet datasets, and object detection task on KITTI dataset, learning a dynamic curriculum via data parameters leads to consistent gains, without any increase in model complexity or training time. When applied to a noisy dataset, the proposed method learns to learn from clean images and improves over the state-of-the-art methods by 14%. To the best of our knowledge, our work is the first curriculum learning method to show gains on large scale image classification and detection tasks.", "full_text": "Data Parameters: A New Family of Parameters for\n\nLearning a Differentiable Curriculum\n\nShreyas Saxena\n\nApple\n\nshreyas_saxena@apple.com\n\nOncel Tuzel\n\nApple\n\notuzel@apple.com\n\nDennis DeCoste\n\nApple\n\nddecoste@apple.com\n\nAbstract\n\nRecent works have shown that learning from easier instances \ufb01rst can help deep\nneural networks (DNNs) generalize better. However, knowing which data to present\nduring different stages of training is a challenging problem. In this work, we address\nthis problem by introducing data parameters. More speci\ufb01cally, we equip each\nsample and class in a dataset with a learnable parameter (data parameters), which\ngoverns their importance in the learning process. During training, at each iteration,\nas we update the model parameters, we also update the data parameters.\nThese updates are done by gradient descent and do not require hand-crafted rules\nor design. When applied to image classi\ufb01cation task on CIFAR10, CIFAR100,\nWebVision and ImageNet datasets, and object detection task on KITTI dataset,\nlearning a dynamic curriculum via data parameters leads to consistent gains, with-\nout any increase in model complexity or training time. When applied to a noisy\ndataset, the proposed method learns to learn from clean images and improves over\nthe state-of-the-art methods by 14%. To the best of our knowledge, our work is the\n\ufb01rst curriculum learning method to show gains on large scale image classi\ufb01cation\nand detection tasks. Code is available at: https://github.com/apple/\nml-data-parameters.\n\n1\n\nIntroduction\n\nCurriculum learning [1, 7, 12, 17, 35] has garnered lot of attention in the \ufb01eld of machine learning.\nIt draws inspiration from the learning principles underlying cognitive process of humans and animals,\nwhich starts by learning easier concepts and then gradually transitions to learning more complex\nconcepts. Existing work has shown that with the help of this paradigm, DNNs can achieve better\ngeneralization [1, 2, 15].\nThe key to applying curriculum learning to different problems is to come up with a ranking function\nthat assigns learning priorities to the training samples. A sample with a higher priority is supposed to\nbe learned earlier than a sample with a lower priority. For the majority of early work in curriculum\nlearning, the curriculum is provided by a pre-determined heuristic. For instance, for the task of\nclassifying shapes [1], shapes which had less variation were assigned a higher priority. In [29], authors\napproached grammar induction, where short sentences were assigned higher priority. The main issues\nwhich limit the application of this approach are: (1) for many complex problems, it is not trivial to\nde\ufb01ne what are the easy examples or subtasks, (2) in cases where humans can design a curriculum,\nit is assumed that the dif\ufb01culty of learning a sample for humans correlates with the dif\ufb01culty of\nlearning the sample for a learning algorithm, and (3) even if one could de\ufb01ne the curriculum, the\npre-determined curriculum might not be appropriate at all learning stages of the dynamically learned\nmodel.\nLearning a curriculum in an automatic manner is a hard task, since the ease or dif\ufb01culty of an example\nis relative to the current state of the model. In order to overcome these issues, in this work, we\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fintroduce a new family of parameters for DNNs termed data parameters. More speci\ufb01cally, each\nclass and data point have their own data parameter, governing their importance in the learning\nprocess. During learning, at every iteration, as we update the standard model parameters, we also\nupdate the data parameters using stochastic gradient descent. Learning data parameters for class and\ninstances leads to a dynamic and differentiable curriculum, without any need of human-intervention.\nThe main contributions of our work are:\n\n1. We introduce a new class of parameters termed data parameters for every class and data\npoint in the dataset. We show that data parameters can be learned using gradient descent, and\ndoing so amounts to learning a dynamic and differentiable curriculum. In our formulation,\ndata parameters are involved only during training, and hence do not affect model complexity\nat inference.\n\n2. We show that for image classi\ufb01cation and object detection tasks, learning a curriculum for\nCNNs improves over baseline by prioritizing classes and their instances. To the best of our\nknowledge, our paper is the \ufb01rst curriculum learning method to show gains on large scale\nimage classi\ufb01cation tasks (ImageNet [5]) and on an object detection task (KITTI [9]).\n\n3. We show that in presence of noisy labels, the learnt curriculum prioritizes learning from\n\nclean labels. Doing so, our method outperforms the state-of-art by a signi\ufb01cant margin.\n\n4. We show that when presented with random labels, in comparison to a baseline DNN which\n\nmemorizes the data, the learned curriculum resists memorizing corrupt data.\n\n2 Learning a Dynamic Curriculum via Data Parameters\n\nAs suggested earlier, the main intuition behind our idea is simple: each class and data point in the\ntraining set has a parameter associated to it, which weighs the contribution of that class or data\npoint in the gradient update of model parameters. In contrast to existing works which set these\nparameters with a heuristic, in our work these parameters are learnable and are learnt along with\nthe model parameters. Unlike model parameters which are involved during training and inference,\ndata parameters are only involved during training. Therefore, using data parameters during training\ndoes not affect the model complexity and run-time at inference. In the next section we formalize this\nintuition for class-level curriculum, followed by instance-level curriculum.\n\n2.1 Learning curriculum over classes\n\nLet(cid:8)(cid:0)xi, yi(cid:1)(cid:9)N\n\nWe \ufb01rst describe learning a dynamic curriculum over classes where the contribution of each sample\nto the model learning is determined by its class. This curriculum favors learning from easier classes\nat the earlier stages of training. The curriculum over classes is dynamic and is controlled by the\nclass-level data parameters, which are also updated via the training process. In what follows, we will\nrefer to class-level data parameters as class-parameters.\n\ni=1 denote the data, where xi \u2208 Rd denotes a single data point and yi \u2208 {1, ..., k}\ndenotes its target label. Let \u03c3class \u2208 Rk denote the class parameters for the classes in the dataset. We\ndenote the neural network function mapping the input data xi to logits zi \u2208 Rk as zi = f\u03b8(xi) where\n\u03b8 are the model parameters. During training, we pass the input sample xi through the DNN, and\ncompute its corresponding logits zi, but instead of computing the softmax directly on the logits, we\nscale the logits of the instance with parameter corresponding to the target class, \u03c3class\n. Note, scaling\nof logits with the parameter of target class can be interpreted as a temperature scaling of logits. The\ncross-entropy loss for a data point xi can then be written as\n\nyi\n\nLi = \u2212 log(pi\nexp(zi\nj exp(zi\n\n(cid:80)\n\nyi)\nyi/\u03c3class\n)\nj/\u03c3class\n\npi\nyi =\n\nyi\n\nyi\n\n)\n\n(1)\n\nyi, zi\n\nyi and \u03c3class\n\nwhere pi\nxi respectively. If we set all class parameters to one, i.e. \u03c3class\ngradient for the standard cross-entropy loss.\n\ndenote probability, logit and parameter of the target class yi for data point\n= 1, j = 1 . . . k, we recover the\n\nyi\n\nj\n\n2\n\n\fDuring training we solve\n\nN(cid:88)\n\ni=1\n\nLi\n\nmin\n\n\u03b8,\u03c3class\n\n1\nN\n\n(2)\n\nwhere in addition to the model parameters, \u03b8, we also optimize the class-level parametres, \u03c3class.\nThe gradient of the loss with respect to logits is given by:\n\n\u2202Li\n\u2202zi\nj\n\n=\n\nj \u2212 1(j = yi)\npi\n\n\u03c3class\n\nyi\n\n(3)\n\nwhere 1(j = yi) means value 1 when j = yi and value 0 otherwise. The gradient of the loss with\nrespect to the parameter of target class is given by:\n(1 \u2212 pi\n(\u03c3class\n\nyi \u2212(cid:88)\n\n\u2202\u03c3class\n\nyi)\n)2\n\nqi\njzi\nj\n\n(cid:16)\n\n(cid:17)\n\n\u2202Li\n\n(4)\n\nzi\n\n=\n\nyi\n\nyi\n\nj(cid:54)=yi\n\nwhere qi\n\nj =\n\npi\nj\n1\u2212pi\nyi\n\nis the probability distribution over non-target classes (indexed by j, with j (cid:54)= yi).\n\nyi <(cid:80)\n\nEffect of class parameters on learning:\nThe class parameters are updated with the negative of\nthe gradient given in equation (4), where the parameter corresponding to target class \u03c3class\nwill\nincrease if the logit of the target class is less than the expected value of logits on non-target classes\n(i. e. zi\nj) and vice-versa. Therefore, during the course of learning, if data-points of\na certain class are being misclassi\ufb01ed, the gradient update on class parameters gradually increases\nthe parameter associated with this class. Increasing the class parameter, \ufb02attens the curvature of the\nloss function for instances of that class, thereby decaying the gradients w.r.t. logits (see equation (3)).\nDecreasing the class parameter has an inverse effect, and accelerates the learning.\n\nj(cid:54)=yi qi\n\njzi\n\nyi\n\n2.2 Learning curriculum over instances\n\nIn the previous section, we have detailed how we can learn a dynamic curriculum over classes of a\ndataset. A natural extension of this framework is to have a dynamic curriculum over the instances\nin the dataset. In this case, in equation (1), rather than having class parameters for each class\n, j \u2208 {1, . . . , k}, we can have a instance parameters for each sample present in the dataset,\n\u03c3class\nj\n, i \u2208 {1, . . . , N}.\n\u03c3inst\ni\nThis parameterization helps us to learn a curriculum over instances of a class, which is useful when\ninstances within a class have different levels of dif\ufb01culty. For instance, consider the task of classifying\nimages of an object. In some instances, the object could be fully visible (easy), while in others,\nit could be occluded by other objects (hard). Another task is learning with noisy/corrupt labels.\nIn this setting, labels of some instances would be consistent with the input (easy), while labels of\nsome instances would not be consistent (hard). In our experiments, we show that the learning of a\ncurriculum over instances learns to ignore the noisy samples.\nWe can also learn a joint curriculum over classes and instances to have the bene\ufb01ts of both. In this\ncase, during training, the parameter for a data point xi is set as the sum of its target\u2019s class paramter\n\u03c3yi and its own instance parameter \u03c3inst\n. In this setting, the gradient of the\nloss with respect to the logits (as in equation 3) can be expressed as \u2202Li\n. Since the\n\u2202zi\nj\neffective parameter of an instance is formed by the addition of class and instance level parameters,\nthe gradient for these parameters for a data point xi is the same and is denoted by:\n\nyi + \u03c3inst\n\nj\u22121(j=yi)\npi\n\ni = \u03c3class\n\ni.e. \u03c3\u2217\n\n\u03c3\u2217\n\n=\n\ni\n\ni\n\ni\n\n(cid:16)\n\nyi \u2212(cid:88)\n\nzi\n\nqi\njzi\nj\n\nj(cid:54)=yi\n\n(cid:17)\n\n\u2202Li\n\u2202\u03c3\u2217\n\ni\n\n=\n\n(1 \u2212 pi\n(\u03c3\u2217\ni )2\n\nyi )\n\n(5)\n\nHowever note, during training, instance parameters collect their gradient from individual samples\n(when sampled in a mini-batch), while class parameters average the gradient from all samples of the\nclass present in a mini-batch.\n\n3\n\n\fInference with data parameters: As explained earlier, during training, we modify the logits of\na sample with data parameters (class or instance parameters). During inference, we do not have\nparameters on the test set, and hence do not scale the logits with a data parameter. Not scaling the\nlogits, has no affect on the argmax of softmax, but the classi\ufb01cation probability is uncalibrated. If\none is interested in calibrated output, calibration can be done on a held-out validation set [11]. Note\nthis modus operandi, of not scaling logits at inference, maintains our claim: use of data parameters\ndoes not affect the model\u2019s capacity and run-time at infernece.\n\n3 Experimental evaluation\n\nIn this section we \ufb01rst describe the implementation details of our method. Next we will show results\nof our method when applied for the task of image classi\ufb01cation and detection. After that, we evaluate\nour dynamic curriculum framework in presence of noisy labels. Finally, we show that our framework,\nwhen applied to all random labels, acts as a strong regularizer and resists memorization. Note, since\nour method modi\ufb01es the logits at the very end of forward pass, the gains reported below come without\nany additional computational overhead during training.\n\n3.1\n\nImplementation details\n\nOptimizing data parameters \u03c3 with gradient descent requires constraint optimization with constraint\n\u03c3 \u2265 0. Instead, we choose to optimize in log parameterization log(\u03c3), which can be mapped back\nusing exponential mapping. Using an exponential mapping resolves log parameterization to positive\ndomain, and allows us to perform unconstrained optimization.\nIn our loss function, in addition to standard (cid:96)2 regularizer on model paramters, ||\u03b8||2, we also have (cid:96)2\nregularization on data parameters, || log(\u03c3class)||2 and || log(\u03c3inst)||2, with their contribution being\ncontrolled by weight decay parameter. This regularizer favors original softmax formulation with\n\u03c3 = 1, and prevents data parameters from obtaining very high values.\nUnless stated otherwise, the following implementation details holds true for our experiments. For all\nnumbers reported in this paper, we report the mean and standard deviation over 3 runs. We learn the\nclass and instance parameters using stochastic gradient descent (SGD). Class and instance parameters\nare initialized with \u03c3 = 1 and optimized using gradient descent with momentum 0.9. When learning a\njoint curriculum over class and instances, class parameters are initialized as 1 and instance parameters\nare initialized as 0.001. This ensures that the sum of both parameters results in \u03c3 = 1, thereby\nrecovering the original softmax formulation. For both sets of parameters we use separate optimizers\nwith their respective learning rates and weight decay. The learning rate and weight decay for class\nparameters is set to 0.1 and 5e\u22124 (same as model parameters of DNN). The learning rate and weight\ndecay for instance parameters varies depending upon the task, and is set by using the validation set.\nWhen a class or instance is not present in a mini-batch, we do not update the momentum buffer\nassociated with the data parameter of the class or the instance respectively.\n\n3.2 Learning a curriculum for image classi\ufb01cation\n\nIn this section we demonstrate the ef\ufb01cacy of our method when applied to the task of image classi\ufb01-\ncation. We evaluate our dynamic curriculum learning framework on CIFAR100 [18] and ImageNet\n2012 classi\ufb01cation [5] dataset.\nCIFAR100 dataset contains 100 classes, 50,000 images in the training set and 10,000 images in the\ntest set. We evaluate our framework on CIFAR100 with WideResNet (depth:28, widening factor:10,\ndropout:0) [38]. We \ufb01rst reproduce the results for WideResNet1 by setting the minibatch size,\noptimizer and learning rate schedule identical to the original paper [38] and report the numbers in\nTable 1.\nImageNet dataset contains 1000 classes, with 1.28 million training samples. We report top-1 accuracy\non the validation set which consists of 50,000 images. We evaluate our framework with ResNet18\n[14], we use the implementation from PyTorch\u2019s website 2. As per the standard settings, we train the\n\n1Authors report results as median run of 5 runs. We reimplement their method and report mean and standard\n\ndeviation over three runs.\n\n2https://github.com/pytorch/examples/tree/master/imagenet\n\n4\n\n\fFigure 1: Left: Class level dynamic curriculum on CIFAR100. Curriculum learnt over classes is\ndynamic in nature and adapts itself for different classes. Right: Instance level dynamic curriculum\non ImageNet. Two instances of the same class, as per their dif\ufb01culty, are learnt at different points\nduring training.\n\nmodels for a total of 100 epochs, with learning rate decay of 0.1 every 30 epochs i.e. at 30, 60, and\n90 epochs. Weight decay for class and instance level data parameters is set as 1e \u2212 4 (same as model\nparameters) respectively. The learning rate for class and instance level data parameters is set as 0.1\nand 0.8 respectively.\nWe report results for ImageNet and CIFAR100 in Table 1. As seen from the table, on CIFAR100\ndataset, learning a curriculum over classes and instances lead to a statistically signi\ufb01cant gain of\n0.7% over the baseline for WideResNet. On ImageNet dataset, using a dynamic curriculum translates\nto a gain of 0.7% over the baseline. In the table, we show that learning a dynamic curriculum over\nclasses alone performs better than baseline, but has a degradation of 0.2% in accuracy when compared\nwith class and instance level curriculum. This highlights the importance of using a curriculum over\ninstances, and validates our hypothesis: instances within a class have varying levels of dif\ufb01culty, and\nlearning the order within a class is important. In Figure 1 (right), we plot data parameter for two\ninstances of the same class as it evolves during training. The two instances are learnt at different\npoints during training, as per their dif\ufb01culty. For description of experiments on WebVision dataset,\nsee section 3.4.\n\nComparison with the state-of-the art: To the best of our knowledge, we are the \ufb01rst work to\nreport gains on ImageNet dataset due to curriculum learning3. There are existing works which report\nresults of curriculum learning on CIFAR100 dataset, but a direct comparison is not possible, since\nthese works report results in different settings. Nevertheless, below, we report key results from the\nexisting state-of-the-art:\n[2] proposes a curriculum learning framework, where the sampling of data (curriculum) for SGD is\nbased on lightweight estimate of sample uncertainty. With ResNet27 they obtain an improvement\nof 0.4% in accuracy. Inspired from the recent work of \u2019Learning to Teach\u2019 [7], [36] proposes an\nextension, where the teacher dynamically alters the loss function for the student model. Training\nof a teacher model requires a separate held-out validation set and hence is not directly comparable.\nOn CIFAR100 dataset, they obtain an improvement of 1.1% over a slightly weaker ResNet-32\narchitecture which has 69.62% baseline accuracy. [13] proposes a dynamic non-uniform sampling\nmethod for curriculum learning. They employ transfer learning to sort the training data by dif\ufb01culty,\nand evaluate various heuristics to guide the sampling for SGD. On CIFAR100, using VGG16 without\ndata augmentation (with a baseline accuracy of 68.1%), they obtain 0.6% improvement in accuracy.\n\nLearnt curriculum is repeatable: To perform a qualitative evaluation, we visualize the dynamic\ncurriculum learnt over classes in Figure 1 (left). We pick four random classes, and plot class\nparameters over the course of training (mean and standard-deviation over three runs). From the \ufb01gure\nwe can see that the curriculum is dynamic, and adapts to different classes. More importantly, low\nstandard-deviation implies that the order in which classes are learnt is repeatable and intrinsic to the\ndataset and model. Random runs on the same dataset with different architectures lead to different\ncurriculum.\n\n3[17] reports numbers on ImageNet, but they train on a dataset twice the size of ImageNet containing noisy\n\nlabels. In section 3.4, we make an explicit comparison with [17] for the task of learning with noisy labels.\n\n5\n\n020406080100120Epochs0.70.80.91.01.11.2Class ParameterInstance Parameter\fDataset\n\nModel\n\nCIFAR100 WRN-28-10\nImageNet\nResNet18\n\nWebVision\n\nResNet18\n\nBaseline\n80.1 \u00b1 0.2\n70.3 \u00b1 0.1\n\n66.3 \u00b1 0.1\n\nDCL\n\n80.8 \u00b1 0.1\n71.0 \u00b1 0.1\n70.8 \u00b1 0.1\n67.5 \u00b1 0.1\n67.1 \u00b1 0.1\n\nClass\n\nInstance\n\n\u0013\n\u0013\n\u0013\n\n\u0013\n\u0017\n\n\u0013\n\u0013\n\u0017\n\n\u0013\n\u0013\n\nTable 1: Results on image classi\ufb01cation dataset. Across different datasets and CNN architectures,\nusing our dynamic curriculum learning (DCL) framework leads to consistent gains.\n\nFigure 3: Dynamic curriculum over the course of training (left to right) for object detection task.\nThickness of a bounding box instance is proportional to the value of parameter associated with it.\nThe curriculum learns easier unoccluded instances \ufb01rst, followed by followed by partial occlusion,\nand in the end learns heavy occlusion.\n\n3.3 Learning a curriculum for object detection\n\nIn this section we show that when applied for an object detection task, our framework is able to\nrecover a curriculum which \ufb01rst learns from the unoccluded instances (easy), followed by partially\noccluded instances (medium) and \ufb01nally learns the severely occluded instances (hard).\nWe apply our framework challenging KITTI dataset [9] for the task of 2D detection. The object\ndetection benchmark from KITTI has three classes: cyclist, pedestrian and car. The training set\ncontains 7,481 images with 2D bounding box annotations. The evaluation of 2D detectors is done in\nthree regimes: easy, medium and hard de\ufb01ned as per the truncation and occlusion levels of objects.\nWe use the train/ validation split provided by [3] to evaluate our performance for detecting car\ninstances.\n\nDCL\n92.8\n87.9\n79.3\n\nFigure 2: Detection mAP on KITTI.\n\nBaseline\nSetting\n92.1 \u00b1 0.16\nEasy\nMedium 87.3 \u00b1 0.26\n78.0 \u00b1 0.6\nHard\n\nAs a baseline, we implement 2D detection using Single Shot\nDetector (SSDNet) [24] architecture. In SSDNet architecture,\nthe network consists of standard convolutional layers, followed\nby anchors at multiple feature maps. Each anchor is assigned\nto either a background or to a bounding box annotation. An-\nchors assigned to a bounding box, predict the bounding box\noffset and class label. To learn a curriculum over instances,\nwe associate a learnable parameter to each bounding box an-\nnotation. Anchors assigned to a bounding box annotation, use\nthe value of parameter associated to that instance to rescale their logits before predicting the target\nclass label. Therefore, if anchors assigned to a certain bounding box annotation are not able to\npredict the target label, the parameter associated to the bounding box instance will attain a high value.\nAnchors assigned to background do not have a bounding box associated to them. To mitigate this\nissue, for each mini-batch, we compute the mean value of instance parameter over target bounding\nbox instances, and use that for negative anchors. This ensures that positive and negative anchors learn\nat the same pace, while allowing the positive anchors to learn a curriculum over different instances.\nWe have not tuned hyperparameters on this dataset, but set learning rate for instance level parameters\nas 0.1, and did not use weight decay or momentum.\nIn Figure 2 we obtain an improvement of 0.7, 0.6 and 1.3 mAP in easy, medium and hard settings\ncompared to the baseline algorithm. In Figure 3 we show how the learnt curriculum attends samples\nof different dif\ufb01culty over the course of training.\n\nComparison with the state-of-the art: To the best of our knowledge, we are the \ufb01rst work to\nreport improvements on object detection task with curriculum learning. Following works are the\nclosest relevant state-of-the-art: [21, 33, 6] have explored the use of curriculum learning for weakly\nsupervised object detection. To avoid getting stuck in a local minima in multiple instance learning\n\n6\n\n\fFigure 4: Visualization of dynamic instance level curriculum on noisy CIFAR100 dataset under 40%\nlabel noise. Left: Plot of mean and standard-deviation over instance parameters of clean and noisy\ndata instances over the course of training. Right: Percentage of corrupt samples in the top 40% data\npoints sorted by their instance parameter value. See text for details.\n\nframework, [21] uses segmentation maps along with current bounding box proposals to de\ufb01ne a\ncurriculum. [33] trains an object detector using small set of training data. This detector is evaluated\non large set of weakly labeled images, and is used to measure mAP per image [41]. mAP per image\nis used as a proxy for intrinsic dif\ufb01culty of an image, and is used to de\ufb01ne a curriculum.\n\n3.4 Learning a curriculum for noisy labels\n\nAn ideal framework for learning the curriculum can be useful when some of the labels in the dataset\nare noisy, where the framework should prioritize learning from clean labels. In this section, we \ufb01rst\nvalidate our dynamic curriculum learning framework in a controlled corrupted label setting, followed\nby results on a real world noisy dataset.\n\nResults in controlled corrupted label setting To compare with the relevant state-of-the-art, we\nfollow the common setting in ([17, 26]) to train deep CNNs, where the label of each image is\nindependently changed to a uniform random class with probability p, where p is noise fraction and is\nset to 0.2, 0.4 and 0.8. The labels of validation data remain clean for evaluation. We compare our\napproach with two state-of-the-art approaches [17, 26] in this setting. Both of these methods assign a\nweight to each sample in the training set, which is used to scale the gradients of these samples during\ntraining. [17] trains an auxiliary network (MentorNet) to assign weights to data points. [26] employs\nmeta-learning framework to learn the optimal weight of a sample.\nWe implement WideResNet-28-10 under settings identical to ones reported in [26]. For all of our\nexperiments with noisy labels, the learning rate for instance parameters is set to 0.2, and accuracy\nis reported at 84 epochs (set by cross-validation). As seen from the results in Table 2, our method\noutperforms the state-of-the-art MentorNet PD[17] by 14.5% on CIFAR10 and 14% on CIFAR100.\nWe also compare our results with methods (MentorNet DD [17] and robust weighting [26]) which\nuse additional clean data to learn the curriculum. Despite the fact that our method does not use\nadditional clean data, we outperform these methods by 2% on CIFAR10 and 3% on CIFAR100. In\nsupplementary material, we perform the same analysis for 20% and 80% noise on CIFAR100, and\nshow that DCL outperforms MentorNet PD[17] by 3% and 22% respectively.\nNext, we measure the gap between our method and an oracle which learns only from the clean data.\nWe establish the performance of the oracle by training our baseline DNN only on the clean data in\neach setting, i.e. in setting with 40% noise, we train only on 60% clean data. As it can be seen from\nthe table, under 40% noise level, the gap between our method and the oracle is only 3%, both for\nCIFAR10 and CIFAR100.\nIn Figure 4 (left) we plot the mean instance parameter for noisy and clean data during the course of\ntraining. As seen from the \ufb01gures, over the course of training, the learnt curriculum is able to \ufb01lter\nthe clean data from noisy data, by assigning high instance parameter value to noisy instances. In\nFigure 4 (right) we plot the percentage of corrupt samples in top 40% of training data sorted by their\ninstance parameter value. As seen from the plot, within 20 epochs, 95% of the noisy instances attain\ninstance parameter values greater than all clean samples. For results under 20% and 80% noise level,\nsee supplementary material.\n\n7\n\n01020304050607080Epochs01234567Instance ParameterNoisy dataClean data020406080Epochs859095100\fAdditional Clean Data\n\nCIFAR-10\n\nCIFAR-100\n\nMentorNet DD [17]\nRobust Weighting [26]\nBaseline [26]\nReed Hard [26]\nS Model [26]\nMentorNet PD [17]\nDCL (ours)\nBaseline on clean data (oracle)\n\nYes\nYes\nNo\nNo\nNo\nNo\nNo\nNo\n\n88.7\n\n86.92 \u00b1 0.19\n67.97 \u00b1 0.62\n69.66 \u00b1 1.21\n70.64 \u00b1 3.09\n91.10 \u00b1 0.70\n94.24 \u00b1 0.15\n\n76.6\n\n67.5\n\n61.34 \u00b1 2.06\n50.66 \u00b1 0.24\n51.34 \u00b1 0.17\n49.10 \u00b1 0.58\n70.93 \u00b1 0.15\n74.18 \u00b1 0.19\n\n56.9\n\nTable 2: Performance of our method under uniform 40% label noise on train set. Dynamic curriculum\nlearning (DCL) outperforms the state-of-the-art methods including methods (top 2 rows) which use\nadditional clean data. Bottom row indicates performance of baseline DNN trained on clean labels.\n\nResults on noisy dataset from web In this section we will show results on the challenging WebVi-\nsion 2017 dataset [22], a large scale dataset, which has corrupted labels and is extremely imbalanced.\nWebVision 2017 dataset is constructed by crawling Google image search and Flickr using 1000\nclasses from ImageNet as queries. It contains 1000 classes, with 2.4 million training images, without\nany human annotation. The dataset provides 50,000 manually-labeled images for evaluation.\nWe conducted experiments using ResNet18 with the same hyper-parameters as we have used for\nImageNet experiments in the paper and report results in Table 1. Our baseline, the standard DNN\ntraining of ResNet18 obtained 66.3 \u00b1 0.1% as top-1 accuracy. Since the noise present in dataset is at\ninstance level, \ufb01rst we evaluate the use of instance level curriculum. Using an instance curriculum\nleads to an improvement of 0.8%. Next, we evaluate the use of joint class and instance level\ncurriculum. Interestingly, even though noise present in this dataset is at instance level, learning a joint\ncurriculum improves over instance level curriculum, and leads to an overall gain of 1.2% over the\nbaseline.\n\n3.5 Curriculum learning with all random labels\n\nRecent work [39] has shown that when presented with a dataset\ncontaining all random labels, standard DNNs are able to mem-\norize the entire training dataset. They evaluated the standard\nregularization methods such as weight-decay, dropout, data-\naugmentation, and found them to be ineffective to prevent\nmemorization. We replicate their experimental setup and \ufb01nd-\nings using VGG16 [28] (baseline) on CIFAR100 dataset. In\nthe \ufb01gure on the right, we plot the training accuracy curves for baseline and baseline with our\ndynamic curriculum. As seen from the plot, using our dynamic curriculum learning formulation\nresists memorizing the corrupt training data. As explained in Section 2.1, when data-points of a\ncertain class are misclassi\ufb01ed, the gradient update will increase the corresponding class parameter. In\nthis setting, where labels of images are random, over the course of training, the class parameter for\nall the classes keeps increasing, effectively decaying the magnitude of gradient update on the training\nset (see equation 4).\n\n4 Related work\n\nCurriculum learning has been an active topic of research in the machine learning community and has\nbeen used in various problems [4, 10, 15, 21, 27, 31, 32]. In this section, we give a brief overview of\nrelated work most relevant to the material we present in the paper. For a brief overview of curriculum\nlearning and a theoretical treatment, we refer the reader to [34].\nIn the early works of curriculum learning [1, 29], the curriculum was pre-determined and \ufb01xed during\nthe course of optimization. To address this limitation, [19] proposed Self Paced Learning (SPL)\nframework, where the curriculum is optimized jointly with the model parameters. In SPL [19], the\ndata points are assigned a weight variable, which are updated along with the model parameters using\nalternate minimization. More speci\ufb01cally at each iteration, weights of samples with a loss higher\nthan a pre-de\ufb01ned threshold \u03bb are set to 0. Over the course of training, while gradually increasing\n\u03bb, more samples are included in training from easy to hard in a self-paced manner. SPL has been\nwidely adopted and applied to various problems [20, 23, 25, 30]. Similar to SPL, our method learns a\n\n8\n\n0100200300400500Epochs20406080100Train AccuracyBaselineDynamic Curriculum\fdynamic curriculum and mitigates the issue of a using a pre-determined curriculum. Earlier works in\ncurriculum learning and SPL perform discrete sampling of data, which could lead to local minima.\nIn comparison, our method performs soft-differentiable sampling of data. Recently, several works\nhave been proposed to obtain better weighting strategies [7, 16, 17, 40] for SPL framework. Learning\nweight for each sample amounts to learning the scale for the gradient update of each sample. In\ncontrast, in our method, learning the data parameters amounts to learning the loss function speci\ufb01c to\neach data point and class. Another major difference between our method and SPL is that, the majority\nof SPL methods use the loss of a data-point as a proxy for establishing its hardness with respect to the\ncurrent model. This heuristic when applied to deep neural networks (DNNs) might be problematic,\nsince DNNs can easily memorize hard examples (e.g random labels[39]), making the loss of a sample\ndecorrelated with the intrinsic hardness of the sample.\nRecent works have explored meta-learning for modifying the loss function dynamically [36], to\nre-weight instances to enable learning with noisy labels [17, 26, 36] and to accelerate training of\nDNNs [7]. These methods involve training a teacher on a task, and then using the teacher to train\nthe student on the target task. In contrast, in our method, the parameters for instances and classes\n(viewed as teacher) and the parameters of the model (viewed as student) are trained jointly. Doing\nso ensures that the learnt curriculum is consistent with the current state of the model, and does not\nrequire a held out dataset.\nCurriculum learning has also been explored in the context of learning with noisy labels [12, 17, 37].\nMentorNet[17] trains an additional network for weighing samples in a noisy train set. Guo et al. [12]\npropose a novel curriculum learning framework by measuring the data complexity using clustering\ndensity. They apply their method on large-scale weakly-supervised web images and obtain state-of-\nthe-art results. For a comprehensive overview on label noise and noise robust algorithms we refer the\nreader to [8].\n\n5 Conclusion\n\nIn this work, we have introduced a new family of parameters termed \"data parameters\". We have\nshown that data parameters can be learnt using gradient descent, and doing so amounts to learning\na dynamic curriculum. Speci\ufb01cally, we equip each class and training data point with a learnable\nparameter (data parameters), which governs their importance during different stages of training.\nAlong with the model parameters, the data parameters are also learnt with gradient descent, thereby\nyielding a curriculum which evolves during the course of training. More importantly, post training,\nduring inference, data parameters are not used, and hence do not alter the model\u2019s complexity or\nrun-time at inference. We apply this dynamic curriculum learning framework to image classi\ufb01cation\nand object detection tasks, and show that our approach leads to consistent gains over the baseline\nDNNs. When applied to a noisy dataset, the dynamic curriculum priortizes learning from clean data,\nwhile effectively ignoring noisy data. Finally, when presented with dataset containing random labels,\nour framework resists memorizing the training data unlike standard DNNs.\n\n9\n\n\fReferences\n\n2009.\n\n[1] Yoshua Bengio, J\u00e9r\u00f4me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML,\n\n[2] Haw-Shiuan Chang, Erik Learned-Miller, and Andrew McCallum. Active bias: Training more accurate\n\nneural networks by emphasizing high variance samples. In NIPS, 2017.\n\n[3] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and Raquel\n\nUrtasun. 3d object proposals for accurate object class detection. In NIPS, 2015.\n\n[4] Monojit Choudhury, Kalika Bali, Sunayana Sitaram, and Ashutosh Baheti. Curriculum design for code-\nswitching: Experiments with language identi\ufb01cation and language modeling with deep neural networks. In\nInternational Conference on Natural Language Processing, 2017.\n\n[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In CVPR, 2009.\n\n[6] Xuanyi Dong, Deyu Meng, Fan Ma, and Yi Yang. A dual-network progressive approach to weakly\n\nsupervised object detection. In ACM International Conference on Multimedia, 2017.\n\n[7] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In ICLR, 2018.\n[8] Beno\u00eet Fr\u00e9nay and Michel Verleysen. Classi\ufb01cation in the presence of label noise: a survey.\n\nIEEE\n\nTransactions on Neural Networks and Learning Systems, 2013.\n\n[9] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti\n\ndataset. The International Journal of Robotics Research, 2013.\n\n[10] Alex Graves, Marc G Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. Automated\n\ncurriculum learning for neural networks. In ICML, 2017.\n\n[11] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In\n\nICML, 2017.\n\narXiv, 2019.\n\nIn CPVR, 2016.\n\n[12] Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R Scott, and Dinglong\n\nHuang. Curriculumnet: Weakly supervised learning from large-scale web images. In ECCV, 2018.\n\n[13] Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks.\n\n[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\n[15] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox.\n\nFlownet 2.0: Evolution of optical \ufb02ow estimation with deep networks. In CVPR, 2017.\n\n[16] Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G Hauptmann. Easy samples \ufb01rst: Self-paced\n\nreranking for zero-example multimedia search. In ACM International conference on Multimedia, 2014.\n\n[17] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven\n\ncurriculum for very deep neural networks on corrupted labels. In ICML, 2018.\n\n[18] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of\n\nToronto, 2009.\n\nNIPS, 2010.\n\nIn CVPR, 2011.\n\n[19] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In\n\n[20] Yong Jae Lee and Kristen Grauman. Learning the easy things \ufb01rst: Self-paced visual category discovery.\n\n[21] Siyang Li, Xiangxin Zhu, Qin Huang, Hao Xu, and C-C Jay Kuo. Multiple instance curriculum learning\n\nfor weakly supervised object detection. arXiv, 2017.\n\n[22] Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, and Luc Van Gool. Webvision database: Visual learning\n\nand understanding from web data. arXiv preprint arXiv:1708.02862, 2017.\n\n[23] Jian Liang, Zhihang Li, Dong Cao, Ran He, and Jingdong Wang. Self-paced cross-modal subspace\n\nmatching. In SIGIR, 2016.\n\n[24] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and\n\nAlexander C Berg. Ssd: Single shot multibox detector. In ECCV, 2016.\n\n[25] Te Pi, Xi Li, Zhongfei Zhang, Deyu Meng, Fei Wu, Jun Xiao, and Yueting Zhuang. Self-paced boost\n\n[26] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust\n\nlearning for classi\ufb01cation. In IJCAI, 2016.\n\ndeep learning. In ICML, 2018.\n\n[27] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning\n\nwith memory-augmented neural networks. In ICML, 2016.\n\n[28] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. In ICLR, 2015.\n\ndependency parsing. In NIPS, 2009.\n\n[29] Valentin I Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. Baby steps: How \u201cless is more\u201d in unsupervised\n\n[30] James S Supancic and Deva Ramanan. Self-paced learning for long-term tracking. In CVPR, 2013.\n[31] Maxwell Svetlik, Matteo Leonetti, Jivko Sinapov, Rishi Shah, Nick Walker, and Peter Stone. Automatic\n\ncurriculum graph generation for reinforcement learning agents. In AAAI, 2017.\n\n[32] Kevin Tang, Vignesh Ramanathan, Li Fei-Fei, and Daphne Koller. Shifting weights: Adapting object\n\ndetectors from image to video. In NIPS, 2012.\n\n[33] Jiasi Wang, Xinggang Wang, and Wenyu Liu. Weakly-and semi-supervised faster r-cnn with curriculum\n\nlearning. In ICPR, 2018.\n\n[34] Daphna Weinshall and Dan Amir. Theory of curriculum learning, with convex loss functions. arXiv, 2018.\n\n10\n\n\f[35] Daphna Weinshall, Gad Cohen, and Dan Amir. Curriculum learning by transfer learning: Theory and\n\n[36] Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning to teach\n\n[37] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled\n\nexperiments with deep networks. arXiv, 2018.\n\nwith dynamic loss functions. In NIPS, 2018.\n\ndata for image classi\ufb01cation. In CVPR, 2015.\n\n[38] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.\n[39] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep\n\nlearning requires rethinking generalization. In ICLR, 2017.\n\n[40] Qian Zhao, Deyu Meng, Lu Jiang, Qi Xie, Zongben Xu, and Alexander G Hauptmann. Self-paced learning\n\nfor matrix factorization. In AAAI, 2015.\n\n[41] Hong-Yu Zhou, Bin-Bin Gao, and Jianxin Wu. Adaptive feeding: Achieving fast and accurate detections\n\nby adaptively combining object detectors. In ICCV, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5946, "authors": [{"given_name": "Shreyas", "family_name": "Saxena", "institution": "Apple"}, {"given_name": "Oncel", "family_name": "Tuzel", "institution": "Apple"}, {"given_name": "Dennis", "family_name": "DeCoste", "institution": "Apple"}]}