{"title": "Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting", "book": "Advances in Neural Information Processing Systems", "page_first": 1919, "page_last": 1930, "abstract": "Current deep neural networks(DNNs) can easily overfit to biased training data with corrupted labels or class imbalance. Sample re-weighting strategy is commonly used to alleviate this issue by designing a weighting function mapping from training loss to sample weight, and then iterating between weight recalculating and classifier updating. Current approaches, however, need manually pre-specify the weighting function as well as its additional hyper-parameters. It makes them fairly hard to be generally applied in practice due to the significant variation of proper weighting schemes relying on the investigated problem and training data. To address this issue, we propose a method capable of adaptively learning an explicit weighting function directly from data. The weighting function is an MLP with one hidden layer, constituting a universal approximator to almost any continuous functions, making the method able to fit a wide range of weighting function forms including those assumed in conventional research. Guided by a small amount of unbiased meta-data, the parameters of the weighting function can be finely updated simultaneously with the learning process of the classifiers. Synthetic and real experiments substantiate the capability of our method for achieving proper weighting functions in class imbalance and noisy label cases, fully complying with the common settings in traditional methods, and more complicated scenarios beyond conventional cases. This naturally leads to its better accuracy than other state-of-the-art methods.", "full_text": "Meta-Weight-Net: Learning an Explicit Mapping\n\nFor Sample Weighting\n\nJun Shu1, Qi Xie1, Lixuan Yi1, Qian Zhao1, Sanping Zhou1, Zongben Xu1, and Deyu Meng*2,1\n\n1Xi\u2019an Jiaotong University\n\n2The Macau University of Science and Technology\n*Corresponding author:dymeng@mail.xjtu.edu.cn\n\nAbstract\n\nCurrent deep neural networks (DNNs) can easily over\ufb01t to biased training data with\ncorrupted labels or class imbalance. Sample re-weighting strategy is commonly\nused to alleviate this issue by designing a weighting function mapping from training\nloss to sample weight, and then iterating between weight recalculating and classi\ufb01er\nupdating. Current approaches, however, need manually pre-specify the weighting\nfunction as well as its additional hyper-parameters. It makes them fairly hard to be\ngenerally applied in practice due to the signi\ufb01cant variation of proper weighting\nschemes relying on the investigated problem and training data. To address this issue,\nwe propose a method capable of adaptively learning an explicit weighting function\ndirectly from data. The weighting function is an MLP with one hidden layer,\nconstituting a universal approximator to almost any continuous functions, making\nthe method able to \ufb01t a wide range of weighting functions including those assumed\nin conventional research. Guided by a small amount of unbiased meta-data, the\nparameters of the weighting function can be \ufb01nely updated simultaneously with\nthe learning process of the classi\ufb01ers. Synthetic and real experiments substantiate\nthe capability of our method for achieving proper weighting functions in class\nimbalance and noisy label cases, fully complying with the common settings in tra-\nditional methods, and more complicated scenarios beyond conventional cases. This\nnaturally leads to its better accuracy than other state-of-the-art methods. Source\ncode is available at https://github.com/xjtushujun/meta-weight-net.\n\n1\n\nIntroduction\n\nDNNs have recently obtained impressive good performance on various applications due to their\npowerful capacity for modeling complex input patterns. However, DNNs can easily over\ufb01t to biased\ntraining data1, like those containing corrupted labels [2] or with class imbalance[3], leading to\ntheir poor performance in generalization in such cases. This robust deep learning issue has been\ntheoretically illustrated in multiple literatures [4, 5, 6, 7, 8, 9].\nIn practice, however, such biased training data are commonly encountered. For instance, practically\ncollected training samples always contain corrupted labels [10, 11, 12, 13, 14, 15, 16, 17]. A typical\nexample is a dataset roughly collected from a crowdsourcing system [18] or search engines [19, 20],\nwhich would possibly yield a large amount of noisy labels. Another popular type of biased training\ndata is those with class imbalance. Real-world datasets are usually depicted as skewed distributions,\nwith a long-tailed con\ufb01guration. A few classes account for most of the data, while most classes are\n\n1We call the training data biased when they are generated from a joint sample-label distribution deviating\n\nfrom the distribution of evaluation/test set[1].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Weight function in focal loss\n\n(b) Weight function in SPL\n\n(c) Meta-Weight-Net architecture\n\n(d) MW-Net function learned in\nclass imbalance case\n\n(e) MW-Net function learned in\ncorrupter labels case\n\n(f) MW-Net function learned in\nreal Clothing1M dataset\n\nFigure 1: (a)-(b) weight functions set in focal loss and self-paced learning (SPL). (c) Meta-Weighting-\nNet architecture. (d)-(f) Meta-Weighting-Net functions learned in class imbalance (imbalanced factor\n100), noisy label (40% uniform noise), and real dataset, respectively, by our method.\n\nunder-represented. Effective learning with these biased training data, which is regarded to be biased\nfrom evaluation/test ones, is thus an important while challenging issue in machine learning [1, 21].\nSample reweighting approach is a commonly used strategy against this robust learning issue. The main\nmethodology is to design a weighting function mapping from training loss to sample weight (with\nhyper-parameters), and then iterates between calculating weights from current training loss values\nand minimizing weighted training loss for classi\ufb01er updating. There exist two entirely contradictive\nideas for constructing such a loss-weight mapping. One makes the function monotonically increasing\nas depicted in Fig. 1(a), i.e., enforce the learning to more emphasize samples with larger loss values\nsince they are more like to be uncertain hard samples located on the classi\ufb01cation boundary. Typical\nmethods of this category include AdaBoost [22, 23], hard negative mining [24] and focal loss [25].\nThis sample weighting manner is known to be necessary for class imbalance problems, since it can\nprioritize the minority class with relatively higher training losses.\nOn the contrary, the other methodology sets the weighting function as monotonically decreasing, as\nshown in Fig. 1(b), to take samples with smaller loss values as more important ones. The rationality\nlies on that these samples are more likely to be high-con\ufb01dent ones with clean labels. Typical methods\ninclude self-paced learning(SPL) [26], iterative reweighting [27, 17] and multiple variants [28, 29, 30].\nThis weighting strategy has been especially used in noisy label cases, since it inclines to suppress the\neffects of samples with extremely large loss values, possibly with corrupted incorrect labels.\nAlthough these sample reweighting methods help improve the robustness of a learning algorithm on\nbiased training samples, they still have evident de\ufb01ciencies in practice. On the one hand, current\nmethods need to manually set a speci\ufb01c form of weighting function based on certain assumptions on\ntraining data. This, however, tends to be infeasible when we know little knowledge underlying data\nor the label conditions are too complicated, like the case that the training set is both imbalanced and\nnoisy. On the other hand, even when we specify certain weighting schemes, like focal loss [25] or\nSPL [26], they inevitably involve hyper-parameters, like focusing parameter in the former and age\nparameter in the latter, to be manually preset or tuned by cross-validation. This tends to further raise\ntheir application dif\ufb01culty and reduce their performance stability in real problems.\nTo alleviate the aforementioned issue, this paper presents an adaptive sample weighting strategy to\nautomatically learn an explicit weighting function from data. The main idea is to parameterize the\nweighting function as an MLP (multilayer perceptron) network with only one hidden layer (as shown\nin Fig. 1(c)), called Meta-Weight-Net, which is theoretically a universal approximator for almost\nany continuous function [31], and then use a small unbiased validation set (meta-data) to guide the\ntraining of all its parameters. The explicit form of the weighting function can be \ufb01nally attained\nspeci\ufb01cally suitable to the learning task.\nIn summary, this paper makes the following three-fold contributions:\n1) We propose to automatically learn an explicit loss-weight function, parameterized by an MLP from\ndata in a meta-learning manner. Due to the universal approximation capability of this weight net, it\ncan \ufb01nely \ufb01t a wide range of weighting functions including those used in conventional research.\n\n2\n\n0246810Loss0.000.250.500.751.00WeightFocal Loss0246810Loss0.00.20.40.60.81.0WeightSelf-paced learning...LossWeight0246810Loss0.50.60.70.80.9Weightlong-tailed CIFAR-10long-tailed CIFAR-1000246810Loss0.00.20.4WeightCIFAR-10CIFAR-10005101520Loss0.450.500.55WeightClothing1M\f2) Experiments verify that the weighting functions learned by our method highly comply with\nmanually preset weighting manners used in tradition in different training data biases, like class\nimbalance and noisy label cases as shown in Fig. 1(d) and 1(e)), respectively. This shows that the\nweighting scheme learned by the proposed method inclines to help reveal deeper understanding for\ndata bias insights, especially in complicated bias cases where the extracted weighting function is with\ncomplex tendencies (as shown in Fig. 1(f)).\n3) The insights of why the proposed method works can be well interpreted. Particularly, the updating\nequation for Meta-Weight-Net parameters can be explained by that the sample weights of those\nsamples better complying with the meta-data knowledge will be improved, while those violating such\nmeta-knowledge will be suppressed. This tallies with our common sense on the problem: we should\nreduce the in\ufb02uence of those highly biased ones, while emphasize those unbiased ones.\nThe paper is organized as follows. Section 2 presents the proposed meta-learning method as well\nas the detailed algorithm and analysis of its convergence property. Section 3 discusses related work.\nSection 4 demonstrates experimental results and the conclusion is \ufb01nally made.\n\n2 The Proposed Meta-Weight-Net Learning Method\n\n2.1 The Meta-learning Objective\nConsider a classi\ufb01cation problem with the training set {xi, yi}N\ni=1, where xi denotes the i-th sample,\nyi \u2208 {0, 1}c is the label vector over c classes, and N is the number of the entire training data. f (x, w)\ndenotes the classi\ufb01er, and w denotes its parameters. In current applications, f (x, w) is always set as\na DNN. We thus also adopt DNN, and call it the classi\ufb01er network for convenience in the following.\nGenerally, the optimal classi\ufb01er parameter w\u2217 can be extracted by minimizing the loss 1\ni=1\nN\n(cid:96)(yi, f (xi, w)) calculated on the training set. For notation convenience, we denote that Ltrain\n(w) =\n(cid:96)(yi, f (xi, w)). In the presence of biased training data, sample re-weighting methods enhance the\nrobustness of training by imposing weight V(Ltrain\n(w); \u0398) on the i-th sample loss, where V((cid:96); \u0398)\ndenotes the weight net, and \u0398 represents the parameters contained in it. The optimal parameter w is\ncalculated by minimizing the following weighted loss:\n\n(cid:80)N\n\ni\n\ni\n\nw\u2217(\u0398) = arg min\n\nw\n\nLtrain(w; \u0398) (cid:44) 1\nN\n\nV(Ltrain\n\ni\n\n(w); \u0398)Ltrain\n\ni\n\n(w).\n\n(1)\n\nN(cid:88)\n\ni=1\n\nMeta-Weight-Net: Our method aims to automatically learn the hyper-parameters \u0398 in a meta-\nlearning manner. To this aim, we formulate V(Li(w); \u0398) as a MLP network with only one hidden\nlayer containing 100 nodes, as shown in Fig. 1(c). We call this weight net as Meta-Weight-Net or\nMW-Net for easy reference. Each hidden node is with ReLU activation function, and the output is\nwith the Sigmoid activation function, to guarantee the output located in the interval of [0, 1]. Albeit\nsimple, this net is known as a universal approximator for almost any continuous function [31], and\nthus can \ufb01t a wide range of weighting functions including those used in conventional research.\nMeta learning process. The parameters contained in MW-Net can be optimized by using the meta\nlearning idea [32, 33, 34, 35]. Speci\ufb01cally, assume that we have a small amount unbiased meta-data\n}M\nset (i.e., with clean labels and balanced data distribution) {x(meta)\ni=1, representing the\nmeta-knowledge of ground-truth sample-label distribution, where M is the number of meta-samples\nand M (cid:28) N. The optimal parameter \u0398\u2217 can be obtained by minimizing the following meta-loss:\n\n, y(meta)\n\ni\n\ni\n\n\u0398\u2217 = arg min\n\n\u0398\n\nLmeta(w\u2217(\u0398)) (cid:44) 1\nM\n\nLmeta\n\ni\n\n(w\u2217(\u0398)),\n\n(2)\n\nwhere Lmeta\n\ni\n\n(w) = (cid:96)\n\ny(meta)\ni\n\n, f (x(meta)\n\ni\n\n, w)\n\nis calculated on meta-data.\n\n2.2 The Meta-Weight-Net Learning Method\nCalculating the optimal \u0398\u2217 and w\u2217 require two nested loops of optimization. Here we adopt an\nonline strategy to update \u0398 and w through a single optimization loop, respectively, to guarantee the\nef\ufb01ciency of the algorithm.\n\n3\n\n(cid:16)\n\nM(cid:88)\n\ni=1\n\n(cid:17)\n\n\fFigure 2: Main \ufb02owchart of the proposed MW-Net Learning algorithm (steps 5-7 in Algorithm 1).\n\nFormulating learning manner of classi\ufb01er network. As general network training tricks, we employ\nSGD to optimize the training loss (1). Speci\ufb01cally, in each iteration of training, a mini-batch of\ntraining samples {(xi, yi), 1 \u2264 i \u2264 n} is sampled, where n is the mini-batch size. Then the updating\nequation of the classi\ufb01er network parameter can be formulated by moving the current w(t) along the\ndescent direction of the objective loss in Eq. (1) on a mini-batch training data:\n\n\u02c6w(t)(\u0398) = w(t) \u2212 \u03b1\n\n1\nn\n\nwhere \u03b1 is the step size.\n\n\u00d7 n(cid:88)\n\ni=1\n\nV(Ltrain\n\ni\n\n(w(t)); \u0398)\u2207wLtrain\n\ni\n\n(w)\n\n,\n\n(3)\n\n(cid:12)(cid:12)(cid:12)w(t)\n\nAlgorithm 1 The MW-Net Learning Algorithm\n\nInput: Training data D, meta-data set (cid:98)D, batch size n, m, max iterations T .\n\n{x, y} \u2190 SampleMiniBatch(D, n).\n\n{x(meta), y(meta)} \u2190 SampleMiniBatch((cid:98)D, m).\n\nOutput: Classi\ufb01er network parameter w(T )\n1: Initialize classi\ufb01er network parameter w(0) and Meta-Weight-Net parameter \u0398(0).\n2: for t = 0 to T \u2212 1 do\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nFormulate the classi\ufb01er learning function \u02c6w(t)(\u0398) by Eq. (3).\nUpdate \u0398(t+1) by Eq. (4).\nUpdate w(t+1) by Eq. (5).\n\nUpdating parameters of Meta-Weight-Net: After receiving the feedback of the classi\ufb01er network\nparameter updating formulation \u02c6w(t)(\u0398) 2from the Eq .(3), the parameter \u0398 of the Meta-Weight-Net\ncan then be readily updated guided by Eq. (2), i.e., moving the current parameter \u0398(t) along the\nobjective gradient of Eq. (2) calculated on the meta-data:\n\n\u0398(t+1) = \u0398(t) \u2212 \u03b2\n\n1\nm\n\n\u2207\u0398Lmeta\n\ni\n\n( \u02c6w(t)(\u0398))\n\n,\n\n(4)\n\nm(cid:88)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)\u0398(t)\n\nwhere \u03b2 is the step size.\nUpdating parameters of classi\ufb01er network: Then, the updated \u0398(t+1) is employed to ameliorate\nthe parameter w of the classi\ufb01er network, i.e.,\n\nw(t+1) = w(t) \u2212 \u03b1\n\n1\nn\n\nV(Ltrain\n\ni\n\n(w(t)); \u0398(t+1))\u2207wLtrain\n\ni\n\n(w)\n\n.\n\n(5)\n\n(cid:12)(cid:12)(cid:12)w(t)\n\n\u00d7 n(cid:88)\n\ni=1\n\nThe MW-Net Learning algorithm can then be summarized in Algorithm 1, and Fig. 2 illustrates its\nmain implementation process (steps 5-7). All computations of gradients can be ef\ufb01ciently imple-\nmented by automatic differentiation techniques and generalized to any deep learning architectures of\nclassi\ufb01er network. The algorithm can be easily implemented using popular deep learning frameworks\nlike PyTorch [36]. It is easy to see that both the classi\ufb01er network and the MW-Net gradually\nameliorate their parameters during the learning process based on their values calculated in the last\nstep, and the weights can thus be updated in a stable manner, as clearly shown in Fig. 6.\n\n2Notice that \u0398 here is a variable instead of a quantity, which makes \u02c6wt(\u0398) a function of \u0398 and the gradient\n\nin Eq. (4) be able to be computed.\n\n4\n\nStep 5Step 6Step 7 Meta-Weight-Net Classifier network ...LossWeight\f(w(t)); \u0398)\n\n,\n\n(6)\n\n(cid:12)(cid:12)(cid:12)\u0398(t)\n(cid:80)m\n\n2.3 Analysis on the Weighting Scheme of Meta-Weight-Net\n\nThe computation of Eq. (4) by backpropagation can be rewritten as3:\n\u2202V(Ltrain\n\u2202\u0398\n\n\u0398(t+1) = \u0398(t) +\n\n\u03b1\u03b2\nn\n\n1\nm\n\nGij\n\nm(cid:88)\n\nj\n\ni=1\n\n(cid:33)\n\n(cid:32)\nn(cid:88)\n(cid:12)(cid:12)(cid:12)w(t)\n\nj=1\n\n(cid:12)(cid:12)(cid:12)T\n\ni\n\nj\n\n\u2202 \u02c6w\n\n(w)\n\n( \u02c6w)\n\n\u2202Ltrain\n\u2202w\n\n\u02c6w(t)\n\n(cid:80)m\n\n. Neglecting the coef\ufb01cient 1\nm\n\nwhere Gij = \u2202Lmeta\nsee that each term in the sum orients to the ascend gradient of the weight function V(Ltrain\n\ni=1 Gij, it is easy to\n(w(t)); \u0398).\ni=1 Gij, the coef\ufb01cient imposed on the j-th gradient term, represents the similarity between\n1\nm\nthe gradient of the j-th training sample computed on training loss and the average gradient of the\nmini-batch meta data calculated on meta loss. That means if the learning gradient of a training sample\nis similar to that of the meta samples, then it will be considered as bene\ufb01cial for getting right results\nand its weight tends to be more possibly increased. Conversely, the weight of the sample inclines to\nbe suppressed. This understanding is consistent with why well-known MAML works [37, 38, 39].\n\nj\n\n2.4 Convergence of the MW-Net Learning algorithm\n\nOur algorithm involves optimization of two-level objectives, and therefore we show theoretically that\nour method converges to the critical points of both the meta and training loss function under some\nmild conditions in Theorem 1 and 2, respectively. The proof is listed in the supplementary material.\nTheorem 1. Suppose the loss function (cid:96) is Lipschitz smooth with constant L, and V(\u00b7) is differential\nwith a \u03b4-bounded gradient and twice differential with its Hessian bounded by B, and the loss function\n(cid:96) have \u03c1-bounded gradients with respect to training/meta data. Let the learning rate \u03b1t satis\ufb01es \u03b1t =\nmin{1, k\nT < 1, and \u03b2t, 1 \u2264 t \u2264 N is a monotone descent sequence,\n\u03b2t = min{ 1\nt \u2264 \u221e. Then\nL ,\nthe proposed algorithm can achieve E[(cid:107)\u2207G(\u0398(t))(cid:107)2\n\nc \u2265 L and(cid:80)\u221e\n\nt=1 \u03b2t \u2264 \u221e,(cid:80)\u221e\n\nT }, for some k > 0, such that k\n\n2] \u2264 \u0001 in O(1/\u00012) steps. More speci\ufb01cally,\n\n} for some c > 0, such that \u03c3\n\nt=1 \u03b22\n\n\u221a\n\n\u221a\nc\n\nT\n\nT\n\n\u03c3\n\nE[(cid:107)\u2207Lmeta(\u0398(t))(cid:107)2\n\n2] \u2264 O(\n\nmin\n0\u2264t\u2264T\n\nC\u221a\nT\n\n),\n\n(7)\n\nwhere C is some constant independent of the convergence process, and \u03c3 is the variance of drawing\nuniformly mini-batch sample at random.\nTheorem 2. The condions in Theorem 1 hold, then we have:\n\nE[(cid:107)\u2207Ltrain(w(t); \u0398(t+1))(cid:107)2\n\n2] = 0.\n\nlim\nt\u2192\u221e\n\n(8)\n\n3 Related Work\n\nSample Weighting Methods. The idea of reweighting examples can be dated back to dataset\nresampling [40, 41] or instance re-weight [42], which pre-evaluates the sample weights as a pre-\nprocessing step by using certain prior knowledge on the task or data. To make the sample weights \ufb01t\ndata more \ufb02exibly, more recent researchers focused on pre-designing a weighting function mapping\nfrom training loss to sample weight, and dynamically ameliorate weights during training process\n[43, 44]. There are mainly two manners to design the weighting function. One is to make it\nmonotonically increasing, speci\ufb01cally effective in class imbalance case. Typical methods include the\nboosting algorithm (like AdaBoost [22]) and multiple of its variations [45], hard example mining\n[24] and focal loss [25], which impose larger weights to ones with larger loss values. On the contrary,\nanother series of methods specify the weighting function as monotonically decreasing, especially used\nin noisy label cases. For example, SPL [26] and its extensions [28, 29], iterative reweighting [27, 17]\nand other recent work [46, 30], pay more focus on easy samples with smaller losses. The limitation\nof these methods are that they all need to manually pre-specify the form of weighting function as\nwell as their hyper-parameters, raising their dif\ufb01culty to be readily used in real applications.\n\n3Derivation can be found in supplementary materials.\n\n5\n\n\fMeta Learning Methods. Inspired by meta-learning developments [47, 48, 49, 37, 50], recently\nsome methods were proposed to learn an adaptive weighting scheme from data to make the learning\nmore automatic and reliable. Typical methods along this line include FWL [51], learning to teach\n[52, 32] and MentorNet [21] methods, whose weight functions are designed as a Bayesian function\napproximator, a DNN with attention mechanism, a bidirectional LSTM network, respectively. Instead\nof only taking loss values as inputs as classical methods, the weighting functions they used (i.e.,\nthe meta-learner), however, are with much more complex forms and required to input complicated\ninformation (like sample features). This makes them not only hard to succeed good properties\npossessed by traditional methods, but also to be easily reproduced by general users.\nA closely related method, called L2RW [1], adopts a similar meta-learning mechanism compared\nwith ours. The major difference is that the weights are implicitly learned there, without an explicit\nweighting function. This, however, might lead to unstable weighting behavior during training and\nunavailability for generalization. In contrast, with the explicit yet simple Meta-Weight-Net, our\nmethod can learn the weight in a more stable way, as shown in Fig. 6, and can be easily generalized\nfrom a certain task to related other ones (see in the supplementary material).\nOther Methods for Class Imbalance. Other methods for handling data imbalance include: [53, 54]\ntries to transfer the knowledge learned from major classes to minor classes. The metric learning based\nmethods have also been developed to effectively exploit the tailed data to improve the generalization\nability, e.g., triple-header loss [55] and range loss [56].\nOther Methods for Corrupted Labels. For handling noisy label issue, multiple methods have been\ndesigned by correcting noisy labels to their true ones via a supplemental clean label inference step\n[11, 14, 57, 13, 21, 1, 15]. For example, GLC [15] proposed a loss correction approach to mitigate\nthe effects of label noise on DNN classi\ufb01ers. Other methods along this line include the Reed [58],\nCo-training [16], D2L [59] and S-Model [12].\n\n4 Experimental Results\n\nTo evaluate the capability of the proposed algorithm, we implement experiments on data sets with\nclass imbalance and noisy label issues, and real-world dataset with more complicated data bias.\n\n4.1 Class Imbalance Experiments\n\nWe use Long-Tailed CIFAR dataset [60], that reduces the number of training samples per class\naccording to an exponential function n = ni\u00b5i, where i is the class index, ni is the original number\nof training images and \u00b5 \u2208 (0, 1). The imbalance factor of a dataset is de\ufb01ned as the number of\ntraining samples in the largest class divided by the smallest. We trained ResNet-32 [61] with softmax\ncross-entropy loss by SGD with a momentum 0.9, a weight decay 5\u00d710\u22124, an initial learning rate 0.1.\nThe learning rate of ResNet-32 is divided by 10 after 80 and 90 epoch (for a total 100 epochs), and\nthe learning rate of WN-Net is \ufb01xed as 10\u22125. We randomly selected 10 images per class in validation\nset as the meta-data set. The compared methods include: 1) BaseModel, which uses a softmax\ncross-entropy loss to train ResNet-32 on the training set; 2) Focal loss [25] and Class-Balanced\n[60] represent the state-of-the-arts of the prede\ufb01ned sample reweighting techniques; 3) Fine-tuning,\n\ufb01ne-tune the result of BaseModel on the meta-data set; 4) L2RW [1], which leverages an additional\nmeta-dataset to adaptively assign weights on training samples.\nTable 1 shows the classi\ufb01cation accuracy of ResNet-32 on the test set and confusion matrices are\ndisplayed in Fig. 3 (more details are listed in the supplementary material). It can be observed that:\n1) Our algorithm evidently outperforms other competing methods on datasets with class imbalance,\nshowing its robustness in such data bias case; 2) When imbalance factor is 1, i.e., all classes are\nwith same numbers of samples, \ufb01ne-tuning runs best, and our method still attains a comparable\nperformance; 3) When imbalance factor is 200 on long-tailed CIFAR-100, the smallest class has only\ntwo samples. An extra \ufb01ne-tuning achieves performance gain, while our method still perform well in\nsuch extreme data bias.\nTo understand the weighing scheme of MW-Net, we depict the tendency curve of weight with respect\nto loss by the learned MW-Net in Fig. 1(d), which complies with the classical optimal weighting\nmanner to such data bias. i.e., larger weights should be imposed on samples with relatively large\nlosses, which are more likely to be minority class sample.\n\n6\n\n\fTable 1: Test accuracy (%) of ResNet-32 on long-tailed CIFAR-10 and CIFAR-100, and the best and\nthe second best results are highlighted in bold and italic bold, respectively.\n\nLong-Tailed CIFAR-10\n\nLong-Tailed CIFAR-100\n\nDataset Name\n\nImbalance\nBaseModel\nFocal Loss\n\nClass-Balanced\n\nFine-tuning\n\nL2RW\nOurs\n\n200\n65.68\n65.29\n68.89\n66.08\n66.51\n68.91\n\n100\n70.36\n70.38\n74.57\n71.33\n74.16\n75.21\n\n50\n\n74.81\n76.71\n79.27\n77.42\n78.93\n80.06\n\n20\n\n82.23\n82.76\n84.36\n83.37\n82.12\n84.94\n\n10\n\n86.39\n86.66\n87.49\n86.42\n85.19\n87.84\n\n1\n\n92.89\n93.03\n92.89\n93.23\n89.25\n92.66\n\n200\n34.84\n35.62\n36.23\n38.22\n33.38\n37.91\n\n100\n38.32\n38.41\n39.60\n41.83\n40.23\n42.09\n\n50\n\n43.85\n44.32\n45.32\n46.40\n44.44\n46.74\n\n20\n\n51.14\n51.95\n52.59\n52.11\n51.64\n54.37\n\n10\n\n55.71\n55.78\n57.99\n57.44\n53.73\n58.46\n\n1\n\n70.50\n70.52\n70.50\n70.72\n64.11\n70.37\n\nTable 2: Test accuracy comparison on CIFAR-10 and CIFAR-100 of WRN-28-10 with varying noise\nrates under uniform noise. Mean accuracy (\u00b1std) over 5 repetitions are reported (\u2018\u2014\u2019 means the\nmethod fails).\n\nCIFAR-10\n\nDatasets / Noise Rate\n0%\n40%\n60%\n0%\n40%\n60%\n\nCIFAR-100\n\nBaseModel\n95.60\u00b10.22\n68.07\u00b11.23\n53.12\u00b13.03\n79.95\u00b11.26\n51.11\u00b10.42\n30.92\u00b10.33\n\nReed-Hard\n94.38\u00b10.14\n81.26\u00b10.51\n73.53\u00b11.54\n64.45\u00b11.02\n51.27\u00b11.18\n26.95\u00b10.98\n\nS-Model\n83.79\u00b10.11\n79.58\u00b10.33\n52.86\u00b10.99\n42.12\u00b10.99\n\n\u2014\n\n\u2014\n\nSelf-paced\n90.81\u00b10.34\n86.41\u00b10.29\n53.10\u00b11.78\n59.79\u00b10.46\n46.31\u00b12.45\n19.08\u00b10.57\n\nFocal Loss\n95.70\u00b10.15\n75.96\u00b11.31\n51.87\u00b11.19\n81.04\u00b10.24\n51.19\u00b10.46\n27.70\u00b13.77\n\nCo-teaching\n88.67\u00b10.25\n74.81\u00b10.34\n73.06\u00b10.25\n61.80\u00b10.25\n46.20\u00b10.15\n35.67\u00b11.25\n\nD2L\n\n94.64\u00b10.33\n85.60\u00b10.13\n68.02\u00b10.41\n66.17\u00b11.42\n52.10 \u00b10.97\n41.11\u00b10.30\n\nFine-tining\n95.65\u00b10.15\n80.47\u00b10.25\n78.75\u00b12.40\n80.88\u00b10.21\n52.49\u00b10.74\n38.16\u00b10.38\n\nMentorNet\n94.35\u00b10.42\n87.33\u00b10.22\n82.80\u00b11.35\n73.26\u00b11.23\n61.39\u00b13.99\n36.87\u00b11.47\n\nL2RW\n\n92.38\u00b10.10\n86.92\u00b10.19\n82.24\u00b10.36\n72.99\u00b10.58\n60.79\u00b10.91\n48.15\u00b10.34\n\nGLC\n\n94.30\u00b10.19\n88.28\u00b10.03\n83.49\u00b10.24\n73.75\u00b10.51\n61.31\u00b10.22\n50.81\u00b11.00\n\nOurs\n\n94.52\u00b10.25\n89.27\u00b10.28\n84.07\u00b10.33\n78.76\u00b10.24\n67.73\u00b10.26\n58.75\u00b10.11\n\n4.2 Corrupted Label Experiment\n\nWe study two settings of corrupted labels on the training set: 1) Uniform noise. The label of each\nsample is independently changed to a random class with probability p following the same setting\nin [2]. 2) Flip noise. The label of each sample is independently \ufb02ipped to similar classes with total\nprobability p. In our experiments, we randomly select two classes as similar classes with equal\nprobability. Two benchmark datasets are employed: CIFAR-10 and CIFAR-100 [62]. Both are\npopularly used for evaluation of noisy labels [59, 16].1000 images with clean labels in validation set\nare randomly selected as the meta-data set. We adopt a Wide ResNet-28-10 (WRN-28-10) [63] for\nuniform noise and ResNet-32 [61] for \ufb02ip noise as our classi\ufb01er network models4.\nThe comparison methods include: BaseModel, referring to the similar classi\ufb01er network utilized in\nour method, while directly trained on the biased training data; the robust learning methods Reed [58],\nS-Model [12] , SPL [26], Focal Loss [25], Co-teaching [16], D2L [59]; Fine-tuning, \ufb01ne-tuning\nthe result of BaseModel on the meta-data with clean labels to further enhance its performance;\ntypical meta-learning methods MentorNet [21], L2RW [1], GLC [15]. We also trained the baseline\nnetwork only on 1000 meta-images. The performance are evidently worse than the proposed method\ndue to the neglecting of the knowledge underlying large amount of training samples. We thus have\nnot involved its results in comparison.\nAll the baseline networks were trained using SGD with a momentum 0.9, a weight decay 5 \u00d7 10\u22124\nand an initial learning rate 0.1. The learning rate of classi\ufb01er network is divided by 10 after 36 epoch\nand 38 epoch (for a total of 40 epoches) in uniform noise, and after 40 epoch and 50 epoch (for a\ntotal of 60 epoches) in \ufb02ip noise. The learning rate of WN-Net is \ufb01xed as 10\u22123. We repeated the\nexperiments 5 times with different random seeds for network initialization and label noise generation.\nWe report the accuracy averaged over 5 repetitions for each series of experiments and each competing\nmethod in Tables 2 and 3. It can be observed that our method gets the best performance across\nalmost all datasets and all noise rates, except the second for 40% Flip noise. At 0% noise cases\n(unbiased ones), our method performs only slightly worse than the BaseModel. For other corrupted\nlabel cases, the superiority of our method is evident. Besides, it can be seen that the performance\ngaps between ours and all other competing methods increase as the noise rate is increased from 40%\nto 60% under uniform noise. Even with 60% label noise, our method can still obtain a relatively\nhigh classi\ufb01cation accuracy, and attains more than 15% accuracy gain compared with the second best\nresult for CIFAR100 dataset, which indicates the robustness of our methods in such cases.\n\n4We have tried different classi\ufb01er network architectures as classi\ufb01er networks under each noise setting to\nshow our algorithm is suitable to different deep learning architectures. We show this effect in Fig.4, verifying\nthe consistently good performance of our method in two classi\ufb01er network settings.\n\n7\n\n\fFigure 3: Confusion matrices for the Basemodel\nand ours on long-tailed CIFAR-10 with imbal-\nance factors 200.\n\nFigure 4: Performance comparison for different\nclassi\ufb01er networks (WRN-28-10 and ResNet32)\nunder CIFAR \ufb02ip noise.\n\nTable 3: Test accuracy comparison on CIFAR-10 and CIFAR-100 of ResNet-32 with varying noise\nrates under \ufb02ip noise.\n\nCIFAR-10\n\nDatasets / Noise Rate\n0%\n20%\n40%\n0%\n20%\n40%\n\nCIFAR-100\n\nBaseModel\n92.89\u00b10.32\n76.83\u00b12.30\n70.77\u00b12.31\n70.50\u00b10.12\n50.86\u00b10.27\n43.01\u00b11.16\n\nReed-Hard\n92.31\u00b10.25\n88.28\u00b10.36\n81.06\u00b10.76\n69.02\u00b10.32\n60.27\u00b10.76\n50.40\u00b11.01\n\nS-Model\n83.61\u00b10.13\n79.25\u00b10.30\n75.73\u00b10.32\n51.46\u00b10.20\n45.45\u00b10.25\n43.81\u00b10.15\n\nSelf-paced\n88.52\u00b10.21\n87.03\u00b10.34\n81.63\u00b10.52\n67.55\u00b10.27\n63.63\u00b10.30\n53.51\u00b10.53\n\nFocal Loss\n93.03\u00b10.16\n86.45\u00b10.19\n80.45\u00b10.97\n70.02\u00b10.53\n61.87\u00b10.30\n54.13\u00b10.40\n\nCo-teaching\n89.87\u00b10.10\n82.83\u00b10.85\n75.41\u00b10.21\n63.31\u00b10.05\n54.13\u00b10.55\n44.85\u00b10.81\n\nD2L\n\n92.02\u00b10.14\n87.66\u00b10.40\n83.89\u00b10.46\n68.11\u00b10.26\n63.48\u00b10.53\n51.83\u00b10.33\n\nFine-tining\n93.23\u00b10.23\n82.47\u00b13.64\n74.07\u00b11.56\n70.72\u00b10.22\n56.98\u00b10.50\n46.37\u00b10.25\n\nMentorNet\n92.13\u00b10.30\n86.36\u00b10.31\n81.76\u00b10.28\n70.24\u00b10.21\n61.97\u00b10.47\n52.66\u00b10.56\n\nL2RW\n\n89.25\u00b10.37\n87.86\u00b10.36\n85.66\u00b10.51\n64.11\u00b11.09\n57.47\u00b11.16\n50.98\u00b11.55\n\nGLC\n\n91.02\u00b10.20\n89.68\u00b10.33\n88.92\u00b10.24\n65.42\u00b10.23\n63.07\u00b10.53\n62.22\u00b10.62\n\nOurs\n\n92.04\u00b10.15\n90.33\u00b10.61\n87.54\u00b10.23\n70.11\u00b10.33\n64.22\u00b10.28\n58.64\u00b10.47\n\nFig. 4 shows the performance comparison between WRN-28-10 and ResNet32 under \ufb01xed \ufb02ip noise\nsetting. We can observe that the performance gains for our method and BaseModel between two\nnetworks takes the almost same value. It implies that the performance improvement of our method is\nnot dependent on the selection of the classi\ufb01er network architectures.\nAs shown in Fig. 1(e), the shape of the learned weight function depicts as monotonic decreasing,\ncomplying with the traditional optimal setting to this bias condition, i.e., imposing smaller weights on\nsamples with relatively large losses to suppress the effect of corrupted labels. Furthermore, we plot\nthe weight distribution of clean and noisy training samples in Fig. 5. It can be seen that almost all\nlarge weights belongs to clean samples, and the noisy samples\u2019s weights are smaller than that of clean\nsamples, which implies that the trained Meta-Weight-Net can distinguish clean and noisy images.\nFig. 6 plots the weight variation along with training epoches under 40% noise on CIFAR10 dataset\nof our method and L2RW. y-axis denotes the differences of weights calculated between adjacent\nepoches, and x-axis denotes the number of epoches. Ten noisy samples are randomly chosen to\ncompute their mean curve, surrounded by the region illustrating the standard deviations calculated on\nthese samples in the corresponding epoch. It is seen that the weight by our method is continuously\nchanged, gradually stable along iterations, and \ufb01nally converges. As a comparison, the weight during\nthe learning process of L2RW \ufb02uctuates relatively more wildly. This could explain the consistently\nbetter performance of our method as compared with this competing method.\n\n4.3 Experiments on Clothing1M\n\nTo verify the effectiveness of the proposed method on real-world data, we conduct experiments on\nthe Clothing1M dataset [64], containing 1 million images of clothing obtained from online shopping\nwebsites that are with 14 categories, e.g., T-shirt, Shirt, Knitwear. The labels are generated by using\nsurrounding texts of the images provided by the sellers, and therefore contain many errors. We use\nthe 7k clean data as the meta dataset. Following the previous works [65, 66], we used ResNet-50\npre-trained on ImageNet. For preprocessing, we resize the image to 256 \u00d7 256, crop the middle\n224 \u00d7 224 as input, and perform normalization. We used SGD with a momentum 0.9, a weight decay\n10\u22123, and an initial learning rate 0.01, and batch size 32. The learning rate of ResNet-50 is divided\nby 10 after 5 epoch (for a total 10 epoch), and the learning rate of WN-Net is \ufb01xed as 10\u22123.\nThe results are summarized in Table. 4. which shows that the proposed method achieves the best\nperformance. Fig. 1(f) plots the tendency curve of the learned MW-Net function, which reveals\nabundant data insights. Speci\ufb01cally, when the loss is with relatively small values, the weighting\nfunction inclines to increase with loss, meaning that it tends to more emphasize hard margin samples\nwith informative knowledge for classi\ufb01cation; while when the loss gradually changes large, the\n\n8\n\n0123456789Predicted label0123456789True label96.6%1.3%1.6%0.2%0.1%0.1%0.1%0.8%98.9%0.2%0.1%7.9%0.8%82.9%2.9%2.1%1.7%1.7%6.1%1.5%9.0%70.4%4.2%6.7%1.7%0.4%6.8%0.4%12.2%6.2%70.2%0.9%1.1%2.2%1.9%0.5%13.0%25.6%4.0%53.0%0.6%1.4%3.6%1.4%16.0%11.9%3.9%0.6%62.3%0.1%0.2%11.1%0.5%11.9%10.4%12.7%6.5%0.2%46.4%0.1%0.2%59.5%13.7%2.1%1.9%0.4%0.3%0.1%0.1%21.9%29.6%59.4%1.0%1.9%0.3%0.5%0.1%0.4%0.3%6.5%BaseModel02004006008000123456789Predicted label0123456789True label97.1%0.9%1.2%0.4%0.1%0.1%0.1%0.1%0.7%98.2%0.2%0.3%0.2%0.4%5.5%0.2%83.8%2.9%3.7%2.3%1.1%0.4%2.9%0.7%6.0%74.4%4.3%9.9%0.9%0.7%0.1%2.5%5.2%5.7%82.4%1.6%1.2%1.4%1.8%0.4%6.6%17.3%4.1%68.7%0.5%0.6%3.2%1.1%10.5%11.6%5.4%2.0%65.8%0.3%0.1%4.7%0.4%3.7%10.3%12.1%11.5%0.1%56.9%0.2%51.0%8.7%1.8%2.4%0.8%0.1%0.5%34.1%0.5%22.0%42.0%0.6%2.4%0.4%0.2%0.3%0.4%0.2%31.4%Ours02004006008000.00.20.4Noise Ratio(%)707580859095100Accuracy(%)CIFAR-10BaseModel(ResNet32)Ours(ResNet32)BaseModel(WRN-28-10)Ours(WRN-28-10 )0.00.20.4Noise Ratio(%)40455055606570758085Accuracy(%)CIFAR-100BaseModel(ResNet32)Ours(ResNet32)BaseModel(WRN-28-10)Ours(WRN-28-10)\fFigure 5: Sample weight distribution on training\ndata under 40% uniform noise experiments.\n\nFigure 6: Weight variation curves under 40%\nuniform noise experiment on CIFAR10 dataset.\n\nTable 4: Classi\ufb01cation accuracy (%) of all competing methods on the Clothing1M test set.\n\n#\n1\n2\n3\n4\n\nMethod\n\nAccuracy\n\nCross Entropy\n\nBootstrapping [58]\n\nForward [65]\n\nS-adaptation [12]\n\n68.94\n69.12\n69.84\n70.36\n\n#\n5\n6\n7\n8\n\nMethod\n\nAccuracy\n\nJoint Optimization [66]\n\nLCCN [67]\nMLNT [68]\n\nOurs\n\n72.23\n73.07\n73.47\n73.72\n\nweighting function begins to monotonically decrease, implying that it tends to suppress noise labels\nsamples with relatively large loss values. Such complicated essence cannot be \ufb01nely delivered by\nconventional weight functions.\n\n5 Conclusion\nWe have proposed a novel meta-learning method for adaptively extracting sample weights to guarantee\nrobust deep learning in the presence of training data bias. Compared with current reweighting methods\nthat require to manually set the form of weight functions, the new method is able to yield a rational\none directly from data. The working principle of our algorithm can be well explained and the\nprocedure of our method can be easily reproduced ( Appendix A provide the Pytorch implement\nof our algorithm (less than 30 lines of codes)), and the completed training code is avriable at\nhttps://github.com/xjtushujun/meta-weight-net.). Our empirical results show that the\npropose method can perform superior in general data bias cases, like class imbalance, corrupted\nlabels, and more complicated real cases. Besides, such an adaptive weight learning approach is\nhopeful to be employed to other weight setting problems in machine learning, like ensemble methods\nand multi-view learning.\n\nAcknowledgments\n\nThis research was supported by the China NSFC projects under contracts 61661166011, 11690011,\n61603292, 61721002,U1811461. The authors would also like to thank anonymous reviewers for their\nconstructive suggestions on improving the paper, especially on the proofs and theoretical analysis of\nour paper.\n\nReferences\n\n[1] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples\n\nfor robust deep learning. In ICML, 2018.\n\n[2] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. In ICLR, 2017.\n\n[3] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on\n\nKnowledge & Data Engineering, 2008.\n\n[4] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring\n\ngeneralization in deep learning. In NeurIPS, 2017.\n\n9\n\n0.3500.3750.4000.4250.4500.4750.5000.525Weight05000100001500020000NumbersCIFAR-10_40% noisenoiseclean0.100.150.200.250.300.350.40Weight0200040006000800010000CIFAR-100_40% noisenoiseclean05101520253035400.0060.0040.0020.0000.002WeightsOurs20406080100Epoches0.0500.0250.0000.0250.0500.075WeightsL2RW\f[5] Devansh Arpit, Stanis\u0142aw Jastrz\u02dbebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,\nMaxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A\ncloser look at memorization in deep networks. In ICML, 2017.\n\n[6] Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning.\n\narXiv preprint arXiv:1710.05468, 2017.\n\n[7] Roman Novak, Yasaman Bahri, Daniel A Abola\ufb01a, Jeffrey Pennington, and Jascha Sohl-\nDickstein. Sensitivity and generalization in neural networks: an empirical study. In ICLR,\n2018.\n\n[8] Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco\nHerrera. A review on ensembles for the class imbalance problem: bagging-, boosting-, and\nIEEE Transactions on Systems, Man, and Cybernetics, Part C\nhybrid-based approaches.\n(Applications and Reviews), 2012.\n\n[9] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class\n\nimbalance problem in convolutional neural networks. Neural Networks, 2018.\n\n[10] Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep neural networks.\n\nIn ICLR workshop, 2015.\n\n[11] Samaneh Azadi, Jiashi Feng, Stefanie Jegelka, and Trevor Darrell. Auxiliary image regulariza-\n\ntion for deep cnns with noisy labels. In ICLR, 2016.\n\n[12] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adapta-\n\ntion layer. In ICLR, 2017.\n\n[13] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning\n\nfrom noisy labels with distillation. In ICCV, 2017.\n\n[14] Arash Vahdat. Toward robustness against label noise in training deep discriminative neural\n\nnetworks. In NeurIPS, 2017.\n\n[15] Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to\n\ntrain deep networks on labels corrupted by severe noise. In NeurIPS, 2018.\n\n[16] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi\nSugiyama. Co-teaching: robust training deep neural networks with extremely noisy labels. In\nNeurIPS, 2018.\n\n[17] Zhilu Zhang and Mert R Sabuncu. Generalized cross entropy loss for training deep neural\n\nnetworks with noisy labels. In NeurIPS, 2018.\n\n[18] Wei Bi, Liwei Wang, James T Kwok, and Zhuowen Tu. Learning to predict from crowdsourced\n\ndata. In UAI, 2014.\n\n[19] Junwei Liang, Lu Jiang, Deyu Meng, and Alexander Hauptmann. Learning to detect concepts\n\nfrom webly-labeled video data. In IJCAI, 2016.\n\n[20] Bohan Zhuang, Lingqiao Liu, Yao Li, Chunhua Shen, and Ian D Reid. Attend in groups: a\n\nweakly-supervised deep learning framework for learning from web data. In CVPR, 2017.\n\n[21] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning\n\ndata-driven curriculum for very deep neural networks on corrupted labels. In ICML, 2018.\n\n[22] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n[23] Yanmin Sun, Mohamed S Kamel, Andrew KC Wong, and Yang Wang. Cost-sensitive boosting\n\nfor classi\ufb01cation of imbalanced data. Pattern Recognition, 40(12):3358\u20133378, 2007.\n\n[24] Tomasz Malisiewicz, Abhinav Gupta, and Alexei A Efros. Ensemble of exemplar-svms for\n\nobject detection and beyond. In ICCV, 2011.\n\n[25] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Doll\u00e1r. Focal loss for dense\n\nobject detection. IEEE transactions on pattern analysis and machine intelligence, 2018.\n\n[26] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable\n\nmodels. In NeurIPS, 2010.\n\n[27] De la Torre Fernando and J. Black Mkchael. A framework for robust subspace learning.\n\nInternational Journal of Computer Vision, 54(1):117\u2013142, 2003.\n\n10\n\n\f[28] Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G Hauptmann. Easy samples \ufb01rst:\n\nSelf-paced reranking for zero-example multimedia search. In ACM MM, 2014.\n\n[29] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann.\n\nSelf-paced learning with diversity. In NeurIPS, 2014.\n\n[30] Yixin Wang, Alp Kucukelbir, and David M Blei. Robust probabilistic modeling with bayesian\n\ndata reweighting. In ICML, 2017.\n\n[31] Bal\u00e1zs Csan\u00e1d Cs\u00e1ji. Approximation with arti\ufb01cial neural networks. Faculty of Sciences, Etvs\n\nLornd University, Hungary, 24:48, 2001.\n\n[32] Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning\n\nto teach with dynamic loss functions. In NeurIPS, 2018.\n\n[33] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom\nSchaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by\ngradient descent. In NeurIPS, 2016.\n\n[34] Mostafa Dehghani, Aliaksei Severyn, Sascha Rothe, and Jaap Kamps. Learning to learn from\n\nweak supervision by full supervision. In NeurIPS Workshop, 2017.\n\n[35] Luca Franceschi, Paolo Frasconi, Saverio Salzo, and Massimilano Pontil. Bilevel programming\n\nfor hyperparameter optimization and meta-learning. In ICML, 2018.\n\n[36] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS Workshop, 2017.\n\n[37] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-\n\ntion of deep networks. In ICML, 2017.\n\n[38] Alex Nichol and John Schulman. Reptile: a scalable metalearning algorithm. arXiv preprint\n\narXiv:1803.02999, 2, 2018.\n\n[39] Amir Erfan Eshratifar, David Eigen, and Massoud Pedram. Gradient agreement as an optimiza-\n\ntion objective for meta-learning. arXiv preprint arXiv:1810.08178, 2018.\n\n[40] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote:\nsynthetic minority over-sampling technique. Journal of arti\ufb01cial intelligence research, 16:321\u2013\n357, 2002.\n\n[41] Qi Dong, Shaogang Gong, and Xiatian Zhu. Class recti\ufb01cation hard mining for imbalanced\n\ndeep learning. In ICCV, 2017.\n\n[42] Bianca Zadrozny. Learning and evaluating classi\ufb01ers under sample selection bias. In ICML,\n\n2004.\n\n[43] Charles Elkan. The foundations of cost-sensitive learning. In IJCAI, 2001.\n[44] Salman H Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A Sohel, and Roberto\nTogneri. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE\ntransactions on neural networks and learning systems, 2018.\n\n[45] Justin M Johnson and Taghi M Khoshgoftaar. Survey on deep learning with class imbalance.\n\nJournal of Big Data, 2019.\n\n[46] Haw-Shiuan Chang, Erik Learned-Miller, and Andrew McCallum. Active bias: Training more\n\naccurate neural networks by emphasizing high variance samples. In NeurIPS, 2017.\n\n[47] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept\n\nlearning through probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[48] Jun Shu, Zongben Xu, and Deyu Meng. Small sample learning in big data era. arXiv preprint\n\narXiv:1808.04572, 2018.\n\n[49] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR,\n\n2017.\n\n[50] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In\n\nNeurIPS, 2017.\n\n[51] Mostafa Dehghani, Arash Mehrjou, Stephan Gouws, Jaap Kamps, and Bernhard Sch\u00f6lkopf.\n\nFidelity-weighted learning. In ICLR, 2018.\n\n11\n\n\f[52] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In ICLR,\n\n2018.\n\n[53] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In NeurIPS,\n\n2017.\n\n[54] Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. Large scale \ufb01ne-grained\n\ncategorization and domain-speci\ufb01c transfer learning. In CVPR, 2018.\n\n[55] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for\n\nimbalanced classi\ufb01cation. In CVPR, 2016.\n\n[56] Xiao Zhang, Zhiyuan Fang, Yandong Wen, Zhifeng Li, and Yu Qiao. Range loss for deep face\n\nrecognition with long-tailed training data. In ICCV, 2017.\n\n[57] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge J Belongie.\n\nLearning from noisy large-scale datasets with minimal supervision. In CVPR, 2017.\n\n[58] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew\nIn ICLR\n\nRabinovich. Training deep neural networks on noisy labels with bootstrapping.\nworkshop, 2015.\n\n[59] Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah M Erfani, Shu-Tao Xia, Sudanthi\nWijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In ICML,\n2018.\n\n[60] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based\n\non effective number of samples. In CVPR, 2019.\n\n[61] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, 2016.\n\n[62] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n[63] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMCV, 2016.\n[64] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive\n\nnoisy labeled data for image classi\ufb01cation. In CVPR, 2015.\n\n[65] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu.\nMaking deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017.\n[66] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization\n\nframework for learning with noisy labels. In CVPR, 2018.\n\n[67] Jiangchao Yao, Hao Wu, Ya Zhang, Ivor W Tsang, and Jun Sun. Safeguarded dynamic label\n\nregression for noisy supervision. In AAAI, 2019.\n\n[68] Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. Learning to learn from noisy\n\nlabeled data. In CVPR, 2019.\n\n[69] Julien Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In\n\nNeurIPS, 2013.\n\n12\n\n\f", "award": [], "sourceid": 1107, "authors": [{"given_name": "Jun", "family_name": "Shu", "institution": "Xi'an Jiaotong University"}, {"given_name": "Qi", "family_name": "Xie", "institution": "Xi'an Jiaotong University"}, {"given_name": "Lixuan", "family_name": "Yi", "institution": "Xi'an Jiaotong University"}, {"given_name": "Qian", "family_name": "Zhao", "institution": "Xi'an Jiaotong University"}, {"given_name": "Sanping", "family_name": "Zhou", "institution": "Xi'an Jiaotong University"}, {"given_name": "Zongben", "family_name": "Xu", "institution": "Xi'an Jiaotong University"}, {"given_name": "Deyu", "family_name": "Meng", "institution": "Xi'an Jiaotong University"}]}