{"title": "Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples", "book": "Advances in Neural Information Processing Systems", "page_first": 1002, "page_last": 1012, "abstract": "Self-paced learning and hard example mining re-weight training instances to improve learning accuracy. This paper presents two improved alternatives based on lightweight estimates of sample uncertainty in stochastic gradient descent (SGD): the variance in predicted probability of the correct class across iterations of mini-batch SGD, and the proximity of the correct class probability to the decision threshold. Extensive experimental results on six datasets show that our methods reliably improve accuracy in various network architectures, including additional gains on top of other popular training techniques, such as residual learning, momentum, ADAM, batch normalization, dropout, and distillation.", "full_text": "Active Bias: Training More Accurate Neural\n\nNetworks by Emphasizing High Variance Samples\n\nHaw-Shiuan Chang, Erik Learned-Miller, Andrew McCallum\n\nUniversity of Massachusetts, Amherst\n140 Governors Dr., Amherst, MA 01003\n\n{hschang,elm,mccallum}@cs.umass.edu\n\nAbstract\n\nSelf-paced learning and hard example mining re-weight training instances to im-\nprove learning accuracy. This paper presents two improved alternatives based on\nlightweight estimates of sample uncertainty in stochastic gradient descent (SGD):\nthe variance in predicted probability of the correct class across iterations of mini-\nbatch SGD, and the proximity of the correct class probability to the decision\nthreshold. Extensive experimental results on six datasets show that our methods re-\nliably improve accuracy in various network architectures, including additional gains\non top of other popular training techniques, such as residual learning, momentum,\nADAM, batch normalization, dropout, and distillation.\n\n1\n\nIntroduction\n\nLearning easier material before harder material is often bene\ufb01cial to human learning. Inspired by\nthis observation, curriculum learning [5] has shown that learning from easier instances \ufb01rst can also\nimprove neural network training. When it is not known a priori which samples are easy, examples\nwith lower loss on the current model can be inferred to be easier and can be used in early training.\nThis strategy has been referred to as self-paced learning [25]. By decreasing the weight of dif\ufb01cult\nexamples in the loss function, the model may become more robust to outliers [33], and this method\nhas proven useful in several applications, especially with noisy labels [36].\nNevertheless, selecting easier examples for training often slows down the training process because\neasier samples usually contribute smaller gradients, and the current model has already learned how to\nmake correct predictions on these samples. On the other hand, and somewhat ironically, the opposite\nstrategy (i.e., sampling harder instances more often) has been shown to accelerate (mini-batch)\nstochastic gradient descent (SGD) in some cases, where the dif\ufb01culty of an example can be de\ufb01ned by\nits loss [18, 29, 44] or be proportional to the magnitude of its gradient [51, 1, 12, 13]. This strategy is\nsometimes referred to as hard example mining [44].\nIn the literature, we can see that these two opposing strategies work well in different situations.\nPreferring easier examples may be effective when either machines or humans try to solve a challenging\ntask containing more label noise or outliers. On the other hand, focusing on harder samples may\naccelerate and stabilize SGD in cleaner data by minimizing the variance of gradients [1, 12]. However,\nwe often do not know how noisy our training dataset is. Motivated by this practical need, this paper\nexplores new methods of re-weighting training examples that are effective in both scenarios.\nIntuitively, if a model has already predicted some examples correctly with high con\ufb01dence, those\nsamples may be too easy to contain useful information for improving that model further. Similarly,\nif some examples are always predicted incorrectly over many iterations of training, these examples\nmay just be too dif\ufb01cult/noisy and may degrade the model. This suggests that we should somehow\nprefer uncertain samples that are predicted incorrectly sometimes during training and correctly at\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: The proposed methods emphasize uncertain samples based on previous prediction history.\n\nother times, as illustrated in Figure 1. This preference is consistent with common variance reduction\nstrategies in active learning [43].\nPrevious studies suggest that \ufb01nding informative unlabeled samples to label is related to selecting\nalready-labeled samples to optimize the model parameters [14]. As reported in the previous stud-\nies [42, 6], models can sometimes achieve lower generalization error after being trained with only\na subset of actively selected training data. In other words, focusing on informative samples can be\nbene\ufb01cial even when all labels are available.\nWe propose two lightweight methods that actively emphasize uncertain samples to improve mini-\nbatch SGD for classi\ufb01cation. One method measures the variance of prediction probabilities, while the\nother one estimates the closeness between the prediction probabilities and the decision threshold. For\nlogistic regression, both methods can be proven to reduce the uncertainty in the model parameters\nunder reasonable approximations.\nWe present extensive experiments on CIFAR 10, CIFAR 100, MNIST (image classi\ufb01cation), Question\nType (sentence classi\ufb01cation), CoNLL 2003, and OntoNote 5.0 (Named entity Recognition), as well\nas on different architectures, including multiple class logistic regression, fully-connected networks,\nconvolutional neural networks (CNNs) [26], and residual networks [16]. The results show that\nactive bias makes neural networks more robust without prior knowledge of noise, and reduces the\ngeneralization error by 1% \u201318% even on training sets having few (if any) annotation errors.\n\n2 Related work\n\nAs (deep) neural networks become more widespread, many methods have recently been proposed to\nimprove SGD training. When using (mini-batch) SGD, the randomness of the gradient sometimes\nslows down the optimization, so one common approach is to use the gradient computed in previous\niterations to stabilize the process. Examples include momentum [38], stochastic variance reduced\ngradient (SVRG) [21], and proximal stochastic variance reduced gradient (Prox-SVRG) [49]. Other\nwork proposes variants of semi-stochastic algorithms to approximate the exact gradient direction and\nreduce the gradient variance [47, 34]. More recently, supervised optimization methods like learning\nby learning [3] also show great potential in this problem.\nIn addition to the high variance of the gradient, another issue with SGD is the dif\ufb01culty of tuning the\nlearning rate. Like Quasi-Newton methods, several methods adaptively adjust learning rates based on\nlocal curvature [2, 40], while ADAGRAD [11] applies different learning rates to different dimensions.\nADAM [23] combines several of these techniques and is widely used in practice.\nMore recently, some studies accelerate SGD by weighting each class differently [13] or weighting\neach sample differently as we do [18, 51, 29, 12, 1, 44], and their experiments suggest that the methods\nare often compatible with other techniques such as Prox-SVRG, ADAGRAD, or ADAM [29, 13].\nNotice that Gao et al. [12] discuss the idea of selecting uncertain examples for SGD based on active\nlearning, but their proposed methods choose each sample according to the magnitude of its gradient\nas in ISSGD [1], which actually prefers more dif\ufb01cult examples.\nThe aforementioned methods focus on accelerating the optimization of a \ufb01xed loss function given a\n\ufb01xed model. Many of these methods adopt importance sampling. That is, if the method prefers to\n\n2\n\nCat ImagesTraining Mini-batch IterationsDog\u20280.2Emphasize?Self Paced LearningHard Example MiningActive LearningEasyHardUncertainP(class | image)Cat\u20280.4Cat\u20280.8Cat\u20280.9Frog\u20280.2Frog\u20280.3Methods\fselect harder examples, the learning rate corresponding to those examples will be lower. This makes\ngradient estimation unbiased [18, 51, 1, 12, 13], which guarantees convergence [51, 13].\nOn the other hand, to make models more robust to outliers, some approaches inject bias into the\nloss function in order to emphasize easier examples [37, 48, 27, 35]. Some variants of the strategy\ngradually increase the loss of hard examples [32], as in self-paced learning [25]. To alleviate the local\nminimum problem during training, other techniques that smooth the loss function have been proposed\nrecently [8, 15]. Nevertheless, to our knowledge, it remains an unsolved challenge to balance the\neasy and dif\ufb01cult training examples to facilitate training while remaining robust to outliers.\n\n3 Methods\n\nIn this section, we \ufb01rst discuss the baseline methods against which we shall compare and introduce\nsome notations which we are going to use later on. We then present our two active bias methods\nbased on prediction variance and closeness to the decision threshold.\n\n3.1 Baselines\n\n1\n\ni\n\n, \u00afpH t\u22121\n\ni\n\ni\n\ni H t\u22121\n\ni\n\nthe network before the current iteration t, H =(cid:83)\n\nDue to its simplicity and generally good performance, the most widely used version of SGD samples\neach training instance uniformly. This basic strategy has two variants. The \ufb01rst samples with\nreplacement. Let D = (xi, yi)i indicate the training dataset. The probability of selecting each sample\nis equal (i.e., Ps(i|D) = 1|D|), so we call it SGD Uniform (SGD-Uni). The second samples without\nreplacement. Let Se be the set of samples we have already used in the current epoch. Then, the\nsampling probability Ps(i|Se,D) would become (\n|D|\u2212|Se| )1i /\u2208Se, where 1 is an indicator function.\nThis version scans through all of the examples in each epoch, so we call it SGD-Scan.\nWe propose a simple baseline which selects harder examples with higher probability, as done by\nLoshchilov and Hutter [29]. Speci\ufb01cally, we let Ps(i|H, Se,D) \u221d 1 \u2212 \u00afpH t\u22121\n(yi|xi) + \u0001D, where\nis the history of prediction probability which stores all p(yi|xi) when xi is selected to train\nH t\u22121\n(yi|xi) is the average probability\n, and \u0001D is a\n, we won\u2019t need to perform\n\nof classifying sample i into its correct class yi over all the stored p(yi|xi) in H t\u22121\nsmoothness constant. Notice that by only considering p(yi|xi) in H t\u22121\nextra forward passes. We refer to this simple baseline as SGD Sampled by Dif\ufb01culty (SGD-SD).\nIn practice, SGD-Scan often works better than SGD-Uni because it ensures that the model sees all of\nthe training examples in each epoch. To emphasize dif\ufb01cult examples while applying SGD-Scan,\nwe weight each sample differently in the loss function. That is, the loss function is modi\ufb01ed as\ni vi \u00b7 lossi(W ) + \u03bbR(W ), where W are the parameters in the model, lossi(W ) is the\nprediction loss, and \u03bbR(W ) is the regularization term of the model. The weight of the ith sample vi\n(yi|xi) + \u0001D), where ND is a normalization constant making the average\ncan be set as 1\nND\nof vi equal to 1. We want to keep the average of the vi \ufb01xed so that we do not change the global\nlearning rate. We denote this method SGD Weighted by Dif\ufb01culty (SGD-WD).\nModels usually cannot \ufb01t outliers well, so SGD-SD and SGD-WD would not be robust to noise. To\nmake a model unbiased, importance sampling can be used. That is, we can let Ps(i|H, Se,D) \u221d\n1 \u2212 \u00afpH t\u22121\n(yi|xi) + \u0001D)\u22121, which is similar to an approach\nused by Hinton [18]. We refer to this as SGD Importance-Sampled by Dif\ufb01culty (SGD-ISD).\nIn addition, we propose two simple baselines that emphasize easy examples, as in self-paced\nlearning. Based on the same naming convention, SGD Sampled by Easiness (SGD-SE) denotes\nthat Ps(i|H, Se,D) \u221d \u00afpH t\u22121\n(yi|xi) + \u0001E, while SGD Weighted by Easiness (SGD-WE) sets\nvi = 1\nNE\n\n(yi|xi) + \u0001D and vi = ND(1 \u2212 \u00afpH t\u22121\n\n(yi|xi) + \u0001E), where NE normalizes the vi\u2019s to have unit mean.\n\ni\n\nL = (cid:80)\n\n(1\u2212 \u00afpH t\u22121\n\ni\n\ni\n\ni\n\ni\n\ni\n\n(\u00afpH t\u22121\n\ni\n\n3.2 Prediction Variance\n\nIn the active learning setting, the prediction variance can be used to measure the uncertainty of each\nsample for either a regression or classi\ufb01cation problem [41]. In order to gain more information at\neach SGD iteration, we choose samples with high prediction variances.\n\n3\n\n\fi\n\ni\n\ni\n\ni\n\n,\n\npH t\u22121\n\ni\n\n(cid:16)\n\npH t\u22121\n\ni\n\n(cid:115)\n\nconf\ni\n\n(cid:17)2\n\ni\n\ni\n\n(|H t\u22121\n\nconf\ni\n\n(H) =\n\n(yi|xi)\n\n(yi|xi)) +\n\n, and |H t\u22121\n\n(yi|xi))2\n| \u2212 1\n\n(cid:100)var(pHt\u22121\n\n(cid:100)var(pHt\u22121\n\nSince the prediction variances are estimated on the \ufb02y, we would like to balance exploration and\nexploitation. Adopting the optimism in face of uncertainty heuristics of bandit problems [7], we draw\nthe next sample based on the estimated prediction variance plus its con\ufb01dence interval. Speci\ufb01cally,\nfor SGD Sampled by Prediction Variance (SGD-SPV), we let\n\nPs(i|H, Se,D) \u221d (cid:99)std\n(cid:16)\n(cid:17)\n(cid:100)var\n(yi|xi)\n2\u00b7(cid:100)var\n\n(H) + \u0001V , where (cid:99)std\nis the prediction variance estimated by history H t\u22121\n\n|H t\u22121\n(1)\n| is the number\n(yi|xi) is normally distributed under the uncer-\nof stored prediction probabilities. Assuming pH t\u22121\ntainty of model parameters w, the variance of prediction variance estimation can be estimated by\n|\u2212 1)\u22121. As we did in the baselines, adding the smoothness constant\n\u0001V prevents the low variance instances from never being selected again. Similarly, another variant of\nthe method sets vi = 1\n(H) + \u0001V ), where NV normalizes vi like other weighted methods;\nNV\nwe call this SGD Weighted by Prediction Variance (SGD-WPV).\nAs in SGD-WD, SGD-WE or self-paced learning [4], we train an unbiased model for several burn-\nin epochs at the beginning so as to judge the sampling uncertainty reasonably and stably. Other\nimplementation details will be described in the \ufb01rst section of the supplementary material.\nUsing a low learning rate, model parameters w would be close to a good local minimum after\nsuf\ufb01cient burn-in epochs, and thus the posterior distribution of w can be locally approximated\nby a Gaussian distribution. Furthermore, the prediction distribution p(yi|xi, w) is often locally\nsmooth with respect to the model parameters w (i.e., small changes of model parameters only induce\nsmall changes in the prediction distribution), so a Gaussian tends to approximate the distribution of\n(yi|xi) well in practice.\npH t\u22121\nExample: logistic regression\nGiven a Gaussian prior P r(W = w) = N (w|0, s0I) on the parameters, consider the probabilistic\ninterpretation of logistic regression:\n\ni\n\n((cid:99)std\n\nconf\ni\n\ni\n\nlog(p(yi|xi, w)) \u2212 c\ns0\n\n||w||2,\n\n(2)\n\n\u2212 log(P r(Y, W = w|X)) = \u2212(cid:88)\n1+exp(\u2212yiwT xi), and yi \u2208 {1,\u22121}.\n(cid:88)\n\ni\n\nwhere p(yi|xi, w) =\nSince the posterior distribution of W is log-concave [39], we can use P r(W = w|Y, X) \u2248\nN (w|wN, SN ), where wN is maximum a posteriori (MAP) estimation, and\n\n1\n\nN = (cid:53)w (cid:53)w \u2212 log(P r(Y, W|X)) =\n\u22121\nS\n\np(yi|xi) (1 \u2212 p(yi|xi)) xixi\n\nT +\n\n2c\ns0\n\nI.\n\n(3)\n\ni\n\nThen, we further approximate p(yi|xi, W ) using the \ufb01rst order Taylor expansion p(yi|xi, W ) \u2248\np(yi|xi, w) + gi(w)T (W \u2212 w), where gi(w) = p(yi|xi, w) (1 \u2212 p(yi|xi, w)) xi. We can compute\nthe prediction variance [41] with respect to the uncertainty of W\n\nV ar(p(yi|xi, W )) \u2248 gi(w)T SN gi(w).\n\n(4)\n\nthings.\n\nFirst, V ar(p(yi|xi, W )) is proportional\n\nThese approximations tell us several\nto\np(yi|xi, w)2(1 \u2212 p(yi|xi, w))2, so the prediction variance is larger when the sample i is closer\nto the boundary. Second, when we have more sample points close to the boundary, the variance of\nthe parameters SN is lower. That is, when we emphasize samples with high prediction variances,\nthe uncertainty of parameters tends to be reduced, akin to the variance reduction strategy in active\nlearning [30]. Third, with a Gaussian assumption on the posterior distribution P r(W = w|Y, X) and\nthe Taylor expansion, the distribution of p(yi|xi, W ) in logistic regression becomes Gaussian, which\n(yi|xi) for the con\ufb01dence estimation of the prediction\njusti\ufb01es our previous assumption of pH t\u22121\nvariance. Notice that there are other methods that can measure the prediction uncertainty, such as the\nmutual information between labels and parameters [19], but we found that the prediction variance\nworks better in our experiments.\n\ni\n\n4\n\n\f(a) Sampling distribution\n\n(b) Training Samples\n\n(c) SGD-Scan\nparameters space\n\n(d) SGD-Scan boundaries\n\n(e) SGD-WD\n\nparameters space\n\n(f) SGD-WD sample\nweights and boundaries\n\n(g) SGD-WPV\nparameters space\n\n(h) SGD-WPV sample\nweights and boundaries\n\nFigure 2: A toy example which compares different methods in a two-class logistic regression model.\nTo visualize the optimization path for the classi\ufb01er parameters (the red paths in (c), (e), and (g)) in\ntwo dimensions, we \ufb01x the weight corresponding to the x-axis to 0.5 and only show the weight for\ny-axis w[1] and bias term b. The ith sample size in (f) and (h) is proportional to vi. The toy example\nshows that SGD-WPV can train a more accurate model in a noisy dataset.\n\n(cid:17)\n\n(cid:16)\n\n(cid:100)var\n\nFigure 2 illustrates a toy example. Given the same learning rate, we can see that the normal SGD in\nFigure 2c and 2d will have higher uncertainty when there are many outliers, and emphasizing dif\ufb01cult\nexamples in Figure 2e and 2f makes it worse. On the other hand, the samples near the boundaries\nwould have higher prediction variances (i.e., larger circles or crosses in Figure 2h) and thus higher\nimpact on the loss function in SGD-WPV.\nAfter burn-in epochs, w becomes close to a local minimum using SGD. Then, the parameters\nestimated in each iteration can be viewed, approximately, as samples drawn from the posterior\ndistribution of the parameters P r(W = w|Y, X) [31]. Therefore, after running SGD long enough,\ncan be used to approximate V ar (p(yi|xi, W )). Notice that if we directly apply\nbias at the beginning without running burn-in epochs, incorrect examples might be emphasized,\nwhich is also known as the local minimum problem in active learning [14]. For instance, in Figure 2,\nif burn-in epochs are not applied and the initial w is a vertical line on the left, the outliers close to the\ninitial boundary would be emphasized, which slows down the convergence speed.\nIn this simple example, we can also see that the gradient magnitude is proportional to the dif\ufb01culty\nbecause \u2212(cid:53)wlog(p(yi|xi, w)) = (1 \u2212 p(yi|xi, w)) xi. This is why we believe the SGD acceleration\nmethods based on gradient magnitude [1, 13] can be categorized as variants of preferring dif\ufb01cult\nexamples, and thus more vulnerable to outliers (like the samples on the left or right in Figure 2).\n\n(yi|xi)\n\npH t\u22121\n\ni\n\n3.3 Threshold Closeness\n\ni\n\ni\n\n(yi|xi)\n\nMotivated by the previous analysis, we propose a simpler and more direct approach to select samples\nwhose correct class probability is close to the decision threshold. SGD Sampled by Threshold\nCloseness (SGD-STC) makes Ps(i|H, Se,D) \u221d \u00afpH t\u22121\n+ \u0001T , where\n(yi|xi) is the average probability of classifying sample i into its correct class yi over all the\n\u00afpH t\u22121\nstored p(yi|xi) in H t\u22121\n. When there are multiple classes, this measures the closeness of the threshold\nfor distinguishing the correct class out of the union of the rest of the classes (i.e., one-versus-rest).\nThe method is similar to an approximation of the optimal allocation in strati\ufb01ed sampling proposed\nby Druck and McCallum [10].\nSimilarly, SGD Weighted by Threshold Closeness (SGD-WTC) chooses the weight of ith sample vi =\n(yj|xj))+\u0001T .\n1\nNT\nThe weighting can be viewed as combining the SGD-WD and SGD-WE by multiplying their weights\n\n(yi|xi))+\u0001T , where NT = 1|D|\n\n(yj|xj)(1\u2212 \u00afpH t\u22121\n\n(yi|xi)(1\u2212 \u00afpH t\u22121\n\n1 \u2212 \u00afpH t\u22121\n\n(yi|xi)\n\n(cid:80)\n\n\u00afpH t\u22121\n\ni\n\nj \u00afpH t\u22121\n\nj\n\ni\n\ni\n\ni\n\n(cid:16)\n\n(cid:17)\n\nj\n\n5\n\n505101523456780510345672.352.302.252.202.152.10b0.150.100.050.000.050.100.15w[1]0.910.920.930.940.950.960.970.980.990510345672.352.302.252.202.152.10b0.150.100.050.000.050.100.15w[1]1.211.221.231.241.251.261.271.281.290510345672.352.302.252.202.152.10b0.150.100.050.000.050.100.15w[1]0.7500.7650.7800.7950.8100.8250.840051034567\fTable 1: Model architectures. Dropouts and L2 reg (regularization) are only applied to the fully-\nconnected (FC) layer(s).\n# Conv\nlayers\n\nDropout\nkeep probs\n\n# FC\nlayers\n\n# BN\nlayers\n\n# Pooling\n\nL2\nreg\n\nlayers\n\nFilter\nsize\n5x5\nN/A\n3X3\n\nFilter\nnumber\n32, 64\nN/A\n\n16, 32, 64\n\n(2,3,4)x1\n\n3x1\nN/A\n\n64\n100\nN/A\n\n2\n0\n\n26 or\n62\n1\n3\n0\n\n2\n0\n0\n1\n0\n0\n\n0\n0\n\n13 or\n31\n0\n0\n0\n\nDataset\nMNIST\nCIFAR 10\nCIFAR 100\n\nQuestion Type\nCoNLL 2003\nOntoNote 5.0\n\nMNIST\n\nDataset\nMNIST\nCIFAR 10\nCIFAR 100\n\n2\n1\n1\n1\n1\n2\n\n80\n30\n150\n250\n200\n60\n\n0.5\n1\n1\n0.5\n\n0.5, 0.75\n\n1\n\n2\n10\n90 or\n50\n50\n30\n20\n\n0.0005\n0.01\n\n0\n\n0.01\n0.001\n\n0\n\n# Trials\n\n20\n30\n20\n100\n10\n10\n\nTable 2: Optimization hyper-parameters and experiment settings\nOptimizer\nMomentum\n\nLearning\nrate decay\n\n# Epochs\n\n0.95\n\n# Burn-in\nepochs\n\nSGD\n\nMomentum\n\nQuestion Type\nCoNLL 2003\nOntoNote 5.0\n\nMNIST\n\nADAM\nADAM\nSGD\n\nBatch Learning\nsize\n64\n100\n128\n64\n128\n128\n\nrate\n0.01\n1e-6\n0.1\n0.001\n0.0005\n\n0.1\n\n0.5 (per 5 epochs)\n0.1 (at 80, 100,\n120 epochs)\n\n1\n1\n1\n\ntogether. Although other uncertainty estimates such as entropy are widely used in active learning and\ncan also be viewed as a measure of boundary closeness, we found the proposed formula works better\nin our experiments.\nWhen using logistic regression, after injecting the bias vi into the loss function, approximating the\nprediction probability based on previous history, removing the regularization and smoothness constant\n(i.e., p(yi|xi, w) \u2248 \u00afpH t\u22121\n(cid:88)\n\n(yi|xi), 1/s0 = 0, and \u0001T = 0), we can show that\n\nV ar(p(yi|xi, W )) \u2248(cid:88)\n\ngi(w)T SN gi(w) \u2248 NT \u00b7 dim(w),\n\n(5)\n\ni\n\ni\n\ni\n\nwhere dim(w) is the dimension of parameters w. This will ensure that the average prediction variance\ndrops linearly as the number of training instance increases. The derivation could be seen in the\nsupplementary materials.\n\n4 Experiments\n\nWe test our methods on six different datasets. The results show that the active bias techniques\nconstantly outperform the standard uniform sampling (i.e., SGD-Uni and SGD-Scan) in the deep\nmodels as well as the shallow models. For each dataset, we use an existing, publicly available\nimplementation for the problem and emphasize samples using different methods. The architectures\nand hyper-parameters are summarized in Table 1. All neural networks use softmax and cross-entropy\nloss at the last layer. The optimization and experiment setups are listed in Table 2. As shown in the\nsecond column of the table, SGD in CNNs and residual networks actually refers to momentum or\nADAM instead of vanilla SGD. All experiments use mini-batch.\nLike most of the widely used neural network training techniques, the proposed techniques are not\napplicable to every scenario. For all the datasets we tried, we found that the proposed methods are not\nsensitive to the hyper-parameter setup except when applying a very complicated model to a relatively\nsmaller dataset. If a complicated model achieves 100% training accuracy within a few epochs, the\nmost uncertain examples would often be outliers, biasing the model towards over\ufb01tting.\nTo avoid this scenario, we modify the default hyper-parameters setup in the implementation of the text\nclassi\ufb01ers in Section 4.3 and Section 4.4 to achieve similar performance using simpli\ufb01ed models. For\nall other models and datasets, we use the default hyper-parameters of the existing implementations,\nwhich should favor the SGD-Uni or SGD-Scan methods, since the default hyper-parameters are\n\n6\n\n\fTable 3: The average of the best testing error rates for different sampling methods and datasets (%).\nThe con\ufb01dence intervals are standard errors. LR means logistic regression.\n\nDatasets\nMNIST\n\nNoisy MNIST\n\nCIFAR 10\n\nQT\n\nModel\nCNN\nCNN\nLR\nCNN\n\nSGD-Uni\n0.55\u00b10.01\n0.83\u00b10.01\n62.49\u00b10.06\n17.70\u00b10.07\n\nSGD-SD\n0.52\u00b10.01\n1.00\u00b10.01\n63.14\u00b10.06\n17.61\u00b10.07\n\nSGD-ISD\n0.57\u00b10.01\n0.84\u00b10.01\n62.48\u00b10.07\n17.66\u00b10.08\n\nSGD-SE\n0.54\u00b10.01\n0.69 \u00b10.01\n60.87\u00b10.06\n17.92\u00b10.08\n\nSGD-SPV\n0.51 \u00b10.01\n0.64\u00b10.01\n60.66\u00b10.06\n17.49\u00b10.08\n\nSGD-STC\n0.51\u00b10.01\n0.63\u00b10.01\n61.00\u00b10.06\n17.55\u00b10.08\n\nTable 4: The average of the best testing error rates and their standard errors for different weighting\nmethods (%). For CoNLL 2003 and OntoNote 5.0, the values are 1-(F1 score). CNN, LR, RN 27,\nRN 63 and FC mean convolutional neural network, logistic regression, residual networks with 27\nlayers, residual network with 63 layers, and fully-connected network, respectively.\n\nDatasets\nMNIST\n\nNoisy MNIST\n\nCIFAR 10\nCIFAR 100\nCIFAR 100\n\nQT\n\nCoNLL 2003\nOntoNote 5.0\n\nMNIST\n\nMNIST (distill)\n\nModel\nCNN\nCNN\nLR\n\nRN 27\nRN 63\nCNN\nCNN\nCNN\nFC\nFC\n\nSGD-Scan\n0.54\u00b10.01\n0.81\u00b10.01\n62.48\u00b10.06\n34.04\u00b10.06\n30.70\u00b10.06\n17.79\u00b10.08\n11.62\u00b10.04\n17.80\u00b10.05\n2.85\u00b10.03\n2.27\u00b10.01\n\nSGD-WD\n0.48\u00b10.01\n0.92\u00b10.01\n63.10\u00b10.06\n34.55\u00b10.06\n31.57\u00b10.09\n17.70\u00b10.08\n11.50\u00b10.05\n17.65\u00b10.06\n2.17\u00b10.01\n2.13\u00b10.02\n\nSGD-WE\n0.56\u00b10.01\n0.72\u00b10.01\n60.88\u00b10.06\n33.65\u00b10.07\n29.92\u00b10.09\n17.87\u00b10.08\n11.73\u00b10.04\n18.40\u00b10.05\n3.08\u00b10.03\n2.35\u00b10.01\n\nSGD-WPV SGD-WTC\n0.48\u00b10.01\n0.48\u00b10.01\n0.61\u00b10.02\n0.63\u00b10.01\n61.02\u00b10.06\n60.61\u00b10.06\n33.64\u00b10.07\n33.69\u00b10.07\n30.02\u00b10.08\n30.16\u00b10.09\n17.57\u00b10.07\n17.61\u00b10.08\n11.24\u00b10.06\n11.18\u00b10.03\n17.51\u00b10.05\n17.82\u00b10.03\n2.68\u00b10.02\n2.34\u00b10.03\n2.18\u00b10.02\n2.07\u00b10.02\n\noptimized for these cases. To show the reliability of the proposed methods, we do not optimize the\nhyper-parameters for the proposed methods or baselines.\nDue to the randomness in all the SGD variants, we repeat experiments and list the number of trials\nin Table 2. At the beginning of each trial, network weights are trained with uniform sampling\nSGD until validation performance starts to saturate. After these burn-in epochs, we apply different\nsampling/weighting methods and compare performance. The number of burn-in epochs is determined\nby cross-validation, and the number of epochs in each trial is set large enough to let the testing error\nof most methods converge. In Tables 3 and 4, we evaluate the testing performance of each method\nafter each epoch and report the best testing performance among epochs within each trial.\nAs previously discussed, there are various versions preferring easy or dif\ufb01cult examples. Some\nof them require extra time to collect necessary statistics such as the gradient magnitude of each\nsample [12, 1], change the network architecture [15, 44], or involve an annealing schedule like self-\npaced learning [25, 32]. We tried self-paced learning on CIFAR 10 but found that performance usually\nremains the same and is sometimes sensitive to the hyper-parameters of the annealing schedule. This\n\ufb01nding is consistent with the results from [4]. To simplify the comparison, we focus on testing the\neffects of steady bias based on sample dif\ufb01culty (e.g., compare with SGD-SE and SGD-SD) and do\nnot gradually change the preference during the training like self-paced learning.\nIt is not always easy to change the sampling procedure because of the model or implementation\nconstraints. For example, in sequence labeling tasks (CoNLL 2003 and OntoNote 5.0), the words in\nthe same sentence need to be trained together. Thus, we only compare methods which modify the\nloss function (SGD-W*) with SGD-Scan for some models. For the other experiments, re-weighting\nexamples (SGD-W*) generally gives us better performance than changing the sampling distribution\n(SGD-S*). It might be because we can better estimate the statistics of each sample.\n\n4.1 MNIST\n\nWe apply our method to a CNN [26] for MNIST1 using one of the Tensor\ufb02ow tutorials.2 The dataset\nhas high testing accuracy, so most of the examples are too easy for the model after a few epochs.\nSelecting more dif\ufb01cult instances can accelerate learning or improve testing accuracy [18, 29, 13].\nThe results from SGD-SD and SGD-WD con\ufb01rm this \ufb01nding while selecting uncertain examples can\ngive us a similar or larger boost. Furthermore, we test the robustness of our methods by randomly\n\n1http://yann.lecun.com/exdb/mnist/\n2https://github.com/tensorflow/models/blob/master/tutorials/image/mnist\n\n7\n\n\freassigning the labels of 10% of the images, and the results indicate that the SGD-WPV improves the\nperformance of SGD-Scan even more while SGD-SD over\ufb01ts the data seriously.\n\n4.2 CIFAR 10 and CIFAR 100\n\nWe test a simple multi-class logistic regression3 on CIFAR 10 [24].4 Images are down-sampled\nsigni\ufb01cantly to 32 \u00d7 32 \u00d7 3, so many examples are dif\ufb01cult, even for humans. SGD-SPV and\nSGD-SE perform signi\ufb01cantly better than SGD-Uni here, consistent with the idea that avoiding\ndif\ufb01cult examples increases robustness to outliers.\nFor CIFAR 100 [24], we demonstrate that the proposed approaches can also work in very deep\nresidual networks [16].5 To show the method is not sensitive to the network depth and the number of\nburn-in epochs, we present results from the network with 27 layers and 90 burn-in epochs as well\nas the network with 63 layers and 50 burn-in epochs. Without changing architectures, emphasizing\nuncertain or easy examples gains around 0.5% in both settings, which is signi\ufb01cant considering the\nfact that the much deeper network shows only 3% improvement here.\nWhen training a neural network, gradually reducing the learning rate (i.e., the magnitude of gradients)\nusually improves performance. When dif\ufb01cult examples are sampled less, the magnitude of gradients\nwould be reduced. Thus, some of the improvement of SGD-SPV and SGD-SE might come from\nusing a lower effective learning rate. Nevertheless, since we apply the aggressive learning rate decay\nin the experiments of CIFAR 10 and CIFAR 100, we know that the improvements from SGD-SPV\nand SGD-SE cannot be entirely explained by its lower effective learning rate.\n\n4.3 Question Type\n\nTo investigate whether our methods are effective for smaller text datasets, we apply them to a sentence\nclassi\ufb01cation dataset (i.e. Question Type (QT) [28]), which contains 1000 training examples and\n500 testing examples.6 We use the CNN architecture proposed by Kim [22].7 Like many other NLP\ntasks, the dataset is relatively small and this CNN classi\ufb01er does not inject noise to inputs like the\nimplementation of residual networks in CIFAR 100, so this complicated model reaches 100% training\naccuracy within a few epochs.\nTo address this, we reduced the model complexity by (i) decreasing the number of \ufb01lters from 128 to\n64, (ii) decreasing convolutional \ufb01lter widths from 3,4,5 to 2,3,4, (iii) adding L2 regularization with\nscale 0.01, (iv) performing PCA to reduce the dimension of pre-trained word embedding from 300 to\n50 and \ufb01xing the word embedding during training. Then, the proposed active bias methods perform\nbetter than other baselines in this smaller model.\n\n4.4 Sequence Tagging Tasks\n\nWe also test our methods on Named Entity Recognition (NER) in CoNLL 2003 [46] and OntoNote\n5.0 [20] datasets using the CNN from Strubell et al. [45].8 Similar to Question Type, the model is too\ncomplex for our approaches. So we (i) only use 3 layers instead of 4 layers, (ii) reduce the number of\n\ufb01lters from 300 to 100, (iii) add 0.001 L2 regularization, (iv) make the 50 dimension word embedding\nfrom Collobert et al. [9] non-trainable. The micro F1 of this smaller model only drops around 1%-2%\nfrom the original big model. Table 4 shows that our methods achieve the lowest error rate (1-F1) in\nboth benchmarks.\n\n4.5 Distillation\n\nAlthough state-of-the-art neural networks in many applications memorize examples easily [50], much\nsimpler models can usually achieve similar performance like those in the previous two experiments.\nIn practice, such models are often preferable due to their low computation and memory requirements.\n\n3https://cs231n.github.io/assignments2016/assignment2/\n4https://www.cs.toronto.edu/~kriz/cifar.html\n5https://github.com/tensorflow/models/tree/master/resnet\n6http://cogcomp.org/Data/QA/QC/\n7https://github.com/dennybritz/cnn-text-classification-tf\n8https://github.com/iesl/dilated-cnn-ner\n\n8\n\n\fWe have shown that the proposed method can improve these smaller models as distillation did [17], so\nit is natural to check whether our methods can work well with distillation. We use an implementation9\nthat distills a shallow CNN with 3 convolution layers to a 2 layer fully-connected network in MNIST.\nThe teacher network can achieve 0.8% testing error, and the temperature of softmax is set as 1.\nOur approaches and baselines simply apply the sample dependent weights vi to the \ufb01nal loss function\n(i.e., cross-entropy of the true labels plus cross-entropy of the prediction probability from the teacher\nnetwork). In MNIST, SGD-WTC and SGD-WD can achieve similar or better improvements compared\nwith adding distillation into SGD-Scan. Furthermore, the best performance comes from the distillation\nplus SGD-WTC, which shows that active bias is compatible with distillation in this dataset.\n\n5 Conclusion\n\nDeep learning researchers often gain accuracy by employing training techniques such as momentum,\ndropout, batch normalization, and distillation. This paper presents a new compatible sibling to these\nmethods, which we recommend for wide use. Our relatively simple and computationally lightweight\ntechniques emphasize the uncertain examples (i.e., SGD-*PV and SGD-*TC).\nThe experiments con\ufb01rm that the proper bias can be bene\ufb01cial to generalization performance. When\nthe task is relatively easy (both training and testing accuracy are high), preferring more dif\ufb01cult\nexamples works well. On the contrary, when the dataset is challenging or noisy (both training and\ntesting accuracy are low), emphasizing easier samples often lead to a better performance. In both\ncases, the active bias techniques consistently lead to more accurate and robust neural networks as\nlong as the classi\ufb01er does not memorize all the training samples easily (i.e., training accuracy is high\nbut testing accuracy is low).\n\nAcknowledgements\n\nThis material is based on research sponsored by National Science Foundation under Grant No.\n1514053 and by DARPA under agreement number FA8750-1 3-2-0020 and HRO011-15-2-0036.\nThe U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes\nnotwithstanding any copyright notation thereon. The views and conclusions contained herein are\nthose of the authors and should not be interpreted as necessarily representing the of\ufb01cial policies or\nendorsements, either expressed or implied, of DARPA or the U.S. Government.\n\nReferences\n[1] G. Alain, A. Lamb, C. Sankar, A. Courville, and Y. Bengio. Variance reduction in SGD by\n\ndistributed importance sampling. arXiv preprint arXiv:1511.06481, 2015.\n\n[2] S.-I. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning\n\nfor multilayer perceptrons. Neural Computation, 12(6):1399\u20131409, 2000.\n\n[3] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas.\n\nLearning to learn by gradient descent by gradient descent. In NIPS, 2016.\n\n[4] V. Avramova. Curriculum learning with deep convolutional neural networks, 2015.\n\n[5] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.\n\n[6] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classi\ufb01ers with online and active\n\nlearning. Journal of Machine Learning Research, 6(Sep):1579\u20131619, 2005.\n\n[7] S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122, 2012.\n\n[8] P. Chaudhari, A. Choromanska, S. Soatto, and Y. LeCun. Entropy-SGD: Biasing gradient\n\ndescent into wide valleys. In ICLR, 2017.\n\n9https://github.com/akamaus/mnist-distill\n\n9\n\n\f[9] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language\nprocessing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493\u20132537,\n2011.\n\n[10] G. Druck and A. McCallum. Toward interactive training and evaluation. In Proceedings of\nthe 20th ACM international conference on Information and knowledge management, pages\n947\u2013956. ACM, 2011.\n\n[11] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[12] J. Gao, H. Jagadish, and B. C. Ooi. Active sampler: Light-weight accelerator for complex data\n\nanalytics at scale. arXiv preprint arXiv:1512.03880, 2015.\n\n[13] S. Gopal. Adaptive sampling for SGD by exploiting side information. In ICML, 2016.\n\n[14] A. Guillory, E. Chastain, and J. A. Bilmes. Active learning as non-convex optimization. In\n\nAISTATS, 2009.\n\n[15] C. Gulcehre, M. Moczulski, F. Visin, and Y. Bengio. Mollifying networks. In ICLR, 2017.\n\n[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n770\u2013778, 2016.\n\n[17] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS\n\nDeep Learning Workshop, 2014.\n\n[18] G. E. Hinton. To recognize shapes, \ufb01rst learn to generate images. Progress in brain research,\n\n165:535\u2013547, 2007.\n\n[19] N. Houlsby, F. Husz\u00e1r, Z. Ghahramani, and M. Lengyel. Bayesian active learning for classi\ufb01ca-\n\ntion and preference learning. arXiv preprint arXiv:1112.5745, 2011.\n\n[20] E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel. OntoNotes: the 90% solution.\n\nIn HLT-NAACL, 2006.\n\n[21] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In NIPS, 2013.\n\n[22] Y. Kim. Convolutional neural networks for sentence classi\ufb01cation. In EMNLP, 2014.\n\n[23] D. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\narXiv preprint\n\n[24] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n\n[25] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS,\n\n2010.\n\n[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[27] G.-H. Lee, S.-W. Yang, and S.-D. Lin. Toward implicit sample noise modeling: Deviation-driven\n\nmatrix factorization. arXiv preprint arXiv:1610.09274, 2016.\n\n[28] X. Li and D. Roth. Learning question classi\ufb01ers. In COLING, 2002.\n\n[29] I. Loshchilov and F. Hutter. Online batch selection for faster training of neural networks. arXiv\n\npreprint arXiv:1511.06343, 2015.\n\n[30] D. J. MacKay. Information-based objective functions for active data selection. Neural computa-\n\ntion, 4(4):590\u2013604, 1992.\n\n[31] S. Mandt, M. D. Hoffman, and D. M. Blei. A variational analysis of stochastic gradient\n\nalgorithms. In ICML, 2016.\n\n10\n\n\f[32] S. Mandt, J. McInerney, F. Abrol, R. Ranganath, and D. Blei. Variational tempering. In AISTATS,\n\n2016.\n\n[33] D. Meng, Q. Zhao, and L. Jiang. What objective does self-paced learning indeed optimize?\n\narXiv preprint arXiv:1511.06049, 2015.\n\n[34] Y. Mu, W. Liu, X. Liu, and W. Fan. Stochastic gradient made stable: A manifold propagation\napproach for large-scale optimization. IEEE Transactions on Knowledge and Data Engineering,\n2016.\n\n[35] C. G. Northcutt, T. Wu, and I. L. Chuang. Learning with con\ufb01dent examples: Rank pruning for\n\nrobust classi\ufb01cation with noisy labels. arXiv preprint arXiv:1705.01936, 2017.\n\n[36] T. Pi, X. Li, Z. Zhang, D. Meng, F. Wu, J. Xiao, and Y. Zhuang. Self-paced boost learning for\n\nclassi\ufb01cation. In IJCAI, 2016.\n\n[37] D. Pregibon. Resistant \ufb01ts for some commonly used logistic models with medical applications.\n\nBiometrics, pages 485\u2013498, 1982.\n\n[38] N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12\n\n(1):145\u2013151, 1999.\n\n[39] J. D. Rennie. Regularized logistic regression is strictly convex. Unpublished manuscript. URL:\n\npeople. csail. mit. edu/ jrennie/ writing/ convexLR. pdf , 2005.\n\n[40] T. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. ICML, 2013.\n\n[41] A. I. Schein and L. H. Ungar. Active learning for logistic regression: an evaluation. Machine\n\nLearning, 68(3):235\u2013265, 2007.\n\n[42] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In ICML,\n\n2000.\n\n[43] B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11,\n\n2010.\n\n[44] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online\n\nhard example mining. In CVPR, 2016.\n\n[45] E. Strubell, P. Verga, D. Belanger, and A. McCallum. Fast and accurate sequence labeling with\n\niterated dilated convolutions. arXiv preprint arXiv:1702.02098, 2017.\n\n[46] E. F. Tjong Kim Sang and F. De Meulder. Introduction to the conll-2003 shared task: Language-\n\nindependent named entity recognition. In HLT-NAACL, 2003.\n\n[47] C. Wang, X. Chen, A. J. Smola, and E. P. Xing. Variance reduction for stochastic gradient\n\noptimization. In NIPS, 2013.\n\n[48] Y. Wang, A. Kucukelbir, and D. M. Blei. Reweighted data for robust probabilistic models. arXiv\n\npreprint arXiv:1606.03860, 2016.\n\n[49] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[50] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\n\nrethinking generalization. In ICLR, 2017.\n\n[51] P. Zhao and T. Zhang. Stochastic optimization with importance sampling. arXiv preprint\n\narXiv:1412.2753, 2014.\n\n11\n\n\f", "award": [], "sourceid": 658, "authors": [{"given_name": "Haw-Shiuan", "family_name": "Chang", "institution": "UMass, Amherst"}, {"given_name": "Erik", "family_name": "Learned-Miller", "institution": "UMass Amherst"}, {"given_name": "Andrew", "family_name": "McCallum", "institution": "UMass Amherst"}]}