{"title": "Decoupling \"when to update\" from \"how to update\"", "book": "Advances in Neural Information Processing Systems", "page_first": 960, "page_last": 970, "abstract": "Deep learning requires data. A useful approach to obtain data is to be creative and mine data from various sources, that were created for different purposes. Unfortunately, this approach often leads to noisy labels. In this paper, we propose a meta algorithm for tackling the noisy labels problem. The key idea is to decouple ``when to update'' from ``how to update''. We demonstrate the effectiveness of our algorithm by mining data for gender classification by combining the Labeled Faces in the Wild (LFW) face recognition dataset with a textual genderizing service, which leads to a noisy dataset. While our approach is very simple to implement, it leads to state-of-the-art results. We analyze some convergence properties of the proposed algorithm.", "full_text": "Decoupling \u201cwhen to update\u201d from \u201chow to update\u201d\n\nEran Malach\n\nSchool of Computer Science\nThe Hebrew University, Israel\n\neran.malach@mail.huji.ac.il\n\nShai Shalev-Shwartz\n\nSchool of Computer Science\nThe Hebrew University, Israel\nshais@cs.huji.ac.il\n\nAbstract\n\nDeep learning requires data. A useful approach to obtain data is to be creative and\nmine data from various sources, that were created for different purposes. Unfortu-\nnately, this approach often leads to noisy labels. In this paper, we propose a meta\nalgorithm for tackling the noisy labels problem. The key idea is to decouple \u201cwhen\nto update\u201d from \u201chow to update\u201d. We demonstrate the effectiveness of our algo-\nrithm by mining data for gender classi\ufb01cation by combining the Labeled Faces\nin the Wild (LFW) face recognition dataset with a textual genderizing service,\nwhich leads to a noisy dataset. While our approach is very simple to implement,\nit leads to state-of-the-art results. We analyze some convergence properties of the\nproposed algorithm.\n\n1\n\nIntroduction\n\nIn recent years, deep learning achieves state-of-the-art results in various different tasks, however,\nneural networks are mostly trained using supervised learning, where a massive amount of labeled\ndata is required. While collecting unlabeled data is relatively easy given the amount of data available\non the web, providing accurate labeling is usually an expensive task. In order to overcome this\nproblem, data science becomes an art of extracting labels out of thin air. Some popular approaches\nto labeling are crowdsourcing, where the labeling is not done by experts, and mining available\nmeta-data, such as text that is linked to an image in a webpage. Unfortunately, this gives rise to a\nproblem of abundant noisy labels - labels may often be corrupted [19], which might deteriorate the\nperformance of neural-networks [12].\nLet us start with an intuitive explanation as to why noisy labels are problematic. Common neural\nnetwork optimization algorithms start with a random guess of what the classi\ufb01er should be, and\nthen iteratively update the classi\ufb01er based on stochastically sampled examples from a given dataset,\noptimizing a given loss function such as the hinge loss or the logistic loss. In this process, wrong\npredictions lead to an update of the classi\ufb01er that would hopefully result in better classi\ufb01cation\nperformance. While at the beginning of the training process the predictions are likely to be wrong,\nas the classi\ufb01er improves it will fail on less and less examples, thus making fewer and fewer updates.\nOn the other hand, in the presence of noisy labels, as the classi\ufb01er improves the effect of the noise\nincreases - the classi\ufb01er may give correct predictions, but will still have to update due to wrong\nlabeling. Thus, in an advanced stage of the training process the majority of the updates may actually\nbe due to wrongly labeled examples, and therefore will not allow the classi\ufb01er to further improve.\nTo tackle this problem, we propose to decouple the decision of \u201cwhen to update\u201d from the decision\nof \u201chow to update\u201d. As mentioned before, in the presence of noisy labels, if we update only when\nthe classi\ufb01er\u2019s prediction differs from the available label, then at the end of the optimization process,\nthese few updates will probably be mainly due to noisy labels. We would therefore like a different\nupdate criterion, that would let us decide whether it is worthy to update the classi\ufb01er based on a\ngiven example. We would like to preserve the behavior of performing many updates at the begin-\nning of the training process but only a few updates when we approach convergence. To do so, we\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fsuggest to train two predictors, and perform update steps only in case of disagreement between them.\nThis way, when the predictors get better, the \u201carea\u201d of their disagreement gets smaller, and updates\nare performed only on examples that lie in the disagreement area, therefore preserving the desired\nbehavior of the standard optimization process. On the other hand, since we do not perform an up-\ndate based on disagreement with the label (which may be due to a problem in the label rather than\na problem in the predictor), this method keeps the effective amount of noisy labels seen throughout\nthe training process at a constant rate.\nThe idea of deciding \u201cwhen to update\u201d based on a disagreement between classi\ufb01ers is closely related\nto approaches for active learning and selective sampling - a setup in which the learner does not have\nunlimited access to labeled examples, but rather has to query for each instance\u2019s label, provided at\na given cost (see for example [34]). Speci\ufb01cally, the well known query-by-committee algorithm\nmaintains a version space of hypotheses and at each iteration, decides whether to query the label of\na given instance by sampling two hypotheses uniformly at random from the version space [35, 14].\nNaturally, maintaining the version space of deep networks seems to be intractable. Our algorithm\nmaintains only two deep networks. The difference between them stems from the random initializa-\ntion. Therefore, unlike the original query-by-committee algorithm, that samples from the version\nspace at every iteration, we sample from the original hypotheses class only once (at the initializa-\ntion), and from there on, we update these two hypotheses using the backpropagation rule, when\nthey disagree on the label. To the best of our knowledge, this algorithm was not proposed/analyzed\npreviously, not in the active learning literature and especially not as a method for dealing with noisy\nlabels.\nTo show that this method indeed improves the robustness of deep learning to noisy labels, we con-\nduct an experiment that aims to study a real-world scenario of acquiring noisy labels for a given\ndataset. We consider the task of gender classi\ufb01cation based on images. We did not have a dedicated\ndataset for this task. Instead, we relied on the Labeled Faces in the Wild (LFW) dataset, which con-\ntains images of different people along with their names, but with no information about their gender.\nTo \ufb01nd the gender for each image, we use an online service to match a gender to a given name (as is\nsuggested by [25]), a method which is naturally prone to noisy labels (due to unisex names). Apply-\ning our algorithm to an existing neural network architecture reduces the effect of the noisy labels,\nachieving better results than similar available approaches, when tested on a clean subset of the data.\nWe also performed a controlled experiment, in which the base algorithm is the perceptron, and show\nthat using our approach leads to a noise resilient algorithm, which can handle an extremely high\nlabel noise rates of up to 40%. The controlled experiments are detailed in Appendix B.\nIn order to provide theoretical guarantees for our meta algorithm, we need to tackle two questions:\n1. does this algorithm converge? and if so, how quickly? and 2. does it converge to an optimum? We\ngive a positive answer to the \ufb01rst question, when the base algorithm is the perceptron and the noise\nis label \ufb02ip with a constant probability. Speci\ufb01cally, we prove that the expected number of iterations\nrequired by the resulting algorithm equals (up to a constant factor) to that of the perceptron in the\nnoise-free setting. As for the second question, clearly, the convergence depends on the initialization\nof the two predictors. For example, if we initialize the two predictors to be the same predictor, the\nalgorithm will not perform any updates. Furthermore, we derive lower bounds on the quality of the\nsolution even if we initialize the two predictors at random. In particular, we show that for some\ndistributions, the algorithm\u2019s error will be bounded away from zero, even in the case of linearly\nseparable data. This raises the question of whether a better initialization procedure may be helpful.\nIndeed, we show that for the same distribution mentioned above, even if we add random label noise,\nif we initialize the predictors by performing few vanilla perceptron iterations, then the algorithm\nperforms much better. Despite this worst case pessimism, we show that empirically, when working\nwith natural data, the algorithm converges to a good solution. We leave a formal investigation of\ndistribution dependent upper bounds to future work.\n\n2 Related Work\n\nThe effects of noisy labels was vastly studied in many different learning algorithms (see for example\nthe survey in [13]), and various solutions to this problem have been proposed, some of them with\ntheoretically provable bounds, including methods like statistical queries, boosting, bagging and more\n[21, 26, 7, 8, 29, 31, 23, 27, 3]. Our focus in this paper is on the problem of noisy labels in the context\nof deep learning. Recently, there have been several works aiming at improving the resilience of deep\n\n2\n\n\flearning to noisy labels. To the best of our knowledge, there are four main approaches. The \ufb01rst\nchanges the loss function. The second adds a layer that tries to mimic the noise behavior. The third\ngroups examples into buckets. The fourth tries to clean the data as a preprocessing step. Beyond\nthese approaches, there are methods that assume a small clean data set and another large, noisy, or\neven unlabeled, data set [30, 6, 38, 1]. We now list some speci\ufb01c algorithms from these families.\n[33] proposed to change the cross entropy loss function by adding a regularization term that takes\ninto account the current prediction of the network. This method is inspired by a technique called\nminimum entropy regularization, detailed in [17, 16]. It was also found to be effective by [12],\nwhich suggested a further improvement of this method by effectively increasing the weight of the\nregularization term during the training procedure.\n[28] suggested to use a probabilistic model that models the conditional probability of seeing a wrong\nlabel, where the correct label is a latent variable of the model. While [28] assume that the probability\nof label-\ufb02ips between classes is known in advance, a follow-up work by [36] extends this method\nto a case were these probabilities are unknown. An improved method, that takes into account the\nfact that some instances might be more likely to have a wrong label, has been proposed recently\nin [15]. In particular, they add another softmax layer to the network, that can use the output of\nthe last hidden layer of the network in order to predict the probability of the label being \ufb02ipped.\nUnfortunately, their method involves optimizing the biases of the additional softmax layer by \ufb01rst\ntraining it on a simpler setup (without using the last hidden layer), which implies two-phase training\nthat further complicates the optimization process. It is worth noting that there are some other works\nthat suggest methods that are very similar to [36, 15], with a slightly different objective or training\nmethod [5, 20], or otherwise suggest a complicated process which involves estimation of the class-\ndependent noise probabilities [32]. Another method from the same family is the one described in\n[37], who suggests to differentiate between \u201cconfusing\u201d noise, where some features of the example\nmake it hard to label, or otherwise a completely random label noise, where the mislabeling has no\nclear reason.\n[39] suggested to train the network to predict labels on a randomly selected group of images from\nthe same class, instead of classifying each image individually. In their method, a group of images\nis fed as an input to the network, which merges their inner representation in a deeper level of the\nnetwork, along with an attention model added to each image, and producing a single prediction.\nTherefore, noisy labels may appear in groups with correctly labeled examples, thus diminishing\ntheir impact. The \ufb01nal setup is rather complicated, involving many hyper-parameters, rather than\nproviding a simple plug-and-play solution to make an existing architecture robust to noisy labels.\nFrom the family of preprocessing methods, we mention [4, 10], that try to eliminate instances that are\nsuspected to be mislabeled. Our method shares a similar motivation of disregarding contaminated\ninstances, but without the cost of complicating the training process by a preprocessing phase.\nIn our experiment we test the performance of our method against methods that are as simple as\ntraining a vanilla version of neural network. In particular, from the family of modi\ufb01ed loss function\nwe chose the two variants of the regularized cross entropy loss suggested by [33] (soft and hard\nbootsrapping). From the family of adding a layer that models the noise, we chose to compare to one\nof the models suggested in [15] (which is very similar to the model proposed by [36]), because this\nmodel does not require any assumptions or complication of the training process. We \ufb01nd that our\nmethod outperformed all of these competing methods, while being extremely simple to implement.\nFinally, as mentioned before, our \u201cwhen to update\u201d rule is closely related to approaches for active\nlearning and selective sampling, and in particular to the query-by-committee algorithm. In [14] a\nthorough analysis is provided for various base algorithms implementing the query-by-committee\nupdate rule, and particularly they analyze the perceptron base algorithm under some strong distribu-\ntional assumptions. In other works, an ensemble of neural networks is trained in an active learning\nsetup to improve the generalization of neural networks [11, 2, 22]. Our method could be seen as\na simpli\ufb01ed member of ensemble methods. As mentioned before, our motivation is very different\nthan the active learning scenario, since our main goal is dealing with noisy labels, rather than trying\nto reduce the number of label queries. To the best of our knowledge, the algorithm we propose was\nnot used or analyzed in the past for the purpose of dealing with noisy labels in deep learning.\n\n3\n\n\f3 METHOD\n\nAs mentioned before, to tackle the problem of noisy labels, we suggest to change the update rule\ncommonly used in deep learning optimization algorithms in order to decouple the decision of \u201cwhen\nto update\u201d from \u201chow to update\u201d. In our approach, the decision of \u201cwhen to update\u201d does not depend\non the label. Instead, it depends on a disagreement between two different networks. This method\ncould be generally thought of as a meta-algorithm that uses two base classi\ufb01ers, performing updates\naccording to a base learning algorithm, but only on examples for which there is a disagreement\nbetween the two classi\ufb01ers.\nTo put this formally, let X be an instance space and Y be the label space, and assume we sample\nexamples from a distribution \u02dcD over X \u21e5Y, with possibly noisy labels. We wish to train a classi\ufb01er\nh, coming from a hypothesis class H. We rely on an update rule, U, that updates h based on its\ncurrent value as well as a mini-batch of b examples. The meta algorithm receives as input a pair of\ntwo classi\ufb01ers, h1, h2 2 H, the update rule, U, and a mini batch size, b. A pseudo-code is given in\nAlgorithm 1.\nNote that we do not specify how to initialize the two base classi\ufb01ers, h1, h2. When using deep\nlearning as the base algorithm, the easiest approach is maybe to perform a random initialization.\nAnother approach is to \ufb01rst train the two classi\ufb01ers while following the regular \u201cwhen to update\u201d\nrule (which is based on the label y), possibly training each classi\ufb01er on a different subset of the data,\nand switching to the suggested update rule only in an advanced stage of the training process. We\nlater show that the second approach is preferable.\nAt the end of the optimization process, we can simply return one of the trained classi\ufb01ers. If a\nsmall accurately labeled test data is available, we can choose to return the classi\ufb01er with the better\naccuracy on the clean test data.\n\nAlgorithm 1 Update by Disagreement\n\ninput:\n\nan update rule U\nbatch size b\ntwo initial predictors h1, h2 2 H\nfor t = 1, 2, . . . , N do\ndraw mini-batch (x1, y1), . . . , (xb, yb) \u21e0 \u02dcDb\nlet S = {(xi, yi) : h1(xi) 6= h2(xi)}\nh1 U (h1, S)\nh2 U (h2, S)\n\nend for\n\n4 Theoretical analysis\n\nSince a convergence analysis for deep learning is beyond our reach even in the noise-free setting,\nwe focus on analyzing properties of our algorithm for linearly separable data, which is corrupted by\nrandom label noise, and while using the perceptron as a base algorithm.\nLet X = {x 2 Rd : kxk \uf8ff 1}, Y = {\u00b11}, and let D be a probability distribution over X \u21e5 Y,\nsuch that there exists w\u21e4 for which D({(x, y) : yhw\u21e4, xi < 1}) = 0. The distribution we observe,\ndenoted \u02dcD, is a noisy version of D. Speci\ufb01cally, to sample (x, \u02dcy) \u21e0 \u02dcD one should sample (x, y) \u21e0 D\nand output (x, y) with probability 1 \u00b5 and (x,y) with probability \u00b5. Here, \u00b5 is in [0, 1/2).\nFinally, let H be the class of linear classi\ufb01ers, namely, H = {x 7! sign(hw, xi) : w 2 Rd}. We\nuse the perceptron\u2019s update rule with mini-batch size of 1. That is, given the classi\ufb01er wt 2 Rd, the\nupdate on example (xt, yt) 2 X \u21e5 Y is: wt+1 = U (wt, (xt, yt)) := wt + yt xt.\nAs mentioned in the introduction, to provide a full theoretical analysis of this algorithm, we need to\naccount for two questions:\n\n1. does this algorithm converge? and if so, how quickly?\n2. does it converge to an optimum?\n\n4\n\n\fTheorem 1 below provides a positive answer for the \ufb01rst question. It shows that the number of\nupdates of our algorithm is only larger by a constant factor (that depends on the initial vectors and\nthe amount of noise) relatively to the bound for the vanilla perceptron in the noise-less case.\nTheorem 1 Suppose that the \u201cUpdate by Disagreement\u201d algorithm is run on a sequence of random\nN examples from \u02dcD, and with initial vectors w(1)\n0 k. Let T be the\nnumber of updates performed by the \u201cUpdate by Disagreement\u201d algorithm.\n(12\u00b5)2 kw\u21e4k2 where the expectation is w.r.t. the randomness of sampling from \u02dcD.\nThen, E[T ] \uf8ff 3 (4 K+1)\nProof\nIt will be more convenient to rewrite the algorithm as follows. We perform N iterations,\nwhere at iteration t we receive (xt, \u02dcyt), and update w(i)\n\n0 . Denote K = maxi kw(i)\n\nt + \u2327t \u02dcyt xt , where\n\nt+1 = w(i)\n\n0 , w(2)\n\n\u2327t =(1\n\n0\n\nif sign(hw(1)\notherwise\n\nt\n\n, xti) 6= sign(hw(2)\n\nt\n\n, xti)\n\nObserve that we can write \u02dcyt = \u2713tyt, where (xt, yt) \u21e0 D, and \u2713t is a random variables with\nP[\u2713t = 1] = 1 \u00b5 and P[\u2713t = 1] = \u00b5. We also use the notation vt = ythw\u21e4, xti and \u02dcvt = \u2713tvt.\nOur goal is to upper bound \u00afT := E[T ] = E[Pt \u2327t].\nWe start with showing that\n\nE\" NXt=1\n\n\u2327t\u02dcvt# (1 2\u00b5)T\n\nIndeed, since \u2713t is independent of \u2327t and vt, we get that:\n\nE[\u2327t\u02dcvt] = E[\u2327t\u2713tvt] = E[\u2713t] \u00b7 E[\u2327tvt] = (1 2\u00b5) E[\u2327tvt] (1 2\u00b5) E[\u2327t]\n\nwhere in the last inequality we used the fact that vt 1 with probability 1 and \u2327t is non-negative.\nSumming over t we obtain that Equation 1 holds.\nNext, we show that for i 2 {1, 2},\n\n(1)\n\n(2)\n\nNXt=1\n\nkw(i)\nIndeed, since the update of w(1)\nfor every t. Now, whenever \u2327t = 1 we have that either ythw(1)\nAssume w.l.o.g. that ythw(1)\n\n0 k2 +\n0 k + 1)\nt+1 is identical, we have that kw(1)\n\nt k2 \uf8ff kw(i)\nt+1 and w(2)\n\n0 w(1)\n\n\u2327t(2kw(2)\n\nkw(1)\n\nt k2 = kw(1)\n\nt1, xti \uf8ff 0. Then,\nt1 + ytxtk2 = kw(1)\n\nSecond,\n\nt1k2 + 2ythw(1)\n\nt1, xti + kxtk2 \uf8ff kw(1)\n\nt1k2 + 1\n\nt+1w(2)\n\nt+1k = kw(1)\n\nt1, xti \uf8ff 0 or ythw(2)\n\n0 w(2)\n0 k\nt1, xti \uf8ff 0.\n\nkw(2)\n\nt k2 = kw(2)\n\uf8ff kw(2)\n\uf8ff kw(2)\n\nt1 + ytxtk2 = kw(2)\nt1k2 + 2ythw(2)\nt1k2 + 2kw(2)\n\nt1 w(1)\nt1 w(1)\n\nt1, xti + kxtk2\nt1k + 1 = kw(2)\n\nt1k2 + 2ythw(2)\n\nt1, xti + kxtk2\n\nt1k2 + 2kw(2)\nt k2 \uf8ff kw(i)\n\n0 w(1)\nt1k2 + 2kw(2)\n\n0 k + 1\n0 w(1)\n\nTherefore, the above two equations imply 8i 2 {1, 2}, kw(i)\nSumming over t we obtain that Equation 2 holds.\nEquipped with Equation 1 and Equation 2 we are ready to prove the theorem.\nDenote K = maxi kw(i)\nupper and lower bounds on E[hw(i)\nE[hw(i)\nTo construct an upper bound, \ufb01rst note that Equation 2 implies that\n\n0 w(1)\n, w\u21e4i]. Combining the update rule with Equation 1 we get:\n\u2327t \u02dcvt# hw(i)\n\n0 , w\u21e4i + E\" NXt=1\n\n0 k and note that kw(2)\n\n0 , w\u21e4i + (1 2\u00b5) \u00afT K kw\u21e4k + (1 2\u00b5) \u00afT\n\n0 k \uf8ff 2K. We prove the theorem by providing\n\n, w\u21e4i] = hw(i)\n\n0 k + 1.\n\nt\n\nt\n\nE[kw(i)\n\nt k2] \uf8ff kw(i)\n\n0 k2 + (2kw(2)\n\n0 w(1)\n\n0 k + 1) \u00afT \uf8ff K2 + (4 K + 1) \u00afT\n\n5\n\n\fUsing the above and Jensen\u2019s inequality, we get that\n\nE[hw(i)\n\nt\n\n, w\u21e4i] \uf8ff E[kw(i)\n\nt kkw\u21e4k] \uf8ff kw\u21e4kqE[kw(i)\n\nt k2] \uf8ff kw\u21e4kqK2 + (4 K + 1) \u00afT\n\nComparing the upper and lower bounds, we obtain that\n\n0 = w(2)\n\n\u00afT \uf8ff + 2 + 1.5 \uf8ff 3 2 ,\n\nK kw\u21e4k + (1 2\u00b5) \u00afT \uf8ff kw\u21e4kqK2 + (4 K + 1) \u00afT\n(1 2\u00b5) \u00afT kw\u21e4kp(4 K + 1)p \u00afT 2 K kw\u21e4k \uf8ff 0\n\nUsing pa + b \uf8ff pa + pb, the above implies that\nDenote \u21b5 = kw\u21e4kp(4 K + 1), then the above also implies that (1 2\u00b5) \u00afT \u21b5p \u00afT \u21b5 \uf8ff 0.\nDenote = \u21b5/(1 2\u00b5), using standard algebraic manipulations, the above implies that\nwhere we used the fact that kw\u21e4k must be at least 1 for the separability assumption to hold, hence\n 1. This concludes our proof.\nThe above theorem tells us that our algorithm converges quickly. We next address the second ques-\ntion, regarding the quality of the point to which the algorithm converges. As mentioned in the\nintroduction, the convergence must depend on the initial predictors. Indeed, if w(1)\n0 , then\nthe algorithm will not make any updates. The next question is what happens if we initialize w(1)\n0\nand w(2)\nat random. The lemma below shows that this does not suf\ufb01ce to ensure convergence to the\n0\noptimum, even if the data is linearly separable without noise. The proof for this lemma is given in\nAppendix A.\nLemma 1 Fix some 2 (0, 1) and let d be an integer greater than 40 log(1/). There exists a\ndistribution over Rd \u21e5 {\u00b11}, which is separable by a weight vector w\u21e4 for which kw\u21e4k2 = d, such\nthat running the \u201cUpdate by Disagreement\u201d algorithm, with the perceptron as the underlying update\nrule, and with every coordinate of w(1)\ninitialized according to any symmetric distribution over\nR, will yield a solution whose error is at least 1/8, with probability of at least 1 .\nTrying to circumvent the lower bound given in the above lemma, one may wonder what would\nhappen if we will initialize w(1)\n0 differently. Intuitively, maybe noisy labels are not such a big\nproblem at the beginning of the learning process. Therefore, we can initialize w(1)\n0 by running\nthe vanilla perceptron for several iterations, and only then switch to our algorithm. Trivially, for\nthe distribution we constructed in the proof of Lemma 1, this approach will work just because in\nthe noise-free setting, both w(1)\n0 will converge to vectors that give the same predictions\n0\nas w\u21e4. But, what would happen in the noisy setting, when we \ufb02ip the label of every example with\nprobability of \u00b5? The lemma below shows that the error of the resulting solution is likely to be order\nof \u00b53. Here again, the proof is given in Appendix A.\nLemma 2 Consider a vector w\u21e4 2 {\u00b11}d and the distribution \u02dcD over Rd \u21e5 {\u00b11} such that to\nsample a pair (x, \u02dcy) we \ufb01rst choose x uniformly at random from {e1, . . . , ed}, set y = hw\u21e4, eii, and\nset \u02dcy = y with probability 1 \u00b5 and \u02dcy = y with probability \u00b5. Let w(1)\nbe the result of\nrunning the vanilla perceptron algorithm on random examples from \u02dcD for any number of iterations.\nSuppose that we run the \u201cUpdate by Disagreement\u201d algorithm for an additional arbitrary number\nof iterations. Then, the error of the solution is likely to be \u2326(\u00b53).\n\n0 , w(2)\n\n0 , w(2)\n\n0 , w(2)\n\n0\n\n0 , w(2)\n\n0\n\nand w(2)\n\nTo summarize, we see that without making additional assumptions on the data distribution, it is\nimpossible to prove convergence of our algorithm to a good solution. In the next section we show\nthat for natural data distributions, our algorithm converges to a very good solution.\n\n5 EXPERIMENTS\n\nWe now demonstrate the merit of our suggested meta-algorithm using empirical evaluation. Our\nmain experiment is using our algorithm with deep networks in a real-world scenario of noisy labels.\n\n6\n\n\fIn particular, we use a hypothesis class of deep networks and a Stochastic Gradient Descent with mo-\nmentum as the basis update rule. The task is classifying face images according to gender. As training\ndata, we use the Labeled Faces in the Wild (LFW) dataset for which we had a labeling of the name\nof the face, but we did not have gender labeling. To construct gender labels, we used an external\nservice that provides gender labels based on names. This process resulted in noisy labels. We show\nthat our method leads to state-of-the-art results on this task, compared to competing noise robustness\nmethods. We also performed controlled experiments to demonstrate our algorithm\u2019s performance on\nlinear classi\ufb01cation with varying levels of noise. These results are detailed in Appendix B.\n\n5.1 Deep Learning\nWe have applied our algorithm with a Stochastic Gradient Descent (SGD) with momentum as the\nbase update rule on the task of labeling images of faces according to gender. The images were taken\nfrom the Labeled Faces in the Wild (LFW) benchmark [18]. This benchmark consists of 13,233\nimages of 5,749 different people collected from the web, labeled with the name of the person in the\npicture. Since the gender of each subject is not provided, we follow the method of [25] and use a\nservice that determines a person\u2019s gender by their name (if it is recognized), along with a con\ufb01dence\nlevel. This method gives rise to \u201cnatural\u201d noisy labels due to \u201cunisex\u201d names, and therefore allows\nus to experiment with a real-world setup of dataset with noisy labels.\n\nName\nCon\ufb01dence\n\nKim\n88%\n\nMorgan\n\n64%\n\nJoan\n82%\n\nLeslie\n88%\n\nCorrect\n\nMislabeled\n\nFigure 1: Images from the dataset tagged as female\n\nWe have constructed train and test sets as follows. We \ufb01rst took all the individuals on which the\ngender service gave 100% con\ufb01dence. We divided this set at random into three subsets of equal\nsize, denoted N1, N2, N3. We denote by N4 the individuals on which the con\ufb01dence level is in\n[90%, 100%), and by N5 the individuals on which the con\ufb01dence level is in [0%, 90%). Needless to\nsay that all the sets N1, . . . , N5 have zero intersection with each other.\nWe repeated each experiment three times, where in every time we used a different Ni as the test set,\nfor i 2 {1, 2, 3}. Suppose N1 is the test set, then for the training set we used two con\ufb01gurations:\n\n1. A dataset consisting of all the images that belong to names in N2, N3, N4, N5, where un-\nrecognized names were labeled as male (since the majority of subjects in LFW are males).\n\n2. A dataset consisting of all the images that belong to names in N2, N3, N4.\n\nWe use a network architecture suggested by [24], using an available tensor\ufb02ow implementation1.\nIt should be noted that we did not change any parameters of the network architecture or the opti-\nmization process, and use the default parameters in the implementation. Since the amount of male\nand female subjects in the dataset is not balanced, we use an objective of maximizing the balanced\naccuracy [9] - the average accuracy obtained on either class.\nTraining is done for 30,000 iterations on 128 examples mini-batch. In order to make the networks\ndisagreement meaningful, we initialize the two networks by training both of them normally (up-\ndating on all the examples) until iteration #5000, where we switch to training with the \u201cUpdate by\nDisagreement\u201d rule. Due to the fact that we are not updating on all examples, we decrease the weight\nof batches with less than 10% of the original examples in the original batch to stabilize gradients. 2.\n\n1https://github.com/dpressel/rude-carnie.\n2Code is available online on https://github.com/emalach/UpdateByDisagreement.\n\n7\n\n\fWe inspect the balanced accuracy on our test data during the training process, comparing our method\nto a vanilla neural network training, as well as to soft and hard bootstrapping described in [33] and\nto the s-model described in [15], all of which are using the same network architecture. We use the\ninitialization parameters for [33, 15] that were suggested in the original papers. We show that while\nin other methods, the accuracy effectively decreases during the training process due to over\ufb01tting the\nnoisy labels, in our method this effect is less substantial, allowing the network to keep improving.\nWe study two different scenarios, one in which a small clean test data is available for model selection,\nand therefore we can choose the iteration with best test accuracy, and a more realistic scenario where\nthere is no clean test data at hand. For the \ufb01rst scenario, we observe the balanced accuracy of the best\navailable iteration. For the second scenario, we observe the balanced accuracy of the last iteration.\nAs can be seen in Figure 2 and the supplementary results listed in Table 1 in Appendix B, our\nmethod outperforms the other methods in both situations. This is true for both datasets, although, as\nexpected, the improvement in performance is less substantial on the cleaner dataset.\nThe second best algorithm is the s-model described in [15]. Since our method can be applied to\nany base algorithm, we also applied our method on top of the s-model. This yields even better\nperformance, especially when the data is less noisy, where we obtain a signi\ufb01cant improvement.\n\nDataset #1 - more noise\n\nDataset #2 - less noise\n\nFigure 2: Balanced accuracy of all methods on clean test data, trained on the two different datasets.\n\n6 Discussion\n\nWe have described an extremely simple approach for supervised learning in the presence of noisy\nlabels. The basic idea is to decouple the \u201cwhen to update\u201d rule from the \u201chow to update\u201d rule. We\nachieve this by maintaining two predictors, and update based on their disagreement. We have shown\nthat this simple approach leads to state-of-the-art results.\nOur theoretical analysis shows that the approach leads to fast convergence rate when the underlying\nupdate rule is the perceptron. We have also shown that proving that the method converges to an op-\ntimal solution must rely on distributional assumptions. There are several immediate open questions\nthat we leave to future work. First, suggesting distributional assumptions that are likely to hold in\npractice and proving that the algorithm converges to an optimal solution under these assumptions.\nSecond, extending the convergence proof beyond linear predictors. While obtaining absolute con-\nvergence guarantees seems beyond reach at the moment, coming up with oracle based convergence\nguarantees may be feasible.\n\nAcknowledgements: This research is supported by the European Research Council (TheoryDL\nproject).\n\n8\n\n\fReferences\n[1] Rie Kubota Ando and Tong Zhang. Two-view feature generation model for semi-supervised\nIn Proceedings of the 24th international conference on Machine learning, pages\n\nlearning.\n25\u201332. ACM, 2007.\n\n[2] Les E Atlas, David A Cohn, Richard E Ladner, Mohamed A El-Sharkawi, Robert J Marks,\nME Aggoune, and DC Park. Training connectionist networks with queries and selective sam-\npling. In NIPS, pages 566\u2013573, 1989.\n\n[3] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for\nef\ufb01ciently learning linear separators with noise. In Proceedings of the 46th Annual ACM Sym-\nposium on Theory of Computing, pages 449\u2013458. ACM, 2014.\n\n[4] Ricardo Barandela and Eduardo Gasca. Decontamination of training samples for supervised\npattern recognition methods. In Joint IAPR International Workshops on Statistical Techniques\nin Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages\n621\u2013630. Springer, 2000.\n\n[5] Alan Joseph Bekker and Jacob Goldberger. Training deep neural-networks based on unreli-\nable labels. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International\nConference on, pages 2682\u20132686. IEEE, 2016.\n\n[6] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In\nProceedings of the eleventh annual conference on Computational learning theory, pages 92\u2013\n100. ACM, 1998.\n\n[7] Jakramate Bootkrajang and Ata Kab\u00b4an. Label-noise robust logistic regression and its appli-\nIn Joint European Conference on Machine Learning and Knowledge Discovery in\n\ncations.\nDatabases, pages 143\u2013158. Springer, 2012.\n\n[8] Jakramate Bootkrajang and Ata Kab\u00b4an. Boosting in the presence of label noise. arXiv preprint\n\narXiv:1309.6818, 2013.\n\n[9] Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M Buhmann.\nThe balanced accuracy and its posterior distribution. In Pattern recognition (ICPR), 2010 20th\ninternational conference on, pages 3121\u20133124. IEEE, 2010.\n\n[10] Carla E. Brodley and Mark A. Friedl. Identifying mislabeled training data. Journal of Arti\ufb01cial\n\nIntelligence Research, 11:131\u2013167, 1999.\n\n[11] David Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning.\n\nMachine learning, 15(2):201\u2013221, 1994.\n\n[12] David Flatow and Daniel Penner. On the robustness of convnets to training on noisy\nhttp://cs231n.stanford.edu/reports/flatow_penner_report.\n\nlabels.\npdf, 2017.\n\n[13] Beno\u02c6\u0131t Fr\u00b4enay and Michel Verleysen. Classi\ufb01cation in the presence of label noise: a survey.\n\nIEEE transactions on neural networks and learning systems, 25(5):845\u2013869, 2014.\n\n[14] Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the\n\nquery by committee algorithm. Machine learning, 28(2-3):133\u2013168, 1997.\n\n[15] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural networks using a noise adap-\n\ntation layer. Under review for ICLR, 2017.\n\n[16] Yves Grandvalet and Yoshua Bengio. Entropy regularization. Semi-supervised learning, pages\n\n151\u2013168, 2006.\n\n[17] Yves Grandvalet, Yoshua Bengio, et al. Semi-supervised learning by entropy minimization. In\n\nNIPS, volume 17, pages 529\u2013536, 2004.\n\n9\n\n\f[18] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the\nwild: A database for studying face recognition in unconstrained environments. Technical\nreport, Technical Report 07-49, University of Massachusetts, Amherst, 2007.\n\n[19] Panagiotis G Ipeirotis, Foster Provost, and Jing Wang. Quality management on amazon me-\nchanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages\n64\u201367. ACM, 2010.\n\n[20] Pravin Kakar and Alex Yong-Sang Chia. Probabilistic learning from mislabelled data for mul-\ntimedia content recognition. In Multimedia and Expo (ICME), 2015 IEEE International Con-\nference on, pages 1\u20136. IEEE, 2015.\n\n[21] Michael Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. Journal of the ACM\n\n(JACM), 45(6):983\u20131006, 1998.\n\n[22] Anders Krogh, Jesper Vedelsby, et al. Neural network ensembles, cross validation, and active\n\nlearning. Advances in neural information processing systems, 7:231\u2013238, 1995.\n\n[23] Jan Larsen, L Nonboe, Mads Hintz-Madsen, and Lars Kai Hansen. Design of robust neural\nIn Acoustics, Speech and Signal Processing, 1998. Proceedings of the\n\nnetwork classi\ufb01ers.\n1998 IEEE International Conference on, volume 2, pages 1205\u20131208. IEEE, 1998.\n\n[24] Gil Levi and Tal Hassner. Age and gender classi\ufb01cation using convolutional neural networks.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Work-\nshops, pages 34\u201342, 2015.\n\n[25] Philip Masek and Magnus Thulin. Evaluation of face recognition apis and libraries. Master\u2019s\n\nthesis, University of Gothenburg, 2015.\n\n[26] Ross A McDonald, David J Hand, and Idris A Eckley. An empirical comparison of three\nboosting algorithms on real data sets with arti\ufb01cial class noise. In International Workshop on\nMultiple Classi\ufb01er Systems, pages 35\u201344. Springer, 2003.\n\n[27] Aditya Krishna Menon, Brendan van Rooyen, and Nagarajan Natarajan. Learning from binary\n\nlabels with instance-dependent corruption. arXiv preprint arXiv:1605.00751, 2016.\n\n[28] Volodymyr Mnih and Geoffrey E Hinton. Learning to label aerial images from noisy data.\nIn Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages\n567\u2013574, 2012.\n\n[29] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning\nwith noisy labels. In Advances in neural information processing systems, pages 1196\u20131204,\n2013.\n\n[30] Kamal Nigam and Rayid Ghani. Analyzing the effectiveness and applicability of co-training. In\nProceedings of the ninth international conference on Information and knowledge management,\npages 86\u201393. ACM, 2000.\n\n[31] Giorgio Patrini, Frank Nielsen, Richard Nock, and Marcello Carioni. Loss factorization,\nweakly supervised learning and label noise robustness. arXiv preprint arXiv:1602.02450,\n2016.\n\n[32] Giorgio Patrini, Alessandro Rozza, Aditya Menon, Richard Nock, and Lizhen Qu. Mak-\narXiv preprint\n\ning neural networks robust to label noise: a loss correction approach.\narXiv:1609.03683, 2016.\n\n[33] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and An-\ndrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv\npreprint arXiv:1412.6596, 2014.\n\n[34] Burr Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-\n\n66):11, 2010.\n\n10\n\n\f[35] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Proceed-\nings of the \ufb01fth annual workshop on Computational learning theory, pages 287\u2013294. ACM,\n1992.\n\n[36] Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Train-\n\ning convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080, 2014.\n\n[37] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive\nnoisy labeled data for image classi\ufb01cation. In Proceedings of the IEEE Conference on Com-\nputer Vision and Pattern Recognition, pages 2691\u20132699, 2015.\n\n[38] Xiaojin Zhu. Semi-supervised learning literature survey. Computer Sciences TR 1530, 2005.\n[39] Bohan Zhuang, Lingqiao Liu, Yao Li, Chunhua Shen, and Ian Reid. Attend in groups:\na weakly-supervised deep learning framework for learning from web data. arXiv preprint\narXiv:1611.09960, 2016.\n\n11\n\n\f", "award": [], "sourceid": 615, "authors": [{"given_name": "Eran", "family_name": "Malach", "institution": "Hebrew University Jerusalem Israel"}, {"given_name": "Shai", "family_name": "Shalev-Shwartz", "institution": "Mobileye & HUJI"}]}