{"title": "Active Learning from Weak and Strong Labelers", "book": "Advances in Neural Information Processing Systems", "page_first": 703, "page_last": 711, "abstract": "An active learner is given a hypothesis class, a large set of unlabeled examples and the ability to interactively query labels to an oracle of a subset of these examples; the goal of the learner is to learn a hypothesis in the class that fits the data well by making as few label queries as possible.This work addresses active learning with labels obtained from strong and weak labelers, where in addition to the standard active learning setting, we have an extra weak labeler which may occasionally provide incorrect labels. An example is learning to classify medical images where either expensive labels may be obtained from a physician (oracle or strong labeler), or cheaper but occasionally incorrect labels may be obtained from a medical resident (weak labeler). Our goal is to learn a classifier with low error on data labeled by the oracle, while using the weak labeler to reduce the number of label queries made to this labeler. We provide an active learning algorithm for this setting, establish its statistical consistency, and analyze its label complexity to characterize when it can provide label savings over using the strong labeler alone.", "full_text": "Active Learning from Weak and Strong Labelers\n\nChicheng Zhang\n\nUC San Diego\n\nKamalika Chaudhuri\n\nUC San Diego\n\nchichengzhang@ucsd.edu\n\nkamalika@eng.ucsd.edu\n\nAbstract\n\nAn active learner is given a hypothesis class, a large set of unlabeled examples and\nthe ability to interactively query labels to an oracle of a subset of these examples;\nthe goal of the learner is to learn a hypothesis in the class that \ufb01ts the data well by\nmaking as few label queries as possible.\n\nThis work addresses active learning with labels obtained from strong and weak\nlabelers, where in addition to the standard active learning setting, we have an extra\nweak labeler which may occasionally provide incorrect labels. An example is\nlearning to classify medical images where either expensive labels may be obtained\nfrom a physician (oracle or strong labeler), or cheaper but occasionally incorrect\nlabels may be obtained from a medical resident (weak labeler). Our goal is to\nlearn a classi\ufb01er with low error on data labeled by the oracle, while using the weak\nlabeler to reduce the number of label queries made to this labeler. We provide an\nactive learning algorithm for this setting, establish its statistical consistency, and\nanalyze its label complexity to characterize when it can provide label savings over\nusing the strong labeler alone.\n\n1\n\nIntroduction\n\nAn active learner is given a hypothesis class, a large set of unlabeled examples and the ability to\ninteractively make label queries to an oracle on a subset of these examples; the goal of the learner is\nto learn a hypothesis in the class that \ufb01ts the data well by making as few oracle queries as possible.\n\nAs labeling examples is a tedious task for any one person, many applications of active learning\ninvolve synthesizing labels from multiple experts who may have slightly different labeling patterns.\nWhile a body of recent empirical work [27, 28, 29, 25, 26, 11] has developed methods for combining\nlabels from multiple experts, little is known on the theory of actively learning with labels from\nmultiple annotators. For example, what kind of assumptions are needed for methods that use labels\nfrom multiple sources to work, when these methods are statistically consistent, and when they can\nyield bene\ufb01ts over plain active learning are all open questions.\n\nThis work addresses these questions in the context of active learning from strong and weak labelers.\nSpeci\ufb01cally, in addition to unlabeled data and the usual labeling oracle in standard active learning,\nwe have an extra weak labeler. The labeling oracle is a gold standard \u2013 an expert on the problem\ndomain \u2013 and it provides high quality but expensive labels. The weak labeler is cheap, but may pro-\nvide incorrect labels on some inputs. An example is learning to classify medical images where either\nexpensive labels may be obtained from a physician (oracle), or cheaper but occasionally incorrect\nlabels may be obtained from a medical resident (weak labeler). Our goal is to learn a classi\ufb01er in a\nhypothesis class whose error with respect to the data labeled by the oracle is low, while exploiting\nthe weak labeler to reduce the number of queries made to this oracle. Observe that in our model\nthe weak labeler can be incorrect anywhere, and does not necessarily provide uniformly noisy labels\neverywhere, as was assumed by some previous works [7, 23].\n\n1\n\n\fA plausible approach in this framework is to learn a difference classi\ufb01er to predict where the weak\nlabeler differs from the oracle, and then use a standard active learning algorithm which queries the\nweak labeler when this difference classi\ufb01er predicts agreement. Our \ufb01rst key observation is that\nthis approach is statistically inconsistent; false negative errors (that predict no difference when O\nand W differ) lead to biased annotation for the target classi\ufb01cation task. We address this problem\nby learning instead a cost-sensitive difference classi\ufb01er that ensures that false negative errors rarely\noccur. Our second key observation is that as existing active learning algorithms usually query labels\nin localized regions of space, it is suf\ufb01cient to train the difference classi\ufb01er restricted to this region\nand still maintain consistency. This process leads to signi\ufb01cant label savings. Combining these\ntwo ideas, we get an algorithm that is provably statistically consistent and that works under the\nassumption that there is a good difference classi\ufb01er with low false negative error.\n\nWe analyze the label complexity of our algorithm as measured by the number of label requests to\nthe labeling oracle. In general we cannot expect any consistent algorithm to provide label savings\nunder all circumstances, and indeed our worst case asymptotic label complexity is the same as that\nof active learning using the oracle alone. Our analysis characterizes when we can achieve label\nsavings, and we show that this happens for example if the weak labeler agrees with the labeling\noracle for some fraction of the examples close to the decision boundary. Moreover, when the target\nclassi\ufb01cation task is agnostic, the number of labels required to learn the difference classi\ufb01er is of a\nlower order than the number of labels required for active learning; thus in realistic cases, learning\nthe difference classi\ufb01er adds only a small overhead to the total label requirement, and overall we get\nlabel savings over using the oracle alone.\n\nRelated Work. There has been a considerable amount of empirical work on active learning where\nmultiple annotators can provide labels for the unlabeled examples. One line of work assumes a\ngenerative model for each annotator\u2019s labels. The learning algorithm learns the parameters of the\nindividual labelers, and uses them to decide which labeler to query for each example. [28, 29, 12]\nconsider separate logistic regression models for each annotator, while [18, 19] assume that each\nannotator\u2019s labels are corrupted with a different amount of random classi\ufb01cation noise. A second\nline of work [11, 15] that includes Pro-Active Learning, assumes that each labeler is an expert\nover an unknown subset of categories, and uses data to measure the class-wise expertise in order to\noptimally place label queries. In general, it is not known under what conditions these algorithms are\nstatistically consistent, particularly when the modeling assumptions do not strictly hold, and under\nwhat conditions they provide label savings over regular active learning.\n\n[24], the \ufb01rst theoretical work to consider this problem, consider a model where the weak labeler\nis more likely to provide incorrect labels in heterogeneous regions of space where similar examples\nhave different labels. Their formalization is orthogonal to ours \u2013 while theirs is more natural in a\nnon-parametric setting, ours is more natural for \ufb01tting classi\ufb01ers in a hypothesis class. In a NIPS\n2014 Workshop paper, [20] have also considered learning from strong and weak labelers; unlike\nours, their work is in the online selective sampling setting, and applies only to linear classi\ufb01ers and\nrobust regression. [10] study learning from multiple teachers in the online selective sampling setting\nin a model where different labelers have different regions of expertise.\n\nFinally, there is a large body of theoretical work [1, 8, 9, 13, 30, 2, 4] on learning a binary classi\ufb01er\nbased on interactive label queries made to a single labeler. In the realizable case, [21, 8] show\nthat a generalization of binary search provides an exponential improvement in label complexity\nover passive learning. The problem is more challenging, however, in the more realistic agnostic\ncase, where such approaches lead to inconsistency. The two styles of algorithms for agnostic active\nlearning are disagreement-based active learning (DBAL) [1, 9, 13, 4] and the more recent margin-\nbased or con\ufb01dence-based active learning [2, 30]. Our algorithm builds on recent work in DBAL [4,\n14].\n\n2 Preliminaries\n\nThe Model. We begin with a general framework for actively learning from weak and strong labelers.\nIn the standard active learning setting, we are given unlabelled data drawn from a distribution U over\nan input space X , a label space Y = {\u22121, 1}, a hypothesis class H , and a labeling oracle O to\nwhich we can make interactive queries.\n\n2\n\n\fIn our setting, we additionally have access to a weak labeling oracle W which we can query inter-\nactively. Querying W is signi\ufb01cantly cheaper than querying O; however, querying W generates a\nlabel yW drawn from a conditional distribution PW (yW|x) which is not the same as the conditional\ndistribution PO(yO|x) of O.\nLet D be the data distribution over labelled examples such that: PD(x, y) = PU (x)PO(y|x). Our goal\nis to learn a classi\ufb01er h in the hypothesis class H such that with probability \u2265 1\u2212\u03b4 over the sample,\nwe have: PD(h(x) \ufffd= y) \u2264 minh\u2032\u2208H PD(h\u2032(x) \ufffd= y) + \u03b5, while making as few (interactive) queries to\nO as possible.\n\nObserve that in this model W may disagree with the oracle O anywhere in the input space; this\nis unlike previous frameworks [7, 23] where labels assigned by the weak labeler are corrupted by\nrandom classi\ufb01cation noise with a higher variance than the labeling oracle. We believe this feature\nmakes our model more realistic.\n\nSecond, unlike [24], mistakes made by the weak labeler do not have to be close to the decision\nboundary. This keeps the model general and simple, and allows greater \ufb02exibility to weak labelers.\nOur analysis shows that if W is largely incorrect close to the decision boundary, then our algorithm\nwill automatically make more queries to O in its later stages.\n\nFinally note that O is allowed to be non-realizable with respect to the target hypothesis class H .\n\nBackground on Active Learning Algorithms. The standard active learning setting is very similar\nto ours, the only difference being that we have access to the weak oracle W . There has been a long\nline of work on active learning [1, 6, 8, 13, 2, 9, 4, 30]. Our algorithms are based on a style called\ndisagreement-based active learning (DBAL). The main idea is as follows. Based on the examples\nseen so far, the algorithm maintains a candidate set Vt of classi\ufb01ers in H that is guaranteed with\nhigh probability to contain h\u2217, the classi\ufb01er in H with the lowest error. Given a randomly drawn\nunlabeled example xt , if all classi\ufb01ers in Vt agree on its label, then this label is inferred; observe that\nwith high probability, this inferred label is h\u2217(xt). Otherwise, xt is said to be in the disagreement\nregion of Vt , and the algorithm queries O for its label. Vt is updated based on xt and its label, and\nalgorithm continues.\n\nRecent works in DBAL [9, 4] have observed that it is possible to determine if an xt is in the dis-\nagreement region of Vt without explicitly maintaining Vt . Instead, a labelled dataset St is maintained;\nthe labels of the examples in St are obtained by either querying the oracle or direct inference. To\ndetermine whether an xt lies in the disagreement region of Vt , two constrained ERM procedures are\nperformed; empirical risk is minimized over St while constraining the classi\ufb01er to output the label\nof xt as 1 and \u22121 respectively. If these two classi\ufb01ers have similar training errors, then xt lies in\nthe disagreement region of Vt ; otherwise the algorithm infers a label for xt that agrees with the label\nassigned by h\u2217.\n\nMore De\ufb01nitions and Notation. The error of a classi\ufb01er h under a labelled data distribution Q is\nde\ufb01ned as: errQ(h) = P(x,y)\u223cQ(h(x) \ufffd= y); we use the notation err(h, S) to denote its empirical error\non a labelled data set S. We use the notation h\u2217 to denote the classi\ufb01er with the lowest error under D\nand \u03bd to denote its error errD(h\u2217), where D is the target labelled data distribution.\nOur active learning algorithm implicitly maintains a (1\u2212 \u03b4)-con\ufb01dence set for h\u2217 throughout the\nalgorithm. Given a set S of labelled examples, a set of classi\ufb01ers V (S) \u2286 H is said to be a (1\u2212 \u03b4)-\ncon\ufb01dence set for h\u2217 with respect to S if h\u2217 \u2208 V with probability \u2265 1\u2212 \u03b4 over S.\nThe disagreement between two classi\ufb01ers h1 and h2 under an unlabelled data distribution U , denoted\nby \u03c1U (h1, h2), is Px\u223cU (h1(x) \ufffd= h2(x)). Observe that the disagreements under U form a pseudomet-\nric over H . We use BU (h, r) to denote a ball of radius r centered around h in this metric. The\ndisagreement region of a set V of classi\ufb01ers, denoted by DIS(V ), is the set of all examples x \u2208 X\nsuch that there exist two classi\ufb01ers h1 and h2 in V for which h1(x) \ufffd= h2(x).\n\n3 Algorithm\n\nOur main algorithm is a standard single-annotator DBAL algorithm with a major modi\ufb01cation: when\nthe DBAL algorithm makes a label query, we use an extra sub-routine to decide whether this query\nshould be made to the oracle or the weak labeler, and make it accordingly. How do we make this\n\n3\n\n\fdecision? We try to predict if weak labeler differs from the oracle on this example; if so, query the\noracle, otherwise, query the weak labeler.\n\nKey Idea 1: Cost Sensitive Difference Classi\ufb01er. How do we predict if the weak labeler differs\nfrom the oracle? A plausible approach is to learn a difference classi\ufb01er hd f in a hypothesis class\nH d f to determine if there is a difference. Our \ufb01rst key observation is when the region where\nO and W differ cannot be perfectly modeled by H d f , the resulting active learning algorithm is\nstatistically inconsistent. Any false negative errors (that is, incorrectly predicting no difference)\nmade by difference classi\ufb01er leads to biased annotation for the target classi\ufb01cation task, which in\nturn leads to inconsistency. We address this problem by instead learning a cost-sensitive difference\nclassi\ufb01er and we assume that a classi\ufb01er with low false negative error exists in H d f . While training,\nwe constrain the false negative error of the difference classi\ufb01er to be low, and minimize the number\nof predicted positives (or disagreements between W and O) subject to this constraint. This ensures\nthat the annotated data used by the active learning algorithm has diminishing bias, thus ensuring\nconsistency.\n\nKey Idea 2: Localized Difference Classi\ufb01er Training. Unfortunately, even with cost-sensitive\ntraining, directly learning a difference classi\ufb01er accurately is expensive. If d\u2032 is the VC-dimension\nof the difference hypothesis class H d f , to learn a target classi\ufb01er to excess error \u03b5, we need a\ndifference classi\ufb01er with false negative error O(\u03b5), which, from standard generalization theory, re-\nquires \u02dcO(d\u2032/\u03b5) labels [5, 22]! Our second key observation is that we can save on labels by training\nthe difference classi\ufb01er in a localized manner \u2013 because the DBAL algorithm that builds the target\nclassi\ufb01er only makes label queries in the disagreement region of the current con\ufb01dence set for h\u2217.\nTherefore we train the difference classi\ufb01er only on this region and still maintain consistency. Addi-\ntionally this provides label savings because while training the target classi\ufb01er to excess error \u03b5, we\nneed to train a difference classi\ufb01er with only \u02dcO(d\u2032\u03c6k/\u03b5) labels where \u03c6k is the probability mass of\nthis disagreement region. The localized training process leads to an additional technical challenge:\nas the con\ufb01dence set for h\u2217 is updated, its disagreement region changes. We address this through an\nepoch-based DBAL algorithm, where the con\ufb01dence set is updated and a fresh difference classi\ufb01er\nis trained in each epoch.\n\nMain Algorithm. Our main algorithm (Algorithm 1) combines these two key ideas, and like [4],\n\nimplicitly maintains the (1\u2212 \u03b4)-con\ufb01dence set for h\u2217 by through a labeled dataset \u02c6Sk. In epoch k,\n2k , and the goal of Algorithm 1 is to generate a labeled dataset \u02c6Sk\nthe target excess error is \u03b5k \u2248 1\nthat implicitly represents a (1\u2212 \u03b4k)-con\ufb01dence set on h\u2217. Additionally, \u02c6Sk has the property that the\nempirical risk minimizer over it has excess error \u2264 \u03b5k.\nA naive way to generate such an \u02c6Sk is by drawing \u02dcO(d/\u03b52\nk ) labeled examples, where d is the VC\ndimension of H . Our goal, however, is to generate \u02c6Sk using a much smaller number of label queries,\nwhich is accomplished by Algorithm 5. This is done in two ways. First, like standard DBAL, we\ninfer the label of any x that lies outside the disagreement region of the current con\ufb01dence set for h\u2217.\nAlgorithm 4 identi\ufb01es whether an x lies in this region. Second, for any x in the disagreement region,\nwe determine whether O and W agree on x using a difference classi\ufb01er; if there is agreement, we\nquery W , else we query O. The difference classi\ufb01er used to determine agreement is retrained in the\nbeginning of each epoch by Algorithm 2, which ensures that the annotation has low bias.\n\nThe algorithms use a constrained ERM procedure CONS-LEARN. Given a hypothesis class H, a\nlabeled dataset S and a set of constraining examples C, CONS-LEARNH (C, S) returns a classi\ufb01er in\nH that minimizes the empirical error on S subject to h(xi) = yi for each (xi, yi) \u2208 C.\n\nIdentifying the Disagreement Region. Algorithm 4 (deferred to the Appendix) identi\ufb01es if an\n\nunlabeled example x lies in the disagreement region of the current (1 \u2212 \u03b4)-con\ufb01dence set for h\u2217;\nrecall that this con\ufb01dence set is implicitly maintained through \u02c6Sk. The identi\ufb01cation is based on two\nERM queries. Let \u02c6h be the empirical risk minimizer on the current labeled dataset \u02c6Sk\u22121, and \u02c6h\u2032 be\nthe empirical risk minimizer on \u02c6Sk\u22121 under the constraint that \u02c6h\u2032(x) = \u2212\u02c6h(x). If the training errors\nof \u02c6h and \u02c6h\u2032 are very different, then, all classi\ufb01ers with training error close to that of \u02c6h assign the same\nlabel to x, and x lies outside the current disagreement region.\n\n4\n\n\fTraining the Difference Classi\ufb01er. Algorithm 2 trains a difference classi\ufb01er on a random set of\nexamples which lies in the disagreement region of the current con\ufb01dence set for h\u2217. The training\nprocess is cost-sensitive, and is similar to [16, 17, 5, 22]. A hard bound is imposed on the false-\nnegative error, which translates to a bound on the annotation bias for the target task. The number of\npositives (i.e., the number of examples where W and O differ) is minimized subject to this constraint;\nthis amounts to (approximately) minimizing the fraction of queries made to O.\n\nThe number of labeled examples used in training is large enough to ensure false negative error\nO(\u03b5k/\u03c6k) over the disagreement region of the current con\ufb01dence set; here \u03c6k is the probability mass\nof this disagreement region under U . This ensures that the overall annotation bias introduced by\nthis procedure in the target task is at most O(\u03b5k). As \u03c6k is small and typically diminishes with k,\nthis requires less labels than training the difference classi\ufb01er globally which would have required\n\u02dcO(d\u2032/\u03b5k) queries to O.\n\nAlgorithm 1 Active Learning Algorithm from Weak and Strong Labelers\n\n1: Input: Unlabeled distribution U , target excess error \u03b5, con\ufb01dence \u03b4, labeling oracle O, weak\n\noracle W , hypothesis class H , hypothesis class for difference classi\ufb01er H d f .\n\n+ ln 1\n\u03b40\n\n)).\n\n(d ln 1\n\u03b52\n0\n\n2: Output: Classi\ufb01er \u02c6h in H .\n3: Initialize: initial error \u03b50 = 1, con\ufb01dence \u03b40 = \u03b4/4. Total number of epochs k0 = \u2308log 1\n\u03b5\u2309.\n4: Initial number of examples n0 = O( 1\n\u03b52\n0\n5: Draw a fresh sample and query O for its labels \u02c6S0 = {(x1, y1), . . . , (xn0, yn0 )}. Let \u03c30 = \u03c3(n0,\u03b40).\n6: for k = 1, 2, . . . , k0 do\n7:\n8:\n\nSet target excess error \u03b5k = 2\u2212k, con\ufb01dence \u03b4k = \u03b4/4(k + 1)2.\n# Train Difference Classi\ufb01er\n\u02c6hd f\nk \u2190 Call Algorithm 2 with inputs unlabeled distribution U , oracles W and O, target excess\n# Adaptive Active Learning using Difference Classi\ufb01er\n\nerror \u03b5k, con\ufb01dence \u03b4k/2, previously labeled dataset \u02c6Sk\u22121.\n\n\u03c3k, \u02c6Sk \u2190 Call Algorithm 5 with inputs unlabeled distribution U , oracles W and O, difference\n\nk , target excess error \u03b5k, con\ufb01dence \u03b4k/2, previously labeled dataset \u02c6Sk\u22121.\n\nclassi\ufb01er \u02c6hd f\n\n9:\n\n10:\n11:\n\n12: end for\n\n13: return \u02c6h \u2190 CONS-LEARNH ( /0, \u02c6Sk0 ).\n\nAdaptive Active Learning using the Difference Classi\ufb01er. Finally, Algorithm 5 (deferred to the\nAppendix) is our main active learning procedure, which generates a labeled dataset \u02c6Sk that is im-\nplicitly used to maintain a tighter (1\u2212 \u03b4)-con\ufb01dence set for h\u2217. Speci\ufb01cally, Algorithm 5 generates\na \u02c6Sk such that the set Vk de\ufb01ned as:\n\nVk = {h : err(h, \u02c6Sk)\u2212 min\n\u02c6hk\u2208H\n\nerr(\u02c6hk, \u02c6Sk) \u2264 3\u03b5k/4}\n\nhas the property that:\n\n{h : errD(h)\u2212 errD(h\u2217) \u2264 \u03b5k/2} \u2286 Vk \u2286 {h : errD(h)\u2212 errD(h\u2217) \u2264 \u03b5k}\n\nThis is achieved by labeling, through inference or query, a large enough sample of unlabeled data\ndrawn from U . Labels are obtained from three sources - direct inference (if x lies outside the dis-\nagreement region as identi\ufb01ed by Algorithm 4), querying O (if the difference classi\ufb01er predicts a\ndifference), and querying W . How large should the sample be to reach the target excess error? If\nerrD(h\u2217) = \u03bd, then achieving an excess error of \u03b5 requires \u02dcO(d\u03bd/\u03b52\nk ) samples, where d is the VC\ndimension of the hypothesis class. As \u03bd is unknown in advance, we use a doubling procedure in\nlines 4-14 to iteratively determine the sample size.\n\n1Note that if in Algorithm 3, the upper con\ufb01dence bound of Px\u223cU (in disagr region( \u02c6T , 3\u03b5\n\n2 , x) = 1) is lower\nthan \u03b5/64, then we can halt Algorithm 2 and return an arbitrary hd f in H d f . Using this hd f will still guarantee\nthe correctness of Algorithm 1.\n\n5\n\n\fAlgorithm 2 Training Algorithm for Difference Classi\ufb01er\n1: Input: Unlabeled distribution U , oracles W and O, target error \u03b5, hypothesis class H d f , con\ufb01-\n\ndence \u03b4, previous labeled dataset \u02c6T .\n\n2: Output: Difference classi\ufb01er \u02c6hd f .\n3: Let\n\n\u02c6p be an estimate of Px\u223cU (in disagr region( \u02c6T , 3\u03b5\n\nrithm 3(deferred to the Appendix) with failure probability \u03b4/3. 1\n\n4: Let U\u2032 = /0, i = 1, and\n\n2 , x) = 1), obtained by calling Algo-\n\nm =\n\n64\u00b7 1024 \u02c6p\n\n\u03b5\n\n(d\u2032 ln\n\n512\u00b7 1024 \u02c6p\n\n\u03b5\n\n+ ln\n\n72\n\u03b4\n\n)\n\n(1)\n\n5: repeat\n6:\n7:\n8:\n9:\n10:\n11:\n\nend if\n\nDraw an example xi from U .\nif in disagr region( \u02c6T , 3\u03b5\n\nquery both W and O for labels to get yi,W and yi,O.\n\n2 , xi) = 1 then # xi is inside the disagreement region\n\nU\u2032 = U\u2032 \u222a{(xi, yi,O, yi,W )}\ni = i + 1\n12: until |U\u2032| = m\n13: Learn a classi\ufb01er \u02c6hd f \u2208 H d f based on the following empirical risk minimizer:\n\nm\n\n\u2211\ni=1\n\n\u02c6hd f = argminhd f \u2208H d f\n\n14: return \u02c6hd f .\n\n1(hd f (xi) = +1), s.t.\n\nm\n\n\u2211\ni=1\n\n1(hd f (xi) = \u22121\u2227 yi,O \ufffd= yi,W ) \u2264 m\u03b5/256 \u02c6p\n\n(2)\n\n4 Performance Guarantees\n\nWe now examine the performance of our algorithm, which is measured by the number of label\nqueries made to the oracle O. Additionally we require our algorithm to be statistically consistent,\nwhich means that the true error of the output classi\ufb01er should converge to the true error of the best\nclassi\ufb01er in H on the data distribution D.\n\nSince our framework is very general, we cannot expect any statistically consistent algorithm to\nachieve label savings over using O alone under all circumstances. For example, if labels provided\nby W are the complete opposite of O, no algorithm will achieve both consistency and label savings.\nWe next provide an assumption under which Algorithm 1 works and yields label savings.\n\nAssumption. The following assumption states that difference hypothesis class contains a good cost-\nsensitive predictor of when O and W differ in the disagreement region of BU (h\u2217, r); a predictor is\ngood if it has low false-negative error and predicts a positive label with low frequency. If there is no\nsuch predictor, then we cannot expect an algorithm similar to ours to achieve label savings.\nAssumption 1. Let D be the joint distribution: PD (x, yO, yW ) = PU (x)PW (yW|x)PO(yO|x). For any\nr,\u03b7 > 0, there exists an hd f\n\n\u03b7,r \u2208 H d f with the following properties:\nPD (hd f\n\n\u03b7,r(x) = \u22121, x \u2208 DIS(BU (h\u2217, r)), yO \ufffd= yW ) \u2264 \u03b7\n\nPD (hd f\n\n\u03b7,r(x) = 1, x \u2208 DIS(BU (h\u2217, r))) \u2264 \u03b1(r,\u03b7)\n\n(3)\n\n(4)\n\nNote that (3), which states there is a hd f \u2208 H d f with low false-negative error, is minimally re-\nstrictive, and is trivially satis\ufb01ed if H d f includes the constant classi\ufb01er that always predicts 1.\nTheorem shows that (3) is suf\ufb01cient to ensure statistical consistency.\n\n(4) in addition states that the number of positives predicted by the classi\ufb01er hd f\n\u03b7,r is upper bounded\nby \u03b1(r,\u03b7). Note \u03b1(r,\u03b7) \u2264 PU (DIS(BU (h\u2217, r))) always; performance gain is obtained when \u03b1(r,\u03b7)\nis lower, which happens when the difference classi\ufb01er predicts agreement on a signi\ufb01cant portion of\nDIS(BU (h\u2217, r)).\n\n6\n\n\f\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\n(5)\n\n(6)\n\n\ufffd\n\n\ufffd\ufffd\n\nConsistency. Provided Assumption 1 holds, we next show that Algorithm 1 is statistically consis-\ntent. Establishing consistency is non-trivial for our algorithm as the output classi\ufb01er is trained on\nlabels from both O and W .\nTheorem 1 (Consistency). Let h\u2217 be the classi\ufb01er that minimizes the error with respect to D. If\nAssumption 1 holds, then with probability \u2265 1\u2212 \u03b4, the classi\ufb01er \u02c6h output by Algorithm 1 satis\ufb01es:\nerrD(\u02c6h) \u2264 errD(h\u2217) + \u03b5.\n\nr\u2032\n\nPU (DIS(BU (h,r\u2032))\n\nLabel Complexity. The label complexity of standard DBAL is measured in terms of the dis-\nagreement coef\ufb01cient. The disagreement coef\ufb01cient \u03b8(r) at scale r is de\ufb01ned as: \u03b8(r) =\nsuph\u2208H supr\u2032\u2265r\n; intuitively, this measures the rate of shrinkage of the disagreement\nregion with the radius of the ball BU (h, r) for any h in H . It was shown by [9] that the label com-\nplexity of DBAL for target excess generalization error \u03b5 is \u02dcO(d\u03b8(2\u03bd + \u03b5)(1 + \u03bd2\n\u03b52 )) where the \u02dcO\nnotation hides factors logarithmic in 1/\u03b5 and 1/\u03b4. In contrast, the label complexity of our algo-\nrithm can be stated in Theorem 2. Here we use the \u02dcO notation for convenience; we have the same\ndependence on log 1/\u03b5 and log 1/\u03b4 as the bounds for DBAL.\nTheorem 2 (Label Complexity). Let d be the VC dimension of H and let d\u2032 be the VC dimension\nof H d f . If Assumption 1 holds, and if the error of the best classi\ufb01er in H on D is \u03bd, then with\nprobability \u2265 1\u2212 \u03b4, the following hold:\n\n1. The number of label queries made by Algorithm 1 to the oracle O in epoch k at most:\n\nmk = \u02dcO\n\nd(2\u03bd+ \u03b5k\u22121)(\u03b1(2\u03bd+ \u03b5k\u22121, \u03b5k\u22121\n\n1024 ) + \u03b5k\u22121)\n\nd\u2032P(DIS(BU (h\u2217, 2\u03bd+ \u03b5k\u22121)))\n\n+\n\n\u03b52\nk\n\n\u03b5k\n\n2. The total number of label queries made by Algorithm 1 to the oracle O is at most:\n\n\u03b1(2\u03bd+ r,\n\nr\n1024 ) + r\n\n2\u03bd+ r\n\n\u00b7 d\n\n\u03bd2\n\u03b52 + 1\n\n+ \u03b8(2\u03bd+ \u03b5)d\u2032\n\n\u03bd\n\u03b5\n\n+ 1\n\n\u02dcO\n\nsup\nr\u2265\u03b5\n\n4.1 Discussion\n\nThe \ufb01rst terms in (5) and (6) represent the labels needed to learn the target classi\ufb01er, and second\nterms represent the overhead in learning the difference classi\ufb01er.\nIn the realistic agnostic case (where \u03bd > 0), as \u03b5 \u2192 0, the second terms are lower order compared\nto the label complexity of DBAL. Thus even if d\u2032 is somewhat larger than d, \ufb01tting the difference\nclassi\ufb01er does not incur an asymptotically high overhead in the more realistic agnostic case. In the\nrealizable case, when d\u2032 \u2248 d, the second terms are of the same order as the \ufb01rst; therefore we should\nuse a simpler difference hypothesis class H d f in this case. We believe that the lower order overhead\nterm comes from the fact that there exists a classi\ufb01er in H d f whose false negative error is very low.\n\n\u03b1(2\u03bd+r,r/1024)\n\nComparing Theorem 2 with the corresponding results for DBAL, we observe that instead of\n\u03b8(2\u03bd + \u03b5), we have the term supr\u2265\u03b5\n\u2264 \u03b8(2\u03bd + \u03b5), the\nworst case asymptotic label complexity is the same as that of standard DBAL. This label complexity\nmay be considerably better however if supr\u2265\u03b5\nis less than the disagreement coef\ufb01cient.\nAs we expect, this will happen when the region of difference between W and O restricted to the dis-\nagreement regions is relatively small, and this region is well-modeled by the difference hypothesis\nclass H d f .\n\n. Since supr\u2265\u03b5\n\n\u03b1(2\u03bd+r,r/1024)\n\n\u03b1(2\u03bd+r,r/1024)\n\n2\u03bd+r\n\n2\u03bd+r\n\n2\u03bd+r\n\nAn interesting case is when the weak labeler differs from O close to the decision boundary and agrees\nwith O away from this boundary. In this case, any consistent algorithm should switch to querying O\nclose to the decision boundary. Indeed in earlier epochs, \u03b1 is low, and our algorithm obtains a good\ndifference classi\ufb01er and achieves label savings. In later epochs, \u03b1 is high, the difference classi\ufb01ers\nalways predict a difference and the label complexity of the later epochs of our algorithm is the same\norder as DBAL. In practice, if we suspect that we are in this case, we can switch to plain active\nlearning once \u03b5k is small enough.\n\nCase Study: Linear Class\ufb01cation under Uniform Distribution. We provide a simple example\nwhere our algorithm provides a better asymptotic label complexity than DBAL. Let H be the class\n\n7\n\n\f\ufffd\n\n\ufffd\n\n\ufffd\ufffd\n\n\u2212\n\nP({x : hw\u2217(x) \ufffd= yO}) = \u03bd\n\nw\u2217\n\n+\n\n\u2212\n\nW\n\nP({x : \u00afhd f (x) = 1}) = g = o(\u221ad\u03bd)\n\n+\n\n{x : P(yO \ufffd= yW|x) > 0}\nFigure 1: Linear classi\ufb01cation over unit ball with d = 2. Left: Decision boundary of labeler O and\nh\u2217 = hw\u2217 . The region where O differs from h\u2217 is shaded, and has probability \u03bd. Middle: Decision\nboundary of weak labeler W . Right: \u00afhd f , W and O. Note that {x : P(yO \ufffd= yW|x) > 0} \u2286 {x : \u00afhd f (x) =\n1}.\n\nof homogeneous linear separators on the d-dimensional unit ball and let H d f = {h\u0394h\u2032 : h, h\u2032 \u2208 H }.\nFurthermore, let U be the uniform distribution over the unit ball.\nSuppose that O is a deterministic labeler such that errD(h\u2217) = \u03bd > 0. Moreover, suppose that W is\nsuch that there exists a difference classi\ufb01er \u00afhd f with false negative error 0 for which PU (\u00afhd f (x) =\n\n1) \u2264 g. Additionally, we assume that g = o(\u221ad\u03bd); observe that this is not a strict assumption on\nH d f , as \u03bd could be as much as a constant. Figure 1 shows an example in d = 2 that satis\ufb01es these\nassumptions. In this case, as \u03b5 \u2192 0, Theorem 2 gives the following label complexity bound.\nCorollary 1. With probability \u2265 1\u2212\u03b4, the number of label queries made to oracle O by Algorithm 1\n, where the \u02dcO notation hides factors logarithmic in 1/\u03b5\nis \u02dcO\nand 1/\u03b4.\nAs g = o(\u221ad\u03bd), this improves over the label complexity of DBAL, which is \u02dcO(d3/2(1 + \u03bd2\n\nd max( g\n\n\u03b52 )).\n\n\u03bd, 1)( \u03bd2\n\n\u03b52 + 1) + d3/2\n\n1 + \u03bd\n\u03b5\n\nConclusion. In this paper, we take a step towards a theoretical understanding of active learning from\nmultiple annotators through a learning theoretic formalization for learning from weak and strong la-\nbelers. Our work shows that multiple annotators can be successfully combined to do active learning\nin a statistically consistent manner under a general setting with few assumptions; moreover, under\nreasonable conditions, this kind of learning can provide label savings over plain active learning.\n\nAn avenue for future work is to explore a more general setting where we have multiple labelers\nwith expertise on different regions of the input space. Can we combine inputs from such labelers\nin a statistically consistent manner? Second, our algorithm is intended for a setting where W is\nbiased, and performs suboptimally when the label generated by W is a random corruption of the\nlabel provided by O. How can we account for both random noise and bias in active learning from\nweak and strong labelers?\n\nAcknowledgements\n\nWe thank NSF under IIS 1162581 for research support and Jennifer Dy for introducing us to the\nproblem of active learning from multiple labelers.\n\nReferences\n\n[1] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. Syst.\n\nSci., 75(1):78\u201389, 2009.\n\n[2] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-\n\nconcave distributions. In COLT, 2013.\n\n[3] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Active learning with an ERM oracle,\n\n2009.\n\n8\n\n\f[4] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without con-\n\nstraints. In NIPS, 2010.\n\n[5] Nader H. Bshouty and Lynn Burroughs. Maximizing agreements with one-sided error with\n\napplications to heuristic learning. Machine Learning, 59(1-2):99\u2013123, 2005.\n\n[6] D. A. Cohn, L. E. Atlas, and R. E. Ladner.\n\nImproving generalization with active learning.\n\nMachine Learning, 15(2), 1994.\n\n[7] K. Crammer, M. J. Kearns, and J. Wortman. Learning from data of variable quality. In NIPS,\n\n2005.\n\n[8] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, 2005.\n\n[9] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In\n\nNIPS, 2007.\n\n[10] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single\n\nand multiple teachers. JMLR, 13:2655\u20132697, 2012.\n\n[11] P. Donmez and J. Carbonell. Proactive learning: Cost-sensitive active learning with multiple\n\nimperfect oracles. In CIKM, 2008.\n\n[12] Meng Fang, Xingquan Zhu, Bin Li, Wei Ding, and Xindong Wu. Self-taught active learning\n\nfrom crowds. In ICDM, pages 858\u2013863. IEEE, 2012.\n\n[13] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.\n\n[14] D. Hsu. Algorithms for Active Learning. PhD thesis, UC San Diego, 2010.\n\n[15] Panagiotis G Ipeirotis, Foster Provost, Victor S Sheng, and Jing Wang. Repeated labeling using\n\nmultiple noisy labelers. Data Mining and Knowledge Discovery, 28(2):402\u2013441, 2014.\n\n[16] Adam Tauman Kalai, Varun Kanade, and Yishay Mansour. Reliable agnostic learning. J.\n\nComput. Syst. Sci., 78(5):1481\u20131495, 2012.\n\n[17] Varun Kanade and Justin Thaler. Distribution-independent reliable learning. In COLT, 2014.\n\n[18] C. H. Lin, Mausam, and D. S. Weld. To re(label), or not to re(label). In HCOMP, 2014.\n\n[19] C.H. Lin, Mausam, and D.S. Weld. Reactive learning: Actively trading off larger noisier\ntraining sets against smaller cleaner ones. In ICML Workshop on Crowdsourcing and Machine\nLearning and ICML Active Learning Workshop, 2015.\n\n[20] L. Malago, N. Cesa-Bianchi, and J. Renders. Online active learning with strong and weak\n\nannotators. In NIPS Workshop on Learning from the Wisdom of Crowds, 2014.\n\n[21] R. D. Nowak. The geometry of generalized binary search. IEEE Transactions on Information\n\nTheory, 57(12):7893\u20137906, 2011.\n\n[22] Hans Ulrich Simon. Pac-learning in the presence of one-sided classi\ufb01cation noise. Ann. Math.\n\nArtif. Intell., 71(4):283\u2013300, 2014.\n\n[23] S. Song, K. Chaudhuri, and A. D. Sarwate. Learning from data with heterogeneous noise using\n\nsgd. In AISTATS, 2015.\n\n[24] R. Urner, S. Ben-David, and O. Shamir. Learning from weak teachers. In AISTATS, pages\n\n1252\u20131260, 2012.\n\n[25] S. Vijayanarasimhan and K. Grauman. What\u2019s it going to cost you?: Predicting effort vs.\n\ninformativeness for multi-label image annotations. In CVPR, pages 2262\u20132269, 2009.\n\n[26] S. Vijayanarasimhan and K. Grauman. Cost-sensitive active visual category learning. IJCV,\n\n91(1):24\u201344, 2011.\n\n[27] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds.\n\nIn NIPS, pages 2424\u20132432, 2010.\n\n[28] Y. Yan, R. Rosales, G. Fung, and J. G. Dy. Active learning from crowds. In ICML, pages\n\n1161\u20131168, 2011.\n\n[29] Y. Yan, R. Rosales, G. Fung, F. Farooq, B. Rao, and J. G. Dy. Active learning from multiple\n\nknowledge sources. In AISTATS, pages 1350\u20131357, 2012.\n\n[30] C. Zhang and K. Chaudhuri. Beyond disagreement-based agnostic active learning. In NIPS,\n\n2014.\n\n9\n\n\f", "award": [], "sourceid": 484, "authors": [{"given_name": "Chicheng", "family_name": "Zhang", "institution": "UC San Diego"}, {"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": "UCSD"}]}