{"title": "A general agnostic active learning algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 353, "page_last": 360, "abstract": "", "full_text": "A general agnostic active learning algorithm\n\nSanjoy Dasgupta\n\nUC San Diego\n\nDaniel Hsu\nUC San Diego\n\nClaire Monteleoni\n\nUC San Diego\n\ndasgupta@cs.ucsd.edu\n\ndjhsu@cs.ucsd.edu\n\ncmontel@cs.ucsd.edu\n\nAbstract\n\nWe present an agnostic active learning algorithm for any hypothesis class\nof bounded VC dimension under arbitrary data distributions. Most previ-\nous work on active learning either makes strong distributional assumptions,\nor else is computationally prohibitive. Our algorithm extends the simple\nscheme of Cohn, Atlas, and Ladner [1] to the agnostic setting, using re-\nductions to supervised learning that harness generalization bounds in a\nsimple but subtle manner. We provide a fall-back guarantee that bounds\nthe algorithm\u2019s label complexity by the agnostic PAC sample complexity.\nOur analysis yields asymptotic label complexity improvements for certain\nhypothesis classes and distributions. We also demonstrate improvements\nexperimentally.\n\n1 Introduction\n\nActive learning addresses the issue that, in many applications, labeled data typically comes\nat a higher cost (e.g. in time, e\ufb00ort) than unlabeled data. An active learner is given unlabeled\ndata and must pay to view any label. The hope is that signi\ufb01cantly fewer labeled examples\nare used than in the supervised (non-active) learning model. Active learning applies to a\nrange of data-rich problems such as genomic sequence annotation and speech recognition.\nIn this paper we formalize, extend, and provide label complexity guarantees for one of the\nearliest and simplest approaches to active learning\u2014one due to Cohn, Atlas, and Ladner [1].\n\nThe scheme of [1] examines data one by one in a stream and requests the label of any data\npoint about which it is currently unsure. For example, suppose the hypothesis class consists\nof linear separators in the plane, and assume that the data is linearly separable. Let the\n\ufb01rst six data be labeled as follows.\n\n\u0001\n\n\u0001 \u0001\n\n\n\u0001\n\n\u0001\n\nThe learner does not need to request the label of the seventh point (indicated by the arrow)\nbecause it is not unsure about the label: any straight line with the \u2295s and \u2296s on opposite\nsides has the seventh point with the \u2296s. Put another way, the point is not in the region\nof uncertainty [1], the portion of the data space for which there is disagreement among\nhypotheses consistent with the present labeled data.\n\nAlthough very elegant and intuitive, this approach to active learning faces two problems:\n\n1. Explicitly maintaining the region of uncertainty can be computationally cumber-\n\nsome.\n\n2. Data is usually not perfectly separable.\n\n1\n\n\fOur main contribution is to address these problems. We provide a simple generalization\nof the selective sampling scheme of [1] that tolerates adversarial noise and never requests\nmany more labels than a standard agnostic supervised learner would to learn a hypothesis\nwith the same error.\n\nIn the previous example, an agnostic active learner (one that does not assume a perfect\nseparator exists) is actually still uncertain about the label of the seventh point, because\nall six of the previous labels could be inconsistent with the best separator. Therefore, it\nshould still request the label. On the other hand, after enough points have been labeled, if\nan unlabeled point occurs at the position shown below, chances are its label is not needed.\n\n\u0001\n\n\u0001\n\n\u0001 \u0001\n\n\n\u0001\n\n\u0001\n\n\n\u0001\n\n\u0001\n\u0001\n\u0001\n\n\u0001\u0001\n\n\n\u0001\n\n\u0001\n\n\u0001\n\u0001\n\u0001\n\u0001\n\u0001\n\n\u0001\n\u0001\n\nTo extend the notion of uncertainty to the agnostic setting, we divide the sampled data into\ntwo groups, S and T : S contains the data for which we have determined the label ourselves\n(we explain below how to ensure that they are consistent with the best separator in the class)\nand T contains the data for which we have explicitly requested a label. Now, somewhat\ncounter-intuitively, the labels in S are completely reliable, whereas the labels in T could\nbe inconsistent with the best separator. To decide if we are uncertain about the label of a\nnew point x, we reduce to a supervised learning task: for each possible label \u02c6y \u2208 {\u00b11}, we\nlearn a hypothesis h\u02c6y consistent with the labels in S \u222a {(x, \u02c6y)} and with minimal empirical\nerror on T . If, say, the error of the hypothesis h+1 is much larger than that of h\u22121, we can\nsafely infer that the best separator must also label x with \u22121 without requesting a label;\nif the error di\ufb00erence is only modest, we explicitly request a label. Standard generalization\nbounds for an i.i.d. sample let us perform this test by comparing empirical errors on S \u222a T .\nThe last claim may sound awfully suspicious, because S \u222a T is not i.i.d.! Indeed, this is in a\nsense the core sampling problem that has always plagued active learning: the labeled sample\nT might not be i.i.d. (due to the \ufb01ltering of examples based on an adaptive criterion), while\nS only contains unlabeled examples (with made-up labels). Nevertheless, we prove that\nin our case, it is in fact correct to e\ufb00ectively pretend S \u222a T is an i.i.d. sample. A direct\nconsequence is that the label complexity of our algorithm (the number of labels requested\nbefore achieving a desired error) is never much more than the usual sample complexity of\nsupervised learning (and in some cases, is signi\ufb01cantly less).\n\nAn important algorithmic detail is the speci\ufb01c choice of generalization bound we use in\ndeciding whether to request a label or not. The usual additive bounds with rate n\u22121/2\nare too loose, e.g. we know in the zero-error case the rate should be n\u22121. Our algorithm\nmagni\ufb01es this small polynomial di\ufb00erence in the bound into an exponential di\ufb00erence in\nlabel complexity, so it is crucial for us to use a good bound. We use a normalized bound\nthat takes into account the empirical error (computed on S\u222aT ) of the hypothesis in question.\nIn this paper, we present and analyze a simple agnostic active learning algorithm for general\nhypothesis classes of bounded VC dimension. It extends the selective sampling scheme of\nCohn et al. [1] to the agnostic setting, using normalized generalization bounds, which we\napply in a simple but subtle manner. For certain hypothesis classes and distributions, our\nanalysis yields improved label complexity guarantees over the standard sample complexity\nof supervised learning. We also demonstrate such improvements experimentally.\n\n1.1 Related work\n\nOur algorithm extends the selective sampling scheme of Cohn et al. [1] (described above)\nto the agnostic setting. Most previous work on active learning either makes strong dis-\ntributional assumptions (e.g. separability, uniform input distribution) [1\u20138], or is generally\ncomputationally prohibitive [2, 4, 9]. See [10] for a discussion of these results.\n\nA natural way to formulate active learning in the agnostic setting is to ask the learner to\nreturn a hypothesis with error at most \u03bd + \u03b5 (where \u03bd is the error of the best hypothesis in\n\n2\n\n\fthe speci\ufb01ed class) using as few labels as possible. A basic constraint on the label complexity\nwas pointed out by K\u00a8a\u00a8ari\u00a8ainen [11], who showed that for any \u03bd \u2208 (0, 1/2), there are data\ndistributions that force any active learner that achieves error at most \u03bd + \u03b5 to request\n\u2126((\u03bd/\u03b5)2) labels. The \ufb01rst rigorously-analyzed agnostic active learning algorithm, called\nA2, was developed recently by Balcan, Beygelzimer, and Langford [9]. Like Cohn-Atlas-\nLadner [1], this algorithm uses a region of uncertainty, although the lack of separability\ncomplicates matters and A2 ends up explicitly maintaining an \u03b5-net of the hypothesis space.\nSubsequently, Hanneke [12] characterized the label complexity of the A2 algorithm in terms\nof a parameter called the disagreement coe\ufb03cient.\n\nOur work was inspired by both [1] and [9], and we have built heavily upon their insights. Our\nalgorithm overcomes their complications by employing reductions to supervised learning.1\nWe bound the label complexity of our method in terms of the same parameter as used for\nA2 [12], and get a somewhat better dependence (linear rather than quadratic).\n\n2 Preliminaries\n\n2.1 Learning framework and uniform convergence\n\nLet X be the input space, D a distribution over X \u00d7 {\u00b11} and H a class of hypotheses\nh : X \u2192 {\u00b11} with VC dimension vcdim(H) = d < \u221e (the \ufb01niteness ensures the nth\nshatter coe\ufb03cient S(H, n) is at most O(nd) by Sauer\u2019s lemma). We denote by DX the\nmarginal of D over X . In our active learning model, the learner receives unlabeled data\nsampled from DX ; for any sampled point x, it can optionally request the label y sampled\nfrom the conditional distribution at x. This process can be viewed as sampling (x, y) from D\nand revealing only x to the learner, keeping the label y hidden unless the learner explicitly\nrequests it. The error of a hypothesis h under D is errD(h) = Pr(x,y)\u223cD[h(x) 6= y], and on a\n\ufb01nite sample Z \u2282 X \u00d7{\u00b11}, the empirical error of h is err(h, Z) = P(x,y)\u2208Z 1l[h(x) 6= y]/|Z|,\nwhere 1l[\u00b7] is the 0-1 indicator function. We assume for simplicity that the minimal error\n\u03bd = inf{errD(h) : h \u2208 H} is achieved by a hypothesis h\u2217 \u2208 H.\nOur algorithm uses the following normalized uniform convergence bound [14, p. 200].\nLemma 1 (Vapnik and Chervonenkis [15]). Let F be a family of measurable functions\nf : Z \u2192 {0, 1} over a space Z. Denote by EZf the empirical average of f over a subset\nZ \u2282 Z. Let \u03b1n = p(4/n) ln(8S(F, 2n)/\u03b4). If Z is an i.i.d. sample of size n from a \ufb01xed\ndistribution over Z, then, with probability at least 1 \u2212 \u03b4, for all f \u2208 F :\n\n\u2212 min(cid:16)\u03b1npEZf , \u03b12\n\nn + \u03b1npEf(cid:17) \u2264 Ef \u2212 EZf \u2264 min(cid:16)\u03b12\n\nn + \u03b1npEZf , \u03b1npEf(cid:17) .\n\n2.2 Disagreement coe\ufb03cient\n\nWe will bound the label complexity of our algorithm in terms of (a slight variation of) the\ndisagreement coe\ufb03cient \u03b8 introduced in [12] for analyzing the label complexity of A2.\nDe\ufb01nition 1. The disagreement metric \u03c1 on H is de\ufb01ned by \u03c1(h, h\u2032) = Prx\u223cDX [h(x) 6=\nh\u2032(x)]. The disagreement coe\ufb03cient \u03b8 = \u03b8(D,H, \u03b5) > 0 is\n\n\u03b8 = sup(cid:26) Prx\u223cDX [\u2203h \u2208 B(h\u2217, r) s.t. h(x) 6= h\u2217(x)]\n\nr\n\n: r \u2265 \u03b5 + \u03bd(cid:27)\n\nwhere B(h, r) = {h\u2032 \u2208 H : \u03c1(h, h\u2032) < r}, h\u2217 = arg inf h\u2208H errD(h), and \u03bd = errD(h\u2217).\nThe quantity \u03b8 bounds the rate at which the disagreement mass of the ball B(h\u2217, r) \u2013 the\nprobability mass of points on which hypotheses in B(h\u2217, r) disagree with h\u2217 \u2013 grows as a\nfunction of the radius r. Clearly, \u03b8 \u2264 1/(\u03b5 + \u03bd); furthermore, it is a constant bounded\n1It has been noted that the Cohn-Atlas-Ladner scheme can easily be made tractable using a\nreduction to supervised learning in the separable case [13, p. 68]. Although our algorithm is most\nnaturally seen as an extension of Cohn-Atlas-Ladner, a similar reduction to supervised learning (in\nthe agnostic setting) can be used for A2 [10].\n\n3\n\n\fAlgorithm 1\nInput: stream (x1, x2, . . . , xm) i.i.d. from DX\nInitially, S0 \u2190 \u2205 and T0 \u2190 \u2205.\nFor n = 1, 2, . . . , m:\n\n1. For each \u02c6y \u2208 {\u00b11}, let h\u02c6y \u2190 LEARNH(Sn\u22121 \u222a {(xn, \u02c6y)}, Tn\u22121).\n2. If err(h\u2212\u02c6y, Sn\u22121 \u222a Tn\u22121) \u2212 err(h\u02c6y, Sn\u22121 \u222a Tn\u22121) > \u2206n\u22121 for some \u02c6y \u2208 {\u00b11}\n\n(or if no such h\u2212 \u02c6y is found)\nthen Sn \u2190 Sn\u22121 \u222a {(xn, \u02c6y)} and Tn \u2190 Tn\u22121.\n\n3. Else request yn; Sn \u2190 Sn\u22121 and Tn \u2190 Tn\u22121 \u222a {(xn, yn)}.\n\nReturn hf = LEARNH(Sm, Tm).\n\nFigure 1: The agnostic selective sampling algorithm. See (1) for how to set \u2206n.\n\nindependently of 1/(\u03b5 + \u03bd) in several cases previously considered in the literature [12]. For\nexample, if H is homogeneous linear separators and DX is the uniform distribution over the\nunit sphere in Rd, then \u03b8 = \u0398(\u221ad).\n\n3 Agnostic selective sampling\n\nHere we state and analyze our general algorithm for agnostic active learning. The main\ntechniques employed by the algorithm are reductions to a supervised learning task and\ngeneralization bounds applied to di\ufb00erences of empirical errors.\n\n3.1 A general algorithm for agnostic active learning\n\nFigure 1 states our algorithm in full generality. The input is a stream of m unlabeled\nexamples drawn i.i.d from DX ; for the time being, m can be thought of as \u02dcO((d/\u03b5)(1 + \u03bd/\u03b5))\nwhere \u03b5 is the accuracy parameter.2\nFor S, T \u2282 X \u00d7 {\u00b11}, let LEARNH(S, T ) denote a supervised learner that returns a hy-\npothesis h \u2208 H consistent with S, and with minimum error on T . Algorithm 1 maintains\ntwo sets of labeled examples, S and T , each of which is initially empty. Upon receiving xn,\nit learns two hypotheses, h\u02c6y = LEARNH(S \u222a{(xn, \u02c6y)}, T ) for \u02c6y \u2208 {\u00b11}, and then compares\nIf the di\ufb00erence is large enough3, it is possible to infer\ntheir empirical errors on S \u222a T .\nhow h\u2217 labels xn (as we show in Lemma 3). In this case, the algorithm adds xn, with this\ninferred label, to S. Otherwise, the algorithm requests the label yn and adds (xn, yn) to T .\nThus, S contains examples with inferred labels consistent with h\u2217, and T contains examples\nwith their requested labels. Because h\u2217 might err on some examples in T , we just insist\nthat LEARNH \ufb01nd a hypothesis with minimal error on T . Meanwhile, by construction, h\u2217\nis consistent with S, so we require LEARNH to only consider hypotheses consistent with S.\n\n3.2 Bounds for error di\ufb00erences\n\nWe still need to specify \u2206n, the threshold value for error di\ufb00erences that determines whether\nthe algorithm requests a label or not. Intuitively, \u2206n should re\ufb02ect how closely empirical\nerrors on a sample approximate true errors on the distribution D.\nThe setting of \u2206n can only depend on observable quantities, so we \ufb01rst clarify the distinction\nbetween empirical errors on Sn \u222a Tn and those with respect to the true (hidden) labels.\nDe\ufb01nition 2. Let Sn and Tn be as de\ufb01ned in Algorithm 1. Let S!\nn be the set of labeled\nexamples identical to those in Sn, except with the true hidden labels swapped in. Thus, for\nexample, S!\n\nn \u222a Tn is an i.i.d. sample from D of size n. Finally, let\n\nerr!\n\nn(h) = err(h, S!\n\nn \u222a Tn)\n\nand\n\nerrn(h) = err(h, Sn \u222a Tn).\n\n2The \u02dcO notation suppresses log 1/\u03b4 and terms polylogarithmic in those that appear.\n3If LEARNH cannot \ufb01nd a hypothesis consistent with S \u222a {(xn, y)} for some y, then it is clear\n\nthat h\u2217(x) = \u2212y. In this case, we simply add (xn, \u2212y) to S, regardless of \u2206n\u22121.\n\n4\n\n\fn(h)\u2212 err!\n\nn \u222a Tn, i.e. to err!\n\nIt is straightforward to apply Lemma 1 to empirical errors on S!\nn(h), but\nwe cannot use such bounds algorithmically: we do not request the true labels for points in\nSn and thus cannot reliably compute err!\nn(h). What we can compute are error di\ufb00erences\nerr!\nn(h\u2032) for pairs of hypotheses (h, h\u2032) that agree on (and make the same mistakes\nn(h) \u2212 err!\non) Sn, since for such pairs, we have err!\nDe\ufb01nition 3. For a pair (h, h\u2032) \u2208 H\u00d7H, de\ufb01ne g+\ng\u2212\nh,h\u2032 (x, y) = 1l[h(x) = y \u2227 h\u2032(x) 6= y].\nWith this notation, we have err(h, Z) \u2212 err(h\u2032, Z) = EZ[g+\nh,h\u2032 ] for any Z \u2282 X \u00d7\n{\u00b11}. Now, applying Lemma 1 to G = {g+\nh,h\u2032 : (h, h\u2032) \u2208 H \u00d7H},\nand noting that S(G, n) \u2264 S(H, n)2, gives the following lemma.\nLemma 2. Let \u03b1n = p(4/n) ln(8S(H, 2n)2/\u03b4). With probability at least 1 \u2212 \u03b4 over an\ni.i.d. sample Z of size n from D, we have for all (h, h\u2032) \u2208 H \u00d7 H,\nn + \u03b1n(cid:16)qEZ[g+\n\nn(h\u2032) = errn(h) \u2212 errn(h\u2032).\nh,h\u2032 (x, y) = 1l[h(x) 6= y \u2227 h\u2032(x) = y] and\n\nh,h\u2032 ] \u2212 EZ[g\u2212\nh,h\u2032 : (h, h\u2032) \u2208 H \u00d7H} = {g\u2212\n\nerr(h, Z) \u2212 err(h\u2032, Z) \u2264 errD(h) \u2212 errD(h\u2032) + \u03b12\n\nh,h\u2032 ] +qEZ[g\u2212\n\nh,h\u2032 ](cid:17) .\n\nCorollary 1. Let \u03b2n = p(4/n) ln(8(n2 + n)S(H, 2n)2/\u03b4). Then, with probability at least\n1 \u2212 \u03b4, for all n \u2265 1 and all (h, h\u2032) \u2208 H \u00d7 H consistent with Sn, we have\n\nerrn(h) \u2212 errn(h\u2032) \u2264 errD(h) \u2212 errD(h\u2032) + \u03b22\n\nn + \u03b2n(perrn(h) +perrn(h\u2032)).\n\nn \u222a Tn (replacing \u03b4 with \u03b4/(n2 + n)) and a union\nProof. Applying Lemma 2 to each S!\nbound implies, with probability at least 1 \u2212 \u03b4, the bounds in Lemma 2 hold simultaneously\nfor all n \u2265 1 and all (h, h\u2032) \u2208 H2 with S!\nn \u222a Tn in place of Z. The corollary follows\nn(h\u2032) = errn(h) \u2212 errn(h\u2032); and because g+\nbecause err!\nh,h\u2032 (x, y) \u2264 1l[h(x) 6= y] and\ng\u2212\nh,h\u2032 (x, y) \u2264 1l[h\u2032(x) 6= y] for (h, h\u2032) consistent with Sn, so E\nh,h\u2032 ] \u2264 errn(h) and\nE\n\nn(h) \u2212 err!\n\n\u222aTn[g+\n\nS!\n\nn\n\nS!\n\nn\n\n\u222aTn[g\u2212\n\nh,h\u2032 ] \u2264 errn(h\u2032).\n\nCorollary 1 implies that we can e\ufb00ectively apply the normalized uniform convergence bounds\nfrom Lemma 1 to empirical error di\ufb00erences on Sn \u222a Tn, even though Sn \u222a Tn is not an\ni.i.d. sample from D. In light of this, we use the following setting of \u2206n:\n\n\u2206n := \u03b22\n\nn + \u03b2n(cid:16)perrn(h+1) +perrn(h\u22121)(cid:17)\n\n(1)\n\nwhere \u03b2n = p(4/n) ln(8(n2 + n)S(H, 2n)2/\u03b4) = \u02dcO(pd log n/n) as per Corollary 1.\n\n3.3 Correctness and fall-back analysis\n\nWe now justify our setting of \u2206n with a correctness proof and fall-back guarantee.\nLemma 3. With probability at least 1 \u2212 \u03b4, the hypothesis h\u2217 = arg inf h\u2208H errD(h) is con-\nsistent with Sn for all n \u2265 0 in Algorithm 1.\nProof. Apply the bounds in Corollary 1 and proceed by induction on n. The base case is\ntrivial since S0 = \u2205. Now assume h\u2217 is consistent with Sn. Suppose upon receiving xn+1,\nwe discover errn(h+1) \u2212 errn(h\u22121) > \u2206n. We will show that h\u2217(xn+1) = \u22121 (assume both\nh+1 and h\u22121 exist, since it is clear h\u2217(xn+1) = \u22121 if h+1 does not exist). Suppose for the\nsake of contradiction that h\u2217(xn+1) = +1. We know the that errn(h\u2217) \u2265 errn(h+1) (by\nn + \u03b2n(perrn(h+1) +perrn(h\u22121)). In\ninductive hypothesis) and errn(h+1) \u2212 errn(h\u22121) > \u03b22\nparticular, errn(h+1) > \u03b22\n\nn. Therefore,\n\nerrn(h\u2217) \u2212 errn(h\u22121) = (errn(h\u2217) \u2212 errn(h+1)) + (errn(h+1) \u2212 errn(h\u22121))\n> perrn(h+1)(perrn(h\u2217) \u2212perrn(h+1)) + \u03b22\n> \u03b2n(perrn(h\u2217) \u2212perrn(h+1)) + \u03b22\nn + \u03b2n(perrn(h\u2217) +perrn(h\u22121)).\n\nn + \u03b2n(perrn(h+1) +perrn(h\u22121))\n\nn + \u03b2n(perrn(h+1) +perrn(h\u22121))\n\nNow Corollary 1 implies that errD(h\u2217) > errD(h\u22121), a contradiction.\n\n= \u03b22\n\n5\n\n\fTheorem 1. Let \u03bd = inf h\u2208H errD(h) and d = vcdim(H). There exists a constant c > 0\nsuch that the following holds. If Algorithm 1 is given a stream of m unlabeled examples,\nthen with probability at least 1 \u2212 \u03b4, the algorithm returns a hypothesis with error at most\n\u03bd + c \u00b7 ((1/m)(d log m + log(1/\u03b4)) +p(\u03bd/m)(d log m + log(1/\u03b4))).\nProof. Lemma 3 implies that h\u2217 is consistent with Sm with probability at least 1\u2212 \u03b4. Using\nthe same bounds from Corollary 1 (already applied in Lemma 3) on h\u2217 and hf together\nm + \u03b2m\u221a\u03bd + \u03b2mperrD(hf ),\nwith the fact errm(hf ) \u2264 errm(h\u2217), we have errD(hf ) \u2264 \u03bd + \u03b22\nwhich in turn implies errD(hf ) \u2264 \u03bd + 3\u03b22\nSo, Algorithm 1 returns a hypothesis with error at most \u03bd + \u03b5 when m = \u02dcO((d/\u03b5)(1 +\n\u03bd/\u03b5)); this is (asymptotically) the usual sample complexity of supervised learning. Since the\nalgorithm requests at most m labels, its label complexity is always at most \u02dcO((d/\u03b5)(1+\u03bd/\u03b5)).\n\nm + 2\u03b2m\u221a\u03bd.\n\n3.4 Label complexity analysis\n\nWe can also bound the label complexity of our algorithm in terms of the disagreement\ncoe\ufb03cient \u03b8. This yields tighter bounds when \u03b8 is bounded independently of 1/(\u03b5 + \u03bd). The\nkey to deriving our label complexity bounds based on \u03b8 is noting that the probability of\nrequesting the (n + 1)th label is intimately related to \u03b8 and \u2206n (see [10] for the complete\nproof).\nLemma 4. There exists a constant c > 0 such that, with probability at least 1 \u2212 2\u03b4, for all\nn \u2265 1, the following holds. Let h\u2217(xn+1) = \u02c6y where h\u2217 = arg inf h\u2208H errD(h). Then, the\nprobability that Algorithm 1 requests the label yn+1 is Prxn+1\u223cDX [Request yn+1] \u2264 c\u00b7 \u03b8 \u00b7 (\u03bd +\nm + 2\u03b2m\u221a\u03bd) is the disagreement coe\ufb03cient, \u03bd = errD(h\u2217), and\n\u03b22\nn), where \u03b8 = \u03b8(D,H, 3\u03b22\n\u03b2n = \u02dcO(pd log n/n) is as de\ufb01ned in Corollary 1.\nNow we give our main label complexity bound for agnostic active learning.\nTheorem 2. Let m be the number of unlabeled data given to Algorithm 1, d = vcdim(H),\nm + 2\u03b2m\u221a\u03bd). There\n\u03bd = inf h\u2208H errD(h), \u03b2m as de\ufb01ned in Corollary 1, and \u03b8 = \u03b8(D,H, 3\u03b22\nexists a constant c1 > 0 such that for any c2 \u2265 1, with probability at least 1 \u2212 2\u03b4:\n\n1. If \u03bd \u2264 (c2 \u2212 1)\u03b22\n\norem 1 and the expected number of labels requested is at most\n\nm, Algorithm 1 returns a hypothesis with error as bounded in The-\n\n1 + c1c2\u03b8 \u00b7(cid:18)d log2 m + log\n\n1\n\u03b4\n\nlog m(cid:19) .\n\n2. Else, the same holds except the expected number of labels requested is at most\n\n1 + c1\u03b8 \u00b7(cid:18)\u03bdm + d log2 m + log\n\n1\n\u03b4\n\nlog m(cid:19) .\n\nFurthermore, if L is the expected number of labels requested as per above, then with probability\n\nat least 1 \u2212 \u03b4\u2032, the algorithm requests no more than L +p3L log(1/\u03b4\u2032) labels.\n\nProof. Follows from Lemma 4 and a Cherno\ufb00 bound for the Poisson trials 1l[Request yn].\n\nm + 2\u03b2m\u221a\u03bd as per Theorem 1, Theorem 2 entails that for any\nWith the substitution \u03b5 = 3\u03b22\nhypothesis class and data distribution for which the disagreement coe\ufb03cient \u03b8 = \u03b8(D,H, \u03b5)\nis bounded independently of 1/(\u03b5 + \u03bd) (see [12] for some examples), Algorithm 1 only needs\n\u02dcO(\u03b8d log2(1/\u03b5)) labels to achieve error \u03b5 \u2248 \u03bd and \u02dcO(\u03b8d(log2(1/\u03b5) + (\u03bd/\u03b5)2)) labels to achieve\nerror \u03b5 \u226a \u03bd. The latter matches the dependence on \u03bd/\u03b5 in the \u2126((\u03bd/\u03b5)2) lower bound [11].\nThe linear dependence on \u03b8 improves on the quadratic dependence required by A2 [12]4. For\nan illustrative consequence of this, suppose DX is the uniform distribution on the sphere\n4It may be possible to reduce A2\u2019s quadratic dependence to a linear dependence by using nor-\n\nmalized bounds, as we do here.\n\n6\n\n\f4000\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n0\n\n0\n\n3500\n\n3000\n\n2500\n\n2000\n\n1500\n\n1000\n\n500\n\n10000\n\n0\n\n0\n\n5000\n(a)\n\n10000\n\n5000\n(b)\n\n(c)\n\n(d)\n\nFigure 2: (a & b) Labeling rate plots. The plots show the number of labels requested\n(vertical axis) versus the total number of points seen (labeled + unlabeled, horizontal axis)\nusing Algorithm 1. (a) H = thresholds: under random misclassi\ufb01cation noise with \u03bd =\n0 (solid), 0.1 (dashed), 0.2 (dot-dashed); under the boundary noise model with \u03bd = 0.1\n(lower dotted), 0.2 (upper dotted). (b) H = intervals: under random misclassi\ufb01cation with\n(p+, \u03bd) = (0.2, 0.0) (solid), (0.1, 0.0) (dashed), (0.2, 0.1) (dot-dashed), (0.1, 0.1) (dotted). (c\n& d) Locations of label requests. (c) H = intervals, h\u2217 = [0.4, 0.6]. The top histogram\nshows the locations of \ufb01rst 400 label requests (the x-axis is the unit interval); the bottom\nhistogram is for all (2141) label requests. (d) H = boxes, h\u2217 = [0.15, 0.85]2. The \ufb01rst 200\nrequests occurred at the \u00d7s, the next 200 at the \u25bds, and the \ufb01nal 109 at the (cid:13)s.\nin Rd and H is homogeneous linear separators; in this case, \u03b8 = \u0398(\u221ad). Then the label\n\ncomplexity of A2 depends at least quadratically on the dimension, whereas the corresponding\ndependence for our algorithm is d3/2.\n\n4 Experiments\n\nWe implemented Algorithm 1 in a few simple cases to experimentally demonstrate the\nlabel complexity improvements. In each case, the data distribution DX was uniform over\n[0, 1]; the stream length was m = 10000, and each experiment was repeated 20 times with\ndi\ufb00erent random seeds. Our \ufb01rst experiment studied linear thresholds on the line. The\ntarget hypothesis was \ufb01xed to be h\u2217(x) = sign(x \u2212 0.5). For this hypothesis class, we used\ntwo di\ufb00erent noise models, each of which ensured inf h\u2208H errD(h) = errD(h\u2217) = \u03bd for a pre-\nspeci\ufb01ed \u03bd \u2208 [0, 1]. The \ufb01rst model was random misclassi\ufb01cation: for each point x \u223c DX ,\nwe independently labeled it h\u2217(x) with probability 1\u2212 \u03bd and \u2212h\u2217(x) with probability \u03bd. In\nthe second model (also used in [7]), for each point x \u223c DX , we independently labeled it +1\nwith probability (x\u22120.5)/(4\u03bd)+0.5 and \u22121 otherwise, thus concentrating the noise near the\nboundary. Our second experiment studied intervals on the line. Here, we only used random\nmisclassi\ufb01cation, but we varied the target interval length p+ = Prx\u223cDX [h\u2217(x) = +1].\n\nThe results show that the number of labels requested by Algorithm 1 was exponentially\nsmaller than the total number of data seen (m) under the \ufb01rst noise model, and was poly-\nnomially smaller under the second noise model (see Figure 2 (a & b); we veri\ufb01ed the polyno-\nmial vs. exponential distinction on separate log-log scale plots). In the case of intervals, we\nobserve an initial phase (of duration roughly \u221d 1/p+) in which every label is requested, fol-\nlowed by a more e\ufb03cient phase, con\ufb01rming the known active-learnability of this class [4,12].\nThese improvements show that our algorithm needed signi\ufb01cantly fewer labels to achieve\nthe same error as a standard supervised algorithm that uses labels for all points seen.\n\nAs a sanity check, we examined the locations of data for which Algorithm 1 requested a\nlabel. We looked at two particular runs of the algorithm: the \ufb01rst was with H = intervals,\np+ = 0.2, m = 10000, and \u03bd = 0.1; the second was with H = boxes (d = 2), p+ = 0.49,\nm = 1000, and \u03bd = 0.01. In each case, the data distribution was uniform over [0, 1]d, and\nthe noise model was random misclassi\ufb01cation. Figure 2 (c & d) shows that, early on, labels\nwere requested everywhere. But as the algorithm progressed, label requests concentrated\nnear the boundary of the target hypothesis.\n\n7\n\n\f5 Conclusion and future work\n\nWe have presented a simple and natural approach to agnostic active learning. Our extension\nof the selective sampling scheme of Cohn, et al. [1]\n\n1. simpli\ufb01es the maintenance of the region of uncertainty with a reduction to super-\n\nvised learning, and\n\n2. guards against noise with a subtle algorithmic application of generalization bounds.\n\nOur algorithm relies on a threshold parameter \u2206n for comparing empirical errors. We\nprescribe a very simple and natural choice for \u2206n \u2013 a normalized generalization bound from\nsupervised learning \u2013 but one could hope for a more clever or aggressive choice, akin to\nthose in [6] for linear separators.\n\nFinding consistent hypotheses when data is separable is often a simple task. In such cases,\nreduction-based active learning algorithms can be relatively e\ufb03cient (answering some ques-\ntions posed in [16]). On the other hand, agnostic learning su\ufb00ers from severe computational\nintractability for many hypothesis classes (e.g. [17]), and of course, agnostic active learning\nis at least as hard in the worst case. Our reduction is relatively benign in that the learning\nproblems created are only over samples from the original distribution, so we do not cre-\nate pathologically hard instances (like those arising from hardness reductions) unless they\nare inherent in the data. Nevertheless, an important research direction is to develop algo-\nrithms that only require solving tractable (e.g. convex) optimization problems. A similar\nreduction-based scheme may be possible.\n\nReferences\n\n[1] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine\n\nLearning, 15(2):201\u2013221, 1994.\n\n[2] Y. Freund, H. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by\n\ncommittee algorithm. Machine Learning, 28(2):133\u2013168, 1997.\n\n[3] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. In\n\nCOLT, 2005.\n\n[4] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, 2005.\n\n[5] S. Hanneke. Teaching dimension and the complexity of active learning. In COLT, 2007.\n\n[6] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In COLT, 2007.\n\n[7] R. Castro and R. Nowak. Upper and lower bounds for active learning. In Allerton Conference\n\non Communication, Control and Computing, 2006.\n\n[8] R. Castro and R. Nowak. Minimax bounds for active learning. In COLT, 2007.\n\n[9] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, 2006.\n\n[10] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. UCSD\n\nTechnical Report CS2007-0898, http://www.cse.ucsd.edu/\u223cdjhsu/papers/cal.pdf, 2007.\n\n[11] M. K\u00a8a\u00a8ari\u00a8ainen. Active learning in the non-realizable case. In ALT, 2006.\n\n[12] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.\n\n[13] C. Monteleoni. Learning with online constraints: shifting concepts and active learning. PhD\n\nThesis, MIT Computer Science and Arti\ufb01cial Intelligence Laboratory, 2006.\n\n[14] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Lecture\n\nNotes in Arti\ufb01cial Intelligence, 3176:169\u2013207, 2004.\n\n[15] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events\n\nto their probabilities. Theory of Probability and its Applications, 16:264\u2013280, 1971.\n\n[16] C. Monteleoni. E\ufb03cient algorithms for general active learning. In COLT. Open problem, 2006.\n\n[17] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In FOCS,\n\n2006.\n\n8\n\n\f", "award": [], "sourceid": 178, "authors": [{"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": null}, {"given_name": "Daniel", "family_name": "Hsu", "institution": ""}, {"given_name": "Claire", "family_name": "Monteleoni", "institution": ""}]}