{"title": "Agnostic Active Learning Without Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 199, "page_last": 207, "abstract": "We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification.", "full_text": "Agnostic Active Learning Without Constraints\n\nAlina Beygelzimer\n\nIBM Research\nHawthorne, NY\n\nDaniel Hsu\n\nRutgers University &\n\nUniversity of Pennsylvania\n\nbeygel@us.ibm.com\n\ndjhsu@rci.rutgers.edu\n\nJohn Langford\nYahoo! Research\nNew York, NY\n\nTong Zhang\n\nRutgers University\n\nPiscataway, NJ\n\njl@yahoo-inc.com\n\ntongz@rci.rutgers.edu\n\nAbstract\n\nWe present and analyze an agnostic active learning algorithm that works without\nkeeping a version space. This is unlike all previous approaches where a restricted\nset of candidate hypotheses is maintained throughout learning, and only hypothe-\nses from this set are ever returned. By avoiding this version space approach, our\nalgorithm sheds the computational burden and brittleness associated with main-\ntaining version spaces, yet still allows for substantial improvements over super-\nvised learning for classi\ufb01cation.\n\n1\n\nIntroduction\n\nIn active learning, a learner is given access to unlabeled data and is allowed to adaptively choose\nwhich ones to label. This learning model is motivated by applications in which the cost of labeling\ndata is high relative to that of collecting the unlabeled data itself. Therefore, the hope is that the\nactive learner only needs to query the labels of a small number of the unlabeled data, and otherwise\nIn this work, we are interested in agnostic active\nperform as well as a fully supervised learner.\nlearning algorithms for binary classi\ufb01cation that are provably consistent, i.e.\nthat converge to an\noptimal hypothesis in a given hypothesis class.\n\nOne technique that has proved theoretically pro\ufb01table is to maintain a candidate set of hypotheses\n(sometimes called a version space), and to query the label of a point only if there is disagreement\nwithin this set about how to label the point. The criteria for membership in this candidate set needs\nto be carefully de\ufb01ned so that an optimal hypothesis is always included, but otherwise this set can be\nquickly whittled down as more labels are queried. This technique is perhaps most readily understood\nin the noise-free setting [1, 2], and it can be extended to noisy settings by using empirical con\ufb01dence\nbounds [3, 4, 5, 6, 7].\n\nThe version space approach unfortunately has its share of signi\ufb01cant drawbacks. The \ufb01rst is com-\nputational intractability: maintaining a version space and guaranteeing that only hypotheses from\nthis set are returned is dif\ufb01cult for linear predictors and appears intractable for interesting nonlinear\npredictors such as neural nets and decision trees [1]. Another drawback of the approach is its brittle-\nness: a single mishap (due to, say, modeling failures or computational approximations) might cause\nthe learner to exclude the best hypothesis from the version space forever; this is an ungraceful fail-\nure mode that is not easy to correct. A third drawback is related to sample re-usability: if (labeled)\ndata is collected using a version space-based active learning algorithm, and we later decide to use\na different algorithm or hypothesis class, then the earlier data may not be freely re-used because its\ncollection process is inherently biased.\n\n1\n\n\fHere, we develop a new strategy addressing all of the above problems given an oracle that returns an\nempirical risk minimizing (ERM) hypothesis. As this oracle matches our abstraction of many super-\nvised learning algorithms, we believe active learning algorithms built in this way are immediately\nand widely applicable.\n\nOur approach instantiates the importance weighted active learning framework of [5] using a rejection\nthreshold similar to the algorithm of [4] which only accesses hypotheses via a supervised learning\noracle. However, the oracle we require is simpler and avoids strict adherence to a candidate set\nof hypotheses. Moreover, our algorithm creates an importance weighted sample that allows for\nunbiased risk estimation, even for hypotheses from a class different from the one employed by the\nactive learner. This is in sharp contrast to many previous algorithms (e.g., [1, 3, 8, 4, 6, 7]) that create\nheavily biased data sets. We prove that our algorithm is always consistent and has an improved label\ncomplexity over passive learning in cases previously studied in the literature. We also describe a\npractical instantiation of our algorithm and report on some experimental results.\n\n1.1 Related Work\n\nAs already mentioned, our work is closely related to the previous works of [4] and [5], both of\nwhich in turn draw heavily on the work of [1] and [3]. The algorithm from [4] extends the selective\nsampling method of [1] to the agnostic setting using generalization bounds in a manner similar\nto that \ufb01rst suggested in [3]. It accesses hypotheses only through a special ERM oracle that can\nenforce an arbitrary number of example-based constraints; these constraints de\ufb01ne a version space,\nand the algorithm only ever returns hypotheses from this space, which can be undesirable as we\npreviously argued. Other previous algorithms with comparable performance guarantees also require\nsimilar example-based constraints (e.g., [3, 5, 6, 7]). Our algorithm differs from these in that (i) it\nnever restricts its attention to a version space when selecting a hypothesis to return, and (ii) it only\nrequires an ERM oracle that enforces at most one example-based constraint, and this constraint is\nonly used for selective sampling. Our label complexity bounds are comparable to those proved in [5]\n(though somewhat worse that those in [3, 4, 6, 7]).\n\nThe use of importance weights to correct for sampling bias is a standard technique for many machine\nlearning problems (e.g., [9, 10, 11]) including active learning [12, 13, 5]. Our algorithm is based\non the importance weighted active learning (IWAL) framework introduced by [5]. In that work, a\nrejection threshold procedure called loss-weighting is rigorously analyzed and shown to yield im-\nproved label complexity bounds in certain cases. Loss-weighting is more general than our technique\nin that it extends beyond zero-one loss to a certain subclass of loss functions such as logistic loss. On\nthe other hand, the loss-weighting rejection threshold requires optimizing over a restricted version\nspace, which is computationally undesirable. Moreover, the label complexity bound given in [5]\nonly applies to hypotheses selected from this version space, and not when selected from the entire\nhypothesis class (as the general IWAL framework suggests). We avoid these de\ufb01ciencies using a\nnew rejection threshold procedure and a more subtle martingale analysis.\n\nMany of the previously mentioned algorithms are analyzed in the agnostic learning model, where\nno assumption is made about the noise distribution (see also [14]). In this setting, the label com-\nplexity of active learning algorithms cannot generally improve over supervised learners by more\nthan a constant factor [15, 5]. However, under a parameterization of the noise distribution related to\nTsybakov\u2019s low-noise condition [16], active learning algorithms have been shown to have improved\nlabel complexity bounds over what is achievable in the purely agnostic setting [17, 8, 18, 6, 7]. We\nalso consider this parameterization to obtain a tighter label complexity analysis.\n\n2 Preliminaries\n\n2.1 Learning Model\n\nLet D be a distribution over X \u00d7 Y where X is the input space and Y = {\u00b11} are the labels. Let\n(X, Y ) \u2208 X \u00d7 Y be a pair of random variables with joint distribution D. An active learner receives\na sequence (X1, Y1), (X2, Y2), . . . of i.i.d. copies of (X, Y ), with the label Yi hidden unless it is\nexplicitly queried. We use the shorthand a1:k to denote a sequence (a1, a2, . . . , ak) (so k = 0\ncorrespond to the empty sequence).\n\n2\n\n\fLet H be a set of hypotheses mapping from X to Y. For simplicity, we assume H is \ufb01nite but does\nnot completely agree on any single x \u2208 X (i.e., \u2200x \u2208 X ,\u2203h, h\u2032 \u2208 H such that h(x) 6= h\u2032(x)). This\nkeeps the focus on the relevant aspects of active learning that differ from passive learning. The error\nof a hypothesis h : X \u2192 Y is err(h) := Pr(h(X) 6= Y ). Let h\u2217 := arg min{err(h) : h \u2208 H} be\na hypothesis of minimum error in H. The goal of the active learner is to return a hypothesis h \u2208 H\nwith error err(h) not much more than err(h\u2217), using as few label queries as possible.\n\n2.2\n\nImportance Weighted Active Learning\n\nIn the importance weighted active learning (IWAL) framework of [5], an active learner looks at\nthe unlabeled data X1, X2, . . . one at a time. After each new point Xi, the learner determines a\nprobability Pi \u2208 [0, 1]. Then a coin with bias Pi is \ufb02ipped, and the label Yi is queried if and only if\nthe coin comes up heads. The query probability Pi can depend on all previous unlabeled examples\nX1:i\u22121, any previously queried labels, any past coin \ufb02ips, and the current unlabeled point Xi.\nFormally, an IWAL algorithm speci\ufb01es a rejection threshold function p : (X \u00d7Y \u00d7{0, 1})\u2217 \u00d7X \u2192\n[0, 1] for determining these query probabilities. Let Qi \u2208 {0, 1} be a random variable conditionally\nindependent of the current label Yi,\n\nand with conditional expectation\n\nQi \u22a5\u22a5 Yi | X1:i, Y1:i\u22121, Q1:i\u22121\n\nE[Qi|Z1:i\u22121, Xi] = Pi\n\n:= p(Z1:i\u22121, Xi).\n\nwhere Zj\n:= (Xj, Yj, Qj). That is, Qi indicates if the label Yi is queried (the outcome of\nthe coin toss). Although the notation does not explicitly suggest this, the query probability\nPi = p(Z1:i\u22121, Xi) is allowed to explicitly depend on a label Yj (j < i) if and only if it has\nbeen queried (Qj = 1).\n\n2.3\n\nImportance Weighted Estimators\n\nWe \ufb01rst review some standard facts about the importance weighting technique. For a function f :\nX \u00d7 Y \u2192 R, de\ufb01ne the importance weighted estimator of E[f (X, Y )] from Z1:n \u2208 (X \u00d7 Y \u00d7\n{0, 1})n to be\n\nNote that this quantity depends on a label Yi only if it has been queried (i.e., only if Qi = 1; it also\ndepends on Xi only if Qi = 1). Our rejection threshold will be based on a specialization of this\nestimator, speci\ufb01cally the importance weighted empirical error of a hypothesis h\n\nbf (Z1:n) :=\n\n1\nn\n\nnXi=1\n\nQi\nPi \u00b7 f (Xi, Yi).\n\nerr(h, Z1:n) :=\n\nQi\nPi \u00b7 1[h(Xi) 6= Yi].\n\n1\nn\n\nnXi=1\nn X(Xi,Yi,1/Pi)\u2208Sn\n\n1\n\nIn the notation of Algorithm 1, this is equivalent to\n\nerr(h, Sn) :=\n\n(1/Pi) \u00b7 1[h(Xi) 6= Yi]\n\n(1)\n\nwhere Sn \u2286 X \u00d7 Y \u00d7 R is the importance weighted sample collected by the algorithm.\nA basic property of these estimators is unbiasedness: E[bf (Z1:n)] = (1/n)Pn\nf (Xi, Yi) | X1:i, Y1:i, Q1:i\u22121]] = (1/n)Pn\n\nE[E[(Qi/Pi) \u00b7\nE[(Pi/Pi) \u00b7 f (Xi, Yi)] = E[f (X, Y )]. So, for exam-\nple, the importance weighted empirical error of a hypothesis h is an unbiased estimator of its true\nerror err(h). This holds for any choice of the rejection threshold that guarantees Pi > 0.\n\ni=1\n\ni=1\n\n3 A Deviation Bound for Importance Weighted Estimators\n\nAs mentioned before, the rejection threshold used by our algorithm is based on importance weighted\nerror estimates err(h, Z1:n). Even though these estimates are unbiased, they are only reliable when\n\n3\n\n\fthe variance is not too large. To get a handle on this, we need a deviation bound for importance\nweighted estimators. This is complicated by two factors that rules out straightforward applications\nof some standard bounds:\n\n1. The importance weighted samples (Xi, Yi, 1/Pi) (or equivalently, the Zi = (Xi, Yi, Qi))\nare not i.i.d. This is because the query probability Pi (and thus the importance weight 1/Pi)\ngenerally depends on Z1:i\u22121 and Xi.\n\n2. The effective range and variance of each term in the estimator are, themselves, random\n\nvariables.\n\nTo address these issues, we develop a deviation bound using a martingale technique from [19].\nLet f : X \u00d7 Y \u2192 [\u22121, 1] be a bounded function. Consider any rejection threshold function p :\n(X \u00d7Y \u00d7{0, 1})\u2217 \u00d7X \u2192 (0, 1] for which Pn = p(Z1:n\u22121, Xn) is bounded below by some positive\nquantity (which may depend on n). Equivalently, the query probabilities Pn should have inverses\n1/Pn bounded above by some deterministic quantity rmax (which, again, may depend on n). The\na priori upper bound rmax on 1/Pn can be pessimistic, as the dependence on rmax in the \ufb01nal\ndeviation bound will be very mild\u2014it enters in as log log rmax. Our goal is to prove a bound on\n\n|bf (Z1:n) \u2212 E[f (X, Y )]| that holds with high probability over the joint distribution of Z1:n.\nTo start, we establish bounds on the range and variance of each term Wi := (Qi/Pi) \u00b7 f (Xi, Yi) in\nthe estimator, conditioned on (X1:i, Y1:i, Q1:i\u22121). Let Ei[ \u00b7 ] denote E[ \u00b7 |X1:i, Y1:i, Q1:i\u22121]. Note\nthat Ei[Wi] = (Ei[Qi]/Pi) \u00b7 f (Xi, Yi) = f (Xi, Yi), so if Ei[Wi] = 0, then Wi = 0. Therefore,\nthe (conditional) range and variance are non-zero only if Ei[Wi] 6= 0. For the range, we have\n|Wi| = (Qi/Pi) \u00b7 |f (Xi, Yi)| \u2264 1/Pi, and for the variance, Ei[(Wi \u2212 Ei[Wi])2] \u2264 (Ei[Q2\ni ) \u00b7\nf (Xi, Yi)2 \u2264 1/Pi. These range and variance bounds indicate the form of the deviations we can\nexpect, similar to that of other classical deviation bounds.\nTheorem 1. Pick any t \u2265 0 and n \u2265 1. Assume 1 \u2264 1/Pi \u2264 rmax for all 1 \u2264 i \u2264 n, and let\nRn := 1/ min({Pi : 1 \u2264 i \u2264 n \u2227 f (Xi, Yi) 6= 0} \u222a {1}). With probability at least 1 \u2212 2(3 +\nlog2 rmax)e\u2212t/2,\n\ni ]/P 2\n\n1\nn\n\nnXi=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nQi\n\nPi \u00b7 f (Xi, Yi) \u2212 E[f (X, Y )](cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 r 2Rnt\n\nn\n\n+r 2t\n\nn\n\n+\n\nRnt\n3n\n\n.\n\nWe defer all proofs to the appendices.\n\n4 Algorithm\n\nFirst, we state a deviation bound for the importance weighted error of hypotheses in a \ufb01nite hypoth-\nesis class H that holds for all n \u2265 1. It is a simple consequence of Theorem 1 and union bounds;\nthe form of the bound motivates certain algorithmic choices to be described below.\nLemma 1. Pick any \u03b4 \u2208 (0, 1). For all n \u2265 1, let\n\n\u03b5n :=\n\n16 log(2(3 + n log2 n)n(n + 1)|H|/\u03b4)\n\nn\n\n= O(cid:18) log(n|H|/\u03b4)\n\nn\n\n(cid:19) .\n\n(3)\n\nLet (Z1, Z2, . . .) \u2208 (X \u00d7 Y \u00d7 {0, 1})\u2217 be the sequence of random variables speci\ufb01ed in Section 2.2\nusing a rejection threshold p : (X \u00d7 Y \u00d7 {0, 1})\u2217 \u00d7 X \u2192 [0, 1] that satis\ufb01es p(z1:n, x) \u2265 1/nn for\nall (z1:n, x) \u2208 (X \u00d7 Y \u00d7 {0, 1})n \u00d7 X and all n \u2265 1.\nThe following holds with probability at least 1 \u2212 \u03b4. For all n \u2265 1 and all h \u2208 H,\n+\n\n|(err(h, Z1:n) \u2212 err(h\u2217, Z1:n)) \u2212 (err(h) \u2212 err(h\u2217))| \u2264r \u03b5n\n\nPmin,n(h)\n\nPmin,n(h)\n\n(4)\n\n\u03b5n\n\nwhere Pmin,n(h) = min{Pi : 1 \u2264 i \u2264 n \u2227 h(Xi) 6= h\u2217(Xi)} \u222a {1} .\nWe let C0 = O(log(|H|/\u03b4)) \u2265 2 be a quantity such that \u03b5n (as de\ufb01ned in Eq. (3)) is bounded as\n\u03b5n \u2264 C0\u00b7 log(n + 1)/n. The following absolute constants are used in the description of the rejection\n\n4\n\n\fAlgorithm 1\nNotes: see Eq. (1) for the de\ufb01nition of err (importance weighted error), and Section 4 for the\nde\ufb01nitions of C0, c1, and c2.\nInitialize: S0 := \u2205.\nFor k = 1, 2, . . . , n:\n1. Obtain unlabeled data point Xk.\n2. Let\n\nLet Gk := err(h\u2032\n\nhk := arg min{err(h, Sk\u22121) : h \u2208 H}, and\nk := arg min{err(h, Sk\u22121) : h \u2208 H \u2227 h(Xk) 6= hk(Xk)}.\nh\u2032\n(cid:18)= min(cid:26)1, O(cid:18) 1\n\nPk :=( 1 if Gk \u2264q C0 log k\n\nk, Sk\u22121) \u2212 err(hk, Sk\u22121), and\n\nk\u22121 + C0 log k\n\ns\n\notherwise\n\nk\u22121\n\nGk =(cid:18) c1\u221as \u2212 c1 + 1(cid:19) \u00b7r C0 log k\n\nwhere s \u2208 (0, 1) is the positive solution to the equation\n+(cid:16) c2\n\ns \u2212 c2 + 1(cid:17) \u00b7\nIf heads, then query Yk, and let Sk := Sk\u22121 \u222a {(Xk, Yk, 1/Pk)}.\nElse, let Sk := Sk\u22121.\n\n3. Toss a biased coin with Pr(heads) = Pk.\n\nk \u2212 1\n\nG2\nk\n\n+\n\n1\n\nGk(cid:19) \u00b7\n\nC0 log k\n\nk \u2212 1 (cid:27)(cid:19)\n\nC0 log k\nk \u2212 1\n\n.\n\n(2)\n\nReturn: hn+1 := arg min{err(h, Sn) : h \u2208 H}.\nFigure 1: Algorithm for importance weighted active learning with an error minimization oracle.\n\nthreshold and the subsequent analysis: c1 := 5 + 2\u221a2, c2 := 5, c3 := ((c1 + \u221a2)/(c1 \u2212 2))2,\nc4 := (c1 + \u221ac3)2, c5 := c2 + c3 .\nOur proposed algorithm is shown in Figure 1. The rejection threshold (Step 2) is based on the\ndeviation bound from Lemma 1. First, the importance weighted error minimizing hypothesis hk and\nthe \u201calternative\u201d hypothesis h\u2032\nk are found. Note that both optimizations are over the entire hypothesis\nclass H (with h\u2032\nk only being required to disagree with hk on xk)\u2014this is a key aspect where our\nalgorithm differs from previous approaches. The difference in importance weighted errors Gk of\nthe two hypotheses is then computed. If Gk \u2264 p(C0 log k)/(k \u2212 1) + (C0 log k)/(k \u2212 1), then\nthe query probability Pk is set to 1. Otherwise, Pk is set to the positive solution s to the quadratic\nequation in Eq. (2). The functional form of Pk is roughly min{1, (1/G2\nk + 1/Gk)\u00b7 (C0 log k)/(k \u2212\n1)}. It can be checked that Pk \u2208 (0, 1] and that Pk is non-increasing with Gk. It is also useful to note\nthat (log k)/(k\u22121) is monotonically decreasing with k \u2265 1 (we use the convention log(1)/0 = \u221e).\nIn order to apply Lemma 1 with our rejection threshold, we need to establish the (very crude) bound\nPk \u2265 1/kk for all k.\nLemma 2. The rejection threshold of Algorithm 1 satis\ufb01es p(z1:n\u22121, x) \u2265 1/nn for all n \u2265 1 and\nall (z1:n\u22121, x) \u2208 (X \u00d7 Y \u00d7 {0, 1})n\u22121 \u00d7 X .\nNote that this is a worst-case bound; our analysis shows that the probabilities Pk are more like\n1/poly(k) in the typical case.\n\n5 Analysis\n\n5.1 Correctness\n\nWe \ufb01rst prove a consistency guarantee for Algorithm 1 that bounds the generalization error of the\nimportance weighted empirical error minimizer. The proof actually establishes a lower bound on\n\n5\n\n\fthe query probabilities Pi \u2265 1/2 for Xi such that hn(Xi) 6= h\u2217(Xi). This offers an intuitive\ncharacterization of the weighting landscape induced by the importance weights 1/Pi.\nTheorem 2. The following holds with probability at least 1 \u2212 \u03b4. For any n \u2265 1,\n0 \u2264 err(hn) \u2212 err(h\u2217) \u2264 err(hn, Z1:n\u22121) \u2212 err(h\u2217, Z1:n\u22121) +r 2C0 log n\n\n2C0 log n\nn \u2212 1\n\nn \u2212 1\n\n+\n\n.\n\nThis implies, for all n \u2265 1,\n\nerr(hn) \u2264 err(h\u2217) +r 2C0 log n\n\nn \u2212 1\n\n+\n\n2C0 log n\nn \u2212 1\n\n.\n\nTherefore, the \ufb01nal hypothesis returned by Algorithm 1 after seeing n unlabeled data has roughly\nthe same error bound as a hypothesis returned by a standard passive learner with n labeled data. A\nvariant of this result under certain noise conditions is given in the appendix.\n\n5.2 Label Complexity Analysis\n\nWe now bound the number of labels requested by Algorithm 1 after n iterations. The following\nlemma bounds the probability of querying the label Yn; this is subsequently used to establish the\n\ufb01nal bound on the expected number of labels queried. The key to the proof is in relating empirical\nerror differences and their deviations to the probability of querying a label. This is mediated through\nthe disagreement coef\ufb01cient, a quantity \ufb01rst used by [14] for analyzing the label complexity of the\nA2 algorithm of [3]. The disagreement coef\ufb01cient \u03b8 := \u03b8(h\u2217,H,D) is de\ufb01ned as\n: r > 0(cid:27)\n\n\u03b8(h\u2217,H,D) := sup(cid:26) Pr(X \u2208 DIS(h\u2217, r))\n\nr\n\nwhere\n\nDIS(h\u2217, r) := {x \u2208 X : \u2203h\u2032 \u2208 H such that Pr(h\u2217(X) 6= h\u2032(X)) \u2264 r and h\u2217(x) 6= h\u2032(x)}\n\n(the disagreement region around h\u2217 at radius r). This quantity is bounded for many learning prob-\nlems studied in the literature; see [14, 6, 20, 21] for more discussion. Note that the supremum can\ninstead be taken over r > \u01eb if the target excess error is \u01eb, which allows for a more detailed analysis.\nLemma 3. Assume the bounds from Eq. (4) holds for all h \u2208 H and n \u2265 1. For any n \u2265 1,\n\nTheorem 3. With probability at least 1 \u2212 \u03b4, the expected number of labels queried by Algorithm 1\nafter n iterations is at most\n\nC0 log2 n\n\nE[Qn] \u2264 \u03b8 \u00b7 2 err(h\u2217) + O \u03b8 \u00b7r C0 log n\n\nn \u2212 1 ! .\n1 + \u03b8 \u00b7 2 err(h\u2217) \u00b7 (n \u2212 1) + O(cid:16)\u03b8 \u00b7pC0n log n + \u03b8 \u00b7 C0 log3 n(cid:17) .\n\nn \u2212 1\n\n+ \u03b8 \u00b7\n\nThe bound is dominated by a linear term scaled by err(h\u2217), plus a sublinear term. The linear term\nerr(h\u2217) \u00b7 n is unavoidable in the worst case, as evident from label complexity lower bounds [15, 5].\nWhen err(h\u2217) is negligible (e.g., the data is separable) and \u03b8 is bounded (as is the case for many\nproblems studied in the literature [14]), then the bound represents a polynomial label complex-\nity improvement over supervised learning, similar to that achieved by the version space algorithm\nfrom [5].\n\n5.3 Analysis under Low Noise Conditions\n\nSome recent work on active learning has focused on improved label complexity under certain noise\nconditions [17, 8, 18, 6, 7]. Speci\ufb01cally, it is assumed that there exists constants \u03ba > 0 and 0 < \u03b1 \u2264\n1 such that\n(5)\nfor all h \u2208 H. This is related to Tsybakov\u2019s low noise condition [16]. Essentially, this condition\nrequires that low error hypotheses not be too far from the optimal hypothesis h\u2217 under the disagree-\nment metric Pr(h\u2217(X) 6= h(X)). Under this condition, Lemma 3 can be improved, which in turn\nyields the following theorem.\n\nPr(h(X) 6= h\u2217(X)) \u2264 \u03ba \u00b7 (err(h) \u2212 err(h\u2217))\u03b1\n\n6\n\n\fTheorem 4. Assume that for some value of \u03ba > 0 and 0 < \u03b1 \u2264 1, the condition in Eq. (5) holds\nfor all h \u2208 H. There is a constant c\u03b1 > 0 depending only on \u03b1 such that the following holds. With\nprobability at least 1 \u2212 \u03b4, the expected number of labels queried by Algorithm 1 after n iterations is\nat most\n\n\u03b8 \u00b7 \u03ba \u00b7 c\u03b1 \u00b7 (C0 log n)\u03b1/2 \u00b7 n1\u2212\u03b1/2.\n\nNote that the bound is sublinear in n for all 0 < \u03b1 \u2264 1, which implies label complexity improve-\nments whenever \u03b8 is bounded (an improved analogue of Theorem 2 under these conditions can be\nestablished using similar techniques). The previous algorithms of [6, 7] obtain even better rates\nunder these noise conditions using specialized data dependent generalization bounds, but these al-\ngorithms also required optimizations over restricted version spaces, even for the bound computation.\n\n6 Experiments\n\nAlthough agnostic learning is typically intractable in the worst case, empirical risk minimization can\nserve as a useful abstraction for many practical supervised learning algorithms in non-worst case\nscenarios. With this in mind, we conducted a preliminary experimental evaluation of Algorithm 1,\nimplemented using a popular algorithm for learning decision trees in place of the required ERM\noracle. Speci\ufb01cally, we use the J48 algorithm from Weka v3.6.2 (with default parameters) to select\nthe hypothesis hk in each round k; to produce the \u201calternative\u201d hypothesis h\u2032\nk, we just modify\nthe decision tree hk by changing the label of the node used for predicting on xk. Both of these\nprocedures are clearly heuristic, but they are similar in spirit to the required optimizations. We\nset C0 = 8 and c1 = c2 = 1\u2014these can be regarded as tuning parameters, with C0 controlling\nthe aggressiveness of the rejection threshold. We did not perform parameter tuning with active\nlearning although the importance weighting approach developed here could potentially be used for\nthat. Rather, the goal of these experiments is to assess the compatibility of Algorithm 1 with an\nexisting, practical supervised learning procedure.\n\n6.1 Data Sets\n\nWe constructed two binary classi\ufb01cation tasks using MNIST and KDDCUP99 data sets. For MNIST,\nwe randomly chose 4000 training 3s and 5s for training (using the 3s as the positive class), and used\nall of the 1902 testing 3s and 5s for testing. For KDDCUP99, we randomly chose 5000 examples\nfor training, and another 5000 for testing. In both cases, we reduced the dimension of the data to 25\nusing PCA.\n\nTo demonstrate the versatility of our algorithm, we also conducted a multi-class classi\ufb01cation exper-\niment using the entire MNIST data set (all ten digits, so 60000 training data and 10000 testing data).\nThis required modifying how h\u2032\nk(xk) 6= hk(xk) by changing the label of\nthe prediction node for xk to the next best label. We used PCA to reduce the dimension to 40.\n\nk is selected: we force h\u2032\n\n6.2 Results\n\nWe examined the test error as a function of (i) the number of unlabeled data seen, and (ii) the number\nof labels queried. We compared the performance of the active learner described above to a passive\nlearner (one that queries every label, so (i) and (ii) are the same) using J48 with default parameters.\n\nIn all three cases, the test errors as a function of the number of unlabeled data were roughly the same\nfor both the active and passive learners. This agrees with the consistency guarantee from Theorem 2.\nWe note that this is a basic property not satis\ufb01ed by many active learning algorithms (this issue is\ndiscussed further in [22]).\n\nIn terms of test error as a function of the number of labels queried (Figure 2), the active learner\nhad minimal improvement over the passive learner on the binary MNIST task, but a substantial\nimprovement over the passive learner on the KDDCUP99 task (even at small numbers of label\nqueries). For the multi-class MNIST task, the active learner had a moderate improvement over the\npassive learner. Note that KDDCUP99 is far less noisy (more separable) than MNIST 3s vs 5s task,\nso the results are in line with the label complexity behavior suggested by Theorem 3, which states\nthat the label complexity improvement may scale with the error of the optimal hypothesis. Also,\n\n7\n\n\fr\no\nr\nr\ne\n\n \nt\ns\ne\n\nt\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n0\n\n0.1\n\n0.08\n\nr\no\nr\nr\ne\n\n \nt\ns\ne\n\nt\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n \n0\n\n \n\nPassive\nActive\n\n \n\nPassive\nActive\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\nr\no\nr\nr\ne\n\n \nt\ns\ne\n\nt\n\n4000\n\n0\n\n \n0\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\n3000\n\n2000\n\n1000\nnumber of labels queried\nMNIST 3s vs 5s\n\n \n\nPassive\nActive\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\nnumber of labels queried\nKDDCUP99 (close-up)\n\nr\no\nr\nr\ne\n\n \nt\ns\ne\n\nt\n\n0.24\n\n0.22\n\n0.2\n\n0.18\n\n0.16\n\n0.14\n\n \n0\n\nnumber of labels queried\nKDDCUP99\n\n \n\nPassive\nActive\n\n1\nnumber of labels queried\n\n3\n\n2\n\n4\n\n4\nx 10\n\nMNIST multi-class (close-up)\n\nFigure 2: Test errors as a function of the number of labels queried.\n\nthe results from MNIST tasks suggest that the active learner may require an initial random sampling\nphase during which it is equivalent to the passive learner, and the advantage manifests itself after\nthis phase. This again is consistent with the analysis (also see [14]), as the disagreement coef\ufb01cient\ncan be large at initial scales, yet much smaller as the number of (unlabeled) data increases and the\nscale becomes \ufb01ner.\n\n7 Conclusion\n\nThis paper provides a new active learning algorithm based on error minimization oracles, a depar-\nture from the version space approach adopted by previous works. The algorithm we introduce here\nmotivates computationally tractable and effective methods for active learning with many classi\ufb01er\ntraining algorithms. The overall algorithmic template applies to any training algorithm that (i) op-\nerates by approximate error minimization and (ii) for which the cost of switching a class prediction\n(as measured by example errors) can be estimated. Furthermore, although these properties might\nonly hold in an approximate or heuristic sense, the created active learning algorithm will be \u201csafe\u201d\nin the sense that it will eventually converge to the same solution as a passive supervised learning\nalgorithm. Consequently, we believe this approach can be widely used to reduce the cost of labeling\nin situations where labeling is expensive.\n\nRecent theoretical work on active learning has focused on improving rates of convergence. However,\nin some applications, it may be desirable to improve performance at much smaller sample sizes, per-\nhaps even at the cost of improved rates as long as consistency is ensured. Importance sampling and\nweighting techniques like those analyzed in this work may be useful for developing more aggressive\nstrategies with such properties.\n\nAcknowledgments\n\nThis work was completed while DH was at Yahoo! Research and UC San Diego.\n\n8\n\n\fReferences\n[1] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning,\n\n15(2):201\u2013221, 1994.\n\n[2] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural Information\n\nProcessing Systems 18, 2005.\n\n[3] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Twenty-Third International\n\nConference on Machine Learning, 2006.\n\n[4] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In Advances in\n\nNeural Information Processing Systems 20, 2007.\n\n[5] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Twenty-Sixth\n\nInternational Conference on Machine Learning, 2009.\n\n[6] S. Hanneke. Adaptive rates of convergence in active learning. In Twenty-Second Annual Conference on\n\nLearning Theory, 2009.\n\n[7] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. Manuscript,\n\n2009.\n\n[8] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Twentieth Annual Conference\n\non Learning Theory, 2007.\n\n[9] R. .S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[10] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem.\n\nSIAM Journal of Computing, 32:48\u201377, 2002.\n\n[11] M. Sugiyama, M. Krauledat, and K.-R. M\u00a8uller. Covariate shift adaptation by importance weighted cross\n\nvalidation. Journal of Machine Learning Research, 8:985\u20131005, 2007.\n\n[12] M. Sugiyama. Active learning for misspeci\ufb01ed models. In Advances in Neural Information Processing\n\nSystems 18, 2005.\n\n[13] F. Bach. Active learning for misspeci\ufb01ed generalized linear models. In Advances in Neural Information\n\nProcessing Systems 19, 2006.\n\n[14] S. Hanneke. A bound on the label complexity of agnostic active learning. In Twenty-Fourth International\n\nConference on Machine Learning, 2007.\n\n[15] M. K\u00a8a\u00a8ari\u00a8ainen. Active learning in the non-realizable case. In Seventeenth International Conference on\n\nAlgorithmic Learning Theory, 2006.\n\n[16] A. B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics, 32(1):135\u2013\n\n166, 2004.\n\n[17] R. Castro and R. Nowak. Upper and lower bounds for active learning. In Allerton Conference on Com-\n\nmunication, Control and Computing, 2006.\n\n[18] R. Castro and R. Nowak. Minimax bounds for active learning.\n\nLearning Theory, 2007.\n\nIn Twentieth Annual Conference on\n\n[19] T. Zhang. Data dependent concentration bounds for sequential prediction algorithms.\n\nAnnual Conference on Learning Theory, 2005.\n\nIn Eighteenth\n\n[20] E. Friedman. Active learning for smooth problems. In Twenty-Second Annual Conference on Learning\n\nTheory, 2009.\n\n[21] L. Wang. Suf\ufb01cient conditions for agnostic active learnable. In Advances in Neural Information Process-\n\ning Systems 22, 2009.\n\n[22] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In Twenty-Fifth International Confer-\n\nence on Machine Learning, 2008.\n\n9\n\n\f", "award": [], "sourceid": 363, "authors": [{"given_name": "Alina", "family_name": "Beygelzimer", "institution": null}, {"given_name": "Daniel", "family_name": "Hsu", "institution": ""}, {"given_name": "John", "family_name": "Langford", "institution": ""}, {"given_name": "Tong", "family_name": "Zhang", "institution": ""}]}