{"title": "Efficient and Parsimonious Agnostic Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2755, "page_last": 2763, "abstract": "We develop a new active learning algorithm for the streaming settingsatisfying three important properties: 1) It provably works for anyclassifier representation and classification problem including thosewith severe noise. 2) It is efficiently implementable with an ERMoracle. 3) It is more aggressive than all previous approachessatisfying 1 and 2. To do this, we create an algorithm based on a newlydefined optimization problem and analyze it. We also conduct the firstexperimental analysis of all efficient agnostic active learningalgorithms, evaluating their strengths and weaknesses in differentsettings.", "full_text": "Ef\ufb01cient and Parsimonious Agnostic Active Learning\n\nTzu-Kuo Huang\n\nMicrosoft Research, NYC\n\nAlekh Agarwal\n\nMicrosoft Research, NYC\n\ntkhuang@microsoft.com\n\nalekha@microsoft.com\n\nDaniel Hsu\n\nColumbia University\ndjhsu@cs.columbia.edu\n\nJohn Langford\n\nMicrosoft Research, NYC\n\njcl@microsoft.com\n\nRobert E. Schapire\n\nMicrosoft Research, NYC\n\nschapire@microsoft.com\n\nAbstract\n\nWe develop a new active learning algorithm for the streaming setting satisfying\nthree important properties: 1) It provably works for any classi\ufb01er representation\nand classi\ufb01cation problem including those with severe noise. 2) It is ef\ufb01ciently\nimplementable with an ERM oracle. 3) It is more aggressive than all previous\napproaches satisfying 1 and 2. To do this, we create an algorithm based on a\nnewly de\ufb01ned optimization problem and analyze it. We also conduct the \ufb01rst ex-\nperimental analysis of all ef\ufb01cient agnostic active learning algorithms, evaluating\ntheir strengths and weaknesses in different settings.\n\n1\n\nIntroduction\n\nGiven a label budget, what is the best way to learn a classi\ufb01er?\nActive learning approaches to this question are known to yield exponential improvements over su-\npervised learning under strong assumptions [7]. Under much weaker assumptions, streaming-based\nagnostic active learning [2, 4, 5, 9, 18] is particularly appealing since it is known to work for any\nclassi\ufb01er representation and any label distribution with an i.i.d. data source.1 Here, a learning al-\ngorithm decides for each unlabeled example in sequence whether or not to request a label, never\nrevisiting this decision. Restated then: What is the best possible active learning algorithm which\nworks for any classi\ufb01er representation, any label distribution, and is computationally tractable?\nComputational tractability is a critical concern, because most known algorithms for this setting [e.g.,\n2, 16, 18] require explicit enumeration of classi\ufb01ers, implying exponentially-worse computational\ncomplexity compared to typical supervised learning algorithms. Active learning algorithms based\non empirical risk minimization (ERM) oracles [4, 5, 13] can overcome this intractability by using\npassive classi\ufb01cation algorithms as the oracle to achieve a computationally acceptable solution.\nAchieving generality, robustness, and acceptable computation has a cost. For the above methods [4,\n5, 13], a label is requested on nearly every unlabeled example where two empirically good classi\ufb01ers\ndisagree. This results in a poor label complexity, well short of information-theoretic limits [6] even\nfor general robust solutions [18]. Until now.\nIn Section 3, we design a new algorithm called ACTIVE COVER (AC) for constructing query prob-\nability functions that minimize the probability of querying inside the disagreement region\u2014the set\nof points where good classi\ufb01ers disagree\u2014and never query otherwise. This requires a new algo-\nrithm that maintains a parsimonious cover of the set of empirically good classi\ufb01ers. The cover is\na result of solving an optimization problem (in Section 4) specifying the properties of a desirable\n\n1See the monograph of Hanneke [11] for an overview of the existing literature, including alternative settings\n\nwhere additional assumptions are placed on the data source (e.g., separability) [8, 3, 1].\n\n1\n\n\fquery probability function. The cover size provides a practical knob between computation and label\ncomplexity, as demonstrated by the complexity analysis we present in Section 4.\nAlso in Section 3, we prove that AC effectively maintains a set of good classi\ufb01ers, achieves good\ngeneralization error, and has a label complexity bound tighter than previous approaches. The label\ncomplexity bound depends on the disagreement coef\ufb01cient [10], which does not completely capture\nthe advantage of the algorithm. In the end of Section 3 we provide an example of a hard active\nlearning problem where AC is substantially superior to previous tractable approaches. Together,\nthese results show that AC is better and sometimes substantially better in theory.\nDo agnostic active learning algorithms work in practice? No previous works have addressed this\nquestion empirically. Doing so is important because analysis cannot reveal the degree to which ex-\nisting classi\ufb01cation algorithms effectively provide an ERM oracle. We conduct an extensive study in\nSection 5 by simulating the interaction of the active learning algorithm with a streaming supervised\ndataset. Results on a wide array of datasets show that agnostic active learning typically outperforms\npassive learning, and the magnitude of improvement depends on how carefully the active learning\nhyper-parameters are chosen.\nMore details (theory, proofs and empirical evaluation) are in the long version of this paper [14].\n\nX \u00d7 {\u00b11} \u00d7 R+ is err(h, S) :=(cid:80)\n\n2 Preliminaries\nLet P be a distribution over X \u00d7 {\u00b11}, and let H \u2286 {\u00b11}X be a set of binary classi\ufb01ers, which\nwe assume is \ufb01nite for simplicity.2 Let EX [\u00b7] denote expectation with respect to X \u223c PX , the\nmarginal of P over X . The expected error of a classi\ufb01er h \u2208 H is err(h) := Pr(X,Y )\u223cP(h(X) (cid:54)=\nY ), and the error minimizer is denoted by h\u2217 := arg minh\u2208H err(h). The (importance weighted)\nempirical error of h \u2208 H on a multiset S of importance weighted and labeled examples drawn from\n(x,y,w)\u2208S w \u00b7 1(h(x) (cid:54)= y)/|S|. The disagreement region for a\nsubset of classi\ufb01ers A \u2286 H is DIS(A) := {x \u2208 X | \u2203h, h(cid:48) \u2208 A such that h(x) (cid:54)= h(cid:48)(x)}. The regret\nof a classi\ufb01er h \u2208 H relative to another h(cid:48) \u2208 H is reg(h, h(cid:48)) := err(h)\u2212 err(h(cid:48)), and the analogous\nempirical regret on S is reg(h, h(cid:48), S) := err(h, S) \u2212 err(h(cid:48), S). When the second classi\ufb01er h(cid:48) in\n(empirical) regret is omitted, it is taken to be the (empirical) error minimizer in H.\nA streaming-based active learner receives i.i.d. labeled examples (X1, Y1), (X2, Y2), . . . from P one\nat a time; each label Yi is hidden unless the learner decides on the spot to query it. The goal is to\nproduce a classi\ufb01er h \u2208 H with low error err(h), while querying as few labels as possible. In the\nIWAL framework [4], a decision whether or not to query a label is made randomly: the learner\npicks a probability p \u2208 [0, 1], and queries the label with that probability. Whenever p > 0, an\nunbiased error estimate can be produced using inverse probability weighting [12]. Speci\ufb01cally, for\nany classi\ufb01er h, an unbiased estimator E of err(h) based on (X, Y ) \u223c P and p is as follows: if Y\nis queried, then E = 1(h(X) (cid:54)= Y )/p; else, E = 0. It is easy to check that E(E) = err(h). Thus,\nwhen the label is queried, we produce the importance weighted labeled example (X, Y, 1/p).3\n\n3 Algorithm and Statistical Guarantees\n\nOur new algorithm, shown as Algorithm 1, breaks the example stream into epochs. The algorithm\nadmits any epoch schedule so long as the epoch lengths satisfy \u03c4m\u22121 \u2264 2\u03c4m. For technical reasons,\nwe always query the \ufb01rst 3 labels to kick-start the algorithm. At the start of epoch m, AC computes\na query probability function Pm : X \u2192 [0, 1] which will be used for sampling the data points to\nquery during the epoch. This is done by maintaining a few objects of interest during each epoch\nin Step 4: (1) the best classi\ufb01er hm+1 on the sample \u02dcZm collected so far, where \u02dcZm has a mix of\nqueried and predicted labels; (2) a radius \u2206m, which is based on the level of concentration we want\nvarious empirical quantities to satisfy; and (3) the set Am+1 consisting of all the classi\ufb01ers with\nempirical regret at most \u2206m on \u02dcZm. Within the epoch, Pm determines the probability of querying\nan example in the disagreement region for this set Am of \u201cgood\u201d classi\ufb01ers; examples outside this\n\n2The assumption that H is \ufb01nite can be relaxed to VC-classes using standard arguments.\n3If the label is not queried, we produce an ignored example of weight zero; its only purpose is to maintain\nthe correct count of querying opportunities. This ensures that 1/|S| is the correct normalization in err(h, S).\n\n2\n\n\f0 = \u03c40 < 3 = \u03c41 < \u03c42 < \u03c43 < . . . < \u03c4M satisfying \u03c4m+1 \u2264 2\u03c4m for m \u2265 1.\n\nAlgorithm 1 ACTIVE COVER (AC)\ninput: Constants c1, c2, c3, con\ufb01dence \u03b4, error radius \u03b3, parameters \u03b1, \u03b2, \u03be for (OP), epoch schedule\ninitialize: epoch m = 0, \u02dcZ0 := \u2205, \u22060 := c1\n\u00011 + c2\u00011 log 3, where \u0001m := 32 log(|H|\u03c4m/\u03b4)/\u03c4m.\ni=1, and set A1 := H,\n1: Query the labels {Yi}3\nP1 \u2261 Pmin,i = 1, and S = {(Xj, Yj, 1)}3\n2: for i = 4, . . . , n, do\nif i = \u03c4m + 1 then\n3:\n4:\n\ni=1 of the \ufb01rst three unlabeled examples {Xi}3\n\nSet \u02dcZm = \u02dcZm\u22121 \u222a S, and S = \u2205. Let\n\nj=1.\n\n\u221a\n\n(cid:113)\n\nhm+1 := arg min\n\nh\u2208H err(h, \u02dcZm), \u2206m := c1\n\n\u0001merr(hm+1, \u02dcZm) + c2\u0001m log \u03c4m, and\nAm+1 := {h \u2208 H | err(h, \u02dcZm) \u2212 err(hm+1, \u02dcZm) \u2264 \u03b3\u2206m}.\n\nCompute the solution Pm+1(\u00b7) to the problem (OP) and increment m := m + 1.\n\nend if\nif next unlabeled point Xi \u2208 Dm := DIS(Am), then\n\nToss coin with bias Pm(Xi); add example (Xi, Yi, 1/Pm(Xi)) to S if outcome is heads,\notherwise add (Xi, 1, 0) to S (see Footnote 3).\n\n5:\n6:\n7:\n8:\n\nAdd example with predicted label (Xi, hm(Xi), 1) to S.\n\nelse\n\n9:\n10:\nend if\n11:\n12: end for\n13: Return hM +1 := arg minh\u2208H err(h, \u02dcZM ).\n\nregion are not queried but given labels predicted by hm (so error estimates are not unbiased). AC\ncomputes Pm by solving the optimization problem (OP), which is further discussed below.\nThe objective function of (OP) encourages small query probabilities in order to minimize the label\ncomplexity. The constraints (1) in (OP) bound the variance in our importance-weighted regret esti-\nmates for every h \u2208 H. This is key to ensuring good generalization as we will later use Bernstein-\nstyle bounds which rely on our random variables having a small variance. More speci\ufb01cally, the\nLHS of the constraints measures the variance in our empirical regret estimates for h, measured only\non the examples in the disagreement region Dm. This is because the importance weights in the form\nof 1/Pm(X) are only applied to these examples; outside this region we use the predicted labels with\nan importance weight of 1. The RHS of the constraint consists of three terms. The \ufb01rst term ensures\nthe feasibility of the problem, as P (X) \u2261 1/(2\u03b12) for X \u2208 Dm will always satisfy the constraints.\nThe second empirical regret term makes the constraints easy to satisfy for bad hypotheses\u2014this is\ncrucial to rule out large label complexities in case there are bad hypotheses that disagree very often\nwith hm. A bene\ufb01t of this is easily seen when \u2212hm \u2208 H, which might have a terrible regret, but\nwould force a near-constant query probability on the disagreement region if \u03b2 = 0. Finally, the\nthird term will be on the same order as the second one for hypotheses in Am, and is only included\nto capture the allowed level of slack in our constraints which will be exploited for the ef\ufb01cient im-\nplementation in Section 4. In addition to controlled variance, good concentration also requires the\nrandom variables of interest to be appropriately bounded. This is ensured through the constraints (2),\nwhich impose a minimum query probability on the disagreement region. Outside the disagreement\nregion, we use the predicted label with an importance weight of 1, so that our estimates will always\nbe bounded (albeit biased) in this region. Note that this optimization problem is written with respect\nto the marginal distribution of the data points PX, meaning that we might have in\ufb01nitely many of\nthe latter constraints. In Section 4, we describe how to solve this optimization problem ef\ufb01ciently,\nand using access to only unlabeled examples drawn from PX.\nAlgorithm 1 requires several input parameters, which must satisfy:\n\n1\n\n8n\u0001M log n\n\n, \u03b22 \u2264\n\n\u03b1 \u2265 1, \u03be \u2264\n\n1, c3 \u2265 1.\nThe \ufb01rst three parameters, \u03b1, \u03b2 and \u03be control the tightness of the variance constraints (1). The next\nthree parameters \u03b3, c1 and c2 control the threshold that de\ufb01nes the set of empirically good classi\ufb01ers;\nc3 is used in the minimum probability (4) and can be simply set to 1.\n\n, \u03b3 \u2265 216, c1 \u2265 2\u03b1\n\n6, c2 \u2265 216c2\n\n\u03b3n\u0001M log n\n\n1\n\n\u221a\n\n3\n\n\fOptimization Problem (OP) to compute Pm\n\n(cid:21)\n(cid:20) 1(h(x) (cid:54)= hm(x) \u2227 x \u2208 Dm)\n\n(cid:21)\n\nEX\n\n1\n\n1 \u2212 P (X)\n\n\u2200h \u2208 H EX\n\u2200x \u2208 X 0 \u2264 P (x) \u2264 1,\n\n\u2264 bm(h),\n\nP (X)\nand \u2200x \u2208 Dm P (x) \u2265 Pmin,m\n\nmin\n\nP\n\ns.t.\n\nwhere I m\n\nh (X) = 1(h(x) (cid:54)= hm(x) \u2227 x \u2208 Dm),\nbm(h) = 2\u03b12EX [I m\n\nh (X)] + 2\u03b22\u03b3reg(h, hm, \u02dcZm\u22121)\u03c4m\u22121\u2206m\u22121 + \u03be\u03c4m\u22121\u22062\n\nm\u22121,\n\nPmin,m = min\n\n(cid:113) \u03c4m\u22121err(hm, \u02dcZm\u22121)\n\nc3\n\nn\u0001M\n\n,\n\n1\n2\n\n+ log \u03c4m\u22121\n\n\uf8f6\uf8f8 .\n\n(cid:20)\n\n\uf8eb\uf8ed\n\nm(cid:88)\n\nj=1\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\nEpoch Schedules: The algorithm takes an arbitrary epoch schedule subject to \u03c4m < \u03c4m+1 \u2264 2\u03c4m.\nTwo natural extremes are unit-length epochs, \u03c4m = m, and doubling epochs, \u03c4m+1 = 2\u03c4m. The\nmain difference lies in the number of times (OP) is solved, which is a substantial computational\nconsideration. Unless otherwise stated, we assume the doubling epoch schedule where the query\nprobability and ERM classi\ufb01er are recomputed only O(log n) times.\n\nGeneralization and Label Complexity. We present guarantees on the generalization error and\nlabel complexity of Algorithm 1 assuming a solver for (OP), which we provide in the next section.\nOur \ufb01rst theorem provides a bound on generalization error. De\ufb01ne\n\nerrm(h) :=\n\n1\n\u03c4m\n\n(\u03c4j \u2212 \u03c4j\u22121)E(X,Y )\u223cP[1(h(X) (cid:54)= Y \u2227 X \u2208 DIS(Aj))],\n\n(cid:112)\u0001merrm(h\u2217) + c2\u0001m log \u03c4m for m \u2265 1.\n\nm := c1\n\n\u2206\u2217\n0 := \u22060 and \u2206\u2217\n\nEssentially \u2206\u2217\nm is a population counterpart of the quantity \u2206m used in Algorithm 1, and crucially\nrelies on errm(h\u2217), the true error of h\u2217 restricted to the disagreement region at epoch m. This quan-\n\u221a\ntity captures the inherent noisiness of the problem, and modulates the transition between O(1/\nn)\nto O(1/n) type error bounds as we see next.\nTheorem 1. Pick any 0 < \u03b4 < 1/e such that |H|/\u03b4 >\narg minh\u2208H err(h), we have for all epochs m = 1, 2, . . . , M, with probability at least 1 \u2212 \u03b4\n\n192. Then recalling that h\u2217 =\n\n\u221a\n\nreg(h, h\u2217) \u2264 16\u03b3\u2206\u2217\nreg(h\u2217, hm+1, \u02dcZm) \u2264 216\u2206m.\n\nm for all h \u2208 Am+1,\n\nand\n\n(5)\n(6)\n\nThe proof is in Section 7.2.2 of [14]. Since we use \u03b3 \u2265 216, the bound (6) implies that h\u2217 \u2208 Am for\nall epochs m. This also maintains that all the predicted labels used by our algorithm are identical to\nthose of h\u2217, since no disagreement amongst classi\ufb01ers in Am was observed on those examples. This\nobservation will be critical to our proofs, where we will exploit the fact that using labels predicted\nby h\u2217 instead of observed labels on certain examples only introduces a bias in favor of h\u2217, thereby\nensuring that we never mistakenly drop the optimal classi\ufb01er from Am. The bound (5) shows that\nevery classi\ufb01er in Am+1 has a small regret to h\u2217. Since the ERM classi\ufb01er hm+1 is always in Am+1,\nthis yields our main generalization error bound on the classi\ufb01er h\u03c4m+1 output by Algorithm 1.\nAdditionally, it also clari\ufb01es the de\ufb01nition of the sets Am as the set of good classi\ufb01ers: these are\nclassi\ufb01ers which indeed have small population regret relative to h\u2217. In a realizable setting where h\u2217\nm = \u02dcO(1/\u03c4m) leading to a \u02dcO(1/n) regret after n unlabeled examples are presented\nhas zero error, \u2206\u2217\n\u221a\nto the algorithm. On the other extreme, if errm(h\u2217) is a constant, then the regret is O(1/\nn). There\nare also interesting regimes in between, where err(h\u2217) might be a constant, but errm(h\u2217) measured\n\n4\n\n\fover the disagreement region decreases rapidly. More speci\ufb01cally, we show in Appendix E of [14]\nthat the expected regret of the classi\ufb01er returned by Algorithm 1 achieves the optimal rate [6] under\nthe Tsybakov [17] noise condition.\nNext, we provide a label complexity guarantee in terms of the disagreement coef\ufb01cient [11]:\nPX{x | \u2203h \u2208 H s.t. h\u2217(x) (cid:54)= h(x), PX{x(cid:48) | h(x(cid:48)) (cid:54)= h\u2217(x(cid:48))} \u2264 r}/r.\n\u03b8 = \u03b8(h\u2217) := supr>0\nTheorem 2. With probability at least 1 \u2212 \u03b4, the number of label queries made by Algorithm 1 after\n\nn examples over M epochs is 4\u03b8 errM (h\u2217)n + \u03b8 \u00b7 \u02dcO((cid:112)nerrM (h\u2217) log(|H|/\u03b4) + log(|H|/\u03b4)).\n\n\u221a\n\nwith(cid:112)nerr(h\u2217) but a worse dependence on \u03b8. In all comparisons the use of errM (h\u2217) provides a\n\nThe theorem is proved in Appendix D of [14]. The \ufb01rst term of the label complexity bound is\nlinear in the number of unlabeled examples, but can be quite small if \u03b8 is small, or if errM (h\u2217) \u2248\n\u221a\n0\u2014it is indeed 0 in the realizable setting. The second term grows at most as \u02dcO(\nn), but also\nbecomes a constant for realizable problems. Consequently, we attain a logarithmic label complexity\nin the realizable setting. In noisy settings, our label complexity improves upon that of predecessors\nsuch as [5, 13]. Beygelzimer et al. [5] obtain a label complexity of \u03b8\nn, exponentially worse\nfor realizable problems. A related algorithm, Oracular CAL [13], has label complexity scaling\nqualitatively superior analysis to all previous results depending on err(h\u2217) since this captures the\nfact that noisy labels outside the disagreement region do not affect the label complexity. Finally,\nas in our regret analysis, we show in Appendix E of [14] that the label complexity of Algorithm 1\nachieves the information-theoretically lower bound [6] under Tsybakov\u2019s low-noise condition [17].\nSection 4.2.2 of [14] gives an example where the label complexity of Algorithm 1 is signi\ufb01cantly\nsmaller than both IWAL and Oracular CAL by virtue of rarely querying in the disagreement region.\nThe example considers a distribution and a classi\ufb01er space with the following structure: (i) for most\nexamples a single good classi\ufb01er predicts differently from the remaining classi\ufb01ers; (ii) on a few\nexamples, half the classi\ufb01ers predict one way and half the other. In the \ufb01rst case, little advantage is\ngained from a label because it provides evidence against only a single classi\ufb01er. ACTIVE COVER\n\u221a\nqueries over the disagreement region with a probability close to Pmin in case (i) and probability 1 in\ncase (ii), while others query with probability \u2126(1) everywhere implying O(\nn) times more queries.\n\n4 Ef\ufb01cient implementation\n\nThe computation of hm is an ERM operation, which can be performed ef\ufb01ciently whenever an ef\ufb01-\ncient passive learner is available. However, several other hurdles remain. Testing for x \u2208 DIS(Am)\nin the algorithm, as well as \ufb01nding a solution to (OP) are considerably more challenging. The epoch\nschedule helps, but (OP) is still solved O(log n) times, necessitating an extremely ef\ufb01cient solver.\nStarting with the \ufb01rst issue, we follow Dasgupta et al. [9] who cleverly observed that x \u2208 Dm :=\nDIS(Am) can be ef\ufb01ciently determined using a single call to an ERM oracle. Speci\ufb01cally, to apply\ntheir method, we use the oracle to \ufb01nd4 h(cid:48) = arg min{err(h, \u02dcZm\u22121) | h \u2208 H, h(x) (cid:54)= hm(x)}. It\ncan then be argued that x \u2208 Dm = DIS(Am) if and only if the easily-measured regret of h(cid:48) (that is,\nreg(h(cid:48), hm, \u02dcZm\u22121)) is at most \u03b3\u2206m\u22121. Solving (OP) ef\ufb01ciently is a much bigger challenge because\nit is enormous: There is one variable P (x) for every point x \u2208 X , one constraint (1) for each\nclassi\ufb01er h and bound constraints (2) on P (x) for every x. This leads to in\ufb01nitely many variables\nand constraints, with an ERM oracle being the only computational primitive available.\nWe eliminate the bound constraints using barrier functions. Notice that the objective EX [1/(1 \u2212\nP (x))] is already a barrier at P (x) = 1. To enforce the lower bound (2), we modify the objective to\n\nEX\n\n1\n\n+ \u00b52EX\n\n1 \u2212 P (X)\n\n(7)\nwhere \u00b5 is a parameter chosen momentarily to ensure P (x) \u2265 Pmin,m for all x \u2208 Dm. Thus, the\nmodi\ufb01ed goal is to minimize (7) over non-negative P subject only to (1). We solve the problem in\nthe dual where we have a large but \ufb01nite number of optimization variables, and ef\ufb01ciently maximize\nthe dual using coordinate ascent with access to an ERM oracle over H. Let \u03bbh \u2265 0 denote the\n\nP (X)\n\n,\n\n(cid:20)\n\n(cid:21)\n\n(cid:20) 1(X \u2208 Dm)\n\n(cid:21)\n\n4 See Appendix F of [15] for how to deal with one constraint with an unconstrained oracle.\n\n5\n\n\fAlgorithm 2 Coordinate ascent algorithm to solve (OP)\ninput Accuracy parameter \u03b5 > 0. initialize \u03bb \u2190 0.\n1: loop\n2:\n\nRescale: \u03bb \u2190 s \u00b7 \u03bb where s = arg maxs\u2208[0,1] D(s \u00b7 \u03bb).\nFind \u00afh = arg max\nh\u2208H\nif EX\n\n(cid:21)\n(cid:105) \u2212 bm(\u00afh) \u2264 \u03b5 then\n\n(cid:20)I m\n\nh (X)\nP\u03bb(X)\n\n\u2212 bm(h).\n\n(cid:104)Im\n\nEX\n\nUpdate \u03bb\u00afh as \u03bb\u00afh \u2190 \u03bb\u00afh + 2\n\nEX [I m\n\n\u00afh (X)/P\u03bb(X)] \u2212 bm(\u00afh)\nEX [I m\n\n\u00afh (X)/q\u03bb(X)3]\n\n.\n\n3:\n\n4:\n5:\n6:\n\n7:\n\n\u00afh (X)\nP\u03bb(X)\nreturn \u03bb\n\nelse\n\nend if\n8:\n9: end loop\n\n(cid:115)\n\n(cid:88)\n\n\u03bbhI m\n\nLagrange multiplier for the constraint (1) for classi\ufb01er h. Then for any \u03bb, we can minimize the\nLagrangian over each primal variable P (X) yielding the solution\n\n1(x \u2208 Dm)q\u03bb(x)\n\nP\u03bb(x) =\n\n(8)\nand I m\nh (x) = 1(h(x) (cid:54)= hm(x) \u2227 x \u2208 Dm). Clearly, \u00b5/(1 + \u00b5) \u2264 P\u03bb(x) \u2264 1 for all x \u2208 Dm, so\nall the bound constraints (2) in (OP) are satis\ufb01ed if we choose \u00b5 = 2Pmin,m. Plugging the solution\nP\u03bb into the Lagrangian, we obtain the dual problem of maximizing the dual objective\n\n, where q\u03bb(x) =\n\n1 + q\u03bb(x)\n\nh (x)\n\n\u00b52 +\n\nh\u2208H\n\n(cid:2)1(X \u2208 Dm)(1 + q\u03bb(X))2(cid:3) \u2212(cid:88)\n\nD(\u03bb) = EX\n\nh\u2208H\n\n\u03bbhbm(h) + C0\n\n(9)\nover \u03bb \u2265 0. The constant C0 is equal to 1\u2212Pr(Dm) where Pr(Dm) = Pr(X \u2208 Dm). An algorithm\nto approximately solve this problem is presented in Algorithm 2. The algorithm takes a parameter\n\u03b5 > 0 specifying the degree to which all of the constraints (1) are to be approximated. Since D is\nconcave, the rescaling step can be solved using a straightforward numerical line search. The main\nimplementation challenge is in \ufb01nding the most violated constraint (Step 3). Fortunately, this step\ncan be reduced to a single call to an ERM oracle. To see this, note that the constraint violation on\nclassi\ufb01er h can be written as\nEX\n\n1(h(X) (cid:54)= hm(X))\n\n\u2212 bm(h) = EX\n\n(cid:18) 1\n\n1(X \u2208 Dm)\n\n(cid:20)I m\n\n\u2212 2\u03b12\n\n(cid:19)\n\n(cid:21)\n\n(cid:21)\n\n(cid:20)\n\nh (X)\nP (X)\n\nP (X)\n\n\u2212 2\u03b22\u03b3\u03c4m\u22121\u2206m\u22121(err(h, \u02dcZm\u22121) \u2212 err(hm, \u02dcZm\u22121)) \u2212 \u03be\u03c4m\u22121\u22062\n\nm\u22121.\n\nThe second term of the right-hand expression is simply the scaled risk (classi\ufb01cation error) of h with\nrespect to the actual labels. The \ufb01rst term is the risk of h in predicting samples which have been\nlabeled according to hm with importance weights of 1/P (x)\u2212 2\u03b12 if x \u2208 Dm and 0 otherwise; note\nthat these weights may be positive or negative. The last two terms do not depend on h. Thus, given\naccess to PX (or samples approximating it, discussed shortly), the most violated constraint can be\nfound by solving an ERM problem de\ufb01ned on the labeled samples in \u02dcZm\u22121 and samples drawn from\nPX labeled by hm, with appropriate importance weights detailed in Appendix F.1 of [14]. When all\nprimal constraints are approximately satis\ufb01ed, the algorithm stops. We have the following guarantee\non the convergence of the algorithm.\nTheorem 3. When run on the m-th epoch, Algorithm 2 halts in at most Pr(Dm)/(8P 3\nmin,m\u03b52)\niterations and outputs a solution \u02c6\u03bb \u2265 0 such that P\u02c6\u03bb satis\ufb01es the simple bound constraints in (2)\nexactly, the variance constraints in (1) up to an additive factor of \u03b5, and\n\n(cid:20)\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)\n\nEX\n\n1\n\n1 \u2212 P\u02c6\u03bb(X)\n\n\u2264 EX\n\n1\n\n1 \u2212 P \u2217(X)\n\n+ 4Pmin,mPr(Dm),\n\n(10)\n\nwhere P \u2217 is the solution to (OP). Furthermore, (cid:107)\u02c6\u03bb(cid:107)1 \u2264 Pr(Dm)/\u03b5.\nIf \u03b5 is set to \u03be2\u03c4m\u22121\u22062\nof iterations (hence the number of ERM oracle calls) in Theorem 3 is at most O(\u03c4 2\nis in Appendix F.2 of [14].\n\nm\u22121, an amount of constraint violation tolerable in our analysis, the number\nm\u22121). The proof\n\n6\n\n\fAUC-GAIN\u2217\nAUC-GAIN\n\nOAC\n\n0.151\n0.065\n\nIWAL0\n\nIWAL1\n\nTable 1: Summary of performance metrics\nORA-\nIWAL0\n0.115\n0.073\n\n0.142\n0.081\n\n0.125\n0.078\n\n0.150\n0.085\n\nORA-OAC\n\nORA-\nIWAL1\n0.121\n0.075\n\nPASSIVE\n\n0.095\n0.072\n\nSolving (OP) with expectation over samples: So far we considered solving (OP) de\ufb01ned on the\nunlabeled data distribution PX , which is unavailable in practice. A natural substitute for PX is an\ni.i.d. sample drawn from it. In Appendix F.3 of [14] we show that solving a properly-de\ufb01ned sample\nvariant of (OP) leads to a solution to the original (OP) with similar guarantees as in Theorem 3.\n\n5 Experiments with Agnostic Active Learning\n\nWhile AC is ef\ufb01cient in the number of ERM oracle calls, it needs to store all past examples, resulting\nin large space complexity. As Theorem 3 suggests, the query probability function (8) may need as\nmany as O(\u03c4 2\ni ) classi\ufb01ers, further increasing storage demand. Aiming at scalable implementation,\nwe consider an online approximation of AC, given in Section 6.1 of [14]. The main differences\nfrom AC are: (1) instead of a batch ERM oracle, it invokes an online oracle; and (2) instead of\nrepeatedly solving (OP) from scratch, it maintains a \ufb01xed-size set of classi\ufb01ers (and hence non-zero\ndual variables), called the cover, for representing the query probability, and updates the cover with\nevery new example in a manner similar to the coordinate ascent algorithm for solving (OP). We\nconduct an empirical comparison of the following ef\ufb01cient agnostic active learning algorithms:\nOAC: Online approximation of ACTIVE COVER (Algorithm 3 in Section 6.1 of [14]).\nIWAL0 and IWAL1: The algorithm of [5] and a variant that uses a tighter threshold.\nORA-OAC, ORA-IWAL0, and ORA-IWAL1: Oracular-CAL [13] versions of OAC, IWAL0 and IWAL1.\nPASSIVE: Passive learning on a labeled sub-sample drawn uniformly at random.\nDetails about these algorithms are in Section 6.2 of [14]. The high-level differences among these\nalgorithms are best explained in the context of the disagreement region: OAC does importance-\nweighted querying of labels with an optimized query probability in the disagreement region, while\nusing predicted labels outside; IWAL0 and IWAL1 maintain a non-zero minimum query probability\neverywhere; ORA-OAC, ORA-IWAL0 and ORA-IWAL1 query labels in their respective disagreement\nregions with probability 1, using predicted labels otherwise.\nWe implemented these algorithms in Vowpal Wabbit (http://hunch.net/\u02dcvw/), a fast learn-\ning system based on online convex optimization, using logistic regression as the ERM oracle. We\nperformed experiments on 22 binary classi\ufb01cation datasets with varying sizes (103 to 106) and di-\nverse feature characteristics. Details about the datasets are in Appendix G.1 of [14]. Our goal is to\nevaluate the test error improvement per label query achieved by different algorithms. To simulate the\nstreaming setting, we randomly permuted the datasets, ran the active learning algorithms through the\n\ufb01rst 80% of data, and evaluated the learned classi\ufb01ers on the remaining 20%. We repeated this pro-\ncess 9 times to reduce variance due to random permutation. For each active learning algorithm, we\nobtain the test error rates of classi\ufb01ers trained at doubling numbers of label queries starting from 10\nto 10240. Formally, let errora,p(d, j, q) denote the test error of the classi\ufb01er returned by algorithm\na using hyper-parameter setting p on the j-th permutation of dataset d immediately after hitting the\nq-th label budget, 10\u00b72(q\u22121), 1 \u2264 q \u2264 11. Let querya,p(d, j, q) be the actual number of label queries\nmade, which can be smaller than 10 \u00b7 2(q\u22121) when algorithm a reaches the end of the training data\nbefore hitting that label budget. To evaluate an algorithm, we consider the area under its curve of\n(cid:19)\ntest error against log number of label queries:\n\n(cid:18)\n\nAUCa,p(d, j) =\n\nerrora,p(d, j, q + 1) + errora,p(d, j, q)\n\nlog2\n\n10(cid:88)\n\n(cid:16)\n\nq=1\n\n1\n2\n\n(cid:17) \u00b7\n\nquerya,p(d, j, q + 1)\n\nquerya,p(d, j, q)\n\n.\n\nA good active learning algorithm has a small value of AUC, which indicates that the test error\ndecreases quickly as the number of label queries increases. We use a logarithmic scale for the\nnumber of label queries to focus on the performance with few label queries where active learning is\nthe most relevant. More details about hyper-parameters are in Appendix G.2 of [14].\n\n7\n\n\f(a) Best hyper-parameter per dataset\n\n(b) Best \ufb01xed hyper-parameter\n\nFigure 1: Average relative improvement in test error v.s. number of label queries\n\nWe measure the performance of each algorithm a by the following two aggregated metrics:\n\nAUC-GAIN\u2217(a)\n\n:= mean\n\nd\n\nmax\n\np\n\nmedian\n1\u2264j\u22649\n\nAUC-GAIN(a)\n\n:= max\n\np\n\nmean\n\nd\n\nmedian\n1\u2264j\u22649\n\n(cid:26) AUCbase(d, j) \u2212 AUCa,p(d, j)\n(cid:26) AUCbase(d, j) \u2212 AUCa,p(d, j)\n\nAUCbase(d, j)\n\n(cid:27)\n(cid:27)\n\n,\n\n,\n\nAUCbase(d, j)\n\nwhere AUCbase denotes the AUC of PASSIVE using a default hyper-parameter setting, i.e., a learn-\ning rate of 0.4 (see Appendix G.2 of [14]). The \ufb01rst metric shows the maximal gain each algorithm\nachieves with the best hyper-parameter setting for each dataset, while the second shows the gain by\nusing the single hyper-parameter setting that performs the best on average across datasets.\nResults and Discussions. Table 1 gives a summary of the performances of different algorithms.\nWhen using hyper-parameters optimized on a per-dataset basis (top row in Table 1), OAC achieves\nthe largest improvement over the PASSIVE baseline, with IWAL0 achieving almost the same improve-\nment and IWAL1 improving slightly less. Oracular-CAL variants perform worse, but still do better\nthan PASSIVE with the best learning rate for each dataset, which leads to an average of 9.5% im-\nprovement in AUC over the default learning rate. When using the best \ufb01xed hyper-parameter setting\nacross all datasets (bottom row in Table 1), all active learning algorithms achieve less improvement\ncompared with PASSIVE (7% improvement with the best \ufb01xed learning rate). In particular, OAC gets\nonly 6.5% improvement. This suggests that careful tuning of hyper-parameters is critical for OAC\nand an important direction for future work.\nFigure 1(a) describes the behaviors of different algorithms in more detail. For each algorithm a we\nidentify the best \ufb01xed hyper-parameter setting\n\np\u2217 := arg max\n\n(11)\nand plot the relative test error improvement by a using p\u2217 averaged across all datasets at the 11 label\n\nAUCbase(d, j)\n\nmedian\n1\u2264j\u22649\n\nmean\n\nd\n\np\n\n(cid:26) AUCbase(d, j) \u2212 AUCa,p(d, j)\n(cid:26) errorbase(d, j, q) \u2212 errora,p\u2217 (d, j, q)\n\n(cid:27)\n(cid:27)(cid:19)(cid:27)11\n\n,\n\nbudgets: (cid:26)(cid:18)\n\n10 \u00b7 2(q\u22121), mean\n\nmedian\n1\u2264j\u22649\n\nd\n\nerrorbase(d, j, q)\n\n.\n\nq=1\n\n(12)\n\nAll algorithms, including PASSIVE, perform similarly during the \ufb01rst few hundred label queries.\nIWAL0 performs the best at label budgets larger than 80, while IWAL1 does almost as well. ORA-\nOAC is the next best, followed by ORA-IWAL1 and ORA-IWAL0. OAC performs worse than PASSIVE\nexcept at label budgets between 320 and 1280.\nIn Figure 1(b),we plot results obtained by each\nalgorithm a using the best hyper-parameter setting for each dataset d:\n\n(cid:26) AUCbase(d, j) \u2212 AUCa,p(d, j)\n\n(cid:27)\n\n.\n\n(13)\n\np\u2217\nd := arg max\n\np\n\nmedian\n1\u2264j\u22649\n\nAUCbase(d, j)\n\nAs expected, all algorithms perform better, but OAC bene\ufb01ts the most from using the best hyper-\nparameter setting per dataset. Appendix G.3 of [14] gives more detailed results, including test error\nrates obtained by all algorithms at different label query budgets for individual datasets.\nIn sum, when using the best \ufb01xed hyper-parameter setting, IWAL0 outperforms other algorithms.\nWhen using the best hyper-parameter setting tuned for each dataset, OAC and IWAL0 perform equally\nwell and better than other algorithms.\n\n8\n\nnumber of label queries101102103104relative improvement in test error-0.2-0.100.10.2OACIWAL0IWAL1ORA-OACORA-IWAL0ORA-IWAL1PASSIVEbaselinenumber of label queries101102103104relative improvement in test error-0.2-0.100.10.2OACIWAL0IWAL1ORA-OACORA-IWAL0ORA-IWAL1PASSIVEbaseline\fReferences\n[1] Maria-Florina Balcan and Phil Long. Active and passive learning of linear separators under\n\nlog-concave distributions. In Conference on Learning Theory, pages 288\u2013316, 2013.\n\n[2] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In\nProceedings of the 23rd international conference on Machine learning, pages 65\u201372. ACM,\n2006.\n\n[3] Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning.\n\nIn\nProceedings of the 20th annual conference on Learning theory, pages 35\u201350. Springer-Verlag,\n2007.\n\n[4] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In ICML,\n\n2009.\n\n[5] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without con-\n\nstraints. In NIPS, 2010.\n\n[6] R.M. Castro and R.D. Nowak. Minimax bounds for active learning. Information Theory, IEEE\n\nTransactions on, 54(5):2339 \u20132353, 2008.\n\n[7] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine\n\nLearning, 15:201\u2013221, 1994.\n\n[8] S. Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural\n\nInformation Processing Systems 18, 2005.\n\n[9] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In\n\nNIPS, 2007.\n\n[10] S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon Uni-\n\nversity, 2009.\n\n[11] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in\n\nMachine Learning, 7(2-3):131\u2013309, 2014.\n\n[12] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a\n\n\ufb01nite universe. J. Amer. Statist. Assoc., 47:663\u2013685, 1952. ISSN 0162-1459.\n\n[13] Daniel J. Hsu. Algorithms for Active Learning. PhD thesis, University of California at San\n\nDiego, 2010.\n\n[14] Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Ef\ufb01-\n\ncient and parsimonious agnostic active learning. arXiv preprint arXiv:1506.08669, 2015.\n\n[15] Nikos Karampatziakis and John Langford. Online importance weight aware updates. In UAI\n2011, Proceedings of the Twenty-Seventh Conference on Uncertainty in Arti\ufb01cial Intelligence,\nBarcelona, Spain, July 14-17, 2011, pages 392\u2013399, 2011.\n\n[16] Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learn-\n\ning. J. Mach. Learn. Res., 11:2457\u20132485, December 2010.\n\n[17] A. B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Ann. Statist., 32:\n\n135\u2013166, 2004.\n\n[18] Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learn-\n\ning. In Advances in Neural Information Processing Systems, pages 442\u2013450, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1574, "authors": [{"given_name": "Tzu-Kuo", "family_name": "Huang", "institution": "Microsoft"}, {"given_name": "Alekh", "family_name": "Agarwal", "institution": "Microsoft Research"}, {"given_name": "Daniel", "family_name": "Hsu", "institution": "Columbia University"}, {"given_name": "John", "family_name": "Langford", "institution": "Microsoft Research New York"}, {"given_name": "Robert", "family_name": "Schapire", "institution": "MIcrosoft Research"}]}