{"title": "Active Learning with a Drifting Distribution", "book": "Advances in Neural Information Processing Systems", "page_first": 2079, "page_last": 2087, "abstract": "We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time.  We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise.  We further prove minimax lower bounds for this problem.", "full_text": "Active Learning with a Drifting Distribution\n\nLiu Yang\n\nMachine Learning Department\n\nCarnegie Mellon University\n\nliuy@cs.cmu.edu\n\nAbstract\n\nWe study the problem of active learning in a stream-based setting, allowing the\ndistribution of the examples to change over time. We prove upper bounds on\nthe number of prediction mistakes and number of label requests for established\ndisagreement-based active learning algorithms, both in the realizable case and\nunder Tsybakov noise. We further prove minimax lower bounds for this problem.\n\nIntroduction\n\n1\nMost existing analyses of active learning are based on an i.i.d. assumption on the data. In this work,\nwe assume the data are independent, but we allow the distribution from which the data are drawn to\nshift over time, while the target concept remains \ufb01xed. We consider this problem in a stream-based\nselective sampling model, and are interested in two quantities: the number of mistakes the algorithm\nmakes on the \ufb01rst T examples in the stream, and the number of label requests among the \ufb01rst T\nexamples in the stream.\n\nIn particular, we study scenarios in which the distribution may drift within a \ufb01xed totally bounded\nfamily of distributions. Unlike previous models of distribution drift [Bar92, CMEDV10], the mini-\nmax number of mistakes (or excess number of mistakes, in the noisy case) can be sublinear in the\nnumber of samples.\n\nWe speci\ufb01cally study the classic CAL active learning strategy [CAL94] in this context, and bound\nthe number of mistakes and label requests the algorithm makes in the realizable case, under condi-\ntions on the concept space and the family of possible distributions. We also exhibit lower bounds\non these quantities that match our upper bounds in certain cases. We further study a noise-robust\nvariant of CAL, and analyze its number of mistakes and number of label requests in noisy scenarios\nwhere the noise distribution remains \ufb01xed over time but the marginal distribution on X may shift.\nIn particular, we upper bound these quantities under Tsybakov\u2019s noise conditions [MT99]. We also\nprove minimax lower bounds under these same conditions, though there is a gap between our upper\nand lower bounds.\n\n2 De\ufb01nition and Notations\nAs in the usual statistical learning problem, there is a standard Borel space X , called the instance\nspace, and a set C of measurable classi\ufb01ers h : X \u2192 {\u22121, +1}, called the concept space. We\nadditionally have a space D of distributions on X , called the distribution space. Throughout, we\nsuppose that the VC dimension of C, denoted d below, is \ufb01nite.\nFor any \u00b51, \u00b52 \u2208 D, let k\u00b51\u2212\u00b52k = supA \u00b51(A)\u2212\u00b52(A) denote the total variation pseudo-distance\nbetween \u00b51 and \u00b52, where the set A in the sup ranges over all measurable subsets of X . For any\n\u01eb > 0, let D\u01eb denote a minimal \u01eb-cover of D, meaning that D\u01eb \u2286 D and \u2200\u00b51 \u2208 D,\u2203\u00b52 \u2208 D\u01eb s.t.\nk\u00b51\u2212 \u00b52k < \u01eb, and that D\u01eb has minimal possible size |D\u01eb| among all subsets of D with this property.\nIn the learning problem, there is an unobservable sequence of distributions D1,D2, . . ., with each\nDt \u2208 D, and an unobservable time-independent regular conditional distribution, which we represent\n\n1\n\n\ft=1 denote an in\ufb01nite\nby a function \u03b7 : X \u2192 [0, 1]. Based on these quantities, we let Z = {(Xt, Yt)}\u221e\nsequence of independent random variables, such that \u2200t, Xt \u223c Dt, and the conditional distribution\nof Yt given Xt satis\ufb01es \u2200x \u2208 X , P(Yt = +1|Xt = x) = \u03b7(x). Thus, the joint distribution of\n(Xt, Yt) is speci\ufb01ed by the pair (Dt, \u03b7), and the distribution of Z is speci\ufb01ed by the collection\nt=1 along with \u03b7. We also denote by Zt = {(X1, Y1), (X2, Y2), . . . , (Xt, Yt)} the \ufb01rst t\n{Dt}\u221e\nsuch labeled examples. Note that the \u03b7 conditional distribution is time-independent, since we are\nrestricting ourselves to discussing drifting marginal distributions on X , rather than drifting concepts.\nConcept drift is an important and interesting topic, but is beyond the scope of our present discussion.\nIn the active learning protocol, at each time t, the algorithm is presented with the value Xt, and\nis required to predict a label \u02c6Yt \u2208 {\u22121, +1}; then after making this prediction, it may optionally\nrequest to observe the true label value Yt; as a means of book-keeping, if the algorithm requests a\nlabel Yt on round t, we de\ufb01ne Qt = 1, and otherwise Qt = 0.\n\nnumber of labels requested up to time T .\n\nt=1\n\nT on T , where \u00afM \u2217\n\nt=1\n\nt=1 Qt, is the total\nIn particular, we will study the expectations of these\n\nIh \u02c6Yt 6= Yti, is the cumulative\nWe are primarily interested in two quantities. The \ufb01rst, \u02c6MT =PT\nnumber of mistakes up to time T . The second quantity of interest, \u02c6QT = PT\nquantities: \u00afMT = Eh \u02c6MTi and \u00afQT = Eh \u02c6QTi. We are particularly interested in the asymptotic\nI [h(Xt) 6= Yt]i. We refer\nT = inf h\u2208C EhPT\ndependence of \u00afQT and \u00afMT \u2212 \u00afM \u2217\nto \u00afQT as the expected number of label requests, and to \u00afMT \u2212 \u00afM \u2217\nT as the expected excess number\nof mistakes. For any distribution P on X , we de\ufb01ne erP (h) = EX\u223cP [\u03b7(X)I[h(X) = \u22121] + (1 \u2212\n\u03b7(X))I[h(X) = +1]], the probability of h making a mistake for X \u223c P and Y with conditional\nprobability of being +1 equal \u03b7(X). Note that, abbreviating ert(h) = erDt (h) = P(h(Xt) 6= Yt),\nwe have \u00afM \u2217\nt=1 ert(h).\nScenarios in which both \u00afMT \u2212 \u00afM \u2217\nT and \u00afQT are o(T ) (i.e., sublinear) are considered desirable, as\nthese represent cases in which we do \u201clearn\u201d the proper way to predict labels, while asymptoti-\ncally using far fewer labels than passive learning. Once establishing conditions under which this is\npossible, we may then further explore the trade-off between these two quantities.\nFor V \u2286 C,\n\nlet diamt(V ) =\nWe will additionally make use of the following notions.\n1\nsuph,g\u2208V Dt({x : h(x) 6= g(x)}). For h : X \u2192 {\u22121, +1}, \u00afers:t(h) =\nu=s eru(h),\nand for \ufb01nite S \u2286 X \u00d7 {\u22121, +1}, \u02c6er(h; S) = 1\nI[h(x) 6= y]. Also let C[S] = {h \u2208 C :\n\u02c6er(h; S) = 0}. Finally, for a distribution P on X and r > 0, de\ufb01ne BP (h, r) = {g \u2208 C : P (x :\nh(x) 6= g(x)) \u2264 r}.\n\n|S|P(x,y)\u2208S\n\nt\u2212s+1Pt\n\nT = inf h\u2208CPT\n\n2.1 Assumptions\nIn addition to the assumption of independence of the Xt variables and that d < \u221e, each result\nbelow is stated under various additional assumptions. The weakest such assumption is that D is\ntotally bounded, in the following sense. For each \u01eb > 0, let D\u01eb denote a minimal subset of D such\nthat \u2200D \u2208 D,\u2203D\u2032 \u2208 D\u01eb s.t. kD \u2212 D\u2032k < \u01eb: that is, a minimal \u01eb-cover of D. We say that D is totally\nbounded if it satis\ufb01es the following assumption.\nAssumption 1. \u2200\u01eb > 0,|D\u01eb| < \u221e.\nIn some of the results below, we will be interested in deriving speci\ufb01c rates of convergence. Doing so\nrequires us to make stronger assumptions about D than mere total boundedness. We will speci\ufb01cally\nconsider the following condition, in which c, m \u2208 [0,\u221e) are constants.\nAssumption 2. \u2200\u01eb > 0,|D\u01eb| < c \u00b7 \u01eb\u2212m.\nFor an example of a class D satisfying the total boundedness assumption, consider X = [0, 1]n, and\nlet D be the collection of distributions that have uniformly continuous density function with respect\nto the Lebesgue measure on X , with modulus of continuity at most some value \u03c9(\u01eb) for each value\nof \u01eb > 0, where \u03c9(\u01eb) is a \ufb01xed real-valued function with lim\u01eb\u21920 \u03c9(\u01eb) = 0.\nAs a more concrete example, when \u03c9(\u01eb) = L\u01eb for some L \u2208 (0,\u221e), this corresponds to the family\nof Lipschitz continuous density functions with Lipschitz constant at most L. In this case, we have\n|D\u01eb| \u2264 O (\u01eb\u2212n), satisfying Assumption 2.\n\n2\n\n\f3 Related Work\nWe discuss active learning under distribution drift, with \ufb01xed target concept. There are several\nbranches of the literature that are highly relevant to this, including domain adaptation [MMR09,\nMMR08], online learning [Lit88], learning with concept drift, and empirical processes for indepen-\ndent but not identically distributed data [vdG00].\n\nStreamed-based Active Learning with a Fixed Distribution [DKM09] show that a certain mod-\ni\ufb01ed perceptron-like active learning algorithm can achieve a mistake bound O(d log(T )) and query\nbound \u02dcO(d log(T )), when learning a linear separator under a uniform distribution on the unit sphere,\nin the realizable case. [DGS10] also analyze the problem of learning linear separators under a uni-\n\nform distribution, but allowing Tsybakov noise. They \ufb01nd that with \u00afQT = \u02dcO(cid:16)d\nit is possible to achieve an expected excess number of mistakes \u00afMT \u2212 M \u2217\nAt this time, we know of no work studying the number of mistakes and queries achievable by active\nlearning in a stream-based setting where the distribution may change over time.\n\n\u03b1+2(cid:17) queries,\n\u03b1+2(cid:17).\n\u03b1+2 \u00b7 T\n\nT = \u02dcO(cid:16)d\n\n\u03b1+2 T\n\n\u03b1+1\n\n2\u03b1\n\n2\n\n1\n\nStream-based Passive Learning with a Drifting Distribution There has been work on learning\nwith a drifting distribution and \ufb01xed target, in the context of passive learning. [Bar92, BL97] study\nthe problem of learning a subset of a domain from randomly chosen examples when the probability\ndistribution of the examples changes slowly but continually throughout the learning process; they\ngive upper and lower bounds on the best achievable probability of misclassi\ufb01cation after a given\nnumber of examples. They consider learning problems in which a changing environment is modeled\nby a slowly changing distribution on the product space. The allowable drift is restricted by ensuring\nthat consecutive probability distributions are close in total variation distance. However, this assump-\ntion allows for certain malicious choices of distribution sequences, which shift the probability mass\ninto smaller and smaller regions where the algorithm is uncertain of the target\u2019s behavior, so that\nthe number of mistakes grows linearly in the number of samples in the worst case. More recently,\n[FM97] have investigated learning when the distribution changes as a linear function of time. They\npresent algorithms that estimate the error of functions, using knowledge of this linear drift.\n\n4 Active Learning in the Realizable Case\nThroughout this section, suppose C is a \ufb01xed concept space and h\u2217 \u2208 C is a \ufb01xed target function:\nthat is, ert(h\u2217) = 0. The family of scenarios in which this is true are often collectively referred\nto as the realizable case. We begin our analysis by studying this realizable case because it greatly\nsimpli\ufb01es the analysis, laying bare the core ideas in plain form. We will discuss more general\nscenarios, in which ert(h\u2217) \u2265 0, in later sections, where we \ufb01nd that essentially the same principles\napply there as in this initial realizable-case analysis.\n\nWe will be particularly interested in the performance of the following simple algorithm, due to\n[CAL94], typically referred to as CAL after its discoverers. The version presented here is speci\ufb01ed in\nterms of a passive learning subroutine A (mapping any sequence of labeled examples to a classi\ufb01er).\nIn it, we use the notation DIS(V ) = {x \u2208 X : \u2203h, g \u2208 V s.t. h(x) 6= g(x)}, also used below.\nCAL\n1. t \u2190 0, Q0 \u2190 \u2205, and let \u02c6h0 = A(\u2205)\n2. Do\n3.\n4. Predict \u02c6Yt = \u02c6ht\u22121(Xt)\n5.\n\nt \u2190 t + 1\nIf max\n\ny\u2208{\u22121,+1}\n\nmin\nh\u2208C\n\n\u02c6er(h;Qt\u22121 \u222a {(Xt, y)}) = 0\n\nmin\nh\u2208C\n\nt = argmin\ny\u2208{\u22121,+1}\n\nRequest Yt, let Qt = Qt\u22121 \u222a {(Xt, Yt)}\n\n\u02c6er(h;Qt\u22121 \u222a {(Xt, y)}), and let Qt \u2190 Qt\u22121 \u222a {(Xt, Y \u2032\n\n6.\n7. Else let Y \u2032\n8. Let \u02c6ht = A(Qt)\nBelow, we let A1IG denote the one-inclusion graph prediction strategy of [HLW94]. Speci\ufb01cally,\nthe passive learning algorithm A1IG is speci\ufb01ed as follows. For a sequence of data points U \u2208 X t+1,\n\nt )}\n\n3\n\n\fthe one-inclusion graph is a graph, where each vertex represents a distinct labeling of U that can be\nrealized by some classi\ufb01er in C, and two vertices are adjacent if and only if their corresponding\nlabelings for U differ by exactly one label. We use the one-inclusion graph to de\ufb01ne a classi\ufb01er\nbased on t training points as follows. Given t labeled data points L = {(x1, y1), . . . , (xt, yt)}, and\none test point xt+1 we are asked to predict a label for, we \ufb01rst construct the one-inclusion graph\non U = {x1, . . . , xt+1}; we then orient the graph (give each edge a unique direction) in a way that\nminimizes the maximum out-degree, and breaks ties in a way that is invariant to permutations of the\norder of points in U; after orienting the graph in this way, we examine the subset of vertices whose\ncorresponding labeling of U is consistent with L; if there is only one such vertex, then we predict for\nxt+1 the corresponding label from that vertex; otherwise, if there are two such vertices, then they are\nadjacent in the one-inclusion graph, and we choose the one toward which the edge is directed and\nuse the label for xt+1 in the corresponding labeling of U as our prediction for the label of xt+1. See\n[HLW94] and subsequent work for detailed studies of the one-inclusion graph prediction strategy.\n\n4.1 Learning with a Fixed Distribution\nWe begin the discussion with the simplest case: namely, when |D| = 1.\nDe\ufb01nition 1. [Han07, Han11] De\ufb01ne the disagreement coef\ufb01cient of h\u2217 under a distribution P as\n\n\u03b8P (\u01eb) = sup\nr>\u01eb\n\nP (DIS(BP (h\u2217, r))) /r.\n\nTheorem 1. For any distribution P on X ,\nthen running CAL with A =\nA1IG achieves expected mistake bound \u00afMT = O (d log(T )) and expected query bound \u00afQT =\nO(cid:0)\u03b8P (\u01ebT )d log2(T )(cid:1), for \u01ebT = d log(T )/T .\n\nFor completeness, the proof is included in the supplemental materials.\n\nif D = {P},\n\n4.2 Learning with a Drifting Distribution\nWe now generalize the above results to any sequence of distributions from a totally bounded space\nD. Throughout this section, let \u03b8D(\u01eb) = supP \u2208D \u03b8P (\u01eb).\nFirst, we prove a basic result stating that CAL can achieve a sublinear number of mistakes, and\nunder conditions on the disagreement coef\ufb01cient, also a sublinear number of queries.\nTheorem 2. If D is totally bounded (Assumption 1), then CAL (with A any empirical risk minimiza-\ntion algorithm) achieves an expected mistake bound \u00afMT = o(T ), and if \u03b8D(\u01eb) = o(1/\u01eb), then CAL\nmakes an expected number of queries \u00afQT = o(T ).\n\nProof. As mentioned, given that erQt\u22121 (h\u2217) = 0, we have that Y \u2032\nt in Step 7 must equal h\u2217(Xt),\nso that the invariant erQt (h\u2217) = 0 is maintained for all t by induction. In particular, this implies\nQt = Zt for all t.\nFix any \u01eb > 0, and enumerate the elements of D\u01eb so that D\u01eb = {P1, P2, . . . , P|D\u01eb|}. For each t \u2208 N,\nlet k(t) = argmink\u2264|D\u01eb| kPk \u2212 Dtk, breaking ties arbitrarily. Let\n\u221a\u01eb(cid:19) + ln(cid:18) 4\n\n\u221a\u01eb(cid:18)d ln(cid:18) 24\n\nL(\u01eb) =(cid:24) 8\n\n\u221a\u01eb(cid:19)(cid:19)(cid:25) .\n\nFor each i \u2264 |D\u01eb|, if k(t) = i for in\ufb01nitely many t \u2208 N, then let Ti denote the smallest value of T\nsuch that |{t \u2264 T : k(t) = i}| = L(\u01eb). If k(t) = i only \ufb01nitely many times, then let Ti denote the\nlargest index t for which k(t) = i, or Ti = 1 if no such index t exists.\nLet T\u01eb = maxi\u2264|D\u01eb| Ti and V\u01eb = C[ZT\u01eb ]. We have that \u2200t > T\u01eb, diamt(V\u01eb) \u2264 diamk(t)(V\u01eb) + \u01eb.\nFor each i, let Li be a sequence of L(\u01eb) i.i.d. pairs (X, Y ) with X \u223c Pi and Y = h\u2217(X), and let\nVi = C[Li]. Then \u2200t > T\u01eb,\nE(cid:2)diamk(t)(V\u01eb)(cid:3) \u2264 E(cid:2)diamk(t)(Vk(t))(cid:3)+ Xs\u2264Ti:k(s)=k(t)\nkDs\u2212Pk(s)k \u2264 E(cid:2)diamk(t)(Vk(t))(cid:3)+L(\u01eb)\u01eb.\nBy classic results in the theory of PAC learning [AB99, Vap82] and our choice of L(\u01eb), \u2200t >\nT\u01eb, E(cid:2)diamk(t)(Vk(t))(cid:3) \u2264 \u221a\u01eb.\n\n4\n\n\fCombining the above arguments,\n\nE\" T\nXt=1\n\ndiamt(C[Zt\u22121])# \u2264 T\u01eb +\n\nT\n\nXt=T\u01eb+1\n\nE [diamt(V\u01eb)] \u2264 T\u01eb + \u01ebT +\n\nE(cid:2)diamk(t)(V\u01eb)(cid:3)\n\nT\n\nXt=T\u01eb+1\nE(cid:2)diamk(t)(Vk(t))(cid:3)\n\nT\n\nXt=T\u01eb+1\n\u2264 T\u01eb + \u01ebT + L(\u01eb)\u01ebT +\n\u2264 T\u01eb + \u01ebT + L(\u01eb)\u01ebT + \u221a\u01ebT.\n\nLet \u01ebT be any nonincreasing sequence in (0, 1) such that 1 \u226a T\u01ebT \u226a T . Since |D\u01eb| < \u221e for all\n\u01eb > 0, we must have \u01ebT \u2192 0. Thus, noting that lim\u01eb\u21920 L(\u01eb)\u01eb = 0, we have\n\nE\" T\nXt=1\n\ndiamt(C[Zt\u22121])# \u2264 T\u01ebT + \u01ebT T + L(\u01ebT )\u01ebT T + \u221a\u01ebT T \u226a T.\n\n(1)\n\nThe result on \u00afMT now follows by noting that for any \u02c6ht\u22121 \u2208 C[Zt\u22121] has ert(\u02c6ht\u22121) \u2264\ndiamt(C[Zt\u22121]), so\n\n\u00afMT = E\" T\nXt=1\n\nert(cid:16)\u02c6ht\u22121(cid:17)# \u2264 E\" T\nXt=1\n\ndiamt(C[Zt\u22121])# \u226a T.\n\nSimilarly, for r > 0, we have\nP(Request Yt) = E [P(Xt \u2208 DIS(C[Zt\u22121])|Zt\u22121)] \u2264 E [P(Xt \u2208 DIS(C[Zt\u22121] \u222a BDt (h\u2217, r)))]\n\u2264 E [\u03b8D(r) \u00b7 max{diamt(C[Zt\u22121]), r}] \u2264 \u03b8D(r) \u00b7 r + \u03b8D(r) \u00b7 E [diamt(C[Zt\u22121])] .\nLetting rT = T \u22121EhPT\nt=1 diamt(C[Zt\u22121])i, we see that rT \u2192 0 by (1), and since \u03b8D(\u01eb) =\no(1/\u01eb), we also have \u03b8D(rT )rT \u2192 0, so that \u03b8D(rT )rT T \u226a T . Therefore, \u00afQT equals\ndiamt(C[Zt\u22121])# = 2\u03b8D(rT )\u00b7rT\u00b7T \u226a T.\nXt=1\n\nP(Request Yt) \u2264 \u03b8D(rT )\u00b7rT\u00b7T +\u03b8D(rT )\u00b7E\" T\nXt=1\n\nT\n\n1\n\nm\n\nm+1 d\n\n1\n\nm\n\nm+1 d\n\nm+1 log2 T(cid:17) and \u00afQT = O(cid:16)\u03b8D (\u01ebT ) T\n\nWe can also state a more speci\ufb01c result in the case when we have some more detailed information\non the sizes of the \ufb01nite covers of D.\nTheorem 3. If Assumption 2 is satis\ufb01ed, then CAL (with A any empirical risk minimization algo-\nrithm) achieves an expected mistake bound \u00afMT and expected number of queries \u00afQT such that \u00afMT =\nO(cid:16)T\nProof. Fix \u01eb > 0, enumerate D\u01eb = {P1, P2, . . . , P|D\u01eb|}, and for each t \u2208 N, let k(t) =\nargmin1\u2264k\u2264|D\u01eb| kDt \u2212 Pkk. Let {X \u2032\nt \u223c Pk(t),\nand Z \u2032\nt = {(X \u2032\nE\" T\ndiamt(C[Zt\u22121])# \u2264 E\" T\nXt=1\nXt=1\n\u2264 E\" T\nXt=1\n\nt=1 be a sequence of independent samples, with X \u2032\nt}\u221e\nt, h\u2217(X \u2032\n\nm+1 log2 T(cid:17), where \u01ebT = (d/T )\n\nt\u22121])# +\nXt=1\nt\u22121])# + \u01ebT \u2264\n\ndiamt(C[Z \u2032\n\nE(cid:2)diamPk(t)(C[Z \u2032\n\nt\u22121])(cid:3) + 2\u01ebT.\n\n1, h\u2217(X \u2032\n\n1)), . . . , (X \u2032\n\nThe classic convergence rates results from PAC learning [AB99, Vap82] imply\n\ndiamt(C[Z \u2032\n\nkDt \u2212 Pk(t)k\n\nt)}. Then\n\nXt=1\n\n1\n\nm+1 .\n\nT\n\nT\n\nT\n\nXt=1\nE(cid:2)diamPk(t)(C[Z \u2032\nXt=1\n\u2264 O(d log T ) \u00b7\n\nT\n\nT\n\nd log t\n\nXt=1\n\nO(cid:16)\n\n|{i\u2264t:k(i)=k(t)}|(cid:17)\nt\u22121])(cid:3) =\n|{i\u2264t:k(i)=k(t)}| \u2264 O(d log T ) \u00b7 |D\u01eb| \u00b7\n\n1\n\n\u2308T /|D\u01eb|\u2309\n\nXu=1\n\n1\n\nu \u2264 O(cid:0)d|D\u01eb| log2(T )(cid:1) .\n\n5\n\n\fTaking \u01eb = (T /d)\u2212 1\n\nt=1\n\nThus,PT\n\u00afMT \u2264 E\" T\nXt=1\n\nE [diamt(C[Zt\u22121])] \u2264 O(cid:0)d|D\u01eb| log2(T ) + \u01ebT(cid:1) \u2264 O(cid:0)d \u00b7 \u01eb\u2212m log2(T ) + \u01ebT(cid:1).\n\nm+1 , this is O(cid:16)d\n\nm+1 log2(T )(cid:17). We therefore have\nm+1 \u00b7 T\nert(h)# \u2264 E\" T\ndiamt(C[Zt\u22121])# \u2264 O(cid:16)d\nXt=1\nm+1 \u00b7 T\n\nsup\n\nh\u2208C[Zt\u22121]\n\n1\n\nm\n\nm+1 log2(T )(cid:17) .\n\n1\n\nm\n\nSimilarly, letting \u01ebT = (d/T )\n\n1\n\nm+1 , \u00afQT is at most\n\nDt (DIS (BDt (h\u2217, max{diamt(C[Zt\u22121]), \u01ebT})))#\n\nE\" T\nDt(DIS(C[Zt\u22121]))# \u2264 E\" T\nXt=1\nXt=1\n\u2264 E\" T\n\u03b8D (\u01ebT ) \u00b7 max{diamt(C[Zt\u22121]), \u01ebT}#\nXt=1\n\u2264 E\" T\n\u03b8D (\u01ebT ) \u00b7 diamt(C[Zt\u22121])#+ \u03b8D (\u01ebT ) T \u01ebT \u2264 O(cid:16)\u03b8D (\u01ebT ) \u00b7 d\nXt=1\n\n1\n\nm+1 \u00b7 T\n\nm\n\nm+1 log2(T )(cid:17) .\n\nWe can additionally construct a lower bound for this scenario, as follows. Suppose C contains a full\nin\ufb01nite binary tree for which all classi\ufb01ers in the tree agree on some point. That is, there is a set of\npoints {xb : b \u2208 {0, 1}k, k \u2208 N} such that, for b1 = 0 and \u2200b2, b3, . . . \u2208 {0, 1}, \u2203h \u2208 C such that\nh(x(b1,...,bj\u22121)) = bj for j \u2265 2. For instance, this is the case for linear separators (and most other\nnatural \u201cgeometric\u201d concept spaces).\nTheorem 4. For any C as above, for any active learning algorithm, \u2203 a set D satsifying Assump-\ntion 2, a target function h\u2217 \u2208 C, and a sequence of distributions {Dt}T\nt=1 in D such that the achieved\n\u00afMT and \u00afQT satisfy \u00afMT = \u2126(cid:0)T\n\nThe proof is analogous to that of Theorem 9 below, and is therefore omitted for brevity.\n\nm+1(cid:1) =\u21d2 \u00afQT = \u2126(cid:0)T\n\nm+1(cid:1), and \u00afMT = O(cid:0)T\n\nm+1(cid:1).\n\nm\n\nm\n\nm\n\n5 Learning with Noise\nIn this section, we extend the above analysis to allow for various types of noise conditions commonly\nstudied in the literature. For this, we will need to study a noise-robust variant of CAL, below\nreferred to as Agnostic CAL (or ACAL). We prove upper bounds achieved by ACAL, as well as\n(non-matching) minimax lower bounds.\n\n5.1 Noise Conditions\nThe following assumption may be referred to as a strictly benign noise condition, which essentially\nsays the model is speci\ufb01ed correctly in that h\u2217 \u2208 C, and though the labels may be stochastic, they\nare not completely random, but rather each is slightly biased toward the h\u2217 label.\nAssumption 3. h\u2217 = sign(\u03b7 \u2212 1/2) \u2208 C and \u2200x, \u03b7(x) 6= 1/2.\nA particularly interesting special case of Assumption 3 is given by Tsybakov\u2019s noise conditions,\nwhich essentially control how common it is to have \u03b7 values close to 1/2. Formally:\nAssumption 4. \u03b7 satis\ufb01es Assumption 3 and for some c > 0 and \u03b1 \u2265 0,\n\u2200t > 0, P (|\u03b7(x) \u2212 1/2| < t) < c \u00b7 t\u03b1.\nIn the setting of shifting distributions, we will be interested in conditions for which the above as-\nsumptions are satisifed simultaneously for all distributions in D. We formalize this in the following.\nAssumption 5. Assumption 4 is satis\ufb01ed for all D \u2208 D, with the same c and \u03b1 values.\n5.2 Agnostic CAL\nThe following algorithm is essentially taken from [DHM07, Han11], adapted here for this stream-\nbased setting. It is based on a subroutine: LEARN(L,Q) = argmin\n\u02c6er(h;L) =\n0, and otherwise LEARN(L,Q) = \u2205.\n\n\u02c6er(h;Q) if min\n\nh\u2208C: \u02c6er(h;L)=0\n\nh\u2208C\n\n6\n\n\fACAL\n1.\n2. Do\n3.\n4.\n5.\n6.\n\nt \u2190 0, Lt \u2190 \u2205, Qt \u2190 \u2205, let \u02c6ht be any element of C\nt \u2190 t + 1\nPredict \u02c6Yt = \u02c6ht\u22121(Xt)\nFor each y \u2208 {\u22121, +1}, let h(y) = LEARN(Lt\u22121,Qt\u22121)\nIf either y has h(\u2212y) = \u2205 or\n\n\u02c6er(h(\u2212y);Lt\u22121 \u222a Qt\u22121) \u2212 \u02c6er(h(y);Lt\u22121 \u222a Qt\u22121) > \u02c6Et\u22121(Lt\u22121,Qt\u22121)\n\nLt \u2190 Lt\u22121 \u222a {(Xt, y)}, Qt \u2190 Qt\u22121\n\nElse Request Yt, and let Lt \u2190 Lt\u22121, Qt \u2190 Qt\u22121 \u222a {(Xt, Yt)}\nLet \u02c6ht = LEARN(Lt,Qt)\nIf t is a power of 2\nLt \u2190 \u2205, Qt \u2190 \u2205\n\n7.\n8.\n9.\n10.\n11.\nThe algorithm is expressed in terms of a function \u02c6Et(L,Q), de\ufb01ned as follows. Let \u03b4i be\na nonincreasing sequence of values in (0, 1). Let \u03be1, \u03be2, . . . denote a sequence of indepen-\ndent Uniform({\u22121, +1}) random variables, also independent from the data. For V \u2286 C,\nt\u22122\u230alog2 (t\u22121)\u230b Pt\nlet \u02c6Rt(V ) = suph1,h2\u2208V\nm=2\u230alog2 (t\u22121)\u230b+1 \u03bem \u00b7 (h1(Xm) \u2212 h2(Xm)), \u02c6Dt(V ) =\nm=2\u230alog2 (t\u22121)\u230b+1 |h1(Xm) \u2212 h2(Xm)|, \u02c6Ut(V, \u03b4) = 12 \u02c6Rt(V ) +\nsuph1,h2\u2208V\n34q \u02c6Dt(V ) ln(32t2/\u03b4)\n. Also, for any \ufb01nite sets L,Q \u2286 X \u00d7 Y, let C[L] = {h \u2208\nC : \u02c6er(h;L) = 0}, \u02c6C(\u01eb;L,Q) = {h \u2208 C[L] : \u02c6er(h;L \u222a Q) \u2212 ming\u2208C[L] \u02c6er(g;L \u222a Q) \u2264 \u01eb}. Then\nde\ufb01ne \u02c6Ut(\u01eb, \u03b4;L,Q) = \u02c6Ut( \u02c6Ct(\u01eb;L,Q), \u03b4), and (letting Z\u01eb = {j \u2208 Z : 2j \u2265 \u01eb})\n\nt\u22122\u230alog2 (t\u22121)\u230b Pt\n\n+ 752 ln(32t2/\u03b4)\n\n1\n\n1\n\nt\n\nt\n\n\u02c6Et(L,Q) = inf(cid:26)\u01eb > 0 : \u2200j \u2208 Z\u01eb, min\n\nm\u2208N\n\n\u02c6Ut(\u01eb, \u03b4\u230alog(t)\u230b;L,Q) \u2264 2j\u22124(cid:27) .\n\n1\n\n\u226a \u03b4i \u226a 2\u2212i/i, ACAL achieves an expected\nT = o(T ), and if \u03b8P (\u01eb) = o(1/\u01eb), then ACAL makes an\n\n5.3 Learning with a Fixed Distribution\nThe following results essentially follow from [Han11], adapted to this stream-based setting.\nTheorem 5. For any strictly benign (P, \u03b7), if 2\u22122i\nexcess number of mistakes \u00afMT \u2212 M \u2217\nexpected number of queries \u00afQT = o(T ).\nTheorem 6. For any (P, \u03b7) satisfying Assumption 4, if D = {P}, ACAL achieves an expected\n\u03b4i2i(cid:17). and\nexcess number of mistakes \u00afMT \u2212 M \u2217\n\u03b1+2 \u00b7 T\n\u03b4i2i(cid:17).\nan expected number of queries \u00afQT = \u02dcO(cid:16)\u03b8P (\u01ebT ) \u00b7 d\nwhere \u01ebT = T \u2212 \u03b1\nCorollary 1. For any (P, \u03b7) satisfying Assumption 4, if D = {P} and \u03b4i = 2\u2212i in ACAL, the\nalgorithm achieves an expected number of mistakes \u00afMT and expected number of queries \u00afQT such\n\u03b1+2(cid:17).\nthat, for \u01ebT = T \u2212 \u03b1\n\n\u03b1+2(cid:17), and \u00afQT = \u02dcO(cid:16)\u03b8P (\u01ebT ) \u00b7 d\n\n\u03b4\u230alog(T )\u230b(cid:17) +P\u230alog(T )\u230b\n\n\u03b4\u230alog(T )\u230b(cid:17) +P\u230alog(T )\u230b\n\n\u03b1+2 log(cid:16)\n\u03b1+2 \u00b7 T\n\nT = \u02dcO(cid:16)d\n\n\u03b1+2 , \u00afMT \u2212 M \u2217\n\nT = \u02dcO(cid:16)d\n\n\u03b1+2 log(cid:16)\n\n\u03b1+2 \u00b7 T\n\n\u03b1+2 \u00b7 T\n\n\u03b1+2 .\n\ni=0\n\ni=0\n\n\u03b1+1\n\n\u03b1+1\n\n5.4 Learning with a Drifting Distribution\nWe can now state our results concerning ACAL, which are analogous to Theorems 2 and 3 proved\nearlier for CAL in the realizable case.\nTheorem 7. If D is totally bounded (Assumption 1) and \u03b7 satis\ufb01es Assumption 3, then ACAL with\n\u03b4i = 2\u2212i achieves an excess expected mistake bound \u00afMT \u2212 M \u2217\nT = o(T ), and if additionally\n\u03b8D(\u01eb) = o(1/\u01eb), then ACAL makes an expected number of queries \u00afQT = o(T ).\n\n1\n\n1\n\n\u03b1\n\n\u03b1\n\n2\n\n1\n\n2\n\nThe proof of Theorem 7 essentially follows from a combination of the reasoning for Theorem 2 and\nTheorem 8 below. Its proof is omitted.\nTheorem 8. If Assumptions 2 and 5 are satis\ufb01ed, then ACAL achieves an expected excess num-\nT = \u02dcO(cid:16)T\nber of mistakes \u00afMT \u2212 M \u2217\nnumber of queries \u00afQT = \u02dcO(cid:16)\u03b8D(\u01ebT )T\n\n\u03b4\u230alog(T )\u230b(cid:17) +P\u230alog(T )\u230b\nlog(cid:16)\n\n\u03b4i2i(cid:17), and an expected\n\u03b4i2i(cid:17), where \u01ebT =\n\n(\u03b1+2)(m+1) log(cid:16)\n\n\u03b4\u230alog(T )\u230b(cid:17) +P\u230alog(T )\u230b\n\n(\u03b1+2)(m+1) .\n\n(\u03b1+2)(m+1)\u2212\u03b1\n\n(\u03b1+2)(m+1)\n\nT \u2212\n\n(\u03b1+2)m+1\n\ni=0\n\ni=0\n\n1\n\n1\n\n\u03b1\n\n7\n\n\fThe proof of this result is in many ways similar to that given above for the realizable case, and is\nincluded among the supplemental materials.\nWe immediately have the following corollary for a speci\ufb01c \u03b4i sequence.\nCorollary 2. With \u03b4i = 2\u2212i in ACAL, the algorithm achieves expected number of mistakes \u00afM and\nexpected number of queries \u00afQT such that, for \u01ebT = T \u2212\n\n(\u03b1+2)(m+1) ,\n\n\u03b1\n\n\u00afMT \u2212 M \u2217\n\nT = \u02dcO(cid:16)T\n\n(\u03b1+2)m+1\n\n(\u03b1+2)(m+1)(cid:17) and \u00afQT = \u02dcO(cid:16)\u03b8D(\u01ebT ) \u00b7 T\n\n(\u03b1+2)(m+1)\u2212\u03b1\n\n(\u03b1+2)(m+1) (cid:17).\n\nJust as in the realizable case, we can also state a minimax lower bound for this noisy setting.\nTheorem 9. For any C as in Theorem 4, for any active learning algorithm, \u2203 a set D satisfying\nAssumption 2, a conditional distribution \u03b7, such that Assumption 5 is satis\ufb01ed, and a sequence of\nt=1 in D such that the \u00afMT and \u00afQT achieved by the learning algorithm satisfy\ndistributions {Dt}T\nT = \u2126(cid:16)T\n\u00afMT \u2212 M \u2217\nThe proof is included in the supplemental material.\n\n\u03b1+2+m\u03b1(cid:17) =\u21d2 \u00afQT = \u2126(cid:16)T\n\n\u03b1+2+m\u03b1(cid:17) and \u00afMT \u2212 M \u2217\n\nT = O(cid:16)T\n\n\u03b1+2+m\u03b1(cid:17).\n\n2+m\u03b1\n\n1+m\u03b1\n\n1+m\u03b1\n\n6 Discussion\nQuerying before Predicting: One interesting alternative to the above framework is to allow the\nlearner to make a label request before making its label predictions. From a practical perspective, this\nmay be more desirable and in many cases quite realistic. From a theoretical perspective, analysis\nof this alternative framework essentially separates out the mistakes due to over-con\ufb01dence from the\nmistakes due to recognized uncertainty. In some sense, this is related to the KWIK model of learning\nof [LLW08].\n\nAnalyzing the above procedures in this alternative model yields several interesting details. Specif-\nically, the natural modi\ufb01cation of CAL produces a method that (in the realizable case) makes the\nsame number of label requests as before, except that now it makes zero mistakes, since CAL will\nrequest a label if there is any uncertainty about its label.\n\nOn the other hand, the analysis of the natural modi\ufb01cation to ACAL can be far more subtle, when\nthere is noise. In particular, because the version space is only guaranteed to contain the best clas-\nsi\ufb01er with high con\ufb01dence, there is still a small probability of making a prediction that disagrees\nwith the best classi\ufb01er h\u2217 on each round that we do not request a label. So controlling the num-\nber of mistakes in this setting comes down to controlling the probability of removing h\u2217 from\nthe version space. However, this con\ufb01dence parameter appears in the analysis of the number of\nqueries, so that we have a natural trade-off between the number of mistakes and the number of\nIn particular, under Assumptions 2 and 5, this procedure achieves an expected\nlabel requests.\nT \u2264 P\u230alog(T )\u230b\nexcess number of mistakes \u00afMT \u2212 M \u2217\n\u03b4i2i, and an expected number of queries\n\u00afQT = \u02dcO(cid:16)\u03b8D(\u01ebT ) \u00b7 T\nlog(cid:16)\n\u03b4\u230alog(T )\u230b(cid:17) +P\u230alog(T )\u230b\n(\u03b1+2)(m+1) .\nIn particular, given any nondecreasing sequence MT , we can set this \u03b4i sequence to maintain\n\u00afMT \u2212 M \u2217\nOpen Problems: What is not implied by the results above is any sort of trade-off between the\nnumber of mistakes and the number of queries. Intuitively, such a trade-off should exist; however,\nas CAL lacks any parameter to adjust the behavior with respect to this trade-off, it seems we need a\ndifferent approach to address that question. In the batch setting, the analogous question is the trade-\noff between the number of label requests and the number of unlabeled examples needed. In the\nrealizable case, that trade-off is tightly characterized by Dasgupta\u2019s splitting index analysis [Das05].\nIt would be interesting to determine whether the splitting index tightly characterizes the mistakes-\nvs-queries trade-off in this stream-based setting as well.\n\n\u03b4i2i(cid:17), where \u01ebT = T \u2212\n\nT \u2264 MT for all T .\n\n(\u03b1+2)(m+1)\u2212\u03b1\n\n(\u03b1+2)(m+1)\n\ni=0\n\ni=1\n\n1\n\n\u03b1\n\nIn the batch setting, in which unlabeled examples are considered free, and performance is only mea-\nsured as a function of the number of label requests, [BHV10] have found that there is an important\ndistinction between the veri\ufb01able label complexity and the unveri\ufb01able label complexity. In partic-\nular, while the former is sometimes no better than passive learning, the latter can always provide\nimprovements for VC classes. Is there such a thing as unveri\ufb01able performance measures in the\nstream-based setting? To be concrete, we have the following open problem. Is there a method for\nevery VC class that achieves O(log(T )) mistakes and o(T ) queries in the realizable case?\n\n8\n\n\f[Das05]\n\n[DGS10]\n\n[DHM07]\n\n[DKM09]\n\n[FM97]\n\n[Han07]\n\n[Han11]\n\n[HLW94]\n\n[Lit88]\n\n[LLW08]\n\n[MMR08]\n\n[MMR09]\n\n[MT99]\n\n[Vap82]\n\n[vdG00]\n\nIn Advances in Neural\n\nconcept drift. In COLT, pages 168\u2013180, 2010.\nS. Dasgupta. Coarse sample complexity bounds for active learning.\nInformation Processing Systems 18, 2005.\nO. Dekel, C. Gentile, and K. Sridharam. Robust selective sampling from single and multiple\nteachers. In Conference on Learning Theory, 2010.\nS. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. Tech-\nnical Report CS2007-0898, Department of Computer Science and Engineering, University of\nCalifornia, San Diego, 2007.\nS. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Journal\nof Machine Learning Research, 10:281\u2013299, 2009.\nY. Freund and Y. Mansour. Learning under persistent drift. In Proceedings of the Third European\nConference on Computational Learning Theory, EuroCOLT \u201997, pages 109\u2013118, 1997.\nS. Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the\n24th International Conference on Machine Learning, 2007.\nS. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333\u2013361,\n2011.\nD. Haussler, N. Littlestone, and M. Warmuth. Predicting {0\npoints. Information and Computation, 115:248\u2013292, 1994.\nN. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold al-\ngorithm. Machine Learning, 2:285\u2013318, 1988.\nL. Li, M. L. Littman, and T. J. Walsh. Knows what it knows: A framework for self-aware\nlearning. In International Conference on Machine Learning, 2008.\nY. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In In\nAdvances in Neural Information Processing Systems (NIPS), pages 1041\u20131048, 2008.\nY. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algo-\nrithms. In COLT, 2009.\nE. Mammen and A.B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics,\n27:1808\u20131829, 1999.\nV. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York,\n1982.\nS. van de Geer. Empirical Processes in M-Estimation (Cambridge Series in Statistical and Prob-\nabilistic Mathematics). Cambridge University Press, 2000.\n\n1}-functions on randomly drawn\n\nReferences\n\n[AB99]\n\n[Bar92]\n\n[BHV10]\n\n[BL97]\n\n[CAL94]\n\nM. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge\nUniversity Press, 1999.\nP. L. Bartlett. Learning with a slowly changing distribution. In Proceedings of the \ufb01fth annual\nworkshop on Computational learning theory, COLT \u201992, pages 243\u2013252, 1992.\nM.-F. Balcan, S. Hanneke, and J. Wortman Vaughan. The true sample complexity of active\nlearning. Machine Learning, 80(2\u20133):111\u2013139, September 2010.\nR. D. Barve and P. M. Long. On the complexity of learning from drifting distributions.\nComput., 138(2):170\u2013193, 1997.\nD. Cohn, L. Atlas, and R. Ladner.\nLearning, 15(2):201\u2013221, 1994.\n\nImproving generalization with active learning. Machine\n\nInf.\n\n[CMEDV10] K. Crammer, Y. Mansour, E. Even-Dar, and J. Wortman Vaughan. Regret minimization with\n\n,\n\n9\n\n\f", "award": [], "sourceid": 1169, "authors": [{"given_name": "Liu", "family_name": "Yang", "institution": null}]}