{"title": "Coarse sample complexity bounds for active learning", "book": "Advances in Neural Information Processing Systems", "page_first": 235, "page_last": 242, "abstract": null, "full_text": "Coarse sample complexity bounds for active learning\n\nSanjoy Dasgupta UC San Diego dasgupta@cs.ucsd.edu\n\nAbstract\nWe characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.\n\n1 Introduction\nThe goal of active learning is to learn a classifier in a setting where data comes unlabeled, and any labels must be explicitly requested and paid for. The hope is that an accurate classifier can be found by buying just a few labels. So far the most encouraging theoretical results in this field are [7, 6], which show that if the hypothesis class is that of homogeneous (i.e. through the origin) linear separators, and the data is distributed uniformly over the unit sphere in Rd , and the labels correspond perfectly to one of the hypotheses (i.e. the separable case) then at most O(d log d/) labels are needed to learn a classifier with error less than . This is exponentially smaller than the usual (d/) sample complexity of learning linear classifiers in a supervised setting. However, generalizing this result is non-trivial. For instance, if the hypothesis class is expanded to include non-homogeneous linear separators, then even in just two dimensions, under the same benign input distribution, we will see that there are some target hypotheses for which active learning does not help much, for which (1/) labels are needed. In fact, in this example the label complexity of active learning depends heavily on the specific target hypothesis, and ranges from O(log 1/) to (1/). In this paper, we consider arbitrary hypothesis classes H of VC dimension d < , and learning problems which are separable. We characterize the sample complexity of active learning in terms of a parameter which takes into account: (1) the distribution P over the input space X ; (2) the specific target hypothesis h  H; and (3) the desired accuracy .\n\nSpecifically, we notice that distribution P induces a natural topology on H, and we define a splitting index  which captures the relevant local geometry of H in the vicinity of h , at scale . We show that this quantity fairly tightly describes the sample complexity of active learning: any active learning scheme requires (1/) labels and there is a generic active ~ learner which always uses at most O (d/) labels1 . This  is always at least ; if it is  we just get the usual sample complexity of supervised\n1\n\n~ The O () notation hides factors polylogarithmic in d, 1/, 1/ , and 1/ .\n\n\f\nlearning. But sometimes  is a constant, and in such instances active learning gives an exponential improvement in the number of labels needed. We look at various hypothesis classes and derive splitting indices for target hypotheses at different levels of accuracy. For homogeneous linear separators and the uniform input distribution, we easily find  to be a constant  perhaps the most direct proof yet of the efficacy of active learning in this case. Most proofs have been omitted for want of space; the full details, along with more examples, can be found at [5].\n\n2 Sample complexity bounds\n2.1 Motivating examples Linear separators in R1 Our first example is taken from [3, 4]. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {hw : w  R}: hw (x) = 1 if x  w 0 if x < w\n- - - - - -- w + + ++\n\nVC theory tells us that if the underlying distribution P is separable (can be classified perfectly by some hypothesis in H), then in order to achieve an error rate less than , it is enough to draw m = O(1/) random labeled examples from P, and to return any classifier consistent with them. But suppose we instead draw m unlabeled samples from P. If we lay these points down on the line, their hidden labels are a sequence of 0's followed by a sequence of 1's, and the goal is to discover the point w at which the transition occurs. This can be done with a binary search which asks for just log m = O(log 1/) labels. Thus, in this case active learning gives an exponential improvement in the number of labels needed. Can we always achieve a label complexity proportional to log 1/ rather than 1/? A natural next step is to consider linear separators in two dimensions. Linear separators in R2 Let H be the hypothesis class of linear separators in R2 , and suppose the input distribution P is some density supported on the perimeter of the unit circle. It turns out that the positive results of the one-dimensional case do not generalize: there are some target hypotheses in H for which (1/) labels are needed to find a classifier with error rate less than , no matter what active learning scheme is used. To see this, consider the following possible target hypotheses (Figure 1, left): h0 , for which all points are positive; and hi (1  i  1/), for which all points are positive except for a small slice Bi of probability mass . The slices Bi are explicitly chosen to be disjoint, with the result that (1/) labels are needed to distinguish between these hypotheses. For instance, suppose nature chooses a target hypothesis at random from among the hi , 1  i  1/. Then, to identify this target with probability at least 1/2, it is necessary to query points in at least (about) half the Bi 's. Thus for these particular target hypotheses, active learning offers no improvement in sample complexity. What about other target hypotheses in H, for instance those in which the positive and negative regions are most evenly balanced? Consider the following active learning scheme:\n\n\f\nh3\n\nB2\n\nx3\n\nh2 B1\n\nP P origin\n\nh1 h0\n\nFigure 1: Left: The data lie on the circumference of a circle. Each Bi is an arc of probability mass . Right: The same distribution P, lifted to 3-d, and with trace amounts of another distribution P mixed in.\n\n1. Draw a pool of O(1/) unlabeled points. 2. From this pool, choose query points at random until at least one positive and one negative point have been found. (If all points have been queried, then halt.) 3. Apply binary search to find the two boundaries between positive and negative on the perimeter of the circle. For any h  H, define i(h) = min{positive mass of h, negative mass of h}. It is not hard to see that when the target hypothesis is h, step (2) asks for O(1/i(h)) labels (with probability at least 9/10, say) and step (3) asks for O(log 1/) labels. Thus even within this simple hypothesis class, the label complexity of active learning can run anywhere from O(log 1/) to (1/), depending on the specific target hypothesis. Linear separators in R3 In our two previous examples, the amount of unlabeled data needed was O(1/), exactly the usual sample complexity of supervised learning. We next turn to a case in which it is helpful to have significantly more unlabeled data than this. Consider the distribution of the previous 2-d example: for concreteness, fix P to be uniform over the unit circle in R2 . Now lift it into three dimensions by adding to each point x = (x1 , x2 ) a third coordinate x3 = 1. Let H consist of homogeneous linear separators in R3 . Clearly the bad cases of the previous example persist. Suppose, now, that a trace amount  of a second distribution P is mixed in with P (Figure 1, right), where P is uniform on the circle {x2 + x2 = 1, x3 = 0}. The \"bad\" linear separators 1 2 in H cut off just a small portion of P but nonetheless divide P perfectly in half. This permits a three-stage algorithm: (1) using binary search on points from P , approximately identify the two places at which the target hypothesis h cuts P ; (2) use this to identify a positive and negative point of P (look at the midpoints of the positive and negative intervals in P ); (3) do binary search on points from P. Steps (1) and (3) each use just O(log 1/) labels. This O(log 1/) label complexity is made possible by the presence of P and is only achievable if the amount of unlabeled data is (1/ ), which could potentially be enormous. With less unlabeled data, the usual (1/) label complexity applies.\n\n\f\nx\n\nx\n\nS\n\n+ Hx - Hx\n\nFigure 2: (a) x is a cut through H; (b) splitting edges. 2.2 Basic definitions The sample complexity of supervised learning is commonly expressed as a function of the error rate  and the underlying distribution P. For active learning, the previous three examples demonstrate that it is also important to take into account the target hypothesis and the amount of unlabeled data. The main goal of this paper is to present one particular formalism by which this can be accomplished. Let X be an instance space with underlying distribution P. Let H be the hypothesis class, a set of functions from X to {0, 1} whose VC dimension is d < . We are operating in a non-Bayesian setting, so we are not given a measure (prior) on the space H. In the absence of a measure, there is no natural notion of the \"volume\" of the current version space. However, the distribution P does induce a natural distance function on H, a pseudometric: d(h, h ) = P{x : h(x) = h (x)}. We can likewise define the notion of neighborhood: B (h, r) = {h  H : d(h, h )  r}. We will be dealing with a separable learning scenario, in which all labels correspond perfectly to some concept h  H, and the goal is to find h  H such that d(h , h)  . To do this, it is sufficient to whittle down the version space to the point where it has diameter at most , and to then return any of the remaining hypotheses. Likewise, if the diameter of the current version space is more than  then any hypothesis chosen from it will have error more than /2 with respect to the worst-case target. Thus, in a non-Bayesian setting, active learning is about reducing the diameter of the version space. If our current version space is S  H, how can we quantify the amount by which a point + x  X reduces its diameter? Let Hx denote the classifiers that assign x a value of 1, + - Hx = {h  H : h(x) = 1}, and let Hx be the remainder, which assign it a value of 0. We can think of x as a cut through hypothesis space; see Figure 2(a). In this example, x is clearly helpful, but it doesn't reduce the diameter of S . And we cannot say that it reduces the average distance between hypotheses, since again there is no measure on H. What x seems to be doing is to reduce the diameter in a certain \"direction\". Is there some notion in arbitrary metric spaces which captures this intuition? Consider any finite Q  H  H. We will think of an element (h, h )  Q as an edge between vertices h and h . For us, each such edge will represent a pair of hypotheses which need to be distinguished from one another: that is, they are relatively far apart, so there is no way to achieve our target accuracy if both of them remain in the version space. We would hope that for any finite set of edges Q, there are queries that will remove a substantial fraction of them. To this end, a point x  X is said to -split Q if its label is guaranteed to reduce the number\n\n\f\nof edges by a fraction  > 0, that is, if: For instance, in Figure 2(b), the edges are 3/5-split by x.\n+ + - - max{|Q  (Hx  Hx )|, |Q  (Hx  Hx )|}  (1 - )|Q|.\n\nIf our target accuracy is , we only really care about edges of length more than . So define Finally, we say that a subset of hypotheses S  H is (, ,  )-splittable if for all finite edge-sets Q  S  S , P{x : x -splits Q }   . Paraphrasing, at least a  fraction of the distribution P is useful for splitting S .2 This  gives a sense of how many unlabeled samples are needed. If  is miniscule, then there are good points to query, but these will emerge only in an enormous pool of unlabeled data. It will soon transpire that the parameters ,  play roughly the following roles: # labels needed  1/, # of unlabeled points needed  1/ A first step towards understanding them is to establish a trivial lower bound on . Lemma 1 Pick any 0 < ,  < 1, and any set S . Then S is ((1 - ), , )-splittable. Proof. Pick any finite edge-set Q  S  S . Let Z denote the number of edges of Q cut by a point x chosen at random from P. Since the edges have length at least , this x has at least an  chance of cutting any of them, whereby EZ  |Q |. Now, which after rearrangement becomes P(Z  (1 - )|Q |)  , as claimed. |Q |  EZ  P(Z  (1 - )|Q |)  |Q | + (1 - )|Q |, Q = {(h, h )  Q : d(h, h ) > }.\n\nThus,  is always (); but of course, we hope for a much larger value. We will now see that the splitting index roughly characterizes the sample complexity of active learning. 2.3 Lower bound We start by showing that if some region of the hypothesis space has a low splitting index, then it must contain hypotheses which are not conducive to active learning. Theorem 2 Fix a hypothesis space H and distribution P. Suppose that for some ,  < 1 and  < 1/2, S  H is not (, ,  )-splittable. Then any active learner which achieves an accuracy of  on all target hypotheses in S , with confidence > 3/4 (over the random sampling of data), either needs  1/ unlabeled samples or  1/ labels. Proof. Let Q be the set of edges of length >  which defies splittability, with vertices V = {h : (h, h )  Q for some h  H}. We'll show that in order to distinguish between hypotheses in V , either 1/ unlabeled samples or 1/ queries are needed.\n\nSo pick less than 1/ unlabeled samples. With probability at least (1 -  )1/  1/4, none of these points -splits Q ; put differently, each of these potential queries has a bad outcome (+ or -) in which at most |Q | edges are eliminated. In this case there must be a target hypothesis in V for which at least 1/ labels are required. In our examples, we will apply this lower bound through the following simple corollary.\nWhenever an edge of length l   can be constructed in S , then by taking Q to consist solely of this edge, we see that   l. Thus we typically expect  to be at most about , although of course it might be a good deal smaller than this.\n2\n\n\f\nLet S0 be an 0 -cover of H for t = 1, 2, . . . , T = lg 2/: St = split(St-1 , 1/2t) return any h  ST\n\nfunction split(S, ) Let Q0 = {(h, h )  S  S : d(h, h ) > } Repeat for t = 0, 1, 2, . . .: Draw m unlabeled points xt1 , . . . , xtm Query the xti which maximally splits Qt Let Qt+1 be the remaining edges until Qt+1 =  return remaining hypotheses in S\n\nFigure 3: A generic active learner.\n\nThen for any  and any  > 1/N , the set B (h0 , ) is not (, ,  )-splittable . Any active learning scheme which achieves an accuracy of  on all of B (h0 , ) must use at least N labels for some of the target hypotheses, no matter how much unlabeled data is available. In this case, the distance metric on h0 , h1 , . . . , hN can accurately be depicted as a star with h0 at the center and with spokes leading to each hi . Each query only cuts off one spoke, so N queries are needed. 2.4 Upper bound\n\nCorollary 3 Suppose that in some neighborhood B (h0 , ), there are hypotheses h1 , . . . , hN such that: (1) d(h0 , hi ) >  for all i; and (2) the \"disagree sets\" {x : h0 (x) = hi (x)} are disjoint for different i.\n\nWe now show a loosely matching upper bound on sample complexity, via an algorithm (Figure 3) which repeatedly halves the diameter of the remaining version space. For some 0 less than half the target error rate , it starts with an 0 -cover of H: a set of hypotheses S0  H such that any h  H is within distance 0 of S0 . It is well-known that it is possible to find such an S0 of size  2(2e/0 ln 2e/0 )d [9](Theorem 5). The 0 -cover serves as a surrogate for the hypothesis class  for instance, the final hypothesis is chosen from it. The algorithm is hopelessly intractable and is meant only to demonstrate the following u p p er b o u n d . Theorem 4 Let the target hypothesis be some h  H. Pick any target accuracy  > 0 and confidence level  > 0. Suppose B (h , 4) is (, ,  )-splittable for all   /2. Then there is an appropriate choice of 0 and m for which, with probability at least 1 -  , ~ ~ the algorithm will draw O ((1/) + (d/ )) unlabeled points, make O (d/) queries, and return a hypothesis with error at most . This theorem makes it possible to derive label complexity bounds which are fine-tuned to the specific target hypothesis. At the same time, it is extremely loose in that no attempt has been made to optimize logarithmic factors.\n\n3 Examples\n3.1 Simple boundaries on the line Returning to our first example, let X = R and H = {hw : w  R}, where each hw is a threshold function hw (x) = 1(x  w). Suppose P is the underlying distribution on X ; for simplicity we'll assume it's a density, although the discussion can easily be generalized.\n\n\f\nThe distance measure P induces on H is d(hw , hw ) = P{x : hw (x) = hw (x)} = P{x : w  x < w } = P[w, w ) (assuming w  w). Pick any accuracy  > 0 and consider any finite set of edges Q =  {(hwi , hwi ) : i = 1, . . . , n}, where without loss of generality the wi are in nondecreasing  order, and where each edge has length greater than : P[wi , wi ) > . Pick w so that P[wn/2 , w) = . It is easy to see that any x  [wn/2 , w) must eliminate at least half the edges in Q. Therefore, H is ( = 1/2, , )-splittable for any  > 0. This echoes the simple fact that active-learning H is just a binary search.\n\n3.2 Intervals on the line The next case we consider is almost identical to our earlier example of 2-d linear separators (and the results carry over to that example, within constant factors). The hypotheses correspond to intervals on the real line: X = R and H = {ha,b : a, b  R}, where ha,b (x) = 1(a  x  b). Once again assume P is a density. The distance measure it induces is d(ha,b , ha ,b ) = P{x : x  [a, b]  [a , b ], x  [a, b]  [a , b ]} = P([a, b][a , b ]), where S T denotes symmetric difference (S  T ) \\ (S  T ). Even in this very simple class, some hypotheses are much easier to active-learn than others. Hypotheses not amenable to active-learning. Divide the real line into 1/ disjoint intervals, each with probability mass , and let {hi : i = 1, ..., 1/} denote the hypotheses taking value 1 on the corresponding intervals. Let h0 be the everywhere-zero concept. Then these hi satisfy the conditions of Corollary 3; their star-shaped configuration forces a -value of , and active learning doesn't help at all in choosing amongst them. Hypotheses amenable to active learning. The bad hypotheses are the ones whose intervals have small probability mass. We'll now see that larger concepts are not so bad; in particular, for any h whose interval has mass > 4, B (h, 4) is ( = (1), , ())-splittable. Pick any  > 0 and any ha,b such that P[a, b] = r > 4. Consider a set of edges Q whose endpoints are in B (ha,b , 4) and which all have length > . In the figure below, all lengths denote probability masses. Any concept in B (ha,b , 4) (more precisely, its interval) must lie within the outer box and must contain the inner box (this inner box might be empty).\nr\n\na 4 4 4\n\nb 4\n\nAny edge (ha ,b , ha ,b )  Q has length > , so [a , b ][a , b ] (either a single interval or a union of two intervals) has total length >  and lies between the inner and outer boxes. Now pick x at random from the distribution P restricted to the space between the two boxes. This space has mass at most 16 and at least 4, of which at least  is occupied by [a , b ][a , b ]. Therefore x separates ha ,b from ha ,b with probability  1/16.\n\nNow let's look at all of Q. The expected number of edges split by our x is at least |Q|/16, and therefore the probability that more than |Q|/32 edges are split is at least 1/32. So P{x : x (1/32)-splits Q}  4/32 = /8.\n\nTo summarize, for any hypothesis ha,b , let i(ha,b ) = P[a, b] denote the probability mass of its interval. Then for any h  H and any  < i(h)/4, the set B (h, 4) is (1/32, , /8)splittable. In short, once the version space is whittled down to B (h, i(h)/4), efficient active\n\n\f\nlearning is possible. And the initial phase of getting to B (h, i(h)/4) can be managed by ~ random sampling, using O (1/i(h)) labels: not too bad when i(h) is large. 3.3 Linear separators under the uniform distribution The most encouraging positive result for active learning to date has been for learning homogeneous (through the origin) linear separators with data drawn uniformly from the surface of the unit sphere in Rd . The splitting indices for this case [5] bring this out immediately:   Theorem 5 For any h  H, any   1/(32 2 d), B (h, 4) is ( 1 , , (/ d))-splittable. 8\n\n4 Related work and open problems\nThere has been a lot of work on a related model in which the points to be queried are synthetically constructed, rather than chosen from unlabeled data [1]. The expanded role of P in our model makes it substantially different, although a few intuitions do carry over  for instance, Corollary 3 generalizes the notion of teaching dimension[8]. We have already discussed [7, 4, 6]. One other technique which seems useful for active learning is to look at the unlabeled data and then place bets on certain target hypotheses, for instance the ones with large margin. This insight  nicely formulated in [2, 10]  is not specific to active learning and is orthogonal to the search issues considered in this paper. In all the positive examples in this paper, a random data point which intersects the version space has a good chance of (1)-splitting it. This permits a naive active learning strategy, also suggested in [3]: just pick a random point whose label you are not yet sure of. On what kinds of problems will this work, and what are prototypical cases where more intelligent querying is needed? Acknowledgements. I'm grateful to Yoav Freund for introducing me to this field; to Peter Bartlett, John Langford, Adam Kalai and Claire Monteleoni for helpful discussions; and to the anonymous NIPS reviewers for their detailed and perceptive comments.\n\nReferences\n[1] D. Angluin. Queries revisited. ALT, 2001. [2] M.-F. Balcan and A. Blum. A PAC-style model for learning from labeled and unlabeled data. Eighteenth Annual Conference on Learning Theory, 2005. [3] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15(2):201221, 1994. [4] S. Dasgupta. Analysis of a greedy active learning strategy. NIPS, 2004. [5] S. Dasgupta. Full version of this paper at www.cs.ucsd.edu/~dasgupta/papers/sample.ps. [6] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning. Eighteenth Annual Conference on Learning Theory, 2005. [7] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algorithm. Machine Learning Journal, 28:133168, 1997. [8] S. Goldman and M. Kearns. On the complexity of teaching. Journal of Computer and System Sciences, 50(1):2031, 1995. [9] D. Haussler. Decision-theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78150, 1992. [10] J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):19261940, 1998.\n\n\f\n", "award": [], "sourceid": 2943, "authors": [{"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": null}]}