{"title": "Information, Prediction, and Query by Committee", "book": "Advances in Neural Information Processing Systems", "page_first": 483, "page_last": 490, "abstract": null, "full_text": "Information, prediction, and query by \n\ncommittee \n\nYoav Freund \n\nComputer and Information Sciences \nUniversity of California, Santa Cruz \n\nyoavQcse.ucsc.edu \n\nH. Sebastian Seung \nAT &T Bell Laboratories \nMurray Hill, New Jersey \nseungQphysics.att.com \n\nEli Shamir \n\nInstitute of Computer Science \nHebrew University, Jerusalem \n\nsharnirQcs.huji.ac.il \n\nN aft ali Tishby \n\nInstitute of Computer Science and \nCenter for Neural Computation \nHebrew University, Jerusalem \n\ntishbyQcs.huji.ac.il \n\nAbstract \n\nWe analyze the \"query by committee\" algorithm, a method for fil(cid:173)\ntering informative queries from a random stream of inputs. We \nshow that if the two-member committee algorithm achieves infor(cid:173)\nmation gain with positive lower bound, then the prediction error \ndecreases exponentially with the number of queries. We show that, \nin particular, this exponential decrease holds for query learning of \nthresholded smooth functions. \n\n1 \n\nIntroduction \n\nFor the most part, research on supervised learning has utilized a random input \nparadigm, in which the learner is both trained and tested on examples drawn at \nrandom from the same distribution. In contrast, in the query paradigm, the learner \nis given the power to ask questions, rather than just passively accept examples. \nWhat does the learner gain from this additional power? Can it attain the same \nprediction performance with fewer examples? \n\nMost work on query learning has been in the constructive paradigm, in which the \n\n483 \n\n\f484 \n\nFreund, Seung, Shamir, and Tishby \n\nlearner constructs inputs on which to query the teacher. For some classes of boolean \nfunctions and finite automata that are not PAC learnable from random inputs, there \nare algorithms that can successfully PAC learn using \"membership queries\" [VaI84, \nAng88]. Query algorithms are also known for neural network learning[Bau91]. The \ngeneral relevance of these positive results is unclear, since each is specific to the \nlearning of a particular concept class. Moreover, as shown by Eisenberg and Rivest \nin [ER90], constructed membership queries cannot be used to reduce the number of \nexamples required for PAC learning. That is because random examples provide the \nlearner with information not only about the correct mapping, but also about the \ndistribution of future test inputs. This information is lacking if the learner must \nconstruct inputs. \n\nIn the statistical literature, some attempt has been made towards a more fun(cid:173)\ndamental understanding of query learning, there called \"sequential design of \nexperiments.\" 1. It has been suggested that the optimal experiment (query) is the \none with maximal Shannon information[Lin56, Fed72, Mac92]. Similar suggestions \nhave been made in the perceptron learning literature[KR90]. Although the use of an \nentropic measure seems sensible, its relationship with prediction error has remained \nunclear. \n\nUnderstanding this relationship is a main goal of the present work, and enables us \nto prove a positive result about the power of queries. Our work is derived within the \nquery filtering paradigm, rather than the constructive paradigm. In this paradigm, \nproposed by [CAL90], the learner is given access to a stream of inputs drawn at \nrandom from a distribution. The learner sees every input, but chooses whether or \nnot to query the teacher for the label. This paradigm is realistic in contexts where \nit is cheap to get unlabeled examples, but expensive to label them. It avoids the \nproblems with the constructive paradigm described in [ER90] because it gives the \nlearner free access to the input distribution. \n\nIn [CAL90] there are several suggestions for query filters together with some em(cid:173)\npirical tests of their performance on simple problems. Seung et al.[SOS92] have \nsuggested a filter called \"query by committee,\" and analytically calculated its per(cid:173)\nformance for some perceptron-type learning problems. For these problems, they \nfound that the prediction error decreases exponentially fast in the number of queries. \nIn this work we present a more complete and general analysis of query by commit(cid:173)\ntee, and show that such an exponential decrease is guaranteed for a general class of \nlearning problems. \n\nWe work in a Bayesian model of concept learning[HKS91] in which the target con(cid:173)\ncept I is chosen from a concept class C according to some prior distribution P. \nThe concept class consists of boolean-valued functions defined on some input space \nX. An example is an input x E X along with its label I = I( x). For any set of \nexamples, we define the version space to be the set of all hypotheses in C that are \nconsistent with the examples. As each example arrives, it eliminates inconsistent \nhypotheses, and the probability of the version space (with respect to P) is reduced. \nThe instantaneous information gain (i.i.g.) is defined as the logarithm of the ratio \n\nIThe paradigm of (non-sequential) experimental design is analogous to what might be \ncalled \"batch query learning,\" in which all of the inputs are chosen by the learner before \na single label is received from the teacher \n\n\fInformation, prediction, and query by committee \n\n485 \n\nof version space probabilities before and after receiving the example. In this work, \nwe study a particular kind oflearner, the Gibbs learner, which chooses a hypothesis \nat random from the version space. In Bayesian terms, it chooses from the posterior \ndistribution on the concept class, which is the restriction of the prior distribution \nto the version space. \n\nIf an unlabeled input x is provided, the expected i.i.g. of its label can be defined by \ntaking the expectation with respect to the probabilities of the unknown label. The \ninput x divides the version space into two parts, those hypotheses that label it as \na positive example, and those that label it negative. Let the probability ratios of \nthese two parts to the whole be X anti 1 - X. Then the expected i.i.g. is \n\n1i(X) = -X logX -\n\n(1 - X) log(1- X) . \n\n(1) \n\nThe goal of the learner is to minimize its prediction error, its probability of error \non an input drawn from the input distribution V. In the ease of random input \nlearning, every input x is drawn independently from V. Since the expected i.i.g. \ntends to zero (see [HKS91]), it seems that random input learning is inefficient. We \nwill analyze query construction and filtering algorithms that are designed to achieve \nhigh information gain. \n\nThe rest of the paper is organized as follows. In section 2 we exhibit query con(cid:173)\nstruction algorithms for the high-low game. The bisection algorithm for high-low \nillustrates that constructing queries with high information gain can improve pre(cid:173)\ndiction performance. But the failure of bisection for multi-dimensional high-low \nexposes a deficiency of the query construction paradigm. In section 3 we define \nthe query filtering paradigm, and discuss the relation between information gain and \nprediction error for queries filtered by a committee of Gibbs learners. In section \n4 lower bounds for information gain are proved for the learning of some nontrivial \nconcept classes. Section 5 is a summary and discussion of open problems. \n\n2 Query construction and the high-low game \n\nIn this section, we give examples of query construction algorithms for the high-low \ngame and its generalizations. In the high-low game, the concept class C consists of \nfunctions of the form \n\n{ I w < x \nfw(x) = 0: w > x \n\n(2) \n\nwhere 0 ~ w, x ~ 1. Thus both X and C are naturally mapped to the interval [0,1]. \nBoth P, the prior distribution for the parameter w, and V, the input distribution for \nx, are assumed to be uniform on [0,1]. Given any sequence of examples, the version \nspace is [XL, XR] where XL is the largest negative example and XR is the smallest \npositive example. The posterior distribution is uniform in the interval [XL, XR] and \nvanishes outside. \nThe prediction error of a Gibbs learner is Pr(fv(x) I- fw(x)) where x is chosen \nfrom V, and v and w from the posterior distribution. It is easy to show that \nPr(fv (x) I- fw (x)) = (XR - xL)/3. Since the prediction error is proportional to the \nversion space volume, always querying on the midpoint (XR + xL)/2 causes the \nprediction error after m queries to decrease like 2- m . This is in contrast to the case \nof random input learning, for which the prediction error decreases like l/m. \n\n\f486 \n\nFreund, Seung, Shamir, and Tishby \n\nThe strategy of bisection is clearly maximally informative, since it achieves \n1i(lj2) = 1 bit per query, and can be applied to the learning of any concept class. \nNaive intuition suggests that it should lead to rapidly decreasing prediction error, \nbut this is not necessarily so. Generalizing the high-low game to d dimensions \nprovides a simple counterexample. The target concepts are functions of the form \n\n(3) \n\nfw(i, x) = {6: Wi < X \n\nWi> X \n\nThe prior distribution of 'Iii is uniform on the concept class C = [0, l]d. The inputs \nare pairs (i, x), where i takes on the values 1, ... , d with equal probability, and x is \nuniformly distributed on [0, 1]. Since this is basically d concurrent high-low games \n(one for each component of 'Iii), the version space is a product of subintervals of \n[0,1]. For d = 2, the concept class is the unit square, and the version space is a \nrectangle. The prediction error is proportional to the perimeter of the rectangle. A \nsequence of queries with i = 1 can bisect the rectangle along one dimension, yielding \n1 bit per query, while the perimeter tends to a finite constant. Hence the prediction \nerror tends to a finite constant, in spite of the maximal information gain. \n\n3 The committee filter: information and prediction \n\nThe dilemma of the previous section was that constructing queries with high infor(cid:173)\nmation gain does not necessarily lead to rapidly decreasing prediction error. This is \nbecause the constructed query distribution may have nothing to do with the input \ndistribution V. This deficiency can be avoided in a different paradigm in which \nthe query distribution is created by filtering V. Suppose that the learner receives \na stream of unlabeled inputs Xl, x2, ... drawn independently from the distribution \nV. After seeing each input Xi, the learner has the choice of whether or not to query \nthe teacher for the correct label Ii = f( Xi). \nIn [50592] it was suggested to filter queries that cause disagreement in a committee \nof Gibbs learners. In this paper we concentrate on committees with two members. \nThe algorithm is: \nQuery by a committee of two \nRepeat the following until n queries have been accepted \n\n1. Draw an unlabeled input x E X at random from V. \n2. Select two hypotheses hl' h2 from the posterior distribution. In other words, \npick two hypotheses that are consistent with the labeled examples seen so \nfar. \n\n3. If hl(x) -:j:. h2(X) then query the teacher for the label of x, and add it to \n\nthe training set. \n\nThe committee filter tends to select examples that split the version space into two \nparts of comparable size, because if one of the parts contains most of the version \nspace, then the probability that the two hypotheses will disagree is very small. \nMore precisely, if x cuts the version space into parts of size X and 1 - X, then the \nprobability of accepting x is 2x(1 - X). One can show that the i.i.g. of the queries \nis lower bounded by that obtained from random inputs. \n\n\fInformation, prediction, and query by committee \n\n487 \n\nIn this section, we assume something stronger: that the expected i.i.g. of the com(cid:173)\nmittee has positive lower bound. Conditions under which this assumption holds will \nbe discussed in the next section. The bound implies that the cumulative information \ngain increases linearly with the number of queries n. But the version space resulting \nfrom the queries alone must be larger than the version space that would result if \nthe learner knew all of the labels. Hence the cumulative information gain from the \nqueries is upper bounded by the cumulative information gain which would be ob(cid:173)\ntained from the labels of all m inputs, which behaves like O(dlog r;;) for a concept \nclass C with finite VC dimension d ([HKS91]). These O(n) and O(log m) behaviors \nare consistent only if the gap between consecutive queries increases exponentially \nfast. This argument is depicted in Fi!~ure 1. \n\nCumulative \nInformation \nGain \n\nCumulative \nInformation \nof Queries \n\nExpected \n\n~~;~~ \n\nRandom \n\nr------- _r------------\n\nExamples r ----Gap between example: \n\naccepted as queries \n\n_ _ \n\nx \n\nx \n\nx \n\nx \n\nNumber of \n\nRandom Examples \n\nFigure 1: Each tag on the x axis denotes a random example in a specific typical \nsequence. The symbol X under a tag denotes the fact that the example was chosen \nas a query. \n\nRecall that an input is accepted if it provokes disagreement between the two Gibbs \nlearners that constitute the committee. Thus a large gap between consecutive \nqueries is equivalent to a small probability of disagreement. But in our Bayesian \nframework the probability of disagreement between two Gibbs learners is equal to \nthe probability of disagreement between a Gibbs learner and the teacher, which is \nthe expected prediction error. Thus the prediction error is exponentially small as a \nfunction of the number of queries. The exact statement of the result is given below, \na detailed proof of which will be published elsewhere. \nTheorem 1 Suppose that a concept class C has VC-dimension d < 00 and the \nexpected information gained by the two member committee algorithm is bounded by \nc > 0, independent of the query number and of the previous queries. Then the \nprobability that one of the two committee members makes a mistake on a randomly \nchosen example with respect to a randomly chosen fEe is bounded by \n\n(3+0(e-Cln\u00bb~exp (-2(d~ l)n) \n\n(4) \n\nfor some constant Cl > 0, where n is the number of queries asked so far. \n\n\f488 \n\nFreund, Seung, Shamir, and Tishby \n\n4 Lower bounds on the information gain \n\nTheorem 1 is applicable to learning problems for which the committee achieves i.i.g. \nwith positive lower bound. A simple case of this is the d-dimensional high-low game \nof section 2, for which the i.i.g. is 7 /(121n 2) R: 0.84, independent of dimension. This \nexact result is simple to derive because the high-low game is geometrically trivial: \nall version spaces are similar to each other. In general, the shape of the version space \nis more complex, and depends on the randomness of the examples. Nevertheless, \nthe expected i.i.g. can be lower bounded even for some learning problems with \nnontrivial version space geometry. \n\n4.1 The information gain for convex version spaces \nDefine a class of functions f w by \n\nfw(x, t) = { ~: \n\n... \n\n... \nt \n, \nw\u00b7x> \ntij\u00b7x