{"title": "Estimating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Dimension Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 855, "page_last": 862, "abstract": null, "full_text": "Estimating Average-Case Learning Curves \n\nUsing Bayesian, Statistical Physics and \n\nVC Dimension Methods \n\nDavid Haussler \n\nUniversity of California \nSanta Cruz, California \n\nMichael Kearns\u00b7 \n\nAT&T Bell Laboratories \nMurray Hill, New Jersey \n\nManfred Opper \n\nInstitut fur Theoretische Physik \nUniversita.t Giessen, Germany \n\nRobert Schapire \n\nAT&T Bell Laboratories \nMurray Hill, New Jersey \n\nAbstract \n\nIn this paper we investigate an average-case model of concept learning, and \ngive results that place the popular statistical physics and VC dimension \ntheories of learning curve behavior in a common framework. \n\n1 \n\nINTRODUCTION \n\nIn this paper we study a simple concept learning model in which the learner attempts \nto infer an unknown target concept I, chosen from a known concept class:F of {O, 1}(cid:173)\nvalued functions over an input space X. At each trial i, the learner is given a point \nXi E X and asked to predict the value of I(xi) . If the learner predicts I(xi) \nincorrectly, we say the learner makes a mistake. After making its prediction, the \nlearner is told the correct value. \n\nThis simple theoretical paradigm applies to many areas of machine learning, includ(cid:173)\ning much of the research in neural networks. The quantity of fundamental interest \nin this setting is the learning curve, which is the function of m defined as the prob-\n\n\u00b7Contact author. Address: AT&T Bell Laboratories, 600 Mountain Avenue, Room \n\n2A-423, Murray Hill, New Jersey 07974. Electronic mail: mkearns@research.att.com. \n\n855 \n\n\f856 \n\nHaussler, Kearns, Opper, and Schapire \n\nability the learning algorithm makes a mistake predicting f(xm+I}, having already \nseen the examples (Xl, I(x!)), ... , (xm, f(xm)). \nIn this paper we study learning curves in an average-case setting that admits a prior \ndistribution over the concepts in F. We examine learning curve behavior for the \noptimal Bayes algorithm and for the related Gibbs algorithm that has been studied \nin statistical physics analyses of learning curve behavior. For both algorithms we \ngive new upper and lower bounds on the learning curve in terms of the Shannon \ninformation gain. \n\nThe main contribution of this research is in showing that the average-case or \nBayesian model provides a unifying framework for the popular statistical physics \nand VC dimension theories of learning curves. By beginning in an average-case set(cid:173)\nting and deriving bounds in information-theoretic terms, we can gradually recover \na worst-case theory by removing the averaging in favor of combinatorial parameters \nthat upper bound certain expectations. \n\nDue to space limitations, the paper is technically dense and almost all derivations \nand proofs have been omitted. We strongly encourage the reader to refer to our \nlonger and more complete versions [4, 6] for additional motivation and technical \ndetail. \n\n2 NOTATIONAL CONVENTIONS \n\nLet X be a set called the instance space. A concept class F over X is a (possibly \ninfinite) collection of subsets of X. We will find it convenient to view a concept \nf E F as a function I : X -\n{O, I}, where we interpret I(x) = 1 to mean that \nx E X is a positive example of f, and f(x) = 0 to mean x is a negative example. \nThe symbols P and V are used to denote probability distributions. The distribution \nP is over F, and V is over X. When F and X are countable we assume that these \ndistributions are defined as probability mass functions. For uncountable F and X \nthey are assumed to be probability measures over some appropriate IT-algebra. All \nof our results hold for both countable and uncountable F and X. \nWe use the notation E I E'P [x (f)] for the expectation of the random variable X with \nrespect to the distribution P, and Pr/E'P[cond(f)] for the probability with respect \nto the distribution P of the set of all I satisfying the predicate cond(f). Everything \nthat needs to be measurable is assumed to be measurable. \n\n3 \n\nINFORMATION GAIN AND LEARNING \n\nLet F be a concept class over the instance space X. Fix a target concept I E F and \nan infinite sequence of instances x = Xl, .. . , X m , Xm+l, ... with Xm E X for all m. \nFor now we assume that the fixed instance sequence x is known in advance to the \nlearner, but that the target concept I is not. Let P be a probability distribution \nover the concept class F. We think of P in the Bayesian sense as representing the \nprior beliefs of the learner about which target concept it will be learning. \nIn our setting, the learner receives information about I incrementally via the label \n\n\fEstimating Average-Case Learning Curves \n\n857 \n\nsequence I(xd, ... , I(xm), I(xm+d, .... At time m, the learner receives the label \nI(xm). For any m ~ 1 we define (with respect to x, I) the mth version space \n\nFm(x, I) = {j E F: j(xd = I(XI), . .. , j(Xm) = I(xm)} \n\nand the mth volume V!(x, I) = P[Fm(x, I)]. We define Fo(x, I) = F for all \nx and I, so Vl(x, I) = 1. The version space at time m is simply the class of \nall concepts in F consistent with the first m labels of I (with respect to x), and \nthe mth volume is the measure of this class under P. For the first part of the \npaper, the infinite instance sequence x and the prior P are fixed; thus we simply \nwrite Fm(f) and Vm(f). Later, when the sequence x is chosen randomly, we will \nreintroduce this dependence explicitly. We adopt this notational practice of omitting \nany dependence on a fixed x in many other places as well. \nFor each m ~ 0 let us define the mth posterior distribution Pm(x, I) = Pm by \nrestricting P to the mth version space Fm(f); that is, for all (measurable) S C F, \nPm[S] = P[S n Fm(I))/P[Fm(l)] = P[S n Fm(I)]/Vm(f). \nHaving already seen I(xd, ... , I(xm), how much information (assuming the prior \nP) does the learner expect to gain by seeing I(xm+d? If we let Im+l(x, I) (ab(cid:173)\nbreviated Im+l (I) since x is fixed for now) be a random variable whose value is \nthe (Shannon) information gained from I(xm+d, then it can be shown that the \nexpected information is \n\nE/E'P[Im +1(f)] = E/E'P [-log v~:(j~) 1 = E/E'P[-logXm+1 (I)] \n\n(1) \nwhere we define the (m + 1 )st volume ratio by X!:+l (x, I) = Xm+l (f) = \nVm +l (f)/Vm(f). \nWe now return to our learning problem, which we define to be that of predicting the \nlabel I(xm+d given only the previous labels I(XI), . .. , I(xm). The first learning \nalgorithm we consider is called the Bayes optimal classification algorithm, or the \nBayes algorithm for short. For any m and bE {O, I}, define F:n(x,!) = F:n(1) = \n{j E Fm(x,!) : j(xm+d = b}. Then the Bayes algorithm is: \nIf Pm[F~(f)] > Pm[F~(I)], predict I(xm+d = 1. \nIf Pm[F~(f)] < Pm[F~(I)], predict I(xm+d = O. \nIf Pm[F~(f)] = Pm[F~(f)], flip a fair coin to predict I(xm+d\u00b7 \nIt is well known that if the target concept I is drawn at random according to \nthe prior distribution P, then the Bayes algorithm is optimal in the sense that it \nminimizes the probability that f(xm+d is predicted incorrectly. Furthermore, if we \nlet Bayes:+ 1 (x, I) (abbreviated Bayes:+ l (I) since x is fixed for now) be a random \nvariable whose value is 1 if the Bayes algorithm predicts I(xm+d correctly and 0 \notherwise, then it can be shown that the probability of a mistake for a random I is \n\nDespite the optimality of the Bayes algorithm, it suffers the drawback that its \nhypothesis at any time m may not be a member of the target class F. (Here we \n\n(2) \n\n\f858 \n\nHaussler, Kearns, Opper, and Schapire \n\ndefine the hypothesis of an algorithm at time m to be the (possibly probabilistic) \nmapping j : X -+ {O, 1} obtained by letting j(x) be the prediction of the algorithm \nwhen Xm+l = x.) This drawback is absent in our second learning algorithm, which \nwe call the Gibbs algorithm [6]: \n\nGiven I(x!), ... , f(x m), choose a hypothesis concept j randomly from Pm. \n\nGiven Xm+l, predict I(xm+d = j(xm+!). \n\nThe Gibbs algorithm is the \"zero-temperature\" limit of the learning algorithm stud(cid:173)\nIf we let Gibbs~+l (x, I) (abbreviated \nied in several recent papers [2, 3, 8, 9]. \nGibbs~+l (f) since x is fixed for now) be a random variable whose value is 1 if the \nGibbs algorithm predicts f(xm+d correctly and 0 otherwise, then it can be shown \nthat the probability of a mistake for a random f is \n\nNote that by the definition of the Gibbs algorithm, Equation (3) is exactly the \naverage probability of mistake of a consistent hypothesis, using the distribution on \n:F defined by the prior. Thus bounds on this expectation provide an interesting \ncontrast to those obtained via VC dimension analysis, which always gives bounds \non the probability of mistake of the worst consistent hypothesis. \n\n(3) \n\n4 THE MAIN INEQUALITY \n\nIn this section we state one of our main results: a chain of inequalities that upper \nand lower bounds the expected error for both the Bayes and Gibbs ,algorithms \nby simple functions of the expected information gain. More precisely, using the \ncharacterizations of the expectations in terms of the volume ratio Xm+l (I) given \nby Equations (1), (2) and (3), we can prove the following, which we refer to as the \nmain inequality: \n\n1l- 1(E/E'P[Im+1(1))) < E/E'P [Bayes m+1 (I)] \n\n< E/E'P[Gibbsm+1 (f)] ~ ~E/E'P[Im+l(I)]. \n\n(4) \nHere we have defined an inverse to the binary entropy function ll(p) = -p log p -\n(1 - p) log(1 - p) by letting 1l-1(q), for q E [0,1]' be the unique p E [0,1/2] such \nthat ll(p) = q. Note that the bounds given depend on properties of the particular \nprior P, and on properties of the particular fixed sequence x . These upper and \nlower bounds are equal (and therefore tight) at both extremes E /E'P [Im+l (I)] = 1 \n(maximal information gain) and E/E'P [Im +1(f)] = 0 (minimal information gain) . \nTo obtain a weaker but perhaps more convenient lower bound, it can also be shown \nthat there is a constant Co > 0 such that for all p > 0, 1l-1(p) 2:: cop/log(2/p). \nFinally, if all that is wanted is a direct comparison of the performances of the Gibbs \nand Bayes algorithms, we can also show: \n\n\f5 THE MAIN INEQUALITY: CUMULATIVE VERSION \n\nEstimating Average-Case Learning C urves \n\n859 \n\nIn this section we state a cumulative version of the main inequality: namely, bounds \non the expected cumulative number of mistakes made in the first m trials (rather \nthan just the instantaneous expectations). \nFirst, for the cumulative information gain, it can be shown that EfE'P [L~l Li(f)] = \nEfE'P[-log Vm(f)]. This expression has a natural interpretation. The first m in(cid:173)\nstances Xl, . .. , xm of x induce a partition II!:(x) of the concept class :F defined \nby II~(x) = II~ = {:Fm(x, f) : f E :F}. Note that III~I is always at most 2m , \nbut may be considerably smaller, depending on the interaction between :F and \nXl,\u00b7 \u00b7\u00b7 ,Xm\u00b7 It is clear that EfE'P[-logVm(f)] = - L:7rEIF P[1I']1ogP[71']. Thus the \nexpected cumulative information gained from the labels of Xl, .. . , Xm is simply the \nentropy of the partition II~ under the distribution P. We shall denote this entropy \nby 1i'P(II~(x)) = 1i'f;.(x) = 1i'f;.. Now analogous to the main inequality for the \ninstantaneous case (Inequality (4)), we can show: \n\nlog(2m/1i~) < mW' (~ 1l::') :'0 E/E1' [t. BayeSi(f)] \n\n< E/E'P [t, GibbSi(f)] \n\n:'0 ~1l::' \n\n(6) \n\nHere we have applied the inequality 1i-l(p) ~ cop/log(2/p) in order to give the \nlower bound in more convenient form. As in the instantaneous case, the upper \nand lower bounds here depend on properties of the particular P and x. When the \ncumulative information gain is maximum (1i'f;. = m), the upper and lower bounds \nare tight. \n\nThese bounds on learning performance in terms of a partition entropy are of special \nimportance to us, since they will form the crucial link between the Bayesian setting \nand the Vapnik-Chervonenkis dimension theory. \n\n6 MOVING TO A WORST-CASE THEORY: BOUNDING \nTHE INFORMATION GAIN BY THE VC DIMENSION \n\nAlthough we have given upper bounds on the expected cumulative number of mis(cid:173)\ntakes for the Bayes and Gibbs algorithms in terms of 1i'f;.(x) , we are still left with the \nproblem of evaluating this entropy, or at least obtaining reasonable upper bounds \non it. We can intuitively see that the \"worst case\" for learning occurs when the \npartition entropy 1i'f;. (x) is as large as possible. In our context, the entropy is qual(cid:173)\nitatively maximized when two conditions hold: (1) the instance sequence x induces \na partition of :F that is the largest possible, and (2) the prior P gives equal weight \nto each element of this partition. \n\nIn this section, we move away from our Bayesian average-case setting to obtain \nworst-case bounds by formalizing these two conditions in terms of combinatorial \nparameters depending only on the concept class:F. In doing so, we form the link \nbetween the theory developed so far and the VC dimension theory. \n\n\f860 \n\nHaussler, Kearns, Opper, and Schapire \n\nThe second of the two conditions above is easily quantified. Since the entropy of \na partition is at most the logarithm of the number of classes in it, a trivial upper \nbound on the entropy which holds for all priors P is 1l!(x) ~ log III~(x)l. VC \ndimension theory provides an upper bound on log III~(x)1 as follows. \nFor any sequence x = Xl, X2, .\u2022. of instances and for m ~ 1, let dimm(F, x) denote \nthe largest d ~ 0 such that there exists a subsequence XiI' ... , Xiii of Xl, ... ,Xm with \n1II~((Xill ... ,xili))1 = 2d; that is, for every possible labeling of XiII ... ,Xili there is \nsome target concept in F that gives this labeling. The Vapnik-Chervonenkis (VC) \ndimension of F is defined by dim(F) = max{ dimm(F, x) : m ~ 1 and Xl, X2, .\u2022\u2022 E \nX}. It can be shown [7, 10] that for all x and m ~ d ~ 1, \n\nlmm F,x \n\nlog III~(x)1 ~ (1 + 0(1)) dimm(F, x) log d\u00b7 7 ) \n\n(7) \nwhere 0(1) is a quantity that goes to zero as Cl' = m/dimm(F, x) goes to infinity. \nIn all of our discussions so far, we have assumed that the instance sequence x is \nfixed in advance, but that the target concept f is drawn randomly according to P. \nWe now move to the completely probabilistic model, in which f is drawn according \nto P, and each instance Xm in the sequence x is drawn randomly and independently \naccording to a distribution V over the instance space X (this infinite sequence of \ndraws from V will be denoted x E V*). Under these assumptions, it follows from \nInequalities (6) and (7), and the observation above that 1l~(x) ~ log III~(x)1 that \nfor any P and any V, \n\nEfEP,XE\"\u00b7 [t. Bayes,(x, f)] < EfEP,XE\"\u00b7 [t. Gibbs,(x, f)] \n\n< ~EXE'V. [log III~(x)1l \n\n< \n\n< \n\n(1 + o(I))ExE'V. [diffim~F, x) log dimm7F, x)] \n\n(1 + 0(1)) \n\ndim(F) \n\n2 \n\nm \n\nlog dim(F) \n\n(8) \n\nThe expectation EXE'V. [log In~(x)1l is the VC entropy defined by Vapnik and Cher(cid:173)\nvonenkis in their seminal paper on uniform convergence [11] . \n\nIn terms of instantaneous mistake bounds, using more sophisticated techniques [4], \nwe can show that for any P and any V, \n\n[ dimm(F, X)] \n[2 dimm(F, x)] \n\nm \n\nm \n\ndim(F) \n\n~ m \n\n~ \n\n2 dim(F) \n\nm \n\n(9) \n\n(10) \n\nE/E1',XE'V\u00b7 [Bayesm(x, I)] ~ EXE'V\u00b7 \n\nE \n\nIE1',XE'V\u00b7 \n\n[G \u00b7bb (f)] \n\nI Sm x, \n\nE \n\n~ XE'V\u00b7 \n\nHaussler, Littlestone and Warmuth [5] construct specific V, P and F for which the \nlast bound given by Inequality (8) is tight to within a factor of 1/ In\\2) ~ 1.44; thus \nthis bound cannot be improved by more than this factor in general. Similarly, the \n\n1 It follows that the expected total number of mistakes of the Bayes and the Gibbs \nalgorithms differ by a factor of at most about 1.44 in each of these cases; this was not \npreviously known. \n\n\fEstimating Average-Case Learning Curves \n\n861 \n\nbound given by Inequality (9) cannot be improved by more than a factor of 2 in \ngeneral. \nFor specific V, P and F, however, it is possible to improve the general bounds \ngiven in Inequalities (8), (9) and (10) by more than the factors indicated above. \nWe calculate the instantaneous mistake bounds for the Bayes and Gibbs algorithms \nin the natural case that F is the set of homogeneous linear threshold functions \non R d and both the distribution V and the prior P on possible target concepts \n(represented also by vectors in Rd) are uniform on the unit sphere in Rd. This \nclass has VC dimension d. In this case, under certain reasonable assumptions used \nin statistical mechanics, it can be shown that for m ~ d ~ I, \n\n(compared with the upper bound of dim given by Inequality (9) for any class of \nVC dimension d) and \n\n0.44d \n\nm \n\n0.62d \n\nm \n\n(compared with the upper bound of 2dlm in Inequality (10)). The ratio of these \nasymptotic bounds is J2. We can also show that this performance advantage of \nBayes over Gibbs is quite robust even when P and V vary, and there is noise in the \nexamples [6]. \n\n7 OTHER RESULTS AND CONCLUSIONS \n\nWe have a number of other results, and briefly describe here one that may be of \nparticular interest to neural network researchers. In the case that the class F has \ninfinite VC dimension (for instance, if F is the class of all multi-layer perceptrons \nof finite size), we can still obtain bounds on the number of cumulative mistakes by \ndecomposing F into F 1 , F2, ... , F\" ... , where each F, has finite VC dimension, and \nby decomposing the prior P over F as a linear sum P = 2::1 aiP\" where each Pi \nis an arbitrary prior over Fi, and 2::1 a, = 1. A typical decomposition might let \nFi be all multi-layer perceptrons of a given architecture with at most i weights, in \nwhich case di = O( i log i) [1]. Here we can show an upper bound on the cumulative \nmistakes during the first m examples of roughly 11 {ai} + [2::1 aidi] log m for both \nthe Bayes and Gibbs algorithms, where 11{ad = - 2::1 ai log a,. The quantity \n2::1 aid, plays the role of an \"effective VC dimension\" relative to the prior weights \n{a,}. In the case that x is also chosen randomly, we can bound the probability of \nmistake on the mth trial by roughly ~(11{ad + [2:~1 a,di)logm). \nIn our current research we are working on extending the basic theory presented \nhere to the problems of learning with noise (see Opper and Haussler [6]), learning \nmulti-valued functions, and learning with other loss functions. \n\nPerhaps the most important general conclusion to be drawn from the work pre(cid:173)\nsented here is that the various theories of learning curves based on diverse ideas \nfrom information theory, statistical physics and the VC dimension are all in fact \nclosely related, and can be naturally and beneficially placed in a common Bayesian \nframework. \n\n\f862 \n\nHaussler, Kearns, Opper, and Schapire \n\nAcknowledgements \n\nWe are greatly indebted to Ron Rivest for his valuable suggestions and guidance, \nand to Sara Solla and N aft ali Tishby for insightful ideas in the early stages of this \ninvestigation. We also thank Andrew Barron, Andy Kahn, Nick Littlestone, Phil \nLong, Terry Sejnowski and Haim Sompolinsky for stimulating discussions on these \ntopics. This research was supported by ONR grant NOOOI4-91-J-1162, AFOSR \ngrant AFOSR-89-0506, ARO grant DAAL03-86-K-0171, DARPA contract NOOOI4-\n89-J-1988, and a grant from the Siemens Corporation. This research was conducted \nin part while M. Kearns was at the M.I.T. Laboratory for Computer Science and the \nInternational Computer Science Institute, and while R. Schapire was at the M.I.T. \nLaboratory for Computer Science and Harvard University. \n\nReferences \n\n[1] E. Baum and D. Haussler. What size net gives valid generalization? Neural \n\nComputation, 1(1):151-160, 1989. \n\n[2] J. Denker, D. Schwartz, B. Wittner, S. SoHa, R. Howard, L. Jackel, and J. Hop(cid:173)\n\nfield. Automatic learning, rule extraction and generalization. Complex Systems, \n1:877-922, 1987. \n\n[3] G. Gyorgi and N. Tishby. Statistical theory of learning a rule. \n\nIn Neural \n\nNetworks and Spin Glasses. World Scientific, 1990. \n\n[4] D. Haussler, M. Kearns, and R. Schapire. Bounds on the sample complex(cid:173)\n\nity of Bayesian learning using information theory and the VC dimension. In \nComputational Learning Theory: Proceedings of the Fourth Annual Workshop. \nMorgan Kaufmann, 1991. \n\n[5] D. Haussler, N. Littlestone, and M. Warmuth. Predicting {O, 1}-functions on \nrandomly drawn points. Technical Report UCSC-CRL-90-54, University of \nCalifornia Santa Cruz, Computer Research Laboratory, Dec. 1990. \n\n[6] M. Opper and D. Haussler. Calculation of the learning curve of Bayes optimal \nclassification algorithm for learning a perceptron with noise. In Computational \nLearning Theory: Proceedings of the Fourth Annual Workshop. Morgan Kauf(cid:173)\nmann, 1991. \n\n[7] N. Sauer. On the density of families of sets. Journal of Combinatorial Theory \n\n(Series A), 13:145-147,1972. \n\n[8] H. Sompolinsky, N. Tishby, and H. Seung. Learning from examples in large \n\nneural networks. Physics Review Letters, 65:1683-1686, 1990. \n\n[9] N. Tishby, E. Levin, and S. Solla. Consistent inference of probabilities in \nlayered networks: predictions and generalizations. In IJCNN International \nJoint Conference on Neural Networks, volume II, pages 403-409. IEEE, 1989. \n[10] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer(cid:173)\n\nVerlag, New York, 1982. \n\n[11] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of rela(cid:173)\n\ntive frequencies of events to their probabilities. Theory of Probability and its \nApplications, 16(2):264-80, 1971. \n\n\f", "award": [], "sourceid": 489, "authors": [{"given_name": "David", "family_name": "Haussler", "institution": null}, {"given_name": "Michael", "family_name": "Kearns", "institution": null}, {"given_name": "Manfred", "family_name": "Opper", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}]}