{"title": "Learning with ensembles: How overfitting can be useful", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 196, "abstract": null, "full_text": "Learning with ensembles: How \n\nover-fitting can be useful \n\nDepartment of Physics \n\nUniversity of Edinburgh, U.K. \n\nPeter Sollich \n\nP.SollichGed.ac.uk \n\nAnders Krogh'\" \n\nNORDITA, Blegdamsvej 17 \n2100 Copenhagen, Denmark \n\nkroghGsanger.ac.uk \n\nAbstract \n\nWe study the characteristics of learning with ensembles. Solving \nexactly the simple model of an ensemble of linear students, we \nfind surprisingly rich behaviour. For learning in large ensembles, \nit is advantageous to use under-regularized students, which actu(cid:173)\nally over-fit the training data. Globally optimal performance can \nbe obtained by choosing the training set sizes of the students ap(cid:173)\npropriately. For smaller ensembles, optimization of the ensemble \nweights can yield significant improvements in ensemble generaliza(cid:173)\ntion performance, in particular if the individual students are sub(cid:173)\nject to noise in the training process. Choosing students with a wide \nrange of regularization parameters makes this improvement robust \nagainst changes in the unknown level of noise in the training data. \n\n1 \n\nINTRODUCTION \n\nAn ensemble is a collection of a (finite) number of neural networks or other types \nof predictors that are trained for the same task. A combination of many differ(cid:173)\nent predictors can often improve predictions, and in statistics this idea has been \ninvestigated extensively, see e.g. [1, 2, 3]. In the neural networks community, en(cid:173)\nsembles of neural networks have been investigated by several groups, see for instance \n[4, 5, 6, 7]. Usually the networks in the ensemble are trained independently and \nthen their predictions are combined. \n\nIn this paper we study an ensemble of linear networks trained on different but \noverlapping training sets. The limit in which all the networks are trained on the \nfull data set and the one where all the data sets are different has been treated in \n[8] . In this paper we treat the case of intermediate training set sizes and overlaps \n\n\u00b7Present address: The Sanger Centre, Hinxton, Cambs CBIO IRQ, UK. \n\n\fLearning with Ensembles: How Overfitting Can Be Useful \n\n191 \n\nexactly, yielding novel insights into ensemble learning. Our analysis also allows us to \nstudy the effect of regularization and of having different predictors in an ensemble. \n\n2 GENERAL FEATURES OF ENSEMBLE LEARNING \n\nWe consider the task of approximating a target function fo from RN to R. It \nwill be assumed that we can only obtain noisy samples of the function, and the \n(now stochastic) target function will be denoted y(x) . The inputs x are taken \nto be drawn from some distribution P(x). Assume now that an ensemble of K \nindependent predictors fk(X) of y(x) is available. A weighted ensemble average is \ndenoted by a bar, like \n\nlex) = L,wkfk(X), \n\nk \n\n(1) \n\nwhich is the final output of the ensemble. One can think of the weight Wk as the \nbelief in predictor k and we therefore constrain the weights to be positive and sum \nto one. For an input x we define the error of the ensemble c(x), the error of the \nkth predictor ck(X), and its ambiguity ak(x) \n\nc(x) \nck(X) \n\n(y(x) -lex)? \n(y(x) - fk(X)? \n(fk(X) -1(x\u00bb2. \n\n(2) \n(3) \n(4) \nThe ensemble error can be written as c(x) = lex) - a(x) [7], where lex) = \nL,k Wkck(X) is the average error over the individual predictors and a(x) = \nL,k Wkak(X) is the average of their ambiguities, which is the variance of the output \nover the ensemble. By averaging over the input distribution P(x) (and implicitly \nover the target outputs y(x\u00bb, one obtains the ensemble generalization error \n\n(5) \nwhere c(x) averaged over P(x) is simply denoted c, and similarly for land a. The \nfirst term on the right is the weighted average of the generalization errors of the indi(cid:173)\nvidual predictors, and the second is the weighted average of the ambiguities, which \nwe refer to as the ensemble ambiguity. An important feature of equation (5) is that \nit separates the generalization error into a term that depends on the generalization \nerrors of the individual students and another term that contains all correlations be(cid:173)\ntween the students. The latter can be estimated entirely from unlabeled data, i. e., \nwithout any knowledge of the target function to be approximated. The relation (5) \nalso shows that the more the predictors differ, the lower the error will be, provided \nthe individual errors remain constant. \nIn this paper we assume that the predictors are trained on a sample of p examples \nof the target function, (xt',yt'), where yt' = fo(xt') + TJt' and TJt' is some additive \nnoise (Jl. = 1, ... ,p). The predictors, to which we refer as students in this context \nbecause they learn the target function from the training examples, need not be \ntrained on all the available data. In fact, since training on different data sets will \ngenerally increase the ambiguity, it is possible that training on subsets of the data \nwill improve generalization. An additional advantage is that, by holding out for \neach student a different part of the total data set for the purpose of testing, one \ncan use the whole data set for training the ensemble while still getting an unbiased \nestimate of the ensemble generalization error. Denoting this estimate by f, one has \n(6) \nwhere Ctest = L,k WkCtest,k is the average of the students' test errors. As already \npointed out, the estimate ft of the ensemble ambiguity can be found from unlabeled \ndata. \n\n\f192 \n\nP. SOLLICH, A. KROGH \n\nSo far, we have not mentioned how to find the weights Wk. Often uniform weights \nare used, but optimization of the weights in some way is tempting. In [5, 6] the \ntraining set was used to perform the optimization, i.e., the weights were chosen to \nminimize the ensemble training error. This can easily lead to over-fitting, and in [7] \nit was suggested to minimize the estimated generalization error (6) instead. If this \nis done, the estimate (6) acquires a bias; intuitively, however, we expect this effect \nto be small for large ensembles. \n\n3 ENSEMBLES OF LINEAR STUDENTS \n\nIn preparation for our analysis of learning with ensembles of linear students we now \nbriefly review the case of a single linear student, sometimes referred to as 'linear \nperceptron learning'. A linear student implements the input-output mapping \n\nT \nJ(x) = ..JNw x \n\n1 \n\nparameterized in terms of an N-dimensional parameter vector w with real compo(cid:173)\nnents; the scaling factor 1/..JN is introduced here for convenience, and . .. T denotes \nthe transpose of a vector. The student parameter vector w should not be con(cid:173)\nfused with the ensemble weights Wk. The most common method for training such \na linear student (or parametric inference models in general) is minimization of the \nsum-of-squares training error \n\nE = L:(y/J - J(x/J))2 + Aw2 \n\n/J \n\nwhere J.L = 1, ... ,p numbers the training examples. To prevent the student from \nfitting noise in the training data, a weight decay term Aw2 has been added. The size \nof the weight decay parameter A determines how strongly large parameter vectors \nare penalized; large A corresponds to a stronger regularization of the student. \nFor a linear student, the global minimum of E can easily be found. However, \nin practical applications using non-linear networks, this is generally not true, and \ntraining can be thought of as a stochastic process yielding a different solution each \ntime. We crudely model this by considering white noise added to gradient descent \nupdates of the parameter vector w. This yields a limiting distribution of parameter \nvectors P(w) ex: exp(-E/2T), where the 'temperature' T measures the amount of \nnoise in the training process. \nWe focus our analysis on the 'thermodynamic limit' N - t 00 at constant normalized \nnumber of training examples, ex = p/ N. In this limit, quantities such as the training \nor generalization error become self-averaging, i.e., their averages over all training \nsets become identical to their typical values for a particular training set. Assume \nnow that the training inputs x/J are chosen randomly and independently from a \nGaussian distribution P(x) ex: exp( - ~x2), and that training outputs are generated \nby a linear target function corrupted by additive noise, i.e., y/J = w'f x/J /..IN + 1]/J, \nwhere the 1]/J are zero mean noise variables with variance u2 \u2022 Fixing the length of the \nparameter vector of the target function to w~ = N for simplicity, the generalization \nerror of a linear student with weight decay A and learning noise T becomes [9] \n\n8G \n(; = (u2 + T)G + A(U2 - A) 8A . \n\n(7) \n\nOn the r.h.s. of this equation we have dropped the term arising from the noise on \nthe target function alone, which is simply u2 , and we shall follow this convention \nthroughout. The 'response function' Gis [10, 11] \n\nG = G(ex, A) = (1 - ex - A + )(1 - ex - A)2 + 4A)/2A. \n\n(8) \n\n\fLearning with Ensembles: How Overfitting Can Be Useful \n\n193 \n\nFor zero training noise, T = 0, and for any a, the generalization error (7} is mini(cid:173)\nmized when the weight decay is set to A = (T2j its value is then (T2G(a, (T2), which \nis the minimum achievable generalization error [9]. \n\n3.1 ENSEMBLE GENERALIZATION ERROR \n\nWe now consider an ensemble of K linear students with weight decays Ak and \nlearning noises Tk (k = 1 . . . K). Each ,student has an ensemble weight Wk and \nis trained on N ak training examples, with students k and I sharing N akl training \nexamples (of course, akk = ak). As above, we consider noisy training data generated \nby a linear target function. The resulting ensemble generalization error can be \ncalculated by diagrammatic [10] or response function [11] methods. We refer the \nreader to a forthcoming publication for details and only state the result: \n\nwhere \n\n(9) \n\n(10) \n\nHere Pk is defined as Pk = AkG(ak, Ak). The Kronecker delta in the last term \nof (10) arises because the training noises of different students are uncorrelated. The \ngeneralization errors and ambiguities of the individual students are \n\nak = ckk - 2 LWlckl + LWIWmclm; \n\nI \n\n1m \n\nthe result for the Ck can be shown to agree with the single student result (7). In \nthe following sections, we shall explore the consequences of the general result (9) . \nWe will concentrate on the case where the training set of each student is sampled \nrandomly from the total available data set of size NO', For the overlap of the training \nsets of students k and I (k 'II) one then has akl/a = (ak/a)(al/a) and hence \n\n(11) \nup to fluctuations which vanish in the thermodynamic limit. For finite ensembles \none can construct training sets for which akl < akal/a. This is an advantage, \nbecause it results in a smaller generalization error, but for simplicity we use (11). \n\nak/ = akal/a \n\n4 LARGE ENSEMBLE LIMIT \n\nWe now use our main result (9) to analyse the generalization performance of an en(cid:173)\nsemble with a large number K of students, in particular when the size of the training \nsets for the individual students are chosen optimally. If the ensemble weights Wk \nare approximately uniform (Wk ~ 1/ K) the off-diagonal elements of the matrix \n(ckl) dominate the generalization error for large K, and the contributions from the \ntraining noises n are suppressed. For the special case where all students are iden(cid:173)\ntical and are trained on training sets of identical size, ak = (1 - c)a, the ensemble \ngeneralization error is shown in Figure 1(left). The minimum at a nonzero value \nof c, which is the fraction of the total data set held out for testing each student, \ncan clearly be seen. This confirms our intuition: when the students are trained \non smaller, less overlapping training sets, the increase in error of the individual \nstudents can be more than offset by the corresponding increase in ambiguity. \n\nThe optimal training set sizes ak can be calculated analytically: \n\n_ \n\nCk = 1 - ak/a = 1 + G(a, (T2) ' \n\n1 - Ak/(T2 \n\n(12) \n\n\f194 \n\nP. SOLLICH, A. KROGH \n\n1.0 r---,-----,r---.,----,.----:. \n\n1.0 r - - - , - - - - - , - - - . - - - - r - - - - \" \n\nw \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n/ \n\n/ \n\n/ \n\n0.0 / \n0.0 \n\n,...-------\n\n, \n\n, \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\nC \n\n, \n, \n1.0 \n\nw \n\n0.8 \n\n0.6 \n\n0.2 \n\n0.0 \n\n.' \n\n-------\n\n0.0 \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\nC \n\n..... \n1.0 \n\nFigure 1: Generalization error and ambiguity for an infinite ensemble of identical \nstudents. Solid line: ensemble generalization error, fj dotted line: average gener(cid:173)\nalization error of the individual students, l; dashed line: ensemble ambiguity, a. \nFor both plots a = 1 and (72 = 0.2 . The left plot corresponds to under-regularized \nstudents with A = 0.05 < (72. Here the generalization error of the ensemble has \na minimum at a nonzero value of c. This minimum exists whenever>' < (72. The \nright plot shows the case of over-regularized students (A = 0.3 > (72), where the \ngeneralization error is minimal at c = O. \n\nThe resulting generalization error is f = (72G(a, (72) + 0(1/ K), which is the globally \nminimal generalization error that can be obtained using all available training data, \nas explained in Section 3. Thus, a large ensemble with optimally chosen training \nset sizes can achieve globally optimal generalization performance. However, we see \nfrom (12) that a valid solution Ck > 0 exists only for Ak < (72, i.e., if the ensemble \nis under-regularized. This is exemplified, again for an ensemble of identical stu(cid:173)\ndents, in Figure 1 (right) , which shows that for an over-regularized ensemble the \ngeneralization error is a: monotonic function of c and thus minimal at c = o. \nWe conclude this section by discussing how the adaptation of the training set sizes \ncould be performed in practice, for simplicity confining ourselves to an ensemble of \nidentical students, where only one parameter c = Ck = 1- ak/a has to be adapted. \nIf the ensemble is under-regularized one expects a minimum of the generalization \nerror for some nonzero c as in Figure 1. One could, therefore, start by training \nall students on a large fraction of the total data set (corresponding to c ~ 0), and \nthen gradually and randomly remove training examples from the students' training \nsets. Using (6), the generalization error of each student could be estimated by \ntheir performance on the examples on which they were not trained, and one would \nstop removing training examples when the estimate stops decreasing. The resulting \nestimate of the generalization error will be slightly biased; however, for a large \nenough ensemble the risk of a strongly biased estimate from systematically testing \nall students on too 'easy' training examples seems small, due to the random selection \nof examples. \n\n5 REALISTIC ENSEMBLE SIZES \n\nWe now discuss some effects that occur in learning with ensembles of 'realistic' sizes. \nIn an over-regularized ensemble nothing can be gained by making the students more \ndiverse by training them on smaller, less overlapping training sets. One would also \n\n\fLearning with Ensembles: How Overfitting Can Be Useful \n\n195 \n\nFigure 2: The generalization error of \nan ensemble with 10 identical stu(cid:173)\ndents as a function of the test set \nfraction c. From bottom to top the \ncurves correspond to training noise \nT = 0,0.1,0.2, ... ,1.0. The star on \neach curve shows the error of the op(cid:173)\ntimal single perceptron (i. e., with op(cid:173)\ntimal weight decay for the given T) \ntrained on all examples, which is in(cid:173)\ndependent of c. The parameters for \nthis example are: a = 1, A = 0.05, \n0'2 = 0.2. \n\n0.2 \n\n0.0 L-_--'-_---' __ -'--_--'-_~ \n1.0 \n\n0.8 \n\n0.0 \n\nC \n\n0.2 \n\n0.4 \n\n0.6 \n\nexpect this kind of 'diversification' to be unnecessary or even counterproductive \nwhen the training noise is high enough to provide sufficient 'inherent' diversity of \nstudents. In the large ensemble limit, we saw that this effect is suppressed, but \nit does indeed occur in finite ensembles. Figure 2 shows the dependence of the \ngeneralization error on c for an ensemble of 10 identical, under-regularized students \nwith identical training noises Tk = T. For small T, the minimum of f. at nonzero c \npersists. For larger T, f. is monotonically increasing with c, implying that further \ndiversification of students beyond that caused by the learning noise is wasteful. The \nplot also shows the performance of the optimal single student (with A chosen to \nminimize the generalization error at the given T), demonstrating that the ensemble \ncan perform significantly better by effectively averaging out learning noise. \n\nFor realistic ensemble sizes the presence of learning noise generally reduces the \npotential for performance improvement by choosing optimal training set sizes. In \nsuch cases one can still adapt the ensemble weights to optimize performance, again \non the basis of the estimate of the ensemble generalization error (6). An example is \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\ntV \n\n----\n\n, \n, \n\nI \nI \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\ntV \n\nI \n\nI \n\n/ \n\n,-\n\n,-\n\n..... -_ ................. \n\n... .... \n\n0.0 \n0.001 \n\n0.010 \n\n0 2 \n\n0.100 \n\n1.000 \n\n0.0 \n0.001 \n\n0.010 \n\n0 2 \n\n0.100 \n\n1.000 \n\nFigure 3: The generalization error of an ensemble of 10 students with different \nweight decays (marked by stars on the 0'2-axis) as a function of the noise level \n0'2. Left: training noise T = 0; right: T = 0.1. The dashed lines are for the \nensemble with uniform weights, and the solid line is for optimized ensemble weights. \nThe dotted lines are for the optimal single perceptron trained on all data. The \nparameters for this example are: a = 1, c = 0.2. \n\n\f196 \n\nP. SOu...ICH, A. KROGH \n\nshown in Figure 3 for an ensemble of size 1< = 10 with the weight decays >'k equally \nspaced on a logarithmic axis between 10-3 and 1. For both of the temperatures T \nshown, the ensemble with uniform weights performs worse than the optimal single \nstudent. With weight optimization, the generalization performance approaches that \nof the optimal single student for T = 0, and is actually better at T = 0.1 over \nthe whole range of noise levels rr2 shown. Even the best single student from the \nensemble can never perform better than the optimal single student, so combining the \nstudent outputs in a weighted ensemble average is superior to simply choosing the \nbest member of the ensemble by cross-validation, i.e., on the basis of its estimated \ngeneralization error. The reason is that the ensemble average suppresses the learning \nnoise on the individual students. \n\n6 CONCLUSIONS \n\nWe have studied ensemble learning in the simple, analytically solvable scenario of \nan ensemble of linear students. Our main findings are: In large ensembles, one \nshould use under-regularized students in order to maximize the benefits of the \nvariance-reducing effects of ensemble learning. In this way, the globally optimal \ngeneralization error on the basis of all the available data can be reached by optimiz(cid:173)\ning the training set sizes of the individual students. At the same time an estimate \nof the generalization error can be obtained. For ensembles of more realistic size, we \nfound that for students subjected to a large amount of noise in the training process \nit is unnecessary to increase the diversity of students by training them on smaller, \nless overlapping training sets. In this case, optimizing the ensemble weights can \nstill yield substantially better generalization performance than an optimally chosen \nsingle student trained on all data with the same amount of training noise. This \nimprovement is most insensitive to changes in the unknown noise levels rr2 if the \nweight decays of the individual students cover a wide range. We expect most of these \nconclusions to carryover, at least qualitatively, to ensemble learning with nonlinear \nmodels, and this correlates well with experimental results presented in [7]. \n\nReferences \n[1] C. Granger, Journal of Forecasting 8, 231 (1989). \n[2] D. Wolpert, Neural Networks 5, 241 (1992) . \n[3] L. Breimann, Tutorial at NIPS 7 and personal communication. \n[4] L. Hansen and P. Salamon, IEEE Trans. Pattern Anal. and Mach. Intell. 12, \n\n[8] R. Meir, in NIPS 7, ed. G. Tesauro et al., p. 295 (MIT Press, 1995). \n[9] A. Krogh and J. A. Hertz, J. Phys. A 25,1135 (1992). \n[10] J. A. Hertz, A. Krogh, and G. I. Thorbergsson, J. Phys. A 22, 2133 (1989). \n[11] P. Sollich, J. Phys. A 27, 7771 (1994). \n\n[5] M. P. Perrone and L. N. Cooper, in Neural Networks for Speech and Image \n\nprocessing, ed. R. J. Mammone (Chapman-Hall, 1993). \n\n[6] S. Hashem: Optimal Linear Combinations of Neural Networks. Tech. Rep. \n\nPNL-SA-25166, submitted to Neural Networks (1995) . \n\n[7] A. Krogh and J. Vedelsby, in NIPS 7, ed. G. Tesauro et al., p. 231 (MIT Press, \n\n993 (1990). \n\n1995). \n\n\f", "award": [], "sourceid": 1044, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}, {"given_name": "Anders", "family_name": "Krogh", "institution": null}]}