{"title": "Bias, Variance and the Combination of Least Squares Estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 295, "page_last": 302, "abstract": null, "full_text": "Bias, Variance and the Combination of \n\nLeast Squares Estimators \n\nRonny Meir \n\nFaculty of Electrical Engineering \n\nTechnion, Haifa 32000 \n\nIsrael \n\nrmeirGee.technion.ac.il \n\nAbstract \n\nWe consider the effect of combining several least squares estimators \non the expected performance of a regression problem. Computing \nthe exact bias and variance curves as a function of the sample size \nwe are able to quantitatively compare the effect of the combination \non the bias and variance separately, and thus on the expected error \nwhich is the sum of the two. Our exact calculations, demonstrate \nthat the combination of estimators is particularly useful in the case \nwhere the data set is small and noisy and the function to be learned \nis unrealizable. For large data sets the single estimator produces \nsuperior results. Finally, we show that by splitting the data set \ninto several independent parts and training each estimator on a \ndifferent subset, the performance can in some cases be significantly \nimproved. \n\nKey words: Bias, Variance, Least Squares, Combination. \n\n1 \n\nINTRODUCTION \n\nMany of the problems related to supervised learning can be boiled down to the \nquestion of balancing bias and variance. While reducing bias can usually be ac(cid:173)\ncomplished quite easily by simply increasing the complexity of the class of models \nstudied, this usually comes at the expense of increasing the variance in such a way \nthat the overall expected error (which is the sum of the two) is often increased. \n\n\f296 \n\nRonny Meir \n\nThus, many efforts have been devoted to the issue of decreasing variance, while at(cid:173)\ntempting to keep the concomitant increase in bias as small as possible. One of the \nmethods which has become popular recently in the neural network community is \nvariance reduction by combining estimators, although the idea has been around in \nthe statistics and econometrics literature at least since the late sixties (see Granger \n1989 for a review). Nevertheless, it seems that not much analytic work has been \ndevoted to a detailed study of the effect of noise and an effectively finite sample size \non the bias/variance balance. It is the explicit goal of this paper to study in detail \na simple problem of linear regression, where the full bias/variance curve can be \ncomputed exactly for any effectively finite sample size and noise level. We believe \nthat this simple and exactly solvable model can afford us insight into more complex \nnon-linear problems, which are at the heart of much of the recent work in neural \nnetworks. \nA further aspect of our work is related to the question of the partitioning of the \ndata set between the various estimators. Thus, while most studies assume the each \nestimator is trained on the complete data set, it is possible to envisage a situation \nwhere the data set is broken up into several subsets, using each subset of data to form \na different estimator. While such a scheme seems wasteful from the bias point of \nview, we will see that in fact it produces superior foreca..,ts in some situations. This, \nperhaps suprising, result is due to a large decrease in variance resulting from the \nindependence of the estimators, in the case where the data subsets are independent. \n\n2 ON THE COMBINATION OF ESTIMATORS \n\nThe basic objective of regression is the following: given a finite training set, D, com(cid:173)\nposed of n input/output pairs, D = {(XIJ' YIJ)}~=l' drawn according to an unkown \ndistribution P(x, y), find a function (,estimator'), f(x; D), which 'best' approxi(cid:173)\nmates y. Using the popular mean-squared error criterion and taking expectations \nwith respect to the data distribution one finds the well-known separation of the \nerror into a bias and variance terms respectively (Geman et al. 1992) \n\n\u00a3(x) = (EDf(x; D) - E[ylx])2 + ED (f(x; D) - EDf(x; D)]2 . \n\n(1) \nWe consider a data source of the form y = g(x) + 7], where the 'target' function g(x) \nis an unknown (and potentially non-linear) function and 7] is a Gaussian random \nvariable with zero mean and variance (72. Clearly then E[Ylx] = g(x). \nIn the usual scenario for parameter estimation one uses the complete data set, D, \nto form an estimator f(x; D). In this paper we consider the case where the data set \nD is broken up into I< subsets (not necessarily disjoint), such that D = nf= 1 DC k) , \nand a separate estimator is found for each subset. The full estimator is then given \nby the linear combination (Granger 1989) \n\nf(x; D) = L bk/k(x; D(k) \n\nK \n\nk=l \n\n(2) \n\nThe optimal values of the parameters bk can be easily obtained if the data distribu(cid:173)\ntion, P(x, y), is known, by simply minimizing the mean-squared error (Granger \n\n\fBias. Variance and the Combination of Least Squares Estimators \n\n297 \n\n1989). \nIn the more typical case where this distribution is unkown, one may \nresort to other schemes such as least-squares fitting for the parameter vector \nb = {bl , ... , bK}. The bias and variance of the combined estimator can be simply \nexpressed in this case, and are given by \n\nB(x; g) = (t hh:(x) - g(x)) 2 \n\nk=l \n\n; V(x; g) = L bkh' {lkfk'(X) - fk(X)fk'(X)} \n\nk,k' \n\n(3) \nwhere the overbars denote an average with respect to the data. It is immediately \napparent that the variance term is composed of two contributions. The first term, \ncorresponding to k = k', simply computes a weighted average of the single estimator \nvariances, while the second term measures the average covariance between the dif(cid:173)\nferent estimators. While the first term in the variance can be seen to decay as 1/ J{ \nin the case where all the weights h are of the same order of magnitude, the second \nterm is finite unless the covariances between estimators are very small. It would \nthus seem beneficial to attempt to make the estimators as weakly correlated as pos(cid:173)\nsible in order to decrease the variance. Observe that in the extreme case where the \ndata sets are independent of each other, the second term in the variance vanishes \nidentically. Note that the bias term depends only on single estimator properties \nand can thus be computed from the theory of the single estimator. As mentioned \nabove, however, the second term in the variance expression explicitly depends on \ncorrelations between the different estimators, and thus requires the computation of \nquantities beyond those of single estimators. \n\n3 THE SINGLE LINEAR ESTIMATOR \n\nBefore considering the case of a combination of estimators, we first review the case \nof a single linear estimator, given by f(x; D) = wT . x, where w is estimated from \nthe data set D. Following Bos et al. (1993) we further assume that the data arises \nthrough an equation of the form y = g(x) + 1] with 9 = g(w~ . x). Looking back \nat equations (3) it is clear that the bias and variance are explicit functions of x \nand the weight vector woo In order to remove the explicit dependence we compute \nin what follows expectations with respect to the probability distribution of x and \nWo, denoted respectively by Ep ['] and Eo[']' Thus, we define the averaged bias and \nvariance by B = EoEp[B(x; wo)] and V = EoEp[V(x; wo)] and the expected error \nis then \u00a3 = B + V. \nIn this work we consider least-squares estimation which corresponds to minimizing \nthe empirical error, \u00a3emp(w, D) = IIXw - Y112, where X is the n x d data matrix, \nY is the n x 1 output vector and w is a d x 1 weight vector. The components of \nthe 'target' vector Yare given by YP = g(w~ . xl-') + 1]p where 1Jp are i.i.d normal \nrandom variables with zero mean and variance (J'2 . Note that while we take the \nestimator itself to be linear we allow the target function g(.) to be non-linear. This \nis meant to model the common situation where the model we are trying to fit is \ninadequate, since the correct model (even it exists) is usually unkown. \nThus, the least squares estimator is given by wE argminw\u00a3emp(w,D). Since in \nthis case the error-function is quadratic it possesses either a unique global minimum \n\n\f298 \n\nRonny Meir \n\nor a degenerate manifold of minima, in the case where the Hessian matrix, XT X, \nis singular. \n\nThe solution to the least squares problem is well known (see for example Scharf \n1991), and will be briefly summarized. When the number of examples, n, is smaller \nthan the input dimension, d, the problem is underdetermined and there are many \nsolutions with zero empirical error. The solutions can be written out explicitly in \nthe form \n\n(n < d), \n\n(4) \nwhere V is an arbitrary d-dimensional vector. It should be noted that any vector w \nsatisfying this equation (and thus any least-squares estimator) becomes singular as \nn approaches d from below, since the matrix X XT becomes singular. The minimal \nnorm solution, often referred to as the Moore-Penrose solution, is given in this case \nby the choice V = O. It is common ill the literature to neglect the study of the \nunderdetermined regime since the solution is not unique in this case. We however \nwill pay specific attention to this case, corresponding to the often prevalent situa(cid:173)\ntion where the amount of data is small, attempting to show that the combination \nof estimators approach can significantly improve the quality of predictors in this \nregime. Moreover, many important inverse problems in signal processing fall into \nthis category (Scharf 1991). \nIn the overdetermined case, n > d (assuming the matrix X to be of full rank), \na zero error solution is possible only if the function g( .) is linear and there is no \nnoise, namely El1]2] = O. In any other case, the problem becomes unrealizable and \nthe minimum error is non-zero. In any event, in this regime the unique solution \nminimizing the empirical error is given by \nW = (XT X)-l XTy \n\n(n > d). \n\n(5) \n\nIt is eay to see that this estimator is unbiased for linear g(.). \nIn order to compute the bias and variance for this model we use Eqs. (3) with \nK = 1 and bTl: = 1. In order to actually compute the expectations with respect to \nx and the weight vector Wo we assume in what follows that the random vector x \nis distributed according to a multi-dimensional normal distributions of zero mean \nand covariance matrix (lid)!. The vector Wo is similarly distributed with unit \ncovariance matrix. The reason for the particular scaling chosen for the covariance \nmatrices will become clear below. In the remainder of the paper we will be concerned \nwith exact calculations in the so called thermodynamic limit: n, d -+ 00 and a = \nnl d finite. This limit is particularly useful in that the central limit theorem allows \none to make precise statements about the behavior of the system, for an effectively \nfinite sample size, a. We note in passing that in the thermodynamic limit, d -+ 00, \nwe have Ei x; -+ 1 with probability 1 and similarly for (lid) Ei W5i' Using these \nsimple distributions we can, after some algerbra, directly compute the bias and \nvariance. Denoting R = Eo[wT . wo], r = Eollwll 2 , Q = Eollwl1 2 , one can show \nthat the bias and variance are given by \n\nB = 7' - 2ugR + g2 \n\n(6) \nIn the above equations we have used g2 = f Dug2(u) and ug = f Du ug(u) where \nthe Gaussian measure Du is defined by Du = (e-u~/2 1...j2-i)du. We note in passing \n\nv = Q -r. \n\n\fBias, Variance and the Combination of Least Squares Estimators \n\n299 \n\nthat the same result is obtained for any i.i.d variables, Xi, with zerO mean and \nvariance lid. We thus note that a complete calculation of the expected bias and \nvariance requires the explicit computation of the variables R, rand Q defined \nabove. In principle, with the explicit expressions (4) and (5) at hand one may \nproceed to compute all the quantities relevant to the evaluation of the bias and \nvariance. Unfortunately, it turns out that a direct computation of r, Rand Q using \nthese expressions is a rather difficult task in the theory of random matrices, keeping \nin mind the potential non-linearity of the function g{.). A way to solve the problem \ncan be undertaken via a slightly indirect route, using tools from statistical physics. \nThe variables Rand Q above have been recently computed by Bos et al. (1993) \nusing replicas and by Opper and Kinzel (1994) by a direct calculation. The variable \nr can be computed along similar lines resulting in the following expressions for the \nbias and variance (given for brevity for the Moore-Penrose solution): \n\n0'< 1 : \n\nB \n\na> 1 : \n\nv = _0'_ [g2 + u2 _ 0'(2 _ a )ttg2] \n\n1-0' \n\n(7) \n\nWe see from this solution that for Q > 1 the bias is constant, while the variance \nis monotonically decreasing with the sample size Q. For Q < 1, there are of course \nmultiple solutions corresponding to different normalizations Q. It is easy to see, \nhowever, that the Moore-Penrose solution, gives rise to the smallest variance of all \nleast-squares estimators (the bias is unaffected by the normalization of the solution) . \nThe expected (or generalization) error is given simply by \u00a3 = B + V, and is thus \nsmallest for the Moore-Penrose solution. Note that this latter result is independent \nof whether the function g(.) is linear or not. We note in passing that in the simple \ncase where the target function g(-) is linear and the data is noise-free (u 2 = 0) one \nobtains the particularly simple result \u00a3 = 1 - a for a < 1 and \u00a3 = 0 above Q = 1. \nNote however that in any other case the expected error is a non-linear function of \nthe normalized sample size a. \n\n4 COMBINING LINEAR ESTIMATORS \n\nHaving summarized the situation in the case of a single estimator, we proceed to \nthe case of K linear estimators. In this case we assume that the complete data \nset, D, is broken up into K subsets D(k) of size nk = Qkd each. In particular we \nconsider two extreme situations: (i) The data sets are independent of each other, \nnamely D(k) n D(k') = 0 for all k, and (ii) D(k) = D for all k. We refer to the first \nsituation as the non-overlapping case, while to the second case where all estimators \nare trained on the same data set as the fully-overlappin~ case. Denoting by w(k) \nthe k'th least-squares estimator, based on training set D( ), we define the following \nquantities: \nR(k) = Eo [w(k)T . w o] , r(k,k') = Eo [w(k)T . w(k')] , p(k,k') = Eo [w(k)T . w(k')] . \n(8) \nMaking use of Eqs. (3) and the probability distribution of Wo one then straighfor-\n\n\f300 \n\nRonny Meir \n\nwardly finds for the mixture estimator \n\n1e,Ie' \n\nIe \n\nL blebk,r(k,Ie') - 2ug L bkR(k) + g2, \nL b~ [p K we find the choice 1< = 1 always \nyields a lower expected error. Thus, while we have shown that for small sample \nsizes the effect of splitting the data set into independent subsets is helpful, this \nis no longer the case if the sample size is sufficiently large, in which case a single \nestimator based on the complete data set is superior. For a -+ 00, however, one \nfinds that all uniform mixtures converge (to leading order) at the same rate, namely \n\u00a3K(a) ::::::: \u00a300 + (\u00a300 + (!2)ja , where \u00a300 = g2 - ug2. For finite values of cr, however, \nthe value of 1< has a strong effect on the quality of the stimator. \n\n4.2 FULLY OVERLAPPING DATA SETS \n\nWe focus now on the case where all estimators are formed from the same data set \nD, namely DUe) = D for all k . Since there is a unique solution to the least-squares \nestimation problem for a > 1, all least-squares estimators coincide in this regime. \nThus, we focus here on the case a < 1, where multiple least-squares estimators \ncoexist. We further assume that only mixtures of estimators of the same norm Q \nare allowed. We obtain for the uniform mixture \n\nClearly the expression for the bias in this case is identical to that obtained for \nthe single estimator, since all estimators are based On the same data set and the \nbias term depends only on single estimator properties. The variance term , however , \nis modified due to the correlation between the estimators expressed through the \nvariable p(k,k'). Since the variance for the case of a single estimator is Q - R2 and \nsince q $ Q it is clear in this case that the variance is reduced while the bias remains \nunchanged. Thus we conclude that the mixture of estimators in this case indeed \nproduces superior performance to that of the single estimator. However, it can be \nseen that in the case of the Moore-Penrose solution, corresponding to choosing the \nsmallest possible norm Q, the expected error is minimal. We thus conclude that for \na < 1 the Moore-Penrose pseudo-inverse solution yields the lowest expected error, \nand this cannot be improved on by combining least-squares estimators obtained \nfrom the full data set D . \nRecall that we have shown in the previous section that (for small and noisy data \nsets) combining estimators formed using non-overlapping data subsets produced \nresults superior to those of any single estimator trained on the complete data set. \nAn interesting conclusion of these results is that splitting the data set into non(cid:173)\noverlapping subsets is a better strategy than training each estimator with the full \ndata. As mentioned previously, the basic reason for this is the independence of \nthe estimators formed in this fashion, which helps to reduce the variance term \nmore drastically than in the case where the estimators are dependent (having been \nexposed to overlapping data sets). \n\n5 CONCLUSIONS \n\nIn this paper we have studied the effect of combining different estimators on the \nperformance of linear regression. In particular we have focused on the case of linear \n\n\f302 \n\nRonny Meir \n\nleast-squares estimation, computing exactly the full bias and variance curves for the \ncase where the input dimension is very large (the so called thermodynamic limit). \nWhile we have focused specifically on the case of linear estimators, it should not be \nhard to extend these results to simple non-linear functions ofthe form f(w T \u00b7x) (see \nsection 2). The case of a combination of more complex estimators (such as multi(cid:173)\nlayered neural networks) is much more demanding, as even the case of a single such \nnetwork is rather difficult to analyzes. \nSeveral positive conclusions we can draw from our study are the following. First, the \ngeneral claim that combining experts is always helpful is clearly fallacious. While \nwe have shown that combining estimators is beneficial in some cases (such as small \nnoisy data sets), this is not the case in general. Second, we have shown that in some \nsituations (specifically unrealizable rules and small sample size) it is advantageous \nto split the data into several non-overlapping subsets. It turns out that in this case \nthe decrease in variance resulting from the independence of the different estimators, \nis larger than the concomitant increase in bias. It would be interesting to try to \ngeneralize our results to the case where the data is split in a more efficient manner. \nThird, our results agree with the general notion that when attempting to learn \nan unrealizable function (whether due to noise or to a mismatch with the target \nfunction) the best option is to learn with errors. \nUltimately one would like to have a general theory for combining empirical estima(cid:173)\ntors. Our work has shown that the effect of noise and finite sample size is expected \nto produce non-trivial effects which are impossible to observe when considering only \nthe asymptotic limit. \n\nAcknowledgements \n\nThe author thanks Manfred Opper for a very helpful conversation and the Ollen(cid:173)\ndorff center of the Electrical Engineering department at the Technion for financial \nsupport. \n\nReferences \n\nS. Bos., W. Kinzel and M. Opper 1993, The generalization ability of perceptrons \nwith continuous outputs, Phys. Rev. A 47:1384-1391. \nS. Geman, E. Bienenstock and R. Dorsat 1992, Neural networks and the \nbias/variance dilemma, Neural Computation 4:1-58. \nC.W.J. Granger 1989, Combining forecasts - twenty years later, J. of Forecast. \n8:167-173. \nM. Opper and W. Kinzel 1994, Statistical mechanics of generalization, in Physics \nof Neural networks, van Hemmen, J.S., E. Domany and K. Schulten eds., Springer(cid:173)\nVerlag, Berlin. \nL.1. Scharf, 1991 Statistical Signal Processing: Detection, Estimation and Time \nSeries Analysis, Addison-Wesley, Massachusetts. \n\n\f", "award": [], "sourceid": 996, "authors": [{"given_name": "Ronny", "family_name": "Meir", "institution": null}]}