{"title": "Combining Neural Network Regression Estimates with Regularized Linear Weights", "book": "Advances in Neural Information Processing Systems", "page_first": 564, "page_last": 570, "abstract": "", "full_text": "Combining Neural Network Regression \n\nEstimate1s with Regularized Linear \n\nWeights \n\nChristopher J. Merz and Michael J. Pazzani \n\nDept. of Information and Computer Science \n\nUniversity of California, Irvine, CA 92717-3425 U.S.A. \n\n{ cmerz,pazzani }@ics.uci.edu \n\nCategory: Algorithms and Architectures. \n\nAbstract \n\nWhen combining a set of learned models to form an improved es(cid:173)\ntimator, the issue of redundancy or multicollinearity in the set of \nmodels must be addressed. A progression of existing approaches \nand their limitations with respect to the redundancy is discussed. \nA new approach, PCR *, based on principal components regres(cid:173)\nsion is proposed to address these limitations. An evaluation of the \nnew approach on a collection of domains reveals that: 1) PCR* \nwas the most robust combination method as the redundancy of the \nlearned models increased, 2) redundancy could be handled without \neliminating any of the learned models, and 3) the principal compo(cid:173)\nnents of the learned models provided a continuum of \"regularized\" \nweights from which PCR * could choose. \n\n1 \n\nINTRODUCTION \n\nto improve classification and regres(cid:173)\n\nCombining a set of learned modelsl \nlearning \nsion estimates has been an area of much research \n[Wolpert, 1992, Merz, 1995, Perrone and Cooper, 1992, \nand neural networks \nLeblanc and Tibshirani, 1993, \nMeir, 1995, \nKrogh and Vedelsby, 1995, Tresp, 1995, Chan and Stolfo, 1995]. The challenge of \nthis problem is to decide which models to rely on for prediction and how much \nweight to give each. \n\nBreiman, 1992, \n\nin machine \n\n1 A learned model may be anything from a decision/regression tree to a neural network. \n\n\fCombining Neural Network Regression Estimates \n\n565 \n\nThe goal of combining learned models is to obtain a more accurate prediction than \ncan be obtained from any single source alone. One major issue in combining a set \nof learned models is redundancy. Redundancy refers to the amount of agreement or \nlinear dependence between models when making a set of predictions. The more the \nset agrees, the more redundancy is present. In statistical terms, this is referred to \nas the multicollinearity problem. \nThe focus of this paper is to explore and evaluate the properties of existing meth(cid:173)\nods for combining regression estimates (Section 2), and to motivate the need for \nmore advanced methods which deal with multicollinearity in the set of learned mod(cid:173)\nels (Section 3). In particular, a method based on principal components regression \n(PCR, [Draper and Smith, 1981]) is described, and is evaluated emperically demon(cid:173)\nstrating the it is a robust and efficient method for finding a set of combining weights \nwith low prediction error (Section 4). Finally, Section 5 draws some conclusions. \n\n2 MOTIVATION \n\nThe problem of combining a set of learned models is defined using the terminology \nof [Perrone and Cooper, 1992]. Suppose two sets of data are given: a training set \n'DTrain = (xm, Ym) and a test set 'DTelt = (Xl, Yl). Now suppose 'DTrain is used to \nbuild a set of functions, :F = fi(X), each element of which approximates f(x). The \ngoal is to find the best approximation of f(x) using :F. \nTo date, most approaches to this problem limit the space of approximations of f( x) \nto linear combinations of the elements of :F, i.e., \nj(x) = L Cidi(X) \n\nN \n\ni=l \n\nwhere Cij is the coefficient or weight of fj(x). \nThe focus of this paper is to evaluate and address the limitations of these ap(cid:173)\nproaches. To do so, a brief summary of these approaches is now provided progress(cid:173)\ning from simpler to more complex methods pointing out their limitations along the \nway. \nThe simplest method for combining the members of :F is by taking the unweighted \naverage, (i.e., Cij = 1/ N). Perrone and Cooper refer to this as the Basic Ensemble \nMethod (BEM), written as \n\nN \n\nfBEM = I/NLfi(x) \n\ni=l \n\nThis equation can also be written in terms of the misfit function for each fi(X). \nThese functions describe the deviations of the elements of :F from the true solution \nand are written as \n\nmi(X) = f(x) -Ji(x). \n\nThus, \n\nfBEM = f(x) -1/NL mi(x). \n\nN \n\ni=l \n\nPerrone and Cooper show that as long as the mi (x) are mutually independent \nwith zero mean, the error in estimating f(x) can be made arbitrarily small by \nincreasing the population size of :F. Since these assumptions break down in practice, \n\n\f566 \n\nC. J. Merz and M. J. Pazzani \n\nthey developed a more general approach which finds the \"optimal,,2 weights while \nallowing the mi (x) 's to be correlated and have non-zero means. This Generalized \nEnsemble Method (GEM) is written as \n\nN \n\nN \n\nIGEM = LQ:di(X) = I(x) - LQ:imi(X) \n\ni=1 \n\ni=l \n\nwhere \n\nC is the symmetric sample covariance matrix for the misfit function and the goal is to \nminimize E7,; Q:iQ:jCii' Note that the misfit functions are calculated on the training \ndata and I(x) is not required. The main disadvantage to this approach is that it \ninvolves taking the inverse of C which can be \"unstable\". That is, redundancy in \nthe members of :F leads to linear dependence in the rows and columns of C which \nin turn leads to unreliable estimates of C- 1 \u2022 \n\nTo circumvent this sensitivity redundancy, Perrone and Cooper propose a method \nfor discarding member(s) of :F when the strength of its agreement with another \nmember exceeds a certain threshold. Unfortunately, this approach only checks for \nlinear dependence (or redundancy) between pairs of Ii (x) and two Ii (x) for i =1= j. \nIn fact, Ii (x) could be a linear combination of several other members of :F and the \ninstability problem would be manifest. Also, depending on how high the threshold is \nset, a member of :F could be discarded while still having some degree of uniqueness \nand utility. An ideal method for weighting the members of :F would neither discard \nany models nor suffer when there is redundancy in the model set. \n\nThe next approach reviewed is linear regression (LR)3 which also finds the \"optimal\" \nweights for the Ii (x) with respect to the training data. In fact, G EM and LR are \nboth considered \"optimal\" because they are closely related in that GEM is a form \nof linear regression with the added constraint that E~1 Q:i = 1. The weights for \nLR are found as follows4 , \n\nN \n\nhR = LQ:di(X) \n\ni=1 \n\nwhere \n\nLike GEM, LR and LRC are subject to the multicollinearity problem because finding \nthe Q:i's involves taking the inverse of a matrix. That is, if the I matrix is composed \nof li(x) which strongly agree with other members of :F, some linear dependence will \nbe present. \n\n20ptimal here refers to weights which minimize mean square error for the training data. \n3 Actually, it is a form of linear regression without the intercept term. The more general \nform, denote by LRC, would be formulated the same way but with member, fo which \nalways predicts 1. According to [Leblanc and Tibshirani, 1993] having the extra constant \nterm will not be necessary (i.e., it will equal zero) because in practice, E[fi(x)] = E[f(x)]. \n4Note that the constraint, E;:'l ai = 1, for GEM is a form of regularization \n[Leblanc and Tibshirani, 1993]. The purpose of regularizing the weights is to provide an \nestimate which is less biased by the training sample. Thus, one would not expect GEM \nand LR to produce identical weights. \n\n\fCombining Neural Network Regression Estimates \n\n567 \n\nGiven the limitations of these methods, the goal of this research was to find a method \nwhich finds weights for the learned models with low prediction error without discard(cid:173)\ning any of the original models, and without being subject to the multicollinearity \nproblem. \n\n3 METHODS FOR HANDLING MULTICOLLINEARITY \n\nIn the abovementioned methods, multicollinearity leads to inflation of the vari(cid:173)\nance of the estimated weights, Ck. Consequently, the weights obtained from fit(cid:173)\nting the model to a particular sample may be far from their true values. To \ncircumvent this problem, approaches have been developed which: 1) constrain \nthe estimated regression coefficients so as to improve prediction performance (Le., \nridge regression, RIDG E [Montgomery and Friedman 1993], and principal compo(cid:173)\nnents regression), 2) search for the coefficients via gradient descent procedures (i.e., \nWidrow-Hofflearning, GD and EG+- [Kivinen and Warmuth, 1994]), or build mod(cid:173)\nels which make decorrelated errors by adjusting the bias of the learning algorithm \n[Opitz and Shavlik, 1995] or the data which it sees [Meir, 1995]. The third approach \nameliorates, but does not solve, the problem because redundancy is an inherent part \nof the task of combining estimators. \n\nThe focus of this paper is on the first approach. \n[Leblanc and Tibshirani, 1993] have proposed several ways of constraining or regu(cid:173)\nlarizing the weights to help produce estimators with lower prediction error: \n\nLeblanc and Tibshirani \n\n1. Shrink a towards (1/ K, 1/ K, ... ,1/ K)T where K is the number of learned \n\nmodels. \n\n2. 2:~1 Ckj = 1 \n3. Ckj ~ O,i = 1,2 ... K \n\nBreiman [Breiman, 1992] provides an intuitive justification for these constraints by \npointing out that the more strongly they are satisfied, the more interpolative the \nweighting scheme is. In the extreme case, a uniformly weighted set of learned models \nis likely to produce a prediction between the maximum and minimum predicted \nvalues of the learned models. Without these constraints, there is no guarantee that \nthe resulting predictor will stay near that range and generalization may be poor. \nThe next subsection describes a variant of principal components regression and \nexplains how it provides a continuum of regularized weights for the original learned \nmodels. \n\n3.1 PRINCIPAL COMPONENTS REGRESSION \n\nWhen dealing with the above mentioned multicollinearity problem, principal com(cid:173)\nponents regression [Draper and Smith, 1981] may be used to summarize and extract \nthe \"relevant\" information from the learned models. The main idea of PCR is to \nmap the original learned models to a set of (independent) principal components in \nwhich each component is a linear combination of the original learned models, and \nthen to build a regression equation using the best subset of the principal components \nto predict lex). \nThe advantage of this representation is that the components are sorted according to \nhow much information (or variance) from the original learned models for which they \naccount. Given this representation, the goal is to choose the number of principal \ncomponents to include in the final regression by retaining the first k which meet a \npreselected stopping criteria. The basic approach is summarized as follows: \n\n\f568 \n\nC. J. Merz and M. J. Pazzani \n\n1. Do a principal components analysis (PCA) on the covariance matrix of the \nlearned models' predictions on the training data (i.e., do a PCA on the \ncovariance matrix of M, where Mi,j is the j-th model's reponse for the \ni-th training example) to produce a set of principal components, PC = \n{PC1, ... ,PCN }. \n\n2. Use a stopping criteria to decide on k, the number of principal components \n\nto use. \n\n3. Do a least squares regression on the selected components (i.e., include PCi \n\nfor i:::; k). \n\n4. Derive the weights, fri, for the original learned models by expanding \n\naccording to \n\n/peR* = i31PC1 + ... + i3\"PC\" \n\nPCi = ;i,O/O + ... + ;i,N /N, \n\nand simplifying for the coefficients of \". Note that ;i,j is the j-th coeffi(cid:173)\ncient of the i-th principal component. \n\nThe second step is very important because choosing too few or too many principal \ncomponents may result in underfitting or overfitting, respectively. Ten-fold cross(cid:173)\nvalidation is used to select k here. \n\nExamining the spectrum of (N) weight sets derived in step four reveals that PCR* \nprovides a continuum of weight sets spanning from highly constrained (i.e., weights \ngenerated from PCR1 satisfy all three regularization constraints) to completely un(cid:173)\nconstrained (i.e., PCRN is equivalent to unconstrained linear regression). To see \nthat the weights, fr, derived from PCR1 are (nearly) uniform, recall that the first \nprincipal component accounts for where the learned models agree. Because the \nlearned models are all fairly accurate they agree quite often so their first principal \ncomponent weights, ;1,* will be similar. The \"Y-weights are in turn multiplied by a \nconstant when PCR1 is regressed upon. Thus, the resulting fri'S will be fairly uni(cid:173)\nform. The later principal components serve as refinements to those already included \nproducing less constrained weight sets until finally PCRN is included resulting in \nan unconstrained estimator much like LR, LRC and GEM. \n\n4 EXPERIMENTAL RESULTS \n\nlearned models, \n\n:F, were generated using Backpropogation \nThe set of \n[Rumelhart, 1986]. For each dataset, a network topology was developed which gave \ngood performance. The collection of networks built differed only in their initial \nweights5 . \n\nThree data sets were chosen: cpu and housing (from the UCI repository), and \nbody/at (from the Statistics Library at Carnegie Mellon University). Due to space \nlimitation, the data sets reported on were chosen because they were representative \nof the basic trends found in a larger collection of datasets. The combining meth(cid:173)\nods evaluated consist of all the methods discussed in Sections 2 and 3, as well as \nPCRI and PCRN (to demonstrate PCR*'s most and least regularized weight sets, \n\nSThere was no extreme effort to produce networks with more decorrelated errors. \nEven with such networks, the issue of extreme multicollinearity would still exist because \nE[f;(x)] = E[fi(x)] for a.ll i and j. \n\n\fCombining Neural Network Regression Estimates \n\n569 \n\nTable 1\u00b7 Results \ncpu \n\nII \n\nData \nN \nBEM \nGEM \nLR \nRIDGE \nGD \nEGPM \nPCRl \nPCRN \nPCR* \n\nbodyfat \n50 \n10 \n1.03 \n1.04 \n1.02 0.86 \n3.09 \n1.02 \n1.02 0.826 \n1.03 \n1.03 \n1.04 \n1.02 0.848 \n0.786 \n0.99 \n\n1.04 \n1.07 \n1.05 \n\n10 \n38.57 \n46.59 \n44.9 \n44.8 \n38.9 \n38.4 \n39.0 \n44.8 \n40.3 \n\n50 \n38.62 \n227.54 \n238.0 \n191.0 \n38.8 \n38.0 \n39.0 \n249.9 \n40.8 \n\n11 \n\nhousing \n50 \n10 \n2.77 \n2.79 \n2.57 \n2.72 \n2.72 \n6.44 \n2.55 \n2.72 \n2.77 \n2.79 \n2.77 \n2.75 \n2.78 \n2.76 \n2.72 2.57 \n2.56 \n2.70 \n\nrespectively). The more computationally intense procedures based on stacking and \nbootstrapping proposed by [Leblanc and Tibshirani, 1993, Breiman, 1992] were not \nevaluated here because they required many more models (i.e., neural networks) to \nbe generated for each of the elements of F. \nThere were 20 trials run for each of the datasets. On each trial the data was \nrandomly divided into 70% training data and 30% test data. These trials were rerun \nfor varying sizes of F (i.e., 10 and 50, respectively). As more models are included \nthe linear dependence amongst them goes up showing how well the multicollinearity \nproblem is handled6 . Table 1 shows the average residual errors for the each of the \nmethods on the three data sets. Each row is a particular method and each column \nis the size of F for a given data set. Bold-faced entries indicate methods which were \nnot significantly different from the method with the lowest error (via two-tailed \npaired t-tests with p :::; 0.05) . \nPCR* is the only approach which is among the leaders for all three data sets. For \nthe body/at and housing data sets the weights produced by BEM, PCRI , GD, and \nEG+- tended to be too constrained, while the weights for LR tended to be too \nunconstrained for the larger collection of models. The less constrained weights of \nGEM, LR, RIDGE, and PCRN severely harmed performance in the cpu domain \nwhere uniform weighting performed better. \n\nThe biggest demonstration of PCR*'s robustness is its ability to gravitate towards \nthe more constrained weights produced by the earlier principal components when \nappropriate (i.e., in the cpu dataset). Similarly, it uses the less constrained principal \ncomponents closer to PCRn when it is preferable as in the bodyfat and housing \ndomains. \n\n5 CONCLUSION \n\nThis investigation suggests that the principal components of a set of learned mod(cid:173)\nels can be useful when combining the models to form an improved estimator. It \nwas demonstrated that the principal components provide a continuum of weight \nsets from highly regularized to unconstrained. An algorithm, PCR* , was devel(cid:173)\noped which attempts to automatically select the subset of these components which \nprovides the lowest prediction error. Experiments on a collection of domains demon(cid:173)\nstrated PCR*'s ability to robustly handle redundancy in the set of learned models. \nFuture work will be to improve upon PCR* and expand it to the classification task. \n\n6This is verified by observing the eigenvalues of the principal components and values \n\nin the covariance matrix of the models in :F \n\n\f570 \n\nReferences \n\nC. 1. Merz and M. J. Pazzani \n\n[Breiman et ai, 1984] Breiman, L., Friedman, J .H., Olshen, R.A. & Stone, C.J . \n\n(1984). Classification and Regression Trees. Belmont, CA: Wadsworth. \n\n[Breiman, 1992] Breiman, L. (1992). Stacked Regression. Dept of Statistics, Berke(cid:173)\n\nley, TR No. 367. \n\n[Chan and Stolfo, 1995] Chan, P.K., Stolfo, S.J. (1995). A Comparative Evaluation \nof Voting and Meta-Learning on Partitioned Data Proceedings of the Twelvth \nInternational Machine Learning Conference (90-98) . San Mateo, CA: Morgan \nKaufmann. \n\n[Draper and Smith, 1981] Draper, N.R., Smith, H. (1981). Applied Regression \n\nAnalysis. New York, NY: John Wiley and Sons. \n\n[Kivinen and Warmuth, 1994] Kivinen, J., and Warmuth, M. (1994) . Exponenti(cid:173)\n\nated Gradient Descent Versus Gradient Descent for Linear Predictors. Dept. of \nComputer Science, UC-Santa Cruz, TR No. ucsc-crl-94-16. \n\n[Krogh and Vedelsby, 1995] Krogh, A. , and Vedelsby, J. (1995). Neural Network \nEnsembles, Cross Validation, and Active Learning. In Advances in Neural Infor(cid:173)\nmation Processing Systems 7. San Mateo, CA: Morgan Kaufmann. \n\n[Hansen and Salamon, 1990] Hansen, L.K., and Salamon, P. (1990). Neural Net(cid:173)\nwork Ensembles. IEEE Transactions on Pattern Analysis and Machine Intelli(cid:173)\ngence, 12 (993-1001). \n\n[Leblanc and Tibshirani, 1993] Leblanc, M., Tibshirani, R. (1993) Combining esti(cid:173)\n\nmates in regression and classification Dept. of Statistics, University of Toronto, \nTR. \n\n[Meir, 1995] Meir, R. (1995) . Bias, variance and the combination of estimators. In \nAdvances in Neural Information Processing Systems 7. San Mateo, CA: Morgan \nKaufmann. \n\n[Merz, 1995] Merz, C.J . (1995) Dynamical Selection of Learning Algorithms. In \nFisher, C. and Lenz, H. (Eds.) Learning from Data: Artificial Intelligence and \nStatistics, 5). Springer Verlag \n\n[Montgomery and Friedman 1993] Mongomery, D.C., and Friedman, D.J. (1993). \nPrediction Using Regression Models with Multicollinear Predictor Variables. lIE \nTransactions, vol. 25, no. 3 73-85. \n\n[Opitz and Shavlik, 1995] Opitz, D.W ., Shavlik, J .W . (1996) . Generating Accurate \nand Diverse Members of a Neural-Network Ensemble. Advances in Neural and \nInformation Processing Systems 8. Touretzky, D.S., Mozer, M.C., and Hasselmo, \nM.E., eds. Cambridge MA : MIT Press. \n\n[Perrone and Cooper, 1992] Perrone, M. P., Cooper, L. N., (1993) . When Networks \nDisagree: Ensemble Methods for Hybrid Neural Networks. Neural Networks for \nSpeech and Image Processing, edited by Mammone, R. J .. New York: Chapman \nand Hall. \n\n[Rumelhart, 1986] Rumelhart, D. E., Hinton, G. E., & Williams, R. J . (1986). \nLearning Interior Representation by Error Propagation. Parallel Distributed Pro(cid:173)\ncessing, 1 318-362. Cambridge, MASS.: MIT Press. \n\n[Tresp, 1995] Tresp, V., Taniguchi, M. (1995). Combining Estimators Using Non(cid:173)\n\nConstant Weighting Functions. In Advances in Neural Information Processing \nSystems 7. San Mateo, CA: Morgan Kaufmann. \n\n[Wolpert, 1992] Wolpert, D. H. (1992). Stacked Generalization. Neural Networks, \n\n5, 241-259. \n\n\f", "award": [], "sourceid": 1201, "authors": [{"given_name": "Christopher", "family_name": "Merz", "institution": null}, {"given_name": "Michael", "family_name": "Pazzani", "institution": null}]}