{"title": "Combining Classifiers Using Correspondence Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 591, "page_last": 597, "abstract": "", "full_text": "Combining Classifiers Using \n\nCorrespondence Analysis \n\nChristopher J. Merz \n\nDept. of Information and Computer Science \n\nUniversity of California, Irvine, CA 92697-3425 U.S.A. \n\ncmerz@ics.uci.edu \n\nCategory: Algorithms and Architectures. \n\nAbstract \n\nSeveral effective methods for improving the performance of a sin(cid:173)\ngle learning algorithm have been developed recently. The general \napproach is to create a set of learned models by repeatedly apply(cid:173)\ning the algorithm to different versions of the training data, and \nthen combine the learned models' predictions according to a pre(cid:173)\nscribed voting scheme. Little work has been done in combining the \npredictions of a collection of models generated by many learning \nalgorithms having different representation and/or search strategies. \nThis paper describes a method which uses the strategies of stack(cid:173)\ning and correspondence analysis to model the relationship between \nthe learning examples and the way in which they are classified by \na collection of learned models. A nearest neighbor method is then \napplied within the resulting representation to classify previously \nunseen examples. The new algorithm consistently performs as well \nor better than other combining techniques on a suite of data sets. \n\n1 \n\nIntroduction \n\nCombining the predictions of a set of learned models! to improve classification \nand regression estimates has been an area of much research in machine learn(cid:173)\ning and neural networks [Wolpert, 1992, Merz and Pazzani, 1997, Perrone, 1994, \nBreiman, 1996, Meir, 1995]. The challenge of this problem is to decide which models \nto rely on for prediction and how much weight to give each. The goal of combining \nlearned models is to obtain a more accurate prediction than can be obtained from \nany single source alone. \n\n1 A learned model may be anything from a decision/regression tree to a neural network. \n\n\f592 \n\nC. l Men \n\nRecently, several effective methods have been developed for improving the perfor(cid:173)\nmance of a single learning algorithm by combining multiple learned models gener(cid:173)\nated using the algorithm. Some examples include bagging [Breiman, 1996], boosting \n[Freund, 1995], and error correcting output codes [Kong and Dietterich, 1995]. The \ngeneral approach is to use a particular learning algorithm and a model genera(cid:173)\ntion technique to create a set of learned models and then combine their predic(cid:173)\ntions according to a prescribed voting scheme. The models are typically generated \nby varying the training data using resampling techniques such as bootstrapping \n[Efron and Tibshirani, 1993J or data partitioning [Meir, 1995] . Though these meth(cid:173)\nods are effective, they are limited to a single learning algorithm by either their \nmodel generation technique or their method of combining. \n\nLittle work has been done in combining the predictions of a collection of models \ngenerated by many learning algorithms each having different representation and/or \nsearch strategies. Existing approaches typically place more emphasis on the model \ngeneration phase rather than the combining phase [Opitz and Shavlik, 1996]. As a \nresult, the combining method is rather limited. The focus of this work is to present \na more elaborate combining scheme, called SCANN, capable of handling any set \nof learned models , and evaluate it on some real-world data sets. A more detailed \nanalytical and empirical study of the SCANN algorithm is presented in [Merz, 1997] . \n\nThis paper describes a combining method applicable to model sets that are homoge(cid:173)\nneous or heterogeneous in their representation and/or search techniques. Section 2 \ndescribes the problem and explains some of the caveats of solving it. The SCANN \nalgorithm (Section 3), uses the strategies of stacking [Wolpert, 1992J and correspon(cid:173)\ndence analysis (Greenacre, 1984] to model the relationship between the learning ex(cid:173)\namples and the way in which they are classified by a collection of learned models. A \nnearest neighbor method is then applied to the resulting representation to classify \npreviously unseen examples. \n\nIn an empirical evaluation on a suite of data sets (Section 4), the naive approach of \ntaking the plurality vote (PV) frequently exceeds the performance of the constituent \nlearners. SCANN, in turn, matches or exceeds the performance of PV and several \nother stacking-based approaches. The analysis reveals that SCANN is not sensitive \nto having many poor constituent learned models, and it is not prone to overfit by \nreacting to insignificant fluctuations in the predictions of the learned models. \n\n2 Problem Definition and Motivation \n\nThe problem of generating a set of learned models is defined as follows. Suppose \ntwo sets of data are given: a learning set C = {(Xi, Yi), i = 1, .. . ,I} and a test set \nT = {(Xt, yd, t = 1, .. . , T}. Xi is a vector of input values which are either nominal \nor numeric values, and Yi E {Cl , ... , Cc} where C is the number of classes. Now \nsuppose C is used to build a set of N functions, :F = {fn (x)}, each element of which \napproximates f(x) , the underlying function. \nThe goal here is to combine the predictions of the members of :F so as to find \nthe best approximation of f(x). Previous work [Perrone, 1994] has indicated that \nthe ideal conditions for combining occur when the errors of the learned models are \nuncorrelated. The approaches taken thus far attempt to generate learned models \nwhich make uncorrelated errors by using the same algorithm and presenting different \nsamples of the training data [Breiman, 1996, Meir, 1995], or by adjusting the search \nheuristic slightly [Opitz and Shavlik, 1996, Ali and Pazzani, 1996J. \n\nNo single learning algorithm has the right bias for a broad selection of problems. \n\n\fCombining Classifiers Using Correspondence Analysis \n\n593 \n\nTherefore, another way to achieve diversity in the errors of the learned models \ngenerated is to use completely different learning algorithms which vary in their \nmethod of search and/or representation. The intuition is that the learned models \ngenerated would be more likely to make errors in different ways. Though it is not \na requirement of the combining method described in the next section, the group of \nlearning algorithms used to generate :F will be heterogeneous in their search and/or \nrepresentation methods (i.e., neural networks, decision lists, Bayesian classifiers, \ndecision trees with and without pruning, etc .). In spite of efforts to diversify the \nerrors committed, it is still likely that some of the errors will be correlated because \nthe learning algorithms have the same goal of approximating f, and they may use \nsimilar search strategies and representations. A robust combining method must \ntake this into consideration. \n\n3 Approach \n\nThe approach taken consists of three major components: Stacking, Correspondence \nAnalysis, and Nearest Neighbor (SCANN) . Sections 3.1-3.3 give a detailed descrip(cid:173)\ntion of each component, and section 3.4 explains how they are integrated to form \nthe SCANN algorithm. \n\n3.1 Stacking \n\nOnce a diverse set of models has been generated, the issue of how to combine them \narises. Wolpert (Wolpert, 1992] provided a general framework for doing so called \nstacked genemiization or stacking. The goal of stacking is to combine the members \nof:F based on information learned about their particular biases with respect to \u00a32 . \n\nThe basic premise of stacking is that this problem can be cast as another induction \nproblem where the input space is the (approximated) outputs of the learned models, \nand the output space is the same as before, i.e., \n\nThe approximated outputs of each learned model, represented as jn(Xi), are gener(cid:173)\nated using the following in-sample/out-of-sample approach: \n\n1. Divide the \u00a30 data up into V partitions. \n2. For each partition, v, \n\n\u2022 Train each algorithm on all but partition v to get {j;V}. \n\u2022 Test each learned model in {j;V} on partition v. \n\u2022 Pair the predictions on each example in partition v (i.e., the new input \nspace) with the corresponding output, and append the new examples \nto \u00a31 \n3. Return \u00a31 \n\n3.2 Correspondence Analysis \n\nCorrespondence Analysis (CA) (Greenacre, 1984] is a method for geometrically ex(cid:173)\nploring the relationship between the rows and columns of a matrix whose entries \nare categorical. The goal here is to explore the relationship between the training \n\n2Henceforth \u00a3 will be referred to as \u00a30 for clarity. \n\n\f594 \n\nStage \n1 \n\n2 \n3 \n\nTable 1\u00b7 Correspondence Analysis calculations. \n\nc. 1. Men \n\ni=1 2:j =1 nij \n\nJ \n\nSymbol Definition \nN \nn \nr \nc \nP \nDc \nDr \nA \nA \nF \nG \n\nDescription \n\nGrand total of table N. \nRow masses. \nColumn masses. \nCorrespondence matrix. \n\n(I x J) indicator matrix Records votes of learned models. \n2:1 \nri = ni+/n \ncj=n+j/n \n(1/n)N \n(J x J) diagonal matrix Masses c on diagonal. \n(I X I) diagonal matrix Masses r on diagonal. \nDr -1/2(p _ rcT)Dc -1/2 Standardized residuals. \nurv'! \nDr -1/2ur \nDc -1/2vr \n\nSVD of A. \nPrincipal coordinates of rows. \nPrincipal coordinates of columns. \n\nexamples and how they are classified by the learned models. To do this, the predic(cid:173)\ntion matrix, M, is explored where min = in (xd (1 ::; i ::; I, and 1 ::; n ::; N). It is \nalso important to see how the predictions for the training examples relate to their \ntrue class labels, so the class labels are appended to form M' , an (I x J) matrix \n(where J = N + 1). For proper application of correspondence analysis, M' must be \nconverted to an (I x (J . C)) indicator matrix, N, where ni,(joJ+e) is a one exactly \nwhen mij = ee, and zero otherwise. \nThe calculations of CA may be broken down into three stages (see Table 1). Stage \none consists of some preprocessing calculations performed on N which lead to the \nstandardized residual matrix, A . In the second stage, a singular value decomposition \n(SVD) is performed on A to redefine it in terms ofthree matrices: U(lXK), r(KxK) ' \nand V(KXJ), where K = min(I - 1, J - 1) . These matrices are used in the third \nstage to determine F(lXK) and G(JxK) , the coordinates of the rows and columns \nof N, respectively, in the new space . It should be noted that not all K dimensions \nare necessary. Section 3.4, describes how the final number of dimensions, K *, is \ndetermined. \n\nIntuitively, in the new geometric representation , two rows, fp* and fq*, will lie close \nto one another when examples p and q receive similar predictions from the collection \nof learned models. Likewise, rows gr* and gu will lie close to to one another when \nthe learned models corresponding to r and s make similar predictions for the set \nof examples. Finally, each column, r, has a learned model, j', and a class label, c', \nwith which it is associated; f p* will lie closer to gr* when model j' predicts class c'. \n\n3.3 Nearest Neighbor \n\nThe nearest neighbor algorithm is used to classify points in a weighted Euclidean \nspace. In this scenario, each possible class will be assigned coordinates in the space \nderived by correspondence analysis. Unclassified examples will be mapped into the \nnew space (as described below) , and the class label corresponding to the closest \nclass point is assigned to the example . \nSince the actual class assignments for each example reside in the last C columns of \nN, their coordinates in the new space can be found by looking in the last Crows \nof G. For convenience, these class points will be called Class!, . .. , Classc . \n\nTo classify an unseen example, XTest, the predictions of the learned models on XTest \nmust be converted to a row profile, rT , oflength J . C, where r& oJ+e) is 1/ J exactly \n\n\fCombining Classifiers Using Correspondence Analysis \n\n595 \n\nTable 2: Experimental results. \n\nData set \nabalone \nbal \nbreast \ncredit \ndementia \nglass \nheart \nionosphere \n.. \n1flS \nkrk \nliver \nlymphography \nmusk \nretardation \nsonar \nvote \nwave \nwdbc \n\nPV \n\nerror \n80.35 \n13.81 \n4.31 \n13.99 \n32.78 \n31.44 \n18.17 \n3.05 \n4.44 \n6.67 \n29.33 \n17.78 \n13.51 \n32.64 \n23.02 \n5.24 \n21.94 \n4.27 \n\nSCANN S-BP \nvs PV \nvs PV \nratio \nratio \n.490 \n.499 \n.900 \n.859 \n.886 \n.920 \n.999 \n1.012 \n.989 \n1.037 \n1.158 1.215 \n1.008 \n.998 \n.964 \n.691 \n1.289 1.299 \n1.017 \n1.467 \n1.033 \n1.080 \n1.030 \n1.149 \n1.024 \n1.077 \n1.035 \n1.162 \n1.100 \n1.017 \n.889 \n.812 \n.835 \n.970 \n.990 \n.960 \n1.079 \n1.007 \n.990 \n.908 \n.893 \n.903 \n1.008 \n1.008 \n1.109 \n1.007 \n1.103 \n1.000 \n\n.972 \n\nS-BAYES Best Ind. \nvs PV \nratio \n.487 \n.992 \n.881 \n1.001 \n.932 \n\nvs PV \nratio \n.535111' \n.911BP \n.938BP \n1.054BP \n1.048c4.5 \n1.1550C1 \n.962BP \n2.175c4 .5 \n1.150oc1 \n1. 159NN \n1.138cN2 \n.983Pebl8 \n1. 113Peb13 \n.936Baye3 \n1.048BP \n.927c4 .5 \n1.200Pebl8 \n1. 164NN \n\nwhen mij = ee, and zero otherwise. However, since the example is unclassified, \nXTe3t is of length (J - 1) and can only be used to fill the first (( J - 1) . C) entries \nin iT. For this reason , C different versions are generated, i.e., iT, . .. , i c , where \neach one \"hypothesizes\" that XTe3t belongs to one of the C classes (by putting 1/ J \nin t~e appropr~ate col~~) .. Loc~ting thes=l.rofiles in the scale~ sp~ce is a matter \nof s1mple matflx multIphcatIOn, 1.e., f'[ = re Gr-1. The f'[ wh1ch lies closest to a \nclass point, say Classc') is considered the \"correct\" hypothesized class, and XTe3t \nis assigned the class label c' . \n\n3.4 The SCANN Algorithm \n\nNow that the three main parts of the approach have been described, a summary of \nthe SCANN algorithm can be given as a function of Co and the constituent learning \nalgorithms, A. The first step is to use Co and A to generate the stacking data, C1 , \ncapturing the approximated predictions of each learned model. Next, C1 is used \nto form the indicator matrix, N. A correspondence analysis is performed on N to \nderive the scaled space, A = urvT. The number of dimensions retained from \nthis new representation, K *, is the value which optimizes classification on C 1 . The \nresulting scaled space is used to derive the row/column coordinates F and G, thus \ngeometrically capturing the relationships between the examples, the way in which \nthey are classified, and their position relative to the true class labels. Finally, the \nnearest neighbor strategy exploits the new representation by predicting which class \nis most likely according to the predictions made on a novel example. \n\n\f596 \n\nC. J Merz \n\n4 Experimental Results \n\nThe constituent learning algorithms, A, spanned a variety of search and/or \nrepresentation techniques: Backpropagation (BP) [Rumelhart et al., 1986], CN2 \n[Clark and Niblett, 1989], C4.5 [Quinlan, 1993], OC1 [Salzberg; and Beigel, 1993], \nPEBLS [Cost, 1993], nearest neighbor (NN), and naive Bayes. Depending on the \ndata set, anywhere from five to eight instantiations of algorithms were applied. The \ncombining strategies evaluated were PV, SCANN, and two other learners trained \non \u00a31: S-BP, and S-Bayes. \n\nThe data sets used were taken from the UCI Machine Learning Database Repository \n[Merz and Murphy, 1996], except for the unreleased medical data sets: retardation \nand dementia. Thirty runs per data set were conducted using a training/test par(cid:173)\ntition of 70/30 percent. The results are reported in Table 2. The first column gives \nthe mean error rate over the 30 runs of the baseline combiner, PV. The next three \ncolumns (\"SCANN vs PV\", \"S-BP vs PV\", and \"S-Bayes vs PV\") report the ratio \nof the other combining strategies to the error rate of PV. The column labeled \"Best \nInd. vs PV\" reports the ratio with respect to the model with the best average error \nrate. The superscript of each entry in this column denotes the winning algorithm. \nA value less than 1 in the \"a vs b\" columns represents an improvement by method a \nover method b. Ratios reported in boldface indicate the difference between method \na and method b is significant at a level better than 1 percent using a two-tailed sign \ntest. \n\nIt is clear that, over the 18 data sets, SCANN holds a statistically significant advan(cid:173)\ntage on 7 sets improving upon PV's classification error by 3-50 percent. Unlike the \nother combiners, SCANN posts no statistically significant losses to PV (i.e., there \nwere 4 losses each for S-BP and S-Bayes). With the exception of the retardation \ndata set, SCANN consistently performs as well or better than the best individual \nlearned model. In the direct comparison of SCANN with the S-BP and S-Bayes, \nSCANN posts 5 and 4 significant wins, respectively, and no losses. \n\nThe most dramatic improvement of the combiners over PV came in the abalone \ndata set. A closer look at the results revealed that 7 of the 8 learned models \nwere very poor classifiers with error rates around 80 percent, and the errors of the \npoor models were highly correlated. This empirically demonstrates PV's known \nsensitivity to learned models with highly correlated errors. On the other hand, PV \nperforms well on the glass and wave data sets where the errors of the learned models \nare measured to be fairly uncorrelated. Here, SCANN performs similarly to PV, \nbut S-BP and S-Bayes appear to be overfitting by making erroneous predictions \nbased on insignificant variations on the predictions of the learned models. \n\n5 Conclusion \n\nA novel method has been introduced for combining the predictions of heterogeneous \nor homogeneous classifiers. It draws upon the methods of stacking, correspondence \nanalysis and nearest neighbor. In an empirical analysis, the method proves to be \ninsensitive to poor learned models and matches the performance of plurality voting \nas the errors of the learned models become less correlated. \n\nReferences \n\n[Ali and Pazzani, 1996) Ali, K. and Pazzani, M. (1996). Error reduction through \n\nlearning multiple descriptions. Machine Learning, 24:173. \n\n\fCombining Classifiers Using Correspondence Analysis \n\n597 \n\n[Breiman, 1996] Breiman, L. (1996). Bagging predictors. Machine Learning, \n\n24(2):123-40. \n\n[Clark and Niblett, 1989] Clark, P. and Niblett, T. (1989). The CN2 induction \n\nalgorithm. Machine Learning, 3(4):261-283. \n\n[Cost, 1993] Cost, S.; Salzberg, S. (1993). A weighted nearest neighbor algorithm \n\nfor learning with symbolic features. Machine Learning, 10(1):57-78. \n\n[Efron and Tibshirani, 1993] Efron, B. and Tibshirani, R. (1993). An Introduction \n\nto the Bootstrap. Chapman and Hall, London and New York. \n\n[Freund, 1995] Freund, Y. (1995). Boosting a weak learning algorithm by majority. \n\nInformation and Computation, 121(2):256-285. Also appeared in COLT90. \n\n[Greenacre, 1984] Greenacre, M. J. (1984). Theory and Application of Correspon(cid:173)\n\ndence Analysis. Academic Press, London. \n\n[Kong and Dietterich, 1995] Kong, E. B. and Dietterich, T. G. (1995). Error(cid:173)\n\ncorrecting output coding corrects bias and variance. In Proceedings of the 12th \nInternational Conference on Machine Learning, pages 313-321. Morgan Kauf(cid:173)\nmann. \n\n[Meir, 1995] Meir, R. (1995). Bias, variance and the combination ofleast squares es(cid:173)\n\ntimators. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural \nInformation Processing Systems, volume 7, pages 295-302. The MIT Press. \n\n[Merz, 1997] Merz, C. (1997). Using correspondence analysis to combine classifiers. \n\nSubmitted to Machine Learning. \n\n[Merz and Murphy, 1996] Merz, C. and Murphy, P. (1996). UCI repository of ma(cid:173)\n\nchine learning databases. \n\n[Merz and Pazzani, 1997] Merz, C. J. and Pazzani, M. J. (1997). Combining neu(cid:173)\nral network regression estimates with regularized linear weights. In Mozer, M., \nJordan, M., and Petsche, T., editors, Advances in Neural Information Processing \nSystems, volume 9. The MIT Press. \n\n[Opitz and Shavlik, 1996] Opitz, D. W. and Shavlik, J. W. (1996). Generating \naccurate and diverse members of a neural-network ensemble. In Touretzky, D. S., \nMozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information \nProcessing Systems, volume 8, pages 535-541. The MIT Press. \n\n[Perrone, 1994] Perrone, M. P. (1994). Putting it all together: Methods for com(cid:173)\n\nbining neural networks. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, \nAdvances in Neural Information Processing Systems, volume 6, pages 1188-1189. \nMorgan Kaufmann Publishers, Inc. \n\n[Quinlan, 1993] Quinlan, R. (1993). G..4-5 Programs for Machine Learning. Morgan \n\nKaufmann, San Mateo, CA. \n\n[Rumelhart et al., 1986] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. \n(1986). Learning internal representations by error propagation. In Rumelhart, \nD. E., McClelland, J. 1., and the PDP research group., editors, Parallel dis(cid:173)\ntributed processing: Explorations in the microstructure of cognition, Volume 1: \nFoundations. MIT Press. \n\n[Salzberg; and Beigel, 1993] Salzberg;, S. M. S. K. S. and Beigel, R. (1993). OC1: \nIn Proceedings of AAAI-93. \n\nRandomized induction of oblique decision trees. \nAAAI Pres. \n\n[Wolpert, 1992] Wolpert, D. H. (1992). Stacked generalization. Neural Networks, \n\n5:241-259. \n\n\f", "award": [], "sourceid": 1394, "authors": [{"given_name": "Christopher", "family_name": "Merz", "institution": null}]}