{"title": "Combinations of Weak Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 494, "page_last": 500, "abstract": null, "full_text": "Combined Weak Classifiers \n\nDepartment of Electrical, Computer and System Engineering \n\nChuanyi Ji and Sheng Ma \n\nRensselaer Polytechnic Institute, Troy, NY 12180 \n\nchuanyi@ecse.rpi.edu, shengm@ecse.rpi.edu \n\nAbstract \n\nTo obtain classification systems with both good generalization per(cid:173)\nformance and efficiency in space and time, we propose a learning \nmethod based on combinations of weak classifiers, where weak clas(cid:173)\nsifiers are linear classifiers (perceptrons) which can do a little better \nthan making random guesses. A randomized algorithm is proposed \nto find the weak classifiers. They\u00b7 are then combined through a ma(cid:173)\njority vote. As demonstrated through systematic experiments, the \nmethod developed is able to obtain combinations of weak classifiers \nwith good generalization performance and a fast training time on \na variety of test problems and real applications. \n\n1 \n\nIntroduction \n\nThe problem we will investigate in this work is how to develop a classifier with both \ngood generalization performance and efficiency in space and time in a supervised \nlearning environment. The generalization performance is measured by the proba(cid:173)\nbility of classification error of a classifier. A classifier is said to be efficient if its \nsize and the (average) time needed to develop such a classifier scale nicely (poly(cid:173)\nnomiaUy) with the dimension of the feature vectors, and other parameters in the \ntraining algorithm. \n\nThe method we propose to tackle this problem is based on combinations of weak \nclassifiers [8][6] , where the weak classifiers are the classifiers which can do a little \nbetter than random guessing. It has been shown by Schapire and Freund [8][4] \nthat the computational power of weak classifiers is equivalent to that of a well(cid:173)\ntrained classifier, and an algorithm has been given to boost the performance of \nweak classifiers. What has not been investigated is the type of weak classifiers \nthat can be used and how to find them. In practice, the ideas have been applied \nwith success in hand-written character recognition to boost the performance of an \nalready well-trained classifier. But the original idea on combining a large number of \nweak classifiers has not been used in solving real problems. An independent work \n\n\fCombinations of Weak Classifiers \n\n495 \n\nby Kleinberg[6] suggests that in addition to a good generalization performance, \ncombinations of weak classifiers also provide advantages in computation time, since \nweak classifiers are computationally easier to obtain than well-trained classifiers. \nHowever, since the proposed method is based on an assumption which is difficult \nto realize, discrepancies have been found between the theory and the experimental \nresults[7]. The recent work by Breiman[1][2] also suggests that combinations of \nclassifiers can be computationally efficient, especially when used to learn large data \nsets. \n\nThe focus of this work is to investigate the following problems: (1) how to find \nweak classifiers, (2) what are the performance and efficiency of combinations of \nweak classifiers, and (3) what are the advantages of using combined weak classifiers \ncompared with other pattern classification methods? \nWe will develop a randomized algorithm to obtain weak classifiers. We will then \nprovide simulation results on both synthetic real problems to show capabilities and \nefficiency of combined weak classifiers. The extended version of this work with some \nof the theoretical analysis can be found in [5]. \n\n2 Weak Classifiers \n\nIn the present work, we choose linear classifiers (perceptrons) as weak classifiers. \nLet t - ~ be the required generalization error of a classifier, where v 2: 2, is called \nthe weakness factor which is used to characterize the strength of a classifier. The \nlarger the v, the weaker the weak classifier. A set of weak classifiers are combined \nthrough a simple majority vote. \n\n3 Algorithm \n\nOur algorithm for combinations of weak classifiers consists of two steps: (1) gen(cid:173)\nerating individual weak classifiers through a simple randomized algorithm; and (2) \ncombining a collection of weak classifiers through a simple majority vote. \n\nThree parameters need to be chosen a priori for the algorithm: a weakness factor v, \na number (J (~ ~ (J < 1) which will be used as a threshold to partition the training \nset, and the number of weak classifiers 2\u00a3 + 1 to be generated, where \u00a3 is a positive \ninteger. \n\n3.1 Partitioning the Training Set \n\nThe method we use to partition a training set is motivated by what given in [4]. \nSuppose a combined classifier consists of K (K ~ 1) weak classifiers already. In \norder to generate a (new) weak classifier, the entire training set of N training \nsamples is partitioned into two subsets: a set of Ml samples which contain all \nthe misclassified samples and a small fraction of samples correctly-classified by the \nexisting combined classifier; and the remaining N - Ml training samples. The set of \nMl samples are called \"cares\", since they will be used to select a new weak classifier, \nwhile the rest of the samples are the \"don't-cares\". \n\nThe threshold (J is used to determine which samples should be assigned as cares. \nFor instance, for the n-th training sample (1 ~ n ~ N), the performance index \na( n) is recorded, where a( n) is the fraction of the weak classifiers in the existing \ncombined classifier which classify the n-th sample correctly. If a(n) < (J, this sample \nis assigned to the cares. Otherwise, it is a don't-care. This is done for all N samples. \n\n\f496 \n\nC. Ji and S. Ma \n\nThrough partitioning a training set in this way, a newly-generated weak classifier \nis forced to learn the samples which have not been learned by the existing weak \nclassifiers. In the meantime, a properly-chosen () can ensure that enough samples \nare used to obtain each weak classifier. \n\n3.2 Random Sampling \n\nTo achieve a fast training time, we obtain a weak classifier by randomly sampling \nthe classifier-space of all possible linear classifiers. \n\nAssume that a feature vector x E Rd is distributed over a compact region D. The \ndirection of a hyperplane characterized by a linear classifier with a weight vector, \nis first generated by randomly selecting the elements of the weight vector based \non a uniform distribution over (-1 , l)d . Then the threshold of the hyperplane \nis determined by randomly picking an xED , and letting the hyperplane pass \nthrough x . This will generate random hyperplanes which pass through the region \nD, and whose directions are randomly distributed in all directions. Such a randomly \nselected classifier will then be tested on all the cares. If it misclassifies a fraction of \ncares no more than k - ~ - \u20ac (\u20ac > 0 and small), the classifier is kept and will be \nused in the combination. Otherwise, it is discarded. This process is repeated until \na weak classifier is finally obtained. \n\nA newly-generated weak classifier is then combined with the existing ones through \na simple majority vote. The entire training set will then be tested on the combined \nclassifier to result in a new set of cares, and don't-cares. The whole process will \nbe repeated until the total number 2L + 1 of weak classifiers are generated. The \nalgorithm can be easily extended to multiple classes. Details can be found in [5] . \n\n4 Experimental Results \n\nExtensive simulations have been carried out on both synthetic and real problems \nusing our algorithm. One synthetic problem is chosen to test the efficiency of \nour method. Real applications from standard data bases are selected to compare \nthe generalization performance of combinations of weak classifiers (CW) with that \nof other methods such as K-Nearest-Neighbor classifiers (K-NN)l, artificial neural \nnetworks (ANN), combinations of neural networks (CNN), and stochastic discrimi(cid:173)\nnations (SD). \n\n4.1 A Synthetic Problem: Two Overlapping Gaussians \n\nTo test the scaling properties of combinations of weak classifiers, a non-linearly \nseparable problem is chosen from a standard database called ELENA 2. The prob(cid:173)\nlem is a two-class classification problem, where the distributions of samples in both \nclasses are multi-variate Gaussians with the same mean but different variances for \neach independent variable. There is a considerable amount of overlap between the \nsamples in two classes, and the problem is non-linearly separable. The average gen(cid:173)\neralization error and the standard deviations are given in Figure 1 for our algorithm \nbased on 20 runs, and for other classifiers. The Bayes error is also given to show \nthe theoretical limit. The results show that the performance of kNN degrades very \nquickly. The performance of ANN is better than that of kNN but still deviates more \nand more from the Bayes error as d gets large. The combination of weak classifiers \n\n1 The best result of different k is reported. \n2 I pu b I neural-nets I ELEN AI databases IBenchmarks. ps. Z on ft p.dice. ucl. ac. be \n\n\fCombinations of Weak Classifiers \n\n497 \n\n\" I., \n\n10 \n\n- \"- k-NN \n-_ ANN \n-0- cw \n\n- ..... \n\nIV_ \n\nFigure 1: Performance versus the dimension of the feature vectors \n\nAlgorithms \n\nCombined Weak Classifiers \n\nk Nearest Neighbor \n\nNeural Networks \n\nCombined Neural Networks \n\nCard1 \n\n(%) Error /er \n\n11.3/ 0.85 \n\n15.67 \n\n13.64/ 0.85 \n13.02/0.33 \n\nDiabetes1 \n\n(%) Error / er \n22.70/ 0.70 \n\n25.8 \n\nGene1 \n\n(%1 Error/ er \n11.80/0.52 \n\n22.87 \n\n23.52/ 0.72 \n22.79/0.57 \n\n13.47/0.44 \n12.08/0.23 \n\nTable 1: Performance on Card1, Diabetes1 and Gene1. er: standard deviation \n\ncontinues to follow the trend of the Bayes error. \n\n4.2 Proben1 Data Sets \n\nThree data sets, Card1, Diabetes1 and Gene1 were selected to test our algorithm \nfrom Proben1 databases which contain data sets from real applications3 . \n\nCard1 data set is for a problem on determining whether a credit-card application \nfrom a customer can be approved based on information given in 51-dimensional \nfeature vectors. 345 out of 690 examples are used for training and the rest for \ntesting. Diabetes1 data set is for determining whether diabetes is present based \non 8-dimensional input patterns. 384 examples are used for training and the same \nnumber of samples for testing. Gene1 data set is for deciding whether a DNA \nsequence is from a donor, an acceptor or neither from 120 dimensional binary feature \nvectors. 1588 samples out of total of 3175 were used for training, and the rest for \ntesting. \n\nThe average generalization error as well as the standard deviations are reported in \nTable 1. The results from combinations of weak classifiers are based on 25 runs. \nThe results of neural networks and combinations of well-trained neural networks are \nfrom the database. As demonstrated by the results, combinations of weak classifiers \nhave been able to achieve the generalization performance comparable to or better \nthan that of combinations of well-trained neural networks. \n\n4.3 Hand-written Digit Recognition \n\nHand-written digit recognition is chosen to test our algorithm, since one of \nthe previously developed method on combinations of weak classifiers (stochastic \ndiscrimination[6]) was applied to this problem. For the purpose of comparison, the \n\n3 Available \n\nby \n\nanonymous \n\nftp \n\nfrom \n\nftp.ira.uka.de, \n\nas \n\n/ pub/papers/techreports/1994/1994-21. ps.z. \n\n\f498 \n\nC. Ji and S. Ma \n\nAlgorithms \n\nCombined Weak Classifiers \n\nk Nearest Neighbor \n\nNeural Networks \n\nStochastic Discriminations \n\n(%) Error/O' \n4.23/ 0.1 \n\n4.84 \n5.33 \n3.92 \n\nTable 2: Performance on handwritten digit recognition. \n\nParameters \n1/2 + l/v \n\ne \n2L+1 \n\nGaussians Card1 Diabetes1 Gene1 Digits \n0.54 \n0.53 \n20000 \n\n0.55 \n0.54 \n4000 \n\n0.51 \n0.51 \n1000 \n\n0.51 \n0.54 \n1000 \n\n0.51 \n0.51 \n2000 \n\nA verage Tries \n\n2 \n\n3 \n\n7 \n\n4 \n\n2 \n\nTable 3: Parameters used in our experiments. \n\nsame set of data as used in [6](from the NIST data base) is utilized to train and \nto test our algorithm. The data set contains 10000 digits written by different peo(cid:173)\nple. Each digit is represented by 16 by 16 black and white pixels. The first 4997 \ndigits are used to form a training set, and the rest are for testing. Performance \nof our algorithm, k-NN, neural networks, and stochastic discriminations are given \nin Table 2. The results for our methods are based on 5 runs, while the results \nfor the other methods are from [6]. The results show that the performance of our \nalgorithm is slightly worse (by 0.3%) than that of stochastic discriminations, which \nuses a different method for multi-class classification [6] . \n\n4.4 Effects of The Weakness Factor \n\nExperiments are done to test the effects of v on the problem of two 8-dimensional \noverlapping Gaussians. The performance and the average training time (CPU(cid:173)\ntime on Sun Spac-10) of combined weak classifiers based on 10 runs are given for \ndifferent v's in Figures 2 and 3, respectively. The results indicate as v increases \nan individual weak classifier is obtained more quickly, but more weak classifiers are \nneeded to achieve good performance. When a proper v is chosen, a nice scaling \nproperty can be observed in training time. \n\nA record of the parameters used in all the experiments on real applications are \nprovided in Table 3. The average tries, which are the average number of times \nneeded to sample the classifier space to obtain an acceptable weak classifier, are \nalso given in the table to characterize the training time for these problems. \n\n4.5 Training Time \n\nTo compare learning time with off-line BackPropagation (BP), feedforward two layer \nneural network with 10 sigmoidal hidden units are trained by gradient-descent to \nlearn the problem on the two 8-dimensional overlapping Gaussians. 2500 training \nsamples are used. The performance versus CPU time4 are plotted for both our \nalgorithm and BP in Figure 4. For our algorithm, 2000 weak classifiers are com(cid:173)\nbined. For BP, 1000 epoches are used. The figure shows that our algorithm is much \nfaster than the BP algorithm. Moreover, when several well-trained neural networks \nare combined to achieve a better performance, the cost on training time will be \n\n4Both algorithms are run on a Sun Sparc-lO sun workstation \n\n\fCombinations of Weak Classifiers \n\n499 \n\n'\u00b00 \n\n200 \n\n400 \n\ntoo \n\n100 \n\n1000 \n\n1200 \n\n1400 \n\n,'00 1100 \n\n2000 \n\nNuntMI'oII .... ~ \n\nFigure 2: Performance versus the number of weak classifiers for different 1/. nu : 1/. \n\nI' -0- 1 AwaO 005 \n-x- lJnu-o.Ol0 \n~ \u2022 \n-\u00b7- \u00b7 1~O'5 \nu , \n-+- : lA'1u-002O \n- : lhN-0025 \n\n~ \n\n, \n\n.' \n\n. \n\n_ \"::;~~=1;:~-~:~~~~-~~:::::::::: \n\nm \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\nFigure 3: Training time versus the number of weak classifiers for different 1/. \n\n~oIIw.-a...r ... \n\neven higher. Therefore, compared to combinations of well-trained neural networks, \ncombining weak classifiers is computationally much cheaper. \n\n5 Discussions \n\nFrom the experimental results, we observe that the performance of the combined \nweak classifiers is comparable or even better than combinations of well-trained clas(cid:173)\nsifiers, and out-performs individual neural network classifiers and k-Nearest Neigh(cid:173)\nbor classifiers. In the meantime whereas the k-nearest neighbor classifiers suffer \nfrom the curse of dimensionality, a nice scaling property in terms of the dimen(cid:173)\nsion of feature vectors has been observed for combined weak classifiers. Another \n\n.. \n.. \n\n35 \n\nIS \n\n10 \n\n, , \n, , \n\n; \n\nTranngcurteollBP \nT_CV't'eollBP \n\n- T,......cuwolCW \n-\nI \n\nr_cwy.otcw \nv_ \n\nl --_ \n\n-1- -\n\n- - - - - . . I \n.............. --t-___ , \n\n-\n\n-\n\n- _ I- - _ _ __ t ____ _ \n\nFigure 4: Performance versus CPU time \n\n\f500 \n\nC. Ii and S. Ma \n\nimportant observation obtained from the experiments is that the weakness factor \ndirectly impacts the size of a combined classifier and the training time. Therefore, \nthe choice of the weakness factor is important to obtain efficient combined weak \nclassifiers. It has been shown in our theoretical analysis on learning an underlying \nperceptron [5] that v should be at least large as O( dlnd) to obtain a polynomial \ntraining time, and the price paid to accomplish this is a space-complexity which is \npolynomial in d as well. This cost can be observed from our experimental results \nfor the need of a large number of weak classifiers. \n\nAcknowledgement \n\nSpecials thanks are due to Tin Kan Ho for providing NIST data, related refer(cid:173)\nences and helpful discussions. Support from the National Science Foundation (ECS-\n9312594 and (CAREER) IRI-9502518) is gratefully acknowledged. \n\nReferences \n\n[1] L. Breiman, \"Bias, Variance and Arcing Classifiers,\" Technical Report, TR-\n460, Department of Statistics, University of California, Berkeley, April, 1996. \n[2] 1. Breiman, \"Pasting, Bites Together for Prediction in Large Data sets and \n\nOn-Line,\" ftp.stat .berkeley.edu/users/breiman, 1996. \n\n[3] H. Drucker, R. Schapire and P. Simard, \"Improving Performance in Neural \n\nNetworks Using a Boosting Algorithm,\" Neural Information Processing Sym(cid:173)\nposium, 42-49, 1993. \n\nFreund \n\nand \n\nR. \n\n[4] Y. \n\n\"A \nDecision-Theoretic Generalization of On-Line Learning and An Application \nto Boosting,\" http://www.research.att.com/orgs/ssr/people/yoav or schapire. \n[5] C. Ji and S. Ma, \"Combinations of Weak Classifiers,\" IEEE Trans. Neural \nNetworks, Special Issue on Neural Networks and Pattern Recognition, vol. 8, \n32-42, Jan., 1997. \n\nSchapire, \n\n[6] E .M. Kleinberg, \"Stochastic Discrimination,\" Annals of Mathematics and Ar(cid:173)\n\ntificial Intelligence , voU , 207-239, 1990. \n\n[7] E.M. Kleinberg and T. Ho, \"Pattern Recognition by Stochastic Modeling,\" \nProceedings of the Third International Workshop on Frontiers in Handwriting \nRecognition, 175-183, Buffalo, May 1993. \n\n[8] R.E. Schapire, \"The Strength of Weak Learnability,\" Machine Learning, vol. \n\n5, 197-227, 1990. \n\n\f", "award": [], "sourceid": 1325, "authors": [{"given_name": "Chuanyi", "family_name": "Ji", "institution": null}, {"given_name": "Sheng", "family_name": "Ma", "institution": null}]}