{"title": "A Comparative Study of the Practical Characteristics of Neural Network and Conventional Pattern Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 970, "page_last": 976, "abstract": "", "full_text": "A Comparative Study of the Practical \nCharacteristics of Neural Network and \n\nConventional Pattern Classifiers \n\nKenney Ng \nBBN Systems and Technologies \nCambridge, MA 02138 \n\nRichard P. Lippmann \nLincoln Laboratory, MIT \nLexington, MA 02173-9108 \n\nAbstract \n\nSeven different pattern classifiers were implemented on a serial computer \nand compared using artificial and speech recognition tasks. Two neural \nnetwork (radial basis function and high order polynomial GMDH network) \nand five conventional classifiers (Gaussian mixture, linear tree, K nearest \nneighbor, KD-tree, and condensed K nearest neighbor) were evaluated. \nClassifiers were chosen to be representative of different approaches to pat(cid:173)\ntern classification and to complement and extend those evaluated in a \nprevious study (Lee and Lippmann, 1989). This and the previous study \nboth demonstrate that classification error rates can be equivalent across \ndifferent classifiers when they are powerful enough to form minimum er(cid:173)\nror decision regions, when they are properly tuned, and when sufficient \ntraining data is available. Practical characteristics such as training time, \nclassification time, and memory requirements, however, can differ by or(cid:173)\nders of magnitude. These results suggest that the selection of a classifier \nfor a particular task should be guided not so much by small differences in \nerror rate, but by practical considerations concerning memory usage, com(cid:173)\nputational resources, ease of implementation, and restrictions on training \nand classification times. \n\nINTRODUCTION \n\n1 \nFew studies have compared practical characteristics of adaptive pattern classifiers \nusing real data. There has frequently been an over-emphasis on back-propagation \nclassifiers and artificial problems and a focus on classification error rate as the main \nperformance measure. No study has compared the practical trade-offs in training \ntime, classification time, memory requirements, and complexity provided by the \n\n970 \n\n\fPractical Characteristics of Neural Network and Conventional Pattern Classifiers \n\n971 \n\nmany alternative classifiers that have been developed (e.g. see Lippmann 1989). \n\nThe purpose of this study was to better understand and explore practical character(cid:173)\nistics of classifiers not included in a previous study (Lee and Lippmann, 1989; Lee \n1989). Seven different neural network and conventional pattern classifiers were eval(cid:173)\nuated. These included radial basis function (RBF), high order polynomial GMDH \nnetwork, Gaussian mixture, linear decision tree, J{ nearest neighbor (KNN), KD \ntree, and condensed J{ nearest neighbor (CKNN) classifiers. All classifiers were \nimplemented on a serial computer (Sun 3-110 Workstation with FPA) and tested \nusing a digit recognition task (7 digits, 22 cepstral inputs, 16 talkers, 70 training \nand 112 testing patterns per talker), a vowel recognition task (10 vowels, 2 formant \nfrequency inputs, 67 talkers, 338 training and 333 testing patterns), and two ar(cid:173)\ntificial tasks with two input dimensions that require either a single convex or two \ndisjoint decision regions. Tasks are as in (Lee and Lippmann, 1989) and details of \nexperiments are described in (Ng, 1990). \n\n2 TUNING EXPERIMENTS \nInternal parameters or weights of classifiers were determined using training data. \nGlobal free parameters that provided low error rates were found experimentally \nusing cross-validation and the training data or by using test data. Global parameters \nincluded an overall basis function width scale factor for the RBF classifier, order \nof nodal polynomials for the GMDH network, and number of nearest neighbors for \nthe KNN, KD tree, and CKNN classifiers. \n\nExperiments were also performed to match the complexity of each classifier to that \nof the training data. Many classifiers exhibit a characteristic divergence between \ntraining and testing error rates as a function of their complexity. Poor performance \nresults when a classifier is too simple to model the complexity of training data \nand also when it is too complex and \"over-fits\" the training data. Cross-validation \nand statistical techniques were used to determine the correct size of the linear tree \nand GMDH classifiers where training and test set error rates diverged substantially. \nAn information theoretic measure (Predicted Square Error) was used to limit the \ncomplexity of the GMDH classifier. This classifier was allowed to grow by adding \nlayers and widening layers to find the number of layers and the layer width which \nminimized predicted square error. Nodes in the linear tree were pruned using 10-\nfold cross-validation and a simple statistical test to determine the minimum size tree \nthat provides good performance. Training and test set error rates did not diverge \nfor the RBF and Gaussian mixture classifiers. Test set performance was thus used \nto determine the number of Gaussian centers for these classifiers. \n\nA new multi-scale radial basis function classifier was developed. It has multiple \nradial basis functions centered on each basis function center with widths that vary \nover 1 1/2 orders of magnitude. Multi-scale RBF classifiers provided error rates \nthat were similar to those of more conventional RBF classifiers but eliminated the \nneed to search for a good value of the global basis function width scale factor. \n\nThe CKNN classifier used in this study was also new. It was developed to reduce \nmemory requirements and dependency on training data order. In the more conven(cid:173)\ntional CKNN classifier, training patterns are presented sequentially and classified \nusing a KNN rule. Patterns are stored as exemplars only if they are classified in-\n\n\f972 \n\nNg and Lippmann \n\ncorrectly. In the new CKNN classifier, this conventional CKNN training procedure \nis repeated N times with different orderings of the training patterns. All exemplar \npatterns stored using any ordering are combined into a new reduced set of training \npatterns which is further pruned by using it as training data for a final pass of \nconventional CKNN training. This approach typically required less memory than \na KNN or a conventional CKNN classifier. Other experiments described in (Chang \nand Lippmann, 1990) demonstrate how genetic search algorithms can further reduce \nKNN classifier memory requirements. \n\nA) GAUSSIAN MIXTURE \n\nB) POLYNOMIAL GMDH NETWORK \n\n4000 \n\n3000 \n\n2000 \n\n1000 \n\n-N \n:J: -N \n\nL&. \n\n500 \n\no \n\n-N \n:J: -\n\n500 \n\n1000 \n\n1400 \n\no \n\n500 \n\n1000 \n\n1400 \n\nF1 (Hz) \n\nF1 (Hz) \n\nFigure 1: Decision Regions Created by (A) RBF and (B) GMDH Classifiers for the \nVowel Problem. \n\n3 DECISION REGIONS \nClassifiers differ not only in their structure and training but also in how decision \nregions are formed. Decision regions formed by the RBF classifier for the vowel \nproblem are shown in Figure 1A. Boundaries are smooth spline-like curves that \ncan form arbitrarily complex regions. This improves generalization for many real \nproblems where data for different classes form one or more roughly ellipsoidal clus(cid:173)\nters. Decision regions for the high-order polynomial (GMDH) network classifier are \nshown in Figure lB. Decision region boundaries are smooth and well behaved only \nin regions of the input space that are densely sampled by the training data. Decision \nboundaries are erratic in regions where there is little training data due to the high \npolynomial order of the discriminant functions formed by the GMDH classifier. As \na result, the GMDH classifier generalizes poorly in regions with little training data. \nDecision boundaries for the linear tree classifier are hyperplanes. This classifier may \nalso generalize poorly if data is in ellipsoidal clusters. \n4 ERROR RATES \nFigure 2 shows the classification (test set) error rates for all classifiers on the bulls(cid:173)\neye, disjoint, vowel, and digit problems. The solid line in each plot represents the \n\n\fPractical Characteristics of Neural Network and Conventional Pattern Classifiers \n\n973 \n\nBULLSEYE \n\n10 \n\n8 \n\n6 \n\n4 \n\nDIGIT 50% \n\n-\"i!P--c:: \n\n2 \n\n0 \nc:: \nc:: \nW \nZ \n0 \n0 \n~ 30 \nU \n25 \nu.. -rn \nrn \n.ex: \n-J \nU \n\n20 \n\n15 \n\n10 \n\n5 \n\n0 \n\nu. \nIn \na: \n\nIIJ \n\nen a: IIJ \n:E: \n::IE