{"title": "Practical Characteristics of Neural Network and Conventional Pattern Classifiers on Artificial and Speech Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 168, "page_last": 177, "abstract": null, "full_text": "168 \n\nLee and Lippmann \n\nPractical Characteristics of Neural Network \n\nand Conventional Pattern Classifiers on \n\nArtificial and Speech Problems* \n\nYuchun Lee \n\nDigital Equipment Corp. \n\n40 Old Bolton Road, \n\nOGOl-2Ull \n\nStow, MA 01775-1215 \n\nRichard P. Lippmann \nLincoln Laboratory, MIT \n\nRoom B-349 \n\nLexington, MA 02173-9108 \n\nABSTRACT \n\nEight neural net and conventional pattern classifiers (Bayesian(cid:173)\nunimodal Gaussian, k-nearest neighbor, standard back-propagation, \nadaptive-stepsize back-propagation, hypersphere, feature-map, learn(cid:173)\ning vector quantizer, and binary decision tree) were implemented \non a serial computer and compared using two speech recognition \nand two artificial tasks. Error rates were statistically equivalent on \nalmost all tasks, but classifiers differed by orders of magnitude in \nmemory requirements, training time, classification time, and ease \nof adaptivity. Nearest-neighbor classifiers trained rapidly but re(cid:173)\nquired the most memory. Tree classifiers provided rapid classifica(cid:173)\ntion but were complex to adapt. Back-propagation classifiers typ(cid:173)\nically required long training times and had intermediate memory \nrequirements. These results suggest that classifier selection should \noften depend more heavily on practical considerations concerning \nmemory and computation resources, and restrictions on training \nand classification times than on error rate. \n\n-This work was sponsored by the Department of the Air Force and the Air Force Office of \n\nScientific Research. \n\n\fPractical Characteristics of Neural Network \n\n169 \n\n1 \n\nIntroduction \n\nA shortcoming of much recent neural network pattern classification research has \nbeen an overemphasis on back-propagation classifiers and a focus on classification \nerror rate as the main measure of performance. This research often ignores the many \nalternative classifiers that have been developed (see e.g. \n[10]) and the practical \ntradeoffs these classifiers provide in training time, memory requirements, classifica(cid:173)\ntion time, complexity, and adaptivity. The purpose of this research was to explore \nthese tradeoffs and gain experience with many different classifiers. Eight neural net \nand conventional pattern classifiers were used. These included Bayesian-unimodal \nGaussian, k-nearest neighbor (kNN), standard back-propagation, adaptive-stepsize \nback-propagation,.hypersphere, feature-map (FM), learning vector quantizer (LVQ) , \nand binary decision tree classifiers. \n\nDISJOINT \n\nBULLSEYE \n\nB I. ) \n\nDimensionality: 2 \nTesting Set Size: 500 \nTraining Set Size: 500 \nClasses: 2 \n\nDimensionality: 2 \nTesting Set Size: 500 \nTraining Set Size: 500 \nClasses: 2 \n\nDIGIT \n\nDimensionality: 22 Cepstra \nTraining Set Size: 70 \nTesting Set Size: 112 \n16 Training Sets \n16 Testing Sets \nClasses: \nTalker Dependent \n\n7 Digits \n\nVOWEL \n\nDimension: 2 Formants \nTraining Set Size: 338 \nTesting Set Size: 330 \nClasses: 10 Vowels \nTalker Independent \n\nFigure 1: Four problems used to test classifiers. \n\nClassifiers were implemented on a serial computer and tested using the four prob(cid:173)\nlems shown in Fig. 1. The upper two artificial problems (Bullseye and Disjoint) \nrequire simple two-dimensional convex or disjoint decision regions for minimum er(cid:173)\nror classification. The lower digit recognition task (7 digits, 22 cepstral parameters, \n\n\f170 \n\nLee and Lippmann \n\n16 talkers, 70 training and 112 testing patterns per talker) and vowel recognition \ntask (10 vowels, 2 formant parameters, 67 talkers, 338 training and 330 testing pat(cid:173)\nterns) use real speech data and require more complex decision regions. These tasks \nare described in [6, 11] and details of experiments are available in [9]. \n\n2 Training and Classification Parameter Selection \nInitial experiments were performed to select sizes of classifiers that provided good \nperformance with limited training data and also to select high-performing versions \nof each type of classifier. Experiments determined the number of nodes and hidden \nlayers in back-propagation classifiers, pruning techniques to use with tree and hyper(cid:173)\nsphere classifiers, and numbers of exemplars or kernel nodes to use with feature-map \nand LVQ classifiers. \n\n2.1 Back-Propagation Classifiers \n\nIn standard back-propagation, weights typically are updated only after each trial \nor cycle. A trial is defined as a single training pattern presentation and a cycle is \ndefined as a sequence of trials which sample all patterns in the training set. In group \nupdating, weights are updated every T trials while in trial-by-trial training, weights \nare updated every trial. Furthermore, in trial-by-trial updating, training patterns \ncan be presented sequentially where a pattern is guaranteed to be presented every \nT trials, or they can be presented randomly where patterns are randomly selected \nfrom the training set. Initial experiments demonstrated that random trial-by-trial \ntraining provided the best convergence rate and error reduction during training. It \nwas thus used whenever possible with all back-propagation classifiers. \n\nAll back-propagation classifiers used a single hidden layer and an output layer with \nas many nodes as classes. The classification decision corresponded to the class of \nthe node in the output layer with the highest output value. During training, the \ndesired output pattern, D, was a vector with all elements set to 0 except for the \nelement corresponding to the correct class of the input pattern. This element of \nD was set to 1. The mean-square difference between the actual output and this \ndesired output error is minimized when the output of each node is exactly the Bayes \na posteriori probability for each correct class [1, 10]. Back-propagation with this \n\"1 of m\" desired output is thus well justified theoretically because it attempts to \nestimate minimum-error Bayes probability functions. The number of hidden nodes \nused in each back-propagation classifier was determined experimentally as described \nin [6, 7, 9, 11]. \nThree \"improved\" back-propagation classifiers with the potential of reduced training \ntimes where studied. The first, the adaptive-stepsize-classifier, has a global stepsize \nthat is adjusted after every training cycle as described in [4]. The second, the \nmultiple-adaptive-stepsize classifier, has multiple stepsizes (one for each weight) \nwhich are adjusted after every training cycle as described in [8]. The third classifier \nuses the conjugate gradient method [9, 12] to minimize the output mean-square \nerror. \n\n\fPractical Characteristics of Neural Network \n\n171 \n\nThe goal of the three \"improved\" versions of back-propagation was to shorten the of(cid:173)\nten lengthy training time observed with standard back-propagation. These improve(cid:173)\nments relied on fundamental assumptions about the error surfaces. However, only \nthe multiple-adaptive-stepsize algorithm was used for the final classifier comparison \ndue to the poor performance of the other two algorithms. The adaptive-stepsize \nclassifier often could not achieve adequately low error rates because the global step(cid:173)\nsize (7]) frequently converged too quickly to zero during training. The multiple(cid:173)\nadaptive-stepsize classifier did not train faster than a standard back-propagation \nclassifier with carefully selected stepsize value. Nevertheless, it eliminated the need \nfor pre-selecting the stepsize parameter. The conjugate gradient classifier worked \nwell on simple problems but almost always rapidly converged to a local minimum \nwhich provided high error rates on the more complex speech problems. \n\n4oo0~ ____ ~(A~)~H_Y_P_E~R_S_PH_E_RE~ ____ ~ \n\n(B) BINARY DECISION TREE \n\n3000 \n\n2000 \n\nF2(Hz) \n\n1000 \n\n500 L.L __ ----L.;~___'~ __ .l__. __ ___l \n\no \n\n500 \n\n1000 \n\n1400 0 \n\n500 \n\n1000 \n\n1400 \n\nFl(Hz) \n\nFl(Hz) \n\nFigure 2: Decision regions formed by the hypersphere classifier (A) and by the \nbinary decision tree classifier (B) on the test set for the vowel problem. Inputs \nconsist of the first two formants for ten vowels in the words A. who'd, <> hawed, + \nhod, 0 hud, x had, > heed, ~ hid, 0 head, V heard, and < hood as described in \n[6, 9]. \n\n2.2 Hypersphere Classifier \n\nHypersphere classifiers build decision regions from nodes that form separate hyper(cid:173)\nsphere decision regions. Many different types of hypersphere classifiers have been \ndeveloped [2, 13]. Experiments discussed in [9], led to the selection of a specific ver(cid:173)\nsion of hypersphere classifier with \"pruning\". Each hypersphere can only shrink in \nsize, centers are not repositioned, an ambiguous response (positive outputs from hy(cid:173)\nperspheres corresponding to different classes) is mediated using a nearest-neighbor \n\n\f172 \n\nLee and Lippmann \n\nrule, and hyperspheres that do not contribute to the classification performance are \npruned from the classifier for proper \"fitting\" of the data and to reduce memory \nusage. Decision regions formed by a hypersphere classifier for the vowel classifica(cid:173)\ntion problem are shown in the left side of Fig. 2. Separate regions in this figure \ncorrespond to different vowels. Decision region boundaries contain arcs which are \nsegments of hyperspheres (circles in two dimensions) and linear segments caused by \nthe application of the nearest neighbor rule for ambiguous responses. \n\n2.3 Binary Decision Tree Classifier \n\nBinary decision tree classifiers from [3] were used in all experiments. Each node in a \ntree has only two immediate offspring and the splitting decision is based on only one \nof the input dimensions. Decision boundaries are thus overlapping hyper-rectangles \nwith sides parallel to the axes of the input space and decision regions become more \ncomplex as more nodes are added to the tree. Decision trees for each problem were \ngrown until they classified all the training data exactly and then pruned back using \nthe test data to determine when to stop pruning. A complete description of the \ndecision tree classifier used is provided in [9] and decision regions formed by this \nclassifier for the vowel problem are shown in the right side of Fig. 2. \n\n2.4 Other Classifiers \n\nThe remaining four classifiers were tuned by selecting coarse sizing parameters to \n\"fit\" the problem imposed. Some of these parameters include the number of ex(cid:173)\nemplars in the LVQ and feature map classifiers and k in the k-nearest neighbor \nclassifier. Different types of covariance matrices (full, diagonal, and various types \nof grand averaging) were also tried for the Bayesian-unimodal Gaussian classifier. \nBest sizing parameter values for classifiers were almost always not those that that \nbest classified the training set. For the purpose of this study, training data was used \nto determine internal parameters or weights in classifiers. The size of a classifier \nand coarse sizing parameters were selected using the test data. In real applications \nwhen a test set is not available, alternative methods, such as cross validation[3, 14] \nwould be used. \n\n3 Classifier Comparison \n\nAll eight classifiers were evaluated on the four problems using simulations pro(cid:173)\ngrammed in C on a Sun 3/110 workstation with a floating point accelerator. Clas(cid:173)\nsifiers were trained until their training error rate converged. \n\n3.1 Error Rates \n\nError rates for all classifiers on all problems are shown in Fig. 3. The middle \nsolid lines in this figure correspond to the average error rate over all classifiers \nfor each problem. The shaded area is one binomial standard deviation above and \nbelow this average. As can be seen, there are only three cases where the error \nrate of anyone classifier is substantially different from the average error. These \nexceptions are the Bayesian-unimodal Gaussian classifier on the disjoint problem \n\n\fPractical Characteristics of Neural Network \n\n173 \n\nIU~ ____________________ , \n\nBULLSEYE \n\nDISJOINT \n\nlU~--------------------, \n\nDIGIT \n\n2 \n\no~~-L~~~~==~~~ \n30~--------------------, \n\nVOWEL \n\n25 \n\nCC \nUJ \n\n-~ -a: o a: \nZ o -~ o -u. -Ul \nUl < ...J o \n\nFigure 3: Error rates for all classifiers on all four problems. The middle solid \nlines correspond to the average error rate over all classifiers for each problem. The \nshaded area is one binomial standard deviation above and below the average error \nrate. \n\nand the decision tree classifier on the digit and the disjoint problem. The Bayesian(cid:173)\nunimodal Gaussian classifier performed poorly on the disjoint problem because it \nwas unable to form the required bimodal disjoint decision regions. The decision \ntree classifier performed poorly on the digit problem because the small amount of \ntraining data (10 patterns per class) was adequately classified by a minimal13-node \ntree which didn't generalize well and didn't even use all 22 input dimensions. The \ndecision tree classifier worked well for the disjoint problem because it forms decision \nregions parallel to both input axes as required for this problem. \n\n3.2 Practical Characteristics \n\nIn contrast to the small differences in error rate, differences between classifiers on \npractical performance issues such as training and classification time, and memory \nusage were large. Figure 4 shows that the classifiers differed by orders of magnitude \nin training time. Shown in log-scale, the k-nearest neighbor stands out distinctively \n\n\f174 \n\nLee and Lippmann \n\n10,000 _\"\"\"T\"\"---r---\"\"T'\"'---r----,----,---.,.....--.,-:I \n\n-CI) -\n\n1000 \n\n100 \n\n10 \n\n1 \n\no BULLSEYE \n\u2022 VOWEL \n6. DISJOINT \no DIGIT \n\n0.01 L--L __ -L __ --L. __ --1 __ ----' __ ---l ___ ' - -__ \"---..... \n\nBAYESIAN \n\nMUL TI\u00b7STEPSIZE \n\nkNN \n\nBACK\u00b7PROP \n\nHYPERSPHERE \n\nCLASSIFIERS \n\nFEATURE MAP \n\nLva \n\nTREE \n\nFigure 4: Training time of all classifiers on all four problems. \n\nas the fastest trained classifier by many orders of magnitude. Depending on the \nproblem, Bayesian-unimodal Gaussian, hypersphere, decision tree, and feature map \nclassifiers also have reasonably short training times. LVQ and back-propagation \nclassifiers often required the longest training time. It should be noted that alterna(cid:173)\ntive implementations, for example using parallel computers, would lead to different \nresults. \n\nAdaptivity or the ability to adapt using new patterns after complete training also \ndiffered across classifiers. The k-nearest neighbor and hypersphere classifiers are \nable to incorporate new information most readily. Others such as back-propagation \nand LVQ classifiers are more difficult to adapt and some, such as decision tree \nclassifiers, are not designed to handle further adaptation after training is complete. \n\nThe binary decision tree can classify patterns much faster than others. Unlike most \nclassifiers that depend on \"distance\" calculations between the input pattern and all \nstored exemplars, the decision tree classifier requires only a few numerical compar(cid:173)\nisons. Therefore, the decision tree classifier was many orders of magnitude faster \n\n\fPractical Characteristics of Neural Network \n\n175 \n\nFM \n\nBAYES \n\nHYPERSPHERE \n\n0 BULLSEYE \n\n\u2022 VOWEL \n\nt:. DISJOINT \n0 DIGIT \n\nBACK-PROPAGATION \n\nMULTIPLE STEPSIZE \n\n8000 \n\nkNN \n\n-f/) \n-\nCD -> a: \n\ncs: \n...J \n0 \n\nQ) 6000 \n>-\n\n0 \n:E \nw \n:E 4000 \nZ \n0 \n~ \n0 \n\nu::: en en 2000 \n\no \n\n100 \n\n200 \n\n300 \n\n400 \n\nTRAINING PROGRAM COMPLEXITY (Lines of Codes) \n\nFigure 5: Classification memory usage versus training program complexity for all \nclassifiers on all four problems. \n\nin classification than other classifiers. However, decision tree classifiers require the \nmost complex training algorithm. As a rough measurement of the ease of imple(cid:173)\nmentation, subjectively measured by the number of lines in the training program, \nthe decision tree classifier is many times more complex than the simplest training \nprogram- that of the k-nearest neighbor classifier. However, the k-nearest neighbor \nclassifier is one of the slowest in classification when implemented serially without \ncomplex search techniques such as k-d trees [5]. These techniques greatly reduce \nclassification time but make adaptation to new training data more difficult and \nincrease complexity. \n\n4 Trade-Offs Between Performance Criteria \nNoone classifier out-performed the rest on all performance criteria. The selection \nof a \"best\" classifier depends on practical problem constraints which differ across \nproblems. Without knowing these constraints or associating explicit costs with \nvarious performance criteria, a classifier that is \"best\" can not be meaningfully \ndetermined. Instead, there are numerous trade-off relationships between various \ncriteria. \n\n\f176 \n\nLee and Lippmann \n\nOne trade-off shown in Fig. 5 is classification memory usage versus the complexity \nof the training algorithm. The far upper left corner, where training is very simple \nand memory is not efficiently utilized, contains the k-nearest neighbor classifier. In \ncontrast, the binary decision tree classifier is in the lower right corner, where the \noverall memory usage is minimized and the training process is very complex. Other \nclassifiers are intermediate. \n\nI. I ---r \n\nI \nMULTIPLE STEPSIZE \n\n3000 \n\n- 2000 \n(/) -w \n~ ... \nC) z \nZ cc a: \n\n1000 \n\nto-\n\n0 \n\n\u2022 \n\nBACKPROPAGATION \n\nLva BAYES \n\n\u2022 \n\nHYPERSPHERE \n\nI \n\n\u2022 TREE \n\nkNN \n\n4000 \n1000 \nCLASSIFICATION MEMORY USAGE (Bytes) \n\n2000 \n\n3000 \n\n5000 \n\nFigure 6: Training time versus classification memory usage of all classifiers on the \nvowel problem. \n\nFigure 6 shows the relationship between training time and classification memory \nusage for the vowel problem. The k-nearest neighbor classifier consistently provides \nthe shortest training time but requires the most memory. The hypersphere clas(cid:173)\nsifier optimizes these two criteria well across all four problems. Back-propagation \nclassifiers frequently require long training times and require intermediate amounts \nof memory. \n\n5 Summary \nThis study explored practical characteristics of neural net and conventional pattern \nclassifiers. Results demonstrate that classification error rates can be equivalent \nacross classifiers when classifiers are powerful enough to form minimum error de(cid:173)\ncision regions, when they are rigorously tuned, and when sufficient training data \nis provided. Practical characteristics such as training time, memory requirements, \nand classification time, however, differed by orders of magnitude. In practice, these \nfactors are more likely to affect classifier selection. Selection will often be driven \n\n\fPractical Characteristics of Neural Network \n\n177 \n\nby practical considerations concerning memory and computation resources, restric(cid:173)\ntions on training, test, and adaptation times, and ease of use and implementation. \nThe many existing neural net and conventional classifiers allow system designers to \ntrade these characteristics off'. Tradeoffs will vary with implementation hardware \n(e.g. serial versus parallel, analog versus digital) and details of the problem (e.g. \ndimension of the input vector, complexity of decision regions). Our current research \nefforts are exploring these tradeoff's on more difficult problems and studying addi(cid:173)\ntional classifiers including radial-basis-function classifiers, high-order networks, and \nGaussian mixture classifiers. \n\nReferences \n[1] A. R. Barron and R. 1. Barron. Statistical learning networks: A unifying view. In \n\n1988 Symposium on the Interface: Statistics and Computing Science, Reston, Vir(cid:173)\nginia, April 21-23 1988. \n\n[2] B. G. Batchelor. Classification and data analysis in vector space. In B. G. Batchelor, \n\neditor, Pattern Recognition, chapter 4, pages 67-116. Plenum Press, London, 1978. \n\n[3] 1. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and \n\nRegression Trees. Wadsworth International Group, Belmont, CA, 1984. \n\n[4] 1. W. Chan and F. Fallside. An adaptive training algorithm for back propagation \n\nnetworks. Computer Speech and Language, 2:205-218, 1987. \n\n[5] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best \nmatches in logarithmic expected time. ACM Transactions on Mathematical Software, \n3(3):209-226, September 1977. \n\n[6] W. M. Huang and R. P. Lippmann. Neural net and traditional classifiers. In D. An(cid:173)\nderson, editor, Neural Information Processing Systems, pages 387-396, New York, \n1988. American Institute of Physics. \n\n[7] William Y. Huang and Richard P. Lippmann. Comparisons between conventional \nand neural net classifiers. In 1st International Conference on Neural Networks, pages \nIV-485. IEEE, June 1987. \n\n[8] R. A. Jacobs. Increased rates of convergence through learning rate adaptation. Neural \n\nNetworks, 1:295-307, 1988. \n\n[9] Yuchun Lee. Classifiers: Adaptive modules in pattern recognition systems. Master's \nthesis, Massachusetts Institute of Technology, Department of Electrical Engineering \nand Computer Science, Cambridge, MA, May 1989. \n\n[10] R. P. Lippmann. Pattern classification using neural networks. IEEE Communications \n\nMagazine, 27(11):47-54, November 1989. \n\n[11] Richard P. Lippmann and Ben Gold. Neural classifiers useful for speech recognition. \nIn 1st International Conference on Neural Networks, pages IV-417. IEEE, June 1987. \n[12] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, editors. Numerical \n\nRecipes. Cambridge University Press, New York, 1986. \n\n[13] D. 1. Reilly, L. N. Cooper, and C. Elbaum. A neural model for category learning. \n\nBiological Cybernetics, 45:35-41, 1982. \n\n[14] M. Stone. Cross-validation choice and assessment of statistical predictions. Journal \n\nof the Royal Statistical Society, B-36:111-147, 1974. \n\n\f", "award": [], "sourceid": 259, "authors": [{"given_name": "Yuchun", "family_name": "Lee", "institution": null}, {"given_name": "Richard", "family_name": "Lippmann", "institution": null}]}