{"title": "Information Measure Based Skeletonisation", "book": "Advances in Neural Information Processing Systems", "page_first": 1080, "page_last": 1087, "abstract": null, "full_text": "Information Measure Based Skeletonisation \n\nSowmya Ramachandran \n\nDepartment of Computer Science \n\nUniversity of Texas at Austin \n\nAustin, TX 78712-1188 \n\nLorien Y. Pratt * \n\nDepartment of Computer Science \n\nRutgers University \n\nNew Brunswick, NJ 08903 \n\nAbstract \n\nAutomatic determination of proper neural network topology by trimming \nover-sized networks is an important area of study, which has previously \nbeen addressed using a variety of techniques. In this paper, we present \nInformation Measure Based Skeletonisation (IMBS), a new approach to \nthis problem where superfluous hidden units are removed based on their \ninformation measure (1M). This measure, borrowed from decision tree in(cid:173)\nduction techniques, reflects the degree to which the hyperplane formed \nby a hidden unit discriminates between training data classes. We show \nthe results of applying IMBS to three classification tasks and demonstrate \nthat it removes a substantial number of hidden units without significantly \naffecting network performance. \n\n1 \n\nINTRODUCTION \n\nNeural networks can be evaluated based on their learning speed, the space and time \ncomplexity of the learned network, and generalisation performance. Pruning over(cid:173)\nsized networks (skeletonisation) has the potential to improve networks along these \ndimensions as follows: \n\n\u2022 Learning Speed: Empirical observation indicates that networks which have \nbeen constrained to have fewer parameters lack flexibility during search, and \nso tend to learn slower. Training a network that is larger than necessary and \n\n*This work was partially supported by DOE #DE-FG02-91ER61129, through subcon(cid:173)\n\ntract #097P753 from the University of Wisconsin. \n\n1080 \n\n\fInformation Measure Based Skeletonisation \n\n1081 \n\ntrimming it back to a reduced architecture could lead to improved learning \nspeed . \n\n\u2022 Network Complexity: Skeletonisation improves both space and time complexity \n\nby reducing the number of weights and hidden units . \n\n\u2022 Generalisation: Skeletonisation could constrain networks to generalise better \n\nby reducing the number of parameters used to fit the data. \n\nVarious techniques have been proposed for skeletonisation. One approach [Hanson \nand Pratt, 1989, Chauvin, 1989, Weigend et al., 1991] is to add a cost term or \nbias to the objective function. This causes weights to decay to zero unless they \nare reinforced. Another technique is to measure the increase in error caused by \nremoving a parameter or a unit, as in [Mozer and Smolensky, 1989, Le Cun et al., \n1990]. Parameters that have the least effect on the error may be pruned from the \nnetwork. \n\nIn this paper, we present Information Measure Based Skeletonisation (IMBS), an \nalternate approach to this problem, in which superfluous hidden units in a single \nhidden-layer network are removed based on their information measure (1M). This \nidea is somewhat related to that presented in [Siestma and Dow, 1991], though we \nuse a different algorithm for detecting superfluous hidden units. \n\nWe also demonstrate that when IMBS is applied to a vowel recognition task, to \na subset of the Peterson-Barney 10-vowel classification problem, and to a heart \ndisease diagnosis problem, it removes a substantial number of hidden units without \nsignificantly affecting network performance. \n\n2 1M AND THE HIDDEN LAYER \n\nSeveral decision tree induction schemes use a particular information-theoretic mea(cid:173)\nsure, called 1M, of the degree to which an attribute separates (discriminates between \nthe classes of) a given set of training data [Quinlan, 1986]. 1M is a measure of the \ninformation gained by knowing the value of an attribute for the purpose of classifi(cid:173)\ncation. The higher the 1M of an attribute, the greater the uniformity of class data \nin the subsets of feature space it creates. \n\nA useful simplification of the sigmoidal activation function used in back-propagation \nnetworks [Rumelhart et al., 1986] is to reduce this function to a threshold by map(cid:173)\nping activations greater than 0.5 to 1 and less than 0.5 to O. In this simplified \nmodel, the hidden units form hyperplanes in the feature space which separate data. \nThus, they can be considered analogous to binary-valued attributes, and the 1M of \neach hidden unit can be calculated as in decision tree induction [Quinlan, 1986]. \n\nFigure 1 shows the training data for a fabricated two-feature, two-class problem and \na possible configuration of the hyperplanes formed by each hidden unit at the end \nof training. Hyperplane h1 's higher 1M corresponds to the fact that it separates the \ntwo classes better than h2. \n\n\f1082 \n\nRamachandran and Pratt \n\n1M = .0115 \n\no \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\nFigure 1: Hyperplanes and their IM. Arrows indicate regions where hidden units have \nactivations> 0.5. \n\n3 1M TO DETECT SUPERFLUOUS HIDDEN UNITS \n\nOne of the important goals of training is to adjust the set of hyperplanes formed \nby the hidden layer so that they separate the training data. 1 We define superfluous \nunits as those whose corresponding hyperplanes are not necessary for the proper \nseparation of training data. For example, in Figure 1, hyperplane h2 is superfluous \nbecause: \n\n1. hI separates the data better than h2 and \n2. h2 does not separate the data in either of the two regions created by hI. \n\nThe IMBS algorithm to identify superfluous hidden units, shown in Figure 2, re(cid:173)\ncursively finds hidden units that are necessary to separate the data and classifies \nthe rest as superfluous. It is similar to the decision tree induction algorithm in \n[Quinlan, 1986]. \n\nThe hidden layer is skeletonised by removing the superfluous hidden units. Since \nthe removal of these units perturbs the inputs to the output layer, the network will \nhave to be trained further after skeletonisation to recover lost performance. \n\n4 RESULTS \n\nWe have tested IMBS on three classification problems, as follows: \n\n1. Train a network to an acceptable level of performance. \n2. Identify and remove superfluous hidden units. \n3. Train the skeletonised network further to an acceptable level of performance. \n\nWe will refer to the stopping point of training at step 1 as the skeletonisation point \n(SP); further training will be referred to in terms of SP + number of training epochs. \n1 This again is not strictly true for hidden units with sigmoidal activation, but holds for \n\nthe approximate model. \n\n\fInformation Measure Based Skeletonisation \n\n1083 \n\nInput: \n\nTraining data \nHidden unit activations for each training data pattern. \n\nOutput: \n\nList of superfluous hidden units. \n\nMethod: \n\nmain ident-superfluous-hu \nbegin \n\ndata-set~ training data \nuseful-hu-list~ nil \npick-best-hu (data-set, useful-hu-list) \noutput hidden units that are not in useful-hu-list \n\nend \nprocedure pick-best-hu(data-set, useful-hu-list) \nbegin \n\nif all the data in data-set belong to the same class then return \nCalculate 1M of each hidden unit. \nhl~ hidden unit with best 1M. \nadd hl to the useful-hu list \ndsl~ all the data in data-set for which hl has an activation of> .5 \nds2~ all the data in data-set for which hl has an activation of <= .5 \npick-best-hu(dsl, useful-hu-list) \npick-best-hu(ds2, useful-hu-list) \n\nend \n\nFigure 2: IMBS: An Algorithm for Identifying Superfluous Hidden Units \n\nFor each problem, data was divided into a training set and a test set. Several \nnetworks were run for a few epochs with different back-propagation parameters 'rJ \n(learning rate) and 0: (momentum) to determine their locally optimal values. \n\nFor each problem, we chose an initial architecture and trained 10 networks with dif(cid:173)\nferent random initial weights for the same number of epochs. The performances of \nthe original (i.e. the network before skeletonisation) and the skeletonised networks, \nmeasured as number of correct classifications of the training and test sets, was mea.(cid:173)\nsured both at SP and after further training. The retrained skeletonised network was \ncompared with the original network at SP as well as the original network that had \nbeen trained further for the same number of weight updates. 2 All training was via \nthe standard back-propagation algorithm with a sigmoidal activation function and \nupdates after every pattern presentation [Rumelhart et al., 1986]. A paired T-test \n[Siegel, 1988] was used to measure the significance of the difference in performance \nbetween the skeletonised and original networks. Our experimental results are sum(cid:173)\nmarised in Figure 3, and Tables 1 and 2; detailed experimental conditions are given \nbelow. \n\n2This was ensured by adjusting the number of epochs a network was trained after \nskeletonisation according to the number of hidden units in the network. Thus, a network \nwith 10 hidden units was trained on twice as many epochs as one with 20 hidden units. \n\n\f1084 \n\nRamachandran and Pratt \n\n \n\n~ \n-a \nji 0> \nS \n18 \n~ \ni5 S; \nu \n~ \n\nPB Vowel \n\n0 EJ \n~ 8[ \n\n300000 \n\n240000 \n\n-.:> \nQ) en \n~ \nen ~ \n=:. :r C> \n\n240000 \n\n32000 \n\n8 Robinson vowel \n\nHeart disease \n\n \n\n9.0 \nA5 \n\n1 .1 \nAA \n\n1.3 \nAA \n\n1 .5 \nAA \n\n166000 \n\n172000 \n\n17600C \n\nC> \neN \n\nC> \n\n9 .0 \n&5 \n\n1 . 1 \n&6 \n\n1 .3 \n&6 \n\n1 .5 \n&6 \n\n166000 \n\n174000 \n\nWatght Updatas \n\nWaight Updata& \n\nWatght Updata& \n\nFigure 3: Summary of experimental results. Circles represent skeletonised networks; \ntriangles represent unskeletonised networks for comparison. Note that when performance \ndrops upon skeletonisation, the original performance level is recovered within a few weight \nupdates. In all cases, hidden unit count is reduced. \n\n4.1 PETERSON-BARNEY DATA \n\nIMBS was first evaluated on a 3-class subset of the Peterson-Barney 10-vowel classi(cid:173)\nfication data set, originally described in [Peterson and Barney, 1952], and recreated \nby [Watrous, 1991]. This data consists of the formant values F1 and F2 for each of \ntwo repetitions of each of ten vowels by 76 speaker (1520 utterances). The vowels \nwere pronounced in isolated words consisting of the consonant \"h\", followed by a \nvowel, followed by \"d\". This set was randomly divided into a ~,~ training/test \nsplit, with 298 and 150 patterns, respectively. \n\nOur initial architecture was a fully connected network with 2 input units, one hidden \nlayer with 20 units, and 3 output units. We trained the networks with T] = 1.0 and \nex = 0.001 until the TSS (total sum of squared error) scores seemed to reach a \nplateau. The networks were trained for 2000 epochs and then skeletonised. \n\nThe skeletonisation procedure removed an average of 10.1 (50.5%) hidden units. \nThough the average performance of the skeletonised networks was worse than that \nof the original, this difference was not statistically significant (p = 0.001). \n\n4.2 ROBINSON VOWEL RECOGNITION \n\nUsing data from [Robinson, 1989], we trained networks to perform speaker indepen(cid:173)\ndent recognition of the 11 steady-state vowels of British English using a training set \nof LPC-derived log area ratios. Training and test sets were as used by [Robinson, \n1989], with 528 and 462 patterns, respectively. \n\nThe initial network architecture was fully connected, with 10 input units, 11 output \nunits, and 30 hidden units. Networks were trained with T] = 1.0 and ex = 0.01, \nuntil the performance on the training set exceeded 95%. The networks were trained \nfor 1500 epochs and then skeletonised. The skeletonisation procedure removed \nan average of 5.8 (19.3%) hidden units. The difference in performance was not \nstatistically significant (p = 0.001). \n\n\fInformation Measure Based Skeletonisation \n\n1085 \n\nTable 1: Performance of unskeletonised networks \n\nTable 2: Mean difference in the number of correct classifications between the original and \nskeletonised networks. Positive differences indicate that the original network did better \nafter further training. The numbers in parentheses indicate the 99.9% confidence intervals \nfor the mean. \n\ncomparison points \n\nOriginal Skeletonised \nPeterson-Barney \nSP \nSP \nSP \nSP+1010 \nSP+500 SP+1010 \nRobinson Vowel \nSP \nSP \nSP \nSP+620 \nSP+500 SP+620 \nHeart Disease \nSP \nSP \nSP \nSP+33 \nSP+14 \nSP+33 \n\nmean difference \n\nTraining set \n\nI \nTest set I \n\n3.10 l-0.83, 7.03J \n-0.1 [-1.76, 1.56] \n0.20 [-1.52, 1.91] \n1.70 J -2.40, 5.80) \n-8.2 [-20.33, 3.93] \n-0.30 [ -3.15, 2.55] \n\n-0.10 l-2.05, 1.84J \n0.7 [-0.73, 2.13] \n0.30 [-1.30, 1.90] \n2.40 J -2.39, 7.19J \n-4.4 [-18.26, 9.46] \n-0.301-8.36, 7.76] \n\n20.80 J-5.66, 47.26J \no [-4.28, +4.28] \n0.60 [ -4.55, 5.75] \n\n12.20 l-1.65, 26.051 \no [-2.85, 2.85] \n0.40 [ -3.03, 3.83] \n\n\f1086 \n\nRamachandran and Pratt \n\n4.3 HEART DISEASE DATA \n\nUsing a 14-attribute set of diagnosis information, we trained networks on a heart \ndisease diagnosis problem [Detrano et al., 1989]. Training and test data were chosen \nrandomly in a ~, ~ split of 820 and 410 patterns, respectively. The initial networks \nwere fully connected, with 25 input units, one hidden layer with 20 units, and 2 \noutput units. The networks were trained with a = 1.25 and 'rJ = 0.005. Training \nwas stopped when the TSS scores seemed to reach a plateau. The networks were \ntrained for 300 epochs and then skeletonised. \n\nThe skeletonisation procedure removed an average of 9.6 (48%) hidden units. Here, \nremoving superfluous units degraded the performance by an average of 2.5% on \nthe training set and 3.0% on the test set. However, after being trained further for \nonly 30 epochs, the skeletonised networks recovered to do as well as the original \nnetworks. \n\n5 CONCLUSION AND EXTENSIONS \n\nWe have introduced an algorithm, called IMBS, which uses an information mea(cid:173)\nsure borrowed from decision tree induction schemes to skeletonise over-sized back(cid:173)\npropagation networks. Empirical tests showed that IMBS removed a substantial \npercentage of hidden units without significantly affecting the network performance. \n\nPotential extensions to this work include: \n\n\u2022 Using decision tree reduction schemes to allow for trimming not only superflu(cid:173)\n\nous hyperplanes, but also those responsible for overfitting the training data, in \nan effort to improve generalisation. \n\n\u2022 Extending IMBS to better identify superfluous hidden units under conditions \n\nof less than 100% performance on the training data. \n\n\u2022 Extending IMBS to work for networks with more than one hidden layer. \n\n\u2022 Performing more rigorous empirical evaluation. \n\n\u2022 Making IMBS less sensitive to the hyperplane-as-threshold assumption. In par(cid:173)\n\nticular, a model with variable-width hyperplanes (depending on the sigmoidal \ngain) may be effective. \n\nAcknowledgements \n\nOur thanks to Haym Hirsh and Tom Lee for insightful comments on earlier drafts of \nthis paper, to Christian Roehr for an update to the IMBS algorithm, and to Vince \nSgro, David Lubinsky, David Loewenstern and Jack Mostow for feedback on later \ndrafts. Matthias Pfister, M.D., of University Hospital in Zurich, Switzerland was \nresponsible for collection of the heart disease data. We used software distributed \nwith [McClelland and Rumelhart, 1988] for many of our simulations. \n\n\fInformation Measure Based Skeletonisation \n\n1087 \n\nReferences \n\n[Chauvin, 1989] Chauvin, Y. 1989. A back-propagation algorithm with optimal use \nof hidden units. In Touretzky, D. S., editor 1989, Advances in Neural Information \nProcessing Systems 1. Morgan Kaufmann, San Mateo, CA. 519-526. \n\n[Detrano et al., 1989] Detrano, R.j Janosi, A.j Steinbrunn, W.j Pfisterer, M.; \n\nSchmid, J.; Sandhu, S.; Guppy, K.; Lee, S.; and Froelicher, V. 1989. Inter(cid:173)\nnational application of a new probability algorithm for the diagnosis of coronary \nartery disease. American Journal of Cardiology 64:304-310. \n\n[Hanson and Pratt, 1989] Hanson, Stephen Jose and Pratt, Lorien Y. 1989. Com(cid:173)\nparing biases for minimal network construction with back-propagation. In Touret(cid:173)\nzky, D. S., editor 1989, Advances in Neural Information Processing Systems 1. \nMorgan Kaufmann, San Mateo, CA. 177-185. \n\n[Le Cun et al., 1990] Le Cun, Yanni Denker, John; Solla, Sara A.; Howard, \n\nRichard E.; and Jackel, Lawrence D. 1990. Optimal brain damage. In Touret(cid:173)\nzky, D. S., editor 1990, Advances in Neural Information Processing Systems 2. \nMorgan Kaufmann, San Mateo, CA. \n\n[McClelland and Rumelhart, 1988] McClelland, James L. and Rumelhart, David E. \n1988. Explorations in Parallel Distributed Processing: A Handbook of Models, \nPrograms, and Exercises. Cambridge, MA, The MIT Press. \n\n[Mozer and Smolensky, 1989] Mozer, Michael C. and Smolensky, Paul 1989. Skele(cid:173)\ntonization: A technique for trimming the fat from a network via relevance as(cid:173)\nsessment. \nProcessing Systems 1. Morgan Kaufmann, San Mateo, CA. 107-115. \n\nIn Touretzky, D. S., editor 1989, Advances in Neural Information \n\n[Peterson and Barney, 1952] Peterson, and Barney, 1952. Control methods used in \n\na study of the vowels. J. Acoust. Soc. Am. 24(2):175-184. \n\n[Quinlan, 1986] Quinlan, J. R. 1986. Induction of decision trees. Machine Learning \n\n1(1):81-106. \n\n[Robinson, 1989] Robinson, Anthony John 1989. Dynamic Error Propagation Net(cid:173)\n\nworks. Ph.D. Dissertation, Cambridge University, Engineering Department. \n\n[Rumelhart et al., 1986] Rumelhart, D.; Hinton, G.; and Williams, R. 1986. Learn(cid:173)\n\ning representations by back-propagating errors. Nature 323:533-536. \n\n[Siegel, 1988] Siegel, Andrew F. 1988. Statistics and data analysis: An Introduction. \n\nJohn Wiley and Sons. chapter 15, 336-339. \n\n[Siestma and Dow, 1991] Siestma, Jocelyn and Dow, Robert J. F. 1991. Creating \n\nartificial neural networks that generalize. Neural Networks 4:67-79. \n\n[Watrous, 1991] Watrous, Raymond L. 1991. Current status of peterson-barney \nvowel formant data. Journal of the Acoustical Society of America 89(3):2459-60. \n[Weigend et al., 1991] Weigend, Andreas S.; Rumelhart, David E.; and Huberman, \n\nBernardo A. 1991. Generalization by weight-elimination with application to fore(cid:173)\ncasting. In Lippmann, R. P.; Moody, J. E.; and Touretzky, D. S., editors 1991, \nAdvances in Neural Information Processing Systems 3. Morgan Kaufmann, San \nMateo, CA. 875-882. \n\n\f", "award": [], "sourceid": 484, "authors": [{"given_name": "Sowmya", "family_name": "Ramachandran", "institution": null}, {"given_name": "Lorien", "family_name": "Pratt", "institution": null}]}