{"title": "A Comparative Study of a Modified Bumptree Neural Network with Radial Basis Function Networks and the Standard Multi Layer Perceptron", "book": "Advances in Neural Information Processing Systems", "page_first": 240, "page_last": 246, "abstract": null, "full_text": "A Comparative Study Of A Modified \nBumptree Neural Network With Radial Basis \nFunction Networks and the Standard Multi(cid:173)\nLayer Perceptron. \n\nRichard T .J. Bostock and Alan J. Harget \n\nDepartment of Computer Science & Applied Mathematics \n\nAston University \n\nBinningham \n\nEngland \n\nAbstract \n\nBumptrees are geometric data structures introduced by Omohundro \n(1991) to provide efficient access to a collection of functions on a \nEuclidean space of interest. We describe a modified bumptree structure \nthat has been employed as a neural network classifier, and compare its \nperformance on several classification tasks against that of radial basis \nfunction networks and the standard mutIi-Iayer perceptron. \n\n1 \n\nINTRODUCTION \n\nA number of neural network studies have demonstrated the utility of the multi-layer \nperceptron (MLP) and shown it to be a highly effective paradigm. Studies have also \nshown, however, that the MLP is not without its problems, in particular it requires an \nextensive training time, is susceptible to local minima problems and its perfonnance is \ndependent upon its internal network architecture. In an attempt to improve upon the \ngeneralisation performance and computational efficiency a number of studies have been \nundertaken principally concerned with investigating the parametrisation of the MLP. It is \nwell known, for example, that the generalisation performance of the MLP is affected by \nthe number of hidden units in the network, which have to be determined empirically since \ntheory provides no guidance. A number of investigations have been conducted into the \npossibility of automatically determining the number of hidden units during the training \nphase (BostOCk, 1992). The results show that architectures can be attained which give \nsatisfactory, although generally sub-optimal, perfonnance. \n\nAlternative network architectures such as the Radial Basis Function (RBF) network have \nalso been studied in an attempt to improve upon the performance of the MLP network. \nThe RBF network uses basis functions in which the weights are effective over only a \nsmall portion of the input space. This is in contrast to the MLP network where the \nweights are used in a more global fashion, thereby encoding the characteristics of the \ntraining set in a more compact form. RBF networks can be rapidly trained thus making \n\n240 \n\n\fModified Bumptree Neural Network and Standard Multi-Layer Perceptron \n\n241 \n\nthem particularly suitable for situations where on-line incremental learning is required. \nThe RBF network has been successfully applied in a number of areas such as speech \nrecognition (Renals, 1992) and financial forecasting (Lowe, 1991). Studies indicate that \nthe RBF network provides a viable alternative to the MLP approach and thus offers \nencouragement that networks employing local solutions are worthy of further \ninvestigation. \n\nIn the past few years there has been an increasing interest in neural network architectures \nbased on tree structures. Important work in this area has been carried out by Omohundro \n(1991) and Gentric and Withagen (1993). These studies seem to suggest that neural \nnetworks employing a tree based structure should offer the same benefits of reduced \ntraining time as that offered by the RBF network. The particular tree based architecture \nexamined in this study is the bumptree which provides efficient access to collections of \nfunctions on a Euclidean space of interest. A bumptree can be viewed as a natural \ngeneralisation of several other geometric data structures including oct-trees, k-d trees, \nballtrees (Omohundro, 1987) and boxtrees (Omohundro, 1989). \n\nIn this paper we present the results of a comparative study of the performance of the three \ntypes of neural networks described above over a wide range of classification problems. \nThe performance of the networks was assessed in terms of the percentage of correct \nclassifications on a test, or generalisation data set, and the time taken to train the \nnetwork. Before discussing the results obtained we shall give an outline of the \nimplementation of our bumptree neural network since this is more novel than the other \ntwo networks. \n\n2 THE BUMPTREE NEURAL NETWORK \nBumptree neural networks share many of the underlying principles of decision trees but \ndiffer from them in the manner in which patterns are classified. Decision trees partition \nthe problem space into increasingly small areas. Classification is then achieved by \ndetermining the lowest branch of the tree which contains a reference to the specified point. \nThe bumptree neural network described in this paper also employs a tree based structure to \npartition the problem space, with each branch of the tree being based on multiple \ndimensions. Once the problem space has been partitioned then each branch can be viewed \nas an individual neural network modelling its own local area of the problem space, and \nbeing able to deal with patterns from multiple output classes. \n\nBumptrees model the problem space by subdividing the space allowing each division to \nbe described by a separate function. Initial partitioning of the problem space is achieved \nby randomly assigning values to the root level functions. A learning algorithm is applied \nto determine the area of influence of each function and an associated error calculated. If \nthe error exceeds some threshold of acceptability then the area in question is further \nsubdivided by the addition of two functions; \nthis process continues until satisfactory \nperformance is achieved. The bumptree employed in this study is essentially a binary tree \nin which each leaf of the tree corresponds to a function of interest although the possibility \nexists that one of the functions could effectively be redundant if it fails to attract any of \nthe patterns from its parent function. \n\nA number of problems had to be resolved in the design and implementation of the \nbumptree. Firstly, an appropriate procedure had to be adopted for partitioning the \n\n\f242 \n\nBostock and Harget \n\nproblem space. Secondly, consideration had to be given to the type of learning algorithm \nto be employed. And finally, the mechanism for calculating the output of the network \nhad to be determined. A detailed discussion of these issues and the solutions adopted now \nfollows. \n\nPARTITIONING THE PROBLEM SPACE \n\n2.1 \nThe bumptree used in this study employed gaussian functions to partition the problem \nspace, with two functions being added each time the space was partitioned. Patterns were \nassigned to whichever of the functions had the higher activation level with the restriction \nthat the functions below the root level could only be active on patterns that activated their \nparents. To calculate the activation of the gaussian function the following expression \nwas used: \n\n(1) \n\nwhere Afp is the activation of function f on pattern p over all the input dimensions, afi is \nthe radius of function f in input dimension i, Cfi is the centre of function f in input \ndimension i, and Inpi is the ith dimension of the pth input vector. \n\nIt was found that the locations and radii of the functions had an important impact on the \nperformance of the network. In the original bumptree introduced by Omohundro every \nfunction below the root level was required to be wholly enclosed by its parent function. \nThis restriction was found to degrade the performance of the bumptree particularly if a \nfunction had a very small radius since this would produce very low levels of acti vation for \nmost patterns. In our studies we relaxed this constraint by assigning the radius of each \nfunction to one, since the data presented to the bumptree was always normalised between \nzero and one. This modification led to an improved performance. \n\nA number of different techniques were examined in order to effectively position the \nfunctions in the problem space. The first approach considered, and the simplest, involved \nselecting two initial sets of centres for the root function with the centre in each dimension \nbeing allocated a value between zero and one. The functions at the lower levels of the tree \nwere assigned in a similar manner with the requirement that their centres fell within the \narea of the problem space for which their parent function was active. The use of non(cid:173)\nhierarchical clustering techniques such as the Forgy method or the K-means clustering \ntechnique developed by MacQueen provided other alternatives for positioning the \nfunctions. The approach finally adopted for this study was the multiple-initial function \n(MIF) technique. \n\nIn the MIF procedure ten sets of functions centres were initially defined by random \nassignment and each pattern in the training set assigned to the function with the highest \nactivation level. A \"goodness\" measure was then determined for each function over all \npatterns for which the function was active. The goodness measure was defined as the \nsquare of the error between the calculated and observed values divided by the number of \nactive patterns. The function with the best value was retained and the remaining \nfunctions that were active on one or more patterns had their centres averaged in each \ndimension to provide a second function. The functions were then added to the network \nstructure and the patterns assigned to the function which gave the greater activation. \n\n\fModified Bumptree Neural Network and Standard Multi-Layer Perceptron \n\n243 \n\nTHE LEARNING ALGORITHM \n\n2.2 \nA bumptree neural network comprises a number of functions each function having its \nown individual weight and bias parameters and each function being responsive to different \ncharacteristics in the training set. The bumptree employed a weighted value for every \ninput to output connection and a single bias value for each output unit. Several different \nlearning algorithms for determining the weight and bias values were considered together \nwith a genetic algorithm approach (Williams, 1993). A one-shot learning algorithm was \nfinally adopted since this gave good results and was computationally efficient. The \nalgorithm used a pseudo-matrix inversion technique to determine the weight and bias \nparameters of each function after a single presentation of the relevant patterns in the \ntraining set had been made. The output of any function for a given pattern p was \ndetermined from \n\n= \"\" a * (p) + f.l. \nPiz \n\nijz X j \n\nGO ipz \n\njmax \n\n\u00a3..J \nj=l \n\n(2) \n\nwhere aoipz is the output of the zth output unit of the ith function on the pth pattern, j is \nthe input unit, jmax is the total number of input units, aijz is the weight that connects \nthe jth input unit to the zth output unit for the ith function, Xj(p) is the element of the \npth pattern concerned with the jth input dimension, and ~iz is the bias value for the zth \noutput unit. \n\nThe weight and bias parameters were determined by minimising the squared error given in \n(3), where Ei is the error of the ith function across all output dimensions (zmax), for all \npatterns upon which the function is active (pmax). The desired output for the zth output \ndimension is tvpz\" and aoipz is the actual output of the ith function on the zth \ndimension of the pth pattern. The weight values are again represented by Ooijz and the bias \nby ~iz' \n\n(3) \n\nAfter the derivatives of aijz and ~iz were determined it was a simple task to arrive at the \nthree matrices used to calculate the weight and bias values for the individual functions. \nProblems were encountered in the matrix inversion when dealing with functions which \nwere only active on a few patterns and which were far removed from the root level of the \ntree; this led to difficulties with singular matrices. It was found that the problem could be \novercome by using the Gauss-Jordan singular decomposition technique for the pseudo(cid:173)\ninversion of the matrices. \n\n2.3 \n\nCALCULATION OF THE NETWORK OUTPUT \n\nThe difficulty in determining the output of the bumptree was that there were usually \nfunctions at different levels of the tree that gave slightly different outputs for each active \npattern. Several different approaches were studied in order to resolve the difficulty \nincluding using the normalised output of all the active functions in the tree irrespective of \ntheir level in the structure. A technique which gave good results and was used in this \n\n\f244 \n\nBostock and Harget \n\nstudy calculated the output for a pattern solely on the output of the lowest level active \nfunction in the tree. The final output class of a pattern being given by the output unit \nwith the highest level of activation. \n\n3 NETWORK PERFORMANCES \nThe perfonnance of the bumptree neural network was compared against that of the \nstandard MLP and RBF networks on a number of different problems. The bumptree used \nthe MIF placing technique in which the radius of each function was set to one. This \nparticular implementation of the bumptree will now be referred to as the MIF bumptree. \nThe MLP used the standard backpropagation algorithm (Rumelhart, 1986) with a \nlearning rate of 0.25 and a momentum value of 0.9. The initial weights and bias values \nof the network were set to random values between -2 and +2. The number of hidden units \nassigned to the network was determined empirically over several runs by varying the \nnumber of hidden units until the best generalisation perfonnance was attained. The RBF \nnetwork used four different types of function, they were gaussian, multi-quadratic, \ninverse multi-quadratic and thin plate splines. The RBF network placed the functions \nusing sample points within the problem space covered by the training set \n\nINITIAL STUDIES \n\n3.1 \nIn the initial studies. a set of classical non-linear problems was used to compare the \nperfonnance of the three types of networks. The set consisted of the XOR, Parity(6) and \nEncoder(8) problems. The average results obtained over 10 runs for each of the data sets \nare shown in Table 1 - the figures presented are the percentage of patterns correctly \nclassified in the training set together with the standard deviation. \n\nTable 1. Percentage of Patterns Correctly Classified for the three Data Sets for each \nNetwork type. \n\nDATA SET \n\nMLP \n\nRBF \n\nMIF \n\nXOR \nParity(6) \nEncoder(8) \n\n100 \n100 \n100 \n\n100 \n92.1 \u00b1 4.7 \n82.5 \u00b1 16.8 \n\n100 \n98.3 \u00b1 4.2 \n100 \n\nFor the XOR problem the MLP network required an average of 222 iterations with an \narchitecture of 4 hidden units, for the parity problem an architecture of 10 hidden units and \nan average of 1133 iterations. and finally for the encoder problem the network required an \naverage of 1900 iterations for an architecture consisting of three hidden units. \n\nThe RBF network correctly classified all the patterns of the XOR data set when four \nmulti-quadratic. inverse multi-quadratic or gaussian functions were used. For the parity(6) \nproblem the best result was achieved with a network employing between 60 and 64 \ninverse multi-quadratic functions. In the case of the encoder problem the best performance \nwas obtained using a network of 8 multi-quadratic functions. \n\nThe MIF bumptree required two functions to achieve perfect classification for the XOR \nand encoder problems and an average of 40 functions in order to achieve the best \nperfonnance on the parity problem. Thus in the case of the XOR and encoder problems \nno further functions were required additional to the root functions. \n\n\fModif1ed Bumptree Neural Network and Standard Multi-Layer Perceptron \n\n245 \n\nA comparison of the training times taken by each of the networks revealed considerable \ndifferences. The MLP required the most extensive training time since it used the \nbackpropagation training algorithm which is an iterative procedure. The RBF network \nrequired less training time than the MLP, but suffered from the fact that for all the \npatterns in the training set the activity of all the functions had to be calculated in order to \narrive at the optimal weights. The bumptree proved to have the quickest training time for \nthe parity and encoder problems and a training time comparable to that taken by the RBF \nnetwork for the XOR problem. This superiority arose because the bumptree used a non(cid:173)\niterative training procedure, and a function was only trained on those members of the \ntraining set for which the function was active. \n\nIn considering the sensitivity of the different networks to the parameters chosen some \ninteresting results emerge. The performance of the MLP was found to be dependent on \nthe number of hidden units assigned to the network. When insufficient hidden units were \nallocated the performance of the MLP degraded. The performance of the RBF network \nwas also found to be highly influenced by the values taken for various parameters, in \nparticular the number and type of functions employed by the network. The bumptree on \nthe other hand was assigned the same set of parameters for all the problems studied and \nwas found to be less sensitive than the other two networks to the parameter settings. \n\n3.2 \n\nCOMPARISON OF GENERALISATION PERFORMANCE \n\nThe performance of the three different networks was also measured for a set of four 'real(cid:173)\nworld' problems which allowed the generalisation performance of each network to be \ndetermined. A summary of the results taken over 10 runs is given in Table 2. \n\nTable 2 Performance of the Networks on the Training and Generalisation Data Sets of the \nTest Problems. \nDATA \n\nNETWORK \n\nFUNCTIONS \nHIDDEN UNITS \n\nTRAINING \n\nTEST \n\nIris \n\nSkin \nCancer \n\nVowel \nData \n\nDiabetes \n\nMLP \nRBF \nMIF \n\nMLP \nRBF \nMIF \n\nMLP \nRBF \nMIF \n\nMLP \nRBF \nMIF \n\n4 \n75 gaussians \n8 \n\n6 \n10 multi-quad \n4 \n\n20 \n50 Thin plate spl. \n104 \n\n16 \n25 Thin plate spl. \n3 \n\n100 \n100 \n100 \n\n88.7 \u00b1 4.3 \n84.4 \u00b1 3.2 \n79.8 \u00b1 5.2 \n\n82.4 \u00b1 5.3 \n82.1 \u00b1 1.5 \n86.5 \u00b1 5.6 \n\n82.5 \u00b1 2.7 \n76.0 \u00b1 0.8 \n76.5 \u00b1 1.2 \n\n95.7 \u00b1 0.6 \n96.0 \u00b1 0.0 \n97.5 \u00b1 0.4 \n\n79.2 \u00b1 1.7 \n80.3 \u00b1 4.4 \n80.8 \u00b1 1.9 \n\n77.1 \u00b1 6.6 \n77.8 \u00b1 1.4 \n73.6 \u00b1 4.6 \n\n78.9 \u00b1 1.2 \n78.9 \u00b1 0.9 \n80.0 \u00b1 1.1 \n\nAll three networks produce a comparable performance on the test problems, but in the \ncase of the bumptree this was achieved with a training time substantially less than that \nrequired by the other networks. Inspection of the results also shows that the bumptree \nrequired fewer functions in general than the RBF network. \n\n\f246 \n\nBostock and Harget \n\nThe results shown above for the bumptree were obtained with the same set of parameters \nused in the initial study which further confirms its lack of sensitivity to parameter \nsettings. \n\n4. CONCLUSION \nA comparative study of the performance of three different types of networks, one of which \nis novel, has been conducted on a wide range of problems. The results show that the \nperformance of the bumptree compared very favourably, both in terms of generalisation \nand training times, with the more traditional MLP and RBF networks. In addition, the \nperformance of the bumptree proved to be less sensitive to the parameters settings than \nthe other networks. These results encourage us to continue further investigation of the \nbumptree neural network and lead us to conclude that it has a valid place in the list of \ncurrent neural networks. \n\nAcknowledgement \nWe gratefully acknowledge the assistance given by Richard Rohwer. \n\n(1991) Time Series Prediction by Adaptive Networks: A \n\nReferences \nBostock R.T 1. & Harget Al. (1992) Towards a Neural Network Based System for Skin \nCancer Diagnosis: lEE Third International Conference on Artificial Neural Networks: \nP21S-220. \nBroomhead D.S. & Lowe D. (1988) Radial Basis Functions, Multi-Variable Functional \nInterpolation and Adaptive Networks: RSRE Memorandum No. 4148, Royal Signals and \nRadar Establishment, Malvern, England. \nGentric P. & Withagen H.C.A.M. (1993) Constructive Methods for a New Classifier \nBased on a Radial Basis Function Network Accelerated by a Tree: Report, Eindhoven \nTechnical University, Eindhoven, Holland. \nLowe D. & Webb A.R. \nDynamical Systems Perspective: lEE Proceedings-F, vol. 128(1), Feb.\" P17-24. \nMoody J. & Darken C. (1988) Learning With Localized Receptive Fields: Research \nReport YALE UID CSIRR-649. \nOmohundro S.M. (1987) Efficient Algorithms With Neural Network Behaviour; in \nComplex Systems 1 (1987): P273-347. \nOmohundro S.M. (1989) Five Balltree Construction Algorithms: \nComputer Science Institute Technical Report TR-89-063. \nOmohundro S.M. (1991) Bumptrees for Efficient Function, Constraint, and \nClassification Learning: Advances in Neural Information Processing Systems 3, P693-\n699. \nRenals S. & Rohwer R.J. (1989) Phoneme Classification Experiments Using Radial \nBasis Functions: Proceedings of the IJCNN, P461-467. \nRumelhart D.E., Hinton G.E. & Williams Rl. (1986) Learning Internal Representations \nby Error Propagation: in Parallel Distributed Processing, vol. 1 P318-362. Cambridge, \nMA : MIT Press. \nWilliams B.V., Bostock R.TJ., Bounds D.G. & Harget A.J. \n(1993) The Genetic \nBumptree Classifier: Proceedings of the BNSS Symposium on Artificial Neural \nNetworks: to be published. \n\nInternational \n\n\f", "award": [], "sourceid": 823, "authors": [{"given_name": "Richard", "family_name": "Bostock", "institution": null}, {"given_name": "Alan", "family_name": "Harget", "institution": null}]}