{"title": "A Comparative Study of a Modified Bumptree Neural Network with Radial Basis Function Networks and the Standard Multi Layer Perceptron", "book": "Advances in Neural Information Processing Systems", "page_first": 240, "page_last": 246, "abstract": null, "full_text": "A  Comparative  Study  Of  A  Modified \nBumptree  Neural  Network  With  Radial  Basis \nFunction  Networks  and  the  Standard  Multi(cid:173)\nLayer  Perceptron. \n\nRichard  T .J.  Bostock  and  Alan  J.  Harget \n\nDepartment of Computer Science &  Applied Mathematics \n\nAston University \n\nBinningham \n\nEngland \n\nAbstract \n\nBumptrees  are  geometric  data  structures  introduced  by  Omohundro \n(1991)  to  provide  efficient  access  to  a  collection  of functions  on  a \nEuclidean space of interest.  We describe a modified bumptree structure \nthat has been employed as  a neural network classifier, and compare its \nperformance on  several classification  tasks against that of radial basis \nfunction networks and the standard mutIi-Iayer perceptron. \n\n1 \n\nINTRODUCTION \n\nA  number of neural  network  studies  have  demonstrated  the  utility  of the  multi-layer \nperceptron  (MLP)  and shown  it to  be a  highly effective paradigm.  Studies  have  also \nshown,  however,  that  the  MLP is  not without its  problems,  in  particular  it requires  an \nextensive  training  time, is susceptible  to  local  minima problems and  its  perfonnance is \ndependent  upon  its  internal  network  architecture.  In  an  attempt  to  improve  upon  the \ngeneralisation  performance and computational efficiency a number of studies have been \nundertaken principally concerned with investigating the parametrisation of the MLP.  It is \nwell known, for example, that the generalisation performance of the MLP is affected by \nthe number of hidden units in the network, which have to be determined empirically since \ntheory provides  no guidance.  A number of investigations have been conducted into the \npossibility of automatically determining  the  number of hidden  units  during  the  training \nphase (BostOCk,  1992).  The results  show that architectures  can  be  attained which  give \nsatisfactory, although generally sub-optimal, perfonnance. \n\nAlternative network architectures such as  the Radial Basis Function (RBF) network have \nalso been studied in an  attempt to improve upon  the performance of the  MLP  network. \nThe RBF network  uses  basis functions  in  which  the  weights  are  effective over only  a \nsmall  portion  of the  input  space.  This  is  in  contrast  to  the  MLP network  where  the \nweights  are  used  in  a  more  global  fashion,  thereby  encoding the  characteristics  of the \ntraining set in  a more compact form.  RBF networks can  be rapidly trained  thus making \n\n240 \n\n\fModified Bumptree Neural Network and Standard Multi-Layer Perceptron \n\n241 \n\nthem  particularly suitable for situations where on-line incremental learning is required. \nThe RBF network has  been  successfully applied  in  a  number of areas  such  as  speech \nrecognition (Renals,  1992) and financial forecasting  (Lowe,  1991).  Studies indicate that \nthe  RBF  network  provides  a  viable  alternative  to  the  MLP approach  and  thus  offers \nencouragement  that  networks  employing  local  solutions  are  worthy  of  further \ninvestigation. \n\nIn the past few years  there has been an increasing interest in neural network architectures \nbased on tree structures.  Important work in  this area has been carried out by Omohundro \n(1991)  and Gentric  and  Withagen  (1993).  These  studies  seem  to  suggest  that  neural \nnetworks  employing  a  tree  based structure should  offer the  same  benefits  of reduced \ntraining time as  that offered by the RBF network.  The particular tree based architecture \nexamined in  this  study is  the bumptree which provides efficient access  to collections of \nfunctions  on  a  Euclidean  space  of interest.  A  bumptree  can  be  viewed  as  a  natural \ngeneralisation of several other geometric data structures  including  oct-trees,  k-d  trees, \nballtrees (Omohundro, 1987) and boxtrees (Omohundro, 1989). \n\nIn this paper we present the results of a comparative study of the performance of the three \ntypes of neural networks described above over a  wide range of classification problems. \nThe  performance  of the  networks  was  assessed  in  terms  of the  percentage  of correct \nclassifications  on  a  test,  or  generalisation  data  set,  and  the  time  taken  to  train  the \nnetwork.  Before  discussing  the  results  obtained  we  shall  give  an  outline  of  the \nimplementation of our bumptree neural  network since this  is  more novel  than  the other \ntwo networks. \n\n2  THE  BUMPTREE  NEURAL  NETWORK \nBumptree neural networks share many of the underlying principles of decision trees  but \ndiffer from  them in  the manner in  which patterns are classified.  Decision trees partition \nthe  problem  space  into  increasingly  small  areas.  Classification  is  then  achieved  by \ndetermining the lowest branch of the tree which contains a reference to the specified point. \nThe bumptree neural network described in this paper also employs a tree based structure to \npartition  the  problem  space,  with  each  branch  of the  tree  being  based  on  multiple \ndimensions.  Once the problem space has been partitioned then each branch can be viewed \nas  an individual neural network modelling  its  own  local area of the  problem space, and \nbeing able to deal with patterns from  multiple output classes. \n\nBumptrees model the problem space by subdividing the  space allowing each division  to \nbe described by a  separate function.  Initial partitioning of the problem space is achieved \nby randomly assigning values to the root level functions.  A learning algorithm is applied \nto determine the area of influence of each function and an associated error calculated.  If \nthe  error exceeds  some  threshold  of acceptability  then  the  area  in  question  is  further \nsubdivided by  the  addition  of two functions; \nthis  process  continues  until  satisfactory \nperformance is achieved.  The bumptree employed in this study is essentially a binary tree \nin which each leaf of the tree corresponds to a function of interest although  the possibility \nexists that one of the functions  could effectively be redundant if it fails  to attract any of \nthe patterns from  its parent function. \n\nA  number  of problems  had  to  be  resolved  in  the  design  and  implementation  of the \nbumptree.  Firstly,  an  appropriate  procedure  had  to  be  adopted  for  partitioning  the \n\n\f242 \n\nBostock and Harget \n\nproblem space.  Secondly, consideration had to be given to the type of learning algorithm \nto  be employed.  And finally,  the  mechanism  for  calculating the output of the network \nhad to be determined.  A detailed discussion of these issues and the solutions adopted now \nfollows. \n\nPARTITIONING  THE  PROBLEM  SPACE \n\n2.1 \nThe bumptree  used  in  this study employed gaussian functions  to  partition  the problem \nspace, with two functions  being added each time the space was partitioned.  Patterns were \nassigned to whichever of the functions  had the higher activation level with the restriction \nthat the functions below the root level could only be active on patterns that activated their \nparents.  To  calculate the activation  of the gaussian function  the following  expression \nwas used: \n\n(1) \n\nwhere Afp is the activation of function f on pattern p over all the input dimensions, afi is \nthe  radius  of function  f in  input  dimension  i,  Cfi  is  the  centre  of function  f  in  input \ndimension i, and Inpi is the ith dimension of the pth input vector. \n\nIt was found  that the locations and radii of the functions  had an important impact on  the \nperformance of the network.  In the original bumptree introduced by Omohundro every \nfunction  below  the root level was required  to be wholly enclosed by its parent function. \nThis restriction was found  to degrade the performance of the bumptree particularly if a \nfunction had a very small radius since this would produce very low levels of acti vation for \nmost patterns.  In our studies we relaxed  this  constraint by assigning  the  radius of each \nfunction to one, since the data presented to the bumptree was always normalised between \nzero and one.  This modification led to an improved performance. \n\nA  number of different  techniques  were  examined  in  order  to  effectively  position  the \nfunctions in the problem space.  The first approach considered, and the simplest, involved \nselecting two initial sets of centres for the root function  with the centre in each dimension \nbeing allocated a value between zero and one.  The functions at the lower levels of the tree \nwere assigned in  a similar manner with  the requirement that their centres fell  within the \narea of the problem  space for  which  their parent function  was active.  The use of non(cid:173)\nhierarchical clustering  techniques such  as the Forgy method or the  K-means clustering \ntechnique  developed  by  MacQueen  provided  other  alternatives  for  positioning  the \nfunctions.  The approach finally adopted for  this  study was  the multiple-initial function \n(MIF) technique. \n\nIn  the  MIF  procedure  ten  sets  of functions  centres  were  initially  defined  by  random \nassignment and each pattern in  the training set assigned to the function  with  the highest \nactivation level.  A  \"goodness\"  measure was then determined for each function over all \npatterns for  which  the function  was  active.  The goodness  measure was  defined as  the \nsquare of the error between the calculated and observed values divided by the number of \nactive  patterns.  The  function  with  the  best  value  was  retained  and  the  remaining \nfunctions  that were active on  one or more  patterns  had  their centres  averaged  in  each \ndimension to provide a second function.  The functions were then  added  to the network \nstructure and the patterns assigned to the function which gave the greater activation. \n\n\fModified Bumptree Neural Network and Standard Multi-Layer Perceptron \n\n243 \n\nTHE  LEARNING  ALGORITHM \n\n2.2 \nA bumptree  neural  network comprises a  number of functions  each  function  having  its \nown individual weight and bias parameters and each function being responsive to different \ncharacteristics in  the  training set.  The bumptree employed a weighted  value  for  every \ninput to output connection and a single bias value for each output unit.  Several different \nlearning algorithms for determining the weight and bias values were considered together \nwith a genetic algorithm approach (Williams,  1993).  A one-shot learning algorithm  was \nfinally  adopted  since  this  gave  good  results  and  was  computationally  efficient.  The \nalgorithm  used  a pseudo-matrix  inversion  technique  to  determine  the  weight and  bias \nparameters  of each  function  after  a  single  presentation  of the  relevant  patterns  in  the \ntraining  set  had  been  made.  The  output  of any  function  for  a  given  pattern  p  was \ndetermined from \n\n=  \"\" a  *  (p) + f.l. \nPiz \n\nijz  X j \n\nGO ipz \n\njmax \n\n\u00a3..J \nj=l \n\n(2) \n\nwhere aoipz is  the output of the zth output unit of the ith function on the pth pattern, j is \nthe  input unit, jmax is  the  total  number of input units,  aijz is  the weight that connects \nthe jth input unit  to  the zth  output unit for  the ith  function,  Xj(p)  is  the element of the \npth pattern concerned with  the jth input dimension, and  ~iz is  the bias value for  the zth \noutput unit. \n\nThe weight and bias parameters were determined by minimising the squared error given in \n(3), where Ei is the error of the ith function across all output dimensions  (zmax), for all \npatterns upon  which  the function  is  active (pmax). The desired output for  the zth output \ndimension  is  tvpz\"  and  aoipz  is  the  actual  output  of  the  ith  function  on  the  zth \ndimension of the pth pattern. The weight values are again represented by Ooijz and the bias \nby ~iz' \n\n(3) \n\nAfter the derivatives of  aijz and  ~iz were determined it was a simple task to arrive at the \nthree matrices used to  calculate the weight and bias values for the individual functions. \nProblems  were  encountered  in  the matrix  inversion  when  dealing with functions  which \nwere only active on a few patterns and which were far removed from  the root level of the \ntree; this led to difficulties with singular matrices.  It was found that the problem could be \novercome by using the  Gauss-Jordan singular decomposition  technique for  the pseudo(cid:173)\ninversion of the matrices. \n\n2.3 \n\nCALCULATION  OF  THE  NETWORK  OUTPUT \n\nThe  difficulty  in  determining  the  output of the  bumptree  was  that  there  were  usually \nfunctions at different levels of the tree that gave slightly different outputs for each active \npattern.  Several  different  approaches  were  studied  in  order  to  resolve  the difficulty \nincluding using the normalised output of all the active functions in  the tree irrespective of \ntheir level in  the  structure.  A  technique which gave good  results and was  used  in  this \n\n\f244 \n\nBostock and Harget \n\nstudy calculated the  output for  a pattern solely on  the output of the  lowest level active \nfunction  in  the  tree.  The final  output class  of a pattern  being  given by the output unit \nwith the highest level of activation. \n\n3  NETWORK  PERFORMANCES \nThe perfonnance  of the  bumptree  neural  network  was  compared  against  that  of the \nstandard MLP and RBF networks on a number of different problems.  The bumptree used \nthe  MIF placing  technique  in  which  the  radius  of each  function  was  set to  one.  This \nparticular implementation of the bumptree will now be referred  to as the MIF bumptree. \nThe  MLP  used  the  standard  backpropagation  algorithm  (Rumelhart,  1986)  with  a \nlearning rate of  0.25 and a momentum  value of 0.9.  The initial weights and bias values \nof the network were set to random values between -2 and +2.  The number of hidden units \nassigned  to  the  network  was  determined  empirically over  several  runs  by  varying  the \nnumber of hidden units until  the best generalisation perfonnance was attained.  The RBF \nnetwork  used  four  different  types  of function,  they  were  gaussian,  multi-quadratic, \ninverse  multi-quadratic  and  thin  plate splines.  The RBF network placed the  functions \nusing sample points within the problem space covered by the training set \n\nINITIAL  STUDIES \n\n3.1 \nIn  the  initial  studies.  a  set  of classical  non-linear  problems  was  used  to  compare  the \nperfonnance of the three types  of networks. The set consisted of the  XOR, Parity(6) and \nEncoder(8) problems.  The average results obtained over 10 runs for each of the data sets \nare  shown  in  Table  1  - the  figures  presented  are  the  percentage  of patterns  correctly \nclassified in the training set together with  the standard deviation. \n\nTable  1.  Percentage  of Patterns  Correctly  Classified  for  the  three  Data  Sets  for  each \nNetwork type. \n\nDATA  SET \n\nMLP \n\nRBF \n\nMIF \n\nXOR \nParity(6) \nEncoder(8) \n\n100 \n100 \n100 \n\n100 \n92.1  \u00b1 4.7 \n82.5  \u00b1 16.8 \n\n100 \n98.3  \u00b1 4.2 \n100 \n\nFor the  XOR  problem  the  MLP network required an  average of 222  iterations with  an \narchitecture of 4 hidden units, for the parity problem an architecture of 10 hidden units and \nan average of 1133 iterations. and finally for the encoder problem the network required an \naverage of 1900 iterations for an architecture consisting of three hidden units. \n\nThe RBF network correctly  classified all  the  patterns  of the  XOR  data set  when  four \nmulti-quadratic. inverse multi-quadratic or gaussian functions were used.  For the parity(6) \nproblem  the  best  result  was  achieved  with  a  network  employing  between  60  and  64 \ninverse multi-quadratic functions.  In the case of the encoder problem the best performance \nwas obtained using a network of 8 multi-quadratic functions. \n\nThe MIF bumptree required  two functions  to achieve perfect classification for  the XOR \nand  encoder  problems  and  an  average  of 40  functions  in  order  to  achieve  the  best \nperfonnance on the parity problem.  Thus in  the case of the XOR and encoder problems \nno further functions were required additional to the root functions. \n\n\fModif1ed Bumptree Neural Network and Standard Multi-Layer Perceptron \n\n245 \n\nA comparison of the training times  taken  by each of the networks revealed considerable \ndifferences.  The  MLP  required  the  most  extensive  training  time  since  it  used  the \nbackpropagation  training  algorithm  which  is an  iterative procedure.  The RBF network \nrequired  less  training  time  than  the  MLP,  but suffered  from  the  fact  that  for  all  the \npatterns in  the training set the activity of all  the functions had to be calculated in order to \narrive at the optimal weights.  The bumptree proved to have the quickest training time for \nthe parity and encoder problems and a training time comparable to that taken by the RBF \nnetwork for the XOR problem.  This superiority arose because the bumptree used a non(cid:173)\niterative  training  procedure, and  a  function  was  only  trained  on  those  members  of the \ntraining set for which the function was active. \n\nIn  considering  the  sensitivity  of the different networks  to  the  parameters chosen some \ninteresting results emerge.  The performance of the MLP was found  to be dependent on \nthe number of hidden units assigned to the network.  When insufficient hidden units were \nallocated the performance of the MLP degraded.  The performance of the RBF network \nwas  also  found  to  be  highly  influenced  by  the  values  taken  for  various  parameters, in \nparticular the  number and type of functions employed by the network.  The bumptree on \nthe other hand  was assigned the  same set of parameters for  all  the problems studied and \nwas found to be less sensitive than  the other two networks to the parameter settings. \n\n3.2 \n\nCOMPARISON  OF  GENERALISATION  PERFORMANCE \n\nThe performance of the three different networks was also measured for a set of four 'real(cid:173)\nworld'  problems which  allowed  the  generalisation  performance of each  network  to  be \ndetermined.  A summary of the results taken over 10 runs is given in Table 2. \n\nTable 2  Performance of the Networks on the Training and Generalisation Data Sets of the \nTest Problems. \nDATA \n\nNETWORK \n\nFUNCTIONS \nHIDDEN  UNITS \n\nTRAINING \n\nTEST \n\nIris \n\nSkin \nCancer \n\nVowel \nData \n\nDiabetes \n\nMLP \nRBF \nMIF \n\nMLP \nRBF \nMIF \n\nMLP \nRBF \nMIF \n\nMLP \nRBF \nMIF \n\n4 \n75  gaussians \n8 \n\n6 \n10 multi-quad \n4 \n\n20 \n50 Thin plate spl. \n104 \n\n16 \n25  Thin plate spl. \n3 \n\n100 \n100 \n100 \n\n88.7  \u00b1 4.3 \n84.4  \u00b1 3.2 \n79.8 \u00b1 5.2 \n\n82.4  \u00b1 5.3 \n82.1  \u00b1 1.5 \n86.5  \u00b1 5.6 \n\n82.5  \u00b1 2.7 \n76.0 \u00b1 0.8 \n76.5  \u00b1 1.2 \n\n95.7  \u00b1 0.6 \n96.0 \u00b1 0.0 \n97.5  \u00b1 0.4 \n\n79.2 \u00b1 1.7 \n80.3  \u00b1 4.4 \n80.8  \u00b1 1.9 \n\n77.1  \u00b1 6.6 \n77.8 \u00b1 1.4 \n73.6 \u00b1 4.6 \n\n78.9  \u00b1 1.2 \n78.9 \u00b1 0.9 \n80.0 \u00b1 1.1 \n\nAll  three  networks  produce a comparable performance on  the test problems,  but in  the \ncase of the bumptree this  was  achieved with  a  training  time substantially  less  than  that \nrequired  by  the other networks.  Inspection  of the  results also shows  that the  bumptree \nrequired fewer functions in general than the RBF network. \n\n\f246 \n\nBostock and Harget \n\nThe results shown above for the bumptree were obtained with  the same set of parameters \nused  in  the  initial  study  which  further  confirms  its  lack  of sensitivity  to  parameter \nsettings. \n\n4.  CONCLUSION \nA comparative study of the performance of three different types of networks, one of which \nis novel,  has  been  conducted on  a  wide range of problems.  The results  show  that  the \nperformance of the bumptree compared very favourably, both in  terms  of generalisation \nand  training  times,  with  the more  traditional MLP and RBF networks.  In  addition,  the \nperformance of the  bumptree proved to be less sensitive to  the parameters  settings  than \nthe other networks.  These results encourage us  to continue further  investigation of the \nbumptree neural  network and  lead  us  to conclude  that  it has  a valid place in  the  list of \ncurrent neural networks. \n\nAcknowledgement \nWe gratefully acknowledge the assistance given by Richard Rohwer. \n\n(1991)  Time  Series  Prediction  by  Adaptive  Networks:  A \n\nReferences \nBostock R.T 1. & Harget Al.  (1992)  Towards a Neural Network Based System for Skin \nCancer  Diagnosis:  lEE Third International  Conference  on  Artificial Neural Networks: \nP21S-220. \nBroomhead D.S.  & Lowe D.  (1988)  Radial Basis Functions, Multi-Variable Functional \nInterpolation and Adaptive Networks:  RSRE Memorandum No. 4148, Royal Signals and \nRadar Establishment, Malvern, England. \nGentric P.  &  Withagen  H.C.A.M.  (1993)  Constructive Methods  for  a New  Classifier \nBased on  a Radial Basis  Function Network Accelerated by a Tree:  Report,  Eindhoven \nTechnical University, Eindhoven, Holland. \nLowe  D.  &  Webb  A.R. \nDynamical Systems Perspective:  lEE Proceedings-F,  vol.  128(1), Feb.\" P17-24. \nMoody J.  &  Darken C.  (1988)  Learning  With  Localized Receptive Fields:  Research \nReport  YALE UID CSIRR-649. \nOmohundro  S.M.  (1987)  Efficient  Algorithms  With  Neural  Network  Behaviour;  in \nComplex  Systems  1 (1987):  P273-347. \nOmohundro  S.M.  (1989)  Five  Balltree  Construction  Algorithms: \nComputer Science Institute Technical Report  TR-89-063. \nOmohundro  S.M.  (1991)  Bumptrees  for  Efficient  Function,  Constraint,  and \nClassification Learning:  Advances in  Neural  Information  Processing  Systems 3,  P693-\n699. \nRenals  S.  &  Rohwer R.J.  (1989)  Phoneme Classification  Experiments  Using  Radial \nBasis Functions:  Proceedings of the IJCNN, P461-467. \nRumelhart D.E., Hinton G.E.  &  Williams Rl. (1986)  Learning Internal Representations \nby Error Propagation:  in Parallel Distributed Processing, vol. 1 P318-362.  Cambridge, \nMA  : MIT Press. \nWilliams  B.V.,  Bostock  R.TJ.,  Bounds  D.G.  &  Harget  A.J. \n(1993)  The  Genetic \nBumptree  Classifier:  Proceedings  of the  BNSS  Symposium  on  Artificial  Neural \nNetworks:  to be published. \n\nInternational \n\n\f", "award": [], "sourceid": 823, "authors": [{"given_name": "Richard", "family_name": "Bostock", "institution": null}, {"given_name": "Alan", "family_name": "Harget", "institution": null}]}