{"title": "Supervised Learning with Growing Cell Structures", "book": "Advances in Neural Information Processing Systems", "page_first": 255, "page_last": 262, "abstract": null, "full_text": "Supervised Learning with  Growing  Cell \n\nStructures \n\nBernd Fritzke \n\nInstitut fiir  Neuroinformatik \n\nRuhr-U niversitat Bochum \n\nGermany \n\nAbstract \n\nWe  present  a  new  incremental  radial  basis  function  network  suit(cid:173)\nable  for  classification  and  regression  problems.  Center  positions \nare  continuously  updated  through  soft  competitive learning.  The \nwidth  of  the  radial  basis  functions  is  derived  from  the  distance \nto  topological  neighbors.  During  the  training  the  observed  error \nis  accumulated  locally  and  used  to  determine  where  to  insert  the \nnext  unit.  This  leads  (in  case  of classification  problems)  to  the \nplacement  of units  near  class  borders  rather  than  near  frequency \npeaks as is done by  most existing methods.  The resulting networks \nneed  few  training epochs  and seem  to generalize  very  well.  This is \ndemonstrated  by  examples. \n\n1 \n\nINTRODUCTION \n\nFeed-forward  networks  of localized  (e.g.,  Gaussian)  units  are  an  interesting  alter(cid:173)\nnative  to  the  more  frequently  used  networks  of global  (e.g.,  sigmoidal)  units.  It \nhas  been  shown  that  with  localized  units  one  hidden  layer  suffices  in  principle  to \napproximate any  continuous function,  whereas  with sigmoidal units two  layers  are \nnecessary. \n\nIn  the  following  we  are  considering  radial  basis function  networks  similar to those \nproposed  by  Moody  &  Darken  (1989)  or  Poggio  &  Girosi  (1990).  Such  networks \nconsist  of one layer  L  of Gaussian  units.  Each  unit eEL has  an  associated  vector \nWe  E  Rn  indicating the position of the Gaussian in input vector space and a standard \n\n255 \n\n\f256 \n\nFritzke \n\ndeviation  U c .  For  a  given  input datum e E  Rn  the  activation of unit  c  is  described \nby \n\nD  (C)  _ \n\nc \n\n'\" \n\n(_ lie - wcll2) \n\n- exp  2 \u00b7  \n\nUc \n\n(1) \n\nOn  top  of  the  layer  L  of  Gaussian  units  there  are  m  single  layer  percepirons. \nThereby,  m  is the output dimensionality of the problem which  is  given by  a number \nof input/output  pairsl  (e, ()  E  (Rn  x  Rm).  Each  of the  single  layer  perceptrons \ncomputes a  weighted  sum of the  activations in  L: \n\nOi(e) = L Wij Dj (0 \n\njEL \n\niE{1, ... ,m} \n\n(2) \n\nWith  Wij  we  denote  the  weighted  connection  from  local  unit  j  to  output  unit  i. \nTraining of a  single layer  perceptron  to minimize square error  is a  very  well  under(cid:173)\nstood  problem  which  can  be  solved  incrementally  by  the  delta  rule  or  directly  by \nlinear algebra techniques (Moore-Penrose inverse).  Therefore,  the only (but severe) \ndifficulty  when  using radial basis function  networks is  choosing the number of local \nunits and their  respective  parameters,  namely center  position wand width  u. \n\nOne extreme approach is  to use  one  unit  per  data points and to position the  units \ndirectly  at  the  data points.  If one  chooses  the  width  of the  Gaussians  sufficiently \nsmall  it  is  possible  to  construct  a  network  which  correctly  classifies  the  training \ndata, no  matter how  complicated the task is  (Fritzke,  1994).  However,  the network \nsize  is  very  large  and  might even  be  infinite  in  the  case  of a  continuous stream  of \nnon-repeating stochastic input  data.  Moreover,  such  a  network  can  be expected  to \ngeneralize  poorly. \nMoody  &  Darken  (1989),  in  contrast,  propose  to  use  a  fixed  number  of local units \n(which is usually considerably smaller than the total number of data points).  These \nunits  are  first  distributed  by  an  unsupervised  clustering  method  (e.g.,  k-means). \nThereafter,  the  weights  to  the  output  units  are  determined  by  gradient  descent. \nAlthough  good  results  are  reported  for  this  method  it  is  rather  easy  to  come  up \nwith examples where  it would not  perform well:  k-means positions the units based \non the density  of the training data, specifically  near density  peaks.  However,  to ap(cid:173)\nproximate the optimal Bayesian a posteriori classifier  it would be  better to position \nunits  near  class  borders.  Class borders,  however, often lie in regions with  a particu(cid:173)\nlarly  low  data density.  Therefore,  all  methods based  on  k-means-like unsupervised \nplacement of the Gaussians are in  danger to perform poorly with a fixed  number of \nunits or - similarly undesirable - to need  a  huge  number of units  to achieve  decent \nperformance. \n\nFrom  this  one  can  conclude  that  - in  the  case  of  radial  basis  function  networks \n- it is  essential  to  use  the  class  labels  not  only  for  the  training  of the  connection \nweights  but  also  for  the  placement of the  local  units.  Doing  this forms  the  core  of \nthe method proposed  below. \n\nIThroughout  this  article  we assume a  classification  problem  and  use the corresponding \n\nterminology.  However,  the described  method  is  suitable  for  regression  problems  as  well. \n\n\fSupervised Learning with Growing Cell Structures \n\n257 \n\n2  SUPERVISED  GROWING  CELL  STRUCTURES \n\nIn  the  following  we  present  an  incremental  radial  basis  function  network  which \nis  able  to simultaneously  determine  a  suitable  number  of local  units,  their  center \npositions  and  widths  as  well  as  the  connection  weights  to  the  output  units.  The \nbasic idea is  a  very  simple one: \n\nO.  Start  with  a  very  small radial basis function  network. \n1.  Train the  current  network  with some I/O-pairs from  the  training data. \n2.  Use  the  observed  accumulated  error  to  determine  where  in  input  vector \n\nspace  to insert  new  units. \n\n3.  If network  does  not perform  well  enough goto  1. \n\nOne  should  note  that  during  the  training  phase  (Step  1.)  error  is  accumulated \nover  several  data items  and  this  accumulated  error  is  used  to  determine  where  to \ninsert new  units (Step 2.).  This is different from the approach of Platt (1991) where \ninsertions are based on single  poorly mapped  patterns.  In  both cases,  however,  the \ngoal is  to position new  units in  regions where  the current  network does  not perform \nwell  rather  than  in  regions  where  many data items stem from. \n\nIn  our  model  the  center  positions of new  units  are  interpolated from  the  positions \nof existing  units.  Specifically,  after  some  adaptation  steps  we  determine  the  unit \nq  which  has  accumulated  the  maximum error  and  insert  a  new  unit  in  between  q \nand one of its neighbors  in  input  vector space.  The interpolation procedure  makes \nit necessary  to allow the center  positions of existing units to change.  Otherwise,  all \nnew units would be restricted  to the convex hull of the centers of the initial network. \n\nWe  do  not  necessarily  insert  a  new  unit  in  between  q  and  its  nearest  neighbor. \nRather  we  like  to  choose  one  of the  units  with  adjacent  Voronoi  regions2 .  In  the \ntwo-dimensional  case  these  are  the  direct  neighbors  of q  in  the  Delaunay  triangu(cid:173)\nlation  (Delaunay-neighbors)  induced  by  all  center  positions.  In  higher-dimensional \nspaces  there  exists  an  equivalent  based  on  hypertetrahedrons  which,  however,  is \nvery  hard to compute.  For this reason,  we  arrange our  units in a certain  topological \nstructure  (see  below)  which  has  the property  that if two  units are direct  neighbors \nin  that  structure  they  are  mostly  Delaunay-neighbors.  By  this  we  get  with  very \nlittle computational effort  an  approximate subset of the  Delaunay-neighbors which \nseems to be sufficient  for  practical  purposes. \n\n2.1  NETWORK  STRUCTURE \n\nThe structure  of our  network  is  very  similar to standard  radial basis function  net(cid:173)\nworks.  The  only  difference  is  that  we  arrange  the  local  units  in  a  k-dimensional \ntopological structure  consisting  of connected  simplices3  (lines  for  k  = 1,  triangles \n\n2The Voronoi region  of a  unit c denotes the part of the input vector space which consists \n\nof points  for  which  c is  the nearest  unit. \n\n3 A  historical  reason  for  this specific  approach  is  the fact  that the model  was  developed \nfrom  an  unsupervised  network  (see  Fritzke,  1993)  where  the  k-dimensional  neighborhood \nwas  needed  to  reduce  dimensionality.  We  currently  investigate  an  alternative  (and  more \n\n\f258 \n\nFritzke \n\nfor  k  = 2,  tetrahedrons  for  k  = 3  and  hypertetrahedrons  for  larger  k).  This  ar(cid:173)\nrangement  is  done  to  facilitate  the  interpolation  and  adaptation  steps  described \nbelow.  The initial network  consists of one  k-dimensional simplex (k + 1 local units \nfully  connected  with  each  other).  The  neighborhood  connections  are  not  weighted \nand  do  not  directly influence  the  behavior of the network.  They  are,  however,  used \nto  determine  the  width  of the  Gaussian  functions  associated  with  the  units.  Let \nfor  each  Gaussian  unit  c  denote  Ne  the  set  of direct  topological  neighbors  in  the \ntopological structure.  Then  the width of c  is  defined  as \n\n(je  =  (1/INe l)  L: Ilwe - wdl12 \n\ndENc \n\n(3) \n\nwhich  is  the  mean  distance  to  the  topological  neighbors.  If topological  neighbors \nhave  similar  center  positions  (which  will  be  ensured  by  the  way  adaptation  and \ninsertion is done) then this leads to a covering ofthe input vector space with partially \noverlapping Gaussian functions. \n\n2.2  ADAPTATION \n\nIt was  mentioned above that several  adaptation steps are done  before a  new  unit is \ninserted.  One  single  adaptation step  is  done  as  follows  (see  fig.  1): \n\n\u2022  Chose  an  I/O-pair (e,(),e E Rn,( E  Rm) from  the  training data. \n\u2022  determine the  unit  s  closest  to e (the so-called  best-matching unit). \n\u2022  Move  the  centers  of s  and its direct  topological neighbors  towards e. \n\ndWe = en (e - we) \n\nfor  all c E  N~ \n\neb  and en  are small constants with eb  > > en. \n\n\u2022  Compute for  each  local unit  eEL the  activation De(e) \n\u2022  Compute for  each  output  unit  i  the  activation Oi \n\n(see  eqn.  2) \n\n(see eqn.  1) \n\n\u2022  Compute the square error  by \n\nm \n\nSE = L:\u00ab(i - Oi)2 \n\ni=l \n\n\u2022  Accumulate error  at best-matching unit  s: \n\nderrs  =  SE \n\n\u2022  Make  Delta-rule step  for  the  weights  (a denotes  the learning rate): \n\niE{1, ... ,m},jEL \n\nSince  together  with  the  best-matching  unit  always its  direct  topological neighbors \nare  adapted,  neighboring  units  tend  to  have  similar  center  positions.  This  prop(cid:173)\nerty  can  be  used  to  determine  suitable  center  positions  for  new  units  as  will  be \ndemonstrated in  the following. \n\n\fSupervised Learning with Growing Cell Structures \n\n259 \n\na)  Before  ... \n\nb)  during,  and  ... \n\nc)  ... after  adaptation \n\nFigure  1:  One  adaptation  step.  The  center  positions  of  the  current  network  are \nshown  and  the  change  caused  by  a  single  input signal.  The observed  error  SE for \nthis pattern is  added  to the local error  variable of the  best-matching unit. \n\nf \n\na)  Before  ... \n\nb)  ... and after  insertion \n\nFigure  2:  Insertion  of a  new  unit.  The  dotted  lines  indicate  the  Voronoi  fields. \nThe  unit  q  has  accumulated  the  most  error  and,  therefore,  a  new  unit  is  inserted \nbetween  q  and one  of its direct  neighbors. \n\n2.3 \n\nINSERTION  OF  NEW  UNITS \n\nAfter  a  constant  number  A of adaptation  steps  a  new  unit  is  inserted.  For  this \npurpose  the  unit  q  with  maximum  accumulated  error  is  determined.  Obviously, \nq  lies  in  a  region  of  the  input  vector  space  where  many  misclassifications  occur. \nOne possible reason for  this is that the gradient descent  procedure is unable to find \nsuitable weights for  the current  network.  This again might be caused  by  the coarse \nresolution  at  this  region  of  the  input  vector  space: \nif  data  items  from  different \nclasses  are  covered  by  the same local unit  and  activate this  unit to about  the same \ndegree  then  it  might  be  the  case  that  their  vectors  of local  unit  activations  are \nnearly  identical  which  makes  it  hard  for  the  following  single  layer  perceptrons  to \ndistinguish  among  them.  Moreover,  even  if the  activation  vectors  are  sufficiently \ndifferent  they still  might be  not  linearly separable. \n\naccurate) approximation  of the Delaunay triangulation  which is based on the \"Neural-Gas\" \nmethod  proposed  by  Martinetz  &  Schulten  (1991). \n\n\f260 \n\nFritzke \n\n\u2022 \n\no  0 \n. . \n\no \n\n0 \n\n0 \n\n0 \n\no \n\n\u2022 \u2022 \u2022 \u2022  \u2022  \u2022 \n\n\u2022 \n\n\u2022 \n\no \no .   00 \ng \n\n\u2022 \n\n0 \n\n\u2022 \n\n\u2022 \n.o.o:~~oo.o.o \n\n00000 \n0 \n\n\u00b00\n\no \n\n\u2022 \n\n0 \n\n0 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n.o\u00b7.~\"~\u00b7o \n\n-. \n\n\u2022\u2022 \n\n-.... \n\n. . . . \n. . \n\no \n\n0 \n\n0 \n\no \n\no \n\n0 \n\n0 \n\no \n\n\u2022 \n\u2022 \n\n0 \n\na)  two spiral problem:  194 points in two \nclasses \n\ndecision \n\nb) \nCorrelation  (reprinted  with  permission \nfrom  Fahlman & Lebiere,  1990) \n\nfor  Cascade(cid:173)\n\nregions \n\nFigure 3:  Two spiral  problem and learning results of a  constructive  network. \n\nThe insertion of a new  local  unit near  q is  likely to improve the situation:  This unit \nwill probablY  be activated to a  different  degree  by  the  data items in this region and \nwill,  therefore,  make the  problem easier  for  the single  layer  perceptrons. \n\nWhat exactly  are  we  doing?  We  choose  one  of the  direct  topological  neighbors  of \nq,  say  a  unit  f  (see  also  fig.  2).  Currently  this is  the  neighbor  with  the  maximum \naccumulated error.  Other choices, however, have shown good results as well, e.g., the \nneighbor with the most distant center  position or even  a randomly picked neighbor. \nWe  insert  a  new  unit  r  in  between  q  and f  and  initialize its center  by \n\n(4) \nWe  connect  the new  unit  with  q and f  and with  all  common neighbors of q and f. \nThe original connection  between  q  and f  is  removed.  By  this we  get  a  structure of \nk-dimensional simplices again.  The new  unit gets weights to the output units which \nare  interpolated from  the weights of its neighbors.  The same is  done for  the  initial \nerror variable which is  linearly interpolated from the variables of the neighbors of r. \nAfter the interpolation all the weights of r  and its neighbors and the error variables \nof these  units  are  multiplied by  a  factor  INrl/(INrl + 1)1.  This is  done  to  disturb \nthe output of the network  as less  as possible4 \u2022  However,  the by  far  most important \n\n4The redistribution  of the error variable  is  again  a  relict from  the  unsupervised  version \n(Fritzke,  1993).  There  we  count  signals  rather  than  accumulate  error.  An  elaborate \nscheme for  redistributing  the signal counters is  necessary to get good local estimates of the \nprobability  density.  For  the supervised  version  this redistribution is  harder to justify since \nthe  insertion  of a  new  unit  in  general  makes  previous  error  information  void.  However, \neven  though  there  is  still  some  room  for  simplification,  the  described  scheme  does  work \nvery  well  already  in  its present  form. \n\n\fSupervised Learning with Growing Cell Structures \n\n261 \n\no \n\na)  final  network  with  145  cells \n\nb)  decision  regions \n\nFigure 4:  Performance of the Growing Cell Structures on the two spiral benchmark. \n\ndecision seems to be to insert the new  unit near the unit with maximum error.  The \nweights  and  the error  variables adjust  quickly  after  some learning steps. \n\n2.4  SIMULATION  RESULTS \n\nSimulations with  the  two  spiral  problem  (fig.  3a)  have  been  performed.  This clas(cid:173)\nsification  benchmark  has  been  widely  used  before  so  that  results  for  comparison \nare  readily available.Figure 3b) shows the result  of another constructive  algorithm. \nThe data consist  of 194 points arranged on two interlaced spirals in the plane.  Each \nspiral corresponds  to one class.  Due to the high nonlinearity of the  task it is partic(cid:173)\nular  difficult  for  networks  consisting of global units  (e.g.,  multi-layer perceptrons). \nHowever,  the  varying  density  of data  points  (which  is  higher  in  the  center  of the \nspirals)  makes it also  a  challenge for  networks of local  units. \n\nAs  for  most  learning  problems  the  interesting  aspect  is  not  learning  the  training \nexamples but  rather  the  performance on  new  data which  is  often  denoted  as  gen(cid:173)\neralization.  Baum & Lang (1991)  defined  a  test  set  of 576  points for  this  problem \nconsisting of three equidistant test  points between each  pair of adjacent  same-class \ntraining points.  They  reported  for  their  best  network  29  errors  on  the  test  set  in \nthe mean. \n\nIn  figure  4  a  typical  network  generated  by  our  method  can  be  seen  as  well  as  the \ncorresponding  decision  regions.  No  errors  on  the  test  set  of  Baum  and  Lang  are \nmade.  Table 1 shows the necessary  training cycles  for  several  algorithms.  The new \ngrowing  network  uses  far  less  cycles  than  the other  networks. \n\nOther experiments have been  performed with a vowel  recognition problem (Fritzke, \nIn  all  simulations  we  obtained  significantly  better  generalization  results \n1993). \nthan  Robinson  (1989)  who  in  his  thesis  investigated  the  performance  of several \nconnectionist  and  conventional  algorithms  on  the  same  problem.  The  necessary \n\n\f262 \n\nFritzke \n\nTable  1:  Training epochs  necessary  for  the  two spiral problem \n\nnetwork  model \nBackpropagation \nCross  Entropy  BP \nCascade-Correlation \nGrowmg Cell  Structures \n\nepochs \n20000 \n10000 \n1700 \n180 \n\ntest  error \n\nyes \nyes \nyes \nno \n\nreported  in \n\nLang & Witbrock  (1989) \nLang & Witbrock  (1989) \nFahlman & Lebiere  (1990) \n\nFntzke ( 1993) \n\nnumber  of training  cycles  for  our  method was  lower  by  a  factor  of about  37  than \nthe numbers reported  by  Robinson  (1993,  personal  communication). \n\nREFERENCES \n\nBaum, E.  B.  &  K.  E.  Lang [1991]'  \"Constructing hidden  units using examples and queries,\" \nin  Advances  in  Neural  Information  Processing  Systems  3,  R.P.  Lippmann,  J.E. \nMoody  &  D.S.  Touretzky,  eds.,  Morgan  Kaufmann  Publishers,  San  Mateo,  904-\n910. \n\nFahlman,  S.  E.  &  C.  Lebiere [1990],  \"The  Cascade-Correlation  Learning  Architecture,\" \nin  Advances  in  Neural  Information  Processing  Systems  2,  D.S.  Touretzky,  ed., \nMorgan  Kaufmann  Publishers,  San  Mateo,  524-532. \n\nFritzke,  B. [1993],  \"Growing  Cell  Structures  - a  self-organizing  network  for  unsupervised \nand  supervised  learning,\"  International  Computer  Science  Institute,  TR-93-026, \nBerkeley. \n\nFritzke,  B. [1994],  \"Making  hard  problems  linearly  separable  - incremental  radial  basis \n\nfunction  approaches,\"  (submitted to  ICANN'94: International Conference on Ar(cid:173)\ntificial Neural  Networks),  Sorrento,  Italy. \n\nLang,  K.  J.  &  M.  J.  Witbrock [1989],  \"Learning  to  tell  two  spirals  apart,\"  in  Proceedings \nof the 1988  Connectionist  Models  Summer School,  D.  Touretzky,  G.  Hinton  &  T . \nSejnowski,  eds.,  Morgan  Kaufmann,  San  Mateo,  52-59. \n\nMartinetz,  T.  M.  &  K.  J.  Schulten [1991]'  \"A  \"neural-gas\"  network  learns  topologies,\"  in \nArtificial  Neural  Networks,  T.  Kohonen,  K.  Makisara,  O.  Simula  &  J.  Kangas, \neds.,  North-Holland,  Amsterdam,  397-402. \n\nMoody,  J. &  C.  Darken [1989],  \"Learning with Localized  Receptive Fields,\"  in  Proceedings \nof the 1988  Connectionist  Models  Summer School,  D. Touretzky,  G.  Hinton  &  T. \nSejnowski,  eds.,  Morgan  Kaufmann,  San  Mateo,  133-143. \n\nPlatt,  J.  C. [1991],  \"A  Resource-Allocating  Network  for  Function  Interpolation,\"  Neural \n\nComputation 3,  213-225. \n\nPoggio,  T.  &  F.  Girosi [1990],  \"Regularization  Algorithms  for  Learning  That Are Equiva(cid:173)\n\nlent  to  Multilayer  Networks,\"  Science 247,  978-982. \n\nRobinson,  A.  J. [1989],  \"Dynamic  Error  Propagation  Networks,\"  Cambridge  University, \n\nPhD  Thesis,  Cambridge. \n\n\f", "award": [], "sourceid": 791, "authors": [{"given_name": "Bernd", "family_name": "Fritzke", "institution": null}]}