{"title": "Constructive Learning Using Internal Representation Conflicts", "book": "Advances in Neural Information Processing Systems", "page_first": 279, "page_last": 284, "abstract": null, "full_text": "Constructive Learning Using Internal \n\nRepresentation  Conflicts \n\nLaurens  R.  Leerink and  Marwan  A.  J abri \n\nSystems Engineering &  Design  Automation Laboratory \n\nDepartment of Electrical  Engineering \n\nThe University  of Sydney \n\nSydney,  NSW  2006,  Australia \n\nAbstract \n\nWe  present an algorithm for  the training of feedforward  and recur(cid:173)\nrent  neural  networks.  It  detects  internal  representation  conflicts \nand  uses  these  conflicts  in  a  constructive manner  to  add new  neu(cid:173)\nrons to the network .  The advantages are twofold:  (1)  starting with \na  small network  neurons  are  only  allocated  when  required;  (2)  by \ndetecting  and resolving  internal conflicts at an  early stage learning \ntime is  reduced.  Empirical results on two real-world problems sub(cid:173)\nstantiate  the  faster  learning  speed;  when  applied  to  the  training \nof a  recurrent  network  on  a  well  researched  sequence  recognition \ntask (the Reber grammar), training times are significantly less than \npreviously  reported . \n\n1 \n\nIntroduction \n\nSelecting  the optimal network  architecture for  a  specific  application is  a  nontrivial \ntask,  and  several  algorithms  have  been  proposed  to  automate  this  process.  The \nfirst  class of network adaptation algorithms start out with a redundant architecture \nand  proceed  by  pruning  away  seemingly  unimportant weights  (Sietsma and  Dow, \n1988;  Le  Cun  et  aI,  1990).  A  second  class  of algorithms starts  off  with  a  sparse \narchitecture  and  grows  the  network  to  the  complexity  required  by  the  problem. \nSeveral  algorithms  have  been  proposed  for  growing  feedforward  networks.  The \nupstart algorithm of Frean (1990) and the cascade-correlation algorithm of Fahlman \n(1990)  are examples of this approach. \n\n279 \n\n\f280 \n\nLeerink and Jabri \n\nThe  cascade  correlation  algorithm  has  also  been  extended  to  recurrent  networks \n(Fahlman,  1991),  and  has  been  shown  to  produce  good  results.  The  recurrent \ncascade-correlation  (RCC)  algorithm adds  a  fully  connected  layer  to  the  network \nafter every  step,  in the process attempting to correlate the output of the additional \nlayer with  the error.  In  contrast,  our proposed  algorithm uses  the statistical  prop(cid:173)\nerties  of the  weight  adjustments produced  during  batch  learning to  add  additional \nunits. \n\nThe  RCC  algorithm  will  be  used  as  a  baseline  against  which  the  performance  of \nour  method  will  be  compared.  In  a  recent  paper,  Chen  et  al  (1993)  presented  an \nalgorithm  which  adds  one  recurrent  neuron  with  small  weights  every  N  epochs. \nHowever,  no  significant  improvement  in  training speed  was  reported  over  training \nthe corresponding fixed size network, and the algorithm will not be further analyzed. \nTo the authors knowledge little work besides the two mentioned papers have applied \nconstructive  algorithms to recurrent  networks. \n\nIn  the  majority  of our  empirical  studies  we  have  used  partially  recurrent  neural \nnetworks,  and in  this paper we  will focus  our attention on such  networks.  The mo(cid:173)\ntivation for the development of this algorithm partly stemmed from the long training \ntimes experienced with the problems of phoneme and word recognition from contin(cid:173)\nuous speech.  However,  the algorithm is directly  applicable to feedforward  networks. \nThe same criteria and method used to add recurrent neurons to a recurrent  network \ncan  be  used for  adding neurons  to  any hidden  layer of a feed-forward  network. \n\n2  Architecture \n\nIn  a  standard feedforward  network,  the outputs only depend on the  current inputs, \nthe  network  architecture  and  the weights  in  the network.  However,  because  of the \ntemporal  nature  of several  applications,  in  particular speech  recognition,  it might \nbe necessary  for  the network  to have  a  short  term memory. \n\nPartially  recurrent  networks,  often  referred  to  as  Jordan  (1989)  or  Elman  (1990) \nnetworks,  are  well  suited  to  these  problems.  The  architecture  examined  in  this \npaper is based on the work done by  Robinson and Fallside (1991) who have applied \ntheir  recurrent  error  propagation network  to continuous speech  recognition. \n\nA  common feature  of all  partially recurrent  networks  is  that  there  is  a  special  set \nof neurons called  context  units which receive feedback  signals from a  previous time \nstep.  Let  the  values of the  context  units at time t  be  represented  by  C(t).  During \nnormal operation the input vector at time t  are applied to the input nodes  I(t),  and \nduring  the  feedforward  calculation  values  are  produced  at  both  the  output  nodes \nO(t + 1)  and  the  context  units C(t + 1).  The values  of the  context  units are  then \ncopied  back  to the  input layer for  use  as input in  the following time step. \n\nSeveral  training  algorithms exist  for  training  partially  recurrent  neural  networks, \nbut  for  tasks  with large training sets  the  back-propagation through  time  (Werbos, \n1990)  is  often  used.  This  method  is  computationally  efficient  and  does  not  use \nany  approximations in  following  the  gradient.  For  an  application  where  the  time \ninformation is  spread  over  T.  input  patterns,  the  algorithm simply  duplicates the \nnetwork T  times - which  results  in  a  feedforward  network  that can  be  trained by  a \nvariation of the standard backpropagation algorithm. \n\n\fConstructive Learning Using Internal Representation Conflicts \n\n281 \n\n3  The Algorithm \n\nFor  partially  recurrent  networks  consisting  of input,  output  and  context  neurons, \nthe following  assertions  can  be made: \n\n\u2022  The  role  of  the  context  units  in  the  network  is  to  extract  and  store  all \nrelevant prior information from the sequence  pertaining to the classification \nproblem. \n\n\u2022  For  weights  entering  context  units  the  weight  update  values  accumulated \nduring  batch learning will  eventually  determine  what  context  information \nis stored in the unit (the sum of the weight  update values is  larger than the \ninitial random weights). \n\n\u2022  We  assume  that  initially  the  number  of context  units  in  the  network  is \ninsufficient  to  implement  this  extraction  and  storage  of information  (we \nstart  training  with  a  small network).  Then,  at  different  moments in  time \nduring  the recognition of long temporal sequences,  a  context unit  could  be \nrequired  to  preserve  several  different  contexts. \n\n\u2022  These  conflicts  are  manifested  as  distinct  peaks  in  the  distribution  of the \n\nweight  update values  during the epoch. \n\nAll  but the last fact  follows  directly from the network  architecture  and requires  no \nfurther elaboration.  The peaks in  the distribution of the weight update values are a \nresult of the training algorithm attempting to adjust the value of the context units in \norder to provide  a context value that will resolve short-term memory requirements. \n\nAfter  the  algorithm had  been  developed,  it  was  discovered  that  this aspect  of the \nweight  update  values  had  been  used  in  the  past  by  Wynne-Jones  (1992)  and  in \nthe  Meiosis  Networks  of Hanson  (1990).  The  method  of  Wynne-Jones  (1992)  in \nparticular  is  very  closely  related;  in  this  case  principal  component  analysis  of the \nweight  updates  and  the  Hessian  matrix is  used  to  detect  oscillating nodes  in  fully \ntrained  feed-forward  networks.  This  aspect  of  backpropagation  training  is  fully \ndiscussed  in  Wynne-Jones  (1992),  to  which  reader  is  referred  for  further  details. \n\nThe above  assertions  lead  to  the  proposed  training algorithm, which  states  that  if \nthere are distinct maxima in  the distribution of weight update values of the weights \nentering a context unit, then this is  an indication that the batch learning algorithm \nrequires  this context  unit for  the storage of more than  one  context. \n\nIf this  conflict  can  be  resolved,  the  network  can  effectively  store  all  the  contexts \nrequired,  leading  to  a  reduction  in  training  time  and  potentially  an  increase  III \nperformance. \n\nThe  training  algorithm is  given  below  (the  mode of the  distribution  is  defined  as \nthe number of distinct maxima): \n\nFor  all  context  units  { \n\nSet  N = modality  ot  the  distribution  ot  weight  update  values; \nIt  N >  1  then  { \n\nAdd  N-1  new  context  units  to  the  network  which  are  identical \n(in  terms  ot  weighted  inputs)  to  the  current  context  unit. \n\n\f282 \n\nLeerink and Jabri \n\nAdjust  each  of  these  N context  units  (including  the \noriginal)  by  the  weight  update  value  determined  by  each \nmaxima  (the  average  value  of  the  mode). \n\nAdjust  all  weights  leaving  these  N  context  units  so  that  the \naddition  of  the  new  units  do  not  affect  any  subsequent  layers \n(division  by  N).  This  ensures  that  the  network  retains  all \npreviously  acquired  knowledge. \n\n} \n\n} \n\nThe main problem in  the  implementation of the  above algorithm is  the  automatic \ndetection  of significant  maxima in  the  distribution of weight  updates.  A  standard \nstatistical approach for  the determination of the modality (the number of maxima) \nof a  distribution of noisy  data is  to fit  a  curve of a  certain  predetermined  order  to \nthe  data.  The  maxima (and  minima)  are  then  found  by  setting  the  derivative  to \nzero.  This method was found  to be  unsuitable mainly because  after  curve fitting it \nwas  difficult  to determine  the significance of the detected  peaks. \nIt was decided  that only instances of bi-modality and tri-modality were  to be  iden(cid:173)\ntified, each corresponding to the addition of one or two context units.  The following \nheuristic  was constructed: \n\n\u2022  Calculate the mean and standard deviation of the  weight  update  values. \n\u2022  Obtain the maximum value  in  the distribution. \n\u2022  If there are any peaks larger than 60% of the maxima outside one standard \n\ndeviation of the mean, regard  this as  significant. \n\nThis heuristic  provided  adequate identification of the modalities.  The distribution \nwas divided into three areas using the mean \u00b1 the standard deviation as boundaries. \nDepending on  the number of maxima detected,  the average within each area is used \nto adjust the weights. \n\n4  Discussion \n\nAccording to our algorithm it follows  that if at least one  weight entering a  context \nunit  has  a  multi-modal distribution,  then  that  context  unit  is  duplicated.  In  the \ncase  where  multi-modality is  detected  in  more than one weight,  context  units were \nadded according to the highest  modality. \n\nAlthough this algorithm increases the computational load during training, the stan(cid:173)\ndard  deviation  of the  weight  updates  rapidly  decreases  as  the  network  converges. \nThe narrowing of the  distribution  makes it more difficult  to  determine  the  modal(cid:173)\nity.  In  practice  it  was  only found  useful  to apply  the  algorithm during  the  initial \ntraining epochs,  typically during the first  20. \n\nDuring simulations in which strong multi-modalities were detected  in certain nodes, \nfrequently  the  multi-modalities would  persist  in  the  newly  created  nodes.  In  this \n\n\fConstructive Learning Using Internal Representation Conflicts \n\n283 \n\nmanner a strong bi-modality would cause  one node to split into two,  the two nodes \nto  grow  to  four,  etc.  This  behaviour  was  prevented  by  disabling  the  splitting  of \na  node  for  a  variable number of epochs  after  a  multi-modality had  been  detected. \nDisabling this behaviour for  two  epochs  provided good results. \n\n5  Simulation Results \n\nThe algorithm was evaluated empirically on  two different  tasks: \n\n\u2022  Phoneme  recognition  from  continuous  multi-speaker  speech  usmg  the \n\nTIMIT  (Garofolo,  1988)  acoustic-phonetic  database . \n\n\u2022  Sequence  Recognition:  Learning  a  finite-state  grammar from  examples  of \n\nvalid sequences. \n\nFor the phoneme recognition task the algorithm decreased  training times by a factor \nof 2  to 10,  depending  on  the size  of the network  and  the  size of the training set. \n\nThe sequence recognition task has been studied by other researchers  in the past, no(cid:173)\ntably Fahlman (1991).  Fahlman compared the performance of the recurrent cascade \ncorrelation  (RCC) network with that of previous results  by Cleeremans et al (1989) \nwho  used  an  Elman  (1990)  network.  It was  concluded  that  the  RCC  algorithm \nprovides the same or better  performance than the Elman network with less training \ncycles  on  a  smaller  training  set.  Our  simulations  have  shown  that  the  recurrent \nerror  propagation network  of Robinson  and  Fallside  (1991),  when  trained with our \nconstructive  algorithm  and  a  learning  rate  adaptation  heuristic,  can  provide  the \nsame  performance  as  the  RCC  architecture  in  40%  fewer  training epochs  using  a \ntraining set of the same size.  The resulting network has the same number of weights \nas  the minimum size  RCC network  which  correctly  solves  this  problem. \n\nConstructive  algorithms  are  often  criticized  in  terms  of efficiency,  i.e.  \"Is  the  in(cid:173)\ncrease  in  learning  speed  due  to  the  algorithm  or  just  the  additional  degrees  of \nfreedom resulting from the added neuron  and associated weights?\".  To address this \nquestion  several  simulations were  conducted  on  the  speech  recognition  task,  com(cid:173)\nparing the  performance and learning time of a  network  with  N  fixed  context  units \nto  that  of a  network  with  small  number  of context  units  and  growing  a  network \nwith  a  maximum of N  context  units.  Results  indicate that  the  constructive  algo(cid:173)\nrithm  consistently  trains  faster,  even  though  both  networks  often  have  the  same \nfinal  performance. \n\n6  Summary \n\nIn this  paper the statistical properties  of the weight update values obtained during \nthe  training  of a  simple  recurrent  network  using  back-propagation  through  time \nhave been examined.  An algorithm has been presented for  using these  properties to \ndetect  internal  representation  conflicts  during training and to use  this information \nto add  recurrent  units to the network.  Simulation results show  that the  algorithm \ndecreases  training time compared to networks which have a fixed  number of context \nunits.  The  algorithm  has  not  been  applied  to  feedforward  networks,  but  can  III \nprinciple be  added  to all  training algorithms that operate  in  batch mode. \n\n\f284 \n\nLeerink and Jabri \n\nReferences \n\nChen,  D.,  Giles,  C.L.,  Sun,  G.Z.,  Chen,  H.H.,  Lee,  Y.C.,  Goudreau,  M.W.  (1993). \nConstructive  Learning of Recurrent  Neural  Networks.  In  1993 IEEE International \nConference  on  Neural  Networks,  111:1196-1201.  Piscataway,  NJ:  IEEE Press. \n\nCleeremans,  A.,  Servan-Schreiber,  D.,  and  McClelland,  J.L.  (1989).  Finite  State \nAutomata and Simple Recurrent  Networks.  Neural  Computation  1:372-381. \nElman, J .L.  (1990).  Finding Structure  in Time.  Cognitive  Science  14:179-21l. \n\nFahlman, S.E.  and C.  Lebiere  (1990).  The Cascade  Correlation Learning  Architec(cid:173)\nture.  In  D.  S.  Touretzky  (ed.),  Advances  in  Neural Information  Processing Systems \n2,  524-532.  San  Mateo,  CA:  Morgan  Kaufmann. \nFahlman, S.E.  (1991).  The Recurrent  Cascade Correlation Architecture.  Technical \nReport  CMU-CS-91-100.  School of Computer Science,  Carnegie  Mellon  University. \nFrean, M.  (1990).  The Upstart Algorithm:  A Method for Constructing and Training \nFeedforward  Neural  Networks.  Neural  Computation  2:198-209. \n\nGarofolo,  J.S.  (1988).  Getting  Started  with  the  DARPA  TIMIT  CD-ROM:  an \nAcoustic  Phonetic  Continuous  Speech  Database.  National  Institute  of Standards \nand Technology  (NIST),  Gaithersburgh,  Maryland. \n\nHanson, S.J. (1990).  Meiosis  Networks.  In  D. S.  Touretzky (ed.),  Advances  in  Neu(cid:173)\nral Information Processing Systems 2,  533-541, San Mateo, CA:  Morgan Kaufmann. \nJordan,  M.1.  (1989).  Serial Order:  A Parallel, Distributed Processing Approach.  In \nAdvances  in  Connectionist  Theory:  Speech,  eds.  J.L.  Elman and  D.E.  Rumelhart. \nHillsdale:  Erlbaum. \nLe  Cun,  Y.,  J .S.  Denker,  and  S.A  Solla (1990).  Optimal Brain  Damage.  In  D.  S. \nTouretzky  (ed.),  Advances  in  Neural  Information  Processing  Systems  2,  598-605. \nSan  Mateo,  CA:  Morgan Kaufmann. \n\nReber,  A.S.  (1967).  Implicit  learning  of artificial  grammars.  Journal  of  Verbal \nLearning  and  Verbal  Behavior 5:855-863. \n\nRobinson,  A.J.  and Fallside F.  (1991).  An  error propagation network speech recog(cid:173)\nnition system.  Computer Speech  and  Language  5:259-274. \nSietsma,  J.  and  RJ.F  Dow  (1988).  Neural  Net  Pruning-\\Vhy  and  How.  In  IEEE \nInternational  Conference  on  Neural Networks.  (San  Diego  1988),  1:325-333. \n\nWynne-Jones,  M.  (1992)  Node  Splitting:  A  Constructive  Algorithm  for  Feed(cid:173)\nForward  Neural  Networks.  In  D.  S.  Touretzky  (ed.),  Advances  in  Neural  Infor(cid:173)\nmation  Processing  Systems 4,  1072-1079.  San  Mateo,  CA:  Morgan  Kaufmann. \nWerbos,  P.J.  (1990).  Backpropagation Through  Time,  How  It Works  and  How  to \nDo  It.  Proceedings  of the  IEEE,  78:1550-1560. \n\n\f", "award": [], "sourceid": 802, "authors": [{"given_name": "Laurens", "family_name": "Leerink", "institution": null}, {"given_name": "Marwan", "family_name": "Jabri", "institution": null}]}