{"title": "Consonant Recognition by Modular Construction of Large Phonemic Time-Delay Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 215, "page_last": 223, "abstract": null, "full_text": "Consonant Recognition by Modular Construction of \n\nLarge Phonemic Time-Delay Neural Networks \n\n215 \n\nAlex Waibel \n\nCarnegie-Mellon University \n\nPittsburgh, PA 15213, \n\nA TR Interpreting Telephony Research Laboratories \n\nOsaka, Japan \n\nAbstract \nIn this paperl  we show that neural networks for speech recognition can be constructed in \na  modular  fashion  by  exploiting  the  hidden  structure  of previously  trained  phonetic \nsubcategory networks.  The performance of resulting larger phonetic nets was found to be \nas  good  as  the  performance  of the  subcomponent  nets  by  themselves.  This  approach \navoids the excessive learning times  that would be necessary to  train larger networks and \nallows  for  incremental  learning. \nLarge  time-delay  neural  networks  constructed \nincrementally  by  applying  these  modular  training  techniques  achieved  a  recognition \nperformance of 96.0% for all consonants. \n\n1. Introduction \nRecently  we  have  demonstrated  that  connectionist  architectures  capable  of capturing \nsome critical aspects of the dynamic  nature of speech,  can achieve superior recognition \nperformance for difficult but small phonemic discrimination tasks such as discrimination \nof the  voiced  consonants  B,D  and  G [Waibel  89, Waibel  88a].  Encouraged  by  these \nresults  we  wanted  to  explore  the  question,  how  we  might expand  on  these  models  to \nmake them useful for the design of speech recognition systems.  A problem that emerges \nas  we attempt to apply neural  network models to  the full  speech recognition problem  is \nthe problem of scaling.  Simply extending neural  networks  to ever larger structures and \nretraining  them  as  one monolithic  net quickly exceeds the capabilities of the  fastest and \nlargest supercomputers.  The search  complexity  of finding  a  good  solutions  in  a  huge \nspace of possible network configurations  also  soon  assumes  unmanageable proportions. \nMoreover, having to decide on all possible classes for recognition ahead of time as well \nas collecting sufficient data to train such a large monolithic network is impractical to say \nthe  least.  In an  effort  to extend our models from  small  recognition  tasks  to  large scale \nspeech  recognition  systems,  we  must  therefore  explore  modularity  and  incremental \nlearning  as  design  strategies  to  break  up  a  large  learning  task  into  smaller  subtasks. \nBreaking  up  a  large  task  into  subtasks  to  be  tackled  by  individual  black  boxes \ninterconnected in ad hoc arrangements, on the other hand, would mean to abandon one of \nthe  most attractive aspects of connectionism:  the ability  to perform  complex constraint \nsatisfaction  in  a  massively  parallel  and  interconnected  fashion,  in  view  of an  overall \noptimal perfonnance goal.  In  this paper we demonstrate based on a  set of experiments \naimed at phoneme recognition that it is indeed possible to construct large neural networks \nincrementally  by  exploiting  the  hidden  structure  of smaller  pretrained  subcomponent \n\n1 An extended version of this paper will also appear in the Proceedings of the  1989 International Conference \n\non Acoustics, Speech and Signal Processing.  Copyright: IEEE.  Reprinted with pennission. \n\n\f216 \n\nWaibel \n\nnetworks. \n\n2. Small Phonemic Classes by Time-Delay Neural Networks \nIn our previous work,  we have proposed a Time-Delay Neural Network architecture (as \nshown  on  the  left of Fig. 1 for  B,D,G)  as  an  approach  to  phoneme  discrimination  that \nachieves  very  high  recognition  scores [Waibel  89, Waibel  88a]. \nIts  multilayer \narchitecture,  its  shift-invariance  and  the  time  delayed  connections  of  its  units  all \ncontributed  to  its  performance  by  allowing  the  net  to  develop  complex,  non-linear \ndecision  surfaces  and  insensitivity  to  misalignments  and  by  incorporating  contextual \ninformation into decision making (see [Waibel 89, Waibel 88a]  for  detailed analysis and \ndiscussion).  It is trained by the back-propagation procedure [Rurnelhart 86] using shared \nweights for different time shifted positions of the net [Waibel 89 , Waibel 88a].  In spirit it \nhas  similarities to other models recently proposed [Watrous 88, Tank 87].  This network, \nhowever,  had only been trained for the voiced stops B,D,G and we began our extensions \nby training similar networks for the other phonemic classes in our database. \n\n~ n, \u00b7 n \" \n\n. \n\nOutDut Llyt' \n\nIntlgrltlon \n\n. , \n.  , \n\nI \n\n\"  ~ \n\n- . . .  \n\n\u2022  \u2022  \u2022 \u2022 \u2022 \u2022 \u2022 \u2022   IJI' \n\n.. .......  \"\" \n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022  nlO \n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022  t,n \n\u2022 . \u2022\u2022\u2022\u2022\u2022\u2022\u2022 'N, \n\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022  \u2022\u2022 \n.....\u2022\u2022\u2022.  -\n--------\"~-\"--'-~ .. , \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022  \n\u2022 \n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022  \n\nIII \na&t \nUI \n\n\u2022 \n\n\u2022  \u2022 \u2022 \u2022 \u2022 \u2022  \u2022 \n\n\u2022 \n\nIS ,,_\"'\" \n\n10 mS.c tr.me rate \n\nFigure 1. The TDNN architecture: BOO-net (left), BooPTK-net (right) \n\nAll phoneme tokens in our experiments were extracted using phonetic handlabels from  a \nlarge vocabulary database of 5240 common Japanese words.  Each word in the database \nwas  spoken  in  isolation  by  one  male  native  Japanese  speaker.  All  utterances  were \nrecorded in a  sound proof booth and digitized at a  12 kHz sampling rate.  The database \nwas  then  split into a training  set and a testing set of 2620 utterances  each.  A  150 msec \nrange around a phoneme boundary was excised for each phoneme token and 16 mel scale \nfllterbank  coefficients  computed  every  10  msec [Waibel  89, Waibel  88a]. \nThe \n\n\fConsonant Recognition by Modular Construction \n\n21 7 \n\npreprocessed training and testing data was then used to train or to evaluate our TDNNs' \nperformance for  various phoneme classes.  For each class, TDNNs with an  architecture \nsimilar to  the BOO-net in Fig.l  were trained.  A  total  of seven nets aimed at the major \ncoarse  phonetic  classes  in  Japanese  were  trained,  including  voiced  stops  B,  D.  G, \nvoiceless stops P,T,I(,  the nasals  M,  N  and  syllabic  nasals,  fricatives  S,  SR, R  and Z, \naffricates CR, TS,liquids and glides R, W, Y and fmally the set of vowels A, I, U, E and \nO. Each of these nets  was given between  two and five  phonemes to distinguish and  the \npertinent  input data  was  presented  for  learning.  Note,  that each  net  was  trained  only \nwithin each respective coarse class and has no notion of phonemes from other classes yet. \nEvaluation  of each  net on  test data  within  each  of these  subcategories revealed that an \naverage rate of9S.S% can be achieved (see [WaibeISSb] for a more detailed tabulation of \nresults). \n\n3. Scaling TDNNs to Larger Phonemic Classes \nWe  have  seen  that  TDNNs  achieve  superior  recognition  performance  on  difficult  but \nsmall  recognition  tasks.  To  train  these  networlcs  substantial  computational  resources \nwere  needed.  This  raises  the  question  of  how  our  networks  could  be  extended  to \nencompass all phonemes or handle speech recognition in general.  To shed light on  this \nquestion  of scaling,  we consider  first  the  problem  of extending  our networks  from  the \ntask  of  voiced  stop  consonant  recognition  (hence  the  BOO-task)  to  the  task  of \ndistinguishing among all stop consonants (the BOOPTK-task). \nFor  a  network  aimed  at  the  discrimination  of  the  voiced  stops  (a  BOO-net), \napproximately 6000 connections  had  to  be trained  over about SOO  training  tokens.  An \nidentical  net  (also  with  approximately  6000  connections2)  can  achieve  discrimination \namong the voiceless stops (\"P\", \"T\" and \"K\").  To extend our networks to the recognition \nof  all  stops,  i.e.,  the  voiced  and  the  unvoiced  stops  (B,D,G,P,T,K),  a  larger  net  is \nrequired.  We have trained such a network for experimental purposes.  To allow for the \nnecessary  number  of features  to  develop  we  have  given  this  net  20  units  in  the  first \nhidden layer, 6 units in hidden layer 2 and 6 output units.  On the right of Fig. 1 we show \nthis  net  in  actual  operation  with  a  \"G\"  presented  at  its  input.  Eventually  a  high \nperformance network was obtained that achieves 9S.3% correct recognition over a  1613-\ntoken BDGPTK-test database, but it took  inordinate amounts of learning to arrive at the \ntrained net (IS days on a 4 processor Alliant!).  Although going from  voiced stops to all \nstops is  only a  modest increase in task size, about  IS,OOO  connections had to be trained. \nTo  make  matters  worse,  not only  the  number  of connections  should  be increased  with \ntask size, but in general the amount of training data required for good generalization of a \nlarger net has to be increased as well.  Naturally, there are practical limits to the size of a \ntraining  database,  and  more  training  data  translates  into  even  more  learning  time. \nLearning  is further  complicated by  the  increased complexity of the  higher  dimensional \nweightspace  in  large  nets  as  well  as  the  limited  preciSion  of our  simulators.  Despite \nprogress towards faster learning algorithms [Haffner 88, Fahlman 88], it is  clear that we \ncannot hope for one single monolithic network to be trained within reasonable time as we \n\n2Note.  that  these  are  connettions  over which  a  back-propagation  pass  is  performed  during  each  iteration. \nSince  many  of them  share  the  same  weights,  only  a  small  fraction  (about  SOO)  of them  are  actually  free \npararneten. \n\n\f218 \n\nWaibel \n\nincrease size  to  handle  larger and larger tasks.  Moreover,  requiring  that all classes  be \nconsidered  and  samples  of each  class  be  presented  during  training,  is  undesirable  for \npractical reasons as we contemplate the design of large neural systems.  Alternative ways \nto  modularly construct and incrementally train such large neural  systems must therefore \nbe explored. \n\n3.1. Experiments with Modularity \nFour  experiments  were  performed  to  explore  methodologies  for  constructing  phonetic \nneural  nets  from  smaller component subnets.  As  a task  we  used again  stop  consonant \nrecognition  (BooPTK)  although  other  tasks  have  recently  been  explored  with  similar \nsuccess  (BOO  and  MNsN) [Waibel  88c].  As  in  the  previous  section  we  used  a  large \ndatabase of 5240 common Japanese words spoken in isolation from  which the testing an \ntraining tokens for the voiced stops (the BOO-set) and for the  voiceless stops  (the PTK(cid:173)\nset) was extracted. \nTwo  separate  TDNNs  have  been  trained.  On  testing  data  the  BOO-net  used  here \nperformed  98.3%  correct  for  the  BDG-set  and  the  PTK-net  achieved  98.7%  correct \nrecognition for  the PTK-set  As a fIrst naive attempt we have now  simply run  a speech \ntoken from  either set (i.e., B,D,G,P,T or K)  through both a BOO-net and a PTK-net and \nselected the class with the highest activation from  either net as the recognition result.  As \nmight have been expected (the component nets had only been trained for their respective \nclasses),  poor  recognition  performance  (60.5%)  resulted  from  the  6  class  experiment. \nThis  is  partially  due  to  the  inhibitory  property  of  the  TDNN  that  we  have  observed \nelsewhere [Waibel  89].  To  combine  the  two  networks  more  effectively,  therefore, \nportions of the net had to be retrained. \n\nO\"\"tpul L'yt' \n\n... \u2022\u2022\u2022\u2022 \n\n-=::::-::-: :  ___ ~  MIdden I..,.., 1 \n\n.\u2022..... .... .. \n\n~~Q.I.~. \n\n\".t \n\nlOG \n\n\u2022  \u2022 \u2022 \u2022 \u2022  I  \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022  \n\n~!!I!\u00b7III\u00b7 \u2022\u2022  .  .... \ni  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\n\u00b7 ............. . \n...... . . . ..  . . . \n\u00b7 ............. . \n\u00b7 ............ . \n...........\u2022\u2022 ~~ \n\u2022\u2022\u2022\u2022\u2022\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n.\u2022.........\u2022\u2022 ~: \n~I::~:~.::::::: \n..  . .......... . \n\u00b7 ............. . \n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\nFigure 2. BDGPTK-net trained from  hidden units from a Boo- and a PTK-net. \n\nWe start by assuming that the fIrst hidden layer in either net already contains all the lower \n\n\fConsonant Recognition by Modular Construction \n\n219 \n\nlevel acoustic phonetic features  we need for proper identification of the stops and freeze \nthe connections from  the input layer (the speech data) to the first hidden layer's 8 units in \nthe  BOO-net  and  the  8  units  in  the  PTK-neL  Back-propagation  learning  is  then \nperformed only on the connections between these 16 (= 2 X 8) units in hidden layer 1 and \nhidden  layer  2  and  between  hidden  layer  2  and  the  combined  BooPTK-net's  output. \nThis  network is  shown  in  Fig.2  with  a  \"G\"  token  presented  as  input.  Only  the  higher \nlayer connections  had  to  be retrained  (for about one day)  in  this  case  and  the  resulting \nnetwork  achieved  a  recognition  performance  of  98.1 %  over  the \ntesting  data. \nCombination  of  the  two  subnets  has  therefore  yielded  a  good  net  although  a  slight \nperformance degradation compared to the subnets was observed.  This degradation could \nbe explained by the increased complexity of the task. but also by the inability of this net \nto develop lower level acoustic-phonetic features in hidden layer 1.  Such features may in \nfact be needed for discrimination between the two stop classes. in addition  to the within(cid:173)\nclass features. \nIn  a  third  experiment.  we  therefore  flrst  train  a  separate  fiNN  to  perform  the \nvoiced/unvoiced (V /UV)  distinction between the Boo- and the PTK-task.  The network \nhas a very similar structure as the BOO-net. except that only four hidden units were used \nin  hidden layer  1 and two in  hidden  layer 2 and  at the output.  This V/UV-net achieved \nbetter  than  99%  voiCed/unvoiced  classification  on  the  test  data  and  its  hidden  units \ndeveloped in the process are now used as additional features for the BooPTK-task.  The \nconnections from  the input to the flrst hidden layer of the Boo-. the PTK- and the V/UV \nnets are frozen and only the connections that combine the 20 units in hidden layer 1 to the \nhigher  layers  are  retrained.  Training  of  the  V /UV -net  and  subsequent  combination \ntraining  took between one and  two days.  The resulting  net was  evaluated as  before on \nour testing database and achieved a recognition score of 98.4% correct. \n\ni ~, !  ' \" ~t.g'''lan \n' ... -. ..... \n' .. __ .. . \n\nOutDut llyt' \n\nFrtt \n\n; \n\n\\ \n\nFr .. \n\n.. ~ ___ ~_~_  MtddtnUl,.\" \n\nFreel \n, \n\n',....... \n\n, \n\n.:~: : :. :\n\n:  . \u2022 \u2022 \u2022 \n\nFigure 3. Combination of a BDG-net and a PTK-net using \n\n4 additional units in hidden layer 1 as free \"Connectionist Glue\". \n\nIn  the  previous  experiment,  good  results  could  be  obtained  by  adding  units  that  we \nbelieved  to  be  the  useful  class  distinctive  features  that  were  missing  in  our  second \nexperiment.  In a fourth experiment. we have now  examined an  approach that allows for \n\n\f220 \n\nWaibel \n\nthe  network to be free  to discover any additional features  that might be useful  to  merge \nthe two component networks.  In stead of previously training a class distinctive network. \nwe now add four units to hidden layer 1. whose connections to the input are free to learn \nany missing discriminatory features  to supplement the 16 frozen BOO and PTK features. \nWe call these units the \"connectionist glue\"  that we apply to merge two distinct networks \ninto a new combined net.  This network is shown  in Fig.3.  The hidden  units of hidden \nlayer 1 from  the BOO-net are shown on the left and those from  the PTK-net on the right. \nThe  connections  from  the  moving  input  window  to  these  units  have  been  trained \nindividually  on  Boo- and PTK-data. respectively.  and -as  before- remain  fIxed  during \ncombination learning.  In  the middle on hidden  layer 1 we show the 4 free  \"Glue\"  units. \nCombination learning now searches for an optimal combination of the existing Boo- and \nPTK-features and also supplements these by learning additional interclass discriminatory \nfeatures.  Combination  retraining  with  \"glue\"  required  a  two  day  training  run. \nPerformance  evaluation  of  this  network  over  the  BDGPTK  test  database  yielded  a \nrecognition rate of 98.4%. \nIn addition to the techniques described so far. it may be useful to free all connections in a \nlarge modularly constructed network for an additional small amount of fine  tuning.  This \nhas  been  done  for  the  BooPTK-net  shown  in  Fig.3  yielding  some  additional \nperformance improvements.  Each iteration  of the  full  network is  indeed very slow. but \nconvergence is reached after only few additional tuning iterations.  The resulting network \nfmally achieved (over testing data) a recognition score of 98.6%. \n\n3.2. Steps for the Design of Large Scale Neural Nets \n\nMethod \n\nbdg \n\nptk \n\nbdgptk \n\nIndividual TDNNs \n\n98 . 3~ \n\n98.7 % \n\nTDNN:Max. ActlvatlOn \n\nReb-aiD BDGPTK \n\nReb-aiD Combined \n\nHigher Layers \n\nReb-aiD with VIUV-units \n\nReb-aiD with Glue \n\nAll-Net Fine Tuning \n\nGO . 5~ \n\n98.3 ~ \n\n98.1  % \n\n98 . 4~ \n\n98 . 4~ \n\n98.6~ \n\nTable 3-1:  From BOO to BDGPTK; Modular Scaling Methods. \n\nTable 3-1  summarizes the  major results from  our experiments.  In  the fIrst row it shows \nthe recognition performance of the two initial TDNNs trained individually to perform the \nBoo- and  the  PTK-tasks.  respectively.  Underneath.  we  show  the  results  from  the \nvarious  experiments  described in  the previous section.  The results  indicate,  that  larger \nTDNNs  can  indeed  be  trained  incrementally.  without  requiring  excessive  amounts  of \ntraining  and  without  loss  in  performance.  The  total  incremental  training  time  was \nbetween  one  third  and  one  half of a  full  monolithically  trained  net  and  the  resulting \n\n\fConsonant Recognition by Modular Construction \n\n221 \n\nnetworks  appear  to  perform  slightly  better.  Even  more  astonishingly,  they  appear  to \nachieve  performance  as  high  as  the  subcomponent  BDG- and  PTK-nets  alone.  As  a \nstrategy  for  the  efficient construction  of larger  networks  we  have  found  the  following \nconcepts  to  be  extremely  effective:  modular,incremental  learning,  class  distinctive \nlearning, connectionist glue, partial and selective learning and all-netfine tuning. \n\n4. Recognition of all Consonants \nThe incremental learning techniques explored so far can  now  be applied to the design of \nnetworks capable of recognizing all consonants. \n\n4.1. Network Architecture \n\nOutpu' La.,.., \n\n\u2022  Q  G  ,  T  K  II(  H,N  S  S,K  Z  It  VI  Y \n\n,  ' \nI \n' \n\\ \nI \n\nFilM. \nI \n(Frtl) , \n, \n\nHlGlclen Layer 1  , \n\n\\ \n\" \n, \n\n\\ \n\n, \n\\ , \n\\ \n\\ \n\\ \n\\ \n\nI' .. H' \n(Fre.,  I \n\nFigure 4. Modular Construction of an All Consonant Network \n\nOur  consonant  TDNN  (shown  in  Fig.4.1)  was  constructed  modularly  froHi  networks \naimed at the consonant subcategories, i.e., the BDG-, PTK-, MNsN-, SShHZ-, TsCh- and \nthe RWY -tasks.  Each of these nets  had been trained before to discriminate between  the \nconsonants within each class.  Hidden layers 1 and 2 were then extracted from  these nets, \ni.e. their weights copied and frozen in a new combined consonant TDNN.  In addition, an \ninterclass  discrimination  net  was  trained  that  distinguishes  between  the  consonant \nsubclasses  and  thus  hopefully  provides  missing  featural  information  for  interclass \ndiscrimination  much  like  the  V /UV  network  described  in  the  previous  section.  The \nstructure of this  network was  very  similar to other subcategory TDNN s, except that  we \nhave  allowed  for  20  units  in  hidden  layer  1 and  6  hidden  units  (one  for  each  coarse \nconsonant class) in hidden layer 2.  The weights leading into hidden layers  1 and 2 were \nthen  also  copied from  this  interclass  discrimination net into  the consonant network  and \nfrozen.  Three  connections  were  then  established  to  each  of the  18  consonant output \ncategories (B,D,G,P,T,K,M,N,sN,S, Sh.H,Z,Ch,Ts,R,W and Y):  one to connect an output \n\n\f222 \n\nWaibel \n\nunit  with  the  appropriate  interclass discrimination  unit  in  hidden  layer 2, one with  the \nappropriate  intra class  discrimination  unit  from  hidden  layer  2  of  the  corresponding \nsubcategory net and one with  the always activated threshold unit (not shown in Fig.4.1) \nThe  overall  network  architecture  is  shown  in  Fig.4.1  for  the  case  of an  incoming  test \ntoken  (e.g.,  a  \"G\").  For  simplicity,  Fig.4.1  shows  only  the  hidden  layers  from  the \nBDG-,PTK,SShHZ- and the inter-class discrimination nets.  At the output, only the two \nconnections  leading  to  the  correctly  activated  \"G\" -output  unit  are  shown.  Units  and \nconnections pertaining to the other subcategories as well as connections leading to the 17 \nother output units are omitted for clarity in Fig.4.1.  All free weights were initialized with \nsmall random weights and then trained. \n\n4.2. Results \n\nConsonants \n\nRecognition Rate (%) \n\nTask \n\nbdg \n\nptk \n\nmnN \n\nsshhz \n\nchts \n\nrwy \n\ncons. class \n\nAll consonant TDNN \n\nAll-Net Fine Tuning \n\n98.6 \n\n98.7 \n\n96.6 \n\n99.3 \n\n100.0 \n\n99.9 \n\n96.7 \n\n95.0 \n\n95.9 \n\nTable 4-1:  Consonant Recognition Performance Results. \n\nTable 4.2 summarizes our results for  the consonant recognition task.  In the first 6 rows \nthe recognition results (measured over the available test data in their respective sublasses) \nare given.  The entry \"cons.class\" shows the performance of the interclass discrimination \nnet  in  identifying  the  coarse  phonemic  subclass  of an  unknown  token.  96.7%  of all \ntokens  were  correctly  categorized  into  one  of  the  six  consonant  subclasses.  Mter \ncompletion of combination learning the entire net was evaluated over 3061 consonant test \ntokens,  and  achieved  a  95.0%  recognition  accuracy.  All-net  fme  tuning  was  then \nperformed  by  freeing  up  all connections  in  the  network  to  allow  for  small  additional \nadjustments in the interest of better overall performance.  Mter completion of all-net fine \ntuning,  the  performance of the  network  then  improved  to  96.0%  correct.  To put these \nrecognition  results  into perspective,  we  have compared  these  results  with  several  other \ncompeting recognition  techniques and found that our incrementally trained net compares \nfavorably [Waibel 88b). \n\n\fConsonant Recognition by Modular Construction \n\n223 \n\n5. Conclusion \nThe serious problems  associated with scaling smaller phonemic subcomponent networks \nto  larger phonemic  tasks  are  overcome by  careful  modular  design.  Modular  design  is \nachieved  by  several  important  strategies:  selective  and  incremental  learning  of \nsubcomponent tasks, exploitation of previously learned hidden structure,  the application \nof  connectionist  glue  or  class  distinctive  features  to  allow  for  separate  networks  to \n\"grow\" together, partial training of portions of a larger net and finally, all-net fine tuning \nfor  making  small  additional  adjustments  in  a  large  net  Our  findings  suggest,  that \njudicious application  of a number  of connectionist design  techniques  could  lead  to  the \nsuccessful  design  of  high  performance  large  scale  connectionist  speech  recognition \nsystems. \n\nReferences \n[Fahlman  88]  Fahl man ,  S.E.  An  Empirical  Study  of  Learning  Speed  in  Back-\nTechnical  Report  CMU-CS-88-162,  Carnegie-Mellon \n\nPropagation  Networks. \nUniversity, June, 1988. \n\n[Haffner 88]  Haffner, P., Waibel,  A.  and Shikano, K.  Fast Back-Propagation Learning \nMethods for  Neural Networks  in  Speech.  In  Proceedings of the  Fall Meeting  of the \nAcoustical Society of Japan.  October, 1988. \n\n[Rumelhart  86]  Rumelhart,  D.E.,  Hinton,  G.E.  and  Williams,  R.J.  Learning  Internal \nIn  McClelland,  J L.  and  Rumelhart,  D.E. \nRepresentations  by  Error  Propagation. \n(editor),  Parallel  Distributed  Processing;  Explorations  in  the  Microstructure  of \nCognition, chapter 8, pages 318-362. MIT Press, Cambridge, MA, 1986. \n\n[Tank  87]  Tank,  D.W.  and  Hopfield,  JJ.  Neural  Computation  by  Concentrating \nInformation in Time.  In Proceedings National Academy of Sciences, pages 1896-1900. \nApril, 1987. \n\n[WaibeI88a]  Waibel, A., Hanazawa, T., Hinton, G., Shikano, K.  and Lang K.  Phoneme \nIn  IEEE  International \n\nRecognition:  Neural  Networks  vs.  Hidden  Markov  Models. \nConference on Acoustics, Speech, and Signal Processing, pages 8.S3.3.  April,  1988. \n\n[Waibel 88b]  Waibel,  A., Sawai, H.  and Shikano, K.  Modularity and Scaling in Large \nTechnical  Report  TR-I-0034,  ATR  Interpreting \n\nPhonemic  Neural  Networks. \nTelephony Research Laboratories, July, 1988. \n\n[Waibel  8Se]  Waibel,  A.  Connectionist  Glue:  Modular  Design  of  Neural  Speech \nSystems.  In Touretzky, D.S., Hinton, G.E. and  Sejnowski, T J. (editors), Proceedings \nof the 1988 Connectionist Models Summer School.  Morgan Kaufmann,  1988. \n\n[Waibel 89]  Waibel, A., Hanazawa, T., Hinton, G., Shikano, K.  and Lang K.  Phoneme \nRecognition  Using  Time-Delay  Neural  Networks.  IEEE,  Transactions  on Acoustics, \nSpeech and Signal Processing, March, 1989. \n\n[Watrous  88]  Watrous,  R.  Speech  Recognition  Using  Connectionist  Networks.  PhD \n\nthesis, University of Pennsylvania, October, 1988. \n\n\f", "award": [], "sourceid": 185, "authors": [{"given_name": "Alex", "family_name": "Waibel", "institution": null}]}