{"title": "Connected Letter Recognition with a Multi-State Time Delay Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 712, "page_last": 719, "abstract": null, "full_text": "Connected Letter Recognition with a \n\nMulti-State Time Delay Neural Network \n\nHermann Hild  and  Alex Waibel \n\nSchool of Computer Science \nCarnegie Mellon  University \n\nPittsburgh,  PA  15213-3891, USA \n\nAbstract \n\nThe  Multi-State  Time  Delay  Neural  Network  (MS-TDNN)  inte(cid:173)\ngrates  a nonlinear time alignment procedure  (DTW) and the high(cid:173)\naccuracy  phoneme spotting capabilities of a TDNN into  a  connec(cid:173)\ntionist speech recognition system with word-level classification and \nerror  backpropagation.  We  present  an  MS-TDNN  for  recognizing \ncontinuously  spelled  letters,  a  task  characterized  by  a  small  but \nhighly confusable vocabulary.  Our MS-TDNN  achieves  98.5/92.0% \nword  accuracy  on  speaker  dependent/independent  tasks,  outper(cid:173)\nforming previously reported results on the same databases.  We pro(cid:173)\npose training techniques  aimed at improving sentence  level  perfor(cid:173)\nmance, including free  alignment  across  word  boundaries,  word  du(cid:173)\nration  modeling and error backpropagation on  the sentence  rather \nthan the word level.  Architectures  integrating submodules special(cid:173)\nized  on  a  subset  of speakers  achieved further  improvements. \n\n1 \n\nINTRODUCTION \n\nThe recognition of spelled strings of letters is  essential for  all  applications involving \nproper names, addresses or other large sets of special words which due to their sheer \nsize can not be in the basic vocabulary of a recognizer.  The high confusability of the \nEnglish letters makes the seemingly easy task a very challenging one,  currently only \naddressed  by  a  few  systems,  e.g.  those  of R.  Cole  et.  al.  [JFC90,  FC90,  CFGJ91] \nfor  isolated spoken letter recognition.  Their connectionist systems first  find  a broad \nphonetic segmentation, from  which  a  letter segmentation  is  derived,  which  is  then \n\n712 \n\n\fConnected Letter  Recognition  with  a  Multi-State Time Delay  Neural Network \n\n713 \n\nunity \n\n----..... \n\n\u2022 \n\nA \n\nSil \n\nWord Layer \n27 word  units \n(only  4  shown) \n\nDTW Layer \n27 word  templates \n(only  4  shown) \n\nPhoneme Layer \n59  phoneme  units \n(only  9 shown) \n\nHidden Layer \n15  hidden  units \n\nInput Layer \n16  melscale FFT co(cid:173)\nefficients \n\nFigure  1:  The MS-TDNN  recognizing  the excerpted  word  'B'.  Only  the  activations \nfor  the words  'SIL',  'A',  'B',  and  'c' are shown. \n\nclassified  by  another  network. \nconnectionist  speech  recognition system for  connected  letter  recognition.  After  de(cid:173)\nscribing the baseline architecture,  training techniques aimed  at improving sentence \nlevel  performance and  architectures  with  gender-specific  sub nets  are  introduced. \n\nIn  this  paper,  we  present  the  MS-TDNN  as  a \n\nBaseline Architecture. Time Delay  Neural  Networks  (TDNNs)  can  combine the \nrobustness  and  discriminative  power  of Neural  Nets  with  a  time-shift invariant  ar(cid:173)\nchitecture  to form  high  accuracy  phoneme classifiers  [WHH+S9].  The  Multi-State \nTDNN (MS-TDNN)  [HFW91, Haf92, HW92], an extension of the TDNN, is capable \nof classifying words (represen ted as sequences of phonemes)  by  integrating a nonlin(cid:173)\near time alignment procedure  (DTW) into  the TDNN  architecture.  Figure  1 shows \nan  MS-TDNN  in the process  of recognizing  the excerpted  word  'B',  represented  by \n16  melscale FFT coefficients at a  10-msec frame  rate.  The first  three layers  consti(cid:173)\ntute a standard TDNN, which  uses  sliding windows  with time delayed  connections \nto compute a  score  for  each  phoneme (state)  for  every  frame,  these  are  the activa(cid:173)\ntions  in  the  \"Phoneme Layer\".  In  the  \"DTW Layer\" , each  word  to  be  recognized \nis  modeled  by  a  sequence  of phonemes.  The  corresponding  activations  are  simply \n\n\f714 \n\nHild  and Waibel \n\ncopied from the Phoneme Layer into the word models of the DTW Layer,  where an \noptimal alignment  path is found  for  each  word.  The  activations  along these  paths \nare then collected in the word output units.  All  units in the DTW and Word  Layer \nare  linear  and  have  no  biases.  15  (25  to  100)  hidden  units  per  frame  were  used \nfor  speaker-dependent (-independent) experiments, the entire 26  letter network  has \napproximately 5200  (8600  to 34500)  parameters. \nTraining starts  with  \"bootstrapping\",  during  which  only  the front-end  TDNN  is \nused  with fixed  phoneme  boundaries as  targets.  In  a  second  phase,  training is per(cid:173)\nformed with word level targets.  Phoneme boundaries are freely  aligned within given \nword boundaries in the DTW layer.  The error derivatives are backpropagated from \nthe word  units through the alignment path and the front-end  TDNN. \nThe  choice  of  sensible  objective  functions  is  of  great  importance.  Let  Y  = \n(Yl, ... ,Yn)  the  output  and  T  = (tl, ... , tn )  the  target  vector.  For  training  on \nthe  phoneme  level  (bootstrapping),  there  is  a  target  vector  T  for  each  frame  in \ntime, representing the correct  phoneme j  in a  \"1-out-of-n\"  coding,  i.e.  ti = Dij.  To \nsee  why  the standard  Mean  Square  Error (MSE = l:?:l (Yi  -\nti)2)  is  problematic \nfor  \"1-out-of-n\"  codings for  large n  (n = 59  in our case),  consider  for  example that \nfor  a  target (1.0,0.0, ... ,0.0) the output vector  (0.0, ... ,0.0) has only half the error \nthan  the  more  desirable  output  (1.0,0.2, ... ,0.2).  To  avoid  this  problem,  we  are \nusmg \n\nn \n\ni=1 \n\nwhich  (like cross  entropy) punishes  \"outliers\"  with an error  approaching infinity for \nIti - yd  approaching  1.0.  For  the word  level  training, we  have achieved  best  results \nwith  an  objective function  similar to  the  \"Classification  Figure  of Merit  (CFM)\" \n[HW90],  which tries to maximize the distance d = Yc  - Yhi  between the correct score \nYc  and the highest  incorrect  score  Yhi  instead of using  absolute target values of 1.0 \nand 0.0 for  correct  and incorrect  word  units: \n\nECFM(T, Y) = f(yc - Yhd = f(d) = (1  - d)2 \n\nThe philosophy here is not to \"touch\"  any output unit not directly related to correct \nclassification.  We  found  it even  useful  to  apply error  backpropagation only  in  the \ncase  of a  wrong or  too narrow  classification,  i.e.  if Yc  - Yhi  < DllaJety...margin. \n\n2 \n\nIMPROVING  CONTINUOUS  RECOGNITION \n\n2.1  TRAINING  ACROSS WORD  BOUNDARIES \n\nA  proper  treatment  of word 1  boundaries  is  especially  important  for  a  short  word \nvocabulary,  since most phones  are  at word  boundaries.  While the phoneme bound(cid:173)\naries within a word are freely  aligned by  the DTW during  \"word level training\" , the \nword  boundaries  are fixed  and  might  be error  prone  or  suboptimal.  By  extending \nthe  alignment  one  phoneme  to  the  left  (last  phoneme  of previous  word)  and  the \nright  (first  phoneme of next  word),  the word  boundaries can be optimally adjusted \n\n1 In  our context, a  \"word\"  consists  of one spelled letter,  and  a  \"sentence\"  is  a  continu(cid:173)\n\nously  spelled  string of letters. \n\n\fConnected  Letter Recognition with  a  Multi-State Time Delay  Neural  Network \n\n715 \n\nb \n\n410 \n\n1 ..  ~ \n:-1 .. : b \n\n.u  p \n\nB \n\nb \n\nu:~ \n\n~ \n\n... 2  ~ \n\u2022 u  4 ~ \n\n.. \n\n. . \n\n: \n.  . \n\n. . . \n\n. . \n\n. . \n\n- word boundaries  -\n\n~ word boundary found \n\nby free alignment \n\n(a)  alignment  across  word  boundaries \n\nc \nB \n\n~~~+---~----~~--~ \n\nprobed) \n\n(durat1.on) \n(b)  duration dependent  word  penalties \n\nd \n\nb \nh\" \n\nb \nP \n\nb  ':~ \nall  P \n\nB \n\nb \n\nX>. ' \nA::!,~ .:. \n\naU'~ . \n\nI-- word bound_ri \u2022\u2022 -\n\n~ word boundary found \n\nby fr_ allgnlngnt \n\n(C)  training on  the sentence  level \n\nFigure 2:  Various  techniques to improve sentence level  recognition performance \n\n\f716 \n\nHild and  Waibel \n\nin  the same way  as  the  phoneme boundaries  within  a  word.  Figure  2(a)  shows  an \nexample  in  which  the  word  to  recognize  is  surrounded  by  a  silence  and  a  'B',  thus \nthe left and  right context  (for all  words to be recognized)  is  the phoneme 'sil' and \n'b',  respectively.  The  gray  shaded  area  indicates  the  extension  necessary  to  the \nDTW  alignment.  The  diagram shows  how  a  new  boundary  for  the  beginning  of \nthe word  'A' is found.  As  indicated in figure  3,  this techniques improves  continuous \nrecognition  significantly,  but it doesn't  help  for  excerpted  words. \n\n2.2  WORD  DURATION  DEPENDENT PENALIZING  OF \n\nINSERTION  AND  DELETION  ERRORS \n\nIn  \"continuous testing mode\" , instead of looking at word units the well-known  \"One \nStage DTW\"  algorithm [Ney84)  is used  to find  an optimal path through an unspec(cid:173)\nified sequence  of words.  The short  and confusable English letters cause  many  word \ninsertion  and  deletion  errors,  such  as  \"T  E\"  vs. \n''T''  or  \"0\"  vs.  \"0  0\",  therefore \nproper  duration modeling is  essential. \nAs  suggested  in  [HW92],  minimum  phoneme  duration  can  be  enforced  by  \"state \nduplication\".  In addition, we  are modeling a duration and word  dependent  penalty \nPenw(d)  = log(k + probw(d\u00bb,  where  the  pdf probw(d)  is  approximated from  the \ntraining  data  and  k  is  a  small  constant  to  avoid  zero  probabilities.  Penw (d)  is \nadded  to  the  accumulated score  AS of the search path,  AS = AS + Aw  * Penw (d), \nwhenever  it  crosses  the  boundary  of a  word  w  in  Ney's  \"One  Stage  DTW\"  algo(cid:173)\nrithm,  as  indicated  in  figure  2(b).  The  ratio  Aw ,  which  determines  the  degree  of \ninfluence  of the  duration  penalty,  is  another  important  degree  of freedom.  There \nis  no  straightforward  mathematically exact  way  to  compute the effect  of a  change \nof the  \"weight\"  Aw  to the insertion  and deletion  rate.  Our  approach  is  a  (pseudo) \ngradient descent, which changes  Aw  proportional to E(w) = (#insw - #delw)/#w, \ni.e.  we  are trying to maximize the relative balance of insertion  and deletion  errors. \n\n2.3  ERROR BACKPOPAGATION  AT THE  SENTENCE  LEVEL \n\nUsually the MS-TDNN is trained to classify excerpted  words,  but evaluated on con(cid:173)\ntinuously  spoken  sentences.  We  propose  a  simple  but  effective  method  to  extend \ntraining on the sentence level.  Figure 2( c)  shows the alignment path of the sentence \n\"e  A  B\",  in  which  a  typical  error,  the  insertion  of an  'A',  occurred.  In  a  forced \nalignment mode  (i.e.  the correct  sequence  of words  is  enforced),  positive training is \napplied along the correct  path, while the units along the incorrect path receive  neg(cid:173)\native training.  Note that the effect of positive and negative training is  neutralized if \nthe paths are the same, only differing parts receive  non-zero error backpropagation. \n\n2.4  LEARNING  CURVES \n\nFigure  3  demonstrates  the  effect  of the  various  training  phases.  The  system  is \nbootstrapped (a) during iteration 1 to 130.  Word level training starts (b) at iteration \n110.  Word  level  training  with  additional  \"training across  word  boundaries\"  (c)  is \nstarted at iteration 260.  Excerpted word performance is  not improved after (c),  but \ncontinuous  recognition  becomes significantly  better,  compare  (d)  and  (e).  In  (d), \nsentence  level  training  is  started  directly  after  iteration  260,  while  in  (e)  sentence \nlevel  training is  started  after additional  \"across  boundaries  (word  level)  training\". \n\n\fConnected Letter  Recognition  with  a  Multi-State  Time Delay  Neural  Network \n\n717 \n\n(~) \n\n~( c) \n\n95.0 \n\n90 . 0 \n\n---~---~---~~-.-- - - - ---.---- -- --~ -- --- - --\n\n--\n\n85.0 \n\n~  .  ~ - - - - ~ - - - -:- - - - i  - - - -:- - - - -! - - - -\n\n- - - -:- - - - ~ - - - - :- - - - -:- - - -\n\n. \n. \n. \n. \n. \n\nI \n\n. \n. \n\n. \n. \n. \n\n\u2022 \n\n\u00b7 \n\u00b7 \n\u00b7 \n\u00b7 \n\n\u2022 \n\n. \n. \n. \n. \n\n, \n\n. \n. \n. \n\nI \n\n. \n. \n. \n. \n.  . .  \n\nI  I .  \n\n.\n.\n\no  20  40  60  80  ~00120140160~80200220 2 402602803003 2 03403603804004 2 0 \n\nFigure 3:  Learning curves  (a =  bootstrapping, b,c =  word  level  (excerpted  words), \nd,e = sentence level  training (continuous speech)) on  the training (0),  crossvalida(cid:173)\ntion  (-)  and test set  (x) for  the speaker-independent  RM  Spell-Mode data. \n\n3  GENDER SPECIFIC SUBNETS \n\nA straightforward approach to building a more specialized system is simply to train \ntwo entirely individual networks for male and female speakers only.  During training, \nthe  gender  of a  speaker  is  known,  during testing  it is  determined  by  an  additional \n\"gender  identification network\" , which is simply another  MS-TDNN  with  two out(cid:173)\nput  units  representing  male and  female  speakers.  Given  a  sentence  as  input,  this \nnetwork  classifies the speaker's gender  with approx.  99%  correct.  The overall mod(cid:173)\nularized  network  improved  the  word  accuracy  from  90.8%  (for  the  \"pooled\"  net, \nsee  table  1)  to 91.3%.  However,  a  hybrid  approach  with specialized gender-specific \nconnections  at  the  lower,  input level  and shared  connections for  the remaining net \nworked  even  better.  As  depicted  in figure  4, in this architecture the gender identifi(cid:173)\ncation network selects one of the two gender-specific bundles of connections between \nthe input  and hidden layer.  This technique improved  the  word  accuracy  to 92.0% . \nMore experiments with speaker-specific subnetworks are  reported  in  [HW93] . \n\n4  EXPERIMENTAL RESULTS \n\nOur  MS-TDNN  achieved  excellent  performance  on  both  speaker  dependent  and \nindependent  tasks.  For  speaker  dependent  testing,  we  used  the  \"eMU  Alph(cid:173)\nData\", with  1000 sentences  (Le. a  continuously  spelled  string of letters)  from  each \nof 3  male and  3 female  speakers.  500,  100,  and  400  sentences  were  used  as  train-\n\n\f718 \n\nHild  and Waibel \n\nPhoneme Layer \n\nHidden Layer \n\nInput Layer \n\nF  F  T \n\nFigure 4:  A network architecture with gender-sp ecific and shared connections.  Only \nthe front-end  TDNN  is  shown. \n\ning, cross-validation and test set,  respectively.  The DARPA  Resource  Management \nSpell-Mode  Data  were  used  for  speaker  independent  testing.  This  data  base \ncontains  about  1700  sentences,  spelled  by  85  male  and  35  female  speakers.  The \nspeech of 7 male and 4 female speakers  was set  aside for  the  test  set,  one sentence \nfrom all  109 and all sentences from 6 training speakers were used for  crossvalidation. \nTable 1 summarizes our results.  With the help of the training techniques described \nabove  we  were  able to outperform previously  reported  [HFW91]  speaker dependent \nresults  as  well  as  the  HMM-based  SPHINX System. \n\n5  SUMMARY AND  FUTURE WORK \n\nWe  have  presented  a  connectionist  speech  recognition  system  for  high  accuracy \nconnected  letter recognition.  New  training techniques aimed at improving sentence \nlevel  recognition enabled our MS-TDNN to outperform previous systems of its own \nkind as well as a state-of-the art HMM-based system (SPHINX).  Beyond the gender \nspecific subnets,  we  are experimenting with an  MS-TDNN  which  maintains several \n\"internal speaker models\"  for  a  more sophisticated speaker-independent system.  In \nthe future  we  will  also experiment  with  context  dependent  phoneme models. \n\nAcknowledgements \n\nThe  authors  gratefully  acknowledge  support  by  the  National  Science  Foundation \nand  DARPA.  We  wish  to  thank  Joe  Tebelskis  for  insightful  discussions,  Arthur \nMcNair for  keeping our machines running,  and especially  Patrick Haffner.  Many  of \nthe ideas presented have  been  developed  in collaboration with him. \n\nReferences \n\n[CFGJ91] \n\nR.  A.  Cole,  M.  Fanty,  Gopalakrishnan,  and  R.  D.T.  Janssen.  Speaker(cid:173)\nIndependent Name Retrival from Spellings  Using a Database of 50,000 Names. \n\n\fConnected Letter  Recognition  with  a Multi-State Time Delay  Neural  Network \n\n719 \n\nSpeaker Dependent  (eMU Alph Data) \n\n500/2500 train,  100/500 crossvalidation, 400/2000 test  sentences/words \n\n96.0 \n83.9 \n\nSPHINX[HFW91]  MS-TDNN[HFW91] \n\nspeaker \n98.5 \nrnjrnt \n91.1 \nrndbs \n94.6 \nrnaern \n98.8 \nfcaw \n86.9 \nflgt \n91.0 \nfee \nSpeaker Independent  (Resource Management  Spell-Mode) \n\n-\n-\n-\n-\n\n97.5 \n89.7 \n\n-\n-\n-\n-\n\nour \n\nMC:_'T'nNN \n\n109  (ca.  11000)  train,  11  (ca.  900)  test  speaker  (words). \nour  MS-TDNN \nSPHINX[HH92] \n\n+ Sen one \n\n88.7 \n\n90.4 \n\n90.8 \n\ngender  specific \n\n92.0 \n\nTable  1:  Word  accuracy  (in  % on  the  test  sets)  on speaker  dependent  and speaker \nindependent connected  letter  tasks. \n\n[FC90] \n\n[Haf92] \n\n[HFW91] \n\n[HH92] \n\n[HW90] \n\n[HW92] \n\n[HW93] \n\n[JFC90] \n\n[Ney84] \n\nIn  Proceedings  of the  International Conference  on Acoustics,  Speech  and Sig(cid:173)\nnal Processing, Toronto,  Ontario,  Canada,  May  1991.  IEEE. \nM.  Fanty and R.  Cole.  Spoken letter recognition.  In  Proceedings  of the  Neural \nInformation  Processing Systems  Conference  NIPS,  Denver,  November  1990. \nP.  Haffner.  Connectionist  Word-Level  Classification  in  Speech  Recognition. \nIn  Proc.  IEEE  International  Conference  on  Acoustics,  Speech,  and  Signal \nProcessing.  IEEE,  1992. \nP.  Haffner,  M.  Franzini,  and  A.  Waibel.  Integrating  Time  Alignment  and \nNeural  Networks  for  High  Performance  Continuous  Speech  Recognition.  In \nProc.  Int.  Conf.  on Acoustics,  Speech,  and Signal Processing.  IEEE,  1991. \nM.-Y.  Hwang  and  X.  Huang.  Subphonetic  Modeling  with  Markov  States -\nSenone.  In  Proc.  IEEE  International  Conference  on  Acoustics,  Speech,  and \nSignal Processing, pages 133  - 137.  IEEE,  1992. \nJ.  Hampshire  and  A.  Waibel.  A  Novel  Objective  Function  for  Improved \nPhoneme  Recognition  Using  Time  Delay  Neural  Networks.  IEEE  Transac(cid:173)\ntions  on  Neural  Networks,  June 1990. \nP.  Haffner  and  A.  Waibel.  Multi-state Time Delay  Neural  Networks for  Con(cid:173)\ntinuous  Speech  Recognition.  In  NIPS(4). Morgan  Kaufman,  1992. \nH.  Hild  and A.  Waibel.  Multi-SpeakerjSpeaker-Independent Architectures for \nthe  Multi-State  Time  Delay  Neural  Network.  In  Proc.  IEEE  International \nConference  on  Acoustics,  Speech,  and Signal Processing.  IEEE,  1993. \nR.D.T.  Jansen,  M.  Fanty,  and  R.  A.  Cole.  Speaker-independent  Phonetic \nClassification  in  Continuous  English  Letters.  In  Proceedings  of the  JJCNN \n90,  Washington D.C.,  July  1990. \nH.  Ney.  The Use  of a  One-Stage  Dynamic  Programming  Algorithm  for  Con(cid:173)\nnected  Word  Recognition.  In  IEEE  Transactions  on  Acoustics,  Speech,  and \nSignal Processing, pages  263-271.  IEEE,  April  1984. \n\n[WHH+89]  A.  Waibel,  T.  Hanazawa,  G.  Hinton,  K.  Shikano,  and  K.  Lang.  Phoneme \nIEEE,  Transactions  on \n\nRecognition  Using  Time-Delay  Neural  Networks. \nAcoustics,  Speech  and Signal Processing,  March  1989. \n\n\f\fPART  IX \n\nApPLICATIONS \n\n\f\f", "award": [], "sourceid": 620, "authors": [{"given_name": "Hermann", "family_name": "Hild", "institution": null}, {"given_name": "Alex", "family_name": "Waibel", "institution": null}]}