{"title": "The Tempo 2 Algorithm: Adjusting Time-Delays By Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 155, "page_last": 161, "abstract": null, "full_text": "The Tempo 2 Algorithm:  Adjusting Time-Delays By \n\nSupervised Learning \n\nUlrich Bodenhausen and Alex Waibel \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsbwgh, PA 15213 \n\nAbstract \n\nIn this work we describe a new method that adjusts time-delays and the widths of \ntime-windows in artificial neural networks automatically.  The input of the units \nare weighted by a gaussian input-window over time which allows the learning \nrules for the delays and widths to be derived in the same way as it is used for the \nweights.  Our results on a phoneme classification task compare well with results \nobtained with the TDNN by Waibel et al., which was manually optimized for the \nsame task. \n\n1 \n\nINTRODUCTION \n\nThe processing  of pattern-sequences  has  been  investigated with  several  neural  network \narchitectures.  One approach  to processing  of temporal  context with  neural  networks  is \nto implement time-delays.  This approach is  neurophysiologically plausible, because real \naxons have a limited conduction speed (which is dependent on the diameter of the axon and \nwhether it is  myelinated or not).  Additionally,  the length of most axons  is  much greater \nthan the euclidean distance between the connected neurons.  This leads to a great variety \nof different time-delays in the brain.  Artificial networks that make use of time-delays have \nbeen suggested [10,  11, 12,8,2,3]. \n\nIn the TDNN [11, 12] and most other artificial neural networks with time-delays the delays \nare implemented as  hat-shaped input-windows over time.  A unit j  that is connected with \nunit i by a connection with delay n is only receiving information about the activity of unit i \nn time-steps ago.  A set of connections with consecutive time-delays is used to let each unit \ngather a certain amount of temporal context.  In these networks, weights are automatically \ntrained but the architecture of the network (time-delays, number of connections and number \nof units) have to be predetermined by laborious experiments [8, 6]. \n\n155 \n\n\f156 \n\nBodenhausen and Waibel \n\nIn  this  work we describe a new  algorithm that adjusts  time-delays  and  the width of the \ninput-window automatically.  The learning rules require input-windows over time that can \nbe described  by a  smooth  function.  With  these  input-windows  it is  possible to  derive \nlearning rules for adjusting the center and the width of the window.  During training, new \nconnections  are  added  if they  are  needed  by splitting already  existing connections  and \ntraining them independently. \n\nAdaptive time-delays  in  neural  networks could  have  Significant advantages  for  the pro(cid:173)\ncessing  of pattern-sequences,  especially  if the relevant  information is  distributed across \nnon-consecutive patterns.  A typical example for this kind of pattern sequences are rhythms \n(relevant in  music and speech).  In  a rhythm,  there are many  events  but also  many  gaps \nbetween  these events.  Another example is  speech, where some parts of an  utterance are \nmore important for understanding than others  (example:  'hat', 'fat', 'cat' .. ).  Therefore a \nnetwork that allocates existing and new resources to the parts of the input sequence that are \nmost helpful for the task could be more compact and efficient for various tasks. \n\n2  THE TEMPO 2 NETWORK \n\nThe Tempo 2 network is an artificial neural network with adaptive weights, adaptive time(cid:173)\ndelays  and adaptive widths of gaussian  input windows over time.  It is  a generalization \nof the Back-Propagation network proposed by Rumelhart, Hinton and Williams  [9].  The \nnetwork is based on some ideas that were tested with the Tempo 1 network [2, 3]. \nThe  Tempo  2  network  is  designed  to  learn  about  the relevant  temporal  context during \ntraining.  A unit in the network is activated by input from a gaussian shaped input-window \ncentered around (t-d) and standard deviation 0', where d (the time-delay) and 0' (the width of \nthe input-window) are to be learned 1 (see Fig.  1 and 2).  This means that the center and the \nwidth of each  input-window can be adjusted by learning rules.  The adaptive time-delays \nallow the processing of temporal context that is distributed across several non-consecutive \npatterns of the sequence.  The adaptive width of the window enables the receiving unit to \nmonitor a variable sequence of consecutive activations over time of each sending unit.  New \nconnections can be added if they are needed (see section 2.1).  The input of unitj at time t, \nx(t)jt is \n\nt \n\nx(t)j = L  LYk(r)O(r, t,djkl O'jk)Wjk \n\nr=O \n\nk \n\nwith O(r, t,djk, O'jk) representing the gaussian input window given by \n\nO(r  t  do  0' 0 )  _ \n-\n\n:Jk ,  Jk \n\n1 \n\n, \n\ne(r-t+djl)2 /2~ \n\n1 \n'2-\ny  L.7rUjk \n\nwhere Yk  is  the output of the previous sending unit and Wjk,  djk  and  O'jk  are  the weights, \ndelays and widths on its connections, respectively. \n\nThis  approach  is  partly motivated by neurophysiology and  mathematics.  In  the brain,  a \nspike that is  sent by a neuron via an  axon is not received as  a spike by the receiving cell. \n\n1 Other windows are possible. The function describing the shape of the window has to be smooth. \n\n\fThe Tempo 2 Algorithm: Adjusting Time-Delays By Supervised Learning \n\n157 \n\ninput 3 \n\u2022 \n\u2022 \n\u2022 \ninput D \n\nFigure I: The input to one unit in the Tempo 2 network. The boxes represent the activations \nof the sending units; a tall box represents a high activity. \n\ntime \n\nRather, the postsynaptic potential has a short rise and a long tail.  Let us assume a situation \nwith two neurons.  Neuron A fires  at time t-d, where d is the time that the signal needs  to \ntravel along the connection and to activate neuron B. Neuron B is activated mostly at time \nt, but the postsynaptic potential will decrease slowly and neuron B will get some input at \ntime t+I, some smaller input at time t+2 and so on.  Functionally, a spike is smeared over \ntime and this provides some \"local memory\". \n\nFor our simulations we simulate this behavior by allowing the receiving unit to be activated \nby the weighted sum  of activations around an  input centered at time t-d.  If the sending \nunit (\"neuron A\")  was  activated at time t-d, then  the receiving unit (\"neuron  Bit)  will be \nactivated mostly at time t,  will be less  activated at time t+I, and so  on.  In our case,  the \ninput-window function also allows the receiving unit to be (less) activated at times t-I, t-2 \netc ..  This enables us to formulate a learning rule that can increase and decrease time-delays. \n\nThe gaussian input-window has the advantage that it provides some robustness against tem(cid:173)\nporally misaligned input tokens.  By looking at Fig. 2 it is obvious that small misalignments \nof the input signal do not change the input of the receiving unit significantly. The robustness \nis dependent on the width of the window.  Therefore a wide window would make the input \nof the receiving unit more robust against signals shifted in time, but would also reduce the \ntime-resOlution of the unit. This suggests the implementation of a learning rule that adjusts \nthe width of the input-windows of each connection. \n\nWith this gaussian input-window, it is possible to compute how the input of unit j  would \nchange if the delay of a connection or the width of the input window were changed.  The \nformalism is the same as for the derivation of the learning rules for the weights in a standard \nBack-Propagation network.  The change of a delay is proportional to the derivative of the \noutput error with respect to the delay.  The change of a width is proportional to the derivative \nof the error with respect to the width.  The error at the output is propagated back to the \nhidden layer.  The learning rules for weights Wji, delays dji and widths (1'ji were derived from \n\n\f158 \n\nBodenhausen and Waibel \n\nAdjusting the delays: \n\nderivative positive \n-> Increase delay \n-> move window left  /: \n\n~1 II \n\nAdjusting the width of the windows \n\ndelay \n\nderivative with \nrespect to \n(J \n.............. . 11 .. \u2022\u2022\u2022 \n\n\u2022\u2022\u2022\u2022\u2022 \n\n' \" \n\n'. iii. \n\nA \n\n:\". \n\n1\"8:1 \n\n.iI  --.,'Iif..-:: \n:. \n\n..... -\u00b7 .. -=.II.-.......... .. \n\nA \n\nFigure 2:  A graphical explanation of the learning rules for delays and widths: The derivative \nof the gaussian  input-window with respect  to time is used  for adjusting the time-delays. \nThe derivative with respect to u (dotted line) is used for adjusting the width of the window. \nA majority of activation in area A will cause the window to grow.  A majority of activity in \narea B will cause the window to shrink. \n\nwhere  fl,  f2  and  f3  are the learning rates  and E  is  the error.  As  in the derivation of the \nstandard Back-Propagation learning rules, the chain rule is applied (z = w, d, u): \n\noE \noE  ox(t)j \nOZji  =  ox(t)j  OZji \n\nwhere  8~f,)j  is the same in the learning rules  for weights, delays  and widths.  The partial \nderivatives of the input with respect to the parameters of the connections are computed as \nfollows: \n\n\fThe Tempo 2 Algorithm: Adjusting Time-Delays By Supervised Learning \n\n159 \n\nSplitting the Connections: \n\nA.  delay-\n: .\u2022.......... \n. \n... \n\". \n\nFigure 3:  Splitting of a connection.  The dotted line represents  the \"old\" window and the \nsolid lines represent the two windows after splitting, respectively. \n\n2.1  ADDING NEW CONNECTIONS \n\nLearning algorithms for neural networks that add hidden units have recently been proposed \n[4,5].  In our network connections are added to the already existing ones in a similar way \nas it is  used by Hanson for adding units [5].  During learning, the network starts with one \nconnection between two units. Depending on the task this may be insufficient and it would be \ndesirable to add new connections where more connections are needed.  New connections are \nadded by splitting already existing connections and afterwards training them independently \n(see Fig.  3).  The rule for splitting a connection is motivated by observations during training \nruns.  It was  observed that input-windows started moving backwards  and forwards  (that \nmeans the time-delays changed) after a certain level of performance was reached.  This can \nbe interpreted as  inconsistent time-delays  which might be caused by temporal variability \nof certain  features  in  the samples  of speech.  During training we compute the standard \ndeviations of all delay changes and compare them with a threshold: \n\nif  L  (L1dji(token)  - walltouns \n\n~ \n\n#tokens \n\nII  L_ \na  tOl<.ens \n\nlL1d .. 1 \n\n'J'  )2  > threshold \n\nthen split connection ji. \n\n3  SIMULATIONS \n\nThe Tempo  2  network was  initially tested  with rhythm classification.  The results  were \nencouraging  and  evaluation  was  carried  out  on  a  phoneme classification  task. \nIn  this \napplication, adaptive delays can help to find  important cues  in a sample of speech.  Units \nshould not accumulate  information from  irrelevant parts  of the phonemes.  Rather,  they \nshould look at parts within the phonemes that provide the most important information for \nthe kind of feature extraction that is  needed for the classification task.  The network was \ntrained on the phonemes /bl, Id/ and Ig/ from  a single speaker.  783 tokens were used for \ntraining and 759 tokens were used for testing. \n\n\f160 \n\nBodenhausen and Waibel \n\nII \n\nadaptive parameters \n\nI constant parameters  I Training Set  I Testing Set  II \n\nweights \ndelays \nwidths \n\ndelays, widths \nweights, delays \n\nweights, delays, widths \n\ndelays, widths \nweights, widths \nweights, delays \n\nweights \nwidths \n\n-\n\n93.2% \n64.0% \n63.5% \n70.0% \n98.3% \n98.8% \n\n89.3% \n63.0% \n61.8% \n68.6% \n97.8% \n98.0% \n\nTable 1:  /b/. Id/ and Ig/ Classification performance with 8 hidden units in one hidden layer. \nThe network is initialized with random weights and constant widths. \n\nIn order to evaluate the usefulness of each adaptive parameter. the network was trained and \ntested with a variety of combinations of constant and adaptive parameters (see Table 1). In \nall cases the network was initialized with random weights and delays and constant widths \nu of the input windows.  All results were obtained with 8 hidden units in one hidden layer. \n\n4  DISCUSSION \n\nThe TDNN has been shown to be a very powerful approach to phoneme recognition.  The \nfixed  time-delays  and  the kind of time-window  were  chosen  partly  because  they  were \nmotivated by results from earlier studies [1. 7]  and because they were successful  from an \nengineering point of view.  The architecture was optimized for the recognition of phonemes \n/b/. Id/ and Ig/ and could be applied to other phonemes without significant changes.  In this \nstudy we explored the performance of an  artificial  neural network that can automatically \nlearn its own architecture by learning time-delays and widths of the gaussian input windows. \nThe learning rules  for the time-delays  and  the width of the windows were derived in the \nsame way that has been shown successful for the derivation of learning rules for weights. \n\nOur results show that time-delays in artificial neural networks can be learned automatically. \nThe  learning  rule  proposed  in  this  study  is  able  to  improve  performance  significantly \ncompared to fixed delays if the network is initialized with random delays. \n\nThe width of an  input window determines  how much  local  temporal context is  captured \nby a single connection.  Additionally. a large window means  increased robustness against \ntemporal misalignments of the input tokens.  A large window also means that the connection \ntransmits with a low temporal resolution.  The learning rule for the widths of the windows \nhas to compromise between increased robustness against misaligned tokens and decreased \ntime-resolution. This is done by a gradient descent methOd. \n\nIf the network is  initialized with the same widths that are used for  the training runs with \nconstant widths. 70 - 80% of the windows in the network get smaller during training.  Our \nsimulations show that it is possible to let a learning rule adjust parameters that determine \nthe temporal resolution of the network. \n\nThe comparison  of the  performances  with  one  adaptive  parameter  set  (either  weights. \ndelays or widths) shows that the main parameters  in the network are the weights.  Delays \nand widths  seem  to be of a  lesser  importance.  but in  combination with  the weights  the \ndelays can improve the performance. especially generalization.  A Tempo 2 network with \ntrained delays and widths and random weights can classify 70% of the phonemes correctly. \n\n\fThe Tempo 2 Algorithm: Adjusting Time-Delays By Supervised Learning \n\n161 \n\nThis suggests that learning temporal parameters is effective. \n\nThe network achieves results comparable to a similar network with a handtuned architecture. \nThis suggests that the kind of learning rule could be helpful in applying time-delay neural \nnetworks  to  problems  where  no  knowledge  about  optimal  time  windows  is  available. \nAt higher levels  of processing such  adaptive networks  could be used  to  learn  rhythmic \n(prosodic) relationships in fluent speech and other tasks. \n\nAcknowledgements \n\nThe authors gratefully acknowledge the support by the McDonnel-Pew Foundation (Cog(cid:173)\nnitive Neuroscience Program) and ATR Interpreting Telephony Research Laboratories. \n\nReferences \n\n[1]  S.E. Blumstein and K.N. Stevens.  Perceptual Invariance And Onset Spectra For Stop \nConsonants In  Different Vowel Environments.  Journal of the Acoustical Society of \nAmerica, 67:648-{)62,1980. \n\n[2]  U.  Bodenhausen.  The  Tempo  Algorithm:  Learning  In  A  Neural  Network  With \nAdaptive Time-Delays.  In Proceedings of the IJCNN 90, Washington D.C., January \n1990. \n\n[3]  U. Bodenhausen. Learning Internal Representations Of pattern Sequences In A Neural \nNetwork With Adaptive Time-Delays.  In Proceedings of the IJCNN 90, San Diego, \nJune 1990. \n\n[4]  S.  Fahlman  and  C.  Lebiere.  The  Cascade-Correlation  Learning  Architecture.  In \n\nAdvances in Neural Information Processing Systems. Morgan Kaufmann, 1990. \n\n[5]  S.  J.  Hanson.  Meiosis  Networks.  In  Advances  in  Neural  Information Processing \n\nSystems. Morgan Kaufmann,  1990. \n\n[6]  Kamm, C. E ..  Effects Of Neural Network Input Span On Phoneme Classification.  In \n\nProceedings of the International Joint Conference on Neural Networks, June 1990. \n\n[7]  D.  Kewley-Port.  Time Varying Features  As  Correlates Of Place Of Articulation In \n\nStop Consonants. Journal of the Acoustical Society of America, 73:322-335, 1983. \n[8]  K. J. Lang, G. E. Hinton, and A.H. Waibel.  A Time-Delay Neural Network Architec(cid:173)\n\nture For Speech Recognition. Neural Networks Journal, 1990. \n\n[9]  D. E. Rumelhart, G. E. Hinton, and RJ. Williams. Learning Internal Representations \nBy Error  Propagation.  In  J.L.  McClelland  and  D.E.  Rumelhart,  editors,  Parallel \nDistributed Processing; Explorations in  the Microstructure of Cognition, chapter 8, \npages 318-362. MIT Press, Cambridge, MA, 1986. \n\n[10]  D.W. Tank: and JJ. Hopfield.  Neural Computation By Concentrating Information In \nTime.  In Proceedings National Academy of Sciences, pages  1896-1900, Apri11987 . \n[11]  A. Waibel, T.  Hanazawa, G. Hinton, K. Shikano, and K. Lang.  Phoneme Recognition \nUsing Time-Delay Neural Networks.  IEEE,  Transactions on Acoustics, Speech and \nSignal Processing, March  1989. \n\n[12]  A. Waibel. Modular Construction Of Time-Delay Neural Networks For Speech Recog(cid:173)\n\nnition. Neural Computation, MIT-Press, March  1989. \n\n\f", "award": [], "sourceid": 423, "authors": [{"given_name": "Ulrich", "family_name": "Bodenhausen", "institution": null}, {"given_name": "Alex", "family_name": "Waibel", "institution": null}]}