{"title": "Temporal Representations in a Connectionist Speech System", "book": "Advances in Neural Information Processing Systems", "page_first": 240, "page_last": 247, "abstract": null, "full_text": "240 \n\nTEMPORAL REPRESENTATIONS  IN A \nCONNECTIONIST SPEECH  SYSTEM \n\nErich  J.  Smythe \n\n207  Greenmanville  Ave,  #6 \n\nMystic,  CT  06355 \n\nABSTRACT \n\nSYREN  is  a  connectionist model  that uses  temporal  information \nin  a  speech signal  for  syllable  recognition.  It classifies  the  rates \nand directions of formant center transitions,  and uses an adaptive \nmethod  to  associate  transition  events  with  each  syllable.  The \nsystem  uses  explicit  spatial  temporal  representations through  de(cid:173)\nlay  lines.  SYREN  uses  implicit  parametric  temporal  representa(cid:173)\ntions  in  formant  transition  classification  through  node  activation \nonset,  decay,  and transition delays  in sub-networks analogous to \nvisual  motion detector cells.  SYREN  recognizes 79% of six repe(cid:173)\ntitions  of  24  consonant-vowel  syllables  when  tested  on  unseen \ndata,  and  recognizes  100%  of  its  training  syllables. \n\nINTRODUCTION \n\nLiving  organisms exist in a  dynamic  environment.  Problem solving systems,  both \nnatural  and  synthetic,  must  relate  and  interpret  events  that  occur  over  time. \nAlthough connectionist models  are based on metaphors from  the brain,  few  have \nbeen designed  to  capture temporal  and sequential  information  common  to even \nthe most primitive nervous systems.  Yet some of the most popular areas of appli(cid:173)\ncation  of these  models,  including  speech  recognition,  vision,  and motor  control, \nrequire  some  form  of temporal  processing. \n\nThe  variation  in  a  speech  signal  contains  considerable  information.  Changes  in \nformat  location  or  other  acoustic  parameters  (Delattre,  et  al.,  1955;  Pols  and \nSchouten,  1982)  can determine the identity of constituents of speech,  even when \nsegmentation  information  is  obscure.  Speech  recognition  systems  have  shown \ngood  results  when  they  incorporate  some  temporal  information  (Waible,  et  al., \n1988, Anderson, et al.,  1988).  Successful speech systems must incorporate tem(cid:173)\nporal  processing. \n\nNatural organisms have sensory organs that are continuously updated and can do \nonly limited buffering of input stimuli.  Synthetic implementations can buffer their \ninput,  transforming time  into space.  Often the  size  and complexity  of the  input \nrepresentations  place  limits  on the  amount  of  input that  can  be  buffered,  espe(cid:173)\ncially  when  data  is  coming  from  hundreds  or  thousands  of  sensors,  and  other \nmethods  must  be  found  to  integrate  temporal  information. \n\n\fTemporal Representations in a Connectionist Speech System \n\n241 \n\nThis  paper  describes  SYREN  (SYllable  REcognition  Network),  a  connectionist \nnetwork that incorporates  various  temporal  representations  for  consonant-vowel \n(CV)  syllable recognition by the classification of formant center transitions.  Input \nis  presented  sequentially,  one  time  slice  at  a  time.  The  network  is  described, \nincluding the temporal processing used in  formant transition  classification,  learn(cid:173)\ning,  and syllable  recognition.  The results  of syllable  recognition experiments are \ndiscussed  in  the  final  section. \n\nTEMPORAL  REPRESENTATIONS \n\nVarious  types  of  temporal  representations  may  be  used  to  incorporate  time  in \nconnectionist  models.  They  range  from  explicit  spatial  representations  where \ntime  is  convened into space,  to implicit parametric  representations where  time  is \nincorporated using network computational parameters.  Spatiotemporal represen(cid:173)\ntations  are  a  middle  ground  combining the two  extremes.  The  categories  repre(cid:173)\nsent  a  continuum  rather  than  absolute  distinctions.  Several  of  these  types  are \nfound  in  SYREN. \n\nEXPLICIT SPATIAL  REPRESENTATIONS \nIn a purely spatial representation temporal information is  preserved by spreading \ntime  steps  over  space  through  the  network  topology.  These  representations  in(cid:173)\nclude  input  buffers,  delay  lines,  and  recurrent  networks. \nFixed  input  buffers  allow  interaction  between  time  slices  of input.  Parts  of  the \nnetwork  are  copied  to  represent  states  at  panicular time  slices.  Other methods \nuse  sliding  input buffers  in  the  form  of  a  queue.  Tapped  delay  lines  and delay \nfilters  are means of spreading network node activations  over time.  Composed  of \nchains  of  network  nodes  or  delay  functions,  they  can  preserve  the  sequential \nstructure  of  information.  A  value  on  a  connection from  a  delay  line  represents \nevents that have  occurred  in the past.  Delay  lines  and filters  have  been  used in \nspeech  recognition  systems  by  Waible,  et  al.  (1988),  and  Tank  and  Hopfield \n(1987) . \nRecurrent networks  are  similar  to  delay  lines  in that information is  preserved by \npropagating  activation  through  the  network.  They  can store  information  indefi(cid:173)\nnitely  or  generate  potentially  infinite  sequences  of  behaviors  through  feedback \ncycles,  whereas  delay  lines  without  cycles  are  limited  by their  fixed  length.  Re(cid:173)\ncurrent networks pose problems for learning,  although researchers are working on \nrecurrent  back  propagation  networks  (Jordan,  1988). \nSpatial representations are good for  explicitly preserving sequences of events,  and \ncan simplify  the  learning of temporal patterns.  Resource  constraints place  a  limit \non  the  size  of  fixed  length  buffers  and  delay  lines,  however. \nInput  data  from \nthousands of sensors place limits  on the  length of time represented in  the buffer, \nand may not be able  to retain information long enough to be of use.  Fixed input \nbuffers  may  introduce  edge  effects.  Interaction  is  lost  between the  edges  of the \nbuffer  and data  from  preceding  and succeeding  buffers unless the input  is  prop(cid:173)\nerly  segmented.  Long  delay  lines  may  be  computationally  expensive  as  well. \n\n\f242 \n\nSmythe \n\nSPATIOTEMPORAL  REPRESENTATIONS \nImplicit parametric methods represent time in connectionist models by the behav(cid:173)\nior  of network  nodes .  State  information stored  in  individual  nodes  allows  more \ncomplex  activation  functions  and  the  accumulation  of  statistical  information. \nThis method may be used to regulate  the  flow  of activation  in the  network,  pro(cid:173)\nvide  a  trace  of previous  activation,  and  learn  from  data  separated  in  time. \n\nAdjusting the  parameters of functions  such  as  the  interactive  activation  equation \nof McClelland  and Rumelhart  (1982)  can control the strength of input,  affecting \nthe  rate  that  activation  reaches  saturation.  This  leads  to  pulse  trains  used  in \nsynchronization.  Variations  in decay parameters  control the  duration  of an  acti(cid:173)\nvation  trace. \n\nState and statistical information is  useful in learning.  Eligibility traces from classi(cid:173)\ncal conditioning models provide decaying memory of past connection activation. \nTemporally  weighted  averages  may  be  used  for  weight  computations. \n\nSpatiotemporal  representations  combine  implicit parametric  representations  with \nexplicit spatial  representations.  These include the regulation of propagation time \nand pulse trains through parameter adjustment.  Gating behavior that controls the \nflow  of  activation  through  a  network  is  another  spatiotemporal  method. \n\nSYREN  DESCRIPTION \n\nSYREN is  a  connectionist model that incorporates temporal processing in isolated \nsyllable  recognition  using  formant  center transitions.  Formant  center  tracts  are \npresented in  5  ms time  slices.  Input nodes  are updated once per time  slice.  The \nnetwork  classifies the rates  and  directions of formant transitions.  Transition  data \nare used by  an adaptive network to associate transition patterns with syllables.  A \nrecognition  network  uses  output  of  the  adaptive  network  to  identify  a  syllable. \nComplete  details  of the  system  maybe  found  in  Smythe  (1988). \n\nDATA  CORPUS \n\nInput data  consist of formant  centers  from  five  repetitions of twenty-four conso(cid:173)\nnant-vowel syllables  (the  stop  consonants Ib,  d,  gl paired with  the vowels  Iii,  ey, \nih,  eh,  ae,  ah,  ou,  uu/),  and an averaged set of each  of the  five  repetitions  from \nwork performed by Kewley  Port  (1982).  Each repetition is  presented as  a binary \nmatrix with  a row representing frequency in 20  Hz units,  and a  column represent(cid:173)\ning  time  in  5  ms  slices.  The  matrix  is  given  to the  input  units  one  column  at  a \ntime.  A  '1'  in  a  cell  of  a  matrix  represents  a  formant  center  at  a  particular \nfrequency  during  a  particular time  slice. \n\nFORMANT TRANSITION  CLASSIFICATION \n\nIn the  first  stage  of processing  SYREN  determines the  rate  and  direction  of for(cid:173)\nmant  center transitions.  Formant transition  detectors  are  subnetworks  designed \nto  respond to  transitions  of one of six  rates  in  either  rising  or falling  directions, \n\n\fTemporal Representations in a Connectionist Speech System \n\n243 \n\nand  also  to steady-state  events.  The  method used  is  motivated  by  a  mechanism \nfor  visual  motion detection in the retina that combines interactions between sub(cid:173)\nunits  of a  dendritic  tree  and shunting,  veto  inhibition  (Koch  et  ai,  1982).  For(cid:173)\nmant motion is  analogous to visual motion,  and formant transitions are treated as \na  one  dimensional  case  of visual  motion. \n\nPreferred \nTransition \n\nD' \nIsta \n\nI  Branch  Nodes \n\nProximal \n\nFigure  1.  Formant  transition  detector  subnetwork  and  its  preferred \ndescending transition type.  The vertical  axis  is  frequency  (one  row  for \neach  input unit)  and the  horizontal  axis  is  time  in  5  ms  slices. \n\nA  detector  subnetwork  for  a  slow  transition  is  shown  in  figure  1,  along  with  its \npreferred transition.  Branch nodes are analogous to dendritic subunits,  and serve \nas  activation  transmission  lines.  Their  activation  is  computed  by  the  equation: \n\naf+l  =  af(1- 0)  + netf(l - aD \n\nWhere  a  is  the  activation  of  unit  i  at  time  t,  net  is  the  weighted  input,  t  is  an \nupdate  cycle  (there  are  7  updates  per  time  slice),  and  e is  a  decay  constant. \nInput  to  a  branch  node  drives  the  activation  to  a  maximum  value,  the  rate  of \nwhich  is  determined  by  the  strength  of the  input,  In the  absence  of  input  the \nactivation  decays  to  O. \n\nFor the preferred direction,  input nodes are activated for two time slices  (10 ms) \nin  order  from  top  to  bottom.  An  input  node  causes  the  activation  of the  most \ndistal branch node to rise to a maximum value.  This in turn causes the next node \nto  activate,  slightly  delayed with  respect to the first,  and so on for  the rest of the \nbranch.  This  results  in  a  pulse  of  activation  flowing  along  the  branch  with  a \ntransmission  delay  of roughly  one time  slice  (7  update  cycles)  from  the  distal  to \nthe proximal end.  The most proximal branch node  also has  a  connection to the \ninput  node.  This  connection  serves  to  prime  the  node  for  slower  transitions. \nActivation  from  an input node that lasts  for  only  one time  slice  will  decay in the \nIf \nproximal  branch  node  before  the  activation  from  the  distal  region  arrives. \ninput is  present for two time  steps the  extra  activation  from  the input connection \nprimes the node,  quickly  driving  it  to  a  maximal  value  when the  distal  activation \narrives. \n\n\f244 \n\nSmythe \n\nAn  S-node  provides  the  output  of the  detector.  It computes  a  sigmoid  squash \nfunction and fires  (a sudden increase in activation)  when sufficient activation is in \nthe  proximal branch nodes.  For this  particular  detector,  if the  transition  is  too \nfast  (i. e.  one time  step  for  each input unit)  the proximal nodes  will  not  attain  a \nhigh enough activation;  if the transition is  too slow  (i.e.  three time steps  for each \ninput unit)  activation on proximal  branch nodes  from  earlier time steps will  have \ndecayed before the transition is  complete.  This  architecture is  tuned to a  slower \ntransition by increasing the transmission time  on the branches by varying the con(cid:173)\nnection weights,  and by reducing the decay rate  by lowering the  decay constant. \nThis illustrates  the use  of parametric  manipulations to  control temporal behavior \nin  for  rate  sensitivity. \n\nVeto  inhibition is  used  in this  detector  for  direction  selectivity.  Veto  nodes  pro(cid:173)\nvide inhibition and are activated by input nodes,  and use the interactive activation \nequation  for  a  decaying  memory.  Had  the  transition  in  figure  1.  been  in  the \nopposite  direction,  activation  from  previous  time  slices  on  a  veto  connection \nwould  prevent the  input  node  from  activating  its  distal  branch  node,  preventing \nthe  flow  of activation and the firing  of the S-node.  Here  a  veto connection acts \nas  a  gate,  serving  to  select  input  for  processing. \n\nDetectors  are  constructed  for  faster  transitions  by  shortening  the  transmission \nlines and by using veto connections for  rate sensitivity.  A transition detector for  a \nfaster  transition is  shown  in figure  2.  Here the receptive field  is  larger,  and veto \nconnections  are  used  to  select  transitions  that  skip  one  input  unit  at  each  time \nslice.  Veto connections  are still used for  direction selectivity.  Detectors for  even \nfaster  transitions  are  created  by  widening  the  receptive  field  and  increasing  the \nnumber  of veto  connections  for  rate  sensitivity. \n\nDetectors are designed to respond to a  specific transition type  and not to respond \nto  the  transitions  of other detectors.  They  will  respond to  transitions  with  rates \nbetween their own  and the next type  of detector.  For slower transitions the firing \nof  two  detectors  indicates  an  intermediate  rate.  For  faster  transitions  special \ndetectors are designed to fire  for  only one precise rate by eliminating some of the \nbranches.  Different firing  patterns of precise  and  more  general detectors  distin(cid:173)\nguish  rates.  This  gives  a  very  fine  rate  sensitivity  throughout the  range  of transi(cid:173)\ntions. \n\nDetector networks are copied to span the entire frequency range with overlapping \nreceptive  fields.  This  yields  an  array  of S-nodes for  each transition type.  giving \nexcellent  spatial  resolution  of the  frequency  range.  There  are  200  S-nodes for \neach detector type.  each signaling a transition that starts and ends at  a particular \nfrequency  unit. \n\nADAPTIVE  NE1WORK \n\nThe adaptive network learns to associate patterns of formant transitions with  spe(cid:173)\ncific  syllables.  To  do this  it  must  be  able  to  store  at least  part  of the  unfolding \npatterns  or else  it  is  forced  to  respond to information  from  only  one time  slice. \n\n\fTemporal Representations in a  Connectionist Speech System \n\n245 \n\nPreferred \nTransition \n\nBranch  Nodes \n\nVeto \nNodes \n\nFigure  2.  Formant transition  detector  subnetwork  for  a  faster  transi(cid:173)\ntion.  Only  the  veto  connections used  for  rate  sensitivity  are  shown. \n\nThe learning algorithm must also deal with past activation histories of connections \nor  else  it  can  only  learn  from  one  time  slice.  The  network  accomplishes  this \nthrough  tapped  delay  lines  and  decaying  eligibility  traces. \nThere are twenty-four nodes in the adaptive network,  each assigned to  one sylla(cid:173)\nble.  It is  a  single  layer network,  trained using  a  hybrid  supervised  learning algo(cid:173)\nrithm that merges  Widrow-Hoff type learning with  a  classical conditioning model \n(Sutton  and  Barto,  1987). \n\nStorage  of  temporal  patterns \n\nTapped delay  lines  are used to  briefly  store sequences  of formant transition pat(cid:173)\nterns.  S-nodes  from  each  transition  detector  are  connected  to  a  tapped  delay \nline of five  nodes.  Each delay node simply passes on its S-node's activation value \nonce  per  5  ms  time  slice,  allowing  the  delay  matrix  to  store  25  ms  (five  time \nslices)  of transition  patterns. \n\nThe delay matrix consists of delay lines for each transition detector at each recep(cid:173)\ntive  field.  Adaptive nodes are connected to every node in the delay matrix.  The \ndelay  lines  do  not perform  input  buffering;  information  in  the  delay  matrix has \nbeen  subject to  one level  of processing.  The amount  of information stored  (the \nlength  of the  delay  line)  is  limited  by  efficiency  considerations. \n\nAdaptive  Algorithm \n\nNodes  in  the  adaptive  network  compute  their  activation  using  a  sigmoid  squash \nfunction  and  adjust their  weights  according  to  the  equation: \n\nW~\"!\"l  =  w~\u00b7 + a(z~ - s~)e~ \nJ \nIJ \n\nIJ \n\nI \n\nI \n\nwhere  w is  the  weight  from  a  connection  from  node j  to  node i  at time  t.  a  is  a \nlearning constant.  z is  the expected  value  of node i,  s  is  the weighted  sum  of the \nconnections of node i.  and e is  the exponentially decaying canonical  eligibility  of \n\n\f246 \n\nSmythe \n\nconnection j.  The eligibility  constant gives  some  variation  in the  exact timing  of \ntransition  patterns,  allowing  limited  time  warping  between  training  and testing. \n\nFINAL  RECOGNITION  NETWORK \nThe adaptive network is  not perfect and results in a number of false  alarm errors. \nMany  of  these  are  eliminated  by  using  firing  patterns  of  other  adaptive  nodes. \nFor example,  a  node  that consistently  misfires  on  one  syllable  could be blocked \nby  the  firing  of the correct node  for that syllable.  Adaptive nodes are  connected \nto a  veto recognition network.  Since  an adaptive  node may fire  at any time  (and \nat different times)  throughout input presentation,  delay lines are used to preserve \npatterns  of  adaptive  node  behavior,  and  veto  inhibition  is  used  to  block  false \nalarms.  Connections in  the  veto  network  are  enabled  or  disabled  after  training. \nClearly  this  is  an ad hoc  solution,  but  it  suggests  the  use  of representations  that \nare  distributed  both  spatially  and temporally. \n\nRESULTS  AND  DISCUSSION \nIn each experiment syllable repetitions were divided into mutually exclusive train(cid:173)\ning and testing sets.  A training cycle  consisted  of one presentation of each  mem(cid:173)\nber  of  the  training  set. \nadequate  performance  was  achieved,  usually  after  four  to  ten  training  cycles. \n\nIn  both  experiments  the  networks  were  trained  until \n\nIn  the  first  experiment  the  network  was  trained  on the  five  raw  repetitions  and \ntested  on  the  averaged  set. \nIt achieved  92%  recognition  on  the  testing  set  and \n100%  recognition  on  the  training  set.  The  network  had  two  miss  errors  on  the \ntraining  set. \n\nIn the second experiment,  the network was  trained on four  of the raw repetitions \nand tested  on  the  fifth.  Five  separate  training  runs  were  performed to  test the \nnetwork  on  each repetition.  The network  achieved 76%  recognition  on the test(cid:173)\ning  set  for  all  training  runs,  and  100%  recognition  on  the  training  set. \n\nIn all  experiments most  of the  adaptive  nodes responded  when there  was  transi(cid:173)\ntion information in the delay matrix.  Many responded when both transition and \nsteady-state  information  was  present,  using  clues  from  both  the  consonant  and \nthe  vowel.  This  situation  occurs  only  briefly  for  each  formant,  since  the  delay \nmatrix holds  information for  5 time slices,  and it takes four time slices to signal  a \nsteady-state  event.  Transition information will  be at the  end of the delay matrix \nwhile  steady-state is  at the beginning.  Many nodes were  strongly inhibited in the \nabsence  of  transition  information  even  for  their  correct  syllable,  although  they \nhad  fired  earlier  in  the  data  presentation. \n\nCONCLUSIONS \n\nWe  have  shown  how  different  temporal  representations  and  processing  methods \nare  used  in  a  connectionist  model  for  syllable  recognition.  Hybrid  connectionist \narchitectures  with  only  slightly  more  elaborate  processing  methods  can  classify \nacoustic  motion  and associate  sequences  of transition  events  with  syllables.  The \n\n\fTemporal Representations in a  Connectionist Speech System \n\n247 \n\nsystem is  not designed as a  general speech recognition system,  especially since the \naccurate  measurement  of formant  center frequencies  is  impractical.  Other signal \nprocessing  techniques,  such  as  spectral  peak  estimation,  can  be  used  without \nchanges  in the  architecture.  This  could  provide  information  to  a  larger  speech \nrecognition  system . \n\nSYREN  was  influenced  by  a  neurophysiological  model  for  visual  motion  detec(cid:173)\ntion,  and shows how knowledge  from  one processing modality is  applied to other \nproblems.  The  merging  of  ideas  from  real  nervous  systems  with  existing  tech(cid:173)\nniques can add to the connectionist tool kit,  resulting in more powerful processing \nsystems. \n\nAcknowledgments \n\nThis  research  was  performed  at  Indiana  University  Computer  Science  Depart(cid:173)\nment  as  part of the  author's  Ph.D.  thesis.  The  author  would like  to  thank com(cid:173)\nmittee members John Barnden and Robert  Port for  their help and direction,  and \nDonald  Lee  and  Peter  Brodeur  for  their  assistance  in  preparing  the  manuscript. \n\nReferences \n\nDelattre,  P.  C.,  Liberman,  A.  M.,  Cooper.  F.  S.,  1955,  \"Acoustic loci  and tran(cid:173)\n\nsitional  cues  for  stop  consonants,\"  J.  Acous.  Soc.  Am .\u2022  27.  769-773. \n\nJordan,  M .  I.,  1986  \"Serial order:  A  parallel distributed processing approach,\" \n\nICS  Report  8604,  UCSD,  San  Diego. \n\nKewley-Port,  D.,  1982,  \"Measurement  of  formant  transitions  in  naturally  pro(cid:173)\n\nduced  consonant-vowel  syllables,\"  J.  Acous.  Soc.  Am,  72,  379-389. \n\nKoch,  C.,  Poggio,  T.,  Torre,  V.,  1982,  \"Retinal  ganglion  cells:  A  functional \ninterpretation  of dendritic  morphology,\"  Phil.  Trans.  R.  Soc.  Lon.:  Series  B, \n298,  227-264. \n\nMcClelland,  J.  L..  Rumelhart,  D.  E.,  1982,  \"An interactive  activation model of \n\ncontext  effects  in  letter  perception,\"  Psychological  Review,  88,  375-407. \n\nPols,  L.  C.  W.,  Schouten,  M.  F.  H.,  1982,  \"Perceptual relevance  of coarticula(cid:173)\n\ntion, \"  in:  Carlson,  R.,  and  Grandstrom,  B.,  The  Representation  of Speech  in \nthe  Peripheral  Auditory  System,  Elsevier,  203-208. \n\nAnderson,  S.,  Merrill,  J.W.L,  Port,  R.,  1988,  \"Dynamic speech characterization \nwith  recurrent  networks,\"  Indiana  University  Dept.  of  Computer  Science  TR. \nNo.  258,  Bloomington,  In. \n\nSmythe,  E.  J.,  1988,  \"Temporal computation in connectionist models,\"  Indiana \n\nUniversity  Dept.  of Computer  Science  TR.  No.  251,  Bloomington,  In. \n\nSutton,  R.  S.,  Barto,  A.  G.,  1987,  \"A  temporal  difference  model  of  classical \n\nconditioning,\"  GTE  TR87-509.2. \n\nTank,  D.  W.,  Hopfield,  J.  J.,  1987,  \"Concentrating information in  time,\"  Pro(cid:173)\n\nceedings of the IEEE Conference on  Neural Networks,  San Diego,  IV-455-468. \nWaible,  A.,  Hanazawa,  T.,  Hinton,  G.,  Shikana,  K,  Lang,  K,  1988,  \"Phoneme \nrecognition:  Neural  networks  vs.  Hidden  Markov  Models,\"  Proc.  Int.  Conf. \nAcoustics,  Speech,  and  Signal  Processing,  107-110. \n\n\f", "award": [], "sourceid": 143, "authors": [{"given_name": "Erich", "family_name": "Smythe", "institution": null}]}