{"title": "Physiologically Based Speech Synthesis", "book": "Advances in Neural Information Processing Systems", "page_first": 658, "page_last": 665, "abstract": null, "full_text": "Physiologically Based Speech Synthesis \n\n~akoto Hirayanaa \n\nt ATR Human Information Processing  Research  Laboratories \n2-2,  Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02 Japan \n\ntATR Auditory and Visual Perception  Research  Laboratories \n\nEric Vatikiotis-Bateson \n\nKiyoshi Hondat \n\nYasuharu Koiket \n\n~itsuo Kawatot* \n\nAbstract \n\nThis study demonstrates a  paradigm for  modeling speech  produc(cid:173)\ntion  based  on  neural  networks.  Using  physiological  data  from \nspeech  utterances,  a  neural  network  learns  the  forward  dynamics \nrelating  motor  commands  to  muscles  and  the  ensuing  articulator \nbehavior  that  allows  articulator trajectories  to be generated  from \nmotor commands constrained by phoneme input strings and global \nperformance parameters.  From these movement trajectories, a sec(cid:173)\nond  neural  network generates  PARCOR parameters  that  are  then \nused  to synthesize  the speech  acoustics. \n\n1 \n\nINTRODUCTION \n\nOur group has attempted to model speech  production computationally as a process \nin which  linguistic intentions are  realized  as speech  through  a  causal succession  of \npatterned  behavior.  Our aim is  to gain  insight  into the  cognitive and  neurophysi(cid:173)\nological mechanisms governing  this  complex skilled  behavior as  well  as  to  provide \nplausible models of speech  synthesis  and  possibly  recognition  based  on  the  physi(cid:173)\nology of speech  production.  It is  the use  of physiological data (EMG)  representing \n\n* Also,  Laboratory of Parallel  Distributed Processing,  Research  Institute for  Electronic \n\nScience,  Hokkaido  University,  Sapporo,  Hokkaido  060,  Japan \n\n658 \n\n\fPhysiologically Based  Speech  Synthesis \n\n659 \n\nmotor  commands  to  muscles  that  distinguishes  our  modeling effort  from  those  of \nothers  who  use  neural  networks  for  articulation-based  synthesis  and/or  inference \nof the dynamical constraints on speech  motor control  (Jordan,  1986, Jordan,  1990, \nBailly,  Laboissiere,  and  Schwalz,  1992,  Saltzman,  1986,  Bengio,  Houde,  and  Jor(cid:173)\ndan,  1992).  This  paper  reports  two  areas  in  which  implementation of the speech \nproduction scheme shown in Figure 1 has progressed.  Initially, we  concentrated  on \nmodeling the  dynamics underlying  articulation so  that  phoneme strings  can spec(cid:173)\nify  motor commands  to  muscles,  which  then  specify  phoneme-specific  articulator \nbehavior (Hirayama, Vatikiotis-Bateson, Kawato, and Jordan,  1992).  A neural net(cid:173)\nwork  learned  the  forward  dynamics relating  motor commands to  muscles  and  the \nensuing  articulator behavior  associated  with prosodic ally intact, but phonemic ally \nsimplified,  reiterant  speech  utterances.  Then,  a  cascade  neural  network  (Kawato, \nMaeda, Uno, and Suzuki, 1990) containing the forward dynamics model along with a \nsuitable smoothness criterion (Uno, Kawato, and Suzuki, 1989) was used  to produce \ncontinuous  motor  commands from  a  sequence  of discrete  articulatory  targets  cor(cid:173)\nresponding  to  the phoneme input string.  From this sequence  of motor commands, \nappropriate articulator trajectories  were  then generated. \n\nIntention to Speak \n\nIntended Phoneme \n\nGlobal Performance \n\nSequence \n\nParameters \n\nArticulator Movement \n\nFigure  1:  Conceptual scheme of speech  production \n\nAlthough the  results of this early work were  encouraging,  there  were  two  technical \nlimitations obstructing our effort  to model real  speech.  First,  using optoelectronic \ntransduction  techniques,  only  simple  speech  samples  whose  primary  articulators \nwere  the lips and jaw could be recorded,  hence  the use  of reiterant  ba.  Without dy(cid:173)\nnamic tongue data, real speech  could not be modeled.  Also,  the reiterant paradigm \nintroduced  a  degree  of rhythmical movement behavior not observed  in real speech. \nThe second limitation was that activity of only four  muscles and generally only one \ndimension  of articulator motion  could  be  recorded  simultaneously.  Thus,  agonist(cid:173)\nantagonist  muscle  activity  was  not  represented  even  for  this  limited  set  of artic(cid:173)\nulators.  Technical  improvements  in  data  acquisition  and  their  consequences  for \nthe  subsequent  dynamical  modeling  of real  speech  are  presented  in  the  next  two \nsections.  The second  area of progress  has  been  to  implement  the  transform  from \nmodel- generated  articulator  trajectories  to  acoustic  output.  A  neural  network  is \n\n\f660 \n\nHirayama,  Vatikiotis-Bateson,  Honda,  Koike,  and Kawato \n\nused  to  acquire  the mapping between  articulation and  acoustics  in  terms of PAR(cid:173)\nCOR parameters  (Itakura and Saito,  1969),  which  are  correlated  with  vocal  tract \narea  functions.  Speech  signals  are  then  generated  using  a  PARCOR synthesizer \nfrom  articulator  input  and  appropriate  glottal  sources  (currently,  the  residual  of \nthe PARCOR analysis).  The results  of this  modeling for  real  and reiterant  speech \nare reported  in the final section of the paper. \n\n2  EMPIRICAL DEVELOPMENTS \n\nIn  order  to  acquire  data  more  suitable  for  real  speech  modeling,  two  additional \nexperiments  were  run  in  which  articulator position,  EMG  and  acoustic  data were \nrecorded  while the same subject  produced real  and  reiterant speech  utterances  5-8 \nseconds  long  at  different  speaking  rates  and  styles  (e.g.,  casual  vs.  precise).  In \nthe  first  of these,  a  sophisticated  optoelectronic  device,  OPTOTRAK  (Northern \nDigital, Inc.),  was  used  because  it  permitted simultaneous recording  of numerous \n3D  articulator positions for  the lips, jaw and  head,  ten  EMG  channels,  the speech \nacoustics,  and even  dynamic tongue-palate contact patterns.  These data were  used \nfor  modeling  of the  forward  dynamics  (see  Figure  2)  and  the  forward  acoustics. \nReal  speech  utterances  collected  with  this system  were  heavily  loaded  with  labial \nstops,  /p,b,m/,  and  labiodental  fricatives,  /f,v /,  as  well  as  many  low  vowels  /a, \nae/.  Since  surface  EMG  was  used,  it  was  difficult  to  obtain  reliable  recordings  of \njaw opening (anterior belly of the digastric), and closing (medial pterygoid) muscles. \nMore recently, an electromagnetic position traking system, EMMA  (Perkell, Cohen, \nSvirsky, Matthies, Garabieta, and Jackson, 1992), was used to transduce midsagittal \nmotions  of the  tongue  tip  and  tongue  blade  as  well  as  the  lips,  jaw,  and  head. \nData were  collected  for  the same speech  utterances  used  in  the  OPTOTRAK  and \noriginal experiments as  well  as more natural utterances.  Reiterant speech  was  also \nrecorded  for  tao  For  this  experiment,  surface  and  hooked-wire  EMG  techniques \nwere  combined,  which  enabled  nine  orofacial  and  extrinsic  tongue  muscles  to  be \nrecorded  for  jaw opening  and  closing,  lip  opening  and  closing,  and  tongue  raising \nand  lowering.  The  most  important  aspects  of the  signal  processing  for  modeling \nthe forward  dynamics concern  the  numerical  differentiation  of articulator position \nto  obtain  velocity  and  acceleration,  and  the  severe  low-pass  filtering  (including \nrectification  and  integration)  of the  EMG  to from  2000  Hz  to  20-40  Hz.  Both  of \nthese  introduce spatiotemporal distortions,  whose  effects  on  the forward  dynamics \nmodel are  currently being examined. \n\n3  MODELING THE FORWARD DYNAMICS \n\nThe  forward  dynamics  model  was  obtained  using  a  3-layer  perceptron  with  back \npropagation (Rumelhart, Hinton, and Williams, 1986).  Inputs to the network were \ninstantaneous  position  and  velocity for  each  dimension  of articulator motion,  and \nthe  EMG  signals of 9-10  related  muscles,  which  serve  as  the  record  of motor com(cid:173)\nmands to muscles; outputs were accelerations for each dimension of motion.  Figure 2 \nshows an example of predicting lip and jaw accelerations from  10  orofacial muscles \nfor  the  'natural' test  utterance,  \"Pam put the bobbin in the frying  pan  and  added \nmore puppy parts to  the  boiling potato soup.\"  As  shown  by  the generalization  re(cid:173)\nsults in Figure 2, the acquired model produced appropriate acceleration trajectories \n\n\fPhysiologically  Based Speech  Synthesis \n\n661 \n\nfor  real  speech  utterances,  suggesting  that  utterance  complexity  is  not  a  limiting \nfactor  in this approach. \n\n-Network Output \n\n....... \u00b7Experimental Data \n\nUpper Lip I---\"-'~ \n\nLower Lip \n\nJaw \n\no \n\n200 \n\n400 \n\n600 \n\n800 \n\n1000 \n\nFigure 2:  Estimated acceleration over  time (5 ms samples) for  vertical motion of the three \narticulators  is  compared  to  that of the test sentence:  \"Pam  put  the  bobbin  in  the frying \npan  and  added  more puppy  parts to  the boiling  potato soup\". \n\nMotor Commands \n.Jln~ra.!!d EMG.L \n\n, \n\nr \n\nl---j .... -V-E~L~  Predictor \n\nAccelerlation \n\nP~S (  Forwa~d )  ACC \n\n.~....-JI~  DynamiCs \n\nMovement Trajectory \n\nFigure 3:  The musculo-skeletal  forward  dynamics  model  for  producing  articulator  move(cid:173)\nment  trajectories  is  implemented  as  a  recurrent  network.  Continuous  motor  command \n(EMG)  input drives  the network,  which  uses estimated  acceleration  at  time  tn,  to predict \nnew  velocity  (integration)  and  position  (double  integration)  values  at  the next  time  step \ntn+l.  D is  a one-sample delay  unit.  The network  is  initialized  with  position  and  velocity \nvalues  taken from  the test  utterance at  to. \n\nNetwork  training resulted  in a  one-step  look-ahead predictor of the  articulator dy(cid:173)\nnamics,  and  was  connected  recurrently  as  shown  in  Figure  3.  Using  only  initial \nvalues of articulator position and velocity for  the first  sample and continuous EMG \ninput,  estimated  acceleration  is  looped  back  and  summed  with  the  velocities  and \npositions of the input layer to predict  their  values for  each  time step.  This is  per(cid:173)\nhaps  an  overly  stringent  test  of the  acquired  model because  errors  are  cumulative \nover  the entire  utterance  5-8  second  utterance.  Yet  the  network outputs appropri(cid:173)\nate  articulator  trajectories  for  the entire  utterance.  Figure  4  shows  the  generated \ntrajectory for  vertical motion of the jaw during reiterant production of ba  (recorded \nwith  the  electromagnetometer).  While  the  trajectory  generated  by  the  network \ntends  to  underestimate  movement  amplitude  and  introduce  a  small DC  offset,  it \npreserves  the temporal properties of the test utterance very well everywhere except \nbefore  a  phrasal pause.  Although good  results  have been obtained for  the  analysis \n\n\f662 \n\nHirayama,  Vatikiotis-Bateson, Honda, Koike,  and Kawato \n\n-\n\nNetwork Output \n\n........  Experimental Data \n\nJAW \n\n(Vertical) \n\no \n\n2 \n\n4 \n\nTime (5) \n\n6 \n\n8 \n\nFigure  4:  Jaw  trajectories,  generated  by  the  forward  dynamics  network  are  compared \nwith experimental  data. \n\nof real speech  using  the larger sets  of articulator and  muscle inputs,  network  com(cid:173)\nplexity has greatly increased.  Performance of the full network has been poorer than \nbefore in modeling simple reiterant speech, which suggests some form of modularity \nshould  be introduced.  Also,  the  addition of tongue data has  increased  the  number \nof apparent many-to-one mappings between muscle activity and articulator motion. \nWe  are  now  incorporating  as  a  boundary  constraint  the  midsagittal profile  of the \nhard palate and alveolar ridge,  against which  tongue-tip  articulations are made. \n\n4  MODELING THE FORWARD  ACOUSTICS \n\nArticulator \nPositions \n\nGlottal Source  t----------......;-.:;~.)) ) \n\n\"----------' \n\nAcoustic wa;e \n\nFigure 5:  Forward  acoustics  network. \n\nThe final  stage  of our  speech  production  model  entails  using  a  neural  network  to \nacquire  a  model of the relation between  articulator motion and  the ensuing  acous(cid:173)\ntics.  As shown in Figure 5,  a 3-layer perceptron,  using articulator position as input, \nwas  used  to learn  PARCOR analysis and generate  appropriate  16-order  PARCOR \nparameters  for  subsequent  speech  synthesis  (Itakura  and  Saito,  1969).  We  chose \nPARCOR parameters,  rather  than  more  commonly  used  formant  values,  because \nthe  parameters  have  some  relation  to  specific  cross-sections  of the  vocal  tract  -\ne.g.,  the  first  PARCOR corresponds  to  the  cross-sectional  area  closest  to  the  lips \n\n\fPhysiologically Based  Speech  Synthesis \n\n663 \n\n-\n\n............. \n\n\u2022 \u2022\u2022\u2022\u2022\u2022  \u2022 \u2022\u2022\u2022\u2022 \n\nNetwork Output \n\n........  Experimental Data \n\nk1 \nk2 \n\nk3 \n\nk4 \n\nk5 \n\nk6 \n\n.-\n\no \n\n. ~ \n\n~ . -\n\n\" \n\n\u2022  0, \n\n-,.\" \n\n.: \n\n-: \n\n, \n\n.1f\":'_ \n\n..... \n\n2 \n\n4 \n\nTime (s) \n\n6 \n\n8 \n\nFigure  6:  PARCOR  parameter  values  (IS-order,  30ms  Hanning  window  at  200Hz)  for \nreiterant  ba  are  predicted  by  the  network.  Only  the  first  six  PARCOR  parameters  are \nshown.  The range of each parameter is -1  to 1 (small tick  beside each  wave  label  indicates \n0).  The value  of kl is  about  1 during  vowels,  and  network output  generally  matches  the \ndesire  wave  almost  perfectly. \n\n(Wakita,  1973).  Also,  PARCOR estimation errors  do  not  have  the  radical  conse(cid:173)\nquences  that formant  estimation errors  show.  Finally,  there  is  a  unique  mapping \nfrom  PARCOR  to formant  values,  but  not  the  reverse  (Itakura  and  Saito,  1969). \nFigure  6  shows  the performance  of the  PARCOR estimation network  for  the  first \n6  parameters  out  of 16  parameters.  Using  the  learned  PARCOR coefficients  and \na  sound  source,  acoustic  signals can  be  synthesized.  Currently,  we  are  investigat(cid:173)\ning  various models for  controlling sound source  as  well  as  prosodic  characteristics. \nHowever,  for  this  preliminary  test  of the  network's  ability  to learn  PARCOR pa(cid:173)\nrameters,  the  residual signal of PARCOR analysis served  as  the source  waveform. \nFigure 7 shows an example of the network-learned  PARCOR synthesis for  reiterant \nba.  In  this  case,  the  training  result  is  good  as  can  be seen  in  the  waveform  (and \nfrequency  spectrum),  or  by  listening  the synthesized  sound.  However,  the  results \nhave not been  as good, so far, for  real speech  utterances  containing a  lot of abrupt \nchanges and variability in vocal tract shape.  One reason for  this may be that learn(cid:173)\ning  has  not  yet  converged,  because  the  number  articulator  input  channels  is  still \ntoo limited.  So  far,  we  have only two  markers on  tongue,  which  is  not  enough  to \nrecover  the  full  vocal  tract  shape.  This situation,  hopefully,  will  improve as  data \nfor  more tongue positions,  or perhaps more functionally motivated placements,  are \ncollected.  Another  reason  may be  the inherent  weakness  of PARCOR analysis for \nmodeling dynamic changes in vocal  tract shape. \n\n5  SUMMARY \n\nThis paper outlines two  areas of progress  in our effort  to  develop  a  computational \nmodel of speech production.  First, we extended our data acquisition to include more \n\n\f664 \n\nHirayama,  Vatikiotis-Bateson,  Honda, Koike,  and Kawato \n\nExperimental \n\nSource \n(Residual) \n\nSynthesized \n\n/ b a/ \u2022\u2022 , \u2022\u2022 1_'. II' \n............ \n.t .......... \n\n,_.\". \n....... \n-_.tl. \n\nI \n2 \n\nI \n4 \n\nTime  (s) \n\n0 \n\nb \n\na \n\n\", .. \n\n. ..... \ntt \u2022\u2022\u2022 \n\nI \n6 \n\nI \n8 \n\nExperimental \n\nSource \n(Residual) \n\n, . \n\nSynthesized \n\n3.00 \n\n3.02 \n\n3.04 \n\n3.06 \n\n3.08 \n\n3.10 \n\nTime  (s) \n\nFigure 7:  Speech  acoustics  are synthesized  by  driving  network-learned  PARCOR param(cid:173)\neters  with  a glottal source  (the  residual).  The  test  sentence is  reiterant  speech  using  00. \nTop  and  bottom  graphs  differ  only  in  time  scale. \n\nmuscles  and  dimensions of motion for  more  articulators,  especially  the  tongue,  so \nthat we  could  begin  modeling the  articulatory dynamics of real  speech.  As  hoped, \nincreasing  the scope of the data demonstrated the applicability of our network  ap(cid:173)\nproach to real speech.  However, this also increases the size of the network, which has \nintroduced  some interesting  problems for  modeling simple speech  samples.  We  are \nnow  considering  modifications to  the  network  architecture  that  will  enable  adap(cid:173)\ntive  modeling  of speech  samples,  whose  complexity  (e.g.,  number  of physiologi(cid:173)\ncal/ articulatory components) may vary.  Second,  we  have employed a simple neural \nnetwork  for  modeling  the  articulatory-to-acoustic  transform  based  on  PARCOR \nanalysis,  whose  parameters are  correlated  with  vocal  tract shape.  Although  PAR(cid:173)\nCOR can be used  to synthesize speech,  its main use for  us is as a  tool for  assessing \nempirical issues  associated  with articulatory-acoustic interface. \n\nAcknowledgments \n\nWe  thank  Haskins  Laboratories  for  use  of their  facilities  (NIH  grant  DC-00121), \nVincent Gracco and Kiyoshi Ohsima for muscle insertions, M. I. Jordan for insightful \n\n\fPhysiologically Based Speech  Synthesis \n\n665 \n\ndiscussion,  and Yoh'ichi Toh'kura for  continuous encouragement.  Further support \nwas  provided by HFSP  grants to M.  Kawato. \n\nReferences \n\n[1]  Bailly,  G., Laboissiere,  R.  and  Schwarz,  J.  L.  (1992)  Formant trajectories  as \naudible  gestures:  an  alternative  for  speech  synthesis.  Journal  of Phonetics, \n19,9-23. \n\n[2]  Bengio,  Y.,  Houde,  J.,  and  Jordan,  M.  I.  (1992)  Representations  based  on \narticulatory  dynamics for  speech  recogmtion.  Presented  at  Neural  Networks \nfor  Computing,  Snowbird,  Utah. \n\n[3]  Hirayama,  M.,  Vatikiotis-Bateson,  E.,  Kawato,  M.,  and Jordan,  M.  I. (1992) \nForward dynamics modeling of speech  motor control using physiological data. \nIn  Moody,  J.  E.,  Hanson,  S.  J.,  and  Lippmann,  R.  P.  (eds.)  Advances  in \nneural information processing systems 4.  San  Mateo, CA:  Morgan Kaufmann \nPublishers. \n\n[4]  Itakura,  F.,  and  Saito,  S.  (1969)  Speech  analysis  and  synthesis  by  partial \n\ncorrelation parameters.  Proceeding  of Japan  Acoust.  Soc.,  2-2-6. \n\n[5]  Jordan,  M.  I. (1986)  Serial order:  a  parallel distributed  processing approach. \n\nICS Report,  8604. \n\n[6]  Jordan,  M.  I. (1990)  Motor learning  and  the degrees  of freedom  problem.  In \nM.  Jeannerod  (ed.)  Attention  and  performance  XIII,  796-836,  Hillsdale,  NJ: \nErlbaum. \n\n[7]  Kawato, M., Maeda, M., Uno, Y., and Suzuki, R. (1990). Trajectory formation \n\nof arm movement by cascade neural-network model based on minimum torque(cid:173)\nchange criterion.  Bioi.  Cybern., 62, 275-288. \n\n[8]  Perkell,  J., Cohen, M.,  Svirsky,  M.,  Matthies,  M.,  Garabieta, I.,  and Jackson, \nM., Electromagnetic midsagittal articulometer systems for transducing speech \narticulatory movements.  J.  Acoust.  Soc.  Am., 92, 3078-3096. \n\n[9]  Rumelhart, D.  E.,  Hinton,  G.  E.,  and Williams, R.  J.  (1986)  Learning  repre(cid:173)\n\nsentations by back-propagating errors.  Nature,  323, 533-536. \n\n[10]  Saltzman,  E.  L.  (1986)  Task  dynamic  coordination  of the  speech  articula(cid:173)\n\ntors:  A preliminary model. In H.  Heuer  and  C.  Fromm (eds.)  Generation  and \nmodulation  of action  patterns,  Berlin:  Springer-Verlag. \n\n[11]  Uno,  Y.,  Kawato,  M.,  and  Suzuki,  R.  (1989)  Formation and  control  of opti(cid:173)\n\nmal trajectory in human multijoint arm movement - minimum torque-change \nmodel.  Bioi.  Cybern.,  61, 89-101. \n\n[12]  Wakita, H.  (1973) Direct estimation of the vocal tract shape by inverse  filter(cid:173)\n\ning of acoustic speech  waveforms.  IEEE  Trans.  Audio  Electroacoust.,  AV-21 \n417-427. \n\n\f", "award": [], "sourceid": 678, "authors": [{"given_name": "Makoto", "family_name": "Hirayama", "institution": null}, {"given_name": "Eric", "family_name": "Vatikiotis-Bateson", "institution": null}, {"given_name": "Kiyoshi", "family_name": "Honda", "institution": null}, {"given_name": "Yasuharu", "family_name": "Koike", "institution": null}, {"given_name": "Mitsuo", "family_name": "Kawato", "institution": null}]}