{"title": "Incremental Parsing by Modular Recurrent Connectionist Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 364, "page_last": 371, "abstract": null, "full_text": "364 \n\nJain and Waibel \n\nIncremental Parsing by Modular Recurrent \n\nConnectionist Networks \n\nAjay N. Jain  Alex  H. Waibel \n\nSchool of Computer Science \nCarnegie Mellon  University \n\nPittsburgh, PA  15213 \n\nABSTRACT \n\nWe  present a novel,  modular, recurrent connectionist network architec(cid:173)\nture  which  learns  to robustly  perform  incremental  parsing of complex \nsentences.  From  sequential  input,  one  word  at  a  time,  our  networks \nlearn  to  do  semantic  role  assignment,  noun  phrase  attachment,  and \nclause structure recognition for sentences with passive constructions and \ncenter embedded  clauses.  The  networks  make  syntactic  and  semantic \npredictions  at every point in time, and previous  predictions are revised \nas  expectations are affirmed or violated with the arrival of new informa(cid:173)\ntion.  Our networks  induce  their own \"grammar rules\" for dynamically \ntransforming  an  input  sequence of words  into a  syntactic/semantic  in(cid:173)\nterpretation.  These networks  generalize  and  display  tolerance  to  input \nwhich  has  been  corrupted  in ways common in spoken  language. \n\nINTRODUCTION \n\n1 \nPreviously, we have reported on experiments using connectionist models for a small pars(cid:173)\ning task using a new network formalism  which extends back-propagation to  better fit  the \nneeds of sequential symbolic domains  such as  parsing (Jain, 1989).  We showed that con(cid:173)\nnectionist networks  could  learn  the  complex  dynamic  behavior needed  in  parsing.  The \ntask  included  passive sentences  which  require dynamic  incorporation  of previously  un(cid:173)\nseen right context information into  partially built syntactic/semantic interpretations.  The \ntrained parsing network exhibited predictive behavior and was  able to  modify or confirm \n\n\fIncremental Parsing by Modular Recurrent Connectionist Networks \n\n365 \n\nI Interclause Units I \nII Clause RolfS  Units II \n\nClause  1 \n\nClause M \n\nr-I Phra-s-e 1\"\"\"\"11  .. . r-I Phn-s-e -'1 I \n\nr-I Phra-s-e 1-'1  .. . r-I Phn-,-e J-\" \n\nII Clause Structure Units II \n\nI Phrase Level Gating Unltslt-------1 \n\nWord Level \n\n, Word Units I \n\nFigure 1:  High-level Parsing Architecture. \n\nhypotheses  as  sentences  were  sequentially processed.  It was also able to generalize well \nand  tolerate iII-formed input \nIn  this  paper, we  describe  work  on extending  our parsing architecture  to  grammatically \ncomplex  sentences. 1  The  paper  is  organized  as  follows.  First,  we  briefly  outline  the \nnetwork formalism  and the general architecture.  Second, the parsing task is  defined and \nthe  procedure  for  constructing  and  training  the  parser is  presented.  Then  the  dynamic \nbehavior of the  parser is  illustrated, and the performance is  characterized. \n\n2  NETWORK ARCHITECTURE \nWe  have  developed  an  extension  to  back-propagation  networks  which  is  specifically \ndesigned  to  perform  tasks  in  sequential  domains  requiring  symbol  manipulation  (Jain, \n1989).  It is  substantially  different  from  other  connectionist  approaches  to  sequential \nproblems  (e.g.  Elman,  1988;  Jordan,  1986;  Waibel  et al.,  1989).  There are  four  major \nfeatures  of this  formalism.  One,  units  retain  partial  activation  between  updates.  They \ncan  respond  to  repetitive  weak  stimuli  as  well  as  singular  sharp  stimuli.  Two,  units \nare  responsive to both  static  activation  values  of other units  and their dynamic changes. \nThree,  well-behaved  symbol  buffers  can  be  constructed  using  groups  of  units  whose \nconnections  are gated  by other  units.  Four. the  formalism  supports  recurrent  networks. \nThe  networks are able  to  learn complex  time-varying behavior using a  gradient descent \nprocedure via error back-propagation. \nFigure  1 shows  a high-level  diagram  of the general parsing architecture.  It is  organized \ninto  five  hierarchical  levels:  Word,  Phrase,  Clause  Structure,  Clause  Roles,  and  Inter-\n\n1 Another presentation of this  work  appears  in Jain  and  Waibel (1990). \n\n\f366 \n\nJain and Waibel \n\nclause.  The description  will  proceed bottom  up.  A  word  is  presented to  the  network by \nstimulating its associated word unit for a short time.  This produces a pattern of activation \nacross the feature units which represents the meaning of the word.  The connections from \nthe word units to the feature units which encode semantic and syntactic information about \nwords are compiled into  the network and are fixed.2  The Phrase level  uses  the sequence \nof word  representations  from  the  Word  level  to  build contiguous  phrases.  Connections \nfrom  the  Word  level  to  the  Phrase level  are  modulated  by gating  units  which  learn  the \nrequired conditional assignment behavior.  The Clause Structure level  maps  phrases  into \nthe  constituent clauses of the input sentence.  The Clause Roles level describes  the roles \nand  relationships  of the  phrases  in  each  clause  of the  sentence.  The  final  level,  Inter(cid:173)\nclause, represents the interrelationships among the clauses.  The following section defines \na parsing task and gives a detailed description of the construction and training of a parsing \nnetwork  which  performs the task. \n\nINCREMENTAL PARSING \n\n3 \nIn parsing spoken language, it is  desirable to process input one word at a time as  words \nare  produced  by  the  speaker  and  to incrementally  build  an  output representation.  This \nallows  tight  bi-directional  coupling  of the  parser  to  the  underlying  speech  recognition \nsystem.  In such a system, the parser processes information as  soon as  it is produced and \nprovides predictive information  to  the recognition  system  based on a rich representation \nof the current context  As  mentioned earlier, our previous work applying connectionist ar(cid:173)\nchitectures to a parsing task was promising.  The experiment described below extends our \nprevious work to grammatically complex sentences requiring a significant scale increase. \n\n3.1  Parsing Task \n\nThe  domain  for  the  experiment  was  sentences  with  up  to  three  clauses  including  non(cid:173)\ntrivial center-embedding and passive constructions.3  Here are some example sentences: \n\n\u2022  Fido dug  up  a bone near the tree in  the garden. \n\u2022  I know  the man  who John  says  Mary gave the book. \n\n\u2022  The dog  who ate the snake was  given a bone. \n\nGiven sequential input, one word at a time, the task is  to  incrementally build a represen(cid:173)\ntation of the input sentence which  includes  the  following  infonnation:  phrase structure, \nclause structure, semantic role assignment, and interclause relationships.  Figure 2 shows \na representation of the desired parse of the  last sentence in  the list above. \n\n2Connectionist networks have been used for lexical acquisition  successfully (Miikkulainen and Dyer, 1989). \nHowever, in building large systems, it makes sense from an efficiency perspective to precompile as much lexical \ninformation as possible into a network.  This  is  a pragmatic design choice in building large systems. \n\n3The training set contained over 200 sentences.  These are a  subset of the  sentences which form the example \nset  of  a  parser  based  on  a  left  associative  grammar  (Hausser,  1988).  These  sentences  are  grammatically \ninteresting, but they do not reflect the statistical structure of common  speech. \n\n\fIncremental Parsing by Modular Recurrent Connectionist Networks \n\n367 \n\n[Clause  1: \n[Clause  2: \n\n[The  dog  RECIP] \n[who  AGENT] \n\n[was  given  ACTION] \n\n[a  bone  PATIENT]] \n\n[ate  ACTION] \n\n[the  snake  PATIENT] \n\n(RELATIVE  to  Clause  1,  Phrase  1)) \nFigure 2:  Representation of an  Example Sentence. \n\n3.2  Constructing the Parser \n\nThe architecture for the network follows that given in Figure 1.  The following paragraphs \ndescribe  the  detailed  network  structure  bottom  up.  The  constraints  on  the  numbers  of \nobjects and labels are fixed for a particular network. but the architecture itself is  scalable. \nWherever possible  in  the  network  construction.  modularity  and  architectural  constraints \nhave been exploited  to  minimize training  time and  maximize generalization.  A network \nwas  constructed  from  three  separate recurrent subnetworks  trained  to  perform  a portion \nof the parsing  task  on  the  training  sentences.  The performance  of the  full  network  will \nbe discussed in detail  in  the  next section. \n\nThe Phrase level contains three types of units:  phrase block units. gating units. and hidden \nunits.  There  are  10  phrase  blocks.  each  being  able  to  capture  up  to  4  words  forming \na phrase.  The  phrase blocks  contain  sets  of units  (called  slots)  whose  target  activation \npatterns  correspond  to  word  feature  patterns  of  words  in  phrases.  Each  slot  has  an \nassociated gating unit which learns  to conditionally assign an  activation pattern from  the \nfeature  units of the Word level to the  slot.  The gating units have input connections from \nthe hidden  units.  The hidden units  have input connections from  the feature  units. gating \nunits,  and phrase block units.  The direct recurrence between the gating and hidden  units \nallows  the gating  units  to  learn  to  inhibit and  compete  with  one  another.  The  indirect \nrecurrence arising  from  the connections  between  the phrase blocks  and  the hidden  units \nprovides  the  context  of the  current  input  word.  The  target  activation  values  for  each \ngating  unit  are  dynamically  calculated  during  training;  each  gating  unit  must  learn  to \nbecome active at the  proper  time in  order  to  perform  the  phrasal  parsing.  Each  phrase \nblock with its associated gating and hidden units has its weights slaved to the other phrase \nblocks in the Phrase level.  Thus. if a particular phrase construction is only present in one \nposition  in  the  training  set.  all  of the phrase blocks  still  learn  to  parse the construction. \n\nThe  Clause Roles  level  also  has  shared  weights  among  separate clause  modules.  This \nlevel  is  trained  by  simulating the  sequential  building  and  mapping  of clauses  to  sets  of \nunits  containing  the  phrase blocks  for  each  clause  (see  Figure  1).  There  are  two  types \nof units  in  this  level:  labeling  units  and  hidden  units.  The  labeling  units  learn  to  label \nthe phrases of the clauses with semantic roles  and attach phrases  to other (within-clause) \nphrases.  For each clause.  there is  a set of units  which  assigns  role labels  (agent. patient. \nrecipient.  action)  to  phrases.  There  is  also a  set of units  indicating phrasal modification. \nThe  hidden  units  are recurrently  connected  to  the  labeling  units  to  provide context and \ncompetition as  with  the Phrase  level;  they  also  have  input  connections  from  the phrase \nblocks  of a  single  clause.  During  training.  the  targets  for  the  labeling  units  are  set  at \nthe  beginning  of the  input presentation  and  remain  static.  In  order  to  minimize  global \nerror across  the training set.  the units  must learn  to  become active or inactive as  soon as \n\n\f368 \n\nJain and Waibel \n\npossible in  the input.  This  forces  the network  to  learn  to be predictive. \nThe Clause Structure and Interclause levels are trained simultaneously as a single module. \nThere  are  three  types  of units  at  this  level:  mapping,  labeling,  and  hidden  units.  The \nmapping units assign phrase blocks  to clauses.  The labeling units indicate relative clause \nand  a  subordinate clause  relationships.  The  mapping  and  labeling  units  are  recurrently \nconnected to the  hidden units  which also have input connections  from  the phrase blocks \nof the Phrase level.  The behavior of the Phrase level is  simulated during  training of this \nmodule.  This  module  utilizes  no  weight  sharing  techniques.  As  with  the Clause Roles \nlevel,  the  targets  for  the  labeling  and  mapping  units  are  set  at the  beginning  of input \npresentation, thus  inducing  the same type of predictive behavior. \n\n4  PARSING PERFORMANCE \nThe separately trained submodules described above were assembled into a single network \nwhich performs  the full  parsing task.  No additional  training  was  needed  to fine-tune  the \nfull  parsing  network  despite  significant  differences  between  actual  subnetwork  perfor(cid:173)\nmance  and  the  simulated  subnetwork  performance  used  during  training.  The  network \nsuccessfully  modeled  the  large diverse training  set.  This  section discusses  three aspects \nof the parsing network's performance:  dynamic behavior of the  integrated network, gen(cid:173)\neralization, and  tolerance  to  noisy  input. \n\n4.1  Dynamic Behavior \n\nThe dynamic  behavior of the  network  will  be  illustrated  on  the example  sentence  from \nFigure  2:  \"The  dog  who  ate  the  snake  was  given  a  bone.\"  This  sentence  was  not  in \nthe  training  set.  Due  to  space limitations, actual  plots  of network behavior will only be \npresented for  a small portion  of the network. \nInitially, all of the  units  in the network are at their resting values.  The units of the phrase \nblocks  all  have  low  activation.  The  word  unit  corresponding  to  \"the\"  is  stimulated, \ncausing  its  word  feature  representation  to  become active across  the  feature  units  of the \nWord level.  The gating unit associated with  the slot 1 of phrase block  1 becomes  active, \nand the  feature  representation of \"the\" is assigned  to  the slot; the gate closes as  the next \nword is presented.  The remaining words of the sentence are processed similarly, resulting \nin  the  final  Phrase  level  representation  shown  in  Figure 2.  While  this  is  occurring,  the \nhigher levels  of the network are processing the evolving Phrase level  representation. \nThe  behavior  of some  of the  mapping  units  of  the  Clause  Structure  Level  is  shown \nin  Figure  3.  Early  in  the  presentation  of  the  first  word,  the  Clause  Structure  level \nhypothesizes  that  the  first  4  phrase  blocks  will  belong  to  the  first  clause-reflecting \nthe  dominance  of single  clause  sentences  in  the  training  set.  After  \"the\"  is  assigned \nto  the  first  phrase  block,  this  hypothesis  is  revised.  The  network  then  believes  that \nthere  is  an  embedded  clause of 3  (possibly 4)  phrases  following  the  first  phrase.  This \npredictive behavior emerged spontaneously from  the training procedure (a large majority \nof sentences  in  the training  set beginning  with  a determiner had embedded clauses  after \nthe first phrase).  The next two words (\"dog who\") confirm the network's expectation.  The \nword \"ate\" allows the network to firmly decide on an embedded clause of 3 phrases within \n\n\fIncremental Parsing by Modular Recurrent Connectionist Networks \n\n369 \n\nClllUse_1 \n\nClllUse_2 \n\nThe \n\ndog \n\nwho \n\nate \n\nthe \n\nsnake  was \n\ngl\\leO \n\nThe \n\ndog \n\nwho \n\nate \n\nthe \n\nsnake  was \n\ngi\\leO \n\n1111111111111111111~111111111111~1111111111111111111111111111111111111111111111 \n\n-... -._-_ ............... -.... _ ............... -...................................... \n\n11111111111111,,\"11111111111 .. \n\n.. 11111111111111111111111111111111111111 \n\n111111111111 \n\nI111111 \n\n11111111 .. 11 .... 111,'11111111 .................... \n\n.,1111111111111111111111111111111111111111111111111111111111111111111111111111111111 \n\np \nh \nr \na \ns \n~ \np \nh \nr \na \ns \n~ \n~ \n\n.4 \n\n~  .Illllllllll1l1llll11ll1l1l1lh I ................... 11111111111\"\"'\"\"'\"\"'1111 \nP i ........ ,lllm 111111111111111111111111111111111111 ~IIIIIIIIIIIIIIIIIIIIIIIIIIII \nj ........ Idlllllllllllllllllllllllllll~ ~IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII \n\n11111111111111111111111111111111111111111111111111 111111111111111111111111111111111 \n\n.1111111111111111111111 II ........ .. 111\"' ... 11 ...... . ......... . .. ......... \"\"\" .... \n\n........ \n\nFigure 3:  Example of Clause Structure Dynamic  Behavior. \n\nthe main clause.  This is  the correct clausal structure of the sentence and is confirmed by \nthe remainder of the input.  The Interclause level indicates the appropriate relative clause \nrelationship during  the initial hypothesis  of the  embedded clause. \n\nThe  Clause  Roles  level  processes  the  individual  clauses  as  they  get  mapped  through \nthe  Clause  Structure  level.  The  labeling  units  for  clause  1  initially  hypothesize  an \nAgent/Action/Patient role structure with some competition from  a Rec/Act/Pat role struc(cid:173)\nture  (the Agent  and Patient units'  activation  traces  for  clause  I, phrase  1 are  shown  in \nFigure 4).  This  prediction occurs  because  active  constructs  outnumbered  passive  ones \nduring  training.  The final  decision  about role  structure  is  postponed  until  just after the \nembedded  clause  is  presented.  The  verb  phrase  \"was  given\"  immediately  causes  the \nRec/Act/Pat role  structure to  dominate.  Also,  the network indicates  that a fourth  phrase \n(e.g.  \"by  Mary'')  is  expected  to  be  the  Agent.  As  with  the  first  clause,  an  AgjAct/Pat \nrole structure is  predicted for clause 2;  this  time the prediction is  borne out \n\n4.2  Generalization \n\nOne type of generalization is  automatic.  A detail of the word representation  scheme was \nomitted  from  the  previous  discussion.  The  feature  patterns  have  two  parts:  a  syntac(cid:173)\ntic/semantic  part and  an  identification  part.  The  representations  of \"John\"  and  \"Peter\" \ndiffer  only  in  their  ID  parts.  Units  in  the  network  which  learn  do  not have  any  input \nconnections  from  the  ID portions  of the  word  units.  Thus,  when  the  network  learns  to \n\n\f370 \n\nJain and Waibel \n\nCUUiELPlfw.)El-[TIE_DXl \nthe \n\ndog \n\nate \n\nThe \n\nwho \n\nbone \n\nsnake  was \n\ngiven  a \n\nI \ni \" .......... ' ,\", \"\"oIllm 111I1I1I11I1Imllll~ m 11~1I1111111111111111111111111111111111111111111 \n\n1'1111111 ..... 111 ............ . \n\nFigure 4:  Example of Clause Roles  Dynamic Behavior. \n\nparse  \"John gave  the  bone to  the  dog:' it will  know  how  to  parse  \"Peter promised  the \nmitt  to  the boy:'  This  type  of generalization  is  extremely  useful,  both  for  addition  of \nnew  words  to  the network and  for  processing many  sentences  not explicitly  trained  on. \n\nThe network also generalizes to correctly process truly novel sentences-sentences which \nare distinct (ignoring ID features) from  those in the training set.  The weight sharing tech(cid:173)\nniques at the Phrase and Clause Structure levels have an impact here.  While being difficult \nto  measure  generalization  quantitatively,  some  statements  can  be made  about  the  types \nof novel  sentences  which  can  be correctly  processed relative  to  the  training  sentences. \nSubstitution of single words  resulting  in a  meaningful  sentence is  tolerated almost with(cid:173)\nout exception.  Substitution  of entire phrases  by  different  phrases  causes  some errors  in \nstructural parsing on sentences which  have few  similar training exemplars.  However, the \nnetwork  does  quite  well  on  sentences  which  can  be  formed  from  composition  between \nfamiliar  sentences  (e.g.  interchanging clauses). \n\n4.3  Tolerance to Noise \n\nSeveral types of noise tolerance are interesting to analyze:  ungrammaticality, word dele(cid:173)\ntions  (especially poorly  articulated short function  words),  variance in  word speed,  inter(cid:173)\nword  silences,  interjections,  word/phrase  repetitions,  etc.  The  effects  of  noise  were \nsimulated by  testing the parsing network on training sentences which had been corrupted \nin the ways  listed above.  Note that the parser was trained only on  well-formed sentences. \n\nSentences  in  which  verbs  were  made  ungrammatical  were  processed  without  difficulty \n(e.g.  \"We am  happy.\").  Sentences in which  verb phrases  were  badly corrupted produced \nreasonable  interpretations.  For example,  the sentence \"Peter was  gave  a bone  to  Fido:' \nreceived  an  AgJAct/Pat/Rec  role  structure  as  if \"was  gave\"  was  supposed  to  be  either \n\"gave\" or \"has  given\".  Interpretation  of corrupted verb phrases  was  context dependent. \n\nSingle clause sentences  in  which  determiners  were randomly deleted to  simulate speech \nrecognition  errors  were  processed  correctly  8S  percent  of  the  time.  Multiple  clause \nsentences degraded in a similar manner produced more parsing errors.  There were fewer \nexamples of multi-clause sentence types, and this hurt performance.  Deletion of function \nwords  such  as  prepositions  beginning  prepositional  phrases  produced  few  errors,  but \ndeletions  of critical  function  words  such  as  \"to\" in  infinitive  constructions  introducing \nsubordinate clauses  caused serious problems. \n\n\fIncremental Parsing by Modular Recurrent Connectionist Networks \n\n371 \n\nThe  network  was  somewhat  sensitive  to  variations  in  word  presentation  speed  (it  was \ntrained on a constant speed), but tolerated inter-word silences.  Interjections of \"ahhn  and \npartial phrase repetitions were also tested.  The network did not perform as  well on these \nsentences  as  other  networks  trained  for  less  complex  parsing  tasks.  One  possibility  is \nthat  the  weight  sharing  is  preventing  the  formation  of strong  attractors  for  the  training \nsentences.  There appears  to be a tradeoff between generalization and noise tolerance. \n\n5  CONCLUSION \nWe  have  presented  a  novel  connectionist  network  architecture  and  its  application  to  a \nnon-trivial  parsing  task.  A  hierarchical,  modular,  recurrent  connectionist  network  was \nconstructed which  successfully  learned  to  parse grammatically complex  sentences.  The \nparser  exhibited  predictive  behavior  and  was  able  to  dynamically  revise  hypotheses. \nTechniques  for  maximizing  generalization  were  also  discussed.  Network  performance \non  novel  sentences  was  impressive.  Results  of testing  the  parser's sensitivity  to  several \ntypes  of noise  were somewhat  mixed,  but  the  parser performed  well  on  ungrammatical \nsentences and sentences  with  non-critical  function  word deletions. \n\nAcknowledgments \n\nThis research  was  funded by  grants  from  ATR  Interpreting Telephony Research Labora(cid:173)\ntories and the National Science Foundation under grant number EET-87 16324.  We thank \nDave Touretzky for  helpful comments and discussions. \n\nReferences \n\nJ.  L.  Elman.  (1988)  Finding  Structure  in  Time.  Tech.  Rep.  8801, Center for Research \n\nin Language, University of California, San Diego. \n\nR. Hausser.  (1988)  Computation of Language.  Springer-Verlag. \nA. N. Jain.  (1989) A Connectionist Architecturefor Sequential Symbolic Domains.  Tech. \nRep.  CMU-CS-89-187, School of Computer Science, Carnegie Mellon University. \nA.  N.  Jain  and  A.  H.  Waibel.  (1990) Robust connectionist parsing of spoken  language. \nIn  Proceedings of the  1990  IEEE International  Conference  on  Acoustics.  Speech. \nand Signal Processing. \n\nM.  I. Jordan.  (1986) Serial Order:  A Parallel Distributed Processing Approach.  Tech. \n\nRep.  8604,  Institute for  Cognitive Science, University of California, San  Diego. \n\nR.  Miikkulainen  and  M.  O.  Dyer. \n\n(1989)  Encoding  input/output  representations  in \nconnectionist  cognitive  systems.  In  D.  Touretzky.  G.  Hinton.  and  T.  Sejnowski \n(eds.) ,  Proceedings of the  1988  Connectionist  Models  Summer  School,  pp.  347-\n356. Morgan Kaufmann Publishers. \n\nA.  Waibel, T.  Hanazawa, O.  Hinton,  K.  Shikano, and K.  Lang.  (1989) Phoneme recog(cid:173)\n\nnition using  time-delay neural networks.  IEEE  Transactions on Acoustics. Speech. \nand Signal Processing  37(3):328-339. \n\n\f", "award": [], "sourceid": 201, "authors": [{"given_name": "Ajay", "family_name": "Jain", "institution": null}, {"given_name": "Alex", "family_name": "Waibel", "institution": null}]}