{"title": "A Short-Term Memory Architecture for the Learning of Morphophonemic Rules", "book": "Advances in Neural Information Processing Systems", "page_first": 605, "page_last": 611, "abstract": null, "full_text": "A  Short-Term Memory Architecture for  the \n\nLearning of Morphophonemic Rules \n\nMichael Gasser and  Chan-Do Lee \n\nComputer Science  Department \n\nIndiana University \n\nBloomington, IN  47405 \n\nAbstract \n\nDespite its successes,  Rumelhart and McClelland's (1986)  well-known ap(cid:173)\nproach to the learning of morphophonemic rules  suffers from two deficien(cid:173)\ncies:  (1)  It performs  the  artificial  task  of associating  forms  with  forms \nrather  than  perception  or  production.  (2)  It is  not  constrained  in  ways \nthat humans learners  are.  This paper describes  a  model which  addresses \nboth objections.  Using  a  simple recurrent  architecture  which  takes  both \nforms  and  \"meanings\"  as  inputs,  the  model  learns  to  generate  verbs  in \none  or  another  \"tense\",  given  arbitrary  meanings,  and  to  recognize  the \ntenses  of verbs.  Furthermore,  it fails  to learn  reversal  processes  unknown \nin human language. \n\n1  BACKGROUND \n\nIn the debate over the power of connectionist  models to handle linguistic phenom(cid:173)\nena, considerable attention has been focused on the learning of simple morphological \nrules.  It is  a  straightforward matter in  a symbolic system to specify  how the mean(cid:173)\nings  of a  stem  and  a  bound  morpheme  combine  to  yield  the  meaning  of a  whole \nword and how the form of the bound morpheme depends  on the shape of the stem. \nIn  a  distributed  connectionist  system,  however,  where  there  may  be  no  explicit \nmorphemes,  words,  or rules,  things are not so  simple. \n\nThe most important work in this area has been  that of Rumelhart and McClelland \n(1986), together with later extensions by Marchman and Plunkett (1989).  The net(cid:173)\nworks involved were trained to associate English verb stems with the corresponding \npast-tense forms, successfully  generating both regular and irregular forms and gen(cid:173)\neralizing  to  novel  inputs.  This  work  established  that  rule-like  linguistic  behavior \n\n605 \n\n\f606 \n\nGasser and Lee \n\ncould be achieved in a system with no explicit rules.  However, it did have important \nlimitations, among them the following: \n\n1.  The representation  of linguistic form  was inadequate.  This is clear,  for  exam(cid:173)\n\nple,  from  the fact  that distinct  lexical  items may be associated  with identical \nrepresentations  (Pinker &  Prince,  1988). \n\n2.  The  model  was  trained  on  an  artificial task,  quite  unlike  the  perception  and \nproduction that real hearers and speakers engage in.  Of course,  because it has \nno semantics, the model also says nothing about the issue  of compositionality. \n\nOne consequence  of both of these shortcomings is that there are few  constraints on \nthe kinds of processes  that can be learned. \n\nIn  this  paper  we  describe  a  model which  addresses  these  objections  to  the  earlier \nwork  on  morphophonemic  rule  acquisition.  The  model  learns  to  generate  forms \nin one  or another  \"tense\",  given  arbitrary  patterns  representing  \"meanings\", and \nto  yield  the  appropriate  tense,  given  forms.  The  network  sees  linguistic  forms \none  segment  at a  time,  saving the  context  in  a  short-term  memory.  This style  of \nrepresentation,  together with the more realistic tasks that the network is faced with, \nresults in constraints on what can be learned.  In particular, the system experiences \ndifficulty  learning  reversal  processes  which  do  not  occur  in  human  language  and \nwhich  were  easily accommodated by the  earlier  models. \n\n2  SHORT-TERM MEMORY  AND  PREDICTION \n\nLanguage  takes  place  in time,  and  at some  point,  systems  that  learn and  process \nlanguage have to come to grips with this fact by accepting input in sequential form. \nSequential  models  require  some  form  of short-term  memory  (STM)  because  the \ndecisions that are made depend on context.  There are basically two options, window \napproaches, which make available stretches of input events all at once, and dynamic \nmemory approaches  (Port,  1990),  which  offer  the  possibility  of a  recoded  version \nof past events.  Networks  with recurrent  connections have the capacity for  dynamic \nmemory.  We  make use  of a  variant  of a  simple recurrent  network  (Elman,  1990), \nwhich is a pattern associator with recurrent connections on its hidden layer.  Because \nthe hidden layer receives  input from itself as well as from the units representing the \ncurrent event,  it can function  as a  kind of STM for  sequences  of events. \n\nElman has shown how networks of this type can learn a  great deal about the struc(cid:173)\nture  of the inputs when  trained on the simple,  unsupervised  task of predicting the \nnext  input event.  We  are  interested  in what can be expected  from  such  a  network \nthat is  given  a  single  phonological segment  (hereafter  referred  to  as  a  phone)  at a \ntime and trained to predict the next  phone.  If a  system could  learn to do  this suc(cid:173)\ncessfully, it would have a left-to-right version of what phonologists call phonotactic8j \nthat is, it would have knowledge of what phones tend to follow other phones in given \ncontexts.  Since  word  recognition  and production  apparently  build  on  phonotactic \nknowledge  of the  language  (Church,  1987),  training  on  the  prediction  task  might \nprovide a  way  of integrating the two  processes  within a  single  network. \n\n\fA Short-Term Memory Architecture for the Learning of Morphophonemic Rules \n\n607 \n\n3  ARCHITECTURE \n\nThe  type  of  network  we  work  with  is  shown  m  Figure  1.  Both  its  inputs  and \n\no  hidden layer/STM \n\n(amrent~ \n\nFORM \n\n(  stem  I tense) \n\nMEANING \n\nFigure  1:  Network  Architecture \n\noutputs include  FORM,  that is,  an individual phone,  and what  we'll call MEANING, \nthat is,  a  pattern  representing  the  stem of the  word  to  be  recognized  or  produced \nand  a  single  unit  representing  a  grammatical feature  such  as  PAST  or  PRESENT. \nIn fact,  the  meaning patterns  have  no  real  semantics,  but like  real  meanings,  they \nare  arbitrarily assigned  to the  various  morphemes and  thus  convey  nothing about \nthe  phonological realization  of the  stem and grammatical feature.  The network  is \ntrained both to auto-associate the current  phone and predict  the next phone. \n\nThe word recognition task corresponds to being given phone inputs (together with a \ndefault pattern on the meaning side) and generating meaning outputs.  The meaning \noutputs are  copied  to the input  meaning layer  on  each  time step.  While networks \ntrained  in this  way  can  learn to recognize  the  words  they  are  trained on,  we  have \nnot been  able to get them to generalize well.  Networks  which are expected  only to \noutput the  grammatical feature,  however,  do  generalize,  as  we  shall see. \n\nThe  word  production  task  corresponds  to  being  given  a  constant  meaning input \nand generating form output.  Following an initial default phone pattern,  the  phone \ninput is  what  was  predicted  on  the  last time step.  Again,  however,  though  such  a \nnetwork  does  fine  on  the  training  set,  it  does  not  generalize  well  to  novel  inputs. \nWe have had more success  with a  version  using  \"teacher forcing\".  Here  the correct \ncurrent  phone is  provided on the input at each time step. \n\n4  SIMULATIONS \n\n4.1  STIMULI \n\nWe  conducted  a  set  of experiments to test the effectiveness  of this  architecture  for \nthe learning of morphophonemic rules.  Input words  were  composed of sequences  of \nphones  in an  artificial  language.  Each  of the  15  possible  phones  was  represented \nby  a  pattern  over  a  set  of 8  phonetic  features.  For  each  simulation,  a  set  of 20 \nwords was generated randomly from the set of possible words.  Twelve of these were \n\n\f608 \n\nGasser and Lee \n\ndesignated  \"training\"  words,  8  \"test\"  words. \n\nFor  each  of these  basic  words,  there  was  an  associated  inflected  form.  For  each \nsimulation, one of a set of 9 rules  was used to generate the inflected form:  (1) suffix \n(+ assimilation)  (gip-+gips, gib-+gibz), (2)  prefix  (+ assimilation)  (gip-+zgip, \nkip-+skip), (3)  gemination (iga-+igga), (4)  initial deletion (gip-+ip), (5)  medial \ndeletion  (ipka-+ipa),  (6)  final  deletion  (gip-+gi),  (7)  tone  change  (glp-+glp), \n(8)  Pig Latin  (gip-+ipge), and (9)  reversal  (gip-+pig). \n\nIn  the  two  assimilation  cases,  the  suffix  or  prefix  agreed  with  the  preceding  or \nfollowing  phone  on  the  voice  feature.  In  the  suffixing  example,  p  is  followed  by \ns  because  it  is  voiceless,  b  by  z  because  it is  voiced.  In  the  prefixing  example,  g \nis  preceded  by  z  because  it  is  voiced,  k  by  s  because  it is  voiceless.  Because  the \nnetwork  is  trained  on  prediction,  these  two  rules  are  not  symmetric.  It would  not \nbe surprising if such  a  network  could  learn to generate  a  final  phone  which  agrees \nin voicing with the phone preceding it.  But in the prefixing case,  the network must \nchoose  the  correct  prefix  before  it has seen  the phone  with  which  it is  to  agree  in \nvoicing.  We  thought this would still be possible,  however, because  the network also \nreceives  meaning input representing  the stem of the word  to  be produced. \n\nWe  hoped  that  the  network  would  succeed  on  rule  types  which  are  common  in \nnatural languages and fail  on  those  which  are  rare  or  non-existent.  Types  1-4 are \nrelatively  common, types  5-7 infrequent  or  rare,  type  8 apparently  known  only in \nlanguage games, and type  9 apparently non-occurring. \n\nFor  convenience,  we  will  refer  to  the  uninflected  form  of a  word  as  the  \"present\" \nand the inflected form as the \"past tense\"  of the word in question.  Each input word \nconsisted of a present or past tense form preceded and followed by a word boundary \npattern  composed  of zeroes.  Meaning  patterns  consisted  of an  arbitrary  pattern \nacross  a  set of 6  \"stem\" units,  representing the meaning of the  \"stem\" of one of the \n20  input  words,  plus a  single  bit  representing  the  \"tense\"  of the  input  word,  that \nis,  present  or past. \n\n4.2  TRAINING \n\nDuring training each  of the training words  was  presented  in both present  and past \nforms,  while  the  test  words  appeared  in  the  present  form  only.  Each  of the  32 \nseparate  words  was  trained in both the recognition and production directions. \n\nFor  recognition  training,  the  words  were  presented,  one  phone  at  a  time,  on  the \nform input  units.  The appropriate pattern was  also  provided on the stem meaning \nunits.  Targets  specified  the  current  phone,  next  phone,  and  complete  meaning. \nThus the network was  actually being trained to generate  only  the tense  portion of \nthe meaning for  each word.  The activation on the  tense  output unit was  copied to \nthe tense  input unit following each  time step. \n\nFor  production  training,  the stem and grammatical feature  were  presented  on  the \nlexical input layer and held constant throughout the word.  The phones  making up \nthe  word  were  presented  one at a  time beginning  with  the  initial word  boundary, \nand the network  was  expected  to predict  the next phone in each  case. \n\nThere  were  10  separate  simulations for  each  of the  9 inflectional  rules.  Pilot runs \n\n\fA Short-Term Memory Architecture for the Learning of Morphophonemic Rules \n\n609 \n\nTable 1:  Results of Recognition and  Production Tests \n\nRECOGNITION \n% tenses  correct  % segments correct  % affixes  correct \n\nPRODUCTION \n\nSuffix \nPrefix \nTone change \nGemination \nDeletion \nPig Latin \nReversal \n\n79 \n76 \n99 \n90 \n67 \n61 \n13 \n\n82 \n62 \n98 \n74 \n31 \n27 \n23 \n\n83 \n76 \n-\n42 \n-\n-\n-\n\nwere  used  to find  estimates of the  best  hidden  layer size.  This varied  between  16 \nand 26.  Training continued until the mean sum-of-squares error was  less  than 0.05. \nThis  normally required  between  50  and  100  epochs.  Then  the  connection  weights \nwere  frozen,  and  the  network  was  tested  in  both  the  recognition  and  production \ndirections on the past tense forms  of the test  words. \n\n4.3  RESULTS \n\nIn all cases,  the network learned the training set quite successfully  (at least 95%  of \nthe  phones  for  production and  96%  of the  tenses  for  recognition).  Results  for  the \nrecognition and production of past-tense forms  of test words are shown in Table  1. \nFor  recognition,  chance  is  37.5%.  For  production,  the network's  output on a  given \ntime step was  considered  to be that phone  which was  closest  to the pattern on the \nphone output units. \n\n5  DISCUSSION \n\n5.1  AFFIXATION  AND  ASSIMILATION \n\nThe model shows  clear  evidence  of having learned  morphophonemic rules  which  it \nuses  in both the  production and perception  directions.  And  the degree  of mastery \nof the  rules,  at least for  production,  mirrors the extent  to which  the  types  of rules \noccur  in  natural  languages.  Significantly,  the  net  is  able  to  generate  appropriate \nforms  even in the  prefix  case  when  a  \"right-to-Ieft\"  (anticipatory)  rule is  involved. \nThat  is,  the  fact  that  the  network  is  trained  only on  prediction  does  not  limit it \nto  left-to-right  (perseverative)  rules  because  it  has  access  to  a  \"meaning\"  which \npermits the required  \"lookahead\" to the relevant feature on the phone following the \nprefix.  What makes this interesting is  the fact  that the  meaning patterns  bear no \nrelation to the phonology of the stems.  The connections between the stem meaning \ninput  units  and  the  hidden  layer  are  being  trained  to  encode  the  voicing  feature \neven  when, in the case  of the test  words,  this  was  never  required  during training. \n\nIn  any  case,  it  is  clear  that  right-to-Ieft  assimilation in  a  network  such  as  this  is \nmore difficult to acquire than left-to-right assimilation, all else  being equal.  We  are \n\n\f610 \n\nGasser and Lee \n\nunaware of any evidence that would support this,  though the fact  that prefixes  are \nless common than suffixes in the world's languages (Hawkins &  Cutler, 1988) means \nthat there are at least fewer  opportunities for  the right-to-Ieft process. \n\n5.2  REVERSAL \n\nWhat  is  it  that  makes  the  reversal  rule,  apparently  difficult  for  human  language \nlearners,  so  difficult  for  the  network?  Consider  what  the network  does  when  it is \nfaced with the past-tense form of a verb trained only in the present.  If the novel item \ntook the form of a  set rather than a  sequence,  it would be identical to the familiar \npresent-tense  form.  What the network sees,  however,  is a  sequence  of phones,  and \nits task is  to predict  the  next.  There  is  thus no  sharing at all between  the  present \nand past forms and no basis for generalizing from the present to the past.  Presented \nwith the novel  past form, it is  more likely  to base its  response  on similarity with a \nword containing a similar sequence of phones  (e.g., gip and gif) than it is with the \ncorrect  mirror-image sequence. \nIt is important to note,  however,  that difficulty  with the reversal  process  does  not \nnecessarily presuppose the type of representations that result from training a simple \nrecurrent net on prediction.  Rather this depends more on the fact that the network \nis  trained to map meaning to form and form to meaning, rather than form to form, \nas in the case  of the Rumelhart and McClelland (1986)  model.  Any network  of the \nformer type which represents  linguistic form in such a  way  that the contexts  of the \nphones  are preserved  is  likely to exhibit this behavior. 1 \n\n6  LIMITATIONS AND  EXTENSIONS \n\nDespite its successes,  this model is far from an adequate account  of the recognition \nand production of words  in natural language.  First,  although networks  of the type \nstudied  here  are capable of yielding complete meanings  given  words  and  complete \nwords given meanings, they have difficulty when expected to respond to novel forms \nor  combinations  of known  meanings.  In  the  simulations,  we  asked  the  network \nto  recognize  only  the  grammatical morpheme in  a  novel  word,  and  in  production \nwe  kept  it  on  track  by  giving  it  the  correct  input  phone  on  each  time  step.  It \nwill  be  important to  discover  ways  to  make  the  system  robust  enough  to  respond \nappropriately to novel forms  and combinations of meanings. \n\nEqually important is  the ability of the model to handle more complex phonological \nprocesses.  Recently  Lakoff (1988)  and  Touretzky  and  Wheeler  (1990)  have  devel(cid:173)\noped connectionist models to deal with complicated interacting phonological rules. \nWhile these  models  demonstrate  that  connectionism  offers  distinct  advantages  to \nconventional serial approaches  to phonology, they do not learn phonology (at least \nnot in a  connectionist  way),  and they do not yet accommodate perception. \n\nWe  believe  that  the  performance  of the  model  will  be  significantly  improved  by \nthe  capacity  to  make  reference  directly  to  units  larger  than  the  phone.  We  are \ncurrently  investigating an architecture  consisting  of a  hierarchy  of networks  of the \ntype described  here,  each trained on the prediction task at a  different  time scale. \n\nlWe are indebted  to Dave  Touretzky for  helping  to clarify  this issue. \n\n\fA Short-Term Memory Architecture for the Learning of Morphophonemic Rules \n\n611 \n\n7  CONCLUSIONS \n\nIt is by now clear that a  connectionist system can be trained to exhibit rule-like be(cid:173)\nhavior.  What is not so clear is whether networks can discover how to map elements \nof form  onto  elements of meaning and to use  this knowledge to  interpret  and gen(cid:173)\nerate novel forms.  It has been argued (Fodor &  Pylyshyn,  1988)  that this behavior \nrequires  the kind  of constituency  which is  not available to networks  making use  of \ndistributed  representations. \n\nThe  present  study  is  one  attempt  to  demonstrate  that  networks  are  not  limited \nin  this  way.  We  have  shown  that,  given  \"meanings\"  and  temporally  distributed \nrepresentations  of words,  a  network  can learn to isolate stems and the realizations \nof grammatical features,  associate  them  with their  meanings,  and,  in a  somewhat \nlimited sense,  use this knowledge to produce and recognize novel forms.  In addition, \nthe nature of the training task constrains the system in such a  way that rules which \nare rare  or  non-occurring in natural language are not learned. \n\nReferences \n\nChurch,  K.  W.  (1987).  Phonological  parsing  and  lexical  retrieval.  Cognition,  ~5, \n53-69. \n\nElman, J.  (1990).  Finding structure  in time.  Cognitive  Science,  14,  179-21l. \n\nFodor,  J.,  &  Pylyshyn,  Z.  (1988).  Connectionism  and  cognitive  architecture:  A \ncritical analysis.  Cognition,  ~8, 3-71. \nHawkins,  J.  A.,  &  Cutler,  A.  (1988).  Psychological factors  in morphological asym(cid:173)\nIn  J.  A.  Hawkins  (Ed.),  Ezplaining  language  universals  (pp.  280-317). \nmetry. \nOxford:  Basil Blackwell. \n\nLakoff,  G.  (1988).  Cognitive  phonology.  Paper presented  at the Annual Meeting of \nthe Linguistics  Society  of America. \n\nMarchman,  V.,  &  Plunkett,  K.  (1989).  Token  frequency  and  phonological  pre(cid:173)\ndictability in a  pattern association network:  Implications for  child language acqui(cid:173)\nsition.  Proceedings  of the  Annual Conference  of the  Cognitive  Science  Society,  11, \n179-187. \n\nPinker,  S.,  &  Prince,  A.  (1988).  On  language  and  connectionism:  Analysis  of a \nparallel  distributed  processing  model  of language  acquisition.  Cognition,  ~8,  73-\n193. \n\nPort,  R.  (1990).  Representation and recognition of temporal patterns.  Connection \nScience,  ~,  151-176. \n\nRumelhart,  D.,  &  McClelland,  J.  (1986).  On  learning  the  past  tense  of English \nverbs.  In J.  L.  McClelland &  D.  E.  Rumelhart (Eds.),  Parallel Distributed Process(cid:173)\ning,  Vol.  2  (pp.  216-271).  Cambridge, MA:  MIT Press. \n\nTouretzky,  D.  and  Wheeler,  D.  (1990).  A  computational basis  for  phonology.  In \nD.  S.  Touretzky  (Ed.),  Advances  in Neural Information Processing  Systems  ~, San \nMateo,  CA:  Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 417, "authors": [{"given_name": "Michael", "family_name": "Gasser", "institution": null}, {"given_name": "Chan-Do", "family_name": "Lee", "institution": null}]}