{"title": "Grammatical Bigrams", "book": "Advances in Neural Information Processing Systems", "page_first": 91, "page_last": 97, "abstract": null, "full_text": "Grammatical  Bigrams \n\nComputer Science  Division \n\nUniversity of California,  Berkeley \n\nMark  A.  Paskin \n\nBerkeley,  CA  94720 \n\npaskin@cs.berkeley.edu \n\nAbstract \n\nUnsupervised learning algorithms have been derived for several sta(cid:173)\ntistical  models  of English grammar, but their computational com(cid:173)\nplexity  makes  applying  them  to large  data  sets  intractable.  This \npaper  presents  a  probabilistic  model  of English  grammar  that  is \nmuch  simpler than conventional models,  but which  admits an effi(cid:173)\ncient  EM training algorithm.  The model  is  based upon  grammat(cid:173)\nical  bigrams,  i.e. ,  syntactic  relationships  between  pairs  of  words. \nWe  present the results  of experiments  that  quantify  the represen(cid:173)\ntational adequacy of the grammatical bigram model, its ability to \ngeneralize  from  labelled  data,  and  its  ability  to  induce  syntactic \nstructure from  large amounts of raw text. \n\n1 \n\nIntroduction \n\nOne of the most significant  challenges in learning grammars from  raw text is  keep(cid:173)\ning  the  computational  complexity  manageable.  For  example,  the  EM  algorithm \nfor  the  unsupervised  training of Probabilistic  Context-Free  Grammars- known  as \nthe  Inside-Outside  algorithm- has  been found  in  practice to  be  \"computationally \nintractable for  realistic problems\"  [1].  Unsupervised learning algorithms have been \ndesigned for other grammar models  (e.g. , [2, 3]).  However, to the best of our knowl(cid:173)\nedge,  no large-scale experiments have been carried out to test the efficacy  of these \nalgorithms; the most likely reason is  that their computational complexity,  like that \nof the Inside-Outside algorithm, is  impractical. \n\nOne  way  to improve the complexity of inference  and learning in  statistical models \nis  to introduce independence assumptions;  however,  doing so  increases the model's \nbias.  It is  natural  to  wonder  how  a  simpler  grammar  model  (that  can  be  trained \nefficiently  from  raw  text)  would  compare  with  conventional  models  (which  make \nfewer  independence  assumptions,  but  which  must  be  trained  from  labelled  data) . \nSuch a  model would be a  useful  tool in domains where partial accuracy is  valuable \nand  large  amounts  of  unlabelled  data  are  available  (e.g.,  Information  Retrieval, \nInformation Extraction, etc.) . \n\nIn this  paper, we  present a  probabilistic model  of syntax that is  based  upon  gram(cid:173)\nmatical  bigrams,  i.e.,  syntactic relationships between pairs of words.  We  show how \nthis  model  results  from  introducing  independence  assumptions  into  more  conven-\n\n\fthe  quick  brown  fox \n\njumps  over  the  lazy  dog \n\nFigure  1:  An  example  parse;  arrows  are  drawn  from  head  words  to  their  depen(cid:173)\ndents.  The root word is  jumps; brown is  a  predependent  (adjunct)  of fox;  dog is  a \npostdependent  (complement)  of over. \n\ntional  models;  as  a  result,  grammatical  bigram  models  can  be  trained  efficiently \nfrom raw text using an O(n3 )  EM algorithm.  We  present the results of experiments \nthat quantify the representational adequacy of the grammatical bigram model,  its \nability to generalize from labelled data, and its ability to induce syntactic structure \nfrom  large amounts of raw text. \n\n2  The  Grammatical Bigram Model \n\nWe  first  provide a  brief introduction to the Dependency  Grammar formalism  used \nby  the  grammatical  bigram  model;  then,  we  present  the  probability  model  and \nrelate  it  to  conventional  models;  finally,  we  sketch  the  EM  algorithm for  training \nthe model.  Details regarding the parsing and learning algorithms  can be found  in \na  companion technical report  [4]. \n\nDependency  Grammar  Formalism. 1  The  primary unit  of syntactic structure \nin dependency grammars is the dependency relationship, or link- a binary relation \nbetween  a  pair of words  in the sentence.  In each  link,  one  word  is  designated  the \nhead,  and  the  other  is  its  dependent.  (Typically,  different  types  of dependency \nare distinguished,  e.g,  subject,  complement,  adjunct,  etc.;  in our simple model,  no \nsuch  distinction  is  made.)  Dependents  that  precede  their  head  are  called  pre de(cid:173)\npendents, and dependents that follow  their heads are called postdependents. \n\nA  dependency  parse  consists  of a  set  of links  that,  when  viewed  as  a  directed \ngraph over word tokens, form  an ordered tree.  This implies  three important prop(cid:173)\nerties: \n\n1.  Every word except one  (the root) is  dependent  to exactly one head. \n2.  The  links  are  acyclic;  no  word  is,  through  a  sequence  of links,  dependent  to \n\nitself. \n\n3.  When  drawn  as  a  graph  above  the  sentence,  no  two  dependency  relations \n\ncross-a property known as  projectivity or planarity. \n\nThe  planarity  constraint ensures  that  a  head  word  and its  (direct  or indirect)  de(cid:173)\npendents form  a  contiguous subsequence of the sentence;  this sequence is  the head \nword's constituent.  See  Figure 1 for  an example dependency parse. \n\nIn order to formalize our dependency grammar model, we  will  view sentences as se(cid:173)\nquences of word tokens drawn from some set of word types.  Let V  = {tl' t2, ... , t M } \nbe  our  vocabulary  of  M  word  types.  A  sentence  with  n  words  is  therefore  repre(cid:173)\nsented as  a  sequence  S  =  (Wl, W2 ,  ... , wn ),  where each word token Wi  is  a  variable \nthat ranges over V.  For 1 :S  i , j  :S  n , we  use  the notation  (i,j)  E  L  to express that \nWj  is  a  dependent of Wi  in the parse L. \n\nIThe Dependency Grammar formalism described here (which is the same used in [5 , 6]) \nis  impoverished compared to  the sophisticated models used in  Linguistics;  refer  to  [7]  for \na  comprehensive treatment of English syntax in a  dependency framework. \n\n\fBecause  it  simplifies  the structure of our  model,  we  will  make  the  following  three \nassumptions about Sand L  (without loss  of generality):  (1)  the first  word  WI  of S \nis  a  special  symbol  ROOT  E  V;  (2)  the  root  of L  is  WI;  and,  (3)  WI  has  only  one \ndependent.  These  assumptions  are merely  syntactic sugar:  they  allow  us  to  treat \nall  words  in  the true sentence  (i.e.,  (W2,  ... ,Wn ))  as  dependent  to one word.  (The \ntrue root of the sentence is  the sole  child of WI.) \nProbability  Model.  A  probabilistic dependency grammar is  a  probability distri(cid:173)\nbution P(S, L)  where S  = (WI,W2, .. . ,wn )  is  a sentence,  L  is  a  parse of S, and the \nwords W2,  ... ,Wn  are random variables  ranging  over  V.  Of course,  S  and  L  exist \nin  high dimensional spaces;  therefore, tractable representations of this distribution \nmake use  of independence assumptions. \n\nConventional  probabilistic  dependency  grammar  models  make  use  of  what  may \nbe  called  the  head  word  hypothesis:  that a  head word  is  the sole  (or  primary) \ndeterminant of how its constituent combines with other constituents.  The head word \nhypothesis constitutes an independence assumption; it implies that the distribution \ncan be safely factored into a  product over constituents: \n\nP(S,L) = II P((Wj: (i,j)  E L) is  the dependent sequencelwi is  the head) \n\nn \n\ni=1 \n\nFor  example,  the  probability  of a  particular  sequence  can  be  governed  by  a  fixed \nset of probabilistic phrase-structure rules, as  in  [6];  alternatively, the predependent \nand postdependent subsequences can be modeled separately by Markov chains that \nare specific to the head word,  as  in  [8]. \n\nConsider  a  much  stronger independence  assumption:  that  all  the  dependents  of a \nhead  word are independent  of one  another and their relative order.  This is  clearly \nan  approximation;  in  general,  there  will  be  strong  correlations  between  the  de(cid:173)\npendents  of a  head  word.  More  importantly,  this  assumption  prevents  the  model \nfrom  representing important argument structure constraints.  (For  example:  many \nwords  require dependents, e.g.  prepositions;  some verbs  can have optional objects, \nwhereas others require or forbid them.)  However, this assumption relieves the parser \nof having to maintain internal state for each constituent it constructs, and therefore \nreduces the computational complexity of parsing and learning. \n\nWe  can express this independence assumption in the following way:  first , we forego \nmodeling  the  length  of the  sentence,  n,  since  in  parsing  applications  it  is  always \nknown;  then,  we  expand  P(S, Lin)  into  P(S I L)P(L I n)  and  choose  P(L I n)  as \nuniform; finally, we  select \n\nP(S I L) \n\nII  P( Wj  is  a  [pre/post]dependent I Wi  is  the head) \n(i ,j)EL \n\nThis  distribution  factors  into  a  product  of  terms  over  syntactically  related  word \npairs;  therefore, we  call this model the  \"grammatical bigram\"  model. \n\nThe parameters of the model  are \n\n<(cid:173)\n\n\"(xy \n\n\"(~ \n\n6. \n\nP(predependent is  ty  I head is  tx ) \nP(postdependent is  ty I head is  t x ) \n\nWe  can  make  the  parameterization  explicit  by  introducing  the  indicator  variable \nwi,  whose value is  1 if Wi  =  tx and a otherwise.  Then we  can express  P(S I L)  as \n\nP(S I L) \n\n(i,j)EL  x=1 y=1 \n\nj<i \n\n(i,j)EL x=1 y=1 \n\ni<j \n\n\fParsing.  Parsing a  sentence S  consists of computing \n\nL:, \n\nL* \n\nargmaxP(L I S,n)  =  argmaxP(L, Sin) =  argmaxP(S I L) \n\nL \n\nL \n\nL \n\nYuret has shown that there are exponentially many parses of a sentence with n words \n[9],  so exhaustive search for  L * is intractable.  Fortunately, our grammar model falls \ninto the class  of \"Bilexical Grammars\" , for  which efficient  parsing algorithms have \nbeen developed.  Our parsing algorithm (described in the tech report  [4])  is  derived \nfrom Eisner's span-based chart-parsing algorithm [5],  and can find L* in O(n3 )  time. \n\nLearning.  Suppose we  have a  labelled data set \n\nwhere  Sk  = \nlikelihood values  for  our parameters given the training data are \n\n(Wl,k, W2,k,\u00b7\u00b7 \u00b7  , Wnk,k)  and  Lk  is  a  parse  over  Sk.  The  maximum \n\nwhere  the  indicator  variable et  is  equal  to  1  if  (i,j)  E  Lk  and  0  otherwise.  As \none  would  expect,  the  maximum-likelihood  value  of ,;; (resp.  ,~ )  is  simply  the \nfraction of tx's  predependents  (resp.  postdependents)  that were  ty. \nIn the unsupervised acquisition problem, our data set has no parses; our approach is \nto treat the Lk as hidden variables and to employ the EM algorithm to learn (locally) \noptimal values  of the parameters ,. As  we  have shown above,  the et are sufficient \nstatistics  for  our  model;  the  companion  tech  report  [4]  gives  an  adaptation of the \nInside-Outside  algorithm  which  computes  their  conditional  expectation  in  O(n3 ) \ntime.  This algorithm effectively examines every possible parse of every sentence in \nthe training set and calculates the expected number of times each pair of words was \nrelated syntactically. \n\n3  Evaluation \n\nThis  section  presents  three  experiments  that  attempt  to  quantify  the  representa(cid:173)\ntional  adequacy and learnability of grammatical bigram models. \n\nCorpora.  Our experiments  make  use  of two  corpora;  one  is  labelled  with  parses, \nand  the  other  is  not.  The  labelled  corpus  was  generated  automatically  from  the \nphrase-structure trees in  the Wall  Street Journal portion of the Penn Treebank-III \n[10].2  The resultant corpus, which we  call C,  consists of 49,207 sentences  (1,037,374 \nword  tokens).  This  corpus  is  split  into two  pieces:  90% of the  sentences  comprise \ncorpus Ctrain  (44,286 sentences, 934,659 word tokens), and the remaining 10% com(cid:173)\nprise Ctest  (4,921  sentences,  102,715 word tokens). \nThe unlabelled corpus consists of the 1987- 1992 Wall  Street Journal articles in  the \nTREC Text  Research  Collection  Volumes  1 and  2.  These  articles  were  segmented \non  sentence  boundaries  using  the  technique  of  [11],  and  the  sentences  were  post(cid:173)\nprocessed  to  have  a  format  similar  to  corpus  C.  The  resultant  corpus  consists  of \n3,347,516 sentences  (66,777,856 word tokens).  We  will  call this corpus U. \n\n2This  involved  selecting  a  head  word  for  each  constituent,  for  which  the  head-word \nextraction  heuristics  described  in  [6]  were  employed. \nAdditionally,  punctuation  was \nremoved, all words were down-cased, and all numbers were mapped to a special <#> symbol. \n\n\fThe model's vocabulary is the same for all experiments; it consists of the 10,000 most \nfrequent  word  types  in  corpus U; this  vocabulary  covers  94.0%  of word  instances \nin  corpus U  and  93.9%  of word  instances  in  corpus  L.  Words  encountered  during \ntesting and training that are outside the vocabulary are mapped to the <unk>  type. \nPerformance metric.  The performance metric we report is  the link precision of \nthe grammatical bigram model:  the fraction of links hypothesized by the model that \nare present in the test corpus Ltest.  (In a  scenario where the model  is  not required \nto  output  a  complete  parse, e.g.,  a  shallow  parsing task,  we  could  similarly define \na  notion of link recall;  but in our current setting, these metrics are identical.)  Link \nprecision is  measured without regard for  link orientation; this amounts to ignoring \nthe  model's  choice  of root,  since  this  choice  induces  a  directionality  on  all  of the \nedges. \n\nExperiments.  We  report on the results of three experiments: \n\nI.  Retention.  This  experiment  represents  a  best-case  scenario: \n\nthe  model  is \ntrained  on  corpora  Ltrain  and  Ltest  and  then  tested  on  Ltest.  The  model's \nlink precision in this setting is  80.6%. \n\nII.  Generalization.  In this experiment, we measure the model's ability to generalize \nfrom  labelled  data.  The  model  is  trained  on  Ltrain  and  then  tested  on  Ltest. \nThe model's link precision in this setting is  61.8%. \n\nIII.  Induction.  In this experiment, we  measure the model's ability to induce gram(cid:173)\n\nmatical  structure from  unlabelled  data.  The  model  is  trained on U  and  then \ntested on  Ltest .  The model's link precision in this setting is  39.7%. \n\nAnalysis.  The  results  of  Experiment  I  give  some  measure  of  the  grammatical \nbigram  model's  representational  adequacy.  A  model  that  memorizes  every  parse \nwould  perform perfectly in this  setting, but the grammatical bigram model  is  only \nable to recover four  out of every five  links.  To see why,  we  can examine an example \nparse.  Figure  2 shows  how  the models  trained  in  Experiments  I,  II,  and III parse \nthe  same  test  sentence.  In  the  top  parse,  syndrome  is  incorrectly  selected  as  a \npostdependent  of  the  first  on  token  rather  than  the  second.  This  error  can  be \nattributed directly  to the grammatical bigram independence  assumption:  because \nargument  structure is  not  modeled,  there  is  no  reason  to prefer  the  correct parse, \nin  which  both  on  tokens  have  a  single  dependent,  over the  chosen  parse,  in  which \nthe first  has two dependents and the second has none. 3 \n\nExperiment II measures the generalization ability of the grammatical bigram model; \nin  this  setting,  the model  can recover three out of every five  links.  To  see  why  the \nperformance  drops  so  drastically,  we  again  turn  to  an  example  parse:  the  middle \nparse in Figure 2.  Because the forces -+  on link was never observed in the training \ndata, served has been made the head of both on tokens; ironically, this corrects the \nerror made in the top parse because the planarity constraint rules out the incorrect \nlink  from  the  first  on  token  to  syndrome.  Another  error in  the  middle  parse  is  a \nfailure to select  several as a predependent of forces; this error also arises because \nthe combination never occurs in the training data.  Thus, we  can attribute this drop \nin  performance to sparseness in the training data. \n\nWe  can  compare  the  grammatical  bigram  model's  parsing  performance  with  the \nresults  reported  by  Eisner  [8]. \nity  models  are  ascribed  to  the  simple  dependency  grammar  described  above  and \n\nIn  that  investigation,  several  different  probabil(cid:173)\n\n3 Although  the  model's  parse  of  acquired  immune  deficiency  syndrome  agrees  with \nthe  labelled  corpus,  this  particular  parse  reflects  a  failure  of  the  head-word  extraction \nheuristics;  acquired and immune  should be predependents of deficiency, and deficiency \nshould be a  predependent of syndrome. \n\n\f'.88' \n\nfir  hr-.  n~ \n\n3.2\",-d \n\nr \n\n1. 843 \n\nI.  <root>  she  has  also  served  on  several  task  forces  on  acquired  immune  deficiency  syndrome \n1 \nII.  <root>  she  has  also  served  on  several  task  forces  on  acquired  immune  deficiency  syndrome \n\nfir  hn fO ~8~\\ \n\n,  14 . 264 \n\nr,m \n\nA \n\n1 \nr \n\n1 2 . 527 \n\n9 . 630 \n\n'. 803 \n\n1. 528 \n\n1.358 \n\n-\n\n14 . 383 \n\nIII.  <root>  she  has  also  served  on  several  task  forces  on  acquired  immune  deficiency  syndrome \n\n0.913 \n\nt I  \n\n1. 990 \n\nk \n\n4 . 124\n\n- 1.709 \n\n1--'  f'  ~ ( \n\n(~ \n\n13 . 585 \n\n0 .14 9 \n\nFigure 2:  The same test sentence, parsed by the models trained in each of the three \nexperiments.  Links are labelled with -log2 IXY I I:~1 IXY,  the mutual information \nof the linked words;  dotted  edges  are default  attachments. \n\nare  compared  on  a  task  similar  to  Experiment  11.4  Eisner  reports  that  the  best(cid:173)\nperforming dependency  grammar model  (Model  D)  achieves a  (direction-sensitive) \nlink precision of 90.0%,  and the Collins parser [6]  achieves a  (direction-sensitive) link \nprecision of 92.6%.  The superior performance of these models can be attributed to \ntwo factors:  first,  they include sophisticated models of argument structure; and sec(cid:173)\nond, they both make use of part-of-speech taggers, and can \"back-off\"  to non-lexical \ndistributions when statistics are not available. \n\nFinally, Experiment III shows that when trained on unlabelled data, the grammat(cid:173)\nical  bigram model  is  able to recover  two out of every five  links.  This  performance \nis  rather poor, and is  only slightly better than chance;  a  model that chooses parses \nuniformly at random achieves 31.3%  precision on L\\est .  To get an intuition for  why \nthis performance is so  poor, we  can examine the last parse, which was induced from \nunlabelled  data.  Because Wall  Street Journal articles often report corporate news, \nthe  frequent  co-occurrence  of has  -+  acquired has  led  to  a  parse  consistent  with \nthe  interpretation that the subject  she suffers  from  AIDS,  rather than serving  on \na  task force  to  study  it.  We  also  see  that  a  flat  parse structure has  been  selected \nfor  acquired  immune  deficiency  syndrome; this is  because while  this  particular \nnoun phrase occurs in the training data, its constituent nouns do not occur indepen(cid:173)\ndently  with  any  frequency,  and  so  their  relative  co-occurrence frequencies  cannot \nbe  assessed. \n\n4  Discussion \n\nFuture  work.  As  one  would  expect,  our  experiments  indicate  that  the  parsing \nperformance  of the  grammatical  bigram  model  is  not  as  good  as  that  of state-of(cid:173)\nthe-art parsers;  however,  its performance in Experiment II suggests that it may be \nuseful in domains where partial accuracy is valuable and large amounts of unlabelled \ndata are available.  However, to realize that potential, the model must be improved \nso that its  performance in Experiment III is  closer to that of Experiment II. \n\nTo  that  end,  we  can  see  two  obvious  avenues  of improvement.  The  first  involves \nincreasing the  model's  capacity for  generalization and  preventing overfitting.  The \n\n4The  labelled corpus used in  that  investigation  is  also  based  upon  a  transformed  ver(cid:173)\n\nsion  of Treebank-III, but  the head-word  extraction  heuristics  were  slightly  different ,  and \nsentences with conjunctions were completely eliminated.  However,  the setup is sufficiently \nsimilar that we  think the comparison we  draw is  informative. \n\n\fmodel  presented in  this  paper is  sensitive  only  to  pairwise  relationships  to  words; \nhowever,  it  could make  good use  of the fact  that  words  can have  similar syntactic \nbehavior.  We  are  currently  investigating  whether  word  clustering  techniques  can \nimprove performance in supervised and unsupervised learning.  Another way to im(cid:173)\nprove the model is to directly address the primary source of parsing error:  the lack of \nargument structure modeling.  We  are also investigating approximation algorithms \nthat reintroduce argument structure constraints without making the computational \ncomplexity unmanageable. \n\nRelated work.  A  recent proposal by Yuret  presents a  \"lexical attraction\"  model \nwith similarities to the grammatical bigram model  [9];  however,  unlike  the present \nproposal,  that  model  is  trained  using  a  heuristic  algorithm.  The  grammatical bi(cid:173)\ngram model also bears resemblance to several proposals to extend finite-state meth(cid:173)\nods to model long-distance dependencies  (e.g.,  [12,  13]),  although these models  are \nnot based upon an underlying theory of syntax. \n\nReferences \n\n[1]  K.  Lari  and  S.  J.  Young.  The estimation  of stochastic  context-free  grammars  using \n\nthe Inside-Outside algorithm.  Computer Speech  and Language,  4:35- 56,  1990. \n\n[2]  John Lafferty,  Daniel Sleator, and Davy Temperley.  Grammatical trigrams:  A proba(cid:173)\n\nbilistic model of link grammar. In Proceedings  of the AAAI Conference on Probabilistic \nApproaches  to  Natural  Language,  October 1992. \n\n[3]  Yves  Schabes.  Stochastic  lexicalized  tree-adjoining  grammars.  In  Proceedings  of the \nFourteenth  International  Conference  on  Computational  Linguistics,  pages  426-432, \nNantes,  France,  1992. \n\n[4]  Mark A.  Paskin.  Cubic-time parsing and learning algorithms for  grammatical bigram \n\nmodels.  Technical Report  CSD-01-1148, University of California,  Berkeley,  2001. \n\n[5]  Jason  Eisner.  Bilexical  grammars and their cubic-time parsing algorithms.  In Harry \n\nBunt  and Anton Nijholt,  editors,  Advances  in  Probabilistic  and  Other  Parsing  Tech(cid:173)\nnologies,  chapter 1.  Kluwer  Academic Publishers,  October 2000. \n\n[6]  Michael  Collins.  Head-driven  Statistical  Models  for  Natural  Language  Parsing.  PhD \n\nthesis,  University of Pennsylvania,  Philadelphia,  Pennsylvania, 1999. \n\n[7]  Richard A.  Hudson.  English  Word  Grammar.  B.  Blackwell,  Oxford, UK,  1990. \n[8]  Jason  M.  Eisner.  An  empirical  comparison  of  probability  models  for  dependency \n\ngrammars.  Technical Report ICRS-96-11,  CIS  Department, University of Pennsylva(cid:173)\nnia,  220  S.  33rd  St.  Philadelphia,  PA  19104- 6389,  1996. \n\n[9]  Deniz Yuret.  Discovery  of Linguistic R elations  Using  Lexical  Attraction.  PhD thesis, \n\nMassachusetts Institute of Technology,  May  1998. \n\n[10]  M.  Marcus,  B.  Santorini,  and M.  Marcinkiewicz.  Building a  large  annotated corpus \n\nof english:  The penn treebank.  Computational  Linguistics,  19:313- 330,  1993. \n\n[11]  Jeffrey C.  Reynar and Adwait Ratnaparkhi.  A maximum entropy approach to identi(cid:173)\n\nfying sentence boundaries.  In Proceedings  of the  Fifth  Conference  on  Applied  Natural \nLanguage Processing,  Washington,  D.C. ,  March 31  - April  3 1997. \n\n[12]  S.  Della  Pietra,  V.  Della  Pietra,  J.  Gillett,  J.  Lafferty,  H.  Printz,  and  L.  Ures.  In(cid:173)\n\nference  and estimation  of a  long-range  trigram  model.  In  Proceedings  of the  Second \nInternational  Colloquium  on  Grammatical Inference and Applications, number 862  in \nLecture Notes in  Artificial Intelligence, pages  78- 92.  Springer-Verlag,  1994. \n\n[13]  Ronald  Rosenfeld.  Adaptive  Statistical  Language  Modeling:  A  Maximum  Entropy \n\nApproach.  PhD thesis,  Carnegie  Mellon  University,  1994. \n\n\f", "award": [], "sourceid": 2041, "authors": [{"given_name": "Mark", "family_name": "Paskin", "institution": null}]}