{"title": "Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models", "book": "Advances in Neural Information Processing Systems", "page_first": 641, "page_last": 648, "abstract": "", "full_text": "Adaptor Grammars: A Framework for Specifying\nCompositional Nonparametric Bayesian Models\n\nMark Johnson\n\nMicrosoft Research / Brown University\n\nThomas L. Grif\ufb01ths\n\nUniversity of California, Berkeley\n\nMark Johnson@Brown.edu\n\nTom Griffiths@Berkeley.edu\n\nSharon Goldwater\nStanford University\n\nsgwater@gmail.com\n\nAbstract\n\nThis paper introduces adaptor grammars, a class of probabilistic models of lan-\nguage that generalize probabilistic context-free grammars (PCFGs). Adaptor\ngrammars augment the probabilistic rules of PCFGs with \u201cadaptors\u201d that can in-\nduce dependencies among successive uses. With a particular choice of adaptor,\nbased on the Pitman-Yor process, nonparametric Bayesian models of language\nusing Dirichlet processes and hierarchical Dirichlet processes can be written as\nsimple grammars. We present a general-purpose inference algorithm for adaptor\ngrammars, making it easy to de\ufb01ne and use such models, and illustrate how several\nexisting nonparametric Bayesian models can be expressed within this framework.\n\n1 Introduction\n\nProbabilistic models of language make two kinds of substantive assumptions: assumptions about\nthe structures that underlie language, and assumptions about the probabilistic dependencies in the\nprocess by which those structures are generated. Typically, these assumptions are tightly coupled.\nFor example, in probabilistic context-free grammars (PCFGs), structures are built up by applying a\nsequence of context-free rewrite rules, where each rule in the sequence is selected independently at\nrandom. In this paper, we introduce a class of probabilistic models that weaken the independence\nassumptions made in PCFGs, which we call adaptor grammars. Adaptor grammars insert addi-\ntional stochastic processes called adaptors into the procedure for generating structures, allowing the\nexpansion of a symbol to depend on the way in which that symbol has been rewritten in the past.\nIntroducing dependencies among the applications of rewrite rules extends the set of distributions\nover linguistic structures that can be characterized by a simple grammar.\n\nAdaptor grammars provide a simple framework for de\ufb01ning nonparametric Bayesian models of\nlanguage. With a particular choice of adaptor, based on the Pitman-Yor process [1, 2, 3], simple\ncontext-free grammars specify distributions commonly used in nonparametric Bayesian statistics,\nsuch as Dirichlet processes [4] and hierarchical Dirichlet processes [5]. As a consequence, many\nnonparametric Bayesian models that have been used in computational linguistics, such as models of\nmorphology [6] and word segmentation [7], can be expressed as adaptor grammars. We introduce a\ngeneral-purpose inference algorithm for adaptor grammars, which makes it easy to de\ufb01ne nonpara-\nmetric Bayesian models that generate different linguistic structures and perform inference in those\nmodels.\n\nThe rest of this paper is structured as follows. Section 2 introduces the key technical ideas we\nwill use. Section 3 de\ufb01nes adaptor grammars, while Section 4 presents some examples. Section 5\ndescribes the Markov chain Monte Carlo algorithm we have developed to sample from the posterior\n\n\fdistribution over structures generated by an adaptor grammar. Software implementing this algorithm\nis available from http://cog.brown.edu/\u02dcmj/Software.htm.\n\n2 Background\n\nIn this section, we introduce the two technical ideas that are combined in the adaptor grammars\ndiscussed here: probabilistic context-free grammars, and the Pitman-Yor process. We adopt a non-\nstandard formulation of PCFGs in order to emphasize that they are a kind of recursive mixture, and\nto establish the formal devices we use to specify adaptor grammars.\n\n2.1 Probabilistic context-free grammars\n\nA context-free grammar (CFG) is a quadruple (N, W, R, S) where N is a \ufb01nite set of nonterminal\nsymbols, W is a \ufb01nite set of terminal symbols disjoint from N, R is a \ufb01nite set of productions or\nrules of the form A \u2192 \u03b2 where A \u2208 N and \u03b2 \u2208 (N \u222a W )\u22c6 (the Kleene closure of the terminal and\nnonterminal symbols), and S \u2208 N is a distinguished nonterminal called the start symbol. A CFG\nassociates with each symbol A \u2208 N \u222a W a set TA of \ufb01nite, labeled, ordered trees. If A is a terminal\nsymbol then TA is the singleton set consisting of a unit tree (i.e., containing a single node) labeled\nA. The sets of trees associated with nonterminals are de\ufb01ned recursively as follows:\n\nTA =\n\n[A\u2192B1...Bn\u2208RA\n\nTREEA(TB1 , . . . , TBn )\n\nwhere RA is the subset of productions in R with left-hand side A, and TREEA(TB1 , . . . , TBn ) is\nthe set of all trees whose root node is labeled A, that have n immediate subtrees, and where the ith\nsubtree is a member of TBi. The set of trees generated by the CFG is TS, and the language generated\nby the CFG is the set {YIELD(t) : t \u2208 TS} of terminal strings or yields of the trees TS.\nA probabilistic context-free grammar (PCFG) is a quintuple (N, W, R, S, \u03b8), where (N, W, R, S) is\na CFG and \u03b8 is a vector of non-negative real numbers indexed by productions R such that\n\n\u03b8A\u2192\u03b2 = 1.\n\nXA\u2192\u03b2\u2208RA\n\nInformally, \u03b8A\u2192\u03b2 is the probability of expanding the nonterminal A using the production A \u2192 \u03b2. \u03b8\nis used to de\ufb01ne a distribution GA over the trees TA for each symbol A. If A is a terminal symbol,\nthen GA is the distribution that puts all of its mass on the unit tree labeled A. The distributions GA\nfor nonterminal symbols are de\ufb01ned recursively over TA as follows:\n\nGA =\n\nXA\u2192B1...Bn\u2208RA\n\n\u03b8A\u2192B1...Bn TREEDISTA(GB1 , . . . , GBn )\n\n(1)\n\nwhere TREEDISTA(GB1 , . . . , GBn ) is the distribution over TREEA(TB1, . . . , TBn ) satisfying:\n\nTREEDISTA(G1, . . . , Gn) \u0018\n\nt1\n\n\u0018\n\n. . .\n\nA\n\nX\n\nX\n\ntn\n\n! =\n\nGi(ti).\n\nn\n\nYi=1\n\nThat is, TREEDISTA(G1, . . . , Gn) is a distribution over trees where the root node is labeled A and\neach subtree ti is generated independently from Gi; it is this assumption that adaptor grammars\nrelax. The distribution over trees generated by the PCFG is GS, and the probability of a string is the\nsum of the probabilities of all trees with that string as their yields.\n\n2.2 The Pitman-Yor process\n\nThe Pitman-Yor process [1, 2, 3] is a stochastic process that generates partitions of integers. It is\nmost intuitively described using the metaphor of seating customers at a restaurant. Assume we have\na numbered sequence of tables, and zi indicates the number of the table at which the ith customer is\nseated. Customers enter the restaurant sequentially. The \ufb01rst customer sits at the \ufb01rst table, z1 = 1,\nand the n + 1st customer chooses a table from the distribution\nm\n\nzn+1|z1, . . . , zn \u223c\n\n\u03b4k\n\n(2)\n\nma + b\nn + b\n\n\u03b4m+1 +\n\nXk=1\n\nnk \u2212 a\nn + b\n\n\fwhere m is the number of different indices appearing in the sequence z = (z1, . . . , zn), nk is the\nnumber of times k appears in z, and \u03b4k is the Kronecker delta function, i.e., the distribution that\nputs all of its mass on k. The process is speci\ufb01ed by two real-valued parameters, a \u2208 [0, 1] and\nb \u2265 0. The probability of a particular sequence of assignments, z, with a corresponding vector of\ntable counts n = (n1, . . . , nm) is\n\nP(z) = PY(n | a, b) = Qm\n\nk=1(a(k \u2212 1) + b) Qnk\u22121\n\ni=0 (i + b)\n\nj=1 (j \u2212 a)\n\n.\n\n(3)\n\nQn\u22121\n\nFrom this it is easy to see that the distribution produced by the Pitman-Yor process is exchangeable,\nwith the probability of z being unaffected by permutation of the indices of the zi.\nEquation 2 instantiates a kind of \u201crich get richer\u201d dynamics, with customers being more likely to sit\nat more popular tables. We can use the Pitman-Yor process to de\ufb01ne distributions with this character\non any desired domain. Assume that every table in our restaurant has a value xj placed on it, with\nthose values being generated from an exchangeable distribution G, which we will refer to as the\ngenerator. Then, we can sample a sequence of variables y = (y1, . . . , yn) by using the Pitman-Yor\nprocess to produce z and setting yi = xzi. Intuitively, this corresponds to customers entering the\nrestaurant, and emitting the values of the tables they choose. The distribution de\ufb01ned on y by this\nprocess will be exchangeable, and has two interesting special cases that depend on the parameters\nof the Pitman-Yor process. When a = 1, every customer is assigned to a new table, and the yi are\ndrawn from G. When a = 0, the distribution on the yi is that induced by the Dirichlet process [4],\na stochastic process that is commonly used in nonparametric Bayesian statistics, with concentration\nparameter b and base distribution G.\nWe can also identify another scheme that generates the distribution outlined in the previous para-\ngraph. Let H be a discrete distribution produced by generating a set of atoms x from G and weights\non those atoms from the two-parameter Poisson-Dirichlet distribution [2]. We could then generate a\nsequence of samples y from H. If we integrate over values of H, the distribution on y is the same\nas that obtained via the Pitman-Yor process [2, 3].\n\n3 Adaptor grammars\n\nIn this section, we use the ideas introduced in the previous section to give a formal de\ufb01nition of\nadaptor grammars. We \ufb01rst state this de\ufb01nition in full generality, allowing any choice of adaptor,\nand then consider the case where the adaptor is based on the Pitman-Yor process in more detail.\n\n3.1 A general de\ufb01nition of adaptor grammars\n\nAdaptor grammars extend PCFGs by inserting an additional component called an adaptor into the\nPCFG recursion (Equation 1). An adaptor C is a function from a distribution G to a distribution\nover distributions with the same support as G. An adaptor grammar is a sextuple (N, W, R, S, \u03b8, C)\nwhere (N, W, R, S, \u03b8) is a PCFG and the adaptor vector C is a vector of (parameters specifying)\nadaptors indexed by N. That is, CA maps a distribution over trees TA to another distribution over\nTA, for each A \u2208 N. An adaptor grammar associates each symbol with two distributions GA and\nHA over TA. If A is a terminal symbol then GA and HA are distributions that put all their mass on\nthe unit tree labeled A, while GA and HA for nonterminal symbols are de\ufb01ned as follows:1\n\nGA =\n\nXA\u2192B1...Bn\u2208RA\n\nHA \u223c CA(GA)\n\n\u03b8A\u2192B1...Bn TREEDISTA(HB1 , . . . , GHn )\n\n(4)\n\nThe intuition here is that GA instantiates the PCFG recursion, while the introduction of HA makes\nit possible to modify the independence assumptions behind the resulting distribution through the\nchoice of the adaptor, CA. If the adaptor is the identity function, with HA = GA, the result is\njust a PCFG. However, other distributions over trees can be de\ufb01ned by choosing other adaptors. In\npractice, we integrate over HA, to de\ufb01ne a single distribution on trees for any choice of adaptors C.\n1This de\ufb01nition allows an adaptor grammar to include self-recursive or mutually recursive CFG productions\n(e.g., X \u2192 X Y or X \u2192 Y Z, Y \u2192 X W ). Such recursion complicates inference, so we restrict ourselves\nto grammars where the adapted nonterminals are not recursive.\n\n\f3.2 Pitman-Yor adaptor grammars\n\nThe de\ufb01nition given above allows the adaptors to be any appropriate process, but our focus in the\nremainder of the paper will be on the case where the adaptor is based on the Pitman-Yor process.\nPitman-Yor processes can cache, i.e., increase the probability of, frequently occurring trees. The ca-\npacity to replace the independent selection of rewrite rules with an exchangeable stochastic process\nenables adaptor grammars based on the Pitman-Yor process to de\ufb01ne probability distributions over\ntrees that cannot be expressed using PCFGs.\n\nA Pitman-Yor adaptor grammar (PYAG) is an adaptor grammar where the adaptors C are based on\nthe Pitman-Yor process. A Pitman-Yor adaptor CA(GA) is the distribution obtained by generating a\nset of atoms from the distribution GA and weights on those atoms from the two-parameter Poisson-\nDirichlet distribution. A PYAG has an adaptor CA with parameters aA and bA for each non-terminal\nA \u2208 N. As noted above, if aA = 1 then the Pitman-Yor process is the identity function, so A is\nexpanded in the standard manner for a PCFG. Each adaptor CA will also be associated with two\nvectors, xA and nA, that are needed to compute the probability distribution over trees. xA is the\nsequence of previously generated subtrees with root nodes labeled A. Having been \u201ccached\u201d by the\ngrammar, these now have higher probability than other subtrees. nA lists the counts associated with\nthe subtrees in xA. The adaptor state can thus be summarized as CA = (aA, bA, xA, nA).\nA Pitman-Yor adaptor grammar analysis u = (t, \u2113) is a pair consisting of a parse tree t \u2208 TS\ntogether with an index function \u2113(\u00b7). If q is a nonterminal node in t labeled A, then \u2113(q) gives the\nindex of the entry in xA for the subtree t\u2032 of t rooted at q, i.e., such that xA\u2113(q) = t\u2032. The sequence\nof analyses u = (u1, . . . , un) generated by an adaptor grammar contains suf\ufb01cient information to\ncompute the adaptor state C(u) after generating u: the elements of xA are the distinctly indexed\nsubtrees of u with root label A, and their frequencies nA can be found by performing a top-down\ntraversal of each analysis in turn, only visiting the children of a node q when the subanalysis rooted\nat q is encountered for the \ufb01rst time (i.e., when it is added to xA).\n\n4 Examples of Pitman-Yor adaptor grammars\n\nPitman-Yor adaptor grammars provide a framework in which it is easy to de\ufb01ne compositional non-\nparametric Bayesian models. The use of adaptors based on the Pitman-Yor process allows us to\nspecify grammars that correspond to Dirichlet processes [4] and hierarchical Dirichlet processes\n[5]. Once expressed in this framework, a general-purpose inference algorithm can be used to calcu-\nlate the posterior distribution over analyses produced by a model. In this section, we illustrate how\nexisting nonparametric Bayesian models used for word segmentation [7] and morphological anal-\nysis [6] can be expressed as adaptor grammars, and describe the results of applying our inference\nalgorithm in these models. We postpone the presentation of the algorithm itself until Section 5.\n\n4.1 Dirichlet processes and word segmentation\n\nAdaptor grammars can be used to de\ufb01ne Dirichlet processes with discrete base distributions. It is\nstraightforward to write down an adaptor grammar that de\ufb01nes a Dirichlet process over all strings:\n\nWord \u2192 Chars\nChars \u2192 Char\nChars \u2192 Chars Char\n\n(5)\n\nThe productions expanding Char to all possible characters are omitted to save space. The start sym-\nbol for this grammar is Word. The parameters aChar and aChars are set to 1, so the adaptors for\nChar and Chars are the identity function and HChars = GChars is the distribution over words pro-\nduced by sampling each character independently (i.e., a \u201cmonkeys at typewriters\u201d model). Finally,\naWord is set to 0, so the adaptor for Word is a Dirichlet process with concentration parameter bWord.\nThis grammar generates all possible strings of characters and assigns them simple right-branching\nstructures of no particular interest, but the Word adaptor changes their distribution to one that re\ufb02ects\nthe frequencies of previously generated words. Initially, the Word adaptor is empty (i.e., xWord is\nempty), so the \ufb01rst word s1 generated by the grammar is distributed according to GChars. However,\nthe second word can be generated in two ways: either it is retrieved from the adaptor\u2019s cache (and\n\n\fhence is s1) with probability 1/(1 + bWord), or else with probability bWord/(1 + bWord) it is a new\nword generated by GChars. After n words have been emitted, Word puts mass n/(n + bWord) on\nthose words and reserves mass bWord/(n + bWord) for new words (i.e., generated by Chars).\nWe can extend this grammar to a simple unigram word segmentation model by adding the following\nproductions, changing the start label to Words and setting aWords = 1.\n\nWords \u2192 Word\nWords \u2192 Word Words\n\nThis grammar generates sequences of Word subtrees, so it implicitly segments strings of terminals\ninto a sequence of words, and in fact implements the word segmentation model of [7]. We applied the\ngrammar above with the algorithm described in Section 5 to a corpus of unsegmented child-directed\nspeech [8]. The input strings are sequences of phonemes such as WAtIzIt. A typical parse might\nconsist of Words dominating three Word subtrees, each in turn dominating the phoneme sequences\nWat, Iz and It respectively. Using the sampling procedure described in Section 5 with bWord =\n30, we obtained a segmentation which identi\ufb01ed words in unsegmented input with 0.64 precision,\n0.51 recall, and 0.56 f-score, which is consistent with the results presented for the unigram model\nof [7] on the same data.\n\n4.2 Hierarchical Dirichlet processes and morphological analysis\n\nAn adaptor grammar with more than one adapted nonterminal can implement a hierarchical Dirichlet\nprocess. A hierarchical Dirichlet process that uses the Word process as a generator can be de\ufb01ned\nby adding the production Word1 \u2192 Word to (5) and making Word1 the start symbol. Informally,\nWord1 generates words either from its own cache xWord1 or from the Word distribution. Word\nitself generates words either from xWord or from the \u201cmonkeys at typewriters\u201d model Chars.\nA slightly more elaborate grammar can implement the morphological analysis described in [6].\nWords are analysed into stem and suf\ufb01x substrings; e.g., the word jumping is analysed as a stem\njump and a suf\ufb01x ing. As [6] notes, one of the dif\ufb01culties in constructing a probabilistic account\nof such suf\ufb01xation is that the relative frequencies of suf\ufb01xes varies dramatically depending on the\nstem. That paper used a Pitman-Yor process to effectively dampen this frequency variation, and\nthe adaptor grammar described here does exactly the same thing. The productions of the adaptor\ngrammar are as follows, where Chars is \u201cmonkeys at typewriters\u201d once again:\n\nWord \u2192 Stem Su\ufb03x\nWord \u2192 Stem\nStem \u2192 Chars\nSu\ufb03x \u2192 Chars\n\nWe now give an informal description of how samples might be generated by this grammar. The\nnonterminals Word, Stem and Su\ufb03x are associated with Pitman-Yor adaptors. Stems and suf\ufb01xes\nthat occur in many words are associated with highly probable cache entries, and so have much higher\nprobability than under the Chars PCFG subgrammar.\nFigure 1 depicts a possible state of the adaptors in this adaptor grammar after generating the three\nwords walking, jumpingand walked. Such a state could be generated as follows. Before any strings\nare generated all of the adaptors are empty. To generate the \ufb01rst word we must sample from HWord,\nas there are no entries in the Word adaptor. Sampling from HWord requires sampling from GStem\nand perhaps also GSu\ufb03x, and eventually from the Chars distributions. Supposing that these return\nwalk and ing as Stem and Su\ufb03x strings respectively, the adaptor entries after generating the \ufb01rst\nword walkingconsist of the \ufb01rst entries for Word, Stem and Su\ufb03x.\nIn order to generate another Word we \ufb01rst decide whether to select an existing word from the\nadaptor, or whether to generate the word using GWord. Suppose we choose the latter. Then we must\nsample from HStem and perhaps also from HSu\ufb03x. Suppose we choose to generate the new stem\njumpfrom GStem (resulting in the second entry in the Stem adaptor) but choose to reuse the existing\nSu\ufb03x adaptor entry, resulting in the word jumping. The third word walkedis generated in a similar\nfashion: this time the stem is the \ufb01rst entry in the Stem adaptor, but the suf\ufb01x ed is generated from\nGSu\ufb03x and becomes the second entry in the Su\ufb03x adaptor.\n\n\fWord\n\nWord\n\nWord\n\nWord:\n\nStem\n\nSuffix\n\nStem\n\nSuffix\n\nStem\n\nSuffix\n\nw a\n\nl k\n\ni n g\n\nj u m p\n\ni n g\n\nw a\n\nl k\n\ne d\n\nStem:\n\nSuffix:\n\nStem\n\nStem\n\nw a\n\nl k\n\nj u m p\n\nSuffix\n\ni n g\n\nSuffix\n\ne d\n\nFigure 1: A depiction of a possible state of the Pitman-Yor adaptors in the adaptor grammar of\nSection 4.2 after generating walking, jumpingand walked.\n\nThe model described in [6] is more complex than the one just described because it uses a hidden\n\u201cmorphological class\u201d variable that determines which stem-suf\ufb01x pair is selected. The morpholog-\nical class variable is intended to capture morphological variation; e.g., the present continuous form\nskippingis formed by suf\ufb01xing pinginstead of the ingform using in walkingand jumping. This can\nbe expressed using an adaptor grammar with productions that instantiate the following schema:\n\nWord \u2192 Wordc\nWordc \u2192 Stemc Su\ufb03xc\nWordc \u2192 Stemc\n\nStemc \u2192 Chars\nSu\ufb03xc \u2192 Chars\n\nHere c ranges over the hidden morphological classes, and the productions expanding Chars and\nChar are as before. We set the adaptor parameter aWord = 1 for the start nonterminal symbol\nWord, so we adapt the Wordc, Stemc and Su\ufb03xc nonterminals for each hidden class c.\nFollowing [6], we used this grammar with six hidden classes c to segment 170,015 orthographic\nverb tokens from the Penn Wall Street Journal corpus, and set a = 0 and b = 500 for the adapted\nnonterminals. Although we trained on all verbs in the corpus, we evaluated the segmentation pro-\nduced by the inference procedure described below on just the verbs whose in\ufb01nitival stems were a\npre\ufb01x of the verb itself (i.e., we evaluated skipping but ignored wrote, since its stem write is not a\npre\ufb01x). Of the 116,129 tokens we evaluated, 70% were correctly segmented, and of the 7,170 verb\ntypes, 66% were correctly segmented. Many of the errors were in fact linguistically plausible: e.g.,\neased was analysed as a stem eas followed by a suf\ufb01x ed, permitting the grammar to also generate\neasingas easplus ing.\n\n5 Bayesian inference for Pitman-Yor adaptor grammars\n\nThe results presented in the previous section were obtained by using a Markov chain Monte Carlo\n(MCMC) algorithm to sample from the posterior distribution over PYAG analyses u = (u1, . . . , un)\ngiven strings s = (s1, . . . , sn), where si \u2208 W \u22c6 and ui is the analysis of si. We assume we are given\na CFG (N, W, R, S), vectors of Pitman-Yor adaptor parameters a and b, and a Dirichlet prior with\nhyperparameters \u03b1 over production probabilities \u03b8, i.e.:\n\nP(\u03b8 | \u03b1) = YA\u2208N\n\n1\n\nB(\u03b1A) YA\u2192\u03b2\u2208RA\n\n\u03b8A\u2192\u03b2\n\n\u03b1A\u2192\u03b2 \u22121 where:\n\n\fB(\u03b1A) = QA\u2192\u03b2\u2208RA\n\u0393(PA\u2192\u03b2\u2208RA\n\n\u0393(\u03b1A\u2192\u03b2)\n\u03b1A\u2192\u03b2)\n\nwith \u0393(x) being the generalized factorial function, and \u03b1A is the subsequence of \u03b1 indexed by RA\n(i.e., corresponding to productions that expand A). The joint probability of u under this PYAG, in-\ntegrating over the distributions HA generated from the two-parameter Poisson-Dirichlet distribution\nassociated with each adaptor, is\n\nP(u | \u03b1, a, b) = YA\u2208N\n\nB(\u03b1A + fA(xA))\n\nB(\u03b1A)\n\nPY(nA(u)|a, b)\n\n(6)\n\nwhere fA\u2192\u03b2(xA) is the number of times the root node of a tree in xA is expanded by production\nA \u2192 \u03b2, and fA(xA) is the sequence of such counts (indexed by r \u2208 RA). Informally, the \ufb01rst term\nin (6) is the probability of generating the topmost node in each analysis in adaptor CA (the rest of\nthe tree is generated by another adaptor), while the second term (from Equation 3) is the probability\nof generating a Pitman-Yor adaptor with counts nA.\nThe posterior distribution over analyses u given strings s is obtained by normalizing P(u | \u03b1, a, b)\nover all analyses u that have s as their yield. Unfortunately, computing this distribution is intractable.\nInstead, we draw samples from this distribution using a component-wise Metropolis-Hastings sam-\npler, proposing changes to the analysis ui for each string si in turn. The proposal distribution is\nconstructed to approximate the conditional distribution over ui given si and the analyses of all other\nstrings u\u2212i, P(ui|si, u\u2212i). Since there does not seem to be an ef\ufb01cient (dynamic programming) al-\ngorithm for directly sampling from P(ui|si, u\u2212i),2 we construct a PCFG G\u2032(u\u2212i) on the \ufb02y whose\nparse trees can be transformed into PYAG analyses, and use this as our proposal distribution.\n\n5.1 The PCFG approximation G\u2032(u\u2212i)\n\nA PYAG can be viewed as a special kind of PCFG which adapts its production probabilities depend-\ning on its history. The PCFG approximation G\u2032(u\u2212i) = (N, W, R\u2032, S, \u03b8\u2032) is a static snapshot of the\nadaptor grammar given the sentences s\u2212i (i.e., all of the sentences in s except si). Given an adaptor\ngrammar H = (N, W, R, S, C), let:\n\n{A \u2192 YIELD(x) : x \u2208 xA}\n\nR\u2032 = R \u222a [A\u2208N\nA\u2192\u03b2 = (cid:18) mAaA + bA\n\n\u03b8\u2032\n\nnA + bA (cid:19) fA\u2192\u03b2(xA) + \u03b1A\u2192\u03b2\n\nmA +PA\u2192\u03b2\u2208RA\n\n\u03b1A\u2192\u03b2! + Xk:YIELD(XAk )=\u03b2(cid:18) nAk \u2212 aA\nnA + bA (cid:19)\n\nwhere YIELD(x) is the terminal string or yield of the tree x and mA is the length of xA. R\u2032 contains\nall of the productions R, together with productions representing the adaptor entries xA for each\nA \u2208 N. These additional productions rewrite directly to strings of terminal symbols, and their\nprobability is the probability of the adaptor CA generating the corresponding value xAk.\nThe two terms to the left of the summation specify the probability of selecting a production from\nthe original productions R. The \ufb01rst term is the probability of adaptor CA generating a new value,\nand the second term is the MAP estimate of the production\u2019s probability, estimated from the root\nexpansions of the trees xA.\nIt is straightforward to map parses of a string s produced by G\u2032 to corresponding adaptor analyses\nfor the adaptor grammar H (it is possible for a single production of R\u2032 to correspond to several\nadaptor entries so this mapping may be non-deterministic). This means that we can use the PCFG\nG\u2032 with an ef\ufb01cient PCFG sampling procedure [9] to generate possible adaptor grammar analyses\nfor ui.\n\n5.2 A Metropolis-Hastings algorithm\n\nThe previous section described how to sample adaptor analyses u for a string s from a PCFG ap-\nproximation G\u2032 to an adaptor grammar H. We use this as our proposal distribution in a Metropolis-\n\n2The independence assumptions of PCFGs play an important role in making dynamic programming possi-\nble. In PYAGs, the probability of a subtree adapts dynamically depending on the other subtrees in u, including\nthose in ui.\n\n\fHastings algorithm. If ui is the current analysis of si and u\u2032\nfrom P(Ui|si, G\u2032(u\u2212i)) we accept the proposal ui with probability A(ui, u\u2032\n\ni 6= ui is a proposal analysis sampled\n\ni), where:\n\nA(ui, u\u2032\n\ni) = min(cid:26)1,\n\nP(u\u2032 | \u03b1, a, b) P(ui | si, G\u2032(u\u2212i))\nP(u | \u03b1, a, b) P(u\u2032\n\ni | si, G\u2032(u\u2212i))(cid:27)\n\nwhere u\u2032 is the same as u except that u\u2032\ni replaces ui. Except when the number of training strings s\nis very small, we \ufb01nd that only a tiny fraction (less than 1%) of proposals are rejected, presumably\nbecause the probability of an adaptor analysis does not change signi\ufb01cantly within a single string.\n\nOur inference procedure is as follows. Given a set of training strings s we choose an initial set of\nanalyses for them at random. At each iteration we pick a string si from s at random, and sample a\nparse for si from the PCFG approximation G\u2032(u\u2212i), updating u when the Metropolis-Hastings pro-\ncedure accepts the proposed analysis. At convergence the u produced by this procedure are samples\nfrom the posterior distribution over analyses given s, and samples from the posterior distribution\nover adaptor states C(u) and production probabilities \u03b8 can be computed from them.\n\n6 Conclusion\n\nThe strong independence assumptions of probabilistic context-free grammars tightly couple com-\npositional structure with the probabilistic generative process that produces that structure. Adaptor\ngrammars relax that coupling by inserting an additional stochastic component into the generative\nprocess. Pitman-Yor adaptor grammars use adaptors based on the Pitman-Yor process. This choice\nmakes it possible to express Dirichlet process and hierarchical Dirichlet process models over dis-\ncrete domains as simple context-free grammars. We have proposed a general-purpose inference\nalgorithm for adaptor grammars, which can be used to sample from the posterior distribution over\nanalyses produced by any adaptor grammar. While our focus here has been on demonstrating that\nthis algorithm can be used to produce equivalent results to existing nonparametric Bayesian models\nused for word segmentation and morphological analysis, the great promise of this framework lies in\nits simpli\ufb01cation of specifying and using such models, providing a basic toolbox that will facilitate\nthe construction of more sophisticated models.\n\nAcknowledgments\n\nThis work was performed while all authors were at the Cognitive and Linguistic Sciences Depart-\nment at Brown University and supported by the following grants: NIH R01-MH60922 and RO1-\nDC000314, NSF 9870676, 0631518 and 0631667, the DARPA CALO project and DARPA GALE\ncontract HR0011-06-2-0001.\n\nReferences\n[1] J. Pitman. Exchangeable and partially exchangeable random partitions. Probability Theory and Related\n\nFields, 102:145\u2013158, 1995.\n\n[2] J. Pitman and M. Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator.\n\nAnnals of Probability, 25:855\u2013900, 1997.\n\n[3] H. Ishwaran and L. F. James. Generalized weighted Chinese restaurant processes for species sampling\n\nmixture models. Statistica Sinica, 13:1211\u20131235, 2003.\n\n[4] T. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209\u2013230,\n\n1973.\n\n[5] Y. W. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American\n\nStatistical Association, to appear.\n\n[6] S. Goldwater, T. L. Grif\ufb01ths, and M. Johnson. Interpolating between types and tokens by estimating power-\n\nlaw generators. In Advances in Neural Information Processing Systems 18, 2006.\n\n[7] S. Goldwater, T. L. Grif\ufb01ths, and M. Johnson. Contextual dependencies in unsupervised word segmenta-\n\ntion. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics, 2006.\n\n[8] M. Brent. An ef\ufb01cient, probabilistically sound algorithm for segmentation and word discovery. Machine\n\nLearning, 34:71\u2013105, 1999.\n\n[9] J. Goodman.\n\nParsing inside-out.\n\nPhD thesis, Harvard University, 1998.\n\navailable from\n\nhttp://research.microsoft.com/\u02dcjoshuago/.\n\n\f", "award": [], "sourceid": 3101, "authors": [{"given_name": "Mark", "family_name": "Johnson", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}, {"given_name": "Sharon", "family_name": "Goldwater", "institution": null}]}