{"title": "Interpolating between types and tokens by estimating power-law generators", "book": "Advances in Neural Information Processing Systems", "page_first": 459, "page_last": 466, "abstract": null, "full_text": "Interpolating Between Types and Tokens\nby Estimating Power-Law Generators \u2217\n\nSharon Goldwater\n\nThomas L. Grif\ufb01ths\n\nDepartment of Cognitive and Linguistic Sciences\n\nBrown University, Providence RI 02912, USA\n\nMark Johnson\n\n{sharon goldwater,tom griffiths,mark johnson}@brown.edu\n\nAbstract\n\nStandard statistical models of language fail to capture one of the most\nstriking properties of natural languages: the power-law distribution in\nthe frequencies of word tokens. We present a framework for developing\nstatistical models that generically produce power-laws, augmenting stan-\ndard generative models with an adaptor that produces the appropriate\npattern of token frequencies. We show that taking a particular stochastic\nprocess \u2013 the Pitman-Yor process \u2013 as an adaptor justi\ufb01es the appearance\nof type frequencies in formal analyses of natural language, and improves\nthe performance of a model for unsupervised learning of morphology.\n\n1 Introduction\n\nIn general it is important for models used in unsupervised learning to be able to describe\nthe gross statistical properties of the data they are intended to learn from, otherwise these\nproperties may distort inferences about the parameters of the model. One of the most strik-\ning statistical properties of natural languages is that the distribution of word frequencies is\nclosely approximated by a power-law. That is, the probability that a word w will occur with\nfrequency nw in a suf\ufb01ciently large corpus is proportional to n\u2212g\nw . This observation, which\nis usually attributed to Zipf [1] but enjoys a long and detailed history [2], stimulated intense\nresearch in the 1950s (e.g., [3]) but has largely been ignored in modern computational lin-\nguistics. By developing models that generically exhibit power-laws, it may be possible to\nimprove methods for unsupervised learning of linguistic structure.\n\nIn this paper, we introduce a framework for developing generative models for language\nthat produce power-law distributions. Our framework is based upon the idea of specifying\nlanguage models in terms of two components: a generator, an underlying generative model\nfor words which need not (and usually does not) produce a power-law distribution, and an\nadaptor, which transforms the stream of words produced by the generator into one whose\nfrequencies obey a power law distribution. This framework is extremely general: any gen-\nerative model for language can be used as a generator, with the power-law distribution\nbeing produced as the result of making an appropriate choice for the adaptor.\n\nIn our framework, estimation of the parameters of the generator will be affected by assump-\ntions about the form of the adaptor. We show that use of a particular adaptor, the Pitman-\nYor process [4, 5, 6], sheds light on a tension exhibited by formal approaches to natural\nlanguage: whether explanations should be based upon the types of words that languages\n\n\u2217This work was partially supported by NSF awards IGERT 9870676 and ITR 0085940 and NIMH\n\naward 1R0-IMH60922-01A2\n\n\fexhibit, or the frequencies with which tokens of those words occur. One place where this\ntension manifests is in accounts of morphology, where formal linguists develop accounts of\nwhy particular words appear in the lexicon (e.g., [7]), while computational linguists focus\non statistical models of the frequencies of tokens of those words (e.g., [8]). The tension\nbetween types and tokens also appears within computational linguistics. For example, one\nof the most successful forms of smoothing used in statistical language models, Kneser-Ney\nsmoothing, explicitly interpolates between type and token frequencies [9, 10, 11].\n\nThe plan of the paper is as follows. Section 2 discusses stochastic processes that can pro-\nduce power-law distributions, including the Pitman-Yor process. Section 3 speci\ufb01es a two-\nstage language model that uses the Pitman-Yor process as an adaptor, and examines some\nproperties of this model: Section 3.1 shows that estimation based on type and token fre-\nquencies are special cases of this two-stage language model, and Section 3.2 uses these\nresults to provide a novel justi\ufb01cation for the use of Kneser-Ney smoothing. Section 4\ndescribes a model for unsupervised learning of the morphological structure of words that\nuses our framework, and demonstrates that its performance improves as we move from\nestimation based upon tokens to types. Section 5 concludes the paper.\n\n2 Producing power-law distributions\n\nAssume we want to generate a sequence of N outcomes, z = {z1, . . . , zN } with each\noutcome zi being drawn from a set of (possibly unbounded) size Z. Many of the stochastic\nprocesses that produce power-laws are based upon the principle of preferential attachment,\nwhere the probability that the ith outcome, zi, takes on a particular value k depends upon\nthe frequency of k in z\u2212i = {z1, . . . , zi\u22121} [2]. For example, one of the earliest and most\nwidely used preferential attachment schemes [3] chooses zi according to the distribution\n\nP (zi = k | z\u2212i) = a\n\n1\nZ\n\n+ (1 \u2212 a)\n\n\u2212i)\n\nn(z\nk\ni \u2212 1\n\n(1)\n\nk\n\n\u2212i)\n\nwhere n(z\nis the number of times k occurs in z\u2212i. This \u201crich-get-richer\u201d process means\nthat a few outcomes appear with very high frequency in z \u2013 the key attribute of a power-law\ndistribution. In this case, the power-law has parameter g = 1/(1 \u2212 a).\nOne problem with these classical models is that they assume a \ufb01xed ordering on the out-\ncomes z. While this may be appropriate for some settings, the assumption of a temporal\nordering restricts the contexts in which such models can be applied. In particular, it is\nmuch more restrictive than the assumption of independent sampling that underlies most\nstatistical language models. Consequently, we will focus on a different preferential attach-\nment scheme, based upon the two-parameter species sampling model [4, 5] known as the\nPitman-Yor process [6]. Under this scheme outcomes follow a power-law distribution, but\nremain exchangeable: the probability of a set of outcomes is not affected by their ordering.\n\nThe Pitman-Yor process can be viewed as a generalization of the Chinese restaurant process\n[6]. Assume that N customers enter a restaurant with in\ufb01nitely many tables, each with\nin\ufb01nite seating capacity. Let zi denote the table chosen by the ith customer. The \ufb01rst\ncustomer sits at the \ufb01rst table, z1 = 1. The ith customer chooses table k with probability\n\nP (zi = k | z\u2212i) =\uf8f1\uf8f2\n\uf8f3\n\n(z\n\n\u2212i )\n\nn\n\n\u2212a\n\nk\ni\u22121+b\n\nK(z\n\n\u2212i)a+b\n\ni\u22121+b\n\nk \u2264 K(z\u2212i)\nk = K(z\u2212i) + 1\n\n(2)\n\nwhere a and b are the two parameters of the process and K(z\u2212i) is the number of tables\nthat are currently occupied.\n\nThe Pitman-Yor process satis\ufb01es our need for a process that produces power-laws while\nretaining exchangeability. Equation 2 is clearly a preferential attachment scheme. When\n\n\f(a)\n\nGenerator\n\nAdaptor\n\n(b)\n\nGenerator\n\nAdaptor\n\n\u03b8\n\n\u2113\n\nz\n\nw\n\nc\n\nf\n\nt\n\n\u2113\n\nz\n\nw\n\nFigure 1: Graphical models showing dependencies among variables in (a) the simple two-\nstage model, and (b) the morphology model. Shading of the node containing w re\ufb02ects the\nfact that this variable is observed. Dotted lines delimit the generator and adaptor.\n\na = 0 and b > 0, it reduces to the standard Chinese restaurant process [12, 4] used in\nDirichlet process mixture models [13]. When 0 < a < 1, the number of people seated at\neach table follows a power-law distribution with g = 1 + a [5]. It is straightforward to\nshow that the customers are exchangeable: the probability of a partition of customers into\nsets seated at different tables is unaffected by the order in which the customers were seated.\n\n3 A two-stage language model\n\nWe can use the Pitman-Yor process as the foundation for a language model that generi-\ncally produces power-law distributions. We will de\ufb01ne a two-stage model by extending the\nrestaurant metaphor introduced above. Imagine that each table k is labelled with a word \u2113k\nfrom a vocabulary of (possibly unbounded) size W . The \ufb01rst stage is to generate these la-\nbels, sampling \u2113k from a generative model for words that we will refer to as the generator.\nFor example, we could choose to draw the labels from a multinomial distribution \u03b8. The\nsecond stage is to generate the actual sequence of words itself. This is done by allowing a\nsequence of customers to enter the restaurant. Each customer chooses a table, producing a\nseating arrangement, z, and says the word used to label that the table, producing a sequence\nof words, w. The process by which customers choose tables, which we will refer to as the\nadaptor, de\ufb01nes a probability distribution over the sequence of words w produced by the\ncustomers, determining the frequency with which tokens of the different types occur. The\nstatistical dependencies among the variables in one such model are shown in Figure 1 (a).\nGiven the discussion in the previous section, the Pitman-Yor process is a natural choice\nfor an adaptor. The result is technically a Pitman-Yor mixture model, with zi indicating\nthe \u201cclass\u201d responsible for generating the ith word, and \u2113k determining the multinomial\ndistribution over words associated with class k, with P (wi = w | zi = k, \u2113k) = 1 if\n\u2113k = w, and 0 otherwise. Under this model the probability that the ith customer produces\nword w given previously produced words w\u2212i and current seating arrangement z\u2212i is\nP (wi = w | w\u2212i, z\u2212i, \u03b8) = X\n\nP (wi = w | zi = k, \u2113k)P (\u2113k | w\u2212i, z\u2212i, \u03b8)P (zi = k | z\u2212i)\n\nX\n\nk\n\n\u2113k\n\nK(z\n\n\u2212i)\n\nX\n\nk=1\n\n=\n\n\u2212i)\n\n(z\n\u2212 a\nn\nk\ni \u2212 1 + b\n\nI(\u2113k = w) +\n\nK(z\u2212i)a + b\n\ni \u2212 1 + b\n\n\u03b8w\n\n(3)\n\nwhere I(\u00b7) is an indicator function, being 1 when its argument is true and 0 otherwise. If\n\u03b8 is uniform over all W words, then the distribution over w reduces to the Pitman-Yor\nprocess as W \u2192 \u221e. Otherwise, multiple tables can receive the same label, increasing the\nfrequency of the corresponding word and producing a distribution with g < 1 + a. Again,\nit is straightforward to show that words are exchangeable under this distribution.\n\n\f3.1 Types and tokens\n\nThe use of the Pitman-Yor process as an adaptor provides a justi\ufb01cation for the role of word\ntypes in formal analyses of natural language. This can be seen by considering the question\nof how to estimate the parameters of the multinomial distribution used as a generator, \u03b8.1\nIn general, the parameters of generators can be estimated using Markov chain Monte Carlo\nmethods, as we demonstrate in Section 4. In this section, we will show that estimation\nschemes based upon type and token frequencies are special cases of our language model,\ncorresponding to the extreme values of the parameter a. Values of a between these extremes\nidentify estimation methods that interpolate between types and tokens.\n\nTaking a multinomial distribution with parameters \u03b8 as a generator and the Pitman-Yor\nprocess as an adaptor, the probability of a sequence of words w given \u03b8 is\n\nP (w | \u03b8) =Xz,\u2113\n\nP (w, z, \u2113 | \u03b8) =Xz,\u2113\n\n\u0393(b)\n\n\u0393(N + b)\n\nK(z)\n\nYk=1 \u03b8\u2113k ((k \u2212 1)a + b)\n\n\u0393(n(z)\n\nk \u2212 a)\n\n\u0393(1 \u2212 a) !\n\nwhere in the last sum z and \u2113 are constrained such that \u2113zi = wi for all i. In the case where\nb = 0, this simpli\ufb01es to\n\nP (w | \u03b8) =Xz,\u2113\n\nK(z)\n\nYk=1\n\n\uf8eb\n\uf8ed\n\n\u03b8\u2113k\uf8f6\n\uf8f8 \u00b7\n\n\u0393(K(z))\n\n\u0393(N )\n\n\u00b7 aK(z)\u22121 \u00b7\uf8eb\n\uf8ed\n\nK(z)\n\nYk=1\n\n\u0393(n(z)\n\nk \u2212 a)\n\n\u0393(1 \u2212 a) \uf8f6\n\uf8f8\n\n(4)\n\nis 1 for n(z)\n\nThe distribution P (w | \u03b8) determines how the data w in\ufb02uence estimates of \u03b8, so we will\nconsider how P (w | \u03b8) changes under different limits of a.\nIn the limit as a approaches 1, estimation of \u03b8 is based upon word tokens. When a \u2192 1,\n\u0393(nz\n\u2212a)\nk > 1. Consequently, all terms in the\nk\n\u0393(1\u2212a)\nsum over (z, \u2113) go to zero, except that in which every word token has its own table. In this\nk=1 \u03b8wk. Any form of\n\nk = 1 but approaches 0 for n(z)\n\ncase, K(z) = N and \u2113k = wk. It follows that lima\u21921 P (w | \u03b8) =QN\n\nestimation using P (w | \u03b8) will thus be based upon the frequencies of word tokens in w.\nIn the limit as a approaches 0, estimation of \u03b8 is based upon word types. The appearance\nof aK(z)\u22121 in Equation 4 means that as a \u2192 0, the sum over z is dominated by the seating\narrangement that minimizes the total number of tables. Under the constraint that \u2113zi = wi\nfor all i, this minimal con\ufb01guration is the one in which every word type receives a single\ntable. Consequently, lima\u21920 P (w | \u03b8) is dominated by a term in which there is a single\ninstance of \u03b8w for each word w that appears in w.2 Any form of estimation using P (w | \u03b8)\nwill thus be based upon a single instance of each word type in w.\n\n3.2 Predictions and smoothing\n\nIn addition to providing a justi\ufb01cation for the role of types in formal analyses of language\nin general, use of the Pitman-Yor process as an adaptor can be used to explain the assump-\ntions behind a speci\ufb01c scheme for combining token and type frequencies: Kneser-Ney\nsmoothing. Smoothing methods are schemes for regularizing empirical estimates of the\nprobabilities of words, with the goal of improving the predictive performance of language\nmodels. The Kneser-Ney smoother estimates the probability of a word by combining type\nand token frequencies, and has proven particularly effective for n-gram models [9, 10, 11].\n\n1Under the interpretation of this model as a Pitman-Yor process mixture model, this is analogous\n\nto estimating the base measure G0 in a Dirichlet process mixture model (e.g. [13]).\n\n2Despite the fact that P (w | \u03b8) approaches 0 in this limit, aK(z)\u22121 will be constant across all\nchoices of \u03b8. Consequently, estimation schemes that depend only on the non-constant terms in\nP (w | \u03b8), such as maximum-likelihood or Bayesian inference, will remain well de\ufb01ned.\n\n\fTo use an n-gram language model, we need to estimate the probability distribution over\nwords given their history, i.e. the n preceding words. Assume we are given a vector of N\nwords w that all share a common history, and want to predict the next word, wN +1, that will\noccur with that history. Assume that we also have vectors of words from H other histories,\nw(1), . . . , w(H). The interpolated Kneser-Ney smoother [11] makes the prediction\n\nP (wN+1 = w | w) =\n\nn(w)\nw \u2212 I(n(w)\n\nw > D)D\nN\n\n+ Pw I(n(w)\n\nw > D)D\nN\n\nPh I(w \u2208 w\n\n(h))\n\nPw Ph I(w \u2208 w(h))\n\n(5)\n\nwhere we have suppressed the dependence on w(1), . . . , w(H), D is a \u201cdiscount factor\u201d\nspeci\ufb01ed as a parameter of the model, and the sum over h includes w.\nWe can de\ufb01ne a two-stage model appropriate for this setting by assuming that the sets of\nwords for all histories are produced by the same adaptor and generator. Under this model,\nthe probability of word wN +1 given w and \u03b8 is\n\nP (wN +1 = w | w, \u03b8) =Xz\n\nP (wN +1 = w|w, z, \u03b8)P (z|w, \u03b8)\n\nwhere P (wN +1 = w|w, z, \u03b8) is given by Equation 3. Assuming b = 0, this becomes\n\nP (wN +1 = w | w, \u03b8) =\n\nnw\n\nw \u2212 Ez[Kw(z)] a\n\nN\n\n+ Pw Ez[Kw(z)] a\n\nN\n\n\u03b8w\n\n(6)\n\nwhere Ez[Kw(z)] = Pz Kw(z)P (z|w, \u03b8), and Kw(z) is the number of tables with label\n\nw under the seating assignment z. The other histories enter into this expression via \u03b8.\nSince the words associated with each history is assumed to be produced from a single set\nof parameters \u03b8, the maximum-likelihood estimate of \u03b8w will approach\n\n\u03b8w = Ph I(w \u2208 w(h))\nPwPh I(w \u2208 w(h))\n\nas a approaches 0, since only a single instance of each word type in each context will\ncontribute to the estimate of \u03b8. Substituting this value of \u03b8w into Equation 6 reveals the\ncorrespondence to the Kneser-Ney smoother (Equation 5). The only difference is that the\nconstant discount factor D is replaced by aEz[Kw(z)], which will increase slowly as nw\nincreases. This difference might actually lead to an improved smoother: the Kneser-Ney\nsmoother seems to produce better performance when D increases as a function of nw [11].\n\n4 Types and tokens in modeling morphology\n\nOur attempt to develop statistical models of language that generically produce power-law\ndistributions was motivated by the possibility that models that account for this statistical\nregularity might be able to learn linguistic information better than those that do not. Our\ntwo-stage language modeling framework allows us to create exactly these sorts of mod-\nels, with the generator producing individual lexical items, and the adaptor producing the\npower-law distribution over words. In this section, we show that taking a generative model\nfor morphology as the generator and varying the parameters of the adaptor results in an\nimprovement in unsupervised learning of the morphological structure of English.\n\n4.1 A generative model for morphology\n\nMany languages contain words built up of smaller units of meaning, or morphemes. These\nunits can contain lexical information (as stems) or grammatical information (as af\ufb01xes).\nFor example, the English word walked can be parsed into the stem walk and the past-tense\nsuf\ufb01x ed. Knowledge of morphological structure enables language learners to understand\nand produce novel wordforms, and facilitates tasks such as stemming (e.g., [14]).\n\n\fAs a basic model of morphology, we assume that each word consists of a single stem\nand suf\ufb01x, and belongs to some in\ufb02ectional class. Each class is associated with a stem\ndistribution and a suf\ufb01x distribution. We assume that stems and suf\ufb01xes are independent\ngiven the class, so we have\n\nP (\u2113k = w) = Xc,t,f\n\nI(w = t.f )P (ck = c)P (tk = t | ck = c)P (fk = f | ck = c)\n\n(7)\n\nwhere ck, tk, and fk are the class, stem, and suf\ufb01x associated with \u2113k, and t.f indicates\nthe concatenation of t and f . In other words, we generate a label by \ufb01rst drawing a class,\nthen drawing a stem and a suf\ufb01x conditioned on the class. Each of these draws is from a\nmultinomial distribution, and we will assume that these multinomials are in turn generated\nfrom symmetric Dirichlet priors, with parameters \u03ba, \u03c4, and \u03c6 respectively. The resulting\ngenerative model can be used as the generator in a two-stage language model, providing a\nmore structured replacement for the multinomial distribution, \u03b8. As before, we will use the\nPitman-Yor process as an adaptor, setting b = 0. Figure 1 (b) illustrates the dependencies\nbetween the variables in this model.\n\nOur morphology model is similar to that used by Goldsmith in his unsupervised morpho-\nlogical learning system [8], with two important differences. First, Goldsmith\u2019s model is\nrecursive, i.e. a word stem can be further split into a smaller stem plus suf\ufb01x. Second,\nGoldsmith\u2019s model assumes that all occurrences of each word type have the same analysis,\nwhereas our model allows different tokens of the same type to have different analyses.\n\n4.2 Inference by Gibbs sampling\n\nOur goal in de\ufb01ning this morphology model is to be able to automatically infer the morpho-\nlogical structure of a language. This can be done using Gibbs sampling, a standard Markov\nchain Monte Carlo (MCMC) method [15]. In MCMC, variables in the model are repeatedly\nsampled, with each sample conditioned on the current values of all other variables in the\nmodel. This process de\ufb01nes a Markov chain whose stationary distribution is the posterior\ndistribution over model variables given the input data.\n\nRather than sampling all the variables in our two-stage model simultaneously, our Gibbs\nsampler alternates between sampling the variables in the generator and those in the adaptor.\nFixing the assignment of words to tables, we sample ck, tk, and fk for each table from\n\nP (ck = c, tk = t, fk = f | c\u2212k, t\u2212k, f\u2212k, \u2113)\n\n\u221d I(\u2113k = tk.fk) P (ck = c | c\u2212k) P (tk = t | t\u2212k, c) P (fk = f | f\u2212k, c)\n\n= I(\u2113k = tk.fk) \u00b7\n\nnc + \u03ba\n\nK(z) \u2212 1 + \u03baC\n\n\u00b7\n\nnc,t + \u03c4\nnc + \u03c4 T\n\n\u00b7\n\nnc,f + \u03c6\nnc + \u03c6F\n\n(8)\n\nwhere nc is the number of other labels assigned to class c, nc,t and nc,f are the number of\nother labels in class c with stem t and suf\ufb01x f , respectively, and C, T , and F , are the total\nnumber of possible classes, stems, and suf\ufb01xes, which are \ufb01xed. We use the notation c\u2212k\nhere to indicate all members of c except for ck. Equation 8 is obtained by integrating over\nthe multinomial distributions speci\ufb01ed in Equation 7, exploiting the conjugacy between\nmultinomial and Dirichlet distributions.\nFixing the morphological analysis (c, t, f), we sample the table zi for each word token from\n\nP (zi = k | z\u2212i, w, c, t, f ) \u221d( I(\u2113k = wi)(n(z\n\nn(z\nP (\u2113k = wi)(K(z\u2212i)a + b) n(z\n\n\u2212 a)\n\n\u2212i)\n\nk\n\n\u2212i)\n\n\u2212i)\n\nk\n\nk\n\n> 0\n\n= 0\n\n(9)\n\nwhere P (\u2113k = wi) is found using Equation 7, with P (c), P (t), and P (f ) replaced with\nthe corresponding conditional distributions from Equation 8.\n\n\f(a)\n\nNULL\n\ne\n\ned\n\nd\n\ning\n\ns\n\nes\n\nn\n\nen\n\nother\n\n \n\n \nf\n\na\n\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n(true dist)\n\no\ne\nu\na\nV\n\nl\n\n \n\no\n\na\n\n \nf\n\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n(true dist)\n\ne\nu\na\nV\n\nl\n\n \n\n1\n\n1\n\n(b)\n\ns\ne\np\ny\nT\ne\nu\nr\nT\n\n \n\noth.\nen\nn\nes\ns\ning\nd\ned\ne\nNU.\n\nNU.e ed d ing s es n enoth.\n\nFound Types\n\ns\nn\ne\nk\no\nT\ne\nu\nr\nT\n\n \n\noth.\nen\nn\nes\ns\ning\nd\ned\ne\nNU.\n\nNU.e ed d ing s es n enoth.\n\nFound Tokens\n\n \n0\n\n0\n\n0.2\n0.8\nProportion of types with each suffix\n\n0.4\n\n0.6\n\n0.2\n0.8\nProportion of tokens with each suffix\n\n0.4\n\n0.6\n\nFigure 2: (a) Results for the morphology model, varying a. (b) Confusion matrices for the\nmorphology model with a = 0. The area of a square at location (i, j) is proportional to the\nnumber of word types (top) or tokens (bottom) with true suf\ufb01x i and found suf\ufb01x j.\n\n4.3 Experiments\n\nWe applied our model to a data set consisting of all the verbs in the training section of\nthe Penn Wall Street Journal treebank (137,997 tokens belonging to 7,761 types). This\nsimple test case using only a single part of speech makes our results easy to analyze. We\ndetermined the true suf\ufb01x of each word using simple heuristics based on the part-of-speech\ntag and spelling of the word.3 We then ran a Gibbs sampler using 6 classes, and compared\nthe results of our learning algorithm to the true suf\ufb01xes found in the corpus.\n\nAs noted above, the Gibbs sampler does not converge to a single analysis of the data, but\nrather to a distribution over analyses. For evaluation, we used a single sample taken after\n1000 iterations. Figure 2 (a) shows the distribution of suf\ufb01xes found by the model for\nvarious values of a, as well as the true distribution. We analyzed the results in two ways:\nby counting each suf\ufb01x once for each word type it was associated with, and by counting\nonce for each word token (thus giving more weight to the results for frequent words).\n\nThe most salient aspect of our results is that, regardless of whether we evaluate on types or\ntokens, it is clear that low values of a are far more effective for learning morphology than\nhigher values. With higher values of a, the system has too strong a preference for empty\nsuf\ufb01xes. This observation seems to support the linguists\u2019 view of type-based generalization.\n\nIt is also worth explaining why our morphological learner \ufb01nds so many e and es suf\ufb01xes.\nThis problem is common to other morphological learning systems with similar models (e.g.\n[8]) and is due to the spelling rule in English that deletes stem-\ufb01nal e before certain suf\ufb01xes.\nSince the system has no knowledge of spelling rules, it tends to hypothesize analyses such\nas {stat.e, stat.ing, stat.ed, stat.es}, where the e and es suf\ufb01xes take the place of NULL\nand s. This effect can be seen clearly in the confusion matrices shown in Figure 2 (b). The\nremaining errors seen in the confusion matrices are those where the system hypothesized an\nempty suf\ufb01x when in fact a non-empty suf\ufb01x was present. Analysis of our results showed\nthat these cases were mostly words where no other form with the same stem was present in\n\n3The part-of-speech tags distinguish between past tense, past participle, progressive, 3rd person\npresent singular, and in\ufb01nitive/unmarked verbs, and therefore roughly correlate with actual suf\ufb01xes.\n\n\fthe corpus. There was therefore no reason for the system to prefer a non-empty suf\ufb01x.\n\n5 Conclusion\n\nWe have shown that statistical language models that exhibit one of the most striking prop-\nerties of natural languages \u2013 power-law distributions \u2013 can be de\ufb01ned by breaking the pro-\ncess of generating words into two stages, with a generator producing a set of words, and an\nadaptor determining their frequencies. Our morphology model and the Pitman-Yor process\nare particular choices for a generator and an adaptor. These choices produce empirical and\ntheoretical results that justify the role of word types in formal analyses of natural language.\nHowever, the greatest strength of this framework lies in its generality: we anticipate that\nother choices of generators and adaptors will yield similarly interesting results.\n\nReferences\n[1] G. Zipf. Selective Studies and the Principle of Relative Frequency in Language. Harvard\n\nUniversity Press, Cambridge, MA, 1932.\n\n[2] M. Mitzenmacher. A brief history of generative models for power law and lognormal distribu-\n\ntions. Internet Mathematics, 1(2):226\u2013251, 2003.\n\n[3] H.A. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):425\u2013440, 1955.\n[4] J. Pitman. Exchangeable and partially exchangeable random partitions. Probability Theory and\n\nRelated Fields, 102:145\u2013158, 1995.\n\n[5] J. Pitman and M. Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable\n\nsubordinator. Annals of Probability, 25:855\u2013900, 1997.\n\n[6] H. Ishwaran and L. F. James. Generalized weighted Chinese restaurant processes for species\n\nsampling mixture models. Statistica Sinica, 13:1211\u20131235, 2003.\n\n[7] J. B. Pierrehumbert. Probabilistic phonology: discrimination and robustness. In R. Bod, J. Hay,\n\nand S. Jannedy, editors, Probabilistic linguistics. MIT Press, Cambridge, MA, 2003.\n\n[8] J. Goldsmith. Unsupervised learning of the morphology of a natural language. Computational\n\nLinguistics, 27:153\u2013198, 2001.\n\n[9] H. Ney, U. Essen, and R. Kneser. On structuring probabilistic dependences in stochastic lan-\n\nguage modeling. Computer, Speech, and Language, 8:1\u201338, 1994.\n\n[10] R. Kneser and H. Ney. Improved backing-off for n-gram language modeling. In Proceedings\n\nof the IEEE International Conference on Acoustics, Speech and Signal Processing, 1995.\n\n[11] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language mod-\neling. Technical Report TR-10-98, Center for Research in Computing Technology, Harvard\nUniversity, 1998.\n\n[12] D. Aldous. Exchangeability and related topics. In \u00b4Ecole d\u2019\u00b4et\u00b4e de probabilit\u00b4es de Saint-Flour,\n\nXIII\u20141983, pages 1\u2013198. Springer, Berlin, 1985.\n\n[13] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of\n\nComputational and Graphical Statistics, 9:249\u2013265, 2000.\n\n[14] L. Larkey, L. Ballesteros, and M. Connell.\n\nImproving stemming for arabic information re-\ntrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th International\nConference on Research and Development in Information Retrieval (SIGIR), 2002.\n\n[15] W.R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain Monte Carlo in\n\nPractice. Chapman and Hall, Suffolk, 1996.\n\n\f", "award": [], "sourceid": 2941, "authors": [{"given_name": "Sharon", "family_name": "Goldwater", "institution": null}, {"given_name": "Mark", "family_name": "Johnson", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}]}