{"title": "A Bayesian LDA-based model for semi-supervised part-of-speech tagging", "book": "Advances in Neural Information Processing Systems", "page_first": 1521, "page_last": 1528, "abstract": null, "full_text": "A Bayesian LDA-based model for semi-supervised\n\npart-of-speech tagging\n\nKristina Toutanova\nMicrosoft Research\n\nRedmond, WA\n\nkristout@microsoft.com\n\nMark Johnson\nBrown University\n\nProvidence, RI\n\nMark Johnson@brown.edu\n\nAbstract\n\nWe present a novel Bayesian model for semi-supervised part-of-speech tagging.\nOur model extends the Latent Dirichlet Allocation model and incorporates the\nintuition that words\u2019 distributions over tags, p(t|w), are sparse. In addition we in-\ntroduce a model for determining the set of possible tags of a word which captures\nimportant dependencies in the ambiguity classes of words. Our model outper-\nforms the best previously proposed model for this task on a standard dataset.\n\n1 Introduction\n\nPart-of-speech tagging is a basic problem in natural language processing and a building block for\nmany components. Even though supervised part-of-speech taggers have reached performance of\nover 97% on in-domain data [1, 2], the performance on unknown in-domain words is below 90%\nand the performance on unknown out-of-domain words can be below 70% [3]. Additionally, few\nlanguages have a large amount of data labeled for part-of-speech. Thus it is important to develop\nmethods that can use unlabeled data to learn part-of-speech. Research on unsupervised or partially\nsupervised part-of-speech tagging has a long history [4, 5]. Recent work includes [6, 7, 8, 9, 10].\n\nAs in most previous work on partially supervised part-of-speech tagging, our model takes as input\na (possibly incomplete) tagging dictionary, specifying, for some words, all of their possible parts\nof speech, as well as a corpus of unlabeled text. Our model departs from recent work on semi-\nsupervised part-of-speech induction using sequence HMM-based models, and uses solely observed\ncontext features to predict the tags of words. We show that using this representation of context gives\nour model substantial advantage over standard HMM-based models.\n\nThere are two main innovations of our approach. The \ufb01rst is that we incorporate a sparse prior on the\ndistribution over tags for each word, p(t|w), and employ a Bayesian approach that maintains a dis-\ntribution over parameters, rather than committing to a single parameter value. Previous approaches\nto part-of-speech tagging ([9, 10]) also use sparse priors and Bayesian inference, but do not incor-\nporate sparse priors directly on the p(t|w) distribution. Our results demonstrate that encoding this\nsparse prior and employing a Bayesian approach contributes signi\ufb01cantly to performance.\n\nThe second innovation of our approach is that we explicitly model ambiguity class (the set of part-of-\nspeech tags a word type can appear with). We show that this also results in substantial performance\nimprovement. Our model outperforms the best-performing previously proposed model for this task\n[7], with an error reduction of up to 57% when the amount of supervision is small.\n\nThe task setting is more formally as follows. Assume we are given a \ufb01nite set of possible part-of-\nspeech tags (labels) T = {t1, t2, . . . , tnT }. The set of part-of-speech tags for English we experiment\nwith has the 17 tags de\ufb01ned by Smith & Eisner [7], and is a coarse-grained version of the 45-tag set\nin the English Penn Treebank. We are also given a dictionary which speci\ufb01es the ambiguity classes\ns \u2286 T for a subset of the word types w. The ambiguity class of a word type is the set of all of its\n\n1\n\n\f\u03be\n\ns\n\nu\n\nm1 . . .\n\nm4\n\n\u03b1\n\n\u03b2\n\n\u03b8\n\nt\n\n\u03c81\n\n. . .\n\n\u03c84\n\nT\n\n\u03c9\nM 4\n\nw\n\nc1\n\n. . .\n\nc4\n\nL\n\nW\n\n\u03b3\n\n\u03d51\n\n. . .\n\n\u03d54\n\nT\n\nsi\nui\nmj,i\nwi\n\u03b2i\n\u03b8i\nti,j\n\u03d5k,\u2113\nck,i,j\n\n\u03be\nsi\n\n|\n|\n| ui, \u03c8j\n| mi, \u03c9\n| \u03b1, si\n| \u03b2i\n|\n\u03b8i\n| \u03b3\n|\n\n\u223c MULTI(\u03be),\n\u223c UNIFORM(si)\n\u223c MULTI(\u03c8j,ui ),\n\u223c MULTI(\u03c9mi),\n= SUBSET(\u03b1, si),\n\u223c DIR(\u03b2i),\n\u223c MULTI(\u03b8i),\n\u223c DIR(\u03b3),\n\ni = 1, . . . , L\ni = 1, . . . , L\ni = 1, . . . , L, j = 1, . . . , 4\ni = 1, . . . , L\ni = 1, . . . , L\ni = 1, . . . , L\ni = 1, . . . , L, j = 1, . . . , Wi\nk = 1, . . . , 4, \u2113 = 1, . . . , T\ni = 1, . . . , L, j = 1, . . . , Wi, k = 1, . . . , 4\n\nti,j, \u03d5k \u223c MULTI(\u03d5k,ti,j ),\n\nFigure 1: A graphical model for the tagging model. In this model, each word type w is associated\nwith a set s of possible parts-of-speech (ambiguity class), and each of its tokens is associated with\na part-of-speech tag t, which generates the context words c surrounding that token. The ambiguity\nclass s also generates the morphological features m of the word type w via a hidden tag u \u2208 s.\nThe dotted line divides the model into the ambiguity class model (on the left) and the word context\nmodel (on the right).\n\npossible tags. For example, the dictionary might specify that walks has the ambiguity class {N, V }\nwhich means that walks can never have a tag which is not an N or a V. Additionally, we are given a\nlarge amount of unlabeled natural language text. The task is to label each word token with its correct\npart-of-speech tag in the corresponding context.\n\nThis task formulation corresponds to a problem in computational linguistics that frequently arises in\npractice, because the only available resources for many languages consist of a manually constructed\ndictionary and a text corpus. Note that it differs from the standard semi-supervised learning setting,\nwhere we are given a small amount of labeled data and a large amount of unlabeled data. In the\nsetting we study, we are never given labeled data, but are given instead constraints on possible tags\nof some words (in the form of a dictionary).1\n\n2 Graphical model\n\nOur model is shown in Figure 1. In the \ufb01gure, T is the set of part-of-speech tags, L is the set of word\ntypes (i.e., the set of different orthographic forms), W is the set of tokens (i.e., occurrences) of the\nword type w, and M 4 is the set of four-element morphological feature vectors described below.\nThis is a generative model for a sequence of word tokens in a text corpus along with part-of-speech\ntags for all tokens, ambiguity classes for word types and other hidden variables. To generate the\ntext corpus, the model generates the instances of every word type together with their contexts in\n\n1For some words, the dictionary speci\ufb01es only one possible tag, e.g. information \u2192 {N }, in which case all\ninstances of information can be assumed labeled with the tag N. However these constraints are not suf\ufb01cient to\nresult in fully labeled sentences.\n\n2\n\n\fturn. The generation of a word type and all of its occurrences can be decomposed into two steps,\ncorresponding to the left and right parts of the model: the ambiguity class model, and the word\ncontext model (separated by a dotted line in the \ufb01gure).\nFor every word type wi \u2208 L (plate L in the \ufb01gure), in the \ufb01rst step the model generates an ambiguity\nclass si \u2286 T of possible parts of speech. The ambiguity class si is the set of parts-of-speech that\ntokens of wi can be labeled with. Our dictionary speci\ufb01es si for some but not all word types wi.\nThe ambiguity class si is generated by a multinomial over 2T with parameters \u03be, with support on\nthe different values for s observed in the dictionary. The ambiguity class si for wi generates four\ndifferent morphological features m1,i, . . . , m4,i of wi representing the suf\ufb01xes, capitalization, etc.,\nof the orthographic form of wi. These are generated by multinomials with parameters \u03c81,u, . . . , \u03c84,u\nrespectively, where u \u2208 s is a hidden variable generated by a uniform distribution over the members\nof s. For completeness we generate the full surface form of the word type wi from a multinomial\ndistribution selected by its morphology features m1,i, . . . , m4,i. But since the morphology features\nare always observed (they are determined by wi\u2019s orthographic form), we ignore this part of the\nmodel. We discuss the ambiguity class model in detail in Section 3.1.\nIn the second step the word context model generates all instances wi,j of wi together with their\npart-of-speech tags ti,j and context words (plate W in the \ufb01gure). This is done by \ufb01rst choosing a\nmultinomial distribution \u03b8i over the tags in the set si, which is drawn from a Dirichlet with param-\neters \u03b2i and support si, where \u03b2i,t = \u03b1t for t \u2208 s. That is, si identi\ufb01es the subset of T to receive\nsupport in \u03b2i, but the value of \u03b2i,t for t \u2208 si is speci\ufb01ed by \u03b1t. Given these variables, all tokens\nwi,j of the word wi together with their contexts are generated by \ufb01rst choosing a part-of-speech tag\nti,j from \u03b8i and then choosing context words ck,i,j preceding and following the word token wi,j\naccording to tag-speci\ufb01c (depending on ti,j) multinomial distributions. The context of a word to-\nken c1,i,j . . . , c4,i,j consists of the two preceding and two following words. For example, for the\nsentence He often walks to school, the context words of that instance of walks are c1=He, c2=often,\nc3=to, and c4=school. This representation of the context has been used previously by unsupervised\nmodels for part-of-speech tagging in different ways [4, 8]. Each context word ck,i,j is generated\nby a multinomial with parameters \u03d5k,ti,j , where each \u03d5k,t is in turn generated by a Dirichlet with\nparameters \u03b3. The parameters \u03d5k,t are generated once for the whole corpus as indicated in the \ufb01gure.\nA sparse Dirichlet prior on \u03b8i with parameter \u03b1 < 1 allows us to exploit the fact that most words\nhave a very frequent predominant tag, and their distribution over tags p(t|w) is sparse. To verify\nthis, we examined the distribution of the 17-label tag set in the WSJ Penn Treebank. A classi\ufb01er\nthat always chooses the most frequent tag for every word type, without looking at context, is 90.9%\naccurate on ambiguous words, indicating that the distribution is heavily skewed.\n\nOur model builds upon the Latent Dirichlet Allocation (LDA) model [11] by extending it in several\nways.\nIf we assume that the only possible ambiguity class s for all words is the set of all tags\n(and thus remove the ambiguity class model because it becomes irrelevant), and if we simplify our\nword context model to generate only one context word (say the word in position \u22121), we would\nend up with the LDA model. In this simpli\ufb01ed model, we could say that for every word type wi\nwe have a document consisting of all word tokens that occur in position \u22121 of the word type wi\nin the corpus. Each context word ci,j in wi\u2019s document is generated by \ufb01rst choosing a tag (topic)\nfrom a word (document) speci\ufb01c distribution \u03b8i and then generating the word ci,j from a tag (topic)\nspeci\ufb01c multinomial. The LDA model incorporates the same kind of Dirichlet priors on \u03b8 and \u03d5 that\nour model uses. The additional power of our model stems from the model of ambiguity classes si\nwhich can take advantage of the information provided by the dictionary, and from the incorporation\nof multiple context features.\n\nFinally, we note that our model is de\ufb01cient, because the same word token in the corpus is indepen-\ndently generated multiple times (e.g., each token will appear in the context of four other words and\nwill be generated four times). Even though this is a theoretical drawback of the model, it remains\nto be seen whether correcting for this de\ufb01ciency (e.g., by re-normalization) would improve tagging\nperformance. Models with similar de\ufb01ciencies have been successful in other applications (e.g. the\nmodel described in [12], which achieved substantial improvements over the previous state-of-the-art\nin unsupervised parsing).\n\n3\n\n\f3 Parameter estimation and tag prediction\n\nHere we discuss our method of estimating the parameters of our model and making predictions,\ngiven an (incomplete) tagging dictionary and a set of natural language sentences.\n\nWe train the parameters of the ambiguity class model, \u03be, \u03c8, and \u03c9, separately from the parameters of\nthe word context model: \u03b1,\u03b8,\u03b3, and \u03d5. This is because the two parts of the model are connected only\nvia the variables si (the ambiguity classes of words), and when these ambiguity classes are given the\ntwo sets of parameters are completely decoupled. The dictionary gives us labeled training examples\nfor the ambiguity class model, and we train the parameters of the ambiguity class model only from\nthis data (i.e., the word types in the dictionary). After training the ambiguity class model from the\ndictionary we \ufb01x its parameters and estimate the word context model given these parameters.\n\n3.1 Ambiguity class model: details and parameter estimation\n\nOur ambiguity class model captures the strong regularities governing the possible tags of a word\ntype. Empirically we observe that the number of occurring ambiguity classes is very small relative\nto the number of possible ambiguity classes. For example, in the WSJ Penn Treebank data, the\n49, 206 word types belong to 118 ambiguity classes. Modeling these (rather than POS tags directly)\nconstrains the model to avoid assignments of tags to word tokens which would result in improbable\nambiguity classes for word types. A related intuition has been used in other contexts before, e.g.\n[13, 14], but without directly modeling ambiguity classes. The ambiguity class model contributes\nto biasing p(t|w) toward sparse distributions as well, because most ambiguity classes have very\nfew elements. For example, the top ten most frequent ambiguity classes in the complete dictionary\nconsist of one or two elements.\n\nThe ambiguity class of a word type can be predicted from its surface morphological features. For\nexample the suf\ufb01x -s of walks indicates that an ambiguity class of {N, V } is likely for this word.\nThe four morphological features which we used for the ambiguity class model were: a binary feature\nindicating whether the word is capitalized, a binary feature indicating whether the word contains a\nhyphen, a binary feature indicating whether the word contains a digit character, and a nominal\nfeature indicating the suf\ufb01x of a word. We de\ufb01ne the suf\ufb01x of a word to be the longest character\nsuf\ufb01x (up to three letters) which occurs as a suf\ufb01x of suf\ufb01ciently many word types.2\nWe train the ambiguity class model on the set of word types present in the dictionary. We set the\nmultinomial parameters \u03c8k,l and \u03be to maximize the joint likelihood of these word types and their\nmorphological features. Maximum likelihood estimation for \u03c8 is complicated by the hidden variable\nui which selects a tag form the ambiguity class with uniform distribution.\nP (s, m1, m2, m3, m4|\u03c8, \u03be) = P (s|\u03be)Pu\u2208s P (u|s)Q4\nWe \ufb01x the probability P (u|s) = 1/|s| to the uniform distribution over tags in s. We estimate the \u03be\nparameters using maximum likelihood estimation with add-1 (Laplace) smoothing and we train the\n\u03c8 parameters using EM (with add-1 smoothing in the M-step).\n\nj=1 P (mj|\u03c8j,l).\n\n3.2 Parameter estimation for the word context model and prediction given complete\n\ndictionary\n\nWe restrict our attention at \ufb01rst to the setting where a complete tagging dictionary is given. The\nincomplete dictionary generalization is discussed in Section 3.3. When every word is in the dictio-\nnary, the ambiguity class si for each word type wi is speci\ufb01ed by the tagging dictionary, and the\nambiguity class model becomes irrelevant. The relevant parameters of the model in this setting are\n\u03b1,\u03b8,\u03b3, and \u03d5. The contexts of word instances ck,i,j and the ambiguity classes si are observed.\nWe integrate over all hidden variables except the uniform Dirichlet parameters \u03b1 and \u03b3. We set\n\u03b3 = 1 and we use Empirical Bayes to estimate \u03b1 by maximizing the likelihood of the observed data\ngiven \u03b1 and the ambiguity classes si. Note that if the ambiguity classes si and \u03b1 are given, \u03b2i is\n\ufb01xed. Below we use c to denote the vector of all contexts of all word instances, and s the vector of\nambiguity classes for all word types. We use \u03d5 to denote the vector of all multinomials \u03d5k,l, \u03b8 to\n\n2A suf\ufb01x occurs with suf\ufb01ciently many word types if its type-frequency rank is below 100.\n\n4\n\n\fdenote the vector of all \u03b8i and t to denote the vector of all tag sequences ti for word types wi. The\nlikelihood we would like to maximize is:\nL(c|s, \u03b1, \u03b3) = R P (\u03d5|\u03b3)QL\nP (\u03d5|\u03b3) = Q4\nSince exact inference is intractable, we use a variational approximation to the posterior distribution\nof the hidden variables given the data and maximize instead of the exact log-likelihood, a lower\nbound given by the variational approximation. This variational approximation is also used for \ufb01nd-\ning the most likely assignment of the part-of-speech tag variables to instances of words.\n\nk=1 P (ck,i,j|\u03d5k,l)(cid:17) d\u03b8id\u03d5\n\ni=1 R P (\u03b8i|\u03b2i)QWi\n\nl=1 (cid:16)\u03b8i,l Q4\n\nl=1 DIR(\u03d5k,l|\u03b3)\n\nj=1 PT\n\nk=1 QT\n\nk=1 QT\n\nj=1 P (ti,j|\u03c5i,j)\n\ni=1 DIR(\u03b8i|\u03b7i)QWi\n\nl=1 DIR(\u03d5k,l|\u03bbk,l)QL\n\nMore speci\ufb01cally, the variational approximation has analogous form to the approximation used for\nthe LDA model [11]. It depends on variational parameters \u03bbk,l, \u03b7i, and \u03c5i,j.\nQ(\u03d5, \u03b8, t|\u03bb, \u03b7, \u03c5) = Q4\nThis distribution is an approximation to the posterior distribution of the hidden variables:\nP (\u03d5, \u03b8, t|c, s, \u03b1, \u03b3). As we can see, according to the Q distribution, the variables \u03d5, \u03b8, and t are\nindependent. Each \u03d5k,l is distributed according to a Dirichlet distribution with variational parame-\nters \u03bbk,l, each \u03b8i is also Dirichlet with parameters \u03b7i and each tag ti,j is distributed according to a\nmultinomial \u03c5i,j. We obtain the variational parameters by maximizing the following lower bound\non the log-likelihood of the data (the dependence of Q on the variational parameters is not shown\nbelow for simplicity): EQ [log P (\u03d5, \u03b8, t, c|s, \u03b1, \u03b3)] \u2212 EQ [log Q(\u03d5, \u03b8, t)]\nWe use an iterative maximization algorithm for \ufb01nding the values of the variational parameters. We\ndo not describe it here due to space limitations, but it is analogous to the one used in [11]. Given\n\ufb01xed variational parameters \u03bbk,l we maximize with respect to the variational parameters \u03b7i and\n\u03c5i,j corresponding to word types and their instances. Then keeping the latter parameters \ufb01xed, we\nmaximize with respect to \u03bbk,l. We repeat until the change in the variational bound falls below a\nthreshold. On our dataset, about 100 iterations of the outer loop for maximizing with respect to \u03bbk,l\nwere necessary. Given a variational distribution Q we can maximize the lower bound on the log-\nlikelihood with respect to \u03b1. Since \u03b1 is determined by a single real-valued parameter, we maximized\nwith respect to \u03b1 using a simple grid search.\nFor predicting the tags ti,j of word tokens we use the same approximate posterior distribution Q.\nSince according to Q all tags ti,j are independent given the variational parameters: Q(ti|\u03c5i) =\nQWi\nj=1(ti,j|\u03c5i,j), \ufb01nding the most likely assignment is straightforward.\n\n3.3 Parameter estimation for the word context model and prediction with incomplete\n\ndictionary\n\nSo far we have described the training of the parameters of the word context model in the setting\nwhere for all words, the ambiguity classes si are known and these variables are observed. When\nthe ambiguity classes si are unknown for some words in the dataset, they become additional hidden\nvariables, and the hidden variables in the word context model become dependent on the morpholog-\nical features mi and the parameters of the ambiguity class model. Denote the vector of ambiguity\nclasses for the known (in the dictionary) word types by sd and the ambiguity classes for the un-\nknown word types by su. The posterior distribution over the hidden variables of interest given the\nobservations becomes: P (\u03d5, \u03b8, t, su|sd, mu, c, \u03b1, \u03b3), where mu are the morphological features of\nthe unknown word types.\n\nTo perform inference in this setting we extend the variational approximation to account for the\nadditional hidden variables. Before we had, for every word type, a variational distribution over the\nhidden variables corresponding to that word type:\nQ(\u03b8i, ti|\u03b7i, \u03c5i,j) = DIR(\u03b8i|\u03b7i)QWi\nWe now introduce a variational distribution including new hidden variables si for unknown words.\nQ(\u03b8i, ti, si|mi, \u03b7i,s, \u03c5i,j,s) = P (si|mi)DIR(\u03b8i|\u03b7i,si )QWi\n\nj=1 P (ti,j|\u03c5i,j,si )\n\nj=1 P (ti,j|\u03c5i,j)\n\n5\n\n\fThat is, for each possible ambiguity class si of an unknown word wi we introduce variational pa-\nrameters speci\ufb01c to that ambiguity class. Instead of single variational parameters \u03b8i and \u03c5i,j for a\nword with known si, we now have variational parameters {\u03b8i,s} and {\u03c5i,j,s} for all possible values s\nof si. For simplicity, we use the probability P (si|mi) = P (si|mi, \u03be, \u03c8) from the morphology-based\nambiguity class model in the approximating distribution rather than introducing new variational\nparameters and learning this distribution.3 We adapt the algorithm to estimate the variational param-\neters. The derivation is slightly complicated by the fact that si and \u03b8i are not independent according\nto Q (this makes sense because si determines the dimensionality of \u03b8i), but the derived iterative\nalgorithm is essentially the same as for our basic model, if we imagine that an unknown word type\nwi occurs with each of its possible ambiguity classes si a fractional p(si|mi) number of times.\nFor predicting tag assignments for words according to this extended model, we use the same algo-\nrithm as described in Section 3.2, for word types whose ambiguity classes si are known. For words\nwith unknown ambiguity classes, we need to maximize over ambiguity classes as well as tag assign-\nments. We use the following algorithm to obtain a slightly better approximation than the one given\nby the variational distribution Q. For each possible tag set si, we \ufb01nd the most likely assignment\nof tags given that ambiguity class t\u2217(si), using the variational distribution as in the case of known\nambiguity classes. We then choose an ambiguity class and an assignment of tags according to:\ns\u2217 = arg maxsiP (si|mi, \u03c8, \u03be)P (t\u2217(si), ci|si, D, \u03b1, \u03b3) and t = t\u2217(s\u2217).\nWe compute P (t\u2217(si), ci|si, D, \u03b1, \u03b3) by integrating with respect to the word context distributions\n\u03d5 whose approximate posterior given the data is Dirichlet with parameters \u03bbk,l, and by integrating\nwith respect to \u03b8i which are Dirichlet with parameters \u03b1 and dimensionality given by si.\n\n4 Experimental Evaluation\n\nWe evaluate the performance of our model in comparison with other related models. We train and\nevaluate the model in three different settings. In the \ufb01rst setting, a complete tagging dictionary is\navailable, and in the other two settings the coverage of the dictionary is greatly reduced.\n\nThe tagging dictionary was constructed by collecting for each word type, the set of parts-of-speech\nwith which it occurs in the annotated WSJ Penn Treebank, including the test set. This method of\nconstructing a tag dictionary is arguably unrealistic but has been used in previous research [7, 9, 6]\nand provides a reproducible framework for comparing different models. In the complete dictionary\nsetting, we use the ambiguity class information for all words, and in the second and third setting we\nremove from the dictionary all word types that have occurred with frequency less than 2 and less\nthan 3, respectively, in the test set of 1,005 sentences. The complete tagging dictionary contains\nentries for 49, 206 words. The dictionary obtained with cutoff of 2 contains 2,141 words, and the\none with cutoff of 3 contains 1,249 words. We train the model on the whole (unlabeled) WSJ Penn\nTreebank, consisting of 49,208 sentences. We evaluate performance on a set of 1,005 sentences,\nwhich is a subset of the training data and is the same test set used by [7, 9].\n\nTo see how much removing information from the dictionary impacts the hardness of the problem we\ncan look at the accuracy of a classi\ufb01er choosing a tag at random from the possible tags of words,\nshown in the column Random of Table 1. Results for the three settings are shown in the three rows\nof Table 1. In addition to the Random baseline, we include the results of a frequency baseline, Freq,\nin which for each word, we choose the most frequent tag from its set of possible tags.4 This baseline\nuses the same amount of partial supervision as our models. If labeled corpus data were available, a\nmodel which assigns the most frequent tag to each word by using \u02c6p(t|w) would do much better.\nThe models in the table are:\nLDA is the model proposed in this paper, excluding the ambiguity class model. The ambiguity class\nmodel is irrelevant when a compete dictionary is available because all si are observed. In the other\ntwo settings for the LDA model we assume that si is the complete ambiguity class (all 17 tags)\n\n3We also limit the number of possible ambiguity classes per word to the three most likely ones and re-\n\nnormalize the probability mass among them.\n\n4Frequency of tags is unigram frequency of tags \u02c6p(t) by token in the unlabeled data. Since the tokens in the\ncorpus are not actually labeled we compute the frequency by giving fractional counts to each possible tag of\nwords in the dictionary. Only the words present in the dictionary were used for computing \u02c6p(t).\n\n6\n\n\fDictionary LDA LDA PLSA PLSA CE (S&E)\ncoverage\ncomplete\ncount \u2265 2\ncount \u2265 3\n\n+ AC\n93.4\n91.2\n89.7\n\n+AC\n89.7\n87.8\n85.9\n\n93.4\n87.4\n85.0\n\n89.7\n83.4\n80.2\n\nBayesian\n\nML\n\n+spelling HMM (G&G)HMM (G&G)\n88.7 (91.9)\n79.5 (90.3)\n78.4 (89.5)\n\n87.3\n79.6\n71.0\n\n83.2\n70.6\n65.5\n\nRandom Freq\n\n69.5\n56.6\n51.0\n\n64.8\n64.8\n62.9\n\nTable 1: Results from minimally supervised POS-tagging models.\n\nfor words which are not in the dictionary and do not attempt to predict a more speci\ufb01c ambiguity\nclass. The estimated parameter \u03b1 for the tag prior was 0.5 for the complete dictionary setting, and\n0.2 for the other two settings, encouraging sparse distributions. For this model we estimate the\nvariational parameters \u03bbk,l and the Dirichlet parameter \u03b1 to maximize the variational bound on the\nlog-likelihood of the word types which are in the dictionary only. We found that including unknown\nword types was detrimental to performance.\nLDA+AC is our full model including the model of ambiguity classes of words given their mor-\nphological features. As mentioned above, this augmented model differs from LDA only when the\ndictionary is incomplete. We trained this model on all word types as discussed in Section 3.3. The\nestimated \u03b1 parameters for this model in the three dictionary settings were 0.5, 0.1, and 0.1, respec-\ntively.\nPLSA is the model analogous to LDA, which has the same structure as our word context model, but\nexcludes the Bayesian components. We include this model in the comparison in order to evaluate\nthe effect on performance of the sparse prior and the integration over model parameters. This model\nis similar to the PLSA model for text documents [15]. The PLSA model does not have a prior on\nthe word-speci\ufb01c distributions over tags \u03b8i = p(t|wi) and it does not have a prior distribution on\nthe topic-speci\ufb01c multinomials for context words \u03d5k,l. For this model we \ufb01nd maximum likelihood\nestimates for these parameters by applying an EM algorithm. We do add-1 smoothing for \u03d5k,l in the\nM step, because even though this is not theoretically justi\ufb01ed for this mixture model, it is frequently\nused in practice and helps prevent probabilities of zero for possible events. PLSA does not include\nthe ambiguity class model for si and as in the LDA model, word types not in the dictionary were\nassumed to have ambiguity classes containing all 17 tags. PLSA+AC extends the PLSA model by\nthe inclusion of the ambiguity class model.\nCE+spelling (S&E) is the sequence model for semi-supervised part-of-speech tagging proposed in\n[7], based on an HMM-structured model estimated using contrastive estimation. This is the state-\nof-the-art model for semi-supervised tagging using an incomplete dictionary. In the table we show\nactual performance and oracle performance for this model (oracle performance is in brackets).The\noracle is obtained by testing models with different values of a smoothing hyper-parameter on the\ntest set and choosing the model with the best accuracy. Even though there is only one real-valued\nhyper-parameter, the accuracies of models using different values can vary by nearly ten accuracy\npoints and it is thus more fair to compare our results to the non-oracle result, until a better criterion\nfor setting the hyper-parameters using only the partial supervision is found. The results shown in\nthe table are for a model which incorporates morphological features.\nBayesian HMM (G&G) is a fully Bayesian HMM model for semi-supervised part-of-speech tag-\nging proposed in [9], which incorporates sparse Dirichlet priors on p(w|t) of word tokens given part\nof speech tags and p(ti|ti\u22121, ti\u22122) of transition probabilities in the HMM. We include this model\nin the comparison, because it uses sparse priors and Bayesian inference as our LDA model, but\nusing a different structure of the model.\n[9] showed that this model outperforms signi\ufb01cantly a\nnon-Bayesian HMM model, whose results we show as well.\nML HMM (G&G) is the maximum likelihood version of a trigram HMM for semi-supervised part-\nof-speech tagging. Results for this model have been reported by other researchers as well [7, 6]. We\nuse the performance numbers reported in [9] because they have used the same data sets for testing.\n\nThe last two models do not use spelling (morphological) features. We should note that even though\nthe same amount of supervision in the form of a tagging dictionary is used by all compared models,\nthe HMM and CE models whose results are shown in the Table have been trained on less unsu-\npervised natural language text: they have been trained using only the test set of 1,005 sentences.\nHowever, there is no reason one should limit the amount of unlabeled data used and in addition,\n\n7\n\n\fother results reported in [7] and [9] show that accuracy does not seem to improve when more unla-\nbeled data are used with these models.\n\nThere are several points to note about the experimental results. First, the fact that PLSA substantially\noutperforms ML HMM (and even the Bayesian HMM) models shows that predicting the tags of\nwords from a window of neighboring word tokens and modeling the P (t|w) distribution directly\nresults in an advantage over HMMs with maximum likelihood or Bayesian estimation. This is\nconsistent with the success of other models that used word context for part-of-speech prediction in\ndifferent ways [4, 8]. Second, the Bayesian and sparse-prior components of our model do indeed\ncontribute substantially to performance, as illustrated by the performance of LDA compared to that\nof PLSA. LDA achieves an error reduction of up to 36% over PLSA. Third, our ambiguity class\nmodel results in a signi\ufb01cant improvement as well; LDA+AC reduces the error of LDA by up to\n31%. PLSA+AC similarly reduces the error of PLSA. Finally, our complete model outperforms the\nstate-of-the-art model CE+spelling. It reduces the error of the non-oracle models by up to 57% and\nalso outperforms the oracle models.\n\nWe compared the performance of our model to that of state-of-the-art models applied in the same\nsetting. It will also be interesting to compare our model to the one proposed in [8], which was\napplied in a different partial supervision setting. In their setting a small set of example word types\n(which they call prototypes) are provided for each possible tag (only three prototypes per tag were\nspeci\ufb01ed). Their model achieved an accuracy of 82.2% on a similar dataset. We can not directly\ncompare the performance of our model to theirs, because our model would need prototypes for\nevery ambiguity class rather than for every tag. In future work we will explore whether a very small\nset of prototypical ambiguity classes and corresponding word types can achieve the performance\nwe obtained with an incomplete tagging dictionary. Another interesting direction for future work\nis applying our model to other NLP disambiguation tasks, such as named entity recognition and\ninduction of deeper syntactic or semantic structure, which could bene\ufb01t from both our ambiguity\nclass model and our word context model.\n\nReferences\n\n[1] Kristina Toutanova, Dan Klein, and Christopher D. Manning. Feature-rich part-of-speech tagging with a\n\ncyclic dependency network. In Proceedings of HLT-NAACL 03, 2003.\n\n[2] Michael Collins. Discriminative training methods for hidden markov models: Theory and experiments\n\nwith perceptron algorithms. In EMNLP, 2002.\n\n[3] John Blitzer, Ryan McDonald, and Fernando Pereira. Domain adaptation with structural correspondence\n\nlearning. In EMNLP, 2006.\n\n[4] Hinrich Sch\u00a8utze. Distributional part-of-speech tagging. In EACL, 1995.\n[5] Bernard Merialdo. Tagging english text with a probabilistic model. In ICASSP, 1991.\n[6] Michele Banko and Robert C. Moore. Part of Speech tagging in context. In COLING, 2004.\n[7] Noah A. Smith and Jason Eisner. Contrastive estimation: Training log-linear models on unlabeled data.\n\nIn ACL, 2005.\n\n[8] Aria Haghighi and Dan Klein. Prototype-driven learning for sequence models. In HLT-NAACL, 2006.\n[9] Sharon Goldwater and Thomas L. Grif\ufb01ths. A fully Bayesian approach to unsupervised Part-of-Speech\n\ntagging. In ACL, 2007.\n\n[10] Mark Johnson. Why doesn\u2019t EM \ufb01nd good HMM POS-taggers. In EMNLP, 2007.\n[11] David Blei, Andrew Ng, and Michael Jordan. Latent dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[12] Dan Klein and Christopher D. Manning. Natural language grammar induction using a constituent-context\n\nmodel. In NIPS 14, 2002.\n\n[13] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local information into\n\ninformation extraction systems by Gibbs sampling. In ACL, 2005.\n\n[14] Tetsuji Nakagawa and Yuji Matsumoto. Guessing parts-of-speech of unknown words using global infor-\n\nmation. In ACL, 2006.\n\n[15] Thomas Hofmann. Probabilistic latent semantic analysis. In UAI, 1999.\n\n8\n\n\f", "award": [], "sourceid": 964, "authors": [{"given_name": "Kristina", "family_name": "Toutanova", "institution": null}, {"given_name": "Mark", "family_name": "Johnson", "institution": null}]}