{"title": "Posterior vs Parameter Sparsity in Latent Variable Models", "book": "Advances in Neural Information Processing Systems", "page_first": 664, "page_last": 672, "abstract": "In this paper we explore the problem of biasing unsupervised models to favor sparsity. We extend the posterior regularization framework [8] to encourage the model to achieve posterior sparsity on the unlabeled training data. We apply this new method to learn \ufb01rst-order HMMs for unsupervised part-of-speech (POS) tagging, and show that HMMs learned this way consistently and signi\ufb01cantly out-performs both EM-trained HMMs, and HMMs with a sparsity-inducing Dirichlet prior trained by variational EM. We evaluate these HMMs on three languages \u2014 English, Bulgarian and Portuguese \u2014 under four conditions. We \ufb01nd that our method always improves performance with respect to both baselines, while variational Bayes actually degrades performance in most cases. We increase accuracy with respect to EM by 2.5%-8.7% absolute and we see improvements even in a semisupervised condition where a limited dictionary is provided.", "full_text": "Posterior vs. Parameter Sparsity in Latent Variable\n\nModels\n\nJo\u00e3o V. Gra\u00e7a\nL2F INESC-ID\nLisboa, Portugal\n\nKuzman Ganchev\n\nBen Taskar\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA, USA\n\nFernando Pereira\nGoogle Research\n\nMountain View, CA, USA\n\nAbstract\n\nWe address the problem of learning structured unsupervised models with moment\nsparsity typical in many natural language induction tasks. For example, in unsu-\npervised part-of-speech (POS) induction using hidden Markov models, we intro-\nduce a bias for words to be labeled by a small number of tags. In order to express\nthis bias of posterior sparsity as opposed to parametric sparsity, we extend the pos-\nterior regularization framework [7]. We evaluate our methods on three languages\n\u2014 English, Bulgarian and Portuguese \u2014 showing consistent and signi\ufb01cant accu-\nracy improvement over EM-trained HMMs, and HMMs with sparsity-inducing\nDirichlet priors trained by variational EM. We increase accuracy with respect\nto EM by 2.3%-6.5% in a purely unsupervised setting as well as in a weakly-\nsupervised setting where the closed-class words are provided. Finally, we show\nimprovements using our method when using the induced clusters as features of a\ndiscriminative model in a semi-supervised setting.\n\n1\n\nIntroduction\n\nLatent variable generative models are widely used in inducing meaningful representations from un-\nlabeled data. Maximum likelihood estimation is a standard method for \ufb01tting such models, but in\nmost cases we are not so interested in the likelihood of the data as in the distribution of the latent\nvariables, which we hope will capture regularities of interest without direct supervision. In this pa-\nper we explore the problem of biasing such unsupervised models to favor a novel kind of sparsity\nthat expresses our expectations about the role of the latent variables. Many important language pro-\ncessing tasks (tagging, parsing, named-entity classi\ufb01cation) involve classifying events into a large\nnumber of possible classes, where each event type can have just a few classes. We extend the poste-\nrior regularization framework [7] to achieve that kind of posterior sparsity on the unlabeled training\ndata. In unsupervised part-of-speech (POS) tagging, a well studied yet challenging problem, the new\nmethod consistently and signi\ufb01cantly improves performance over a non-sparse baseline and over a\nvariational Bayes baseline with a Dirichlet prior used to encourage sparsity [9, 4].\nA common approach to unsupervised POS tagging is to train a hidden Markov model where the\nhidden states are the possible tags and the observations are word sequences. The model is typi-\ncally trained with the expectation-maximization (EM) algorithm to maximize the likelihood of the\nobserved sentences. Unfortunately, while supervised training of HMMs achieves relatively high\naccuracy, the unsupervised models tend to perform poorly. One well-known reason for this is that\nEM tends to allow each word to be generated by most hidden states some of the time. In reality,\nwe would like most words to have a small number of possible tags. To solve this problem, several\nstudies [14, 17, 6] investigated weakly-supervised approaches where the model is given the list of\npossible tags for each word. The task is then to disambiguate among the possible tags for each word\ntype. Recent work has made use of smaller dictionaries, trying to model the set of possible tags for\neach word [18, 5], or use a small number of \u201cprototypes\u201d for each tag [8]. All these approaches\ninitialize the model in a way that encourages sparsity by zeroing out impossible tags. Although this\n\n1\n\n\fhas worked extremely well for the weakly-supervised case, we are interested in the setting where\nwe have only high-level information about the model: we know that the distribution over the la-\ntent variables (such as POS tags) should be sparse. This has been explored in a Bayesian setting,\nwhere a prior is used to encourage sparsity in the model parameters [4, 9, 6]. This sparse prior,\nwhich prefers each tag to have few word types associated with it, indirectly achieves sparsity over\nthe posteriors, meaning each word type should have few possible tags. Our method differs in that\nit encourages sparsity in the model posteriors, more directly encoding the desiderata. Additionally\nour method can be applied to log-linear models where sparsity in the parameters leads to dense\nposteriors. Sparsity at this level has already been suggested before under a very different model[18].\nWe use a \ufb01rst-order HMM as our model to compare the different training conditions: classical\nexpectation-maximization (EM) training without modi\ufb01cations to encourage sparsity, the sparse\nprior used by [9] with variational Bayes EM (VEM), and our sparse posterior regularization (Sparse).\nWe evaluate these methods on three languages, English, Bulgarian and Portuguese. We \ufb01nd that our\nmethod consistently improves performance with respect to both baselines in a completely unsuper-\nvised scenario, as well as in a weakly-supervised scenario where the tags of closed-class words are\nsupplied. Interestingly, while VEM achieves a state size distribution (number of words assigned\nto hidden states) that is closer to the empirical tag distribution than EM and Sparse its state-token\ndistribution is a worse match to the empirical tag-token distribution than the competing methods.\nFinally, we show that states assigned by the model are useful as features for a supervised POS\ntagger.\n\n2 Posterior Regularization\n\nIn order to express the desired preference for posterior sparsity, we use the posterior regularization\n(PR) framework [7], which incorporates side information into parameter estimation in the form of\nlinear constraints on posterior expectations. This allows tractable learning and inference even when\nthe constraints would be intractable to encode directly in the model, for instance to enforce that\neach hidden state in an HMM is used only once in expectation. Moreover, PR can represent prior\nknowledge that cannot be easily expressed as priors over model parameters, like the constraint used\nin this paper. PR can be seen as a penalty on the standard marginal likelihood objective, which we\nde\ufb01ne \ufb01rst:\n\nMarginal Likelihood: L(\u03b8) = (cid:98)E[\u2212 log p\u03b8(x)] = (cid:98)E[\u2212 log(cid:88)\n\nover the parameters \u03b8, where(cid:98)E is the empirical expectation over the unlabeled sample x, and z are\n\nthe hidden states. This standard objective may be regularized with a parameter prior \u2212 log p(\u03b8) =\nC(\u03b8), for example a Dirichlet.\nPosterior information in PR is speci\ufb01ed with sets Qx of distributions over the hidden variables z\nde\ufb01ned by linear constraints on feature expectations:\n\np\u03b8(z, x)]\n\nz\n\nQx = {q(z | x) : Eq[f(x, z)] \u2264 b}.\n\n(1)\nThe marginal log-likelihood of a model is then penalized with the KL-divergence between the de-\nsired distributions Qx and the model, KL(Qx (cid:107) p\u03b8(z|x)) = minq\u2208Qx KL(q(z) (cid:107) p\u03b8(z|x)). The\nrevised learning objective minimizes:\n\nPR Objective: L(\u03b8) + C(\u03b8) +(cid:98)E[KL(Qx (cid:107) p\u03b8(z|x))].\n\n(2)\nSince the objective above is not convex in \u03b8, PR estimation relies on an EM-like lower-bounding\nscheme for model \ufb01tting, where the E step computes a distribution q(z|x) over the latent variables\nand the M step minimizes negative marginal likelihood under q(z|x) plus parameter regularization:\n(3)\nIn a standard E step, q is the posterior over the model hidden variables given current \u03b8: q(z|x) =\np\u03b8(z|x). However, in PR, q is a projection of the posteriors onto the constraint set Qx for each\nexample x:\n\n(cid:98)E [Eq[\u2212 log p\u03b8(x, z)]] + C(\u03b8)\n\nM-Step: min\n\n\u03b8\n\n(4)\n\nKL(q(z|x) (cid:107) p\u03b8(z|x))\n\ns.t. Eq[f(x, z)] \u2264 b.\n\narg min\n\nq\n\n2\n\n\fp\u03b8\n\n\u03bb\n\nqti \u221d p\u03b8e\u2212\u03bbti\n\nFigure 1: An illustration of (cid:96)1/(cid:96)\u221e regularization. Left panel: initial tag distributions (columns)\nfor 15 instances of a word. Middle panel: optimal regularization parameters \u03bb, each row sums to\n\u03c3 = 20. Right panel: q concentrates the posteriors for all instances on the NN tag, reducing the\n(cid:96)1/(cid:96)\u221e norm from just under 4 to a little over 1.\n\nThe new posteriors q(z|x) are used to compute suf\ufb01cient statistics for this instance and hence to\nupdate the model\u2019s parameters in the M step. The optimization problem in Equation 4 can be solved\nef\ufb01ciently in dual form:\n\np\u03b8(z|x) exp{\u2212\u03bb(cid:62)f(x, z)}.\n\n(5)\n\nb(cid:62)\u03bb + log(cid:88)\n\nz\n\narg min\n\n\u03bb\u22650\n\nGiven \u03bb, the primal solution is q(z|x) = p\u03b8(z|x) exp{\u2212\u03bb(cid:62)f(x, z)}/Z, where Z is a normalization\nconstant. There is one dual variable per expectation constraint, which can be optimized by projected\ngradient descent where gradient for \u03bb is b \u2212 Eq[f(x, z)]. Gradient computation involves an expec-\ntation under q(z|x) that can be computed ef\ufb01ciently if the features f(x, z) factor in the same way as\nthe model p\u03b8(z|x) [7].\n\n3 Relaxing Posterior Regularization\nIn this work, we modify PR so that instead of hard constraints on q(z | x), it allows the constraints\nto be relaxed at a cost speci\ufb01ed by a penalty. This relaxation can allow combining multiple con-\nstraints without having to explicitly ensure that the constraint set remains non-empty. Additionally,\nit will be useful in dealing with the (cid:96)1/(cid:96)\u221e constraints we need. If those were incorporated as hard\nconstraints, the dual objective would become non-differentiable, making the optimization (some-\nwhat) more complicated. Using soft constraints, the non-differentiable portion of the dual objective\nturns into simplex constraints on the dual variables, allowing us to use an ef\ufb01cient projected gradient\nmethod. For soft constraints, Equation 4 is replaced by\n\narg min\nq,b\n\nKL(q (cid:107) p) + R(b)\n\ns. t. Eq[f(x, z)] \u2264 b\n\n(6)\n\nwhere b is the constraint vector, and R(b) penalizes overly lax constraints. For POS tagging, we\nwill design R(b) to encourage each word type to be observed with a small number of POS tags in\nthe projected posteriors q. The overall objective minimized can be shown to be:\nSoft PR Objective: arg min\n\u03b8,q,b\n\nL(\u03b8) + C(\u03b8) +(cid:98)E[KL(q (cid:107) p\u03b8) + R(b)]\n\ns. t. Eq[f(x, z)] \u2264 b.\n(7)\n\n3.1\n\n(cid:96)1/(cid:96)\u221e regularization\n\nWe now choose the posterior constraint regularizer R(b) to encourage each word to be associated\nwith only a few parts of speech. Let feature fwti have value 1 whenever the ith occurrence of word\nw has part of speech tag t. For every word w, we would like there to be only a few POS tags t\nsuch that there are occurrences i where t has nonzero probability. This can be achieved if it \u201ccosts\u201d\na lot to allow an occurrence of a word to take a tag, but once that happens, it should be \u201cfree\u201d for\nother occurrences of the word to receive that same tag. More precisely, we would like the sum ((cid:96)1\nnorm) over tags t and word types w of the maxima ((cid:96)\u221e norm) of the expectation of taking tag t\n\n3\n\ninstanceNNVBJJDT 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1instanceNNVBJJDT 0 1 2 3 4 5 6 7 8 9 10instanceNNVBJJDT 0 0.2 0.4 0.6 0.8 1\f(cid:32)(cid:88)\n\n\u2212 log\n\nmax\n\u03bb\u22650\n\n(cid:33)\n\n(cid:88)\n\nover all occurrences of w to be small. Table 1 shows the value of the (cid:96)1/(cid:96)\u221e sparsity measure for\nthree different corpora, comparing fully supervised HMM and fully unsupervised HMM learned\nwith standard EM, with standard EM having 3-4 times larger value of (cid:96)1/(cid:96)\u221e than the supervised.\nThis discrepancy is what our PR objective is attempting to eliminate.\nFormally, the E-step of our approach is expressed by the objective:\n\nKL(q (cid:107) p\u03b8) + \u03c3\n\nmin\nq,cwt\n\ncwt\n\ns. t. Eq[fwti] \u2264 cwt\n\n(8)\n\n(cid:88)\n\nwt\n\nwhere \u03c3 is the strength of the regularization. Note that setting \u03c3 = 0 we are back to normal EM\nwhere q is the model posterior distribution. As \u03c3 \u2192 \u221e, the constraints force each occurrence of\na word type to have the same posterior distribution, effectively reducing the mode to a 0th-order\nMarkov chain in the E step.\nThe dual of this objective has a very simple form (see supplementary material for derivation):\n\np\u03b8(z) exp(\u2212\u03bb \u00b7 f(z))\n\ns. t.\n\n\u03bbwti \u2264 \u03c3\n\n(9)\n\nz\n\ni\n\nwhere z ranges over assignments to the hidden tag variables for all of the occurrences in the training\ndata, f(z) is the vector of fwti feature values for assignment z, \u03bb is the vector of dual parame-\nters \u03bbwti, and the primal parameters are q(z) \u221d p\u03b8(z) exp (\u2212\u03bb \u00b7 f(z)). This can be computed by\nprojected gradient, as described by Bertsekas [3].\nFigure 1 illustrates how the (cid:96)1/(cid:96)\u221e norm operates on a toy example. For simplicity suppose we are\nonly regularizing one word and our model p\u03b8 is just a product distribution over 15 instances of the\nword. The left panel in Figure 1 shows the posteriors under p\u03b8. We would like to concentrate the\nposteriors on a small subset of rows. The center panel of the \ufb01gure shows the \u03bb values determined\nby Equation 9, and the right panel shows the projected distribution q, which concentrates most of the\nposterior on the bottom row. Note that we are not requiring the posteriors to be sparse, which would\nbe equivalent to preferring that the distribution is peaked; rather, we want a word to concentrate its\ntag posterior on a few tags across all instances of the word. Indeed, most of the instances (columns)\nbecome less peaked than in the original posterior to allow posterior mass to be redistributed away\nfrom the outlier tags. Since they are more numerous than the outliers, they moved less. This also\njusti\ufb01es only regularizing relatively frequent events in our model.\n\n4 Bayesian Estimators\n\nRecent advances in inference methods for sparsifying Bayesian estimation have been applied to\nunsupervised POS tagging [4, 9, 6]. In the Bayesian setting, preference for sparsity is expressed\nas a prior distribution over model structures and parameters, rather than as constraints on feature\nposteriors. To compare these two approaches, in Section 5 we compare our method to a Bayesian\napproach proposed by Johnson [9], which relies on a Dirichlet prior to encourage sparsity in a \ufb01rst-\norder HMM for POS tagging. The complete description of the model is:\n\n\u03b8i \u223c Dir (\u03b1i)\nP (ti|tt\u22121 = tag) \u223c Multi(\u03b8i)\n\n\u03c6i \u223c Dir (\u03bbi)\nP (wi|ti = tag) \u223c Multi(\u03c6i)\n\nHere, \u03b1i controls sparsity over the state transition matrix and \u03bbi controls the sparsity of state emis-\nsion probabilities. Johnson [9] notes that \u03b1i does not in\ufb02uence the model that much. In contrast, as\n\u03bbi approaches zero, it encourages the model to have highly skewed P (wi|ti = tag) distributions,\nthat is, each tag is encouraged to generate a few words with high probability, and the rest with very\nlow probability. This is not exactly the constraint we would like to enforce: there are some POS\ntags that generate many different words with relatively high probability (for example, nouns and\nverbs), while each word is associated with a small number of tags. This difference is one possible\nexplanation for the relatively worse performance of this prior compared to our method.\nJohnson [9] describes two approaches to learn the model parameters: a component-wise Gibbs\nsampling scheme (GS) and a variational Bayes (VB) approximation using a mean \ufb01eld. Since John-\nson [9] found VB worked much better than GS, we use VB in our experiments. Additionally, VB is\nparticularly simple to implement, consisting only a small modi\ufb01cation to the M-Step of the EM al-\ngorithm. The Dirichlet prior hyper-parameters are added to the expected counts and passed through\n\n4\n\n\fa squashing function (exponential of the Digamma function) before being normalized. We refer\nthe reader to the original paper for more detail (see also http://www.cog.brown.edu/~mj/\nPublications.htm for a bug \ufb01x in the Digamma function implementation).\n\n5 Experiments\n\nWe now compare \ufb01rst-order HMMs trained using the three methods described earlier: the classi-\ncal EM algorithm (EM), our (cid:96)1/(cid:96)\u221e posterior regularization based method (Sparse), and the model\npresented in Section 4 (VEM). Models were trained and tested on all available data of three cor-\npora: the Wall Street Journal portion of the Penn treebank [13] using the reduced tag set of 17 tags\n[17] (PTB17); the Bosque subset of the Portuguese Floresta Sinta(c)tica Treebank [1] used for the\nConLL X shared task on dependency parsing (PT-CoNLL); and the Bulgarian BulTreeBank [16]\n(BulTree) with the 12 coarse tags. We also report results on the full Penn treebank tag set in the\nsupplementary materials. All words that occurred only once were replaced by the token \u201cunk\u201d. To\nmeasure model sparsity, we compute the average (cid:96)1/(cid:96)\u221e norm over words occurring more than 10\ntimes (denoted \u2018L1LMax\u2019 in our \ufb01gures). Table 1 gives statistics for each corpus as well as the\nsparsity for a \ufb01rst-order HMM trained using the labeled data and using standard EM with unlabeled\ndata.\n\nPT-Conll\nBulTree\nPTB17\n\nTypes\n11293\n12177\n23768\n\nTokens\n206678\n174160\n950028\n\nUnk\nTags\n8.5% 22\n10% 12\n2% 17\n\nSup. (cid:96)1/(cid:96)\u221e EM (cid:96)1/(cid:96)\u221e\n\n1.14\n1.04\n1.23\n\n4.57\n3.51\n3.97\n\nTable 1: Corpus statistics. All words with only one occurrence where replaced by the \u2018unk\u2019 token.\nThe third column shows the percentage of tokens replaced. Sup. (cid:96)1/(cid:96)\u221e is the value of the sparsity\nmeasure for a fully supervised HMM trained on all available data and EM (cid:96)1/(cid:96)\u221e is the value of the\nsparsity measure for a fully unsupervised HMM trained using standard EM on all available data.\n\nFollowing Gao and Johnson [4], the parameters were initialized with a \u201cpseudo E step\u201d as follows:\nwe \ufb01lled the expected count matrices with numbers 1 + X \u00d7 U(0, 1), where U(0, 1) is a random\nnumber between 0 and 1 and X is a parameter. These matrices are then fed to the M step; the re-\nsulting \u201crandom\u201d transition and emission probabilities are used for the \ufb01rst real E step. For VEM,\nX was set to 0.0001 (almost uniform) since this showed a signi\ufb01cant improvement in performance.\nOn the other hand, EM showed less sensitivity to initialization, and we used X = 1 which resulted\nin the best results. The models were trained for 200 iterations as longer runs did not signi\ufb01cantly\nchange the results (models converge before 100 iterations). For VEM we tested 4 different prior\ncombinations, (all combinations of 10\u22121 and 10\u22123 for emission prior and transition prior), based\non Johnson\u2019s results [9]. As previously noted, changing the transition priors does not affect the\n\nEstimator\n\nEM\nVEM(10\u22121)\nVEM(10\u22124)\nSparse (10)\nSparse (32)\nSparse (100)\n\nPT-Conll\n\n1-1\n\n1-Many\n40.4(3.0)\n64.0(1.2)\n51.1(2.3)\n60.4(0.6)\n63.2(1.0)* 48.1(2.2)\n43.3(2.2)\n68.5(1.3)\n69.2(0.9)\n43.2(2.9)\n68.3(2.1)\n44.5(2.4)\n\nBG\n\nPTB17\n\n1-1\n\n1-Many\n59.4(2.2)\n54.9(3.1)\n56.1(2.8)\n65.1(1.0)\n66.0(1.8)\n65.9(1.6)\n\n1-1\n\n1-Many\n46.4(2.6)\n67.5(1.3)\n42.0(3.0)\n68.2(0.8)* 52.8(3.5)\n46.4(3.0)\n43.3(1.7)* 67.3(0.8)* 49.6(4.3)\n50.0(3.5)\n48.0(3.3)\n49.5(2.0)\n48.7(2.2)\n48.9(2.8)\n47.8(1.5)*\n\n69.5(1.6)\n70.2(2.2)\n68.7(1.1)\n\nTable 2: Average accuracy (standard deviation in parentheses) over 10 different runs (random seeds\nidentical across models) for 200 iterations. 1-Many and 1-1 are the two hidden-state to POS map-\npings described in the text. All models are \ufb01rst order HMMs: EM trained using expectation maxi-\nmization, VEM trained using variational EM observation priors shown in parentheses, Sparse trained\nusing PR with the constraint strength (\u03c3) in parentheses. Bold indicates the best value for each col-\numn. All results except those starred are signi\ufb01cant (p=0.005) on a paired t-test against the EM\nmodel.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Detailed visualizations of the results on the PT-Conll corpus. (a) 1-many accuracy vs\n(cid:96)1/(cid:96)\u221e, (b) 1-1 accuracy vs (cid:96)1/(cid:96)\u221e, (c) tens of thousands of tokens assigned to hidden state vs rank,\n(d) mutual information in bits between gold tag distribution and hidden state distribution.\n\nresults, so we only report results for different emission priors. Later work [4] considered a wider\nrange of values but did not identify de\ufb01nitely better choices. Sparse was initialized with the pa-\nrameters obtained by running EM for 30 iterations, followed by 170 iterations of the new training\nprocedure. Predictions were obtained using posterior decoding since this consistently showed small\nimprovements over Viterbi decoding.\nWe evaluate the accuracy of the models using two established mappings between hidden states and\nPOS tags: 1-Many maps each hidden state to the tag with which it co-occurs the most; 1-1 [8]\ngreedily picks a tag for each state under the constraint of never using the same tag twice. This\nresults in an approximation of the optimal 1-1 mapping. If the numbers of hidden states and tags\nare not the same, some hidden states will be unassigned (and hence always wrong) or some tags not\nused. In all our experiments the number of hidden states is the same as the number of POS tags.\nTable 2 shows the accuracy of the different methods averaged over 15 different random parameter\ninitializations. Comparing the methods for each of the initialization points individually, our (cid:96)1/(cid:96)\u221e\nregularization always outperforms EM baseline model on both metrics, and always outperforms\nVEM using 1-Many mapping, while for the 1-1 mapping our method outperforms VEM roughly\nhalf the time. The improvements are consistent for different constraint strength values.\nFigure 2 shows detailed visualizations of the behavior of the different methods on the PT-Conll cor-\npus. The results for the other corpora are qualitatively similar, and are reported in the supplemental\nmaterial. The left two plots show scatter graphs of accuracy with respect to (cid:96)1/(cid:96)\u221e value, where\naccuracy is measured with either the 1-many mapping (left) or 1-1 mapping (center). We see that\nSparse is much better using the 1-many mapping and worse using the 1-1 mapping than VEM, even\nthough they achieve similar (cid:96)1/(cid:96)\u221e. The third plot shows the number of tokens assigned to each\nhidden state at decoding time, in frequency rank order. While both EM and Sparse exhibit a fast de-\ncrease in the size of the states, VEM more closely matches the power law-like distribution achieved\nby the gold labels. This difference explains the improvement on the 1-1 mapping, where VEM is\nassigning larger size states to the most frequent tags. However, VEM achieves this power law distri-\nbution at the expense of the mutual information with the gold labels as we see in the rightmost plot.\nFrom all methods, VEM has the lowest mutual information, while Sparse has the highest.\n\n5.1 Closed-class words\n\nWe now consider the case where some supervision has been given in the form of a list of the closed-\nclass words for the language, along with POS tags. Example closed classes are punctuation, pro-\nnouns, possessive markers, while open classes would include nouns, verbs, and adjectives. (See the\nsupplemental materials for details.) We assume that we are given the POS tags of closed classes\nalong with the words in each closed class. In the models, we set the emission probability from a\nclosed-class tag to any word not in its class to zero. Also, any word appearing in a closed class is as-\nsumed to have zero probability of being generated by an open-class tag. This improves performance\nsigni\ufb01cantly for all languages, but our sparse training procedure is still able to outperform EM train-\ning signi\ufb01cantly as shown in Table 3. Note, for these experiments we do not use an unknown word,\nsince doing so for closed-class words would allow closed class tags to generate unknown words.\n\n6\n\n 60 61 62 63 64 65 66 67 68 69 70 0 1 2 3 4Sparse 100 Sparse 32 Sparse 10EM VEM 10-1 VEM 10-3 40 41 42 43 44 45 46 47 48 49 50 51 52 0 1 2 3 4 Sparse 10Sparse 32 Sparse 100EM VEM 10-3 VEM 10-1 0 1 2 3 4 5 10 15 20EMVEM 10-3VEM 10-1Sparse 32True 1.4 1.6 1.8 2 2.2 2.4EMVEM 0.001VEM 0.1Sparse 10Sparse 32Sparse 100Max\fEstimator\n\nEM\nSparse (32)\n\nPT-Conll\n\nBulTree\n\nPTB-17\n\n1-Many\n72.5(1.7)\n75.3(1.2)\n\n1-1\n\n52.6(4.2)\n57.5(5.0)\n\n1-Many\n77.9(1.7)\n82.4(1.2)\n\n1-1\n\n65.4(2.8)\n69.5(1.3)\n\n1-Many\n76.7(0.9)\n78.0(1.6)\n\n1-1\n\n61.1(1.8)\n62.2(2.0)\n\nTable 3: Results with given closed-class tags, using posterior decoding, and projection at test time.\n\nPT-Conll\n\nBulTree\n\nPTB-17\n\nFigure 3: Accuracy of a supervised classi\ufb01er when trained using the output of various unsupervised\nmodels as features. Vertical axis: accuracy, Horizontal axis: number of labeled sentences.\n\n5.2 Supervised POS tagging\n\nAs a further comparison of the models trained using the different methods, we use them to generate\nfeatures for a supervised POS tagger. The basic supervised model has features for the identity of the\ncurrent token as well as suf\ufb01xes of length 2 and 3. We augment these features with the state identity\nfor the current token, based on the automatically generated models. We train the supervised model\nusing averaged perceptron for 20 iterations.\nFor each unsupervised training procedure (EM, Sparse, VEM) we train 10 models using different\nrandom initializations and got 10 state identities per training method for each token. We then add\nthese cluster identities as features to the supervised model. Figure 3 shows the average accuracy of\nthe supervised model as we vary the type of unsupervised features. The average is taken over 10\nrandom samples for the training set at each training set size. We can see from Figure 3 that using\nour method or EM always improves performance relative to the baseline features (labeled \u201cnone\u201d\nin the \ufb01gure). VEM always under performs EM and for larger amounts of training data, the VEM\nfeatures appear not to be useful. This should not be surprising given that VEM has very low mutual\ninformation with the gold labeling.\n\n6 Related Work\n\nOur learning method is very closely related to the work of Mann and McCallum [11, 12], who\nconcurrently developed the idea of using penalties based on posterior expectations of features to\nguide learning. They call their method generalized expectation (GE) constraints or alternatively\nexpectation regularization. In the original GE framework, the posteriors of the model are regularized\ndirectly. For equality constraints, our objective would become:\n\narg max\n\n\u03b8\n\nL(\u03b8) \u2212 ED[R(E\u03b8[f])].\n\n(10)\n\nNotice that there is no intermediate distribution q. For some kinds of constraints this objective is\ndif\ufb01cult to optimize in \u03b8 and in order to improve ef\ufb01ciency Bellare et al. [2] propose interpreting\nthe PR framework as an approximation to the GE objective in Equation 10. They compare the\ntwo frameworks on several datasets and \ufb01nd that performance is similar, and we suspect that this\nwould be true for the sparsity constraints also. Liang et al. [10] cast the problem of incorporating\npartial information about latent variables into a Bayesian framework using \u201cmeasurements,\u201d and\nthey propose active learning for acquiring measurements to reduce uncertainty.\n\n7\n\n 65 70 75 80 10 20 30 40 50 60 70 80 90 100Sparse 32EMVEMnone 55 60 65 70 75 80 85 10 20 30 40 50 60 70 80 90 100Sparse 32EMVEMnone 65 70 75 80 85 90 10 20 30 40 50 60 70 80 90 100Sparse 32EMVEMnone\fRecently, Ravi et al. [15] show promising results in weakly-supervised POS tagging, where a tag\ndictionary is provided. This method \ufb01rst searches, using integer programming, for the smallest\ngrammar (in terms of unique transitions between tags) that explains the data. This sparse grammar\nand the dictionary are provided as input for training an unsupervised HMM. Results show that using\na sparse grammar, hence enforcing sparsity over possible sparsity transitions leads to better results.\nThis method is different from ours in the sense that our method focuses on learning the sparsity\npattern they their method uses as input.\n\n7 Conclusion\n\nWe presented a new regularization method for unsupervised training of probabilistic models that\nfavors a kind of sparsity that is pervasive in natural language processing. In the case of part-of-\nspeech induction, the preference can be summarized as \u201ceach word occurs as only a few different\nparts-of-speech,\u201d but the approach is more general and could be applied to other tasks. For example,\nin grammar induction, we could favor models where only a small number of production rules have\nnon-zero probability for each child non-terminal.\nOur method uses the posterior regularization framework to specify preferences about model poste-\nriors directly, without having to say how these should be encoded in model parameters. This means\nthat the sparse regularization penalty could be used for a log-linear model, where sparse parameters\ndo not correspond to posterior sparsity.\nWe evaluated the new regularization method on the task of unsupervised POS tagging, encoding\nthe prior knowledge that each word should have a small set of tags as a mixed-norm penalty. We\ncompared our method to a previously proposed Bayesian method (VEM) for encouraging sparsity of\nmodel parameters [9] and found that ours performs better in practice. We explain this advantage by\nnoting that VEM encodes a preference that each POS tag should generate a few words, which goes\nin the wrong direction. In reality, in POS tagging (as in several other language processing task), a\nfew event types (tags) (such the NN for POS tagging) generate the bulk of the word occurrences,\nbut each word is only associated with a few tags. Even when some supervision was provided with\nthrough closed class lists, our regularizer still improved performance over the other methods.\nAn analysis of sparsity shows that both VEM and Sparse achieve a similar posterior sparsity as\nmeasured by the (cid:96)1/(cid:96)\u221e metric. While VEM models better the empirical sizes of states (tags), the\nstates it assigns have lower mutual information to the true tags, suggesting that parameter sparsity is\nnot as good at generating good tag assignments. In contrast, Sparse\u2019s sparsity seems to help build a\nmodel that contains more information about the correct tag assignments.\nFinally, we evaluated the worth of states assigned by unsupervised learning as features for supervised\ntagger training with small training sets. These features are shown to be useful in most conditions,\nespecially those created by Sparse. The exceptions are some of the annotations provided by VEM\nwhich actually hinder the performance, con\ufb01rming that its lower mutual information states are not\nso informative.\nIn future work, we would like to evaluate the usefulness of these sparser annotations for down-\nstream tasks, for example determining whether Sparse POS tags are better for unsupervised parsing.\nFinally, we would like to apply the (cid:96)1/(cid:96)\u221e posterior regularizer to other applications such as unsu-\npervised grammar induction where we would like sparsity in production rules. Similarly, it would\nbe interesting to use this to regularize a log-linear model, where parameter sparsity does not achieve\nthe same goal.\n\nAcknowledgments\n\nJ. V. Gra\u00e7a was supported by a fellowship from Funda\u00e7\u00e3o para a Ci\u00eancia e Tecnologia (SFRH/\nBD/ 27528/ 2006). K. Ganchev was supported by ARO MURI SUBTLE W911NF-07-1-0216 The\nauthors would like to thank Mark Johnson and Jianfeng Gao for their help in reproducing the VEM\nresults.\n\n8\n\n\fReferences\n[1] S. Afonso, E. Bick, R. Haber, and D. Santos. Floresta Sinta(c)tica: a treebank for Portuguese.\n\nIn In Proc. LREC, pages 1698\u20131703, 2002.\n\n[2] K. Bellare, G. Druck, and A. McCallum. Alternating projections for learning with expectation\n\nconstraints. In In Proc. UAI, 2009.\n\n[3] D.P. Bertsekas, M.L. Homer, D.A. Logan, and S.D. Patek. Nonlinear programming. Athena\n\nscienti\ufb01c, 1995.\n\n[4] Jianfeng Gao and Mark Johnson. A comparison of Bayesian estimators for unsupervised Hid-\nIn In Proc. EMNLP, pages 344\u2013352, Honolulu, Hawaii,\n\nden Markov Model POS taggers.\nOctober 2008. ACL.\n\n[5] Y. Goldberg, M. Adler, and M. Elhadad. Em can \ufb01nd pretty good hmm pos-taggers (when\n\ngiven a good start). In Proc. ACL, pages 746\u2013754, 2008.\n\n[6] S. Goldwater and T. Grif\ufb01ths. A fully bayesian approach to unsupervised part-of-speech tag-\n\nging. In In Proc. ACL, volume 45, page 744, 2007.\n\n[7] J. Gra\u00e7a, K. Ganchev, and B. Taskar. Expectation maximization and posterior constraints. In\n\nIn Proc. NIPS. MIT Press, 2008.\n\n[8] A. Haghighi and D. Klein. Prototype-driven learning for sequence models. In In Proc. NAACL,\n\npages 320\u2013327, 2006.\n\n[9] M Johnson. Why doesn\u2019t EM \ufb01nd good HMM POS-taggers.\n\n2007.\n\nIn In Proc. EMNLP-CoNLL,\n\n[10] P. Liang, M. I. Jordan, and D. Klein. Learning from measurements in exponential families. In\n\nIn proc. ICML, 2009.\n\n[11] G. Mann and A. McCallum. Simple, robust, scalable semi-supervised learning via expectation\n\nregularization. In Proc. ICML, 2007.\n\n[12] G. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning of\n\nconditional random \ufb01elds. In In Proc. ACL, pages 870 \u2013 878, 2008.\n\n[13] M.P. Marcus, M.A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of\n\nEnglish: The Penn Treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[14] B. Merialdo. Tagging English text with a probabilistic model. Computational linguistics,\n\n20(2):155\u2013171, 1994.\n\n[15] Sujith Ravi and Kevin Knight. Minimized models for unsupervised part-of-speech tagging. In\n\nIn Proc. ACL, 2009.\n\n[16] Kiril Simov, Petya Osenova, Milena Slavcheva, Sia Kolkovska, Elisaveta Balabanova, Dimitar\nDoikoff, Krassimira Ivanova, Alexander Simov, Er Simov, and Milen Kouylekov. Building a\nlinguistically interpreted corpus of bulgarian: the bultreebank. In In Proc. LREC, page pages,\n2002.\n\n[17] N.A. Smith and J. Eisner. Contrastive estimation: Training log-linear models on unlabeled\n\ndata. In In Proc. ACL, pages 354\u2013362, 2005.\n\n[18] K. Toutanova and M. Johnson. A Bayesian LDA-based model for semi-supervised part-of-\n\nspeech tagging. In Proc. NIPS, 20, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1109, "authors": [{"given_name": "Kuzman", "family_name": "Ganchev", "institution": null}, {"given_name": "Ben", "family_name": "Taskar", "institution": null}, {"given_name": "Fernando", "family_name": "Pereira", "institution": null}, {"given_name": "Jo\u00e3o", "family_name": "Gama", "institution": null}]}