{"title": "Syntactic Topic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 185, "page_last": 192, "abstract": "We develop \\name\\ (STM), a nonparametric Bayesian model of parsed documents. \\Shortname\\ generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-specific topic weights and parse-tree specific syntactic transitions. Words are assumed generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents.", "full_text": "Syntactic Topic Models\n\nJordan Boyd-Graber\n\nDepartment of Computer Science\n\n35 Olden Street\n\nPrinceton University\nPrinceton, NJ 08540\n\njbg@cs.princeton.edu\n\nblei@cs.princeton.edu\n\nDavid Blei\n\nDepartment of Computer Science\n\n35 Olden Street\n\nPrinceton University\nPrinceton, NJ 08540\n\nAbstract\n\nWe develop the syntactic topic model (STM), a nonparametric Bayesian model\nof parsed documents. The STM generates words that are both thematically and\nsyntactically constrained, which combines the semantic insights of topic models\nwith the syntactic information available from parse trees. Each word of a sentence\nis generated by a distribution that combines document-speci\ufb01c topic weights and\nparse-tree-speci\ufb01c syntactic transitions. Words are assumed to be generated in an\norder that respects the parse tree. We derive an approximate posterior inference\nmethod based on variational methods for hierarchical Dirichlet processes, and we\nreport qualitative and quantitative results on both synthetic data and hand-parsed\ndocuments.\n\n1 Introduction\n\nProbabilistic topic models provide a suite of algorithms for \ufb01nding low dimensional structure in a\ncorpus of documents. When \ufb01t to a corpus, the underlying representation often corresponds to the\n\u201ctopics\u201d or \u201cthemes\u201d that run through it. Topic models have improved information retrieval [1], word\nsense disambiguation [2], and have additionally been applied to non-text data, such as for computer\nvision and collaborative \ufb01ltering [3, 4].\nTopic models are widely applied to text despite a willful ignorance of the underlying linguistic\nstructures that exist in natural language. In a topic model, the words of each document are assumed\nto be exchangeable; their probability is invariant to permutation. This simpli\ufb01cation has proved\nuseful for deriving ef\ufb01cient inference techniques and quickly analyzing very large corpora [5].\nHowever, exchangeable word models are limited. While useful for classi\ufb01cation or information\nretrieval, where a coarse statistical footprint of the themes of a document is suf\ufb01cient for success,\nexchangeable word models are ill-equipped for problems relying on more \ufb01ne-grained qualities of\nlanguage. For instance, although a topic model can suggest documents relevant to a query, it cannot\n\ufb01nd particularly relevant phrases for question answering. Similarly, while a topic model might\ndiscover a pattern such as \u201ceat\u201d occurring with \u201ccheesecake,\u201d it lacks the representation to describe\nselectional preferences, the process where certain words restrict the choice of the words that follow.\nIt is in this spirit that we develop the syntactic topic model, a nonparametric Bayesian topic model\nthat can infer both syntactically and thematically coherent topics. Rather than treating words as the\nexchangeable unit within a document, the words of the sentences must conform to the structure of a\nparse tree. In the generative process, the words arise from a distribution that has both a document-\nspeci\ufb01c thematic component and a parse-tree-speci\ufb01c syntactic component.\nWe illustrate this idea with a concrete example. Consider a travel brochure with the sentence \u201cIn\n.\u201d Both the low-level syntactic context of a word and\nthe near future, you could \ufb01nd yourself in\nits document context constrain the possibilities of the word that can appear next. Syntactically, it\n\n1\n\n\f(a) Overall Graphical Model\n\n(b) Sentence Graphical Model\n\nFigure 1: In the graphical model of the STM, a document is made up of a number of sentences,\nrepresented by a tree of latent topics z which in turn generate words w. These words\u2019 topics are\nchosen by the topic of their parent (as encoded by the tree), the topic weights for a document \u03b8,\nand the node\u2019s parent\u2019s successor weights \u03c0. (For clarity, not all dependencies of sentence nodes\nare shown.) The structure of variables for sentences within the document plate is on the right, as\ndemonstrated by an automatic parse of the sentence \u201cSome phrases laid in his mind for years.\u201d The\nSTM assumes that the tree structure and words are given, but the latent topics z are not.\n\nis going to be a noun consistent as the object of the preposition \u201cof.\u201d Thematically, because it is in\na travel brochure, we would expect to see words such as \u201cAcapulco,\u201d \u201cCosta Rica,\u201d or \u201cAustralia\u201d\nmore than \u201ckitchen,\u201d \u201cdebt,\u201d or \u201cpocket.\u201d Our model can capture these kinds of regularities and\nexploit them in predictive problems.\nPrevious efforts to capture local syntactic context include semantic space models [6] and similarity\nfunctions derived from dependency parses [7]. These methods successfully determine words that\nshare similar contexts, but do not account for thematic consistency. They have dif\ufb01culty with pol-\nysemous words such as \u201c\ufb02y,\u201d which can be either an insect or a term from baseball. With a sense\nof document context, i.e., a representation of whether a document is about sports or animals, the\nmeaning of such terms can be distinguished.\nOther techniques have attempted to combine local context with document coherence using linear\nsequence models [8, 9]. While these models are powerful, ordering words sequentially removes\nthe important connections that are preserved in a syntactic parse. Moreover, these models gener-\nate words either from the syntactic or thematic context. In the syntactic topic model, words are\nconstrained to be consistent with both.\nThe remainder of this paper is organized as follows. We describe the syntactic topic model, and\ndevelop an approximate posterior inference technique based on variational methods. We study its\nperformance both on synthetic data and hand parsed data [10]. We show that the STM captures\nrelationships missed by other models and achieves lower held-out perplexity.\n\n2 The syntactic topic model\n\nWe describe the syntactic topic model (STM), a document model that combines observed syntactic\nstructure and latent thematic structure. To motivate this model, we return to the travel brochure\n.\u201d. The word that \ufb01lls in the blank is\nsentence \u201cIn the near future, you could \ufb01nd yourself in\nconstrained by its syntactic context and its document context. The syntactic context tells us that it is\nan object of a preposition, and the document context tells us that it is a travel-related word.\nThe STM attempts to capture these joint in\ufb02uences on words.\nIt models a document corpus as\nexchangeable collections of sentences, each of which is associated with a tree structure such as a\n\n2\n\n\u03b1\u03b1T\u03b2\u03c0k\u03c4k\u221eM\u03b8d\u03b1D\u03c3Parse trees grouped into M documentsw1:laidw2:phrasesw6:forw5:hisw4:somew5:mindw7:yearsw3:inz1z2z3z4z5z5z6z7\fparse tree (Figure 1(b)). The words of each sentence are assumed to be generated from a distribution\nin\ufb02uenced both by their observed role in that tree and by the latent topics inherent in the document.\nThe latent variables that comprise the model are topics, topic transition vectors, topic weights, topic\nassignments, and top-level weights. Topics are distributions over a \ufb01xed vocabulary (\u03c4k in Figure\n1). Each is further associated with a topic transition vector (\u03c0k), which weights changes in topics\nbetween parent and child nodes. Topic weights (\u03b8d) are per-document vectors indicating the degree\nto which each document is \u201cabout\u201d each topic. Topic assignments (zn, associated with each internal\nnode of 1(b)) are per-word indicator variables that refer to the topic from which the corresponding\nword is assumed to be drawn. The STM is a nonparametric Bayesian model. The number of topics\nis not \ufb01xed, and indeed can grow with the observed data.\nThe STM assumes the following generative process of a document collection.\n\n1. Choose global topic weights \u03b2 \u223c GEM(\u03b1)\n2. For each topic index k = {1, . . .}:\n\n3. For each document d = {1, . . . M}:\n\n(a) Choose topic \u03c4k \u223c Dir(\u03c3)\n(b) Choose topic transition distribution \u03c0k \u223c DP(\u03b1T , \u03b2)\n(a) Choose topic weights \u03b8d \u223c DP(\u03b1D, \u03b2)\n(b) For each sentence in the document:\n\ni. Choose topic assignment z0 \u221d \u03b8d\u03c0start\nii. Choose root word w0 \u223c mult(1, \u03c4z0)\niii. For each additional word wn and parent pn, n \u2208 {1, . . . dn}\n\n\u2022 Choose topic assignment zn \u221d \u03b8d\u03c0zp(n)\n\u2022 Choose word wn \u223c mult(1, \u03c4zn)\n\nThe distinguishing feature of the STM is that the topic assignment is drawn from a distribution that\ncombines two vectors: the per-document topic weights and the transition probabilities of the topic\nassignment from its parent node in the parse tree. By merging these vectors, the STM models both\nthe local syntactic context and corpus-level semantics of the words in the documents. Because they\ndepend on their parents, the topic assignments and words are generated by traversing the tree.\nA natural alternative model would be to traverse the tree and choose the topic assignment from either\nthe parental topic transition \u03c0zp(n) or document topic weights \u03b8d, based on a binary selector variable.\nThis would be an extension of [8] to parse trees, but it does not enforce words to be syntactically\nconsistent with their parent nodes and be thematically consistent with a topic of the document. Only\none of the two conditions must be true. Rather, this approach draws on the idea behind the product\nof experts [11], multiplying two vectors and renormalizing to obtain a new distribution. Taking\nthe point-wise product can be thought of as viewing one distribution through the \u201clens\u201d of another,\neffectively choosing only words whose appearance can be explained by both.\nThe STM is closely related to the hierarchical Dirichlet process (HDP). The HDP is an extension of\nDirichlet process mixtures to grouped data [12]. Applied to text, the HDP is a probabilistic topic\nmodel that allows each document to exhibit multiple topics. It can be thought of as the \u201cin\ufb01nite\u201d\ntopic version of latent Dirichlet allocation (LDA) [13]. The difference between the STM and the\nHDP is in how the per-word topic assignment is drawn. In the HDP, this topic assignment is drawn\ndirectly from the topic weights and, thus, the HDP assumes that words within a document are ex-\nchangeable. In the STM, the words are generated conditioned on their parents in the parse tree. The\nexchangeable unit is a sentence.\nThe STM is also closely related to the in\ufb01nite tree with independent children [14]. The in\ufb01nite tree\nmodels syntax by basing the latent syntactic category of children on the syntactic category of the\nparent. The STM reduces to the In\ufb01nite Tree when \u03b8d is \ufb01xed to a vector of ones.\n\n3 Approximate posterior inference\n\nThe central computational problem in topic modeling is to compute the posterior distribution of the\nlatent structure conditioned on an observed collection of documents. Speci\ufb01cally, our goal is to\ncompute the posterior topics, topic transitions, per-document topic weights, per-word topic assign-\n\n3\n\n\fments, and top-level weights conditioned on a set of documents, each of which is a collection of\nparse trees.\nThis posterior distribution is intractable to compute. In typical topic modeling applications, it is\napproximated with either variational inference or collapsed Gibbs sampling. Fast Gibbs sampling\nrelies on the conjugacy between the topic assignment and the prior over the distribution that gen-\nerates it. The syntactic topic model does not enjoy such conjugacy because the topic assignment is\ndrawn from a multiplicative combination of two Dirichlet distributed vectors. We appeal to varia-\ntional inference.\nIn variational inference, the posterior is approximated by positing a simpler family of distributions,\nindexed by free variational parameters. The variational parameters are \ufb01t to be close in relative\nentropy to the true posterior. This is equivalent to maximizing the Jensen\u2019s bound on the marginal\nprobability of the observed data [15].\nWe use a fully-factorized variational distribution,\n\nq(\u03b2, z, \u03b8, \u03c0, \u03c4|\u03b2\u2217, \u03c6, \u03b3, \u03bd) = q(\u03b2|\u03b2\u2217)Q\n\n(1)\nFollowing [16], q(\u03b2|\u03b2\u2217) is not a full distribution, but is a degenerate point estimate truncated so that\nall weights whose index is greater than K are zero in the variational distribution. The variational\nparameters \u03b3d and \u03bdz index Dirichlet distributions, and \u03c6n is a topic multinomial for the nth word.\nFrom this distribution,\nthe Jensen\u2019s lower bound on the log probability of the corpus is\nL(\u03b3, \u03bd, \u03c6; \u03b2, \u03b8, \u03c0, \u03c4) =\n\nd q(\u03b8d|\u03b3d)Q\n\nk q(\u03c0k|\u03bdk)Q\n\nn q(zn|\u03c6n).\n\nEq [log p(\u03b2|\u03b1) + log p(\u03b8|\u03b1D, \u03b2) + log p(\u03c0|\u03b1P , \u03b2) + log p(z|\u03b8, \u03c0)+\n\nlog p(w|z, \u03c4 ) + log p(\u03c4|\u03c3)] \u2212 Eq [log q(\u03b8) + log q(\u03c0) + log q(z)] .\n\n(2)\nExpanding Eq [log p(z|\u03b8, \u03c0)] is dif\ufb01cult, so we add an additional slack parameter, \u03c9n to approximate\nthe expression. This derivation and the complete likelihood bound is given in the supplement. We\nuse coordinate ascent to optimize the variational parameters to be close to the true posterior.\n\nPer-word variational updates The variational update for the topic assignment of the nth word is\n\nn\n\u03a8(\u03b3i) \u2212 \u03a8(PK\nj=1 \u03b3j) +PK\n+P\nPK\n\u2212P\nPK\nc\u2208c(n)\nP\nc\u2208c(n) \u03c9\u22121\n\n(cid:16)\nP\n\nj=1 \u03c6c,j\n\nc\n\nj\n\n\u03b3j \u03bdi,j\n\nj=1 \u03c6p(n),j\n\u03a8(\u03bdi,j) \u2212 \u03a8\n\n(cid:16)\n(cid:16)PK\n\n\u03a8(\u03bdj,i) \u2212 \u03a8\n\n(cid:17)(cid:17)\n\nk=1 \u03bdi,k\n\no\n\n(cid:16)PK\n\nk=1 \u03bdj,k\n\n(cid:17)(cid:17)\n\n\u03c6ni \u221d exp\n\nk \u03b3k\n\nk \u03bdi,k\n\n+ log \u03c4i,wn\n\n.\n\n(3)\n\nThe in\ufb02uences on estimating the posterior of a topic assignment are: the document\u2019s topic \u03b3, the\ntopic of the node\u2019s parent p(n), the topic of the node\u2019s children c(n), the expected transitions be-\ntween topics \u03bd, and the probability of the word within a topic \u03c4i,wn.\nMost terms in Equation 3 are familiar from variational inference for probabilistic topic models, as\nthe digamma functions appear in the expectations of multinomial distributions. The second to last\nterm is new, however, because we cannot assume that the point-wise product of \u03c0k and \u03b8d will sum\nto one. We approximate the normalizer for their produce by introducing \u03c9; its update is\n\n\u03c9n =X\n\nX\n\ni=1\n\nj=1\n\nPK\n\nPK\n\n\u03b3i\u03bdj,i\n\nk=1 \u03b3k\n\nk=1 \u03bdj,k\n\n.\n\n\u03c6p(n),j\n\nVariational Dirichlet distributions and topic composition This normalizer term also appears in\nthe derivative of the likelihood function for \u03b3 and \u03bd (the parameters to the variational distributions\non \u03b8 and \u03c0, respectively), which cannot be solved in a closed form. We use conjugate gradient\noptimization to determine the appropriate updates for these parameters [17].\nTop-level weights Finally, we consider the top-level weights. The \ufb01rst K \u2212 1 stick-breaking\nproportions are drawn from a Beta distribution with parameters (1, \u03b1), but we assume that the \ufb01nal\nstick-breaking proportion is unity (thus implying \u03b2\u2217 is non-zero only from 1 . . . K). Thus, we only\noptimize the \ufb01rst K \u2212 1 positions and implicitly take \u03b2\u2217\n\u03b2\u2217\ni . This constrained\noptimization is performed using the barrier method [17].\n\nK = 1 \u2212PK\u22121\n\ni\n\n4\n\n\f4 Empirical results\n\nBefore considering real-world data, we demonstrate the STM on synthetic natural language data. We\ngenerated synthetic sentences composed of verbs, nouns, prepositions, adjectives, and determiners.\nVerbs were only in the head position; prepositions could appear below nouns or verbs; nouns only\nappeared below verbs; prepositions or determiners and adjectives could appear below nouns. Each\nof the parts of speech except for prepositions and determiners were sub-grouped into themes, and\na document contains a single theme for each part of speech. For example, a document can only\ncontain nouns from a single \u201ceconomic,\u201d \u201cacademic,\u201d or \u201clivestock\u201d theme.\nUsing a truncation level of 16, we \ufb01t three different nonparametric Bayesian language models to\nthe synthetic data (Figure 2).1 The in\ufb01nite tree model is aware of the tree structure but not docu-\nments [14] It is able to separate parts of speech successfully except for adjectives and determiners\n(Figure 2(a)). However, it ignored the thematic distinctions that actually divided the terms between\ndocuments. The HDP is aware of document groupings and treats the words exchangeably within\nthem [12]. It is able to recover the thematic topics, but has missed the connections between the parts\nof speech, and has con\ufb02ated multiple parts of speech (Figure 2(b)).\nThe STM is able to capture the the topical themes and recover parts of speech (with the exception of\nprepositions that were placed in the same topic as nouns with a self loop). Moreover, it was able to\nidentify the same interconnections between latent classes that were apparent from the in\ufb01nite tree.\nNouns are dominated by verbs and prepositions, and verbs are the root (head) of sentences.\n\nQualitative description of topics learned from hand-annotated data The same general proper-\nties, but with greater variation, are exhibited in real data. We converted the Penn Treebank [10], a\ncorpus of manually curated parse trees, into a dependency parse [18]. The vocabulary was pruned\nto terms that appeared in at least ten documents.\nFigure 3 shows a subset of topics learned by the STM with truncation level 32. Many of the re-\nsulting topics illustrate both syntactic and thematic consistency. A few nonspeci\ufb01c function topics\nemerged (pronoun, possessive pronoun, general verbs, etc.). Many of the noun categories were more\nspecialized. For instance, Figure 3 shows clusters of nouns relating to media, individuals associated\nwith companies (\u201cmr,\u201d \u201cpresident,\u201d \u201cchairman\u201d), and abstract nouns related to stock prices (\u201cshares,\u201d\n\u201cquarter,\u201d \u201cearnings,\u201d \u201cinterest\u201d), all of which feed into a topic that modi\ufb01es nouns (\u201chis,\u201d \u201ctheir,\u201d\n\u201cother,\u201d \u201clast\u201d). Thematically related topics are separated by both function and theme.\nThis division between functional and topical uses for the latent classes can also been seen in the\nvalues for the per-document multinomial over topics. A number of topics in Figure 3(b), such as 17,\n15, 10, and 3, appear to some degree in nearly every document, while other topics are used more\nsparingly to denote specialized content. With \u03b1 = 0.1, this plot also shows that the nonparametric\nBayesian framework is ignoring many later topics.\n\nPerplexity To study the performance of the STM on new data, we estimated the held out proba-\nbility of previously unseen documents with an STM trained on a portion of the Penn Treebank. For\neach position in the parse trees, we estimate the probability the observed word. We compute the\nperplexity as the exponent of the inverse of the per-word average log probability. The lower the per-\nplexity, the better the model has captured the patterns in the data. We also computed perplexity for\nindividual parts of speech to study the differences in predictive power between content words, such\nas nouns and verbs, and function words, such as prepositions and determiners. This illustrates how\ndifferent algorithms better capture aspects of context. We expect function words to be dominated by\nlocal context and content words to be determined more by the themes of the document.\nThis trend is seen not only in the synthetic data (Figure 4(a)), where parsing models better predict\nfunctional categories like prepositions and document only models fail to account for patterns of\nverbs and determiners, but also in real data. Figure 4(b) shows that HDP and STM both perform\nbetter than parsing models in capturing the patterns behind nouns, while both the STM and the\nin\ufb01nite tree have lower perplexity for verbs. Like parsing models, our model was better able to\n\n1In Figure 2 and Figure 3, we mark topics which represent a single part of speech and are essentially the lone\nrepresentative of that part of speech in the model. This is a subjective determination of the authors, does not\nre\ufb02ect any specialization or special treatment of topics by the model, and is done merely for didactic purposes.\n\n5\n\n\fSTART\n1.00\n\nponders, \ndiscusses, falls \nqueries, runs\n\n0.46\n\nPROFESSOR, \n\nPHD_CANDIDATE, \nGRAD_STUDENT, \n\nSHEEP, \nPONY \n\n0.20\n\n0.98\n\n0.28\n\n0.61\n\non, about, \nover, with\n\nthat, evil, the, \n\nthis, a\n\nPHD_CANDIDATE, \nGRAD_STUDENT, with, \nover, PROFESSOR, on\n\nPROFESSOR, \n\nPHD_CANDIDATE, evil, \n\nGRAD_STUDENT, \n\nponders. this\n\nSHEEP, COW, PONY, \n\nover, about, on\n\nPROFESSOR, \n\nGRAD_STUDENT, over, \nPHD_CANDIDATE, on\n\nSTOCK, \n\nMUTUAL_FUND, \nSHARE, with, about\n\nSHARE, STOCK, \n\nMUTUAL_FUND, on, \n\nover\n\n(a) Parse transition only\n\n(b) Document multinomial only\n\nThemes\n\nSTART\n\n 0.13\n\n 0.26\n\n 0.30\n\nbucks, surges, \nclimbs, falls, runs\n\n 0.26\nponders, \nfalls, runs\n\ndiscusses, queries, \n\n 0.36\n\n 0.22\n\n 0.30\n 0.35\n\n0.37\n\nSHEEP, PONY, \nCOW, over, with\n\nruns, falls, walks, \n\nsits, climbs\n\n 0.30\n\n 0.27\n\n 0.22\n\n 0.36\n\nSTOCK SHARE, \nMUTUAL_FUND, \n\non, with\n\n0.38\n\n 0.31\n\n 0.23\n\n 0.33\n\nhates, dreads, \nmourns, fears, \n\ndespairs\n\n 0.40\n\n 0.24\n 0.30\n 0.35\n\nPROFESSOR, \n\nPHD_CANDIDATE, \nGRAD_STUDENT, \n\nover, on\n\nP\na\nr\nt\ns\n \no\nf\n \n\nS\np\ne\ne\nc\nh\n\n0.33\n\n 0.33\n\n 0.30\n\n 0.26\n\nevil, that, this, the\n\nstupid, that, the, \n\ninsolent\n\nevil, this, the, that\n\n(c) Combination of parse transition and document multinomial\n\nFigure 2: Three models were \ufb01t to the synthetic data described in Section 4. Each box illustrates the\ntop \ufb01ve words of a topic; boxes that represent homogenous parts of speech have rounded edges and\nare shaded. Edges between topics are labeled with estimates of their transition weight \u03c0. While the\nin\ufb01nite tree model (a) is able to reconstruct the parts of speech used to generate the data, it lumps\nall topics into the same categories. Although the HDP (b) can discover themes of recurring words, it\ncannot determine the interactions between topics or separate out ubiquitous words that occur in all\ndocuments. The STM (c) is able to recover the structure.\n\npredict the appearance of prepositions, but also remained competitive with HDP on content words.\nOn the whole, the STM had lower perplexity than HDP and the in\ufb01nite tree.\n\n5 Discussion\n\nWe have introduced and evaluated the syntactic topic model, a nonparametric Bayesian model of\nparsed documents. The STM achieves better perplexity than the in\ufb01nite tree or the hierarchical\nDirichlet process and uncovers patterns in text that are both syntactically and thematically consistent.\nThis dual relevance is useful for work in natural language processing. For example, recent work [19,\n20] in the domain of word sense disambiguation has attempted to combine syntactic similarity with\ntopical information in an ad hoc manner to improve the predominant sense algorithm [21]. The\nsyntactic topic model offers a principled way to learn both simultaneously rather than combining\ntwo heterogenous methods.\nThe STM is not a full parsing model, but it could be used as a means of integrating document\ncontext into parsing models. This work\u2019s central premise is consistent with the direction of recent\nimprovements in parsing technology in that it provides a method for re\ufb01ning the parts of speech\npresent in a corpus. For example, lexicalized parsers [22] create rules speci\ufb01c to individual terms,\nand grammar re\ufb01nement [23] divides general roles into multiple, specialized ones. The syntactic\ntopic model offers an alternative method of \ufb01nding more speci\ufb01c rules by grouping words together\nthat appear in similar documents and could be extended to a full parser.\n\n6\n\n\fSTART\n 0.95\n\nsays, \ncould, \ncan, did, \ndo, may, \ndoes, say\n\n0.11\n\n0.37\n\n 0.57\n\n 0.29\n\n0.11\n\nthey, who, \nhe, there, \none, we, \nalso, if\n\n 0.28\n\ntelevision, \npublic, \naustralia, \ncable, host, \nfranchise, \nservice\n\n0.34\n\n 0.67\n\n 0.09\n0.25\n\n 0.06\n\nhis, their, \nother, us, \nits, last, \none, all\n\n 0.10\n\n 0.42\n\nmarket, sales, \n\nshares, \nquarter, \nearnings, \ninterest, \nmonths, yield\n\n0.22\n\nt\n\nn\ne\nm\nu\nc\no\nD\n\n 0.06\n\n 0.31\n\nmr, inc, co, \npresident, \nchairman, \nvice, analyst,\n\ncorp, \n\n 0.08\n\neurope, \neastern, \nprotection, \ncorp, poland, \nhungary, \nchapter, aid\n\n0.52\n\n(a) Sinks and sources\n\ngarden, visit, \nhaving, aid, \n\nprime, \ndespite, \nminister, \nespecially\n\n 0.26\n\npolicy, \ngorbachev, \nmikhail, \nleader, soviet, \nrestructuring, \n\nsoftware\n\nsales\nmarket\njunk\nfund\nbonds\n\nhis\nher\ntheir\nother\none\n\nsays\ncould\ncan\ndid\ndo\n\n1 3 5 7 9\n\n12\n\n15\n\n18\n\n21\n\n24\n\n27\n\n30\n\nTopic\n\n(b) Topic usage\n\nFigure 3: Selected topics (along with strong links) after a run of the syntactic topic model with\na truncation level of 32. As in Figure 2, parts of speech that aren\u2019t subdivided across themes are\nindicated. In the Treebank corpus (left), head words (verbs) are shared, but the nouns split off into\nmany separate specialized categories before feeding into pronoun sinks. The specialization of topics\nis also visible in plots of the variational parameter \u03b3 normalized for the \ufb01rst 300 documents of the\nTreebank (right), where three topics columns have been identi\ufb01ed. Many topics are used to some\nextent in every document, showing that they are performing a functional role, while others are used\nmore sparingly for semantic content.\n\nHDP\nInfinite Tree, independent children\nSTM\n\n 30\n\n 25\n\ny\nt\ni\nx\ne\nl\np\nr\ne\nP\n\n 20\n\n 15\n\n 10\n\n 5\n\n 0\n\nALL\n\nNOUN\n\nVERB\n\nADJ\n(a) Synthetic\n\nDET\n\nPREP\n\ny\nt\ni\nx\ne\nl\np\nr\ne\nP\n\n 4,500\n 4,000\n 3,500\n 3,000\n 2,500\n 2,000\n 1,500\n 1,000\n 500\n 0\n\nALL\n\nHDP\nInfinite Tree, independent children\nSTM\n\nPREP\n\n(b) Treebank\n\nNOUN\n\nVERB\n\nFigure 4: After \ufb01tting three models on synthetic data, the syntactic topic model has better (lower)\nperplexity on all word classes except for adjectives. HDP is better able to capture document- level\npatterns of adjectives. The in\ufb01nite tree captures prepositions best, which have no cross- document\nvariation. On real data 4(b), the syntactic topic model was able to combine the strengths of the\nin\ufb01nite tree on functional categories like prepositions with the strengths of the HDP on content\ncategories like nouns to attain lower overall perplexity.\n\nWhile traditional topic models reveal groups of words that are used in similar documents, the STM\nuncovers groups that are used the same way in similar documents. This decomposition is useful for\ntasks that require a more \ufb01ne- grained representation of language than the bag of words can offer or\nfor tasks that require a broader context than parsing models.\n\nReferences\n[1] Wei, X., B. Croft. LDA- based document models for ad- hoc retrieval. In Proceedings of the ACM SIGIR\n\nConference on Research and Development in Information Retrieval. 2006.\n\n[2] Cai, J. F., W. S. Lee, Y. W. Teh. NUS- ML:Improving word sense disambiguation using topic features. In\n\nProceedings of SemEval- 2007. Association for Computational Linguistics, 2007.\n\n7\n\n\f[3] Fei-Fei Li, P. Perona. A Bayesian hierarchical model for learning natural scene categories. In CVPR \u201905\n\n- Volume 2, pages 524\u2013531. IEEE Computer Society, Washington, DC, USA, 2005.\n\n[4] Marlin, B. Modeling user rating pro\ufb01les for collaborative \ufb01ltering. In S. Thrun, L. Saul, B. Sch\u00a8olkopf,\n\neds., Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA, 2004.\n\n[5] Grif\ufb01ths, T., M. Steyvers. Probabilistic topic models.\n\nIn T. Landauer, D. McNamara, S. Dennis,\n\nW. Kintsch, eds., Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, 2006.\n\n[6] Pad\u00b4o, S., M. Lapata. Dependency-based construction of semantic space models. Computational Linguis-\n\ntics, 33(2):161\u2013199, 2007.\n\n[7] Lin, D. An information-theoretic de\ufb01nition of similarity. In Proceedings of International Conference of\n\nMachine Learning, pages 296\u2013304. 1998.\n\n[8] Grif\ufb01ths, T. L., M. Steyvers, D. M. Blei, et al. Integrating topics and syntax. In L. K. Saul, Y. Weiss,\nL. Bottou, eds., Advances in Neural Information Processing Systems, pages 537\u2013544. MIT Press, Cam-\nbridge, MA, 2005.\n\n[9] Gruber, A., M. Rosen-Zvi, Y. Weiss. Hidden topic Markov models. In Proceedings of Arti\ufb01cial Intelli-\n\ngence and Statistics. San Juan, Puerto Rico, 2007.\n\n[10] Marcus, M. P., B. Santorini, M. A. Marcinkiewicz. Building a large annotated corpus of English: The\n\nPenn treebank. Computational Linguistics, 19(2):313\u2013330, 1994.\n\n[11] Hinton, G. Products of experts. In Proceedings of the Ninth International Conference on Arti\ufb01cial Neural\n\nNetworks, pages 1\u20136. IEEE, Edinburgh, Scotland, 1999.\n\n[12] Tee, Y. W., M. I. Jordan, M. J. Beal, et al. Hierarchical dirichlet processes. Journal of the American\n\nStatistical Association, 101(476):1566\u20131581, 2006.\n\n[13] Blei, D., A. Ng, M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993\u2013\n\n1022, 2003.\n\n[14] Finkel, J. R., T. Grenager, C. D. Manning. The in\ufb01nite tree. In Proceedings of Association for Computa-\ntional Linguistics, pages 272\u2013279. Association for Computational Linguistics, Prague, Czech Republic,\n2007.\n\n[15] Jordan, M., Z. Ghahramani, T. S. Jaakkola, et al. An introduction to variational methods for graphical\n\nmodels. Machine Learning, 37(2):183\u2013233, 1999.\n\n[16] Liang, P., S. Petrov, M. Jordan, et al. The in\ufb01nite PCFG using hierarchical Dirichlet processes.\n\nProceedings of Emperical Methods in Natural Language Processing, pages 688\u2013697. 2007.\n\nIn\n\n[17] Boyd, S., L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[18] Johansson, R., P. Nugues. Extended constituent-to-dependency conversion for English. In (NODALIDA).\n\n2007.\n\n[19] Koeling, R., D. McCarthy. Sussx: WSD using automatically acquired predominant senses. In Proceedings\n\nof SemEval-2007. Association for Computational Linguistics, 2007.\n\n[20] Boyd-Graber, J., D. Blei. PUTOP: Turning predominant senses into a topic model for WSD. In Proceed-\n\nings of SemEval-2007. Association for Computational Linguistics, 2007.\n\n[21] McCarthy, D., R. Koeling, J. Weeds, et al. Finding predominant word senses in untagged text. In Pro-\nceedings of Association for Computational Linguistics, pages 280\u2013287. Association for Computational\nLinguistics, 2004.\n\n[22] Collins, M. Head-driven statistical models for natural language parsing. Computational Linguistics,\n\n29(4):589\u2013637, 2003.\n\n[23] Klein, D., C. Manning. Accurate unlexicalized parsing. In Proceedings of Association for Computational\n\nLinguistics, pages 423\u2013430. Association for Computational Linguistics, 2003.\n\n8\n\n\f", "award": [], "sourceid": 319, "authors": [{"given_name": "Jordan", "family_name": "Boyd-graber", "institution": null}, {"given_name": "David", "family_name": "Blei", "institution": null}]}