{"title": "Word Features for Latent Dirichlet Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 1921, "page_last": 1929, "abstract": "We extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the encoding of side information in the distribution over words. This results in a variety of new capabilities, such as improved estimates for infrequently occurring words, as well as the ability to leverage thesauri and dictionaries in order to boost topic cohesion within and across languages. We present experiments on multi-language topic synchronisation where dictionary information is used to bias corresponding words towards similar topics. Results indicate that our model substantially improves topic cohesion when compared to the standard LDA model.", "full_text": "Word Features for Latent Dirichlet Allocation\n\nJames Petterson1, Alex Smola2, Tiberio Caetano1, Wray Buntine1, Shravan Narayanamurthy3\n\n1NICTA and ANU, Canberra, ACT, Australia\n\n2Yahoo! Research, Santa Clara, CA, USA\n\n3Yahoo! Research, Bangalore, India\n\nAbstract\n\nWe extend Latent Dirichlet Allocation (LDA) by explicitly allowing for the en-\ncoding of side information in the distribution over words. This results in a variety\nof new capabilities, such as improved estimates for infrequently occurring words,\nas well as the ability to leverage thesauri and dictionaries in order to boost topic\ncohesion within and across languages. We present experiments on multi-language\ntopic synchronisation where dictionary information is used to bias correspond-\ning words towards similar topics. Results indicate that our model substantially\nimproves topic cohesion when compared to the standard LDA model.\n\nIntroduction\n\n1\nLatent Dirichlet Allocation [4] assigns topics to documents and generates topic distributions over\nwords given a collection of texts. In doing so, it ignores any side information about the similarity\nbetween words. Nonetheless, it achieves a surprisingly high quality of coherence within topics.\nThe inability to deal with word features makes LDA fall short on several aspects. The most obvious\none is perhaps that the topics estimated for infrequently occurring words are usually unreliable.\nIdeally, for example, we would like the topics associated with synonyms to have a prior tendency of\nbeing similar, so that in case one of the words is rare but the other is common, the topic estimates\nfor the rare one can be improved. There are other examples. For instance, it is quite plausible that\n\u2019Germany\u2019 and \u2019German\u2019, or \u2019politics\u2019, \u2019politician\u2019, and \u2019political\u2019 should, by default, belong to\nthe same topic. Similarly, we would like to be able to leverage dictionaries in order to boost topic\ncohesion across languages, a problem that has been researched but is far from being fully solved,\nespecially for non-aligned corpora [6]. For example, we know that \u2018democracy\u2019 and \u2018democracia\u2019\nare different words, but it is clear that not leveraging the fact they actually mean the same thing (and\ntherefore should have aligned topics) reduces the statistical strength of a model.\nA possible solution, which we propose in this paper, is to treat word information as features rather\nthan as explicit constraints and to adjust a smoothing prior over topic distributions for words such\nthat correlation is emphasised. In the parlance of LDA we do not pick a globally constant \u03b2 smoother\nover the word multinomials but rather we adjust it according to word similarity. In this way we are\ncapable of learning the prior probability of how words are distributed over various topics based on\nhow similar they are, e.g. in the context of dictionaries, synonym collections, thesauri, edit distances,\nor distributional word similarity features.\nUnfortunately, in performing such model extension we lose full tractability of the setting by means\nof a collapsed Gibbs sampler. Instead, we use a hybrid approach where we perform smooth opti-\nmisation over the word smoothing coef\ufb01cients, while retaining a collapsed Gibbs sampler to assign\ntopics for a \ufb01xed choice of smoothing coef\ufb01cients. The advantage of this setting is that it is entirely\nmodular and can be added to existing Gibbs samplers without modi\ufb01cation.\nWe present experimental results on multi-language topic synchronisation which clearly evidence the\nability of the model to incorporate dictionary information successfully. Using several different mea-\nsures of topic alignment, we consistently observe that the proposed model improves substantially on\nstandard LDA, which is unable to leverage this type of information.\n\n1\n\n\f\u03b1\n\n\u03b8m\n\nzmn\n\nfor k = 1 to K\n\nwmn\n\n\u03c8kv\n\n\u03b2\n\nfor m = 1 to M\n\nfor n = 1 to Nm\n\nfor v = 1 to V\n\n\u03b1\n\n\u03b8m\n\nzmn\n\n\u03b1\n\n\u03b8m\n\nzmn\n\nfor k = 1 to K\n\ny\n\nfor k = 1 to K\n\nwmn\n\n\u03c8kv\n\n\u03b2\n\nwmn\n\n\u03c8kv\n\n\u03b2kv\n\n\u03c6v\n\nfor m = 1 to M\n\nfor n = 1 to Nm\n\nfor v = 1 to V\n\nfor m = 1 to M\n\nfor n = 1 to Nm\n\nfor v = 1 to V\n\n\u03b1\n\nFigure 1: LDA: The topic distribution for\neach word (\u03c8v) has as smoother the Dirich-\nlet distribution with a parameter \u03b2 (indepen-\ndent of the word).\n\nfor k = 1 to K\n\nzmn\n\n\u03b8m\n\ny\n\nwmn\n\n\u03c8kv\n\n\u03b2kv\n\n\u03b8m\n\n\u03b1m\n\nzmn\n\nFigure 2: Our Extension: Assume we observe side\nx\ninformation \u03c6v (i.e. features) for each word v. The\nword-speci\ufb01c smoothing parameters \u03b2kv are gov-\nerned by \u03c6v and a common parameter choice y.\n\nfor k = 1 to K\n\nwmn\n\n\u03c8kv\n\n\u03b2kv\n\n\u03b3m\n\n\u03c6v\n\ny\n\n\u03c6v\n\nfor m = 1 to M\n\nfor n = 1 to Nm\n\nfor v = 1 to V\n\nfor m = 1 to M\n\nfor n = 1 to Nm\n\n1.1 Related work\n\nfor v = 1 to V\n\ny\n\n\u03c6v\n\n\u03b8m\n\n\u03b3m\n\n\u03b1m\n\n\u03b2kv\n\n\u03c8kv\n\nzmn\n\nwmn\n\nfor v = 1 to V\n\nfor k = 1 to K\n\nfor m = 1 to M\n\nfor n = 1 to Nm\n\nLoosely related works that use logistic models to induce structure in generative models are [17],\nx\nwhich proposed a shared logistic normal distribution as a Bayesian prior over probabilistic grammar\nweights, and [10], which incorporated features into unsupervised models using locally normalized\nmodels. More related to our work is [5], which encodes correlations between synonyms, and [1]\nwhich encodes more general correlations. In fact, our proposed model can be seen as a generalisation\nof [1], where we can encode the strength of the links between each pair of words.\nPrevious work on multilingual topic models requires parallelism at either the sentence level ([20])\nor document level ([9], [15]). More recent work [13] relaxes that, but still requires that a signi\ufb01cant\nfraction (at least 25%) of the documents are paired up.\nMultilingual topic alignment without parallelism was recently proposed by [6]. Their model requires\na list of matched word pairs m (where each pair has one word in each language) and corresponding\nmatching priors \u03c0 that encode the prior knowledge on how likely the match is to occur. The topics\nare de\ufb01ned as distributions over word pairs, while the unmatched words come from a unigram\ndistribution speci\ufb01c to each language. Although their model could be in principle extended to more\nthan two languages their experimental section was focused on the bilingual case.\nOne of the key differences between [6] and our method is that we do not hardcode word informa-\ntion, but we use it only as a prior \u2013 this way our method becomes less sensitive to errors in the word\nfeatures. Furthermore, our model automatically extends to multiple languages without any modi\ufb01-\ncation, aligning topics even for language pairs for which we have no information, as we show in the\nexperimental section for the Portuguese/French pair. Finally, our model is conceptually simpler and\ncan be incorporated as a module in existing LDA implementations.\n\n2 The Model\n\nWe begin by brie\ufb02y reviewing the LDA model of [4] as captured in Figure 1.\n(1c)\n(1d)\n\n\u03c8k \u223c Dir(\u03b2)\nwmn \u223c Multi(\u03c8zmn)\n\n\u03b8m \u223c Dir(\u03b1)\nzmn \u223c Mult(\u03b8m)\n\n(1a)\n(1b)\n\nIt assumes that\n\nNonparametric extensions in terms of the number of topics can be obtained using Dirichlet\nprocess models [2] regarding the generation of topics. Our extension deals with the word\nsmoother \u03b2.\nInstead of treating it as a constant for all words we attempt to infer its val-\nues for different words and topics respectively. That is, we assume that (1c) is replaced by\n\n\u03c8k \u223c Dir(\u03b2k|\u03c6, y)\n\n(2a)\n\n\u03b2 \u223c Logistic(y; \u03c6).\n\n(2b)\n\nWe refer to this setting as downstream conditioning, in analogy to the upstream conditioning of [14]\n(which dealt with topical side information over documents). The corresponding graphical model\nis given in Figure 2. The above dependency allows us to incorporate features of words as side\ninformation. For instance, if two words (e.g. \u2019politics\u2019 and \u2019politician\u2019) are very similar then it is\nplausible to assume that their topic distributions should also be quite similar. This can be achieved\nby choosing similar \u03b2k,politics and \u03b2k,politician. For instance, both of those coef\ufb01cients might have\ngreat af\ufb01nity to \u03b2k,scandal and we might estimate y such that this is achieved.\n\n2\n\n\f2.1 Detailed Description\n\nWe now discuss the directed graphical model from Figure 2 in detail. Whenever needed we use the\ncollapsed representation of the model [8], that is we integrate out the parameters \u03b8m and \u03c8kv such\nthat we only need to update \u03b1 and \u03b2 (or indirectly y). We de\ufb01ne the standard quantities\n\nnKM\nkm\n\nnK\n\nnV\n\nnKV\n\nnKM\n\nnKM\nkm\n\nnKV\nkv ,\n\nas well as:\n\nnM\n\nm =!k\n\nk =!m\nv =!k\n\nkv =!m,n{zmn = k and wmn = v}\nkm =!n {zmn = k}\nTopic distribution p(zmn|\u03b8m): We assume that this is a multinomial distribution speci\ufb01c to\ndocument m, that is p(zmn|\u03b8m) = \u03b8m,zmn.\nConjugate distribution p(\u03b8m|\u03b1): This is a Dirichlet distribution with parameters \u03b1, where \u03b1k\ndenotes the smoother for topic k.\nCollapsed distribution p(zm|\u03b1): Integrating out \u03b8m and using conjugacy yields\n\np(zm|\u03b1) = \"K\nwhere \u0393 is the gamma function: \u0393(x) =# \u221e\nWord distribution p(wmn|zmn,\u03c8 ): We assume that given a topic zmn the word wmn is drawn\nfrom a multinomial distribution \u03c8wmn,zmn. That is p(wmn|zmn,\u03c8 ) = \u03c8wmn,zmn. This is entirely\nstandard as per the basic LDA model.\n\nk=1 \u0393(nKM\n\u0393( nM\nm + \"\u03b1\"1)\n0 tx\u22121e\u2212t dt.\n\n\u0393(\"\u03b1\"1)\n\"K\nk=1 \u0393(\u03b1k)\n\nkm + \u03b1k)\n\n,\n\nConjugate distribution p(\u03c8k|\u03b2k): As by default, we assume that \u03c8k is distributed according to a\nDirichlet distribution with parameters \u03b2k. The key difference is that here we do not assume that all\ncoordinates of \u03b2k are identical. Nor do we assume that all \u03b2k are the same.\nCollapsed distribution p(w|z, \u03b2): Integrating out \u03c8k for all topics k yields the following\n\np(w|z, \u03b2) =\n\n2.2 Priors\n\nkv + \u03b2kv)\n\nv=1 \u0393(nKV\n\nK$k=1\"V\n\u0393%nK\nk + \"\u03b2k\"1&\n\n\u0393(\"\u03b2k\"1)\n\"V\nv=1 \u0393(\u03b2kv)\n\nIn order to better control the capacity of our model, we impose a prior on naturally related words,\ne.g. the (\u2019Toyota\u2019, \u2019Kia\u2019) and the (\u2019Bush\u2019, \u2019Cheney\u2019) tuples, rather than generally related words.\nFor this purpose we design a similarity graph G(V, E) with words represented as vertices V and\nsimilarity edge weights \u03c6uv between vertices u, v \u2208 V whenever u is related to v. In particular, the\nmagnitude of \u03c6uv can denote the similarity between words u and v.\nIn the following we denote by ykv the topic dependent smoothing coef\ufb01cients for a given word v\nand topic k. We impose the smoother\n\nlog \u03b2kv = ykv + yv and log p(\u03b2) = \u22121\n\n2\u03bb2\uf8ee\uf8f0!v,v!,k\n\n\u03c6v,v!(ykv \u2212 ykv!)2 +!v\n\ny2\n\nv\uf8f9\uf8fb\n\nwhere log p(\u03b2) is given up to an additive constant and yv allows for multiplicative topic-unspeci\ufb01c\ncorrections. A similar model was used by [3] to capture temporal dependence between topic mod-\nels computed at different time instances, e.g. when dealing with topic drift over several years in a\nscienti\ufb01c journal. There the vertices are words at a given time and the edges are between smoothers\ninstantiated at subsequent years.\n\n3\n\nInference\n\nIn analogy to the collapsed sampler of [8] we also represent the model in a collapsed fashion. That\nis, we integrate out the random variables \u03b8m (the document topic distributions) and \u03c8kv (the topic\n\n3\n\n\fword distributions), which leads to a joint likelihood in terms of the actual words wmn, the side\ninformation \u03c6 about words, the latent variable y, the smoothing hyperprior \u03b2kv, and \ufb01nally, the\ntopic assignments zmn.\n\n3.1 Document Likelihood\n\nThe likelihood contains two terms: a word-dependent term which can be computed on the \ufb02y while\nresampling data1, and a model-dependent term involving the topic counts and the word-topic counts\nwhich can be computed by one pass through the aggregate tables respectively. Let us \ufb01rst write out\nthe uncollapsed likelihood in terms of z, \u03b8, \u03c8, \u03b1, \u03b2. We have\n\np(w, z, \u03b8, \u03c8|\u03b1, \u03b2) =\n\nM$m=1\np(wmn|zmn,\u03c8 )p(zmn|\u03b8m)\nDe\ufb01ne \u00af\u03b1 := \"\u03b1\"1 and \u00af\u03b2k := \"\u03b2k\"1. Integrating out \u03b8 and \u03c8 yields\np(w, z|\u03b1, \u03b2) =\n\nNm$n=1\nM$m=1\nm) $k:nKM\n\n\u0393(\u03b1k + nKM\nkm )\n\nM$m=1\n\nK$k=1\n\n\u0393(\u00af\u03b1 + nM\n\n\u0393(\u03b1k)\n\nkm #=0\n\n\u0393(\u00af\u03b1)\n\n\u0393( \u00af\u03b2k)\n\u0393( \u00af\u03b2k + nK\n\nK$k=1\np(\u03b8m|\u03b1)\nk ) $v:nKV\n\nkv #=0\n\np(\u03c8k|\u03b2)\n\n\u0393(\u03b2kv + nKV\nkv )\n\n\u0393(\u03b2kv)\n\nThe above product is obtained simply by canceling out terms in denominator and numerator where\nthe counts vanish. This is computationally signi\ufb01cant, since it allows us to evaluate the normalization\nfor sparse count tables with cost linear in the number of nonzero coef\ufb01cients rather than cost in the\ndense count table.\n\n3.2 Collapsed Sampler\nIn order to perform inference we need two components: a sampler which is able to draw from\np(zi = k|w, z\u00aci, \u03b1, \u03b2)2, and an estimation procedure for (\u03b2, y). The sampler is essentially the same\nas in standard LDA. For the count variables nKM, nKV, nK and nM we denote by the subscript \u2018\u2212\u2019\ntheir values after the word wmn and associated topic zmn have been removed from the statistics.\nStandard calculations yield the following topic probability for resampling:\n\np(zmn = k|rest) \u221d +\u03b2kv + nKV\n\nkvmn\u2212,+nKM\n\n+ \u00af\u03b2k\n\nkm\u2212\n\n+ \u03b1k,\n\nnK\nk\u2212\n\nIn the appendix we detail how to addapt the sampler of [19] to obtain faster sampling.\n\n(6)\n\n3.3 Topic Smoother for \u03b2\nOptimizing over y is considerably hard since the log-likelihood does not decompose ef\ufb01ciently. This\nis due to the dependence of \u00af\u03b2k on all words in the dictionary. The data-dependent contribution to\nthe negative log-likelihood is\n\nL\u03b2 =\n\nK!k=1+log \u0393( \u00af\u03b2k + nK\n\nk ) \u2212 log \u0393( \u00af\u03b2k), +\n\nK!k=1 !v:nKV\n\nkv #=0+log \u0393(\u03b2kv) \u2212 log \u0393(\u03b2kv + nKV\nkv ),\n\nwith gradients given by the appropriate derivatives of the \u0393 function. We use the prior from section\n2.2, which smooths between closely related words only. After choosing edges \u03c6uv according to\nthese matching words, we obtain an optimisation problem directly in terms of the variables ykv and\nyv. Denote by N(v) the neighbours for word v in G(V, E), and \u03a5(x) := \u2202x log \u0393(x) the Digamma\nfunction. We have\n\n\u2202ykv [L\u03b2 \u2212 log p(\u03b2)] =\n\n1\n\n\u03bb2 !v!\u2208N (v)\n+.nKV\n\n\u03c6v,v! [ykv \u2212 ykv!] + \u03b2kv-\u03a5( \u00af\u03b2k + nK\nkv ),0.\nkv > 0/+\u03a5(\u03b2kv) \u2212 \u03a5(\u03b2kv + nKV\n\nk ) \u2212 \u03a5( \u00af\u03b2k) +\n\nThe gradient with respect to yk is analogous.\n\n1Note that this is not entirely correct \u2014 the model changes slightly during one resampling pass, hence the\nlog-likelihood that we compute is effectively the averaged log-likelihood due to an ongoing sampler. For a\ncorrect computation we would need to perform one pass through the data without resampling. Since this is\nwasteful, we choose the approximation instead.\n\n2Here zi denotes the topic of word i, and z\u00aci the topics of all words in the corpus except for i.\n\n4\n\n\f4 Experiments\n\nTo demonstrate the usefulness of our model we applied it to a multi-lingual document collection,\nwhere we can show a substantial improvement over the standard LDA model on the coordination\nbetween topics of different languages.\n\n4.1 Dataset\n\nSince our goal is to compare topic distributions on different languages we used a parallel corpus\n[11] with the proceedings of the European Parliament in 11 languages. We focused on two language\npairs: English/French and English/Portuguese.\nNote that a parallel corpus is not necessary for the application of the proposed model \u2013 it is being\nused here only because it allows us to properly evaluate the effectiveness of our model.3\nWe treated the transcript of each speaker in each session as a document, since different speakers\nusually talk about different topics. We randomly sampled 1000 documents from each language,\nremoved infrequent4 and frequent5 words and kept only the documents with at least 20 words. Fi-\nnally, we removed all documents that lost their corresponding translations in this process. After this\npreprocessing we were left with 2415 documents, 805 in each language, and a vocabulary size of\n23883 words.\n\n4.2 Baselines\n\nWe compared our model to standard LDA, learning \u03b1 and \u03b2, both asymmetric6.\n\n4.3 Prior\n\nWe imposed the graph based prior mentioned in Section 2.2. To build our similarity graph we used\nthe English-French and English-Portuguese dictionaries from http://wiki.webz.cz/dict/,\naugmented with translations from Google Translate for the most frequent words in our dataset. As\ndescribed earlier, each word corresponds to a vertex, with an edge7 whenever two words match in\nthe dictionary.\nIn our model \u03b2 = exp(ykv + yv), so we want to keep both ykv and yv reasonably low to avoid\nnumerical problems, as a large value of either would lead to over\ufb02ows. We ensure that by setting \u03bb,\nthe standard deviation of their prior, \ufb01xed to one in all experiments. We did the same for the standard\nLDA model, where to learn an asymmetric beta we simply removed ykv to get \u03b2 = exp(yv).\n\n4.4 Methodology\n\nIn our experiments we used all the English documents and a subset of the French and Portuguese\nones \u2013 this is what we have in a real application, when we try to learn a topic model from web pages:\nthe number of pages is English is far greater than in any other language.\nWe compared three approaches. First, we run the standard LDA model with all documents mixed\ntogether \u2013 this is one of our baselines, which we call STD1.\nNext we run our proposed model, but with a slight modi\ufb01cation to the setup: in the \ufb01rst half of the\niterations of the Gibbs sampler we include only English documents; in the second half we add the\nFrench and Portuguese ones to the mix.8\n\n3To emphasise this point, later in this section we show experiments with non-parallel corpora, in which case\n\nwe have to rely on visual inspection to assess the outcomes.\n\n4Words that occurred less than 3 times in the corpus.\n5Words that occurred more than M/10 times in the corpus, where M is the total number of documents.\n6That is, we don\u2019t assume all coordinates of \u03b1 and \u03b2 are identical.\n7All edges have a \ufb01xed weight of one in this case.\n8We need to start with only one language so that an initial topic-word distribution is built; once that is done\n\nthe priors are learned and can be used to guide the topic-word distributions in other languages.\n\n5\n\n\fFinally, as a control experiment we run the standard LDA model in this same setting: \ufb01rst English\ndocuments, then all languages mixed. We call this STD2.\nIn all experiments we run the Gibbs sampler for a total of 3000 iterations, with the number of topics\n\ufb01xed to 20, and keep the last sample. After a burn-in of 500 iterations, the optimisation over the word\nsmoothing coef\ufb01cients is done every 100 iterations, using an off-the-shelf L-BFGS [12] optimizer.9.\nWe repeat every experiment 5 times with different randomisations.\n\n4.5 Evaluation\n\nEvaluation of topic models is an open problem \u2013 recent work [7] suggests that popular measures\nbased on held-out likelihood, such as perplexity, do not capture whether topics are coherent or\nnot. Furthermore, we need a set of measures that can assess whether or not we improved over\nthe standard LDA model w.r.t. our goal \u2013 to synchronize topics across different languages \u2013 and\nthere\u2019s no reason to believe that likelihood measures would assess that: a model where topics are\nsynchronized across languages is not necessarily more likely than a model that is not synchronized.\nTherefore, to evaluate our model we compare the topic distributions of each English document with\nits corresponding French pair (and analogously for the other combinations: English/Portuguese and\nFrench/Portuguese), with these metrics:\n\nMean )2 distance:\n\n1\n\n|L1|1d1\u2208L1,d2=F (d1)21K\n\nk=1-\u03b8d1\n\nk \u2212 \u03b8d2\n\n2\n\nk 023 1\n\nwhere L1 denotes the set of documents in the \ufb01rst language, F a mapping from a document\nin the \ufb01rst language to its corresponding translation in the second language and \u03b8d the topic\ndistribution of document d.\n\nMean Hellinger distance:\n\nAgreements on \ufb01rst topic:\n\n1\n\nk=124\u03b8d1\n|L1|1d1\u2208L1,d2=F (d1)1K\n|L1|1d1\u2208L1,d2=F (d1) I(argmaxk \u03b8d1\n\n1\n\nk 32\nk \u22124\u03b8d2\n\nk , argmaxk \u03b8d2\n\nk ))\n\nwhere I is the indicator function \u2013 that is, the proportion of document pairs where the most\nlikely topic is the same for both languages.\n\nMean number of agreements in top 5 topics:\n\nwhere agreements(d1, d2) is the cardinality of the intersection of the 5 most likely topics\nof d1 and d2.\n\n1\n\n|L1|1d1\u2208L1,d2=F (d1) agreements(d1, d2)\n\n4.6 Results\n\nIn Figure 3 we compare our method (DC) to the standard LDA model (STD1 and STD2, see section\n4.4), for the English-French pair10. In all metrics our proposed model shows a substantial improve-\nment over the standard LDA model.\nIn Figures 4 and 5 we do the same for the English-Portuguese and Portuguese-French pairs, re-\nspectively, with similar results. Note that we did not use a Portuguese-French dictionary in any\nexperiment.\nIn Figure 6 we plot the word smoothing prior for the English word democracy and its French and\nPortuguese translations, d\u00b4emocratie and democracia, for both the standard LDA model (STD1) and\nour model (DC), with 20% of the French and Portuguese documents used in training. In STD1\nwe don\u2019t have topic-speci\ufb01c priors (hence the horizontal line) and the word democracy has a much\nhigher prior, because it happens more often in the dataset (since we have all English documents and\nonly 20% of the French and Portuguese ones). In DC, however, the priors are topic-speci\ufb01c and\nquite similar, as this is enforced by the similarity graph.\n\n9http://www.chokkan.org/software/liblbfgs\n10See the Appendix for run times.\n\n6\n\n\f1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n0\n\n \n\n1.5\n\n1\n\n0.5\n\n \n\n0\n0\n\nTo emphasize that we do not need a parallel corpus we ran a second experiment where we selected\nthe same number of documents of each language, but assuring that for each document its corre-\nsponding translations are not in the dataset, and trained our model (DC) with 100 topics. This could\nbe done with any multilingual corpus, since no parallelization is required. In this case, however, we\ncannot compute the distance metrics as before, since we have no information on the actual topic dis-\ntributions of the documents. The best we can hope to do is to visually inspect the most likely words\nfor the learned topics. This is shown in Table 1, for some selected topics, where the synchronization\namongst the different languages is clear.\n\nMean l2\u2212distance\n\nSTD1\nSTD2\nDC\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\nMean Hellinger distance\n\nSTD1\nSTD2\nDC\n\n \n\n2\n\n1.5\n\n1\n\n0.5\n\n \n\n80\n\n60\n\n40\n\n20\n\n% agreements on first topic\n\nSTD1\nSTD2\nDC\n\n \n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\nMean no. agreements in top 5 topics\n\n \n\nSTD1\nSTD2\nDC\n\n \n\n5\n15\n% of French documents\n\n0.2\n0\n20\nFigure 3: Comparison of topic distributions in English and French documents. See text for details.\n\n5\n15\n% of French documents\n\n5\n15\n% of French documents\n\n5\n15\n% of French documents\n\n0.5\n0\n\n0\n0\n\n0\n0\n\n10\n\n20\n\n20\n\n10\n\n10\n\n20\n\n10\n\n \n\n \n\n \n\nMean l2\u2212distance\n\nSTD1\nSTD2\nDC\n\n \n\n2\n\n1.5\n\n1\n\n0.5\n\nMean Hellinger distance\n\n \n\nSTD1\nSTD2\nDC\n\n% agreements on first topic\n\n \n\nSTD1\nSTD2\nDC\n\nMean no. agreements in top 5 topics\n\nSTD1\nSTD2\nDC\n\n5\n\n15\n\n10\n\n% of Portuguese documents\n\n15\n% of Portuguese documents\nFigure 4: Comparison of topic distributions in English and Portuguese documents. See text.\n\n% of Portuguese documents\n\n% of Portuguese documents\n\n20\n\n10\n\n20\n\n15\n\n15\n\n10\n\n20\n\n10\n\n5\n\n5\n\n5\n\n \n\n0\n0\n\n \n\n2\n\n1.5\n\n1\n\n0.5\n\n \n\n0\n0\n\nMean l2\u2212distance\n\nSTD1\nSTD2\nDC\n\nMean Hellinger distance\n\nSTD1\nSTD2\nDC\n\n% agreements on first topic\n\n \n\nSTD1\nSTD2\nDC\n\n \n\n80\n\n60\n\n40\n\n20\n\nMean no. agreements in top 5 topics\n\nSTD1\nSTD2\nDC\n\n5\n\n10\n\n15\n% of Portuguese/French documents\nFigure 5: Comparison of topic distributions in Portuguese and French documents. See text.\n\n% of Portuguese/French documents\n\n% of Portuguese/French documents\n\n% of Portuguese/French documents\n\n20\n\n15\n\n10\n\n10\n\n20\n\n15\n\n10\n\n15\n\n20\n\n5\n\n5\n\n5\n\n\u22120.3\n\n\u22120.4\n\n\u22120.5\n\n\u22120.6\n\n\u22120.7\n\n\u22120.8\n\n\u22120.9\n\n)\n\u03b2\n(\ng\no\n\nl\n\nSTD1\n\n \n\ndemocracy\ndemocracia\nd\u00e9mocratie\n\n15\n\n10\n\n5\n\n0\n\n)\n\u03b2\n(\ng\no\n\nl\n\nDC\n\n \n\ndemocracy\ndemocracia\nd\u00e9mocratie\n\n\u22121\n \n0\n\n5\n\n10\ntopic\n\n15\n\n20\n\n\u22125\n \n0\n\n5\n\n10\ntopic\n\n15\n\n20\n\nFigure 6: Word smoothing prior for two words in the standard LDA and in our model. The x-axis is\nthe index to the topic. See text for details.\n\n5 Extensions: Other Features\nAlthough we have implemented a speci\ufb01c type of feature encoding for the words, our model admits\na large range of applications through a suitable choice of features. In the following we discuss a\nnumber of them in greater detail.\n\n7\n\n60\n50\n40\n30\n20\n10\n0\n0\n\n \n\n \n\n0\n0\n\n3.5\n3\n2.5\n2\n1.5\n1\n0.5\n0\n\n \n\n3.5\n3\n2.5\n2\n1.5\n1\n0.5\n0\n\n \n\n \n\n20\n\n \n\n20\n\n\fTable 1: Top 10 words for some of the learned topics (from top to bottom, respectively, topics 8, 17,\n20, 32, 49). Words are colored according to their language \u2013 English, Portuguese or French \u2013 except\nwhen ambiguous (e.g., information is a word in both French and English). See text for details.\n\namendments, alterac\u00b8 \u02dcoes, amendment, amendements, alterac\u00b8\u02dcao, use, substances, r`eglement, l\u2019amendement, accept\n\u00b4elections, electoral, elections, d\u00b4eput\u00b4es, eleic\u00b8 \u02dcoes, partis, proportional, eleitoral, transnational, scrutin\ninformac\u00b8\u02dcao, information, regi\u02dcoes, soci\u00b4et\u00b4e, l\u2019information, acesso, aeroplanes, prix, r\u00b4egions, comunicac\u00b8\u02dcao\nstability, coordination, estabilidade, central, coordenac\u00b8\u02dcao, plans, objectivo, stabilit\u00b4e, ue, list\nmonnaie, consumers, consumidores, consommateurs, l\u2019euro, crois, s\u2019agit, moeda, pouvoir, currency\n\n5.1 Single Language\n\nDistributional Similarity: The basic idea is that words are similar if they occur in a similar context\n[16]. Hence, one could build a graph as outlined in Section 2.2 with edges only between words\nwhich exceed a level of proximity.\nLexical Similarity: For interpolation between words one could use a distribution over substrings\nof a word as the feature map. This is essentially what is proposed by [18]. Such lexical similarity\nmakes the sampler less sensitive to issues such as stemming: after all, two words which reduce to\nthe same stem will also have a high lexical similarity score, hence the estimated \u03b2kv will yield very\nsimilar topic assignments.\nSynonyms and Thesauri: Given a list of synonyms it is reasonable to assume that they belong\nto related topics. This can be achieved by adding edges between a word and all of its synonyms.\nSince in our framework we only use this information to shape a prior, errors in the synonym list and\nmultiple meanings of a word will not prove fatal.\n\n5.2 Multiple Languages\n\nLexical Similarity: Similar considerations apply for inter-lingual topic models. It is reasonable to\nassume that lexical similarity generally points to similarity in meaning. Using such features should\nallow one to synchronise topics even in the absence of dictionaries. However, it is important that\nsimilarities are not hardcoded but only imposed as a prior on the topic distribution (e.g., \u2019gift\u2019 has\ndifferent meanings in English and German).\n\n6 Discussion\n\nIn this paper we described a simple yet general formalism for incorporating word features into LDA,\nwhich among other things allows us to synchronise topics across different languages. We performed\na number of experiments in the multiple-language setting, in which the goal was to show that our\nmodel is able to incorporate dictionary information in order to improve topic alignment across dif-\nferent languages. Our experimental results reveal substantial improvement over the LDA model in\nthe quality of topic alignment, as measured by several metrics, and in particular we obtain much\nimproved topic alignment even across languages for which a dictionary is not used (as described in\nthe Portuguese/French plots, see Figure 5). We also showed that the algorithm is quite effective even\nin the absence of documents that are explicitly denoted as being aligned (see Table 1). This sets it\napart from [13], which requires that a signi\ufb01cant fraction (at least 25%) of documents are paired up.\nAlso, the model is not limited to lexical features. Instead, we could for instance also exploit syn-\ntactical information such as parse trees. For instance, noun / verb disambiguation or named entity\nrecognition are all useful in determining the meaning of words and therefore it is quite likely that\nthey will also aid in obtaining an improved topical mixture model.\n\nAcknowledgements\n\nNICTA is funded by the Australian Government as represented by the Department of Broadband,\nCommunications and the Digital Economy and the Australian Research Council through the ICT\nCentre of Excellence program.\n\n8\n\n\fReferences\n[1] David Andrzejewski, Xiaojin Zhu, and Mark Craven. Incorporating domain knowledge into\n\ntopic modeling via Dirichlet Forest priors. In ICML, pages 1\u20138. ACM Press, 2009.\n\n[2] C. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric\n\nproblems. Annals of Statistics, 2:1152\u20131174, 1974.\n\n[3] David M. Blei and John D. Lafferty. Dynamic topic models. In W. W. Cohen and A. Moore,\n\neditors, ICML, volume 148, pages 113\u2013120. ACM, 2006.\n\n[4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of\n\nMachine Learning Research, 3:993\u20131022, January 2003.\n\n[5] Jordan Boyd-Graber, David Blei, and Xiaojin Zhu. A Topic Model for Word Sense Disam-\n\nbiguation. In EMNLP-CoNLL, pages 1024\u20131033, 2007.\n\n[6] Jordan Boyd-Graber and David M. Blei. Multilingual topic models for unaligned text.\nProceedings of the 25th Conference in Uncertainty in Arti\ufb01cial Intelligence (UAI), 2009.\n\nIn\n\n[7] Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and David Blei. Reading\nIn Y. Bengio, D. Schuurmans, J. Lafferty,\n\ntea leaves: How humans interpret topic models.\nC. K. I. Williams, and A. Culotta, editors, NIPS, pages 288\u2013296. 2009.\n\n[8] Thomas L. Grif\ufb01ths and Mark Steyvers. Finding scienti\ufb01c topics. Proceedings of the National\n\nAcademy of Sciences, 101:5228\u20135235, 2004.\n\n[9] Woosung Kim and Sanjeev Khudanpur. Lexical triggers and latent semantic analysis for\ncrosslingual language model adaptation. ACM Transactions on Asian Language Information\nProcessing, 3, 2004.\n\n[10] T.B. Kirkpatrick, A.B. C\u02c6ot\u00b4e, J. DeNero, and Dan Klein. Painless Unsupervised Learning with\nFeatures. In Human Language Technologies: The 2010 Annual Conference of the North Amer-\nican Chapter of the Association for Computational Linguistics, 2010.\n\n[11] Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In Machine\n\nTranslation Summit X, pages 79\u201386, 2005.\n\n[12] Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale opti-\n\nmization. Mathematical Programming, 45(3):503\u2013528, 1989.\n\n[13] David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCal-\nlum. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods\nin Natural Language Processing, pages 880\u2013889, Singapore, August 2009. ACL.\n\n[14] David M. Mimno and Andrew McCallum. Topic models conditioned on arbitrary features\nwith dirichlet-multinomial regression. In D. A. McAllester and P. Myllym\u00a8aki, editors, UAI,\nProceedings of the 24th Conference in Uncertainty in Arti\ufb01cial Intelligence, pages 411\u2013418.\nAUAI Press, 2008.\n\n[15] Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. Mining multilingual topics from\nwikipedia. In 18th International World Wide Web Conference, pages 1155\u20131155, April 2009.\n[16] Patrick Pantel and Dekang Lin. Discovering word senses from text. In David Hand, Daniel\nKeim, and Raymond Ng, editors, Proceedings of the Eighth ACM SIGKDD International Con-\nference on Knowledge Discovery and Data Mining, pages 613\u2013619, New York, July 2002.\nACM Press.\n\n[17] Noah A Smith and Shay B Cohen. The Shared Logistic Normal Distribution for Grammar In-\nduction. In NIPS Workshop on Speech and Language: Unsupervised Latent-Variable Models,,\npages 1\u20134, 2008.\n\n[18] S. V. N. Vishwanathan and A. J. Smola. Fast kernels for string and tree matching. In S. Becker,\nS. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15,\npages 569\u2013576. MIT Press, Cambridge, MA, 2003.\n\n[19] Limin Yao, David Mimno, and Andrew McCallum. Ef\ufb01cient methods for topic model infer-\n\nence on streaming document collections. In KDD\u201909, 2009.\n\n[20] Bing Zhao and Eric P. Xing. BiTAM: Bilingual Topic AdMixture Models for Word Alignment.\nIn In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics\n(ACL\u201906), 2006.\n\n9\n\n\f", "award": [], "sourceid": 566, "authors": [{"given_name": "James", "family_name": "Petterson", "institution": null}, {"given_name": "Wray", "family_name": "Buntine", "institution": null}, {"given_name": "Shravan", "family_name": "Narayanamurthy", "institution": null}, {"given_name": "Tib\u00e9rio", "family_name": "Caetano", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}]}