{"title": "Symmetric Correspondence Topic Models for Multilingual Text Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1286, "page_last": 1294, "abstract": "Topic modeling is a widely used approach to analyzing large text collections. A small number of multilingual topic models have recently been explored to discover latent topics among parallel or comparable documents, such as in Wikipedia. Other topic models that were originally proposed for structured data are also applicable to multilingual documents. Correspondence Latent Dirichlet Allocation (CorrLDA) is one such model; however, it requires a pivot language to be specified in advance. We propose a new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden variable to control a pivot language, in an extension of CorrLDA. We experimented with two multilingual comparable datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more effective than some other existing multilingual topic models.", "full_text": "Symmetric Correspondence Topic Models for\n\nMultilingual Text Analysis\n\n\u2020\nKosuke Fukumasu\n\n\u2020\nKoji Eguchi\n\nEric P. Xing\n\n\u2021\n\nGraduate School of System Informatics, Kobe University, Kobe 657-8501, Japan\n\n\u2020\n\nSchool of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA\n\n\u2021\n\nfukumasu@cs25.scitec.kobe-u.ac.jp, eguchi@port.kobe-u.ac.jp, epxing@cs.cmu.edu\n\nAbstract\n\nTopic modeling is a widely used approach to analyzing large text collections. A\nsmall number of multilingual topic models have recently been explored to dis-\ncover latent topics among parallel or comparable documents, such as in Wikipedia.\nOther topic models that were originally proposed for structured data are also ap-\nplicable to multilingual documents. Correspondence Latent Dirichlet Allocation\n(CorrLDA) is one such model; however, it requires a pivot language to be speci-\n\ufb01ed in advance. We propose a new topic model, Symmetric Correspondence LDA\n(SymCorrLDA), that incorporates a hidden variable to control a pivot language,\nin an extension of CorrLDA. We experimented with two multilingual compara-\nble datasets extracted from Wikipedia and demonstrate that SymCorrLDA is more\ne\ufb00ective than some other existing multilingual topic models.\n\n1 Introduction\n\nTopic models (also known as mixed-membership models) are a useful method for analyzing large\ntext collections [1, 2]. In topic modeling, each document is represented as a mixture of topics, where\neach topic is represented as a word distribution. Latent Dirichlet Allocation (LDA) [2] is one of the\nwell-known topic models. Most topic models assume that texts are monolingual; however, some\ncan capture statistical dependencies between multiple classes of representations and can be used for\nmultilingual parallel or comparable documents. Here, a parallel document is a merged document\nconsisting of multiple language parts that are translations from one language to another, sometimes\nincluding sentence-to-sentence or word-to-word alignments. A comparable document is a merged\ndocument consisting of multiple language parts that are not translations of each other but instead\ndescribe similar concepts and events. Recently published multilingual topic models [3, 4], which\nare the equivalent of Conditionally Independent LDA (CI-LDA) [5, 6], can discover latent topics\namong parallel or comparable documents. SwitchLDA [6] was modeled by extending CI-LDA. It\ncan control the proportions of languages in each multilingual topic. However, both CI-LDA and\nSwitchLDA preserve dependencies between languages only by sharing per-document multinomial\ndistributions over latent topics, and accordingly the resulting dependencies are relatively weak.\n\nCorrespondence LDA (CorrLDA) [7] is another type of topic model for structured data represented\nin multiple classes. It was originally proposed for annotated image data to simultaneously model\nwords and visual features, and it can also be applied to parallel or comparable documents. In the\nmodeling, it \ufb01rst generates topics for visual features in an annotated image. Then only the topics\nassociated with the visual features in the image are used to generate words. In this sense, visual\nfeatures can be said to be the pivot in modeling annotated image data. However, when CorrLDA\nis applied to multilingual documents, a language that plays the role of the pivot (a pivot language1)\n1Note that the term \u2018pivot language\u2019 does not have exactly the same meaning as that commonly used in the\nmachine translation community, where it means an intermediary language for translation between more than\nthree languages.\n\n1\n\n\fmust be speci\ufb01ed in advance. The pivot language selected is sensitive to the quality of the multi-\nlingual topics estimated with CorrLDA. For example, a translation of a Japanese book into English\nwould presumably have a pivot to the Japanese book, but a set of international news stories would\nhave pivots that di\ufb00er based on the country an article is about. It is often di\ufb03cult to appropriately\nselect the pivot language. To address this problem, which we call the pivot problem, we propose\na new topic model, Symmetric Correspondence LDA (SymCorrLDA), that incorporates a hidden\nvariable to control the pivot language, in an extension of CorrLDA. Our SymCorrLDA addresses the\nproblem of CorrLDA and can select an appropriate pivot language by inference from the data.\n\nWe evaluate various multilingual topic models, i.e., CI-LDA, SwitchLDA, CorrLDA, and our Sym-\nCorrLDA, as well as LDA, using comparable articles in di\ufb00erent languages (English, Japanese, and\nSpanish) extracted from Wikipedia. We \ufb01rst demonstrate through experiments that CorrLDA outper-\nforms the other existing multilingual topic models mentioned, and then show that our SymCorrLDA\nworks more e\ufb00ectively than CorrLDA in any case of selecting a pivot language.\n\n2 Multilingual Topic Models with Multilingual Comparable Documents\n\nBilingual topic models for bilingual parallel documents that have word-to-word alignments have\nbeen developed, such as those by [8]. Their models are directed towards machine translation, where\nword-to-word alignments are involved in the generative process. In contrast, we focus on analyzing\ndependencies among languages by modeling multilingual comparable documents, each of which\nconsists of multiple language parts that are not translations of each other but instead describe similar\nconcepts and events. The target documents can be parallel documents, but word-to-word alignments\nare not taken into account in the topic modeling. Some other researchers explored di\ufb00erent types\nof multilingual topic models that are based on the premise of using multilingual dictionaries or\nWordNet [9, 10, 11]. In contrast, CI-LDA and SwitchLDA only require multilingual comparable\ndocuments that can be easily obtained, such as from Wikipedia, when we use those models for\nmultilingual text analysis. This is more similar to the motivation of this paper. Below, we introduce\nLDA-style topic models that handle multiple classes and can be applied to multilingual comparable\ndocuments for the above-mentioned purposes.\n\n2.1 Conditionally Independent LDA (CI-LDA)\n\nCI-LDA [5, 6] is an extension of the LDA model to handle multiple classes, such as words and\ncitations in scienti\ufb01c articles. The CI-LDA framework was used to model multilingual parallel or\ncomparable documents by [3] and [4]. Figure 1(b) shows a graphical model representation of CI-\nLDA for documents in L languages, and Figure 1(a) shows that of LDA for reference. D, T , and\nNd respectively indicate the number of documents, number of topics, and number of word tokens\nthat appear in a speci\ufb01c language part in a document d. The superscript \u2018(\u00b7)\u2019 indicates the variables\ncorresponding to a speci\ufb01c language part in a document d. For better understanding, we show below\nthe process of generating a document according to the graphical model of the CI-LDA model.\n\nFor all D documents, sample \u03b8d \u223c Dirichlet(\u03b1)\nFor all T topics and for all L languages, sample \u03d5(\u2113)\nt\nFor each of the N(\u2113)\n\nd words w(\u2113)\n\nin language \u2113 (\u2113 \u2208 {1,\u00b7\u00b7\u00b7 , L}) of document d:\n\n\u223c Dirichlet(\u03b2(\u2113))\n\n1.\n2.\n3.\n\na.\nb.\n\nSample a topic z(\u2113)\ni\nSample a word w(\u2113)\ni\n\ni\n\n\u223c Multinomial(\u03b8d)\n\u223c Multinomial(\u03d5(\u2113)\nz(\u2113)\ni\n\n)\n\nFor example, when we deal with Japanese and English bilingual data, w(1) and w(2) are a Japanese\nand an English word, respectively. CI-LDA preserves dependencies between languages only by\nsharing the multinomial distributions with parameters (cid:18)d. Accordingly, there are substantial chances\nthat some topics are assigned only to a speci\ufb01c language part in each document, and the resulting\ndependencies are relatively weak.\n\n2.2 SwitchLDA\n\nSimilarly to CI-LDA, SwitchLDA [6] can be applied to multilingual comparable documents. How-\never, di\ufb00erent from CI-LDA, SwitchLDA can adjust the proportions of multiple di\ufb00erent languages\nfor each topic, according to a binomial distribution for bilingual data or a multinomial distribu-\ntion for data of more than three languages. Figure 1(c) depicts a graphical model representation of\nSwitchLDA for documents in L languages. The generative process is described below.\n\n2\n\n\f(a) LDA\n\n(b) CI-LDA\n\n(c) SwitchLDA\n\nFigure 1: Graphical model representations of (a) LDA, (b) CI-LDA, and (c) SwitchLDA\n\n1.\n2.\n\n3.\n\na.\nb.\n\na.\nb.\nc.\n\nFor all D documents, sample \u03b8d \u223c Dirichlet(\u03b1)\nFor all T topics:\n\nFor each of the Nd words wi in document d:\n\nFor all L languages, sample \u03d5(\u2113)\nSample \u03c8t \u223c Dirichlet(\u03b7)\nt\nSample a topic zi \u223c Multinomial(\u03b8d)\nSample a language label si \u223c Multinomial(\u03c8zi)\nSample a word wi \u223c Multinomial(\u03d5(si)\nzi )\n\n\u223c Dirichlet(\u03b2(\u2113))\n\nHere, t indicates a multinomial parameter to adjust the proportions of L di\ufb00erent languages for\ntopic t. If all components of hyperparameter vector (cid:17) are large enough, SwitchLDA becomes equiv-\nalent to CI-LDA. SwitchLDA is an extension of CI-LDA to give emphasis or de-emphasis to speci\ufb01c\nlanguages for each topic. Therefore, SwitchLDA may represent multilingual topics more \ufb02exibly;\nhowever, it still has the drawback that the dependencies between languages are relatively weak.\n\n2.3 Correspondence LDA (CorrLDA)\n\nCorrLDA [7] can also be applied to multilingual comparable documents. In the multilingual setting,\nthis model \ufb01rst generates topics for one language part of a document. We refer to this language as a\npivot language. For the other languages, the model then uses the topics that were already generated\nin the pivot language. Figure 2(a) shows a graphical model representation of CorrLDA assuming L\n(\u2113 \u2208 {p, 2,\u00b7\u00b7\u00b7 , L})\nlanguages, when p is the pivot language that is speci\ufb01ed in advance. Here, N(\u2113)\nd\ndenotes the number of words in language \u2113 in document d. The generative process is shown below:\n\u223c Dirichlet(\u03b2(\u2113))\n\nFor all D documents\u2019 pivot language parts, sample \u03b8(p)\nd\nFor all T topics and for all L languages (including the pivot language), sample \u03d5(\u2113)\nt\nFor each of the N(p)\n\nin the pivot language p of document d:\n\n\u223c Dirichlet(\u03b1(p))\n\n1.\n2.\n3.\n\nd words w(p)\n\ni\n\n)\n\n(\nin language \u2113 (\u2113 \u2208 {2,\u00b7\u00b7\u00b7 , L}) of document d:\n\n)\n\n4.\n\na.\nb.\n\na.\n\nb.\n\nSample a topic z(p)\ni\nSample a word w(p)\ni\n\n\u223c Multinomial(\u03b8(p)\nd )\n\u223c Multinomial(\u03d5(p)\nz(p)\ni\n\nFor each of the N(\u2113)\n\nd words w(\u2113)\n\ni\n\nSample a topic y(\u2113)\ni\nSample a word w(\u2113)\ni\n\nz(p)\n1\n\n\u223c Uni f orm\n\u223c Multinomial(\u03d5(\u2113)\ny(\u2113)\ni\n\n,\u00b7\u00b7\u00b7 , z(p)\n)\n\nN(p)\nd\n\nThis model can capture more direct dependencies between languages, due to the constraints that top-\nics have to be selected from the topics selected in the pivot language parts. However, when CorrLDA\nis applied to multilingual documents, a pivot language must be speci\ufb01ed in advance. Moreover, the\npivot language selected is sensitive to the quality of the multilingual topics estimated with CorrLDA.\n\n3 Symmetric Correspondence Topic Models\n\nWhen CorrLDA is applied to parallel or comparable documents, this model \ufb01rst generates topics\nfor one language part of a document, which we refer to this language as a pivot language. For the\nother languages, the model then uses the topics that were already generated in the pivot language.\nCorrLDA has the great advantage that it can capture more direct dependency between languages;\n\n3\n\n\f(a) CorrLDA\n\n(b) SymCorrLDA\n\n(c) alternative SymCorrLDA\n\nFigure 2: Graphical model representations of (a) CorrLDA, (b) SymCorrLDA, and (c) its variant\n\nhowever, it has a disadvantage that it requires a pivot language to be speci\ufb01ed in advance. Since\nthe pivot language may di\ufb00er based on the subject, such as the country a document is about, it is\noften di\ufb03cult to appropriately select the pivot language.\nTo address this problem, we propose\nSymmetric Correspondence LDA (SymCorrLDA). This model generates a \ufb02ag that speci\ufb01es a pivot\nlanguage for each word, adjusting the probability of being pivot languages in each language part\nof a document according to a binomial distribution for bilingual data or a multinomial distribution\nfor data of more than three languages. In other words, SymCorrLDA estimates from the data the\nbest pivot language at the word level in each document. The pivot language \ufb02ags may be assigned\nto the words in the originally written portions in each language, since the original portions may be\ndescribed con\ufb01dently and with rich vocabulary. Figure 2(b) shows a graphical model representation\nof SymCorrLDA. SymCorrLDA\u2019s generative process is shown as follows, assuming L languages:\n\na.\nb.\n\n1.\n\n2.\n3.\n\nFor all D documents:\n\nFor all L languages, sample \u03b8(\u2113)\nSample \u03c0d \u223c Dirichlet(\u03b3)\nd\n\n\u223c Dirichlet(\u03b1(\u2113))\n\nFor all T topics and for all L languages, sample \u03d5(\u2113)\nt\nFor each of the N(\u2113)\n\nd words w(\u2113)\n\ni\n\na.\nb.\nc.\n\nd.\n\nSample a pivot language \ufb02ag x(\u2113)\ni\nIf (x(\u2113)\ni\nIf (x(\u2113)\ni\nSample a word w(\u2113)\ni\n\n=\u2113), sample a topic z(\u2113)\ni\n=m,\u2113), sample a topic y(\u2113)\ni\n\n\u223c Multinomial\n\n\u223c Dirichlet(\u03b2(\u2113))\n\nin language \u2113 (\u2113 \u2208 {1,\u00b7\u00b7\u00b7 , L}) of document d:\n(\n\u223c Multinomial(\u03b8(\u2113)\nd )\n,\u00b7\u00b7\u00b7 , z(m)\nz(m)\n1\n+ (1 \u2212 \u03b4\n\n\u223c Multinomial(\u03c0d)\n(\n\u223c Uni f orm\n=\u2113\u03d5(\u2113)\nz(\u2113)\ni\n\nM(m)\n=\u2113)\u03d5(\u2113)\ny(\u2113)\ni\n\nd\nx(\u2113)\ni\n\n)\n\n)\n\nx(\u2113)\ni\n\n\u03b4\n\nis its own language \u2113, and x(\u2113)\ni\n\n= m indicates that the pivot language for w(\u2113)\ni\n\nThe pivot language \ufb02ag x(\u2113)\n= \u2113 for an arbitrary language \u2113 indicates that the pivot language for the\ni\nword w(\u2113)\nis another\ni\nlanguage m di\ufb00erent from its own language \u2113. The indicator function \u03b4 takes the value 1 when the\ndesignated event occurs and 0 if otherwise. Unlike CorrLDA, the uniform distribution at Step 3-c is\nnot based on the topics that are generated for all N(m)\nd words with the pivot language \ufb02ags, but based\nonly on the topics that are already generated for M(m)\n) words with the pivot language\nd\n\ufb02ags at each step while in the generative process.2 The full conditional probability for collapsed\nGibbs sampling of this model is given by the following equations, assuming symmetric Dirichlet\npriors parameterized by \u03b1(\u2113), \u03b2(\u2113) (\u2113 \u2208 {1,\u00b7\u00b7\u00b7 , L}), and \u03b3:\n\n\u2264 N(m)\n\n(M(m)\nd\n\nd\n\nP(z(\u2113)\ni\n\nP(y(\u2113)\ni\n\n= t, x(\u2113)\ni\n\nnd\u2113,\u2212i +\n\n= t, x(\u2113)\ni\n\n= w(\u2113), z(\u2113)\u2212i\n\n\u2211\n\n\u00b7\n\ni\n\n= \u2113|w(\u2113)\n\u2211\nnd\u2113,\u2212i + \u03b3\nj,\u2113 nd j + L\u03b3\n= m|w(\u2113)\n\ni\n\n= w(\u2113), y(\u2113)\u2212i\n\n+ \u03b1(\u2113)\n\n, x\u2212i, \u03b1(\u2113), \u03b2(\u2113), \u03b3) \u221d\n\u2211\n+ T \u03b1(\u2113)\n, x\u2212i, \u03b2(\u2113), \u03b3) \u221d\n\n, w(\u2113)\u2212i\nCT D(\u2113)\ntd,\u2212i\n\u2032 CT D(\u2113)\n\u2032\nd,\u2212i\n, z(m), w(\u2113)\u2212i\n\nw(\u2113)\n\n\u00b7\n\nt\n\nt\n\n+ \u03b2(\u2113)\n\nCW(\u2113)T\n\u2032\nw(\u2113)\n\nt,\u2212i\n\u2032 CW(\u2113)T\n\u2032\nw(\u2113)\n\nt,\u2212i\n\n+ W(\u2113)\u03b2(\u2113)\n\n(1)\n\n2M(m)\n\nd words may indeed di\ufb00er in size at the step of generating each word in the generative process. How-\never, this is not problematic for inference, such as by collapsed Gibbs sampling, where any topic is \ufb01rst ran-\ndomly assigned to every word, and a more appropriate topic is then re-assigned to each word, based on the\ntopics previously assigned to all N(m)\n\nd words, with the pivot language \ufb02ags.\n\nd words, not M(m)\n\n4\n\n\fTable 1: Summary of bilingual data\n\nTable 2: Summary of trilingual data\n\nJapanese\n\nEnglish\n\nNo. of documents\n\n229,855\n\nNo. of word types (vocab)\n\nNo. of word tokens\n\n124,046\n\n61,187,469\n\n173,157\n\n80,096,333\n\nNo. of documents\n\nNo. of word types (vocab)\n\n70,902\n\nJapanese\n\nEnglish\n90,602\n98,474\n\nSpanish\n\n96,191\n\nNo. of word tokens\n\n25,952,978 33,999,988 25,701,830\n\nndm,\u2212i + \u03b3\n\n\u2211\nj,m nd j + L\u03b3\n}, and x(\u00b7) = {x(\u00b7)\n\nCW(\u2113)T\n\u2032\nw(\u2113)\n\nt,\u2212i\n\u2032 CW(\u2113)T\n\u2032\nw(\u2113)\n\nt,\u2212i\n\n\u00b7\n\n\u00b7 CT D(m)\ntd\nN(m)\n}. W(\u00b7) and N(\u00b7)\n\nw(\u2113)\n\nd\n\n\u2211\n\nndm,\u2212i +\n}, z(\u00b7) = {z(\u00b7)\n\n+ \u03b2(\u2113)\n\n+ W(\u2113)\u03b2(\u2113)\n\n(2)\n\ni\n\ni\n\ni\n\n= m respectively are allocated to document d. CT D(\u00b7)\n\n} in an arbitrary language j \u2208 {1,\u00b7\u00b7\u00b7 , L} of document d, the \ufb02ags x( j)\n\nwhere w(\u00b7) = {w(\u00b7)\nd respectively indicate the total number of\nvocabulary words (word types) in the speci\ufb01ed language, and the number of word tokens that appear\nin the speci\ufb01ed language part of document d. nd\u2113 and ndm are the number of times, for an arbitrary\nword i \u2208 {1,\u00b7\u00b7\u00b7 , N(\u00b7)\n= \u2113 and\nindicates the (t, d) element of a T \u00d7 D\nx( j)\ni\ntopic-document count matrix, meaning the number of times topic t is allocated to the document d\u2019s\nindicates the (w, t) element of a W(\u00b7) \u00d7 T word-topic\nlanguage part speci\ufb01ed in parentheses. CW(\u00b7)T\ncount matrix, meaning the number of times topic t is allocated to word w in the language speci\ufb01ed\nin parentheses. The subscript \u2018\u2212i\u2019 indicates when wi is removed from the data.\nNow we slightly modify SymCorrLDA by replacing Step 3-c in its generative process by:\n\nwt\n\ntd\n\nd\n\ni\n\n3-c.\n\nIf (x(\u2113)\ni\n\n=m,\u2113), sample a topic y(\u2113)\ni\n\n\u223c Multinomial(\u03b8(m)\n\n)\n\nd\n\nFigure 2(c) shows a graphical model representation of this alternative SymCorrLDA. In this model,\nnon-pivot topics are dependent on the distribution behind the pivot topics, not dependent directly on\nthe pivot topics as in the original SymCorrLDA. By this modi\ufb01cation, the generative process is more\nnaturally described. Accordingly, Eq. (2) of the full conditional probability is replaced by:\n\nP(y(\u2113)\ni\n\n= t, x(\u2113)\ni\n\n= w(\u2113), y(\u2113)\u2212i\n\n= m|w(\u2113)\n\u2211\nndm,\u2212i + \u03b3\n\ni\n\nndm,\u2212i +\n\nj,m nd j + L\u03b3\n\n\u2211\n\nt\n\n, z(m), w(\u2113)\u2212i\nCT D(m)\n\u2032 CT D(m)\n\ntd\n\n\u2032\n\nt\n\nd\n\n\u00b7\n\n, x\u2212i, \u03b2(\u2113), \u03b3) \u221d\n\u2211\n\n\u00b7\n\n+ \u03b1(m)\n\n+ T \u03b1(m)\n\nw(\u2113)\n\n+ \u03b2(\u2113)\n\nCW(\u2113)T\n\u2032\nw(\u2113)\n\nt,\u2212i\n\u2032 CW(\u2113)T\n\u2032\nw(\u2113)\n\nt,\u2212i\n\n+ W(\u2113)\u03b2(\u2113)\n\n(3)\n\nAs you can see in the second term of the right-hand side above, the constraints are relaxed by this\nmodi\ufb01cation so that topics do not always have to be selected from the topics selected for the words\nwith the pivot language \ufb02ags, di\ufb00erently from that of Eq. (2). We will show through experiments\nhow the modi\ufb01cation a\ufb00ects the quality of the estimated multilingual topics, in the following section.\n\n4 Experiments\n\nIn this section, we demonstrate some examples with SymCorrLDA, and then we compare multi-\nlingual topic models using various evaluation methods. For the evaluation, we use held-out log-\nlikelihood using two datasets, the task of \ufb01nding an English article that is on the same topic as that\nof a Japanese article, and a task with the languages reversed.\n\n4.1 Settings\n\nThe datasets used in this work are two collections of Wikipedia articles: one is in English and\nJapanese, the other is in English, Japanese, and Spanish, and articles in each collection are connected\nacross languages via inter-language links, as of November 2, 2009. We extracted text content from\nthe original Wikipedia articles, removing link information and revision history information. We used\nWP2TXT3 for this purpose. For English articles, we removed 418 types of standard stop words [12].\nFor Spanish articles, we removed 351 types of standard stop words [13]. As for Japanese articles,\nwe removed function words, such as symbols, conjunctions and particles, using part-of-speech tags\nannotated by MeCab4. The statistics of the datasets after preprocessing are shown in Tables 1 and\n2. We assumed each set of Wikipedia articles connected via inter-language links between two (or\n\n3http://wp2txt.rubyforge.org/\n4http://mecab.sourceforge.net/\n\n5\n\n\fFigure 3: Change of frequency dis-\ntribution of \u03c0d,1 according to num-\nber of iterations\n\n(a) Examples with bilingual data\n\n(b) Examples with trilingual data\n\nFigure 4: Document titles and corresponding (cid:25)d\n\nFigure 5: Topic examples and corresponding proportion of pivots assigned to Japanese. An English\ntranslation for each Japanese word follows in parentheses, except for Japanese proper nouns.\n\nthree) languages as a comparable document that consists of two (or three) language parts. To carry\nout the evaluation in the task of \ufb01nding counterpart articles that we will describe later, we randomly\ndivided the Wikipedia document collection at the document level into 80% training documents and\n20% test documents. Furthermore, to compute held-out log-likelihood, we randomly divided each\nof the training documents at the word level into 90% training set and 10% held-out set.\n\nWe \ufb01rst estimated CI-LDA, SwitchLDA, CorrLDA, and SymCorrLDA and its alternative version\n(\u2018SymCorrLDA-alt\u2019) as well as LDA for a baseline, using collapsed Gibbs sampling with the training\nset. In addition, we estimated a special implementation of SymCorrLDA, setting (cid:25)d in a simple way\nfor comparison, where the pivot language \ufb02ag for each word is randomly selected according to the\nproportion of the length of each language part (\u2018SymCorrLDA-rand\u2019).\nFor all the models, we assumed symmetric Dirichlet hyperparameters \u03b1 = 50/T and \u03b2 = 0.01, which\nhave often been used in prior work [14]. We imposed the convergence condition of collapsed Gibbs\nsampling, such that the percentage change of held-out log-likelihood is less than 0.1%. For Sym-\nCorrLDA, we assumed symmetric Dirichlet hyperparameters \u03b3 = 1. For SwitchLDA, we assumed\nsymmetric Dirichlet hyperparameters \u03b7 = 1. We investigated the e\ufb00ect of \u03b3 in SymCorrLDA and\n\u03b7 in SwitchLDA; however, the held-out log-likelihood was almost constant when varying these hy-\nperparameters. LDA does not distinguish languages, so for a baseline we assumed all the language\nparts connected via inter-language links to be mixed together as a single document.\n\n4.2 Pivot assignments\n\nFigure 3 demonstrates how the frequency distribution of the pivot language-\ufb02ag (binomial) param-\neter \u03c0d,1 for the Japanese language with the bilingual dataset5 in SymCorrLDA changes while in\niterations of collapsed Gibbs sampling. This \ufb01gure shows that the pivot language \ufb02ag is randomly\nassigned at the initial state, and then it converges to an appropriate bias for each document as the it-\nerations proceed. We next demonstrate how the pivot language \ufb02ags are assigned to each document.\nFigure 4(a) shows the titles of eight documents and the corresponding (cid:25)d when using the bilingual\ndata (T = 500). If \u03c0d,1 is close to 1, the article can be considered to be more related to a subject on\nJapanese or Japan. In contrast, if \u03c0d,1 is close to 0 and therefore \u03c0d,2 = 1 \u2212 \u03c0d,1 is close to 1, the\narticle can be considered to be more related to a subject on English or English-speaking countries.\nTherefore, a pivot is assigned considering the language biases of the articles. Figure 4(b) shows\nthe titles of six documents and the corresponding (cid:25)d = (\u03c0d,1, \u03c0d,2, \u03c0d,3) when using the trilingual\n\n5The parameter for English was \u03c0d,2 = 1 \u2212 \u03c0d,1 in this case.\n\n6\n\n0100002000030000400005000060000700008000000.20.40.60.81frequency\u03c0d,10th iteration5th iteration20th iteration50th iteration0.51.00.0Japanese LanguageEuropeAustriaPhysicsHoryu\u0305-ji(HoryuTemple)PersonalcomputerWestern art historyShogi(Japanese chess)1,d\u03c0EuropeSonyMount FujiBull\ufb01gh\u019fngNFLMobile phone)0,0,1(=d\u03c0)0,1,0(=d\u03c0)1,0,0(=d\u03c00.51.00.0Propor\u019fon of Japanese pivotTopic 201irelandirishscotlandsco\u01abshdublinairurando(Ireland)suko\u01a9orando(Scotland)nen(year)daburin(Dublin)kitaairurando(Northern Ireland)Topic 13japanosakakyotohughesjapaneseosakakyotoshi(city)nen(year)kobeTopic 251unitedcupmanchestermanagerleaguenen(year)ingurando(England)daihyo\u0305 (representa\u019fve)rigu(league)sizun(season)Topic 269speciesinsectseggsbodylarvaerui(species)shu(species)karada(body)konchu\u0305 (insect)dobutsu(animal)Topic 59castleba\u01a9leodahideyoshinobunaganobunagashiro(castle)hideyoshishi(surname)odaTopic 426carvehiclevehiclescarstruckkuruma(car)jidosha(automobile)sharyo\u0305 (vehicle)unten(driving)torakku(truck)\fTable 3: Per-word held-out log-likelihood with\nbilingual data. Boldface indicates the best result\nin each column.\n\nT=500\n\nT=1000\n\nTable 4: Per-word held-out log-likelihood with\ntrilingual data. Boldface indicates the best result\nin each column.\n\nT=1000\n\nLDA\n\nCI-LDA\n\nSwitchLDA\nCorrLDA1\nCorrLDA2\n\nSymCorrLDA\n\nSymCorrLDA-alt\nSymCorrLDA-rand\n\nJapanese\n-8.127\n-8.136\n-8.139\n-7.463\n-7.777\n-7.433\n-7.476\n-7.483\n\nEnglish\n-8.633\n-8.644\n-8.641\n-8.403\n-8.197\n-8.175\n-8.206\n-8.222\n\nJapanese\n-7.992\n-8.008\n-8.012\n-7.345\n-7.663\n-7.317\n-7.358\n-7.373\n\nEnglish\n-8.530\n-8.549\n-8.549\n-8.346\n-8.109\n-8.084\n-8.116\n-8.137\n\nCorrLDA1\nCorrLDA2\nCorrLDA3\n\nSymCorrLDA\n\nSymCorrLDA-alt\n\nJapanese English Spanish Japanese English Spanish\n-8.545\n-7.408\n-8.401\n-7.655\n-8.274\n-7.794\n-8.215\n-7.394\n-7.440\n-8.254\n\n-8.393\n-8.122\n-8.383\n-8.093\n-8.120\n\n-8.667\n-8.467\n-8.338\n-8.289\n-8.330\n\n-7.305\n-7.572\n-7.700\n-7.287\n-7.330\n\nT=500\n\n-8.512\n-8.198\n-8.460\n-8.178\n-8.209\n\ndata (T = 500). Here, \u03c0d,1, \u03c0d,2, and \u03c0d,3 respectively indicate the pivot language-\ufb02ag (multinomial)\nparameters corresponding to Japanese, English, and Spanish parts in each document. We further\ndemonstrate the proportions of pivot assignments at the topic level. Figure 5 shows the content of\n6 topics through 10 words with the highest probability for each language and for each topic when\nusing the bilingual data (T = 500), some of which are biased to Japanese (Topics 13 and 59) or\nEnglish (Topics 201 and 251), while the others have almost no bias. It can be seen that the pivot bias\nto speci\ufb01c languages can be interpreted.\n\n4.3 Held-out log-likelihood\n\nBy measuring the held-out log-likelihood, we can evaluate the quality of each topic model. The\nhigher the held-out log-likelihood, the greater the predictive ability of the model.\nIn this work,\nwe estimated multilingual topic models with the training set and computed the log-likelihood of\ngenerating the held-out set that was mentioned in Section 4.1.\n\nTable 3 shows the held-out log-likelihood of each multilingual topic model estimated with the bilin-\ngual dataset when T = 500 and 1000. Note that the held-out log-likelihood (i.e., the micro-average\nper-word log-likelihood of the 10% held-out set) is shown for each language in this table, while\nthe model estimation was performed over the 90% training set in all the languages. Hereafter, Cor-\nrLDA1 refers to the CorrLDA model that was estimated when Japanese was the pivot language. As\ndescribed in Section 2.3, the CorrLDA model \ufb01rst generates topics for the pivot language part of a\ndocument, and for the other language parts of the document, the model then uses the topics that were\nalready generated in the pivot language. CorrLDA2 refers to the CorrLDA model when English was\nthe pivot language. As the results in Table 3 show, the held-out log-likelihoods of CorrLDA1 and\nCorrLDA2 are much higher than those of the other prior models: CI-LDA, SwitchLDA, and LDA,\nin both cases. This is because CorrLDA can capture direct dependencies between languages, due to\nthe constraints that topics have to be selected from the topics selected in the pivot language parts.\nOn the other hand, CI-LDA and SwitchLDA are too poorly constrained to e\ufb00ectively capture the\ndependencies between languages, as mentioned in Sections 2.1 and 2.2. In particular, CorrLDA1\nhas the highest held-out log-likelihood among all the prior models for Japanese, while CorrLDA2\nis the best among all the prior models for English. This is probably due to the fact that CorrLDA\ncan estimate topics from the pivot language parts (Japanese in the case of CorrLDA1) without any\nspeci\ufb01c constraints; however, great constraints (topics having to be selected from the topics selected\nin the pivot language parts) are imposed for the other language parts. In SymCorrLDA, the held-out\nlog-likelihood for Japanese is larger than that of CorrLDA1 (and the other models), and the held-out\nlog-likelihood for English is larger than that of CorrLDA2. This is probably because SymCorrLDA\nestimates the pivot language appropriately adjusted for each word in each document. Next, we com-\npare SymCorrLDA and its alternative version (SymCorrLDA-alt). We observed in Table 3 that the\nheld-out log-likelihood of SymCorrLDA-alt is smaller than that of the original SymCorrLDA, and\ncomparable to CorrLDA\u2019s best. This is because the constraints in SymCorrLDA-alt are relaxed so\nthat topics do not always have to be selected from the topics selected for the words with the pivot\nlanguage \ufb02ags.\n\nfurther consideration,\n\nlet us examine the results of\n\nFor\nthe simpli\ufb01ed implementation:\nSymCorrLDA-rand, which we de\ufb01ned in Section 4.1. SymCorrLDA-rand\u2019s held-out log-likelihood\nlies even below CorrLDA\u2019s best. These results re\ufb02ect the fact that the performance of SymCor-\nrLDA in its full form is inherently a\ufb00ected by the nature of the language biases in the multilingual\ncomparable documents, rather than merely being a\ufb00ected by the language part length.\n\n7\n\n\fTable 4 shows the held-out log-likelihood with the trilingual data when T = 500 and 1000. Here,\nCorrLDA3 refers to the CorrLDA model that was estimated when Spanish was the pivot language.\nAs you can see in this table, SymCorrLDA\u2019s held-out log-likelihood is larger than CorrLDA\u2019s best.\nSymCorrLDA can estimate the pivot language appropriately adjusted for each word in each docu-\nment in the trilingual data, as with the bilingual data. SymCorrLDA-alt behaves similarly as with\nthe bilingual data.\n\nFor both the bilingual and trilingual data, the improvements with SymCorrLDA were statistically\nsigni\ufb01cant, compared to each of the other models, according to the Wilcoxon signed-rank test at the\n5% level in terms of the word-by-word held-out log-likelihood. As for the scalability, SymCorrLDA\nis as scalable as CorrLDA because the time complexity of SymCorrLDA is the same order as that of\nCorrLDA: the number of topics times the sum of vocabulary size in each language. On clock time,\nSymCorrLDA does pay some extra, such as around 40% of the time for CorrLDA in the case of the\nbilingual data, for allocating the pivot language \ufb02ags.\n\n4.4 Finding counterpart articles\n\nGiven an article, we can \ufb01nd its unseen counterpart articles in other languages using a multilin-\ngual topic model. To evaluate this task, we experimented with the bilingual dataset. We estimated\ndocument-topic distributions of test documents for each language, using the topic-word distributions\nthat were estimated by each multilingual topic model with training documents. We then evaluated\nthe performance of \ufb01nding English counterpart articles using Japanese articles as queries, and vice\nversa. For estimating the document-topic distributions of test documents, we used re-sampling of\nLDA using the topic-word distribution estimated beforehand by each multilingual topic model [15].\nWe then computed the Jensen-Shannon (JS) divergence [16] between a document-topic distribution\nof Japanese and that of English for each test document. Each held-out English-Japanese article pair\nconnected via an inter-language link is considered to be on the same topic; therefore, JS divergence\nof such an article pair is expected to be small if the latent topic estimation is accurate. We \ufb01rst\nassumed each held-out Japanese article to be a query and the corresponding English article to be\nrelevant, and evaluated the ranking of all the test articles of English in ascending order of the JS\ndivergence; then we conducted the task with the languages reversed.\n\nTable 5: MRR in counterpart article \ufb01nding task.\nBoldface indicates the best result in each column.\n\nTable 5 shows the results of mean reciprocal\nrank (MRR), when T = 500 and 1000. The re-\nciprocal rank is de\ufb01ned as the multiplicative in-\nverse of the rank of the counterpart article cor-\nresponding to each query article, and the mean\nreciprocal rank is the average of it over all the\nquery articles. CorrLDA works much more ef-\nfectively than the other prior models: CI-LDA,\nSwitchLDA, and LDA, and overall, SymCor-\nrLDA works the most e\ufb00ectively. We observed that the improvements with SymCorrLDA were\nstatistically signi\ufb01cant according to the Wilcoxon signed-rank test at the 5% level, compared with\neach of the other models. Therefore, it is clear that SymCorrLDA estimates multilingual topics the\nmost successfully in this experiment.\n\nJapanese to English\nT=1000\nT=500\n0.1027\n0.0743\n0.1426\n0.1464\n0.1347\n0.1357\n0.3281\n0.2987\n0.3063\n0.2829\nSymCorrLDA 0.3256\n0.3592\n\nEnglish to Japanese\nT=1000\nT=500\n0.1262\n0.0870\n0.1697\n0.1818\n0.1653\n0.1668\n0.3111\n0.2863\n0.3464\n0.3161\n0.3348\n0.3685\n\nLDA\n\nCI-LDA\n\nSwitchLDA\nCorrLDA1\nCorrLDA2\n\n5 Conclusions\n\nIn this paper, we compared the performance of various topic models that can be applied to multilin-\ngual documents, not using multilingual dictionaries, in terms of held-out log-likelihood and in the\ntask of cross-lingual link detection. We demonstrated through experiments that CorrLDA works sig-\nni\ufb01cantly more e\ufb00ectively than CI-LDA, which was used in prior work on multilingual topic models.\nFurthermore, we proposed a new topic model, SymCorrLDA, that incorporates a hidden variable to\ncontrol a pivot language, in an extension of CorrLDA. SymCorrLDA has an advantage in that it does\nnot require a pivot language to be speci\ufb01ed in advance, while CorrLDA does. We demonstrated that\nSymCorrLDA is more e\ufb00ective than CorrLDA and the other topic models, through experiments\nwith Wikipedia datasets using held-out log-likelihood and in the task of \ufb01nding counterpart articles\nin other languages. SymCorrLDA can be applied to other kinds of data that have multiple classes of\nrepresentations, such as annotated image data. We plan to investigate this in future work.\n\n8\n\n\fAcknowledgments We thank Sinead Williamson, Manami Matsuura, and the anonymous review-\ners for valuable discussions and comments. This work was supported in part by the Grant-in-Aid for\nScienti\ufb01c Research (#23300039) from JSPS, Japan.\n\nReferences\n\n[1] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd Anuual\nInternational ACM SIGIR Conference on Research and Development in Information Retrieval,\npages 50\u201357, Berkeley, California, USA, 1999.\n\n[2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet allocation. Journal of\n\nMachine Learning Research, 3:993\u20131022, 2003.\n\n[3] David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCal-\nlum. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods\nin Natural Language Processing, pages 880\u2013889, Stroudsburg, Pennsylvania, USA, 2009.\n\n[4] Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. Mining multilingual topics from\nwikipedia. In Proceedings of the 18th International Conference on World Wide Web, pages\n1155\u20131156, Madrid, Spain, 2009.\n\n[5] Elena Erosheva, Stephen Fienberg, and John La\ufb00erty. Mixed-membership models of scienti\ufb01c\npublications. Proceedings of the National Academy of Sciences of the United States of America,\n101:5220\u20135227, 2004.\n\n[6] David Newman, Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. Statistical\nentity-topic models. In Proceedings of the 12th ACM SIGKDD International Conference on\nKnowledge Discovery and Data Mining, pages 680\u2013686, Philadelphia, Pennsylvania, USA,\n2006.\n\n[7] David M. Blei and Michael I. Jordan. Modeling annotated data. In Proceedings of the 26th\nAnnual International ACM SIGIR Conference on Research and Development in Informaion\nRetrieval, pages 127\u2013134, Toronto, Canada, 2003.\n\n[8] Bing Zhao and Eric P. Xing. BiTAM: Bilingual topic admixture models for word alignment.\nIn Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics,\npages 969\u2013976, Sydney, Australia, 2006.\n\n[9] Jordan Boyd-Graber and David M. Blei. Multilingual topic models for unaligned text.\n\nIn\nProceedings of the 25th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 75\u201382,\nMontreal, Canada, 2009.\n\n[10] Jagadeesh Jagarlamudi and Hal Daume. Extracting multilingual topics from unaligned com-\nparable corpora. In Advances in Information Retrieval, volume 5993 of Lecture Notes in Com-\nputer Science, pages 1\u201312. Springer, 2010.\n\n[11] Duo Zhang, Qiaozhu Mei, and ChengXiang Zhai. Cross-lingual latent topic extraction.\n\nIn\nProceedings of the 48th Annual Meeting of the Association for Computational Linguistics,\npages 1128\u20131137, Uppsala, Sweden, 2010.\n\n[12] James P. Callan, W. Bruce Croft, and Stephen M. Harding. The INQUERY retrieval system.\nIn Proceedings of the 3rd International Conference on Database and Expert Systems Applica-\ntions, pages 78\u201383, Valencia, Spain, 1992.\n\n[13] Jacques Savoy. Report on CLEF-2002 experiments: Combining multiple sources of evidence.\nIn Advances in Cross-Language Information Retrieval, volume 2785 of Lecture Notes in Com-\nputer Science, pages 66\u201390. Springer, 2003.\n\n[14] Mark Steyvers and Tom Gri\ufb03ths. Handbook of Latent Semantic Analysis, chapter 21: Proba-\nbilistic Topic Models. Lawrence Erbaum Associates, Mahwah, New Jersey and London, 2007.\n[15] Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation meth-\nods for topic models. In Proceedings of the 26th International Conference on Machine Learn-\ning, pages 1105\u20131112, Montreal, Canada, 2009.\n\n[16] Jianhua Lin. Divergence measures based on the shannon entropy.\n\nInformation Theory, 37(1):145\u2013151, 1991.\n\nIEEE Transactions on\n\n9\n\n\f", "award": [], "sourceid": 634, "authors": [{"given_name": "Kosuke", "family_name": "Fukumasu", "institution": null}, {"given_name": "Koji", "family_name": "Eguchi", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}]}