{"title": "Factorial LDA: Sparse Multi-Dimensional Text Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2582, "page_last": 2590, "abstract": "Multi-dimensional latent variable models can capture the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional latent variable model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (e.g. methods vs. applications.) Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors.", "full_text": "Factorial LDA:\n\nSparse Multi-Dimensional Text Models\n\nMichael J. Paul and Mark Dredze\n\nHuman Language Technology Center of Excellence (HLTCOE)\n\nCenter for Language and Speech Processing (CLSP)\n\nJohns Hopkins University\n\nBaltimore, MD 21218\n\n{mpaul,mdredze}@cs.jhu.edu\n\nAbstract\n\nLatent variable models can be enriched with a multi-dimensional structure to\nconsider the many latent factors in a text corpus, such as topic, author perspective\nand sentiment. We introduce factorial LDA, a multi-dimensional model in which a\ndocument is in\ufb02uenced by K different factors, and each word token depends on a\nK-dimensional vector of latent variables. Our model incorporates structured word\npriors and learns a sparse product of factors. Experiments on research abstracts\nshow that our model can learn latent factors such as research topic, scienti\ufb01c disci-\npline, and focus (methods vs. applications). Our modeling improvements reduce\ntest perplexity and improve human interpretability of the discovered factors.\n\n1\n\nIntroduction\n\nThere are many factors that contribute to a document\u2019s word choice: topic, syntax, sentiment, author\nperspective, and others. Latent variable \u201ctopic models\u201d such as latent Dirichlet allocation (LDA)\nimplicitly model a single factor of topical content [1]. More in-depth analyses of corpora call for\nmodels that are explicitly aware of additional factors beyond topic. Some topic models have been\nused to model speci\ufb01c factors like sentiment [2], and more general models\u2014like the topic aspect\nmodel [3] and sparse additive generative models (SAGE) [4]\u2014have jointly considered both topic\nand another factor, such as perspective. Most prior work has only considered two factors at once.1\nThis paper presents factorial LDA, a general framework for multi-dimensional text models that cap-\nture an arbitrary number of factors. While standard topic models associate each word token with\na single latent topic variable, a multi-dimensional model associates each token with a vector of\nmultiple factors, such as (topic, political ideology) or (product type, sentiment, author age).\nScaling to an arbitrary number of factors poses challenges that cannot be addressed with exist-\ning two-dimensional models. First, we must ensure consistency across different word distribu-\ntions which have the same components. For example, the word distributions associated with the\n(topic, perspective) pairs (ECONOMICS,LIBERAL) and (ECONOMICS,CONSERVATIVE) should both\ngive high probability to words about economics. Additionally, increasing the number of factors re-\nsults in a multiplicative increase in the number of possible tuples that can be formed, and not all\ntuples will be well-supported by the data. We address these two issues by adding additional struc-\nture to our model: we impose structured word priors that link tuples with common components, and\nwe place a sparse prior over the space of possible tuples. We demonstrate that both of these model\nstructures lead to improvements in model performance.\nIn the next section, we introduce our model, where our main contributions are to:\n\n1A recent variant of SAGE modeled three factors in historic documents: topic, time, and location [5].\n\n1\n\n\f\u2022 introduce a general model that can accommodate K different factors (dimensions) of language,\n\u2022 design structured priors over the word distributions that tie together common factors,\n\u2022 enforce a sparsity pattern which excludes unsupported combinations of components (tuples).\n\nWe then discuss our inference procedure (\u00a74) and share experimental results (\u00a75).\n\n2 Factorial LDA: A Multi-Dimensional Generative Model\n\nLatent Dirichlet allocation (LDA) [1] assumes we have a set of Z latent components (usually called\n\u201ctopics\u201d in the context of text modeling), and each data point (a document) has a discrete distribution\n\u03b8 over these topics. The set of topics can be thought as a vector of length Z, where each cell is a\npointer into a discrete distribution over words, parameterized by \u03c6z. Under LDA, a document is\ngenerated by choosing the topic distribution \u03b8 from a Dirichlet prior, then for each token we sample\na latent topic t from this distribution before sampling a word w from the tth word distribution \u03c6t.\nWithout additional structure, LDA tends to learn distributions which correspond to semantic topics\n(such as SPORTS or ECONOMICS) [6] which dominate the choice of words in a document, rather\nthan syntax, perspective, or other aspects of document content.\nImagine that instead of a one-dimensional vector of Z topics, we have a two-dimensional matrix of\nZ1 components along one dimension (rows) and Z2 components along the other (columns). This\nstructure makes sense if a corpus is composed of two different factors, and the two dimensions might\ncorrespond to factors such as news topic and political perspective (if we are modeling newspaper edi-\ntorials), or research topic and discipline (if we are modeling scienti\ufb01c papers). Individual cells of the\nmatrix would represent pairs such as (ECONOMICS,CONSERVATIVE) or (GRAMMAR,LINGUISTICS)\nand each is associated with a word distribution \u03c6(cid:126)z. Conceptually, this is the idea behind the two-\ndimensional models of TAM [3] and SAGE [4].\nLet us expand this idea further by assuming K factors modeled with a K-dimensional array, where\neach cell of the array has a pointer to a word distribution corresponding to that particular K-tuple.\nFor example, in addition to topic and perspective, we might want to model a third factor of the au-\nthor\u2019s gender in newspaper editorials, yielding triples such as (ECONOMICS,CONSERVATIVE,MALE).\nConceptually, each K-tuple (cid:126)t functions as a topic in LDA (with an associated word distribution \u03c6(cid:126)t)\nexcept that K-tuples imply a structure, e.g.\nthe pairs (ECONOMICS,CONSERVATIVE) and (ECO-\nNOMICS,LIBERAL) are related. This is the idea behind factorial LDA (f-LDA).\nAt its core, our model follows the basic template of LDA, but each word token is associated with\na K-tuple rather than a single topic value. Under f-LDA, each document has a distribution over\ntuples, and each tuple indexes into a distribution over words. Of course, without additional struc-\nk Zk topics. In f-LDA, we induce a factorial\nstructure by creating priors which tie together tuples that share components: distributions involv-\ning the pair (ECONOMICS,CONSERVATIVE) should have commonalties with distributions for (ECO-\nNOMICS,LIBERAL). The key ingredients of our new model are:\n\u2022 We model the intuition that tuples which share components should share other properties.\nFor example, we expect the word distributions for (ECONOMICS,CONSERVATIVE) and (ECO-\nNOMICS,LIBERAL) to both give high probability to words about economics, while the pairs (ECO-\nNOMICS,LIBERAL) and (ENVIRONMENT,LIBERAL) should both re\ufb02ect words about liberalism.\nSimilarly, we want each document\u2019s distribution over tuples to re\ufb02ect the same type of consis-\ntency. If a document is written from a liberal perspective, then we believe that pairs of the form\n(*,LIBERAL) are more likely to have high probability than pairs with CONSERVATIVE as the sec-\nond component. This consistency across factors is encouraged by sharing parameters across\nthe word and topic prior distributions in the model: this encodes our a priori assumption that\ndistributions which share components should be similar.\n\nture, this would simply be equivalent to LDA with(cid:81)K\n\n\u2022 Additionally, we allow for sparsity across the set of tuples. As the dimensionality of the array\nincreases, we are going to encounter problems of overparameterization, because the model will\nlikely contain more tuples than are observed in the data. We handle this by having an auxiliary\nmulti-dimensional array which encodes a sparsity pattern over tuples. The priors over tuples are\naugmented with this sparsity pattern. These priors model the belief that the Cartesian product of\nfactors should be sparse; the posterior may \u201copt out\u201d of some tuples.\n\n2\n\n\f\u03b1d\n\nK\n\n\u03b8\n\n(cid:81)\n\nb\nk Zk\n\nK\n\n\u03b1\n\n\u03b3\n\nz\n\nK\n\n(cid:80)\n\n\u03c9\n\nk Zk\n\nw\n\nN\n\nD\n\n\u03c6\n\n(cid:81)\n\nk Zk\n\n(a)\n\n(b)\n\nFigure 1: (a) Factorial LDA as a graphical model. (b) An illustration of word distributions in f-LDA\nwith two factors. When applying f-LDA to a collection of scienti\ufb01c articles from various disciplines,\nwe learn weights \u03c9 corresponding to a topic we call WORDS and the discipline EDUCATION as well\nas background words. These weights are combined to form the Dirichlet prior, and the distribution\nfor (WORDS,EDUCATION) is drawn from this prior: this distribution describes writing education.\n\nThe generative story (we\u2019ll describe the individual pieces below) is as follows.\n\n1. Draw the various hyperparameters \u03b1 and \u03c9\n\nfrom N (0, I\u03c32)\n\nwhere the Dirichlet vectors \u02c6\u03c9 and \u02c6\u03b1 are de\ufb01ned\nas:\n\n(cid:40)\n(cid:40)\n\n(cid:32)(cid:88)\n\n(cid:88)\n\nk\n\n(cid:41)\n(cid:33)(cid:41)\n\n2. For each tuple (cid:126)t = (t1, t2, . . . , tK):\n\n(a) Sample word distribution \u03c6(cid:126)t \u223c Dir(\u02c6\u03c9((cid:126)t))\n(b) Sample sparsity \u201cbit\u201d b(cid:126)t \u223c Beta(\u03b30, \u03b31)\n\n3. For each document d \u2208 D:\n\n\u02c6\u03c9((cid:126)t)\nw\n\n(cid:44) exp\n\n\u03c9(B) + \u03c9(0)\n\nw +\n\n\u03c9(k)\ntkw\n\n(cid:44) exp\n\n\u02c6\u03b1(d)\n(cid:126)t\n\n\u03b1(B)+\n\n\u03b1(D,k)\n\ntk\n\n+ \u03b1(d,k)\n\ntk\n\n(1)\n\n(a) Draw document component weights\n\n(b) Sample distribution over tuples\n\n\u03b1(d,k) \u223c N (0, I\u03c32) for each factor k\n\u03b8(d) \u223c Dir(B \u00b7 \u02c6\u03b1(d))\ni. Sample component tuple (cid:126)z \u223c \u03b8(d)\nii. Sample word w \u223c \u03c6(cid:126)z\n\n(c) For each token:\n\nk\n\nSee Figure 1a for the graphical model, and Fig-\nure 1b for an illustration of how the weight vec-\ntors \u03c9(0) and \u03c9(k) are combined to form \u02c6\u03c9 for a\nparticular tuple that was inferred by our model.\nThe words shown have the highest weight after\nrunning our inference procedure (see \u00a75 for ex-\nperimental details).\n\nAs discussed above, the only difference between f-LDA and LDA is that structure has been added\nto the Dirichlet priors for the word and topic distributions. We use a form of Dirichlet-multinomial\nregression [7] to formulate the priors for \u03c6 and \u03b8 in terms of the log-linear functions in Eq. 1. We\nwill now describe these priors in more detail.\nPrior over \u03c6: We formulate the priors of \u03c6 to encourage word distributions to be consistent across\ncomponents of each factor. For example, tuples that re\ufb02ect the same topic should share words. To\nachieve this goal, we link the priors for tuples that share common components by utilizing a log-\nlinear parameterization of the Dirichlet prior of \u03c6 (Eq. 1). Formally, we place a prior Dirichlet(\u02c6\u03c9((cid:126)t))\nover \u03c6(cid:126)t, the word distribution for tuple (cid:126)t = (t1, t2, . . . , tK). The Dirichlet vector \u02c6\u03c9((cid:126)t) controls the\nprecision and focus of the prior. It is a function of three types of hyperparameters. First, a single\ncorpus-wide bias scalar \u03c9(B), and second, a vector over the vocabulary \u03c9(0), which re\ufb02ects the\nrelative likelihood of different words. These respectively increase the overall precision of words\nand the default likelihood of each word. Finally, \u03c9(k)\ntkw introduces bias parameters for each word w\nfor component tk of the kth factor. By increasing the weight of a particular \u03c9(k)\ntkw, we increase the\nexpected relative log-probabilities of word w in \u03c6(cid:126)z for all (cid:126)z that contain component tk, thereby tying\nthese priors together.\nPrior over \u03b8: We use a similar formulation for the prior over \u03b8. Recall that we want documents to\nnaturally favor tuples that share components, i.e. favoring both (ECONOMICS,CONSERVATIVE) and\n(EDUCATION,CONSERVATIVE) if the document favors CONSERVATIVE in general. To address this,\nwe let \u03b8(d) be drawn from Dirichlet( \u02c6\u03b1(d)), where instead of a corpus-wide prior, each document has a\n\n3\n\nbased different article research present information level new role time examines words context development word following primary article use skills children school teachers                                                                                                    education year teaching educational reading childrens use writing ability instruction spelling strategy written skills adult narratives use words word phoneme speech length following sequence phonetic sentence hebrew exp( + + ) = Posterior Prior \u03c9(0) \u03c9(1)     \u03c9(2) WORDS EDUCATION \ftk\n\ntk\n\nvector \u02c6\u03b1(d) which re\ufb02ects the independent contributions of the factors via a log-linear function. This\nfunction contains three types of hyperparameters. First, \u03b1(B) is the corpus-wide precision parameter\n(the bias); this is shared across all documents and tuples. Second, \u03b1(D,k)\nindicates the bias for the\nkth factor\u2019s component tk across the entire corpus D, which enables the model to favor certain\ncomponents a priori. Finally, \u03b1(d,k)\nis the bias for the kth factor\u2019s component tk speci\ufb01cally in\ndocument d. This allows documents to favor certain components over others, such as the perspective\nCONSERVATIVE in a speci\ufb01c document. We assume all \u03c9s and \u03b1s are independent and normally\ndistributed around 0, which gives us L2 regularization during optimization.\nSparsity over tuples: Finally, we describe the generation of the sparsity pattern over tuples in the\ncorpus. We assume a K-dimensional binary array B, where an entry b(cid:126)t corresponds to tuple (cid:126)t.\nIf b(cid:126)t = 1, then (cid:126)t is active: that is, we are allowed to chose (cid:126)t to generate a token and we learn\n\u03c6(cid:126)t; otherwise we do not. We modify this prior over \u03b8 to include a binary mask of the tuples:\n\u03b8(d) \u223c Dirichlet(B \u00b7 \u02c6\u03b1(d)), where \u00b7 is the Hadamard (cell-wise) product. \u03b8 will not include tuples\nfor which b(cid:126)t = 0; otherwise the prior will remain unchanged.\nWe would ideally model B so that its values are in {0,1}. While we could use a Beta-Bernoulli\nmodel (a \ufb01nite Indian Buffet Process [8]) to generate a \ufb01nite binary matrix (array), this model is typ-\nically learned over continuous data; learning over discrete observations (tuples) can be exceedingly\ndif\ufb01cult since forcing the model to change a bit can yield large changes to the observations, which\nmakes mixing very slow.2 To aid learning, we relax the constraint that B must be binary and instead\nallow b(cid:126)t to be real-valued in (0, 1). This is a common approximation used in other models, such as\narti\ufb01cial neural networks and deep belief networks. To encourage sparsity, we place a \u201cU-shaped\u201d\nBeta(\u03b30, \u03b31) prior over b(cid:126)t, where \u03b3 < 1, which yields a density function that is concentrated around\nthe edges 0 and 1. Empirically, we will show that this effectively learns a sparse binary B. The\neffect is that the prior assigns tiny probabilities to some tuples instead of strictly 0.\n\n3 Related Work\n\nPrevious work on multi-dimensional modeling includes the topic aspect model (TAM) [3], multi-\nview LDA (mv-LDA) [10], cross-collection LDA [11] and sparse additive generative models\n(SAGE) [4], which jointly consider both topic and another factor. Other work has jointly mod-\neled topic and sentiment [2]. Zhang et al. [12] apply PLSA [13] to multi-dimensional OLAP data,\nbut not with a joint model. Our work is the \ufb01rst to jointly model an arbitrary number of factors. A\nrather different approach considered different dimensions of clustering using spectral methods [14],\nin which K different clusterings are obtained by considering K different eigenvectors. For example,\nproduct reviews can be clustered not only by topic, but also by sentiment and author attributes.\nWe contrast this body of work with probabilistic matrix and tensor factorization models [15, 16]\nwhich model data that has already been organized in multiple dimensions \u2013 for example, topic-like\nmodels have been used to model the movie ratings within a matrix of users and movies. f-LDA and\nthe models described above, however, operate over \ufb02at input (text documents), and it is only the\nlatent structure that is assumed to be organized along multiple dimensions.\nAn important contribution of f-LDA is the use of priors to tie together word distributions with the\nsame components. Previous work with two-dimensional models, such as TAM and mv-LDA, assume\nconditional independence among all \u03c6, and there is no explicit encouragement of correlation. An\nalternative approach would be to strictly enforce consistency, such as through a \u201cproduct of experts\u201d\nmodel [17], in which each factor has independent word distributions that are multiplied together\nk \u03c6tk. Syntactic topic\nmodels [18] and shared components topic models [19] follow this approach. Our structured word\nprior generalizes both of these approaches. By setting all \u03c9(k) to 0, the factors have no in\ufb02uence on\nthe prior and we obtain the distribution independence of TAM. If instead we have large \u03c9 values,\nthen the model behaves like a product of experts; as precision increases, the posterior converges to\nthe prior. By learning \u03c9 our model can determine the optimal amount of coherence among the \u03c6.\n\nand renormalized to form the distribution for a particular tuple, i.e. \u03c6(cid:126)t \u221d(cid:81)\n\n2 One approach is to (approximately) collapse out the sparsity array [9], but this is dif\ufb01cult when working\nover the entire corpus of tokens. Experiments with Metropolis-Hastings samplers, split-merge based samplers,\nand alternative prior structures all suffered from mixing problems.\n\n4\n\n\fAnother key part of f-LDA is the inclusion of a sparsity pattern. There have been several recent ap-\nproaches that enforce sparsity in topic models. Various applications of sparsity can be organized into\nthree categories. First, one could enforce sparsity over the topic-speci\ufb01c word distributions, forcing\neach topic to select a subset of relevant words. This is the idea behind sparse topic models [20],\nwhich restrict topics to a subset of the vocabulary, and SAGE [4], which applies L1 regularization to\nword weights. A second approach is to enforce sparsity in the document-speci\ufb01c topic distributions,\nfocusing each document on a subset of relevant topics. This is the idea in focused topic models\n[9]. Finally\u2014our contribution\u2014is to impose sparsity among the set of topics (or K-tuples) that are\navailable to the model. Among sparsity-inducing regularizers, one that closely relates to our goals\nis the group lasso [21]. While the standard lasso will drive vector elements to 0, the group lasso will\ndrive entire vectors to 0.\n\n4\n\nInference and Optimization\n\nf-LDA turns out to be fairly similar to LDA in terms of inference.\nIn both models, words are\ngenerated by \ufb01rst sampling a latent variable (in our case, a latent tuple) from a distribution \u03b8, then\nsampling the word from \u03c6 conditioned on the latent variable. The differences between LDA and\nf-LDA lie in the parameters of the Dirichlet priors. The presentation of our optimization procedure\nfocuses on these parameters.\nWe follow the common approach of alternating between sampling the latent variables and direct\noptimization of the Bayesian hyperparameters [22]. We use a Gibbs sampler to estimate E[(cid:126)z], and\ngiven the current estimate of this expectation, we optimize the parameters \u03b1, \u03c9 and B. These two\nsteps form a Monte Carlo EM (MCEM) routine.\n\n4.1 Latent Variable Sampling\n\nThe latent variables (cid:126)z are sampled using the standard collapsed Gibbs sampler for LDA [23], with\nthe exception that the basic Dirichlet priors have been replaced with our structured priors for \u03b8 and\n\u03c6. The sampling equation for (cid:126)z for token i, given all other latent variable assignments (cid:126)z, the corpus\nw and the parameters (\u03b1, \u03c9, and B) becomes:\n\np((cid:126)zi = (cid:126)t | (cid:126)z\\{(cid:126)zi}, w, \u03b1, \u03c9, B) \u221d(cid:16)\n\n(cid:17)(cid:32)\n\n(cid:80)\n\n(cid:33)\n\n(cid:126)t + b(cid:126)t \u02c6\u03b1(d)\nnd\n(cid:126)t\n\nw\n\nw + \u02c6\u03c9((cid:126)t)\nn(cid:126)t\nw(cid:48) n(cid:126)t\n\nw(cid:48) + \u02c6\u03c9((cid:126)t)\nw(cid:48)\n\n(2)\n\nwhere nb\n\na denotes the number of times a occurs in b.\n\n4.2 Optimizing the Sparsity Array and Hyperparameters\n\nFor mathematical convenience, we reparameterize B in terms of the logistic function \u03c3, such that\nb(cid:126)t \u2261 \u03c3(\u03b2(cid:126)t). We optimize \u03b2 \u2208 R to obtain b \u2208 (0, 1). The derivative of \u03c3(x) has the simple form\n\u03c3(x)\u03c3(\u2212x). For a tuple (cid:126)t, the gradient of the corpus log likelihood L with respect to \u03b2(cid:126)t is:\n\u2202L\n\u2202\u03b2(cid:126)t\n\n\u03c3(\u03b2(cid:126)t)\u03c3(\u2212\u03b2(cid:126)t) \u02c6\u03b1(d)\n\n= (\u03b30 \u2212 1)\u03c3(\u2212\u03b2(cid:126)t) + (\u03b31 \u2212 1)(\u2212\u03c3(\u03b2(cid:126)t)) +\n(cid:16)\n\n(cid:17)(cid:35)\n\n(cid:34)(cid:88)\n) + \u03a8 ((cid:80)\n\nd\u2208D\n\n(cid:1) \u02c6\u03b1(d)\n\n(cid:126)t\n\n(cid:126)u ) \u2212 \u03a8(cid:0)(cid:80)\n\n) \u2212 \u03a8(\u03c3(\u03b2(cid:126)t)\u02c6\u03b1(d)\n\n(cid:126)u \u03c3(\u03b2(cid:126)u) \u02c6\u03b1(d)\n\n(cid:126)u nd\n\n(cid:126)u + \u03c3(\u03b2(cid:126)u\n\n(cid:126)u )\n\n(3)\n\n\u03a8(nd\n\n(cid:126)t + \u03c3(\u03b2(cid:126)t)\u02c6\u03b1(d)\n(cid:126)t\n\n(cid:126)t\n\n\u00d7\n\nwhere the \u03b3 values are the Beta parameters. The top terms are a result of the Beta prior over b(cid:126)t,\nwhile the summation over documents re\ufb02ects the gradient of the Dirichlet-multinomial compound.\nStandard non-convex optimization methods can be used on this gradient. To avoid shallow local\nminima, we optimize this gradually by taking small gradient steps, performing a single iteration of\ngradient ascent after each Gibbs sampling iteration (see \u00a75 for more details).\nThe gradients for the \u03b1 and \u03c9 variables have a similar form to (3); the main difference with \u03c9 is\nthat the gradient involves a sum over components rather than over documents. We similarly update\nthese values through gradient ascent.\n\n5\n\n\f5 Experiments\n\nWe experiment with two data sets that could contain multiple factors. The \ufb01rst is a collection of 5000\ncomputational linguistics abstracts from the ACL Anthology (ACL). The second combines these\nabstracts (C) with several journals in the \ufb01elds of linguistics (L), education (E), and psychology (P).\nWe use 1000 articles from each discipline (CLEP). For both corpora, we keep an additional 1000\ndocuments for development and 1000 for test (uniformly representative of the 4 CLEP disciplines).\nWe used (cid:126)Z = (\u2217, 2, 2) for ACL and (cid:126)Z = (\u2217, 4) for CLEP for various numbers of \u201ctopics\u201d Z1 \u2208\n{5, . . . , 50}. While we cannot say in advance what each factor will represent, we observed that\nwhen Zk is large, components along this factor correspond to topics. Therefore, we set Z1 > Zk>1\nand assume the \ufb01rst factor is topic. While our model presentation assumed latent factors, we could\nobserve factors, such as knowing the journal of each article in CLEP. However, our experiments\nstrictly focus on the unsupervised setting to measure what the model can infer on its own.\nWe will compare our complete model against simpler models by ablating parts of f-LDA. If we\nremove the structured word priors and array sparsity, we are left with a basic multi-dimensional\nmodel (base). We will compare against models where we add back in the structured word priors (W)\nand array sparsity (S), and \ufb01nally the full f-LDA model (SW). All variants are identical except that\nwe \ufb01x all \u03c9(k) = 0 to remove structured word priors and \ufb01x B = 1 to remove sparsity.\nWe also compare against the topic aspect model (TAM) [3], a two-dimensional model, using the\npublic implementation.3 TAM is similar to the \u201cbase\u201d two-factor f-LDA model except that f-LDA\nhas a single \u03b8 per document with priors that are independently weighted by each factor, whereas\nTAM has K independent \u03b8s, with a different \u03b8k for each factor. If the Dirichlet precision in f-LDA\nis very high, then it should exhibit similar behavior as having separate \u03b8s. TAM only models two\ndimensions so we are restricted to running it on the two-dimensional CLEP data set.\nFor hyperparameters, we set \u03b30 = \u03b31 = 0.1 in the Beta prior over b(cid:126)t, and we set \u03c32 = 10 for \u03b1 and\n1 for \u03c9 in the Gaussian prior over weights. Bias parameters (\u03b1(B), \u03c9(B)) are initialized to \u22125 for\nweak initial priors. Our sampling algorithm alternates between a full pass over tokens and a single\ngradient step on the parameters (step size of 10\u22122 for \u03b1; 10\u22123 for \u03c9 and \u03b2). Results are averaged or\npooled from \ufb01ve trials of randomly initialized chains, which are each run for 10,000 iterations.\nPerplexity Following standard practice we measure perplexity on held-out data by \ufb01xing all pa-\nrameters during training except document-speci\ufb01c parameters (\u03b1(d,k), \u03b8(d)), which are computed\nfrom the test document. We use the \u201cdocument completion\u201d method: we infer parameters from half\na document and measure perplexity on the remaining half [24]. Monte Carlo EM is run on test data\nfor 200 iterations. Average perplexity comes from another 10 iterations.\nFigure 2a shows that the structured word priors yield lower perplexity, while results for sparse mod-\nels are mixed. On ACL, sparsity consistently improves perplexity once the number of topics exceeds\n20, while on CLEP sparsity does worse. Experiments with varying K yielded similar orderings, sug-\ngesting that differences are data dependent and not dependent on K. On CLEP, we \ufb01nd that TAM\nperforms worse than f-LDA with a lower number of topics (which is what we \ufb01nd to work best qual-\nitatively), but catches up as the number of topics increases. (Beyond 50 topics, we \ufb01nd that TAM\u2019s\nperplexity stays about the same, and then begins to increase again once Z \u2265 75.) Thus, in addition\nto scaling to more factors, f-LDA is more predictive than simpler multi-dimensional models.\nQualitative Results To illustrate model behavior we include a sample of output on ACL (Figure\n3). We consider the component-speci\ufb01c weights for each factor (cid:126)\u03c9(k)\ntk , which present an \u201coverview\u201d\nof each component, as well as the tuple-speci\ufb01c word distributions \u03c6(cid:126)t. Upon examination, we deter-\nmined that the \ufb01rst factor (Z1= 20) corresponds to topic, the second (Z2= 2) to approach (empirical\nvs.\ntheoretical), and the third (Z3= 2) to focus (methods vs. applications). The top row shows\nwords common across all components for each factor. The bottom row shows speci\ufb01c \u03c6(cid:126)t. Consider\nthe topic SPEECH: the triple (SPEECH,METHODS,THEORETICAL) emphasizes the linguistic side of\nspeech processing (phonological, prosodic, etc.) while (SPEECH,APPLICATIONS,EMPIRICAL) is\npredominantly about dialogue systems and speech interfaces. We also see tuple sparsity (shaded\n\n3Most other two-dimensional models, including SAGE [4] and multi-view LDA [10], assume that the second\n\nfactor is \ufb01xed and observed. Our focus in this paper is fully unsupervised models.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: (a) The document completion perplexity on two data sets. Models with \u201cW\u201d use structured\nword priors, and those with \u201cS\u201d use sparsity. Error bars indicate 90% con\ufb01dence intervals. When\npooling results across all numbers of topics \u226520, we \ufb01nd that S is signi\ufb01cantly better than Base with\np = 1.4\u00d7 10\u22124 and SW is better than W with p = 5\u00d7 10\u22125 on the ACL corpus. (b) The distribution\nof sparsity values induced on the ACL corpus with (cid:126)Z = (20, 2, 2).\nACL\n\nCLEP\n\nCLEP\nACL\nIntrusion Accuracy\n\nTAM\nBaseline\nSparsity (S)\nWord Priors (W)\nCombined (SW)\n\nn/a\n39%\n51%\n76%\n73%\n\nn/a\n\nRelatedness Score (1\u20135)\n2.29 \u00b1 0.26\n46%\n2.35 \u00b1 0.31\n2.55 \u00b1 0.37\n38%\n2.53 \u00b1 0.48\n2.61 \u00b1 0.37\n43%\n3.56 \u00b1 0.36\n2.59 \u00b1 0.33\n45%\n2.67 \u00b1 0.55\n67% 3.90 \u00b1 0.37\n\nTable 1: Results from human judgments. The best scoring model for each data set is in bold. 90%\ncon\ufb01dence intervals are indicated for scores; scores were more varied on the CLEP corpus.\n\ntuples, in which b(cid:126)t \u2264 0.5) for poor tuples. For example, under the topic of DATA, a mostly empirical\ntopic, tuples along the THEORETICAL component are inactive.\nHuman Judgments Perplexity may not correlate with human judgments [6], which are important\nfor f-LDA since structured word priors and array sparsity are motivated in part by semantic coher-\nence. We measured interpretability based on the notion of relatedness: among components that are\ninferred to belong to the same factor, how many actually make sense together? Seven annotators\nprovided judgments for two related tasks. First, we presented annotators with two word lists (ten\nmost frequent words assigned to each tuple4) that are assigned to the same topic, along with a word\nlist randomly selected from another topic. Annotators are asked to choose the word list that does\nnot belong, i.e. an intrusion test [6]. If the two tuples from the same topic are strongly related, the\nrandom list should be easy to identify. Second, annotators are presented with pairs of word lists\nfrom the same topic and asked to judge the degree of relation using a 5-point Likert scale.\nWe ran these experiments on both corpora with 20 topics. For the two models without the struc-\ntured word priors, we use a symmetric prior (by optimizing only \u03c9(B) and \ufb01xing \u03c9(0) = 0), since\nsymmetric word priors can lead to better interpretability [22].5 We exclude tuples with b(cid:126)t \u2264 0.5.\nAcross all data sets and models, annotators labeled 362 triples in the intrusion experiment and 333\npairs in the scoring experiment. The results (Table 1) differ slightly from the perplexity results. The\nword priors help in all cases, but much more so on ACL. The models with sparsity are generally\nbetter than those without, even on CLEP, in contrast to perplexity where sparse models did worse.\nThis suggests that removing tuples with small b(cid:126)t values removes nonsensical tuples. Overall, the\njudgments are worse for the CLEP corpus; this appears to be a dif\ufb01cult corpus to model due to\nhigh topic diversity and low overlap across disciplines. TAM is judged to be worse than all f-LDA\nvariants when directly scored by annotators. The intrusion performance with TAM is better than or\ncomparable to the ablated versions of f-LDA, but worse than the full model. It thus appears that both\nthe structured priors and sparsity yield more interpretable word clusters.\n\n4We use frequency instead of the actual posterior because including the learned priors (which share many\n\nwords) could make the task unfairly easy.\n\n5We used an asymmetric prior for the perplexity experiments, which gave slightly better results.\n\n7\n\n5101520253035404550Numberof\u201dtopics\u201d1800200022002400260028003000Held-OutPerplexity(nats)ACL(K=3)5101520253035404550Numberof\u201dtopics\u201dCLEP(K=2)BaseSWSWTAM0.00.20.40.60.81.0b020406080100120140160180NumberofInstancesDistributionofSparsityValuesBestFitPrior\f\u201cSPEECH\u201d\n\nspeech\nspoken\n\nrecognition\n\nstate\n\nvocabulary\nrecognizer\nutterances\nsynthesis\n\nTopic\nFocus\n\nh\nc\na\no\nr\np\np\nA\n\nL\nA\nC\n\nI\n\nR\n\nI\nP\nM\nE\n\nL\nA\nC\n\nI\nT\nE\nR\nO\nE\nH\nT\n\n\u201cTopic\u201d\n\u201cI.R.\u201d\n\ndocument\nretrieval\ndocuments\nquestion\n\nweb\n\nanswering\n\nquery\nanswer\n\n\u201cM.T.\u201d\n\ntranslation\nmachine\nsource\n\nmt\n\nparallel\nfrench\nbilingual\ntransfer\n\n. . .\n. . .\n. . .\n. . .\n. . .\n. . .\n. . .\n. . .\n. . .\n\n\u201cApproach\u201d\n\n\u201cFocus\u201d\n\n\u201cEMPIRICAL\u201d\n\n\u201cTHEORETICAL\u201d\n\n\u201cMETHODS\u201d\n\n\u201cAPPLICATIONS\u201d\n\nword\n\nalgorithm\nmethod\naccuracy\n\nbest\n\nsentence\nstatistical\npreviously\n\nuser\n\nresearch\nproject\n\ntechnology\nprocessing\n\nscience\nnatural\n\ndevelopment\n\ntask\ntasks\n\nperformance\n\nimprove\naccuracy\nlearning\n\ndemonstrate\n\nusing\n\ntheory\n\ndescription\n\nformal\nforms\n\ntreatment\nlinguistics\n\nsyntax\n\ned\n\nSPEECH\n\nDATA\n\nMODELING\n\nGRAMMAR\n\nAPPL.\n(b=1.00)\ndialogue\nspoken\nspeech\ndialogues\n\nunderstanding\n\ntask\n\nrecognition\n(b=0.00)\n\nMETHODS\n(b=0.20)\n\n(b=0.99)\nspeech\nwords\n\nrecognition\nprosodic\nwritten\n\nphonological\n\nspoken\n\nMETHODS\n(b=1.00)\ncorpus\ndata\n\ntraining\nmodel\ntagging\nannotated\n\ntest\n\n(b=0.07)\n\nAPPL.\n(b=1.00)\n\ndata\ncorpus\n\nannotation\nannotated\ncorpora\ncollection\n\nxml\n\n(b=0.02)\n\nAPPL.\n(b=0.50)\n\n(b=0.01)\n\nMETHODS\n(b=1.00)\nmodels\nmodel\napproach\nshown\nerror\nerrors\n\nstatistical\n(b=1.00)\n\nrules\nrule\nmodel\nshown\nmodels\nright\nleft\n\nMETHODS\n(b=1.00)\nparsing\nparser\nsyntactic\n\ntree\nparse\n\ndependency\n\ntreebank\n(b=1.00)\ngrammar\nparsing\ngrammars\nstructures\n\npaper\n\nformalism\n\nbased\n\nAPPL.\n(b=0.57)\ngrammar\nparsing\nbased\nrobust\n\ncomponent\nprocessing\nlinguistic\n(b=1.00)\ngrammar\ngrammars\nformalism\nparsing\nbased\nef\ufb01cient\nuni\ufb01cation\n\nFigure 3: Example output from the ACL corpus with (cid:126)Z = (20, 2, 2). Above: The top words (based\non their \u03c9 values) for a few components from three factors. Below: A three-dimensional table\nshowing a sample of four topics (i.e. components of the \ufb01rst factor) with their top words (based on\ntheir \u03c6 values) as they appear in all combinations of factors. The components in the top table are\ncombined to create 3-tuples in the bottom table. Shaded cells (b \u2264 0.5) are inactive. The names of\nfactors and their components in quotes are manually assigned through post-hoc analysis.\n\nSparsity Patterns Finally, we examine the learned sparsity patterns: how much of B is close to 0\nor 1? Figure 2b shows a histogram of b(cid:126)t values (ACL with 20 topics, 3 factors) pooled across \ufb01ve\nsampling chains. The majority of values are close to 0 or 1, effectively capturing a sparse binary\narray. The higher variance near 0 relative to 1 suggests that the model prefers to keep bits \u201con\u201d\u2014\nand give tuples tiny probability\u2014rather than \u201coff.\u201d This suggests that a model with a hard constraint\nmight struggle to \u201cturn off\u201d bits during inference.\nWhile we \ufb01xed the Beta parameters in our experiments, these can be tuned to control sparsity. The\nmodel will favor more \u201con\u201d than \u201coff\u201d bits by setting \u03b31 > \u03b30, or vice versa. When \u03b3 > 1, the Beta\ndistribution no longer favors sparsity; we con\ufb01rmed empirically that this leads to b(cid:126)t values that are\ncloser to 0.8 or 0.9 rather than 1. In contrast, setting \u03b3 (cid:28) 0.1 yields more extreme values near 0\nand 1 than with \u03b3 = 0.1 (e.g. .9999 instead of .991), but this does not greatly affect the number of\nnon-binary values. Thus, a sparse prior alone cannot fully satisfy our preference that B is binary.\nComparison to LDA The runtimes of samplers for LDA and f-LDA are on the same order (but\nwe have not investigated differences in mixing time). Our f-LDA implementation is one to two\ntimes slower per iteration than our own comparable LDA implementation (with hyperparameter\noptimization using the methods in [25]). We did not observe a consistent pattern regarding the\nperplexity of the two models. Averaged across all numbers of topics, the perplexity of LDA was\n97% the perplexity of f-LDA on ACL and 104% on CLEP. Note that our experiments always use a\ncomparable number of word distributions, thus (cid:126)Z = (20, 2, 2) is the same as Z = 80 topics in LDA.\n\n6 Conclusion\n\nWe have presented factorial LDA, a multi-dimensional text model that can incorporate an arbitrary\nnumber of factors. To encourage the model to learn the desired patterns, we developed two new\ntypes of priors: word priors that share features across factors, and a sparsity prior that restricts the\nset of active tuples. We have shown both qualitatively and quantitatively that f-LDA is capable of\ndiscovering interpretable patterns even in multi-dimensional spaces.\n\n8\n\n\fAcknowledgements\n\nWe are grateful to Jason Eisner, Matthew Gormley, Nicholas Andrews, David Mimno, and the\nanonymous reviewers for helpful discussions and feedback. This work was supported in part by\na National Science Foundation Graduate Research Fellowship under Grant No. DGE-0707427.\n\nReferences\n[1] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 2003.\n[2] Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. Topic sentiment mixture: modeling facets and opinions\n\nin weblogs. In WWW, 2007.\n\n[3] M. Paul and R. Girju. A two-dimensional topic-aspect model for discovering multi-faceted topics. In\n\nAAAI, 2010.\n\n[4] J. Eisenstein, A. Ahmed, and E. P. Xing. Sparse additive generative models of text. In ICML, 2011.\n[5] W. Y. Wang, E. May\ufb01eld, S. Naidu, and J. Dittmar. Historical analysis of legal opinions with a sparse\n\nmixed-effects latent variable model. In ACL, pages 740\u2013749, July 2012.\n\n[6] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading tea leaves: How humans interpret\n\ntopic models. In NIPS, 2009.\n\n[7] D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial\n\nregression. In UAI, 2008.\n\n[8] T. Grif\ufb01ths and Z. Ghahramani. In\ufb01nite latent feature models and the Indian buffet process. In NIPS,\n\n2006.\n\n[9] S. Williamson, C. Wang, K. Heller, and D. Blei. The IBP-compound dirichlet process and its application\n\nto focused topic modeling. In ICML, 2010.\n\n[10] A. Ahmed and E. P. Xing. Staying informed: supervised and semi-supervised multi-view topical analysis\n\nof ideological perspective. In EMNLP, pages 1140\u20131150, 2010.\n\n[11] M. Paul and R. Girju. Cross-cultural analysis of blogs and forums with mixed-collection topic models. In\n\nEMNLP, pages 1408\u20131417, August 2009.\n\n[12] D. Zhang, C. Zhai, J. Han, A. Srivastava, and N. Oza. Topic modeling for OLAP on multidimensional\n\ntext databases: topic cube and its applications. Statistical Analysis and Data Mining, 2, 2009.\n\n[13] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, 1999.\n[14] S. Dasgupta and V. Ng. Mining clustering dimensions. In ICML, 2010.\n[15] I. Porteous, E. Bart, and M. Welling. Multi-HDP: a non parametric Bayesian model for tensor factoriza-\n\ntion. In AAAI, pages 1487\u20131490, 2008.\n\n[16] L. Mackey, D. Weiss, and M. I. Jordan. Mixed membership matrix factorization. In ICML, 2010.\n[17] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Comput.,\n\n14:1771\u20131800, August 2002.\n\n[18] J. Boyd-Graber and D. Blei. Syntactic topic models. In NIPS, 2008.\n[19] M. R. Gormley, M. Dredze, B. Van Durme, and J. Eisner. Shared components topic models. In NAACL,\n\n2010.\n\n[20] C. Wang and D. Blei. Decoupling sparsity and smoothness in the discrete hierarchical Dirichlet process.\n\nIn NIPS, 2009.\n\n[21] L. Meier, S. van de Geer, and P. B\u00a8uhlmann. The group lasso for logistic regression. Journal Of The Royal\n\nStatistical Society Series B, 70(1):53\u201371, 2008.\n\n[22] H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In NIPS, 2009.\n[23] T. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. In Proceedings of the National Academy of Sciences\n\nof the United States of America, 2004.\n\n[24] M. Rosen-Zvi, T. Grif\ufb01ths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents.\n\nIn UAI, 2004.\n\n[25] Michael J. Paul. Mixed membership Markov models for unsupervised conversation modeling. In EMNLP-\n\nCoNLL, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1224, "authors": [{"given_name": "Michael", "family_name": "Paul", "institution": null}, {"given_name": "Mark", "family_name": "Dredze", "institution": null}]}