{"title": "Capturing Semantically Meaningful Word Dependencies with an Admixture of Poisson MRFs", "book": "Advances in Neural Information Processing Systems", "page_first": 3158, "page_last": 3166, "abstract": "We develop a fast algorithm for the Admixture of Poisson MRFs (APM) topic model and propose a novel metric to directly evaluate this model. The APM topic model recently introduced by Inouye et al. (2014) is the first topic model that allows for word dependencies within each topic unlike in previous topic models like LDA that assume independence between words within a topic. Research in both the semantic coherence of a topic models (Mimno et al. 2011, Newman et al. 2010) and measures of model fitness (Mimno & Blei 2011) provide strong support that explicitly modeling word dependencies---as in APM---could be both semantically meaningful and essential for appropriately modeling real text data. Though APM shows significant promise for providing a better topic model, APM has a high computational complexity because $O(p^2)$ parameters must be estimated where $p$ is the number of words (Inouye et al. could only provide results for datasets with $p = 200$). In light of this, we develop a parallel alternating Newton-like algorithm for training the APM model that can handle $p = 10^4$ as an important step towards scaling to large datasets. In addition, Inouye et al. only provided tentative and inconclusive results on the utility of APM. Thus, motivated by simple intuitions and previous evaluations of topic models, we propose a novel evaluation metric based on human evocation scores between word pairs (i.e. how much one word brings to mind\" another word (Boyd-Graber et al. 2006)). We provide compelling quantitative and qualitative results on the BNC corpus that demonstrate the superiority of APM over previous topic models for identifying semantically meaningful word dependencies. (MATLAB code available at: http://bigdata.ices.utexas.edu/software/apm/)\"", "full_text": "Capturing Semantically Meaningful Word\n\nDependencies with an Admixture of Poisson MRFs\n\nDavid I. Inouye\n\nPradeep Ravikumar\n\nInderjit S. Dhillon\n\nDepartment of Computer Science\n\n{dinouye,pradeepr,inderjit}@cs.utexas.edu\n\nUniversity of Texas at Austin\n\nAbstract\n\nWe develop a fast algorithm for the Admixture of Poisson MRFs (APM) topic\nmodel [1] and propose a novel metric to directly evaluate this model. The APM\ntopic model recently introduced by Inouye et al. [1] is the \ufb01rst topic model that\nallows for word dependencies within each topic unlike in previous topic models\nlike LDA that assume independence between words within a topic. Research in\nboth the semantic coherence of a topic models [2, 3, 4, 5] and measures of model\n\ufb01tness [6] provide strong support that explicitly modeling word dependencies\u2014as\nin APM\u2014could be both semantically meaningful and essential for appropriately\nmodeling real text data. Though APM shows signi\ufb01cant promise for providing\na better topic model, APM has a high computational complexity because O(p2)\nparameters must be estimated where p is the number of words ([1] could only\nprovide results for datasets with p = 200). In light of this, we develop a paral-\nlel alternating Newton-like algorithm for training the APM model that can han-\ndle p = 104 as an important step towards scaling to large datasets. In addition,\nInouye et al. [1] only provided tentative and inconclusive results on the utility\nof APM. Thus, motivated by simple intuitions and previous evaluations of topic\nmodels, we propose a novel evaluation metric based on human evocation scores\nbetween word pairs (i.e. how much one word \u201cbrings to mind\u201d another word [7]).\nWe provide compelling quantitative and qualitative results on the BNC corpus\nthat demonstrate the superiority of APM over previous topic models for identi-\nfying semantically meaningful word dependencies. (MATLAB code available at:\nhttp://bigdata.ices.utexas.edu/software/apm/)\n\n1\n\nIntroduction and Related Work\n\n2\n\nIn standard topic models such as LDA [8, 9], the primary representation for each topic is simply\na list of top 10 or 15 words. To understand a topic, a person must manually consider many of the\n\npossible(cid:0)10\ninfer abstract meaning from this list of words. Of all the(cid:0)10\n\n(cid:1) pairwise relationships as well as possibly larger m-wise relationships and attempt to\n(cid:1) pairwise relationships probably a\n\nvery small number of them are direct relationships. For example, a topic with the list of words\n\u201cmoney\u201d, \u201cfund\u201d, \u201cexchange\u201d and \u201ccompany\u201d can be understood as referring to investment but this\ncan only be inferred from a very high-level human abstraction of meaning. This problem has given\nrise to research on automatically labeling topics with a topic word or phrase that summarizes the\ntopic [10, 11, 12]. [13] propose to evaluate topic models by randomly replacing a topic word with\na random word and evaluating whether a human can identify the intruding word. The intuition for\nthis metric is that the top words of a good topic will be related, and therefore, a person will be\nable to easily identify the word that does not have any relationship to the other words. [2, 3, 5]\ncompute statistics related to Pointwise Mutual Information for all pairs of top words in a topic\nand attempt to correlate this with human judgments. All of these metrics suggest that capturing\n\n2\n\n1\n\n\fsemantically meaningful relationships between pairs of words is fundamental to the interpretability\nand usefulness of topic models as a document summarization and exploration tool.\nIn light of these metrics, [1] recently proposed a topic model called Admixture of Poisson MRFs\n(APM) that relaxes the independence assumption for the topic distributions and explicitly models\nword dependencies. This can be motivated in part by [6] who investigated whether the Multino-\nmial (i.e.\nindependent) assumption of word-topic distributions actually \ufb01ts real-world text data.\nSomewhat unsurprisingly, [6] found that the Multinomial assumption was often violated and thus\ngives evidence that models with word dependencies\u2014such as APM\u2014may be a fundamentally more\nappropriate model for text data.\nPrevious research in topic modeling has implicitly uncovered this issue with model mis\ufb01t by \ufb01nding\nthat models with 50, 100 or even 500 topics tend to perform better on semantic coherence experi-\nments than smaller models with only 10 or 20 topics [4]. Though using more topics may allow topic\nmodels to ignore the issue of word dependencies, using more topics can make the coherence of a\ntopic model more dif\ufb01cult as suggested by [4] who found that using 100 or 500 topics did not signif-\nicantly improve the coherence results over 50 topics. Intuitively, a topic model with a much smaller\nnumber of topics (e.g. 5 or 10) is easier to comprehend. For instance, if training on newspaper text,\nthe number of topics could roughly correspond to the number of sections in a newspaper such as\nnews, weather and sports. Or, if modeling an encyclopedia, the top-level topics could be art, history,\nscience, and society. Thus, rather than using more topics, APM opens the way for a promising topic\nmodel that can overcome this model mis\ufb01t issue while only using a small number of topics.\nEven though APM shows promise for being a signi\ufb01cantly more powerful and more realistic topic\nmodel than previous models, the original paper acknowledged the signi\ufb01cant computational com-\nplexity. Instead of needing to \ufb01t O(k(n + p)) parameters, APM needs to estimate O(k(n + p2))\nparameters. [1] suggested that by using a sparsity prior (i.e. (cid:96)1 regularization of the likelihood), this\ncomputational complexity could be reduced. However, [1] could only produce some quantitative re-\nsults on a very small dataset with only 200 words. In addition, the quantitative results from [1] were\ntentative and inconclusive on whether APM could actually perform better than LDA in coherence\nexperiments.\nTherefore, in this paper, we seek to answer two major open questions regarding APM: 1) Is there an\nalgorithm that can overcome the computational complexity of APM and handle real-world datasets?\n2) Does the APM model actually capture more semantically interesting concepts that were not pos-\nsible with previous topic models? We answer the \ufb01rst question by developing a parallel alternating\nalgorithm whose independent subproblems are solved using a Newton-like algorithm similar to the\nalgorithms developed for sparse inverse covariance estimation [14]. As in [14], this new APM\nalgorithm exploits the sparsity of the solution to signi\ufb01cantly reduce the computational time for\ncomputing the approximate Newton direction. However, unlike [14], the APM model is solving for\nk Poisson MRFs simultaneously whereas [14] is only solving for a single Gaussian MRF. Another\ndifference from [14] is that the whole algorithm can be easily parallelized up to min(n, p).\nFor the second question about the semantic utility of APM, we develop a novel evaluation metric that\nmore directly evaluates the APM model against human judgments of semantic relatedness\u2014a notion\ncalled evocation introduced by [7]. Intuitively, the idea is that humans seek to understand traditional\ntopic models by looking at the list of top words. They will implicitly attempt to \ufb01nd how these\nwords are related and extract some more abstract meaning that generalizes the set of words. Thus,\nthis evaluation metric attempts to explicitly score how well pairs of words capture some semantically\nmeaningful word dependency. Previous research has evaluated topic models using word similarity\nmeasures [4]. However, our work is different from [4] in three signi\ufb01cant ways: 1) our metrics use\nevocation rather than similarity (e.g. antonyms should have high evocation but low similarity), 2) we\nevaluate top individual word pairs instead of rough aggregate statistics, and 3) we evaluate a topic\nmodel that directly captures word dependencies (i.e. APM). We demonstrate that APM substantially\noutperforms other topic models in both quantitative and qualitative ways.\n\n2 Background on Admixture of Poisson MRFs (APM)\n\nAdmixtures The general notion of admixtures introduced by [1] generalizes many previous topic\nmodels including PLSA [15], LDA [8], and the Spherical Admixture Model (SAM) [16]. Admix-\n\n2\n\n\ftures have also been known as mixed membership models or, more precisely, partial membership\nmodels (See [17] for excellent overview and discussion of mixed and partial membership models).\nIn contrast to mixture distributions which assume that each observation is drawn from 1 of k compo-\nnent distributions, admixture distributions assume that each observation is drawn from an admixed\ndistribution whose parameters are a mixture of component parameters. As examples of admixtures,\nPLSA and LDA are admixtures of Multinomials whereas SAM is an admixture of Von-Mises Fisher\ndistributions. In addition, because of the connections between Poissons and Multinomials, PLSA\nand LDA can be seen as admixtures of independent Poisson distributions [1].\n\nPoisson MRFs (PMRF) Yang et al. [18] introduced a multivariate generalization of the Poisson\nthat assumes that the conditional distributions are univariate Poisson which is similar to a Gaussian\nMRF whose conditionals are Gaussian (unlike a Guassian MRF, however, the marginals are not\nunivariate Poisson). A PMRF can be parameterized by a node vector \u03b8 and an edge matrix \u0398 whose\n\nnon-zeros encode the direct dependencies between words: PrPMRF(x| \u03b8, \u0398) = exp(cid:0)\u03b8T x+xT \u0398x\u2212\ns=1 ln(xs!) \u2212 A (\u03b8, \u0398)(cid:1), where A (\u03b8, \u0398) is the log partition function needed for normalization.\n(cid:80)p\ntrue likelihood, which means that A (\u03b8, \u0398) \u2248(cid:80)p\n\nThis formulation needs to be slightly modi\ufb01ed to allow for positive edges using the ideas from [19].\nThe log partition function can be approximated by using the pseudo log-likelihood instead of the\ns=1 exp(\u03b8s + xT \u0398s). The reader should note that\nbecause this is an MRF distribution, all the properties of MRFs apply to PMRFs including that a\nword is independent of all other words given the value of its neighbors. For example, in a chain\ngraph, all the variables are correlated with each other but they have a much simpler dependency\nstructure that can be encoded with O(n) parameters. Therefore, PMRFs more directly and succinctly\ncapture the dependencies between words as opposed to other simple statistics such as covariance or\npointwise mutual information.\n\nInouye et al.\n\nAdmixture of Poisson MRFs (APM)\n[1] essentially constructed a new admix-\nture model by using Poisson MRFs as the topic-word distributions instead of the usual Multino-\nmial as in LDA. This allows for word dependencies within each topic. For example, if the word\n\u201cclassi\ufb01cation\u201d appears in a document, \u201csupervised\u201d is more likely to appear than in general doc-\numents. Given the admixture weights vector for a document the likelihood of a document is sim-\nply: PrAPM(x| w, \u03b81...k, \u03981...k) = PrPMRF\nAppendix A for notational conventions used throughout the paper).\n[1] de\ufb01ne a\nDirichlet(\u03b1) prior on the admixture weights and a conjugate prior with hyperparameter \u03b2 on the\nPMRF parameters which can be easily incorporated as pseudo counts. For our experiments as de-\nscribed in Sec. 4.1, we set \u03b1 = 1 (i.e. a uniform prior on admixture weights) and \u03b2 = {0, 1}.\n\nj=1 wj\u0398j(cid:1) (please see\n\nj=1 wj\u03b8j, \u0398 = (cid:80)k\n\n(cid:0)x| \u03b8 = (cid:80)k\n\nInouye et al.\n\n3 Parallel Alternating Newton-like Algorithm for APM\n\nIn the original APM paper [1], parameters were estimated by maximizing the joint approximate\nposterior over all variables.1 Instead of maximizing jointly over all parameters, we split the problem\ninto alternating convex optimization problems. Let us denote the likelihood part (i.e. the smooth\npart) of the optimization function as g(W, \u03b81...k, \u03981...k) and the non-smooth (cid:96)1 regularization term\nas h where the full negative posterior is de\ufb01ned as f = g + h. The smooth part of the approximate\nposterior can be written as:\n\nn(cid:88)\np(cid:88)\n\n(cid:104) k(cid:88)\n\ng = \u2212 1\nn\n\ns) \u2212 exp(cid:0) k(cid:88)\n\ns)(cid:1)(cid:105)\n\n,\n\nwijxis(\u03b8j\n\ns + xT\n\ni \u0398j\n\nwij(\u03b8j\n\ns + xT\n\ni \u0398j\n\n(1)\n\ni=1\n\ns=1\n\nj=1\n\nj=1\n\nwhere xi is the word-count vector for the ith document, wi is the admixture weight vector for the\nith document, and \u03b8j and \u0398j are the PMRF parameters for the jth component (see Appendix B for\nderivation). By writing g in this form, it is straightforward to see that even though the whole opti-\nmization problem is not convex because of the interaction between the admixture weights w and the\n\n1This posterior approximation was based on the pseudo-likelihood while ignoring the symmetry constraint\nso that nodewise regression parameters are independent. This leads to an overcomplete parameterization for\nAPM. For an overview of composite likelihood methods, see [20]. For a comparison of pseudo-likelihood\nversus nodewise regressions, see [21].\n\n3\n\n\fi ]T , \u03c6j\n\nPMRF parameters, the problem is convex if either the admixture weights W or the component pa-\nrameters \u03b81...k, \u03981...k are held \ufb01xed. To simplify the notation in the following sections, we combine\nthe node (which is analogous to an intercept term in regression) and edge parameters by de\ufb01ning\nzi = [1 xT\nThus, we can alternate between optimizing two similar optimization problems where one has a non-\nsmooth (cid:96)1 regularization and the other has the constraint that wi must lie on the simplex \u2206k:\n\u03bb(cid:107)vec(\u03a6s)\\1(cid:107)1\n\ns)T ]T and \u03a6s = [\u03c61\n\ns \u00b7\u00b7\u00b7 \u03c6k\ns].\n\np(cid:88)\n\ns = [\u03b8j\n\ns (\u0398j\n\narg min\n\nexp(zT\n\ni \u03a6swi)\n\ns \u03c62\n\n(cid:105)\n\n(2)\n\n+\n\np(cid:88)\nn(cid:88)\n\ns=1\n\n(cid:104)\n(cid:104)\n\ntr(\u03a8s\u03a6s) \u2212 n(cid:88)\ni wi \u2212 p(cid:88)\n\ni=1\n\n\u03c8T\n\ni=1\n\ns=1\n\n\u2212 1\nn\n\u2212 1\nn\n\n\u03a61,\u03a62,\u00b7\u00b7\u00b7 ,\u03a6p\n\narg min\n\nw1,w2,\u00b7\u00b7\u00b7 ,wn\u2208\u2206k\n\n(cid:105)\n\nexp(zT\n\ni \u03a6swi)\n\n,\n\ns=1\n\n(3)\n\nwhere \u03c8i and \u03a8s are constants in the optimization that can be computed from the data matrix X and\nthe other parameters that are being held \ufb01xed (see Alg. 2 in Appendix D for computation of \u03a8s).\nThis alternating scheme is analogous to Alternating Least Squares (ALS) for Non-negative Matrix\nFactorization (NMF) [22] and EM-like algorithms such as k-means. By writing the optimization as\nin Eq. 2 and Eq. 3, we also expose the simple independence between the subproblems because they\nare simple summations. Thus, we can easily parallelize both optimization problems upto min(n, p)\nwith little overhead and simple changes to the code\u2014in our MATLAB implementation, we only\nchanged a for loop to a parfor loop.\n\n3.1 Newton-like Algorithms for Subproblems\n\nFor each of the subproblems, we develop Newton-like optimization algorithms. For the component\nPMRFs, we borrow several important ideas from [14] including \ufb01xed and free sets of variables for\nthe (cid:96)1 regularized optimization problem. The overall idea is to construct a quadratic approximation\naround the current solution and approximately optimize this simpler function to \ufb01nd a step direction.\nUsually, \ufb01nding the Newton direction requires computing the Hessian for all the optimization vari-\nables but because of the (cid:96)1 regularization, we only need to focus on variables that might be non-zero.\nThis set of free variables, denoted F, can be simply determined from the gradient and current iterate\n[14]. Since usually there is only a small number of free variables compared to \ufb01xed variables (i.e. \u03bb\nis large enough), we can simply run coordinate descent on these free variables and only implicitly\ncalculate Hessian information as needed in each coordinate descent step. After \ufb01nding an approxi-\nmate Newton direction, we \ufb01nd a step size that satis\ufb01es the Armijo rule and then update the iterate\n(see Alg. 2 in Appendix D).\nWe also employed a similar Newton-like algorithm for estimating the admixture weights. Instead of\nthe (cid:96)1 regularization term, however, this subproblem has the constraint that the admixture weights\nwi must lie on the simplex so that each document can be properly interpreted as a convex mixture\nof over topic parameters. For this constraint, we used a dual-coordinate descent algorithm to \ufb01nd\nthe approximate Newton direction as in [23].\nFinally, we put both subproblem algorithms together and alternate between the two (see Alg. 1 in\nAppendix D). For tracing through different \u03bb parameters, \u03bb is initially set to \u221e so that the model\ntrains an independent APM model \ufb01rst. Then, the initial \u03bb = \u03bbmax is found by computing the largest\ngradient of the \ufb01nal independent iteration. Every time the alternating algorithm converges, the value\nof \u03bb is decreased so that a set of models is trained for decreasing values of \u03bb.\n\n3.2 Timing Results\n\nWe conducted two main timing experiments to show that the algorithm can be ef\ufb01ciently parallelized\nand the algorithm can scale to reasonably large datasets. For the parallel timing experiment, we used\nthe BNC corpus described in Sec. 4.1 (n = 4049, p = 1646) and \ufb01xed k = 5, \u03bb = 8 and a total of\n30 alternating iterations. For the large data experiment, we used a Wikipedia dataset formed from a\nrecent Wikipedia dump by choosing the top 10k words neglecting stop words and then selecting the\nlongest documents. We ran several main iterations of the algorithm with this dataset while \ufb01xing\nthe parameters k = 5 and \u03bb = 0.5. All timing experiments were conducted on the TACC Maverick\nsystem with Intel Xeon E5-2680 v2 Ivy Bridge CPUs (2.80 GHz), 20 CPUs per node, and 12.8 GB\nmemory per CPU (https://www.tacc.utexas.edu/).\n\n4\n\n\fThe parallel timing results can be seen in Fig. 1 (left) which shows that the algorithm does have\nalmost linear speedup when parallelizing across multiple workers. Though we only had access to a\nsingle computer with 20 processors, substantially more speed up could be obtained by using more\nprocessors on a distributed computing system. This simple parallelism makes this algorithm viable\nfor much larger datasets. The timing results for the Wikipedia can be seen in Fig. 1 (right). These\nresults give an approximate computational complexity of O(np2) which show that the proposed\nalgorithm has the potential for scaling to datasets where n is O(105) and p is O(104). The O(p2)\ncomes from the fact that there are p subproblems and each subproblem needs to calculate the gradient\nwhich is O(p) as well as approximate the Newton direction for a subset of the variables. The\n\ufb01rst iteration takes longer because the initial parameter values are na\u00a8\u0131vely set to 0 whereas future\niterations start from reasonable initial value.\n\nFigure 1: (left) The speedup on the BNC dataset shows that the algorithm scales approximately lin-\nearly with the number of workers because the subproblems are all independent. (right) The timing\nresults on the Wikipedia dataset show that the algorithm scales to larger datasets and has a compu-\ntational complexity of approximately O(np2).\n\n4 Evocation Metric\n\nBoyd-Graber et al. [7] introduced the notion of evocation which denotes the idea of which words\n\u201cevoke\u201d or \u201cbring to mind\u201d other words. There can be many types of evocation including the fol-\nlowing examples from [7]: [rose - \ufb02ower] (example), [brave - noble] (kind), [yell - talk] (manner),\n[eggs - bacon] (co-occurence), [snore - sleep] (setting), [wet - desert] (antonymy), [work - lazy] (ex-\nclusivity), and [banana - kiwi] (likeness). This is distinctive from word similarity or synonymy since\ntwo words can have very different meanings but \u201cbring to mind\u201d the other word (e.g. antonyms).\nThis notion of word relatedness is a much simpler but potentially more semantically meaningful and\ninterpretable than word similarity. For instance, \u201cwork\u201d and \u201clazy\u201d do not have similar meanings\nbut are related through the semantic meanings of the words. Another difference is that\u2014unlike word\nsemantic similarity\u2014 words that generally appear in very different contexts yet mean the same thing\nwould probably not have a high evocation score. For example, \u201cnetworks\u201d and \u201cgraphs\u201d both have a\nde\ufb01nition that means a set of nodes and edges yet usually one word is chosen in a particular context.\nRecent work in evaluating topic models [2, 3, 4, 5] formulate automated metrics based on automati-\ncally scoring all pairs of top words and noticing that they correlate with human judgment of overall\ntopic coherence. All of these metrics are based on the common assumption that a person should\nbe able to understand a topic by understanding the abstract semantic connections between the word\npairs. Thus, evocation is a reasonable notion for evaluating topic modeling because it directly eval-\nuates the level of semantic connection between word pairs. In addition, this new evocation metric\nprovides a way to explicitly evaluate the edge matrices of APM, which would be ignored in previous\nmetrics because explicit word dependencies are not modeled in other topic models.\nWe now formally de\ufb01ne our evocation metric. Given human-evaluated scores for a subset of word\npairs H and the corresponding weights given by a topic model for this subset of word pairs M, let\nus de\ufb01ne \u03c0M(j) to be an ordering of the word pairs induced by M such that M\u03c0(1) \u2265 M\u03c0(2) \u2265\n\u00b7\u00b7\u00b7 \u2265 M\u03c0(|H|). Then, the top-m evocation metric is simply:\n\nm(cid:88)\n\nEvocm(M,H) =\n\nH\u03c0M(j) .\n\n(4)\n\nNote that the scaling of M is inconsequential because M is only needed to de\ufb01ne an ordering or\nranking of H. For example, \u02c6M = \u03b1 exp(M) would yield the same evocation score for all scalar\n\nj=1\n\n5\n\n0 5 10 15 20 0 5 10 15 20 Speedup # of MATLAB Workers Parallel Speedup on BNC Dataset Perfect Speedup Actual Speedup 1 3.1 3.4 0.6 2.2 2.2 0 1 2 3 4 n = 20,000 p = 5,000 # of Words = 50M n = 100,000 p = 5,000 # of Words = 133M n = 20,000 p = 10,000 # of Words = 57M Time (hrs) APM Training Time on Wikipedia Dataset 1st Iter. Avg. Next 3 Iter. \fvalues \u03b1 > 0 because the ordering would be maintained. Essentially, M merely induces an ordering\nof the word pairs and the evocation score is the sum of the human scores for these top m word pairs.\nFor APM, the word pair weights come primarily from the PMRF edge matrices \u03981...k\u2014the PMRF\nnode vectors are only used to provide an ordering if there are not enough non-zeros in the edge\nmatrices. For the other Multinomial-based topic models which do not have parameters explicitly\nassociated with word-pairs, we can compute the most likely word pairs in a topic by multiplying\ntheir corresponding marginal probabilities. This weighting corresponds to the probability that two\nindependent draws from the topic distribution produce the word pair and thus is the most obvious\nchoice for Multinomial-based topic models.\nSince this metric only gives a way to evaluate one topic, we consider two ways of determining\nk Evocm(Mj,H) and\nkMj,H). In words, these are \u201caverage evocation of topics\u201d and \u201cevo-\ncation of average topic\u201d respectively. Evoc-1 measures whether all or at least most topics capture\nmeaningful word associations since it can be affected by uninteresting topics. Evoc-2 is reasonable\nfor measuring whether the topic model as a whole is capturing word semantics even if some of the\ntopics are not capturing interesting word associations. This second measure has some relation to\nthe word similarity measure of topic coherence in [4]. However, [4] uses similarity rather than evo-\ncation, does not directly evaluate top individual word pairs and does not evaluate any models with\nword dependencies such as APM.\n\nthe overall evocation score for the whole topic model: Evoc-1 = (cid:80)k\nEvoc-2 = Evocm((cid:80)k\n\nj=1\n\n1\n\n1\n\nj=1\n\n4.1 Experimental Setup\n\nHuman-Scored Evocation Dataset The original human-scored evocation dataset was produced\nby a set of trained undergraduates in which 1,000 words were hand selected primarily based on\ntheir frequency and usage in the British National Corpus (BNC) [7]. From the possible pairwise\nevaluations, approximately 10% of the word pairs were randomly selected to be manually scored by\na set of trained undergraduates. The second dataset was constructed by predicting the pairs of words\nthat were likely to have a high evocation using a standard machine learning classi\ufb01er. This new set of\npairs was scored using Amazon MTurk (mturk.com) by using the original dataset as a control [24].\nThough these scores are between synsets\u2014which are a word, part-of-speech and sense triplet\u2014, we\nmapped all of the synsets to word, part-of-speech pairs since that is the only information we have\nfor the BNC corpus. This led to a total of 1646 words. In addition, though the evocation dataset has\nscores for directed relationships (i.e. word1 \u2192 word2 could have a different score than word2 \u2192\nword1), we averaged these two scores because the directionality of the relationship is not modeled\nby APM or any other topic model.\n\nBNC Corpus Because the evocation dataset was based on the BNC corpus, we used the BNC cor-\npus for our experiments. We processed the BNC corpus by lemmatizing each word using the Word-\nNetLemmatizer included in the nltk package (nltk.org) and then attaching the part-of-speech,\nwhich is already included in the BNC corpus. We only retained the counts for the 1646 words that\noccurred in the human-scored datasets but processed all 4049 documents in the corpus.\n\nAPM Model Parameters We trained APM on the BNC corpus with several different parameter\nsettings including various \u03bb and \u03b2 parameter settings. We also trained two particular APM models\ndenoted APM-LowReg and APM-HeldOut. APM-LowReg uses a very small regularization param-\neter so that almost all edges are non-zero. APM-HeldOut automatically selects a reasonable value\nfor \u03bb based on the likelihood of a held-out set of the documents. Thus, the APM-HeldOut model\ndoes not require a user-speci\ufb01ed \u03bb parameter but\u2014as seen in the following sections\u2014still performs\nreasonably well even compared to the APM model in which many different parameter settings are\nattempted. In addition, the APM-HeldOut can stop the training early when the model begins to over-\n\ufb01t the data rather than tracing through all the \u03bb parameters\u2014this could lead to a signi\ufb01cant gain in\nmodel training time. The authors suggest that APM-HeldOut is a simple baseline model for future\ncomparison if a user does not want to specify \u03bb.\n\nOther Models For comparison, we trained \ufb01ve other models: Correlated Topic Models (CTM),\nHierarchical Dirichlet Process (HDP), Latent Dirichlet Allocation (LDA), Replicated Softmax\n(RSM), and a na\u00a8\u0131ve random baseline (RND). CTM models correlations between topics [25]. HDP\n\n6\n\n\fis a non-parametric Bayesian model that selects the number of topics based input data and hyperpa-\nrameters [26]. The standard topic model LDA was trained using MALLET [27]. LDA was trained\nfor at least 5,000 iterations and HDP was trained for at least 300 iterations since HDP is compu-\ntationally expensive. RSM is an undirected topic model based on Restricted Boltzmann Machines\n(RBM) [28]. The random model is merely the expected evocation score if edges are ranked at ran-\ndom. We ran a full factorial experimental setting where all the combinations of a set of parameter\nvalues were trained to give a fair comparison between models (see Appendix C for a summary of\nparameter values). All these comparison models only indirectly model dependencies between words\nthrough the latent variables since the topic distributions are Multinomials whereas APM can directly\nmodel the dependencies between words since the topic distributions are Poisson MRFs.\n\nSelecting Best Parameters We randomly split the human scores into a 50% tuning split and 50%\ntesting split. Note that we have a tuning split rather than a training split because the model training\nalgorithms are unsupervised (i.e. they never see the human scores) so the only supervision occurs in\nselecting the \ufb01nal model parameters (i.e. during the tuning phase). Therefore, we selected the \ufb01nal\nparameters based on the tuning split and computed the \ufb01nal evocation scores on the test split. Thus,\neven when selecting the parameter settings, the modeling process never sees the test data.\n\n4.2 Main Results\n\nThe Evoc-1 and Evoc-2 scores with m = 50 for all models can be seen in Fig. 2.2 For Evoc-1, the\nAPM models signi\ufb01cantly outperform all other models for a small number of topics and even capture\nmany semantically meaningful word pairs with a single topic. For higher number of topics, the APM\nmodels seem to perform only competitively with previous topic models. It seems that APM-LowReg\nperforms better with a small number of topics whereas APM-HeldOut\u2014which generally chooses a\nrelatively high \u03bb\u2014seems more robust for large number of topics. These trends are likely caused\nbecause there is a relatively small number of documents (n = 4049) so the APM-LowReg begins to\nsigni\ufb01cantly over\ufb01t the data as the number of topics increases whereas APM-HeldOut does not seem\nto over\ufb01t as much. For all the APM models, the degradation in performance as the number of topics\nincreases is most likely caused by the fact that a Poisson MRF with O(p2) parameters is a much\nmore \ufb02exible distribution than a Multinomial, and thus, fewer topics are needed to appropriately\nmodel the data. These results also give some evidence that APM can succinctly model the data with\na much smaller number of topics than is needed for independent topic models; this succinctness\ncould be particularly helpful for the interpretability and intuitions of topic models.\n\nFigure 2: Both Evoc-1 scores (left) and Evoc-2 scores (right) demonstrate that APM usually signif-\nicantly outperforms other topic models in capturing meaningful word pairs.\n\nFor the Evoc-2 score, the APM models\u2014including the APM-HeldOut model which automatically\ndetermines a \u03bb from the data\u2014signi\ufb01cantly outperform previous topic models even for a large num-\nber of topics. This supports the idea that APM only needs a small number of topics to capture many\nof the semantically meaningful word dependencies. Thus, when increasing the number of topics\nbeyond 5, the performance does not decrease as in Evoc-1. It is likely that this discrepancy is caused\nby the fact that many of the edges are concentrated in a small number of topics even when the num-\nber of topics is 10 or 25. As expected because of previous research in topic models, most other topic\n\n2For simplicity and comparability, we grouped HDP into the topic number that was closest to its discovered\n\nnumber of topics because HDP can select a variable number of topics.\n\n7\n\n0 200 400 600 800 1000 1200 1400 1600 k = 1 3 5 10 25 50 k = 1 3 5 10 25 50 Evoc-1 (Avg. Evoc. of Topics) Evoc-2 (Evoc. of Avg. Topic) Evocation (m= 50) APM APM-LowReg APM-HeldOut CTM HDP LDA RSM RND \fmodels perform slightly better with a larger number of topics. Though it is possible that using 100\nor 500 topics for these topic models might give an evocation score better than APM with 5 topics,\nthis would only enforce the idea that APM can perform better or at least competitively with previous\ntopic models while only using a comparatively small number of topics.\n\nQualitative Analysis of Top 20 Word Pairs for Best LDA and APM Models To validate the\nintuition of using evocation as an human-standard evaluation metric, we present the top 20 word\npairs for the best standard topic model\u2014in this case LDA\u2014and the best APM model for the Evoc-2\nmetric as seen in Table 1. The best performing LDA model was trained with 50 topics, \u03b1 = 1 and\n\u03b2 = 0.0001. The best APM model was the APM-LowReg model trained with only 5 topics and a\nsmall regularization parameter \u03bb = 0.05. It is important to note that the best model for LDA has\n50 topics while the best model for APM only has 5 topics. As before, this reinforces the theme that\nAPM can capture more semantically meaningful word pairs with a smaller number of topics than\nprevious topic models.\n\nTable 1: Top 20 words for LDA (left) and APM (right)\n\nAs seen in Table 1, APM captures many more word pairs with a human score greater than 50,\nwhereas LDA only captures a few. One interesting example is that LDA \ufb01nds two word pairs\n[woman.n - wife.n] and [wife.n - man.n] that capture some semantic notion of marriage. How-\never, APM directly captures this semantic meaning with [husband.n - wife.n]. APM also discovers\nseveral other familial relationships such as [aunt.n - uncle.n] and [mother.n - baby.n]. In addition,\nAPM identi\ufb01es multiple semantically coherent yet high-level word pairs such as [residential.a -\nhome.n], [steel.n - iron.n], [job.n - employment.n] and [question.n - answer.n], whereas LDA \ufb01nds\nseveral low-level word pairs such as [member.n - give.v], [west.n - state.n] and [year.n - day.n].\nThese overall trends become even more evident when looking at the top 50 word pairs as can be\nfound in Appendix E. Both the quantitative evaluation metrics (i.e. Evoc-1 and Evoc-2) as well as a\nqualitative exploration of the top word pairs give strong evidence that APM can succinctly capture\nboth more interesting and higher-level semantic concepts through word dependencies than previous\nindependent topic models.\n\n5 Conclusion and Future Work\n\nWe motivated the need for more expressive topic models that consider word dependencies\u2014such\nas APM\u2014by considering previous work on topic model evaluation metrics. We overcame the sig-\nni\ufb01cant computational barrier of APM by providing a fast alternating Newton-like algorithm which\ncan be easily parallelized. We proposed a new evaluation metric based on human evocation scores\nthat seeks to measure whether a model is capturing semantically meaningful word pairs. Finally,\nwe presented compelling quantitative and qualitative measures showing the superiority of APM in\ncapturing semantically meaningful word pairs. In addition, this metric suggests new evaluations\nof topic models based on evaluating top word pairs rather than top words. One drawback with the\ncurrent human-scored data is that only a small portion of the word pairs have been scored. Thus,\none extension is to dynamically collect more human scores as needed for evaluation. This work also\nopens the door for exciting new word-semantic applications for APM such as Word Sense Induction\nusing topic models [29], keyword expansion or suggestion, document summarization, and document\nvisualization because APM is capturing semantically meaningful relationships between words.\n\nAcknowledgments\n\nD. Inouye was supported by the NSF Graduate Research Fellowship via DGE-1110007. P. Raviku-\nmar acknowledges support from ARO via W911NF-12-1-0390 and NSF via IIS-1149803, IIS-\n1447574, and DMS-1264033. I. Dhillon acknowledges support from NSF via CCF-1117055.\n\n8\n\nHuman ScoreHuman ScoreHuman ScoreHuman Score100run.v\u2194car.n38woman.n\u2194man.n100telephone.n\u2194call.n57question.n\u2194answer.n82teach.v\u2194school.n38give.v\u2194church.n97husband.n\u2194wife.n57prison.n\u2194cell.n69school.n\u2194class.n38wife.n\u2194man.n82residential.a\u2194home.n51mother.n\u2194baby.n63van.n\u2194car.n38engine.n\u2194car.n76politics.n\u2194political.a50sun.n\u2194earth.n51hour.n\u2194day.n35publish.v\u2194book.n75steel.n\u2194iron.n50west.n\u2194east.n50teach.v\u2194student.n32west.n\u2194state.n75job.n\u2194employment.n44weekend.n\u2194sunday.n44house.n\u2194government.n32year.n\u2194day.n75room.n\u2194bedroom.n41wine.n\u2194drink.v44week.n\u2194day.n25member.n\u2194give.v72aunt.n\u2194uncle.n38south.n\u2194north.n38university.n\u2194institution.n25dog.n\u2194animal.n72printer.n\u2194print.v38morning.n\u2194afternoon.n38state.n\u2194government.n25seat.n\u2194car.n60love.v\u2194love.n38engine.n\u2194car.nWord PairWord PairWord PairWord Pair\fReferences\n[1] D. I. Inouye, P. Ravikumar, and I. S. Dhillon, \u201cAdmixture of Poisson MRFs: A Topic Model with Word\n\nDependencies,\u201d in International Conference on Machine Learning (ICML), 2014.\n\n[2] D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum, \u201cOptimizing semantic coherence\n\nin topic models,\u201d in EMNLP, pp. 262\u2013272, 2011.\n\n[3] D. Newman, Y. Noh, E. Talley, S. Karimi, and T. Baldwin, \u201cEvaluating topic models for digital libraries,\u201d\n\nin ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 215\u2013224, 2010.\n\n[4] K. Stevens and P. Kegelmeyer, \u201cExploring topic coherence over many models and many topics,\u201d in\n\nEMNLP-CoNLL, pp. 952\u2013961, 2012.\n\n[5] N. Aletras and R. Court, \u201cEvaluating Topic Coherence Using Distributional Semantics,\u201d in International\n\nConference on Computational Semantics (IWCS 2013) - Long Papers, pp. 13\u201322, 2013.\n\n[6] D. Mimno and D. Blei, \u201cBayesian Checking for Topic Models,\u201d in EMNLP, pp. 227\u2013237, 2011.\n[7] J. Boyd-graber, C. Fellbaum, D. Osherson, and R. Schapire, \u201cAdding Dense, Weighted Connections to\n\n{WordNet},\u201d in Proceedings of the Global {WordNet} Conference, 2006.\n\n[8] D. Blei, A. Ng, and M. Jordan, \u201cLatent dirichlet allocation,\u201d JMLR, vol. 3, pp. 993\u20131022, 2003.\n[9] T. L. Grif\ufb01ths and M. Steyvers, \u201cFinding scienti\ufb01c topics,\u201d Proceedings of the National Academy of\n\nSciences of the United States of America, vol. 101, pp. 5228\u201335, Apr. 2004.\n\n[10] J. H. Lau, K. Grieser, D. Newman, and T. Baldwin, \u201cAutomatic Labelling of Topic Models,\u201d in NAACL\n\nHLT, pp. 1536\u20131545, 2011.\n\n[11] D. Magatti, S. Calegari, D. Ciucci, and F. Stella, \u201cAutomatic Labeling Of Topics,\u201d in ISDA, 2009.\n[12] X.-l. Mao, Z.-y. Ming, Z.-j. Zha, T.-s. Chua, H. Yan, and X. Li, \u201cAutomatic Labeling Hierarchical Topics,\u201d\n\nin CIKM, pp. 2383\u20132386, 2012.\n\n[13] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei, \u201cReading tea leaves: How humans interpret\n\ntopic models,\u201d NIPS, pp. 1\u20139, 2009.\n\n[14] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravkiumar, \u201cSparse inverse covariance matrix estimation\n\nusing quadratic approximation,\u201d NIPS, pp. 1\u201318, 2011.\n\n[15] T. Hofmann, \u201cProbabilistic latent semantic analysis,\u201d in Uncertainty in Arti\ufb01cial Intelligence (UAI),\n\npp. 289\u2013296, Morgan Kaufmann Publishers Inc., 1999.\n\n[16] J. Reisinger, A. Waters, B. Silverthorn, and R. J. Mooney, \u201cSpherical topic models,\u201d in ICML, pp. 903\u2013\n\n910, 2010.\n\n[17] E. M. Airoldi, D. Blei, E. A. Erosheva, and S. E. Fienberg, eds., Handbook of Mixed Membership Models\n\nand Their Applications. Chapman and Hall/CRC, 2014.\n\n[18] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu, \u201cGraphical models via generalized linear models,\u201d in\n\nNIPS, pp. 1367\u20131375, 2012.\n\n[19] E. Yang, P. Ravikumar, G. Allen, and Z. Liu., \u201cOn poisson graphical models,\u201d in NIPS, pp. 1718\u20131726,\n\n2013.\n\n[20] C. Varin, N. Reid, and D. Firth, \u201cAn overview of composite likelihood methods,\u201d STATISTICA SINICA,\n\nvol. 21, pp. 5\u201342, 2011.\n\n[21] J. D. Lee and T. J. Hastie, \u201cStructure Learning of Mixed Graphical Models,\u201d in AISTATS, vol. 31, pp. 388\u2013\n\n396, 2013.\n\n[22] D. D. Lee and H. S. Seung, \u201cAlgorithms for Non-negative Matrix Factorization,\u201d in NIPS, pp. 556\u2013562,\n\n2000.\n\n[23] H.-F. Yu, F.-L. Huang, and C.-J. Lin, \u201cDual coordinate descent methods for logistic regression and maxi-\n\nmum entropy models,\u201d Machine Learning, vol. 85, pp. 41\u201375, Nov. 2010.\n\n[24] S. Nikolova, J. Boyd-graber, C. Fellbaum, and P. Cook, \u201cBetter Vocabularies for Assistive Communication\nAids: Connecting Terms using Semantic Networks and Untrained Annotators,\u201d in ACM Conference on\nComputers and Accessibility, pp. 171\u2013178, 2009.\n\n[25] D. M. Blei and J. D. Lafferty, \u201cCorrelated topic models,\u201d in NIPS, pp. 147\u2013154, 2005.\n[26] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, \u201cHierarchical Dirichlet Processes,\u201d Journal of the\n\nAmerican Statistical Association, vol. 101, pp. 1566\u20131581, Dec. 2006.\n\n[27] A. K. McCallum, \u201cMALLET: A Machine Learning for Language Toolkit,\u201d 2002.\n[28] G. Hinton and R. Salakhutdinov, \u201cReplicated softmax: An undirected topic model,\u201d NIPS, 2009.\n[29] J. H. Lau, P. Cook, D. Mccarthy, D. Newman, and T. Baldwin, \u201cWord Sense Induction for Novel Sense\n\nDetection,\u201d in EACL, pp. 591\u2013601, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1622, "authors": [{"given_name": "David", "family_name": "Inouye", "institution": "University of Texas at Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas"}]}