{"title": "The Doubly Correlated Nonparametric Topic Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1980, "page_last": 1988, "abstract": "Topic models are learned via a statistical model of variation within document collections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with documents; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual specification of the number of topics. We propose a doubly correlated nonparametric topic (DCNT) model, the first model to simultaneously capture all three of these properties. The DCNT models metadata via a flexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stick-breaking construction. We validate the semantic structure and predictive performance of the DCNT using a corpus of NIPS documents annotated by various metadata.", "full_text": "The Doubly Correlated Nonparametric Topic Model\n\nDae Il Kim and Erik B. Sudderth\nDepartment of Computer Science\n\nBrown University, Providence, RI 02906\n\ndaeil@cs.brown.edu, sudderth@cs.brown.edu\n\nAbstract\n\nTopic models are learned via a statistical model of variation within document col-\nlections, but designed to extract meaningful semantic structure. Desirable traits\ninclude the ability to incorporate annotations or metadata associated with docu-\nments; the discovery of correlated patterns of topic usage; and the avoidance of\nparametric assumptions, such as manual speci\ufb01cation of the number of topics. We\npropose a doubly correlated nonparametric topic (DCNT) model, the \ufb01rst model\nto simultaneously capture all three of these properties. The DCNT models meta-\ndata via a \ufb02exible, Gaussian regression on arbitrary input features; correlations\nvia a scalable square-root covariance representation; and nonparametric selection\nfrom an unbounded series of potential topics via a stick-breaking construction.\nWe validate the semantic structure and predictive performance of the DCNT using\na corpus of NIPS documents annotated by various metadata.\n\n1\n\nIntroduction\n\nThe contemporary problem of exploring huge collections of discrete data, from biological sequences\nto text documents, has prompted the development of increasingly sophisticated statistical models.\nProbabilistic topic models represent documents via a mixture of topics, which are themselves dis-\ntributions on the discrete vocabulary of the corpus. Latent Dirichlet allocation (LDA) [3] was the\n\ufb01rst hierarchical Bayesian topic model, and remains in\ufb02uential and widely used. However, it suffers\nfrom three key limitations which are jointly addressed by our proposed model.\nThe \ufb01rst assumption springs from LDA\u2019s Dirichlet prior, which implicitly neglects correlations1 in\ndocument-speci\ufb01c topic usage. In diverse corpora, true semantic topics may exhibit strong (positive\nor negative) correlations; neglecting these dependencies may distort the inferred topic structure. The\ncorrelated topic model (CTM) [2] uses a logistic-normal prior to express correlations via a latent\nGaussian distribution. However, its usage of a \u201csoft-max\u201d (multinomial logistic) transformation\nrequires a global normalization, which in turn presumes a \ufb01xed, \ufb01nite number of topics.\nThe second assumption is that each document is represented solely by an unordered \u201cbag of words\u201d.\nHowever, text data is often accompanied by a rich set of metadata such as author names, publi-\ncation dates, relevant keywords, etc. Topics that are consistent with such metadata may also be\nmore semantically relevant. The Dirichlet multinomial regression (DMR) [11] model conditions\nLDA\u2019s Dirichlet parameters on feature-dependent linear regressions; this allows metadata-speci\ufb01c\ntopic frequencies but retains other limitations of the Dirichlet. Recently, the Gaussian process topic\nmodel [1] incorporated correlations at the topic level via a topic covariance, and the document level\nvia an appropriate GP kernel function. This model remains parametric in its treatment of the num-\nber of topics, and computational scaling to large datasets is challenging since learning scales super-\nlinearly with the number of documents.\n\n1One can exactly sample from a Dirichlet distribution by drawing a vector of independent gamma random\n\nvariables, and normalizing so they sum to one. This normalization induces slight negative correlations.\n\n1\n\n\fThe third assumption is the a priori choice of the number of topics. The most direct nonparametric\nextension of LDA is the hierarchical Dirichlet process (HDP) [17]. The HDP allows an unbounded\nset of topics via a latent stochastic process, but nevertheless imposes a Dirichlet distribution on any\n\ufb01nite subset of these topics. Alternatively, the nonparametric Bayes pachinko allocation [9] model\ncaptures correlations within an unbounded topic collection via an inferred, directed acyclic graph.\nMore recently, the discrete in\ufb01nite logistic normal [13] (DILN) model of topic correlations used an\nexponentiated Gaussian process (GP) to rescale the HDP. This construction is based on the gamma\nprocess representation of the DP [5]. While our goals are similar, we propose a rather different\nmodel based on the stick-breaking representation of the DP [16]. This choice leads to arguably\nsimpler learning algorithms, and also facilitates our modeling of document metadata.\nIn this paper, we develop a doubly correlated nonparametric topic (DCNT) model which captures\nbetween-topic correlations, as well as between-document correlations induced by metadata, for an\nunbounded set of potential topics. As described in Sec. 2, the global soft-max transformation of\nthe DMR and CTM is replaced by a stick-breaking transformation, with inputs determined via both\nmetadata-dependent linear regressions and a square-root covariance representation. Together, these\nchoices lead to a well-posed nonparametric model which allows tractable MCMC learning and in-\nference (Sec. 3). In Sec. 4, we validate the model using a toy dataset, as well as a corpus of NIPS\ndocuments annotated by author and year of publication.\n\n2 A Doubly Correlated Nonparametric Topic Model\n\nThe DCNT is a hierarchical, Bayesian nonparametric generalization of LDA. Here we give an\noverview of the model structure (see Fig. 1), focusing on our three key innovations.\n\n2.1 Document Metadata\nConsider a collection of D documents. Let \u03c6d \u2208 RF denote a feature vector capturing the metadata\nassociated with document d, and \u03c6 an F \u00d7 D matrix of corpus metadata. When metadata is unavail-\nable, we assume \u03c6d = 1. For each of an unbounded sequence of topics k, let \u03b7f k \u2208 R denote an\nassociated signi\ufb01cance weight for feature f, and \u03b7:k \u2208 RF a vector of these weights.2\nWe place a Gaussian prior \u03b7:k \u223c N (\u00b5, \u039b\u22121) on each topic\u2019s weights, where \u00b5 \u2208 RF is a vector of\nmean feature responses, and \u039b is an F \u00d7 F diagonal precision matrix. In a hierarchical Bayesian\nfashion [6], these parameters have priors \u00b5f \u223c N (0, \u03b3\u00b5), \u03bbf \u223c Gam(af , bf ). Appropriate values\nfor the hyperparameters \u03b3\u00b5, af , and bf are discussed later.\nGiven \u03b7 and \u03c6d, the document-speci\ufb01c \u201cscore\u201d for topic k is sampled as ukd \u223c N (\u03b7T\nreal-valued scores are mapped to document-speci\ufb01c topic frequencies \u03c0kd in subsequent sections.\n\n:k\u03c6d, 1). These\n\n2.2 Topic Correlations\n\nFor topic k in the ordered sequence of topics, we de\ufb01ne a sequence of k linear transformation\nweights Ak(cid:96), (cid:96) = 1, . . . , k. We then sample a variable vkd as follows:\n\n(cid:18) k(cid:88)\n\n(cid:19)\n\nvkd \u223c N\n\nAk(cid:96)u(cid:96)d, \u03bb\u22121\n\nv\n\n(1)\n\n(cid:96)=1\n\nLet A denote a lower triangular matrix containing these values Ak(cid:96), padded by zeros. Slightly\nabusing notation, we can then compactly write this transformation as v:d \u223c N (Au:d, L\u22121), where\nL = \u03bbvI is an in\ufb01nite diagonal precision matrix. Critically, note that the distribution of vkd depends\nonly on the \ufb01rst k entries of u:d, not the in\ufb01nite tail of scores for subsequent topics.\nMarginalizing u:d, the covariance of v:d equals Cov[v:d] = AAT + L\u22121 (cid:44) \u03a3. As in the classical\nfactor analysis model, A encodes a square-root representation of an output covariance matrix. Our\nintegration of input metadata has close connections to the semiparametric latent factor model [18],\nbut we replace their kernel-based GP covariance representation with a feature-based regression.\n\n2For any matrix \u03b7, we let \u03b7:k denote a column vector indexed by k, and \u03b7f : a row vector indexed by f.\n\n2\n\n\fFigure 1: Directed graphical representation of a DCNT model for D documents containing N words. Each of\nthe unbounded set of topics has a word distribution \u2126k. The topic assignment zdn for word wdn depends on\ndocument-speci\ufb01c topic frequencies \u03c0d, which have a correlated dependence on the metadata \u03c6d produced by\nA and \u03b7. The Gaussian latent variables ud and vd implement this mapping, and simplify MCMC methods.\n\nGiven similar lower triangular representations of factorized covariance matrices, conventional\nBayesian factor analysis models place a symmetric Gaussian prior Ak(cid:96) \u223c N (0, \u03bb\u22121\nA ). Under this\nprior, however, E[\u03a3kk] = k\u03bb\u22121\nA grows linearly with k. This can produce artifacts for standard factor\nanalysis [10], and is disastrous for the DCNT where k is unbounded. We instead propose an alterna-\ntive prior Ak(cid:96) \u223c N (0, (k\u03bbA)\u22121), so that the variance of entries in the kth row is reduced by a factor\nof k. This shrinkage is carefully chosen so that E[\u03a3kk] = \u03bb\u22121\nA ) and Ak(cid:96) = 0 for k (cid:54)= (cid:96), we\nIf we constrain A to be a diagonal matrix, with Akk \u223c N (0, \u03bb\u22121\nrecover a simpli\ufb01ed singly correlated nonparametric topic (SCNT) model which captures metadata\nbut not topic correlations. For either model, the precision parameters are assigned conjugate gamma\npriors \u03bbv \u223c Gam(av, bv), \u03bbA \u223c Gam(aA, bA).\n2.3 Logistic Mapping to Stick-Breaking Topic Frequencies\n\nA remains constant.\n\ndocument d, where(cid:80)\u221e\n\nStick breaking representations are widely used in applications of nonparametric Bayesian models,\nand lead to convenient sampling algorithms [8]. Let \u03c0kd be the probability of choosing topic k in\n\nk=1 \u03c0kd = 1. The DCNT constructs these probabilities as follows:\n\nk\u22121(cid:89)\n\n(cid:96)=1\n\n\u03c0kd = \u03c8(vkd)\n\n\u03c8(\u2212v(cid:96)d),\n\n\u03c8(vkd) =\n\n1\n\n1 + exp(\u2212vkd)\n\n.\n\n(2)\n\nHere, 0 < \u03c8(vkd) < 1 is the classic logistic function, which satis\ufb01es \u03c8(\u2212v(cid:96)d) = 1 \u2212 \u03c8(v(cid:96)d). This\nsame transformation is part of the so-called logistic stick-breaking process [14], but that model is\nmotivated by different applications, and thus employs a very different prior distribution for vkd.\nGiven the distribution \u03c0:d, the topic assignment indicator for word n in document d is drawn accord-\ning to zdn \u223c Mult(\u03c0:d). Finally, wdn \u223c Mult(\u2126zdn ) where \u2126k \u223c Dir(\u03b2) is the word distribution\nfor topic k, sampled from a Dirichlet prior with symmetric hyperparameters \u03b2.\n\n3 Monte Carlo Learning and Inference\n\nWe use a Markov chain Monte Carlo (MCMC) method to approximately sample from the posterior\ndistribution of the DCNT. For most parameters, our choice of conditionally conjugate priors leads to\nclosed form Gibbs sampling updates. Due to the logistic stick-breaking transformation, closed form\nresampling of v is intractable; we instead use a Metropolis independence sampler [6].\nOur sampler is based on a \ufb01nite truncation of the full DCNT model, which has proven useful with\nother stick-breaking priors [8, 14, 15]. Let K be the maximum number topics. As our experiments\ndemonstrate, K is not the number of topics that will be utilized by the learned model, but rather a\n(possibly loose) upper bound on that number. For notational convenience, let \u00afK = K \u2212 1.\n\n3\n\n\fUnder the truncated model, \u03b7 is a F \u00d7 \u00afK matrix of regression coef\ufb01cients, and u is a \u00afK \u00d7 D\nmatrix satisfying u:d \u223c N (\u03b7T \u03c6d, I \u00afK). Similarly, A is a \u00afK \u00d7 \u00afK lower triangular matrix, and\nv:d \u223c N (Au:d, \u03bb\u22121\nv I \u00afK). The probabilities \u03c0kd for the \ufb01rst \u00afK topics are set as in eq. (2), with the\n\n\ufb01nal topic set so that a valid distribution is ensured: \u03c0Kd = 1 \u2212(cid:80)K\u22121\n\nk=1 \u03c8(\u2212vkd).\n3.1 Gibbs Updates for Topic Assignments, Correlation Parameters, and Hyperparameters\n\nk=1 \u03c0kd =(cid:81)K\u22121\n\nThe precision parameter \u03bbf controls the variability of the feature weights associated with each topic.\nAs in many regression models, the gamma prior is conjugate so that\n\np(\u03bbf | \u03b7, af , bf ) \u221d Gam(\u03bbf | af , bf )\n\nN (\u03b7f k | \u00b5f , \u03bb\u22121\nf )\n\n(cid:18)\n\n\u221d Gam\n\nk=1\n\n\u03bbf | 1\n2\n\n\u00afK + af ,\n\n1\n2\n\n\u00afK(cid:88)\n\nk=1\n\n(cid:19)\n\n.\n\n(3)\n\n(\u03b7f k \u2212 \u00b5f )2 + bf\n\n(cid:19)\n\nSimilarly, the precision parameter \u03bbv has a gamma prior and posterior:\n\np(\u03bbv | v, av, bv) \u221d Gam(\u03bbv | av, bv)\n\nN (v:d | Au:d, L\u22121)\n\n(cid:18)\n\n\u221d Gam\n\n(4)\nEntries of the regression matrix A have a rescaled Gaussian prior Ak(cid:96) \u223c N (0, (k\u03bbA)\u22121). With a\ngamma prior, the precision parameter \u03bbA nevertheless has the following gamma posterior:\n\n\u00afKD + av,\n\n(v:d \u2212 Au:d)T (v:d \u2212 Au:d) + bv\n\n\u03bbv | 1\n2\n\n1\n2\n\nd=1\n\n.\n\np(\u03bbA | A, aA, bA) \u221d Gam(\u03bbA | aA, bA)\n\nN (Ak(cid:96) | 0, (k\u03bbA)\u22121)\n\n(cid:18)\n\n\u221d Gam\n\nk=1\n\n(cid:96)=1\n\n\u03bbA | 1\n2\n\n\u00afK( \u00afK \u2212 1) + aA,\n\n1\n2\n\n(cid:19)\n\n.\n\n(5)\n\nkA2\n\nk(cid:96) + bA\n\n\u00afK(cid:88)\n\nk(cid:88)\n\nk=1\n\n(cid:96)=1\n\nConditioning on the feature regression weights \u03b7, the mean weight \u00b5f in our hierarchical prior for\neach feature f has a Gaussian posterior:\n\n\u00afK(cid:89)\n\nD(cid:88)\n\n\u00afK(cid:89)\n\nk(cid:89)\n\nD(cid:89)\n\nd=1\n\n\u00afK(cid:89)\n\np(\u00b5f | \u03b7) \u221d N (\u00b5f | 0, \u03b3\u00b5)\n\nN (\u03b7f k | \u00b5f , \u03bb\u22121\nf )\n\n(cid:18)\n\n\u221d N\n\nk=1\n\n\u00b5f |\n\n\u03b3\u00b5\n\n\u00afK\u03b3\u00b5 + \u03bb\u22121\n\nf\n\n\u00afK(cid:88)\n\nk=1\n\n\u03b7f k, (\u03b3\u22121\n\n\u00b5 + \u00afK\u03bbf )\u22121\n\n(cid:19)\n\n(6)\n\nTo sample \u03b7:k, the linear function relating metadata to topic k, we condition on all documents uk: as\nwell as \u03c6, \u00b5, and \u039b. Columns of \u03b7 are conditionally independent, with Gaussian posteriors:\n\np(\u03b7:k | u, \u03c6, \u00b5, \u039b) \u221d N (\u03b7:k | \u00b5, \u039b\u22121)N (uT\n\nk: | \u03c6T \u03b7:k, ID)\n\n(7)\nSimilarly, the scores u:d for each document are conditionally independent with Gaussian posteriors:\n\n\u221d N (\u03b7:k | (\u039b + \u03c6\u03c6T )\u22121(\u03c6uT\n\nk: + \u039b\u00b5), (\u039b + \u03c6\u03c6T )\u22121).\n\np(u:d | v:d, \u03b7, \u03c6d, L) \u221d N (u:d | \u03b7T \u03c6d, I \u00afK)N (v:d | Au:d, L\u22121)\n\n\u221d N (u:d | (I \u00afK + AT LA)\u22121(AT Lv:d + \u03b7T \u03c6d), (I \u00afK + AT LA)\u22121).\n(8)\nTo resample A, we note that its rows are conditionally independent. The posterior of the k entries\nAk: in row k depends on vk: and \u02c6Uk (cid:44) u1:k,:, the \ufb01rst k entries of u:d for each document d:\n\nk: | vk:, \u02c6Uk, \u03bbA, \u03bbv) \u221d k(cid:89)\n\np(AT\n\nN (Akj | 0, (k\u03bbA)\u22121)N (vT\nk: | (k\u03bbA\u03bb\u22121\n\nv Ik + \u02c6Uk \u02c6U T\n\nk: | \u02c6U T\nk AT\nk )\u22121 \u02c6UkvT\n\nj=1\n\n\u221d N (AT\n\nk:, \u03bb\u22121\n\nv ID)\n\nk:, (k\u03bbAIk + \u03bbv \u02c6Uk \u02c6U T\n\nk )\u22121).\n\n(9)\n\n4\n\n\fFor the SCNT model, there is a related but simpler update (see supplemental material).\nAs in collapsed sampling algorithms for LDA [7], we can analytically marginalize the word distri-\n\\dn\nbution \u2126k for each topic. Let M\nkw denote the number of instances of word w assigned to topic k,\nexcluding token n in document d, and M\nthe number of total tokens assigned to topic k. For a\nvocabulary with W unique word types, the posterior distribution of topic indicator zdn is then\n\n\\dn\nk.\n\np(zdn = k | \u03c0:d, z\\dn) \u221d \u03c0kd\n\n.\n\n(10)\n\n(cid:33)\n\n(cid:32)\n\n\\dn\nM\nkw + \u03b2\n\\dn\nk. + W \u03b2\n\nM\n\nRecall that the topic probabilities \u03c0:d are determined from v:d via Equation (2).\n\n3.2 Metropolis Independence Sampler Updates for Topic Activations\n\nThe posterior distribution of v:d does not have a closed analytical form due to the logistic nonlin-\nearity underlying our stick-breaking construction. We instead employ a Metropolis-Hastings inde-\npendence sampler, where proposals q(v\u2217\nv I \u00afK) are drawn from\nthe prior. Combining this with the likelihood of the Nd word tokens, the proposal is accepted with\nprobability min(A(v\u2217\n\n:d | v:d, A, u:d, \u03bbv) = N (v\u2217\n\n:d | Au:d, \u03bb\u22121\n\n:d, v:d), 1), where\n\np(v\u2217\n\n:d | A, u:d, \u03bbv)(cid:81)Nd\np(v:d | A, u:d, \u03bbv)(cid:81)Nd\nn=1 p(zdn | v\u2217\n:d)q(v:d | v\u2217\n(cid:19)(cid:80)Nd\nn=1 p(zdn | v:d)q(v\u2217\nK(cid:89)\nNd(cid:89)\n\n(cid:18) \u03c0\u2217\n\nn=1 \u03b4(zdn,k)\n\np(zdn | v\u2217\n:d)\np(zdn | v:d)\n\n=\n\nkd\n\u03c0kd\n\nn=1\n\nk=1\n\nA(v\u2217\n\n:d, v:d) =\n\n=\n\n:d, A, u:d, \u03bbv)\n:d | v:d, A, u:d, \u03bbv)\n\n(11)\n\nBecause the proposal cancels with the prior distribution in the acceptance ratio A(v\u2217\n:d, v:d), the \ufb01nal\nprobability depends only on a ratio of likelihood functions, which can be easily evaluated from\ncounts of the number of words assigned to each topic by zd.\n\n4 Experimental Results\n\n4.1 Toy Bars Dataset\n\nFollowing related validations of the LDA model [7], we ran experiments on a toy corpus of \u201cimages\u201d\ndesigned to validate the features of the DCNT. The dataset consisted of 1,500 images (documents),\neach containing a vocabulary of 25 pixels (word types) arranged in a 5x5 grid. Documents can be\nvisualized by displaying pixels with intensity proportional to the number of corresponding words\n(see Figure 2). Each training document contained 300 word tokens.\nTen topics were de\ufb01ned, corresponding to all possible horizontal and vertical 5-pixel \u201cbars\u201d. We\nconsider two toy datasets. In the \ufb01rst, a random number of topics is chosen for each document, and\nthen a corresponding subset of the bars is picked uniformly at random. In the second, we induce\ntopic correlations by generating documents that contain a combination of either only horizontal\n(topics 1-5) or only vertical (topics 6-10) bars. For these datasets, there was no associated metadata,\nso the input features were simply set as \u03c6d = 1.\nUsing these toy datasets, we compared the LDA model to several versions of the DCNT. For LDA,\nwe set the number of topics to the true value of K = 10. Similar to previous toy experiments [7],\nwe set the parameters of its Dirichlet prior over topic distributions to \u03b1 = 50/K, and the topic\nsmoothing parameter to \u03b2 = 0.01. For the DCNT model, we set \u03b3\u00b5 = 106, and all gamma prior\nhyperparameters as a = b = 0.01, corresponding to a mean of 1 and a variance of 100. To initialize\nthe sampler, we set the precision parameters to their prior mean of 1, and sample all other variables\nfrom their prior. We compared three variants of the DCNT model: the singly correlated SCNT (A\nconstrained to be diagonal) with K = 10, the DCNT with K = 10, and the DCNT with K = 20.\nThe \ufb01nal case explores whether our stick-breaking prior can successfully infer the number of topics.\nFor the toy dataset with correlated topics, the results of running all sampling algorithms for 10,000\niterations are illustrated in Figure 2. On this relatively clean data, all models limited to K = 10\n\n5\n\n\fFigure 2: A dataset of correlated toy bars (example document images in bottom left). Top: From left to\nright, the true counts of words generated by each topic, and the recovered counts for LDA (K = 10), SCNT\n(K = 10), DCNT (K = 10), and DCNT (K = 20). Note that the true topic order is not identi\ufb01able. Bottom:\nInferred topic covariance matrices for the four corresponding models. Note that LDA assumes all topics have\na slight negative correlation, while the DCNT infers more pronounced positive correlations. With K = 20\npotential DCNT topics, several are inferred to be unused with high probability, and thus have low variance.\n\ntopics recover the correct topics. With K = 20 topics, the DCNT recovers the true topics, as well as\na redundant copy of one of the bars. This is typical behavior for sampling runs of this length; more\nextended runs usually merge such redundant bars. The development of more rapidly mixing MCMC\nmethods is an interesting area for future research.\nTo determine the topic correlations corresponding to a set of learned model parameters, we use\na Monte Carlo estimate (details in the supplemental material). To make these matrices easier to\nvisualize, the Hungarian algorithm was used to reorder topic labels for best alignment with the\nground truth topic assignments. Note the signi\ufb01cant blocks of positive correlations recovered by the\nDCNT, re\ufb02ecting the true correlations used to create this toy data.\n\n4.2 NIPS Corpus\n\nThe NIPS corpus that we used consisted of publications from previous NIPS conferences 0-12\n(1987-1999), including various metadata (year of publication, authors, and section categories). We\ncompared four variants of the DCNT model: a model which ignored metadata, a model with in-\ndicator features for the year of publication, a model with indicator features for year of publication\nand the presence of highly proli\ufb01c authors (those with more than 10 publications), and a model with\nfeatures for year of publication and additional authors (those with more than 5 publications). In all\ncases, the feature matrix \u03c6 is binary. All models were truncated to use at most K = 50 topics, and\nthe sampler initialized as in Sec. 4.1.\n\n4.2.1 Conditioning on Metadata\n\nA learned DCNT model provides predictions for how topic frequencies change given particular\nmetadata associated with a document. In Figure 3, we show how predicted topic frequencies change\nover time, conditioning also on one of three authors (Michael Jordan, Geoffrey Hinton, or Terrence\nSejnowski). For each, words from a relevant topic illustrate how conditioning on a particular au-\nthor can change the predicted document content. For example, the visualization associated with\nMichael Jordan shows that the frequency of the topic associated with probabilistic models gradually\nincreases over the years, while the topic associated with neural networks decreases. Conditioning\non Geoffrey Hinton puts larger mass on a topic which focuses on models developed by his research\ngroup. Finally, conditioning on Terrence Sejnowski dramatically increases the probability of topics\nrelated to neuroscience.\n\n4.2.2 Correlations between Topics\n\nThe DCNT model can also capture correlations between topics. In Fig. 4, we visualize this us-\ning a diagram where the size of a colored grid is proportional to the magnitude of the correlation\n\n6\n\n\fFigure 3: The DCNT predicts topic frequencies over the years (1987-1999) for documents with (a) none of\nthe most proli\ufb01c authors, (b) the Michael Jordan feature, (c) the Geoffrey Hinton feature, and (d) the Terrence\nSejnowski feature. The stick-breaking distribution at the top shows the frequencies of each topic, averaging\nover all years; note some are unused. The middle row illustrates the word distributions for the topics highlighted\nby red dots in their respective columns. Larger words are more probable.\n\nFigure 4: A Hinton diagram of correlations between all pairs of topics, where the sizes of squares indicates the\nmagnitude of dependence, and red and blue squares indicate positive and negative correlations, respectively. To\nthe right are the top six words from three strongly correlated topic pairs. This visualization, along with others in\nthis paper, are interactive and can be downloaded from this page: http://www.cs.brown.edu/\u02dcdaeil.\n\ncoef\ufb01cients between two topics. The results displayed in this \ufb01gure are for a model trained with-\nout metadata. We can see that the model learned strong positive correlations between function and\nlearning topics which have strong semantic similarities, but are not identical. Another positive cor-\nrelation that the model discovered was between the topics visual and neuron; of course there are\nmany papers at NIPS which study the brain\u2019s visual cortex. A strong negative correlation was found\nbetween the network and model topics, which might re\ufb02ect an idealogical separation between papers\nstudying neural networks and probabilistic models.\n\n4.3 Predictive Likelihood\n\nIn order to quantitatively measure the generalization power of our DCNT model, we tested several\nvariants on two versions of the toy bars dataset (correlated & uncorrelated). We also compared\nmodels on the NIPS corpus, to explore more realistic data where metadata is available. The test data\nfor the toy dataset consisted of 500 documents generated by the same process as the training data,\n\n7\n\n\fFigure 5: Perplexity scores (lower is better) computed via Chib-style estimators for several topic models.\nLeft: Test performance for the toy datasets with uncorrelated bars (-A) and correlated bars (-B). Right: Test\nperformance on the NIPS corpus with various metadata: no features (-noF), year features (-Y), year and proli\ufb01c\nauthor features (over 10 publications, -YA1), and year and additional author features (over 5 publications, -YA2).\n\nwhile the NIPS corpus was split into training and tests subsets containing 80% and 20% of the full\ncorpus, respectively. Over the years 1988-1999, there were a total of 328 test documents.\nWe calculated predictive likelihood estimates using a Chib-style estimator [12]; for details see the\nsupplemental material. In a previous comparison [19], the Chib-style estimator was found to be far\nmore accurate than alternatives like the harmonic mean estimator. Note that there is some subtlety\nin correctly implementing the Chib-style estimator for our DCNT model, due to the possibility of\nrejection of our Metropolis-Hastings proposals.\nPredictive negative log-likelihood estimates were normalized by word counts to determine perplexity\nscores [3]. We tested several models, including the SCNT and DCNT, LDA with \u03b1 = 1 and \u03b2 =\n0.01, and the HDP with full resampling of its concentration parameters. For the toy bars data, we\nset the number of topics to K = 10 for all models except the HDP, which learned K = 15. For the\nNIPS corpus, we set K = 50 for all models except the HDP, which learned K = 86.\nFor the toy datasets, the LDA and HDP models perform similarly. The SCNT and DCNT are both\nsuperior, apparently due to their ability to capture non-Dirichlet distributions on topic occurrence\npatterns. For the NIPS data, all of the DCNT models are substantially more accurate than LDA and\nthe HDP. Including metadata encoding the year of publication, and possibly also the most proli\ufb01c\nauthors, provides slight additional improvements in DCNT accuracy. Interestingly, when a larger\nset of author features is included, accuracy becomes slightly worse. This appears to be an over\ufb01tting\nissue: there are 125 authors with over 5 publications, and only a handful of training examples for\neach one.\nWhile it is pleasing that the DCNT and SCNT models seem to provide improved predictive like-\nlihoods, a recent study on the human interpretability of topic models showed that such scores do\nnot necessarily correlate with more meaningful semantic structures [4]. In many ways, the interac-\ntive visualizations illustrated in Sec. 4.2 provide more assurance that the DCNT can capture useful\nproperties of real corpora.\n\n5 Discussion\n\nThe doubly correlated nonparametric topic model \ufb02exibly allows the incorporation of arbitrary fea-\ntures associated with documents, captures correlations that might exist within a dataset\u2019s latent top-\nics, and can learn an unbounded set of topics. The model uses a set of ef\ufb01cient MCMC techniques\nfor learning and inference, and is supported by a set of web-based tools that allow users to visualize\nthe inferred semantic structure.\n\nAcknowledgments\n\nThis research supported in part by IARPA under AFRL contract number FA8650-10-C-7059. Dae Il\nKim supported in part by an NSF Graduate Fellowship. The views and conclusions contained herein\nare those of the authors and should not be interpreted as necessarily representing the of\ufb01cial policies\nor endorsements, either expressed or implied, of IARPA, AFRL, or the U.S. Government.\n\n8\n\nLDAHDPDCNT\u2212noFDCNT\u2212YDCNT\u2212YA1DCNT\u2212YA21800185019001950200020502100Perplexity scores (NIPS)1975.462060.431926.421925.561923.11932.26LDA\u2212AHDP\u2212ADCNT\u2212ASCNT\u2212ALDA\u2212BHDP\u2212BDCNT\u2212BSCNT\u2212B02468101214Perplexity (Toy Data)10.510.529.7910.1412.0812.1311.5111.75\fReferences\n[1] A. Agovic and A. Banerjee. Gaussian process topic models. In UAI, 2010.\n[2] D. M. Blei and J. D. Lafferty. A correlated topic model of science. AAS, 1(1):17\u201335, 2007.\n[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:993\u20131022,\n\nMarch 2003.\n\n[4] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei. Reading tea leaves: How humans interpret\n\ntopic models. In NIPS, 2009.\n\n[5] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. An. Stat., 1(2):209\u2013230, 1973.\n[6] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, 2004.\n[7] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. PNAS, 2004.\n[8] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American\n\nStatistical Association, 96(453):161\u2013173, Mar. 2001.\n\n[9] W. Li, D. Blei, and A. McCallum. Nonparametric Bayes Pachinko allocation. In UAI, 2008.\n[10] H. F. Lopes and M. West. Bayesian model assessment in factor analysis. Stat. Sinica, 14:41\u201367, 2004.\n[11] D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial\n\nregression. In UAI, 2008.\n\n[12] I. Murray and R. Salakhutdinov. Evaluating probabilities under high-dimensional latent variable models.\n\nIn NIPS 21, pages 1137\u20131144. 2009.\n\n[13] J. Paisley, C. Wang, and D. Blei. The discrete in\ufb01nite logistic normal distribution for mixed-membership\n\nmodeling. In AISTATS, 2011.\n\n[14] L. Ren, L. Du, L. Carin, and D. B. Dunson. Logistic stick-breaking process. JMLR, 12, 2011.\n[15] A. Rodriguez and D. B. Dunson. Nonparametric bayesian models through probit stick-breaking processes.\n\nJ. Bayesian Analysis, 2011.\n\n[16] J. Sethuraman. A constructive de\ufb01nition of Dirichlet priors. Stat. Sin., 4:639\u2013650, 1994.\n[17] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the\n\nAmerican Statistical Association, 101(476):1566\u20131581, 2006.\n\n[18] Y. W. Teh, M. Seeger, and M. I. Jordan. Semiparametric latent factor models. In AIStats 10, 2005.\n[19] H. M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models. In\n\nICML, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1117, "authors": [{"given_name": "Dae", "family_name": "Kim", "institution": null}, {"given_name": "Erik", "family_name": "Sudderth", "institution": null}]}