{"title": "Latent Dirichlet Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 601, "page_last": 608, "abstract": null, "full_text": "Latent Dirichlet Allocation \n\nDavid M. Blei, Andrew Y. Ng and Michael I. Jordan \n\nUniversity of California, Berkeley \n\nBerkeley, CA 94720 \n\nAbstract \n\nWe propose a generative model for text and other collections of dis(cid:173)\ncrete data that generalizes or improves on several previous models \nincluding naive Bayes/unigram, mixture of unigrams [6], and Hof(cid:173)\nmann's aspect model, also known as probabilistic latent semantic \nindexing (pLSI) [3]. In the context of text modeling, our model \nposits that each document is generated as a mixture of topics, \nwhere the continuous-valued mixture proportions are distributed \nas a latent Dirichlet random variable. Inference and learning are \ncarried out efficiently via variational algorithms. We present em(cid:173)\npirical results on applications of this model to problems in text \nmodeling, collaborative filtering, and text classification. \n\n1 \n\nIntroduction \n\nRecent years have seen the development and successful application of several latent \nfactor models for discrete data. One notable example, Hofmann's pLSI/aspect \nmodel [3], has received the attention of many researchers, and applications have \nemerged in text modeling [3], collaborative filtering [7], and link analysis [1]. In \nthe context of text modeling, pLSI is a \"bag-of-words\" model in that it ignores the \nordering of the words in a document. It performs dimensionality reduction, relating \neach document to a position in low-dimensional \"topic\" space. In this sense, it is \nanalogous to PCA, except that it is explicitly designed for and works on discrete \ndata. \n\nA sometimes poorly-understood subtlety of pLSI is that, even though it is typically \ndescribed as a generative model, its documents have no generative probabilistic \nsemantics and are treated simply as a set of labels for the specific documents seen \nin the training set. Thus there is no natural way to pose questions such as \"what is \nthe probability of this previously unseen document?\". Moreover, since each training \ndocument is treated as a separate entity, the pLSI model has a large number of \nparameters and heuristic \"tempering\" methods are needed to prevent overfitting. \n\nIn this paper we describe a new model for collections of discrete data that provides \nfull generative probabilistic semantics for documents. Documents are modeled via a \nhidden Dirichlet random variable that specifies a probability distribution on a latent, \nlow-dimensional topic space. The distribution over words of an unseen document is \na continuous mixture over document space and a discrete mixture over all possible \ntopics. \n\n\f2 Generative models for text \n\n2.1 Latent Dirichlet Allocation (LDA) model \n\nTo simplify our discussion, we will use text modeling as a running example through(cid:173)\nout this section, though it should be clear that the model is broadly applicable to \ngeneral collections of discrete data. \nIn LDA, we assume that there are k underlying latent topics according to which \ndocuments are generated, and that each topic is represented as a multinomial distri(cid:173)\nbution over the IVI words in the vocabulary. A document is generated by sampling \na mixture of these topics and then sampling words from that mixture. \nMore precisely, a document of N words w = (W1,'\" \n,W N) is generated by the \nfollowing process. First, B is sampled from a Dirichlet(a1,'\" \n,ak) distribution. \nI)-dimensional simplex: Bi 2': 0, 2:i Bi = 1. \nThis means that B lies in the (k -\nThen, for each of the N words, a topic Zn E {I , ... , k} is sampled from a Mult(B) \ndistribution p(zn = ilB) = Bi . Finally, each word Wn is sampled, conditioned on \nthe znth topic, from the multinomial distribution p(wl zn). Intuitively, Bi can be \nthought of as the degree to which topic i is referred to in the document. Written \nout in full, the probability of a document is therefore the following mixture: \n\np(w) = Ie (11 z~/(wnlzn; ,8)P(Zn IB\u00bb) p(B; a)dB, \n\n(1) \n\nwhere p(B; a) is Dirichlet , p(znIB) is a multinomial parameterized by B, and \np( Wn IZn;,8) is a multinomial over the words. This model is parameterized by the k(cid:173)\ndimensional Dirichlet parameters a = (a1,' .. ,ak) and a k x IVI matrix,8, which are \nparameters controlling the k multinomial distributions over words. The graphical \nmodel representation of LDA is shown in Figure 1. \nAs Figure 1 makes clear, this model is not a simple Dirichlet-multinomial clustering \nmodel. In such a model the innermost plate would contain only W n ; the topic \nnode would be sampled only once for each document; and the Dirichlet would be \nsampled only once for the whole collection. In LDA, the Dirichlet is sampled for \neach document, and the multinomial topic node is sampled repeatedly within the \ndocument. The Dirichlet is thus a component in the probability model rather than \na prior distribution over the model parameters. \nWe see from Eq. (1) that there is a second interpretation of LDA. Having sampled \nB, words are drawn iid from the multinomial/unigram model given by p(wIB) = \n2::=1 p(wl z)p(z IB). Thus, LDA is a mixture model where the unigram models \np(wIB) are the mixture components, and p(B; a) gives the mixture weights. Note \nthat unlike a traditional mixture of unigrams model, this distribution has an infinite \n\no 1'0 '. I \n\nWn Nd \n\nZn \n\nD \n\nFigure 1: Graphical model representation of LDA. The boxes are plates representing \nreplicates. The outer plate represents documents, while the inner plate represents \nthe repeated choice of topics and words within a document. \n\n\fFigure 2: An example distribution on unigram models p(wIB) under LDA for three \nwords and four topics. The triangle embedded in the x-y plane is the 2-D simplex \nover all possible multinomial distributions over three words. \n(E.g. , each of the \nvertices of the triangle corresponds to a deterministic distribution that assigns one \nof the words probability 1; the midpoint of an edge gives two of the words 0.5 \nprobability each; and the centroid of the triangle is the uniform distribution over \nall 3 words). The four points marked with an x are the locations of the multinomial \ndistributions p(wlz) for each of the four topics, and the surface shown on top of the \nsimplex is an example of a resulting density over multinomial distributions given \nby LDA. \n\nnumber of continuously-varying mixture components indexed by B. The example \nin Figure 2 illustrates this interpretation of LDA as defining a random distribution \nover unigram models p(wIB). \n\n2.2 Related models \n\nThe mixture of unigrams model [6] posits that every document is generated by a \nsingle randomly chosen topic: \n\n(2) \n\nThis model allows for different documents to come from different topics, but fails to \ncapture the possibility that a document may express multiple topics. LDA captures \nthis possibility, and does so with an increase in the parameter count of only one \nparameter: rather than having k - 1 free parameters for the multinomial p(z) over \nthe k topics, we have k free parameters for the Dirichlet. \n\nA second related model is Hofmann's probabilistic latent semantic indexing \n(pLSI) [3], which posits that a document label d and a word ware conditionally \nindependent given the hidden topic z : \n\np(d, w) = L~=l p(wlz)p(zld)p(d). \n\n(3) \n\nThis model does capture the possibility that a document may contain multiple topics \nsince p(zld) serve as the mixture weights of the topics. However, a subtlety of pLSI(cid:173)\nand the crucial difference between it and LDA-is that d is a dummy index into \nthe list of documents in the training set. Thus, d is a multinomial random variable \nwith as many possible values as there are training documents, and the model learns \n\n\fthe topic mixtures p(zld) only for those documents on which it is trained. For this \nreason, pLSI is not a fully generative model and there is no clean way to use it \nto assign probability to a previously unseen document. Furthermore, the number \nof parameters in pLSI is on the order of klVl + klDI, where IDI is the number of \ndocuments in the training set. Linear growth in the number of parameters with the \nsize of the training set suggests that overfitting is likely to be a problem and indeed, \nin practice, a \"tempering\" heuristic is used to smooth the parameters of the model. \n\n3 \n\nInference and learning \n\nLet us begin our description of inference and learning problems for LDA by exam(cid:173)\nining the contribution to the likelihood made by a single document. To simplify \nour notation, let w~ = 1 iff Wn is the jth word in the vocabulary and z~ = 1 \niff Zn is the ith topic. Let j3ij denote p(w j = Ilzi = 1), and W = (WI, ... ,WN), \nZ = (ZI, ... ,ZN). Expanding Eq. (1), we have: \n\n(4) \n\nThis is a hypergeometric function that is infeasible to compute exactly [4]. \nLarge text collections require fast inference and learning algorithms and thus we \nhave utilized a variational approach [5] to approximate the likelihood in Eq. (4). \nWe use the following variational approximation to the log likelihood: \n\nlogp(w; a, 13) \n\nlog r :Ep(wlz; j3)p(zIB)p(B; a) q~:, z:\" ~~ dB \n\nq \n\n,Z\", \n\nle z \n\n> Eq[logp(wlz;j3) +logp(zIB) +logp(B;a) -logq(B,z; , ,\u00a2)], \n\nwhere we choose a fully factorized variational distribution q(B, z;\" \u00a2) \nq(B; ,) fIn q(Zn; \u00a2n) parameterized by , and \u00a2n, so that q(B; ,) is Dirichlet({), and \nq(zn; \u00a2n) is MUlt(\u00a2n). Under this distribution, the terms in the variational lower \nbound are computable and differentiable, and we can maximize the bound with \nrespect to, and \u00a2 to obtain the best approximation to p(w;a,j3). \nNote that the third and fourth terms in the variational bound are not straight(cid:173)\nforward to compute since they involve the entropy of a Dirichlet distribution, a \n(k -\nI)-dimensional integral over B which is expensive to compute numerically. In \nthe full version of this paper, we present a sequence of reductions on these terms \nwhich use the log r function and its derivatives. This allows us to compute the \nintegral using well-known numerical routines. \nVariational inference is coordinate ascent in the bound on the probability of a single \ndocument. In particular, we alternate between the following two equations until the \nobjective converges: \n\n,i \n\nai + 2:~=1 \u00a2ni \n\n(6) \nwhere \\]i is the first derivative of the log r function. Note that the resulting vari(cid:173)\national parameters can also be used and interpreted as an approximation of the \nparameters of the true posterior. \nIn the current paper we focus on maximum likelihood methods for parameter es(cid:173)\ntimation. Given a collection of documents V = {WI' ... ' WM}, we utilize the EM \n\n(5) \n\n\falgorithm with a variational E step, maximizing a lower bound on the log likelihood: \n\nlogp(V) 2:: l:= Eqm [logp(B, z, w)]- Eqm [logqm(B, z)]. \n\nM \n\n(7) \n\nm=l \n\nThe E step refits qm for each document by running the inference step described \nabove. The M step optimizes Eq. (7) with respect to the model parameters a \nand (3. For the multinomial parameters (3ij we have the following M step update \nequation: \n\nM \n\nIwml \n\n(3ij ex: l:= l:= \u00a2>mniwtnn\u00b7 \n\nm=l n=l \n\n(8) \n\nThe Dirichlet parameters ai are not independent of each other and we apply \nN ewton-Raphson to optimize them: \n\nThe variational EM algorithm alternates between maximizing Eq. (7) with respect \nto qm and with respect to (a, (3) until convergence. \n\n4 Experiments and Examples \n\nWe first tested LDA on two text corpora. 1 The first was drawn from the TREC AP \ncorpus, and consisted of 2500 news articles, with a vocabulary size of IVI = 37,871 \nwords. The second was the CRAN corpus, consisting of 1400 technical abstracts, \nwith IVI = 7747 words. \nWe begin with an example showing how LDA can capture multiple-topic phenomena \nin documents. By examining the (variational) posterior distribution on the topic \nmixture q(B; ')'), we can identify the topics which were most likely to have contributed \nto many words in a given document; specifically, these are the topics i with the \nlargest ')'i. Examining the most likely words in the corresponding multinomials can \nthen further tell us what these topics might be about. The following is an article \nfrom the TREC collection. \n\nThe William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, \nMetropolitan Opera Co., New York Philharmonic and Juilliard School. \n\"Our board felt that we had a real opportunity to make a mark on the future of the \nperforming arts with these grants an act every bit as important as our traditional ar(cid:173)\neas of support in health , medical research, education and the social services,\" Hearst \nFoundation President Randolph A. Hearst said Monday in announcing the grants. \nLincoln Center's share will be $200,000 for its new building, which will house young \nartists and provide new public facilities. The Metropolitan Opera Co. and New York \nPhilharmonic will receive $400,000 each. The Juilliard School, where music and the \nperforming arts are taught, will get $250,000. \nThe Hearst Foundation, a leading supporter of the Lincoln Center Consolidated Cor(cid:173)\nporate Fund, will make its usual annual $100,000 donation, too. \n\nFigure 3 shows the Dirichlet parameters of the corresponding variational distribu(cid:173)\ntion for those topics where ')'i > 1 (k = 100) , and also lists the top 15 words (in \niTo enable repeated large scale comparison of various models on large corpora, we \nimplemented our variational inference algorithm on a parallel computing cluster. The \n(bottleneck) E step is distributed across nodes so that the qm for different documents are \ncalculated in parallel. \n\n\fTopic 3 \nSAID \nAIDS \nHEALTH \nDISEASE \nVIRUS \nCHILDREN \nBLOOD \n\nTopic 1 \nSCHOOL \nSAID \nSTUDENTS \nBOARD \nSCHOOLS \nSTUDENT \nTEACHER \nPOLICE \nPROGRAM \nTEACHERS \nMEMBERS \nYEAROLD \nGANG \nDEPARTMENT REVENUE \n\nTopic 2 \nMILLION \nYEAR \nSAID \nSALES \nBILLION \nTOTAL \nSHARE \nEARNINGS PATIENTS \nPROFIT \nQUARTER \nORDERS \nLAST \nDEC \n\nBAND \nPLAY \nTREATMENT COMPANY WON \nTWO \nSTUDY \nAVAILABLE \nIMMUNE \nAWARD \nCANCER \nPEOPLE \nOPERA \nBEST \nPERCENT \n\nYORK \nSCHOOL \nTWO \nTODAY \nCOLUMBIA \n\nTopic 5 \nSAID \nNEW \n\nTopic 4 \nSAID \nNEW \nPRESIDENT MUSIC \nYEAR \nCHIEF \nCHAIRMAN \nTHEATER \nEXECUTIVE MUSICAL \nVICE \nYEARS \n\nI\" \n\nFigure 3: The Dirichlet parameters where Ii > 1 (k = 100), and the top 15 words \nfrom the corresponding topics, for the document discussed in the text. \n\n__ LDA \n-x- pLSI \n.. pLSI(00 lemper) \n~ MIx1Un;grams \n\nv \u00b7 ram \n\nwoo ' \n\n',.. _ \n\n.~ 4500 \n.l\\ \n! \n\n->< - - - - - - - - - - - - - - -\n\nk (number of topics) \n\nk (number of topiCS) \n\nFigure 4: Perplexity results on the CRAN and AP corpora for LDA, pLSI, mixture \nof unigrams, and the unigram model. \n\norder) from these topics. This document is mostly a combination of words about \nschool policy (topic 4) and music (topic 5). The less prominent topics reflect other \nwords about education (topic 1) , finance (topic 2), and health (topic 3). \n\n4.1 Formal evaluation: Perplexity \n\nTo compare the generalization performance of LDA with other models, we com(cid:173)\nputed the perplexity of a test set for the AP and CRAN corpora. The perplex(cid:173)\nity, used by convention in language modeling, is monotonically decreasing in the \nlikelihood of the test data, and can be thought of as the inverse of the per-word \nlikelihood. More formally, for a test set of M documents, perplexity(Vtest ) = \nexp (-l:m logp(wm)/ l:m Iwml}. \nWe compared LDA to both the mixture of unigrams and pLSI described in Sec(cid:173)\ntion 2.2. We trained the pLSI model with and without tempering to reduce over(cid:173)\nfitting. When tempering, we used part of the test set as the hold-out data, thereby \ngiving it a slight unfair advantage. As mentioned previously, pLSI does not readily \ngenerate or assign probabilities to previously unseen documents; in our experiments, \nwe assigned probability to a new document d by marginalizing out the dummy train-\ning set indices2 : pew ) = l: d( rr : =1l:z p(wn lz)p(z ld))p(d) . \n\n2 A second natural method, marginalizing out d and z to form a unigram model using \nthe resulting p(w)'s, did not perform well (its performance was similar to the standard \nunigram model). \n\n\f1-:- ~Dc.~UrUg,ams I \n\n~W' \n\u2022 M\" \n\nx NaiveBaes \n\nk (number of topics) \n\nk (number of topics) \n\nFigure 5: Results for classification (left) and collaborative filtering (right) \n\nFigure 4 shows the perplexity for each model and both corpora for different values \nof k. The latent variable models generally do better than the simple unigram model. \nThe pLSI model severely overfits when not tempered (the values beyond k = 10 \nare off the graph) but manages to outperform mixture of unigrams when tempered. \nLDA consistently does better than the other models. To our knowledge, these are \nby far the best text perplexity results obtained by a bag-of-words model. \n\n4.2 Classification \n\nWe also tested LDA on a text classification task. For each class c, we learn a separate \nmodel p(wlc) of the documents in that class. An unseen document is classified by \npicking argmaxcp(Clw) = argmaxcp(wlc)p(c). Note that using a simple unigram \ndistribution for p(wlc) recovers the traditional naive Bayes classification model. \nUsing the same (standard) subset of the WebKB dataset as used in [6], we obtained \nclassification error rates illustrated in Figure 5 (left). In all cases, the difference \nbetween LDA and the other algorithms' performance is statistically significant (p < \n0.05). \n\n4.3 Collaborative filtering \n\nOur final experiment utilized the EachMovie collaborative filtering dataset. In this \ndataset a collection of users indicates their preferred movie choices. A user and \nthe movies he chose are analogous to a document and the words in the document \n(respectively) . \nThe collaborative filtering task is as follows. We train the model on a fully ob(cid:173)\nserved set of users. Then, for each test user, we are shown all but one of the \nmovies that she liked and are asked to predict what the held-out movie is. The \ndifferent algorithms are evaluated according to the likelihood they assign to the \nheld-out movie. More precisely define the predictive perplexity on M test users \nto be exp( - ~~=llogP(WmNd lwml' ... ,Wm(Nd-l))/M) . With 5000 training users, \n3500 testing users, and a vocabulary of 1600 movies, we find predictive perplexities \nillustrated in Figure 5 (right). \n\n5 Conclusions \n\nWe have presented a generative probabilistic framework for modeling the topical \nstructure of documents and other collections of discrete data. Topics are represented \n\n\fexplicitly via a multinomial variable Zn that is repeatedly selected, once for each \nword, in a given document. In this sense, the model generates an allocation of \nthe words in a document to topics. When computing the probability of a new \ndocument, this unknown allocation induces a mixture distribution across the words \nin the vocabulary. There is a many-to-many relationship between topics and words \nas well as a many-to-many relationship between documents and topics. \nWhile Dirichlet distributions are often used as conjugate priors for multinomials in \nBayesian modeling, it is preferable to instead think of the Dirichlet in our model as \na component of the likelihood. The Dirichlet random variable e is a latent variable \nthat gives generative probabilistic semantics to the notion of a \"document\" in the \nsense that it allows us to put a distribution on the space of possible documents. \nThe words that are actually obtained are viewed as a continuous mixture over this \nspace, as well as being a discrete mixture over topics.3 \nThe generative nature of LDA makes it easy to use as a module in more complex \narchitectures and to extend it in various directions. We have already seen that \ncollections of LDA can be used in a classification setting. If the classification variable \nis treated as a latent variable we obtain a mixture of LDA models, a useful model for \nsituations in which documents cluster not only according to their topic overlap, but \nalong other dimensions as well. Another extension arises from generalizing LDA to \nconsider Dirichlet/multinomial mixtures of bigram or trigram models, rather than \nthe simple unigram models that we have considered here. Finally, we can readily \nfuse LDA models which have different vocabularies (e.g., words and images); these \nmodels interact via a common abstract topic variable and can elegantly use both \nvocabularies in determining the topic mixture of a given document. \n\nAcknowledgments \n\nA. Ng is supported by a Microsoft Research fellowship. This work was also sup(cid:173)\nported by a grant from Intel Corporation, NSF grant IIS-9988642, and ONR MURI \nN 00014-00-1-0637. \n\nReferences \n[1] D. Cohn and T. Hofmann. The missing link- A probabilistic model of document \ncontent and hypertext connectivity. In Advances in Neural Information Processing \nSystems 13, 2001. \n\n[2] P.J. Green and S. Richardson. Modelling heterogeneity with and without the Dirichlet \n\nprocess. Technical Report, University of Bristol, 1998. \n\n[3] T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Twenty-Second \n\nAnnual International SIGIR Conference, 1999. \n\n[4] T. J. Jiang, J. B. Kadane, and J. M. Dickey. Computation of Carlson's multiple \nhypergeometric functions r for Bayesian applications. Journal of Computational and \nGraphical Statistics, 1:231- 251 , 1992. \n\n[5] M. I. Jordan , Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. Introduction to varia(cid:173)\n\ntional methods for graphical models. Machine Learning, 37:183- 233, 1999. \n\n[6] K. Nigam, A. Mccallum, S. Thrun, and T. Mitchell. Text classification from labeled \n\nand unlabeled documents using EM. Machine Learning, 39(2/3):103- 134, 2000. \n\n[7] A. Popescul, L. H. Ungar, D. M. Pennock, and S. Lawrence. Probabilistic models for \nunified collaborative and content-based recommendation in sparse-data environments. \nIn Uncertainty in Artificial Intelligence, Proceedings of the Seventeenth Conference, \n2001. \n\n3These remarks also distinguish our model from the Bayesian Dirichlet/Multinomial \nallocation model (DMA)of [2], which is a finite alternative to the Dirichlet process. The \nDMA places a mixture of Dirichlet priors on p(wl z ) and sets O i = 00 for all i. \n\n\f", "award": [], "sourceid": 2070, "authors": [{"given_name": "David", "family_name": "Blei", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}