{"title": "Latent Dirichlet Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 601, "page_last": 608, "abstract": null, "full_text": "Latent  Dirichlet  Allocation \n\nDavid M.  Blei,  Andrew Y.  Ng and  Michael I. Jordan \n\nUniversity of California,  Berkeley \n\nBerkeley,  CA  94720 \n\nAbstract \n\nWe propose a generative model for text and other collections of dis(cid:173)\ncrete data that generalizes or improves on several previous models \nincluding naive Bayes/unigram, mixture of unigrams  [6],  and Hof(cid:173)\nmann's  aspect  model,  also  known  as  probabilistic latent  semantic \nindexing  (pLSI)  [3].  In  the  context  of text  modeling,  our  model \nposits  that  each  document  is  generated  as  a  mixture  of  topics, \nwhere  the  continuous-valued  mixture  proportions  are  distributed \nas  a  latent  Dirichlet  random  variable.  Inference  and  learning  are \ncarried out efficiently  via variational  algorithms.  We  present  em(cid:173)\npirical  results  on  applications  of this  model  to  problems  in  text \nmodeling,  collaborative filtering,  and text classification. \n\n1 \n\nIntroduction \n\nRecent years have seen the development and successful application of several latent \nfactor  models  for  discrete  data.  One  notable  example,  Hofmann's  pLSI/aspect \nmodel  [3],  has  received  the  attention  of many  researchers,  and  applications  have \nemerged  in  text  modeling  [3],  collaborative  filtering  [7],  and  link  analysis  [1].  In \nthe context of text modeling,  pLSI is  a  \"bag-of-words\"  model in that it ignores the \nordering of the words in a  document.  It performs dimensionality reduction, relating \neach document  to  a  position  in  low-dimensional  \"topic\"  space.  In this  sense,  it  is \nanalogous  to  PCA,  except  that  it  is  explicitly  designed  for  and  works  on  discrete \ndata. \n\nA sometimes poorly-understood subtlety of pLSI is  that, even though it is  typically \ndescribed  as  a  generative  model,  its  documents  have  no  generative  probabilistic \nsemantics and are treated simply as  a  set of labels  for  the specific  documents seen \nin the training set.  Thus there is  no natural way to pose questions such as  \"what is \nthe probability of this previously unseen document?\".  Moreover, since each training \ndocument  is  treated  as  a  separate  entity,  the  pLSI  model  has  a  large  number  of \nparameters and heuristic  \"tempering\"  methods are needed to prevent overfitting. \n\nIn this paper we  describe a  new model for  collections of discrete data that provides \nfull  generative probabilistic semantics for  documents.  Documents are modeled via a \nhidden Dirichlet random variable that specifies a probability distribution on a latent, \nlow-dimensional topic space.  The distribution over words of an unseen document is \na  continuous mixture over document space and a  discrete mixture over all possible \ntopics. \n\n\f2  Generative  models for  text \n\n2.1  Latent  Dirichlet  Allocation  (LDA)  model \n\nTo simplify our discussion, we  will use text modeling as a running example through(cid:173)\nout  this  section,  though it should  be  clear that the model is  broadly applicable to \ngeneral collections of discrete data. \nIn  LDA,  we  assume  that  there  are  k  underlying  latent  topics  according  to  which \ndocuments are generated, and that each topic is  represented as a multinomial distri(cid:173)\nbution over the IVI  words in the vocabulary.  A  document is  generated by sampling \na  mixture of these topics and then sampling words from  that mixture. \nMore  precisely,  a  document  of  N  words  w  =  (W1,'\" \n,W N)  is  generated  by  the \nfollowing  process.  First,  B is  sampled  from  a  Dirichlet(a1,'\" \n,ak)  distribution. \nI)-dimensional  simplex:  Bi  2':  0, 2:i  Bi  =  1. \nThis  means  that  B lies  in  the  (k  -\nThen, for  each of the N  words,  a  topic  Zn  E  {I , ... , k}  is  sampled from  a  Mult(B) \ndistribution p(zn  =  ilB)  =  Bi .  Finally,  each  word  Wn  is  sampled,  conditioned  on \nthe  znth  topic,  from  the  multinomial  distribution p(wl zn).  Intuitively,  Bi  can  be \nthought  of as  the  degree  to  which  topic  i  is  referred  to  in  the  document.  Written \nout in full,  the probability of a  document is  therefore the following  mixture: \n\np(w) = Ie (11 z~/(wnlzn; ,8)P(Zn IB\u00bb)  p(B; a)dB, \n\n(1) \n\nwhere  p(B; a)  is  Dirichlet ,  p(znIB)  is  a  multinomial  parameterized  by  B,  and \np( Wn IZn;,8)  is a  multinomial over the words.  This model is  parameterized by the k(cid:173)\ndimensional Dirichlet parameters a  =  (a1,' ..  ,ak) and a k x IVI matrix,8, which are \nparameters  controlling the  k  multinomial  distributions  over  words.  The  graphical \nmodel representation of LDA  is  shown in  Figure 1. \nAs Figure 1 makes clear, this model is  not a simple Dirichlet-multinomial clustering \nmodel.  In  such  a  model  the  innermost  plate  would  contain  only  W n ;  the  topic \nnode  would  be sampled only  once  for  each  document;  and  the  Dirichlet  would  be \nsampled  only  once  for  the  whole  collection.  In  LDA,  the  Dirichlet  is  sampled  for \neach  document,  and  the  multinomial  topic  node  is  sampled  repeatedly  within  the \ndocument.  The Dirichlet is  thus a  component in the probability model rather than \na  prior distribution over the model parameters. \nWe  see from  Eq.  (1)  that there is  a  second interpretation of LDA.  Having sampled \nB,  words  are  drawn  iid  from  the  multinomial/unigram  model  given  by  p(wIB)  = \n2::=1 p(wl z)p(z IB).  Thus,  LDA  is  a  mixture  model  where  the  unigram  models \np(wIB)  are  the  mixture  components,  and p(B; a)  gives  the  mixture  weights.  Note \nthat unlike a  traditional mixture of unigrams model, this distribution has an infinite \n\no 1'0 '.  I \n\nWn  Nd \n\nZn \n\nD \n\nFigure 1:  Graphical model representation of LDA. The boxes are plates representing \nreplicates.  The outer plate represents  documents,  while  the  inner  plate represents \nthe repeated choice of topics and words within a  document. \n\n\fFigure 2:  An example distribution on unigram models p(wIB)  under LDA  for  three \nwords  and four  topics.  The triangle embedded in the x-y  plane is  the  2-D  simplex \nover  all  possible  multinomial  distributions  over  three  words. \n(E.g. ,  each  of  the \nvertices of the triangle corresponds to a  deterministic distribution that assigns one \nof  the  words  probability  1;  the  midpoint  of  an  edge  gives  two  of the  words  0.5 \nprobability each;  and the  centroid  of the  triangle is  the  uniform  distribution  over \nall 3 words).  The four points marked with an x are the locations of the multinomial \ndistributions p(wlz) for  each of the four topics, and the surface shown on top of the \nsimplex  is  an  example  of  a  resulting  density  over  multinomial  distributions  given \nby LDA. \n\nnumber  of continuously-varying  mixture  components  indexed  by  B.  The  example \nin Figure 2 illustrates this interpretation of LDA  as defining a  random distribution \nover unigram models p(wIB). \n\n2.2  Related  models \n\nThe  mixture  of unigrams  model  [6]  posits  that  every  document  is  generated  by  a \nsingle  randomly chosen topic: \n\n(2) \n\nThis model allows for  different documents to come from different topics, but fails to \ncapture the possibility that a document may express multiple topics.  LDA captures \nthis  possibility,  and  does  so  with  an  increase  in  the  parameter  count  of only  one \nparameter:  rather than having  k  - 1 free  parameters for  the multinomial p(z)  over \nthe  k  topics,  we  have  k  free  parameters for  the Dirichlet. \n\nA  second  related  model  is  Hofmann's  probabilistic  latent  semantic  indexing \n(pLSI)  [3],  which  posits  that  a  document  label  d  and  a  word  ware conditionally \nindependent given the hidden topic z : \n\np(d, w)  =  L~=l p(wlz)p(zld)p(d). \n\n(3) \n\nThis model does capture the possibility that a document may contain multiple topics \nsince p(zld) serve as the mixture weights of the topics.  However, a subtlety of pLSI(cid:173)\nand  the  crucial  difference  between  it  and  LDA-is that  d  is  a  dummy  index  into \nthe list of documents in the training set.  Thus,  d is  a  multinomial random variable \nwith as many possible values as there are training documents, and the model learns \n\n\fthe topic mixtures p(zld)  only for  those documents on which  it is  trained.  For this \nreason,  pLSI  is  not  a  fully  generative  model  and  there  is  no  clean  way  to  use  it \nto  assign  probability  to  a  previously  unseen  document.  Furthermore,  the  number \nof parameters in  pLSI  is  on  the  order of klVl  +  klDI,  where  IDI  is  the  number of \ndocuments in the training set.  Linear growth in the number of parameters with the \nsize of the training set suggests that overfitting is likely to be a problem and indeed, \nin practice, a  \"tempering\"  heuristic is  used to smooth the parameters of the model. \n\n3 \n\nInference  and  learning \n\nLet us  begin our description of inference and learning problems for  LDA  by exam(cid:173)\nining  the  contribution  to  the  likelihood  made  by  a  single  document.  To  simplify \nour  notation,  let  w~ =  1  iff  Wn  is  the  jth  word  in  the  vocabulary  and  z~  =  1 \niff  Zn  is  the  ith  topic.  Let  j3ij  denote  p(w j  =  Ilzi  =  1),  and  W  =  (WI, ... ,WN), \nZ  =  (ZI, ... ,ZN).  Expanding Eq.  (1),  we  have: \n\n(4) \n\nThis is  a  hypergeometric function  that is  infeasible to compute exactly  [4]. \nLarge  text  collections  require  fast  inference  and  learning  algorithms  and  thus  we \nhave  utilized  a  variational  approach  [5]  to  approximate  the  likelihood  in  Eq.  (4). \nWe  use  the following  variational approximation to the log likelihood: \n\nlogp(w; a, 13) \n\nlog  r :Ep(wlz; j3)p(zIB)p(B; a) q~:, z:\" ~~ dB \n\nq \n\n,Z\", \n\nle  z \n\n>  Eq[logp(wlz;j3)  +logp(zIB) +logp(B;a) -logq(B,z; , ,\u00a2)], \n\nwhere  we  choose  a  fully  factorized  variational  distribution  q(B, z;\" \u00a2) \nq(B; ,) fIn q(Zn; \u00a2n)  parameterized by ,  and \u00a2n,  so  that q(B; ,) is  Dirichlet({),  and \nq(zn; \u00a2n)  is  MUlt(\u00a2n).  Under  this  distribution,  the  terms  in  the  variational  lower \nbound  are  computable  and  differentiable,  and  we  can  maximize  the  bound  with \nrespect to, and  \u00a2  to obtain the best approximation to p(w;a,j3). \nNote  that  the  third  and  fourth  terms  in  the  variational  bound  are  not  straight(cid:173)\nforward  to  compute  since  they  involve  the  entropy  of  a  Dirichlet  distribution,  a \n(k  -\nI)-dimensional integral over  B which  is  expensive to  compute numerically.  In \nthe  full  version  of this  paper,  we  present  a  sequence  of reductions  on  these  terms \nwhich  use  the  log r  function  and  its  derivatives.  This  allows  us  to  compute  the \nintegral using  well-known numerical routines. \nVariational inference is coordinate ascent in the bound on the probability of a single \ndocument.  In particular, we  alternate between the following two equations until the \nobjective converges: \n\n,i \n\nai + 2:~=1 \u00a2ni \n\n(6) \nwhere  \\]i  is  the  first  derivative  of the  log r  function.  Note  that the  resulting  vari(cid:173)\national  parameters  can  also  be  used  and  interpreted  as  an  approximation  of the \nparameters of the true posterior. \nIn  the  current  paper  we  focus  on  maximum  likelihood  methods  for  parameter es(cid:173)\ntimation.  Given  a  collection  of documents  V  =  {WI' ... ' WM},  we  utilize  the  EM \n\n(5) \n\n\falgorithm with a variational E step, maximizing a lower bound on the log likelihood: \n\nlogp(V)  2::  l:=  Eqm [logp(B, z, w)]- Eqm [logqm(B, z)]. \n\nM \n\n(7) \n\nm=l \n\nThe  E  step  refits  qm  for  each  document  by  running  the  inference  step  described \nabove.  The  M  step  optimizes  Eq.  (7)  with  respect  to  the  model  parameters  a \nand  (3.  For  the  multinomial  parameters  (3ij  we  have  the following  M  step  update \nequation: \n\nM \n\nIwml \n\n(3ij  ex:  l:=  l:=  \u00a2>mniwtnn\u00b7 \n\nm=l  n=l \n\n(8) \n\nThe  Dirichlet  parameters  ai  are  not  independent  of  each  other  and  we  apply \nN ewton-Raphson to optimize them: \n\nThe variational EM algorithm alternates between maximizing Eq.  (7)  with respect \nto qm  and with respect to  (a, (3)  until convergence. \n\n4  Experiments and  Examples \n\nWe first  tested LDA on two text corpora. 1  The first was drawn from the TREC AP \ncorpus,  and consisted of 2500 news articles,  with a  vocabulary size of IVI  =  37,871 \nwords.  The second  was  the  CRAN  corpus,  consisting  of 1400  technical  abstracts, \nwith IVI  =  7747 words. \nWe begin with an example showing how LDA can capture multiple-topic phenomena \nin  documents.  By examining  the  (variational)  posterior  distribution  on  the  topic \nmixture q(B; ')'), we can identify the topics which were most likely to have contributed \nto  many  words  in  a  given  document;  specifically,  these  are  the  topics  i  with  the \nlargest ')'i.  Examining the most likely words in the corresponding multinomials can \nthen  further  tell  us  what  these  topics  might  be  about.  The following  is  an  article \nfrom  the TREC collection. \n\nThe William Randolph Hearst  Foundation will  give  $1.25  million  to Lincoln  Center, \nMetropolitan  Opera Co.,  New  York  Philharmonic and Juilliard School. \n\"Our board felt  that we  had a  real opportunity to make a  mark on the future of the \nperforming arts with these grants an act every bit as important as our traditional ar(cid:173)\neas of support in health , medical research,  education and the social services,\"  Hearst \nFoundation President Randolph  A.  Hearst said Monday in  announcing the grants. \nLincoln  Center's share will  be $200,000  for  its new  building,  which will  house young \nartists and provide new public facilities.  The Metropolitan Opera Co.  and New York \nPhilharmonic will  receive  $400,000  each.  The Juilliard School,  where music and the \nperforming arts are taught,  will  get  $250,000. \nThe Hearst Foundation, a  leading supporter of the Lincoln Center Consolidated Cor(cid:173)\nporate Fund, will  make its usual annual $100,000  donation,  too. \n\nFigure 3  shows  the  Dirichlet  parameters of the  corresponding variational distribu(cid:173)\ntion  for  those  topics  where  ')'i  >  1  (k  =  100) ,  and  also  lists  the  top  15  words  (in \niTo  enable  repeated  large  scale  comparison  of  various  models  on  large  corpora,  we \nimplemented  our  variational  inference  algorithm  on  a  parallel  computing  cluster.  The \n(bottleneck)  E  step is  distributed across  nodes so  that the qm  for  different  documents are \ncalculated in parallel. \n\n\fTopic  3 \nSAID \nAIDS \nHEALTH \nDISEASE \nVIRUS \nCHILDREN \nBLOOD \n\nTopic  1 \nSCHOOL \nSAID \nSTUDENTS \nBOARD \nSCHOOLS \nSTUDENT \nTEACHER \nPOLICE \nPROGRAM \nTEACHERS \nMEMBERS \nYEAROLD \nGANG \nDEPARTMENT  REVENUE \n\nTopic  2 \nMILLION \nYEAR \nSAID \nSALES \nBILLION \nTOTAL \nSHARE \nEARNINGS  PATIENTS \nPROFIT \nQUARTER \nORDERS \nLAST \nDEC \n\nBAND \nPLAY \nTREATMENT  COMPANY  WON \nTWO \nSTUDY \nAVAILABLE \nIMMUNE \nAWARD \nCANCER \nPEOPLE \nOPERA \nBEST \nPERCENT \n\nYORK \nSCHOOL \nTWO \nTODAY \nCOLUMBIA \n\nTopic  5 \nSAID \nNEW \n\nTopic  4 \nSAID \nNEW \nPRESIDENT  MUSIC \nYEAR \nCHIEF \nCHAIRMAN \nTHEATER \nEXECUTIVE  MUSICAL \nVICE \nYEARS \n\nI\" \n\nFigure 3:  The  Dirichlet  parameters where Ii > 1 (k  =  100),  and the top  15  words \nfrom  the corresponding topics, for  the document discussed in  the text. \n\n__ LDA \n-x- pLSI \n..  pLSI(00 lemper) \n~  MIx1Un;grams \n\nv \u00b7  ram \n\nwoo ' \n\n',.. _ \n\n.~  4500 \n.l\\ \n! \n\n->< - - - - - - - - - - - - - - -\n\nk (number of topics) \n\nk (number of topiCS) \n\nFigure 4:  Perplexity results on the CRAN and AP corpora for  LDA,  pLSI, mixture \nof unigrams, and the unigram model. \n\norder)  from  these  topics.  This  document  is  mostly  a  combination  of words  about \nschool policy  (topic 4)  and music  (topic  5).  The less  prominent topics  reflect  other \nwords  about education  (topic  1) , finance  (topic  2),  and health  (topic  3). \n\n4.1  Formal evaluation:  Perplexity \n\nTo  compare  the  generalization  performance  of  LDA  with  other  models,  we  com(cid:173)\nputed  the  perplexity  of  a  test  set  for  the  AP  and  CRAN  corpora.  The  perplex(cid:173)\nity,  used  by  convention  in  language  modeling,  is  monotonically  decreasing  in  the \nlikelihood  of the  test  data,  and  can  be  thought  of as  the  inverse  of the  per-word \nlikelihood.  More  formally,  for  a  test  set  of  M  documents,  perplexity(Vtest )  = \nexp (-l:m logp(wm)/ l:m Iwml}. \nWe  compared  LDA  to  both  the  mixture  of unigrams  and  pLSI  described  in  Sec(cid:173)\ntion  2.2.  We  trained the  pLSI  model  with  and  without  tempering  to  reduce  over(cid:173)\nfitting.  When tempering,  we  used part of the test set as  the hold-out data, thereby \ngiving it a slight unfair advantage.  As  mentioned previously,  pLSI does not readily \ngenerate or assign probabilities to previously unseen documents; in our experiments, \nwe assigned probability to a new document d by marginalizing out the dummy train-\ning set indices2 :  pew ) =  l: d( rr : =1l:z p(wn lz)p(z ld))p(d) . \n\n2 A  second natural method,  marginalizing out  d  and  z  to form  a  unigram model  using \nthe  resulting  p(w)'s,  did  not  perform  well  (its  performance  was  similar  to  the  standard \nunigram model). \n\n\f1-:- ~Dc.~UrUg,ams I \n\n~W' \n\u2022 M\" \n\nx  NaiveBaes \n\nk (number of topics) \n\nk (number of topics) \n\nFigure 5:  Results for  classification  (left)  and collaborative filtering  (right) \n\nFigure 4 shows  the perplexity for  each model  and both corpora for  different  values \nof k.  The latent variable models generally do better than the simple unigram model. \nThe  pLSI  model  severely  overfits  when  not  tempered  (the  values  beyond  k  =  10 \nare off the graph)  but manages to outperform mixture of unigrams when tempered. \nLDA  consistently  does  better than the other models.  To  our knowledge,  these  are \nby far  the best text perplexity results obtained by a  bag-of-words model. \n\n4.2  Classification \n\nWe also tested LDA on a text classification task.  For each class c,  we learn a separate \nmodel p(wlc)  of the  documents  in  that class.  An  unseen  document  is  classified  by \npicking  argmaxcp(Clw)  =  argmaxcp(wlc)p(c).  Note  that  using  a  simple  unigram \ndistribution for  p(wlc)  recovers the traditional naive Bayes classification model. \nUsing the same (standard) subset of the WebKB dataset as  used in  [6],  we  obtained \nclassification  error  rates  illustrated  in  Figure  5  (left).  In  all  cases,  the  difference \nbetween LDA and the other algorithms' performance is  statistically significant (p < \n0.05). \n\n4.3  Collaborative filtering \n\nOur final  experiment utilized the EachMovie collaborative filtering dataset.  In this \ndataset  a  collection  of users  indicates  their  preferred  movie  choices.  A  user  and \nthe  movies  he  chose  are analogous to  a  document  and the  words  in  the  document \n(respectively) . \nThe  collaborative  filtering  task  is  as  follows.  We  train  the  model  on  a  fully  ob(cid:173)\nserved  set  of  users.  Then,  for  each  test  user,  we  are  shown  all  but  one  of  the \nmovies  that  she  liked  and  are  asked  to  predict  what  the  held-out  movie  is.  The \ndifferent  algorithms  are  evaluated  according  to  the  likelihood  they  assign  to  the \nheld-out  movie.  More  precisely  define  the  predictive  perplexity  on  M  test  users \nto be exp( - ~~=llogP(WmNd lwml' ... ,Wm(Nd-l))/M) .  With 5000 training users, \n3500 testing users, and a vocabulary of 1600 movies,  we find  predictive perplexities \nillustrated in Figure 5 (right). \n\n5  Conclusions \n\nWe  have  presented  a  generative  probabilistic  framework  for  modeling  the  topical \nstructure of documents and other collections of discrete data.  Topics are represented \n\n\fexplicitly  via a  multinomial  variable  Zn  that  is  repeatedly  selected,  once  for  each \nword,  in  a  given  document.  In  this  sense,  the  model  generates  an  allocation  of \nthe  words  in  a  document  to  topics.  When  computing  the  probability  of  a  new \ndocument, this unknown allocation induces a mixture distribution across the words \nin the vocabulary.  There is  a many-to-many relationship between topics and words \nas  well  as a  many-to-many relationship between documents  and topics. \nWhile Dirichlet distributions are often used as conjugate priors for  multinomials in \nBayesian modeling,  it is  preferable to instead think of the Dirichlet in our model as \na  component of the  likelihood.  The Dirichlet random variable e is  a  latent variable \nthat gives  generative probabilistic semantics to the notion  of a  \"document\"  in  the \nsense  that  it  allows  us  to  put  a  distribution  on  the  space  of possible  documents. \nThe words  that are actually obtained are viewed  as a  continuous mixture over this \nspace,  as  well  as  being  a  discrete mixture over topics.3 \nThe generative  nature of LDA  makes  it easy  to use  as  a  module  in  more  complex \narchitectures  and  to  extend  it  in  various  directions.  We  have  already  seen  that \ncollections of LDA can be used in a classification setting.  If the classification variable \nis treated as a latent variable we obtain a mixture of LDA models, a useful model for \nsituations in which documents cluster not only according to their topic overlap, but \nalong other dimensions as  well.  Another extension arises from generalizing LDA  to \nconsider  Dirichlet/multinomial mixtures of bigram or trigram models, rather than \nthe  simple  unigram  models  that  we  have  considered  here.  Finally,  we  can readily \nfuse  LDA  models which  have different  vocabularies  (e.g.,  words  and images);  these \nmodels  interact  via a  common  abstract topic  variable  and  can  elegantly  use  both \nvocabularies in determining the topic mixture of a  given document. \n\nAcknowledgments \n\nA.  Ng  is  supported  by  a  Microsoft  Research  fellowship.  This  work  was  also  sup(cid:173)\nported by a grant from Intel Corporation, NSF grant IIS-9988642, and ONR MURI \nN 00014-00-1-0637. \n\nReferences \n[1]  D.  Cohn  and  T.  Hofmann.  The  missing  link- A  probabilistic  model  of  document \ncontent  and  hypertext  connectivity.  In  Advances  in  Neural  Information  Processing \nSystems  13,  2001. \n\n[2]  P.J.  Green and S. Richardson.  Modelling heterogeneity with and without the Dirichlet \n\nprocess.  Technical  Report,  University  of Bristol,  1998. \n\n[3]  T. Hofmann.  Probabilistic latent semantic indexing.  Proceedings  of the  Twenty-Second \n\nAnnual International  SIGIR  Conference,  1999. \n\n[4]  T.  J.  Jiang,  J.  B.  Kadane,  and  J.  M.  Dickey.  Computation  of  Carlson's  multiple \nhypergeometric functions  r  for  Bayesian  applications.  Journal  of Computational  and \nGraphical  Statistics,  1:231- 251 ,  1992. \n\n[5]  M.  I.  Jordan ,  Z.  Ghahramani,  T.  S.  Jaakkola,  and L.  K.  Saul.  Introduction to  varia(cid:173)\n\ntional methods for  graphical  models.  Machine  Learning,  37:183- 233,  1999. \n\n[6]  K.  Nigam,  A.  Mccallum,  S.  Thrun,  and  T.  Mitchell.  Text  classification  from  labeled \n\nand unlabeled documents using EM.  Machine  Learning,  39(2/3):103- 134,  2000. \n\n[7]  A.  Popescul,  L.  H.  Ungar, D.  M.  Pennock, and S.  Lawrence.  Probabilistic models for \nunified collaborative and content-based recommendation in sparse-data environments. \nIn  Uncertainty  in  Artificial  Intelligence,  Proceedings  of the  Seventeenth  Conference, \n2001. \n\n3These  remarks  also  distinguish  our  model  from  the  Bayesian  Dirichlet/Multinomial \nallocation  model  (DMA)of [2],  which  is  a  finite  alternative  to  the Dirichlet  process.  The \nDMA places  a  mixture of Dirichlet  priors  on p(wl z ) and sets  O i  = 00  for  all  i. \n\n\f", "award": [], "sourceid": 2070, "authors": [{"given_name": "David", "family_name": "Blei", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}