Part of Advances in Neural Information Processing Systems 14 (NIPS 2001)
David Blei, Andrew Ng, Michael Jordan
We propose a generative model for text and other collections of dis(cid:173) crete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof(cid:173) mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present em(cid:173) pirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.