Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)
Jian Zhang, Zoubin Ghahramani, Yiming Yang
In this paper we propose a probabilistic model for online document clus- tering. We use non-parametric Dirichlet process prior to model the grow- ing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichlet- multinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature.