{"title": "A Probabilistic Model for Online Document Clustering with Application to Novelty Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 1617, "page_last": 1624, "abstract": null, "full_text": " A Probabilistic Model for Online Document\nClustering with Application to Novelty Detection\n\n\n\n Jian Zhang Zoubin Ghahramani\n School of Computer Science Gatsby Computational Neuroscience Unit\n Cargenie Mellon University University College London\n Pittsburgh, PA 15213 London WC1N 3AR, UK\n jian.zhang@cs.cmu.edu zoubin@gatsby.ucl.ac.uk\n\n\n Yiming Yang\n School of Computer Science\n Cargenie Mellon University\n Pittsburgh, PA 15213\n yiming@cs.cmu.edu\n\n\n\n\n Abstract\n\n In this paper we propose a probabilistic model for online document clus-\n tering. We use non-parametric Dirichlet process prior to model the grow-\n ing number of clusters, and use a prior of general English language\n model as the base distribution to handle the generation of novel clusters.\n Furthermore, cluster uncertainty is modeled with a Bayesian Dirichlet-\n multinomial distribution. We use empirical Bayes method to estimate\n hyperparameters based on a historical dataset. Our probabilistic model\n is applied to the novelty detection task in Topic Detection and Tracking\n (TDT) and compared with existing approaches in the literature.\n\n\n\n1 Introduction\n\nThe task of online document clustering is to group documents into clusters as long as\nthey arrive in a temporal sequence. Generally speaking, it is difficult for several reasons:\nFirst, it is unsupervised learning and the learning has to be done in an online fashion,\nwhich imposes constraints on both strategy and efficiency. Second, similar to other learning\nproblems in text, we have to deal with a high-dimensional space with tens of thousands of\nfeatures. And finally, the number of clusters can be as large as thousands in newswire data.\n\nThe objective of novelty detection is to identify the novel objects from a sequence of data,\nwhere \"novel\" is usually defined as dissimilar to previous seen instances. Here we are inter-\nested in novelty detection in the text domain, where we want to identify the earliest report\nof every new event in a sequence of news stories. Applying online document clustering\nto the novelty detection task is straightforward by assigning the first seed of every cluster\nas novel and all its remaining ones as non-novel. The most obvious application of novelty\ndetection is that, by detecting novel events, systems can automatically alert people when\nnew events happen, for example.\n\n\f\nIn this paper we apply Dirichlet process prior to model the growing number of clusters, and\npropose to use a General English language model as a basis of newly generated clusters.\nIn particular, the new clusters will be generated according to the prior and a background\nGeneral English model, and each document cluster is modeled using a Bayesian Dirichlet-\nmultinomial language model. The Bayesian inference can be easily carried out due to\nconjugacy, and model hyperparameters are estimated using a historical dataset by the em-\npirical Bayes method. We evaluate our online clustering algorithm (as well as its variants)\non the novelty detection task in TDT, which has been regarded as the hardest task in that\nliterature [2].\n\nThe rest of this paper is organized as follows. We first introduce our probabilistic model\nin Section 2, and in Section 3 we give detailed information on how to estimate model\nhyperparameters. We describe the experiments in Section 4, and related work in Section 5.\nWe conclude and discuss future work in Section 6.\n\n\n2 A Probabilistic Model for Online Document Clustering\n\nIn this section we will describe the generative probabilistic model for online document\n\nclustering. We use x = (n(x)\n 1 , n(x)\n 2 , . . . , n(x)) to represent a document vector where each\n V\nelement n(x)\n v denotes the term frequency of the vth corresponding word in the document\nx, and V is the total size of the vocabulary.\n\n\n2.1 Dirichlet-Multinomial Model\n\nThe multinomial distribution has been one of the most frequently used language models for\nmodeling documents in information retrieval. It assumes that given the set of parameters\n = (1, 2, . . . , V ), a document x is generated with the following probability:\n ( V n(x) V\n p( v )!\n x|) = v=1 n(x)\n v\n V v .\n n(x)\n v=1 v ! v=1\nFrom the formula we can see the so-called naive assumption: words are assumed to be in-\ndependent of each other. Given a collection of documents generated from the same model,\nthe parameter can be estimated with Maximum Likelihood Estimation (MLE).\n\nIn a Bayesian approach we would like to put a Dirichlet prior over the parameter ( \nDir()) such that the probability of generating a document is obtained by integrating over\nthe parameter space: p(x) = p(|)p(x|)d. This integration can be easily written\ndown due to the conjugacy between Dirichlet and multinomial distributions. The key dif-\nference between the Bayesian approach and the MLE is that the former uses a distribution\nto model the uncertainty of the parameter , while the latter gives only a point estimation.\n\n\n2.2 Online Document Clustering with Dirichlet Process Mixture Model\n\nIn our system documents are grouped into clusters in an online fashion. Each cluster is\nmodeled with a multinomial distribution whose parameter follows a Dirichlet prior. First,\na cluster is chosen based on a Dirichlet process prior (can be either a new or existing\ncluster), and then a document is drawn from that cluster.\n\nWe use Dirichlet Process (DP) to model the prior distribution of 's, and our hierarchical\nmodel is as follows:\n xi|ci M ul(.|(ci))\n i iid.\n G (1)\n G DP (, G0)\n\n\f\nwhere ci is the cluster indicator variable, i is the multinomial parameter 1 for each docu-\nment, and (ci) is the unique for the cluster ci. G is a random distribution generated from\nthe Dirichlet process DP (, G0) [4], which has a precision parameter and a base distribu-\ntion G0. Here our base distribution G0 is a Dirichlet distribution Dir(1, 2, . . . , V )\nwith V \n t=1 t = 1, which reflects our expected knowledge about G. Intuitively, our G0\ndistribution can be treated as the prior over general English word frequencies, which has\nbeen used in information retrieval literature [6] to model general English documents.\n\nThe exact cluster-document generation process can be described as follows:\n\n 1. Let xi be the current document under processing (the ith document in the input\n sequence), and C1, C2, . . . , Cm are already generated clusters.\n\n 2. Draw a cluster ci based on the following Dirichlet process prior [4]:\n |C\n p(c j |\n i = Cj ) = (j = 1, 2, . . . , m)\n + m |C\n j=1 j |\n (2)\n p(ci = Cm+1) = + m |C\n j=1 j |\n where |Cj| stands for the cardinality of cluster j with m |C\n j=1 j | = i - 1, and\n with certain probability a new cluster Cm+1 will be generated.\n\n 3. Draw the document xi from the cluster ci.\n\n\n2.3 Model Updating\n\nOur models for each cluster need to be updated based on incoming documents. We can\nwrite down the probability that the current document xi is generated by any cluster as\n\n p(xi|Cj) = p((Cj)|Cj)p(xi|(Cj))d(Cj) (j = 1, 2, . . . , m, m + 1)\n\nwhere p((Cj)|Cj) is the posterior distribution of parameters of the jth cluster (j =\n1, 2, . . . , m) and we use p((Cm+1)|Cm+1) = p((Cm+1)) to represent the prior distribu-\ntion of the parameters of the new cluster for convenience. Although the dimensionality of\n is high (V 105 in our case), closed-form solution can be obtained under our Dirichlet-\nmultinomial assumption. Once the conditional probabilities p(xi|Cj) are computed, the\nprobabilities p(Cj|xi) can be easily calculated using the Bayes rule:\n\n p(C\n p( j )p(xi|Cj )\n Cj|xi) = m+1 p(C\n j =1 j )p(xi|Cj )\n\nwhere the prior probability of each cluster is calculated using equation (2).\n\nNow there are several choices we can consider on how to update the cluster models. The\nfirst choice, which is correct but obviously intractable, is to fork m + 1 children of the\ncurrent system where the jth child is updated with document xi assigned to cluster j, while\nthe final system is a probabilistic combination of those children with the corresponding\nprobabilities p(Cj|xi). The second choice is to make a hard decision by assigning the\ncurrent document xi to the cluster with the maximum probability:\n\n p(Cj)p(xi|Cj)\n ci = arg max p(Cj|xi) = .\n C m+1\n j p(C\n j =1 j )p(xi|Cj )\n\n 1For we use v to denote the vth element in the vector, i to denote the parameter vector that\ngenerates the ith document, and (j) to denote the parameter vector for the jth cluster.\n\n\f\nThe third choice is to use a soft probabilistic updating, which is similar in spirit to the\nAssumed Density Filtering (ADF) [7] in the literature. That is, each cluster is updated by\nexponentiating the likelihood function with probabilities:\n\n p(Cj |xi)\n p((Cj)|xi, Cj) p(xi|(Cj)) p((Cj)|Cj)\n\nHowever, we have to specially deal with the new cluster since we cannot afford both time-\nwise and space-wise to generate a new cluster for each incoming document. Instead, we\nwill update all existing clusters as above, and new cluster will be generated only if ci =\nCm+1. We will use HD and PD (hard decision and probabilistic decision) to denote the\nlast two candidates in our experiments.\n\n\n3 Learning Model Parameters\n\nIn the above probabilistic model there are still several hyperparameters not specified,\nnamely the and in the base distribution G0 = Dir(1, 2, . . . , V ), and the pre-\ncision parameter in the DP (, G0). Since we can obtain a partially labeled historical\ndataset 2 , we now discuss how to estimate those parameters respectively.\n\nWe will mainly use the empirical Bayes method [5] to estimate those parameters instead\nof taking a full Bayesian approach, since it is easier to compute and generally reliable\nwhen the number of data points is relatively large compared to the number of parameters.\nBecause the i's are iid. from the random distribution G, by integrating out the G we get\n\n 1\n i|1, 2, . . . , i-1 G \n + i - 1 0 + + i - 1 j\n j*0 .\n iHL a/i+b+ci+i\n\n\n4 Experiments\n\nWe apply the above online clustering model to the novelty detection task in Topic Detection\nand Tracking (TDT). TDT has been a research community since its 1997 pilot study, which\nis a research initiative that aims at techniques to automatically process news documents in\nterms of events. There are several tasks defined in TDT, and among them Novelty Detection\n(a.k.a. First Story Detection or New Event Detection) has been regarded as the hardest task\nin this area [2]. The objective of the novelty detection task is to detect the earliest report\nfor each event as soon as that report arrives in the temporal sequence of news stories.\n\n\n4.1 Dataset\n\nWe use the TDT2 corpus as our historical dataset for estimating parameters, and use the\nTDT3 corpus to evaluate our model 5. Notice that we have a subset of documents in the\nhistorical dataset (TDT2) for which events labels are given. The TDT2 corpus used for\nnovelty detection task consists of 62,962 documents, among them 8,401 documents are\nlabeled in 96 clusters. Stopwords are removed and words are stemmed, and after that there\nare on average 180 words per document. The total number of features (unique words) is\naround 100,000.\n\n\n4.2 Evaluation Measure\n\nIn our experiments we use the standard TDT evaluation measure [1] to evaluate our results.\nThe performance is characterized in terms of the probability of two types of errors: miss\nand false alarm (PMiss and PF A). These two error probabilities are then combined into a\nsingle detection cost, Cdet, by assigning costs to miss and false alarm errors:\n\n Cdet = CMiss PMiss Ptarget + CF A PF A Pnon-target\n\nwhere\n\n 1. CMiss and CF A are the costs of a miss and a false alarm, respectively,\n\n 2. PMissand PF A are the conditional probabilities of a miss and a false alarm, re-\n spectively, and\n\n 3. Ptarget and Pnon-target is the priori target probabilities (Ptarget = 1 -\n Pnon-target).\n\nIt is the following normalized cost that is actually used in evaluating various TDT systems:\n\n C\n (C det\n det)norm = min(CMiss Ptarget, CFA Pnon-target)\n\nwhere the denominator is the minimum of two trivial systems. Besides, two types of eval-\nuations are used in TDT, namely macro-averaged (topic-weighted) and micro-averaged\n\n 5Strictly speaking we only used the subsets of TDT2 and TDT3 that is designated for the novelty\ndetection task.\n\n\f\n(story-weighted) evaluations. In macro-averaged evaluation, the cost is computed for every\nevent, and then the average is taken. In micro-averaged evaluation the cost is averaged over\nall documents' decisions generated by the system, thus large event will have bigger impact\non the overall performance. Note that macro-averaged evaluation is used as the primary\nevaluation measure in TDT. In addition to the binary decision \"novel\" or \"non-novel\", each\nsystem is required to generated a confidence score for each test document. The higher the\nscore is, the more likely the document is novel. Here we mainly use the minimum cost to\nevaluate systems by varying the threshold, which is independent of the threshold setting.\n\n\n4.3 Methods\n\nOne simple but effective method is the \"GAC-INCR\" clustering method [9] with cosine\nsimilarity metric and TFIDF term weighting, which has remained to be the top performing\nsystem in TDT 2002 & 2003 official evaluations. For this method the novelty confidence\nscore we used is one minus the similarity score between the current cluster xi and its nearest\nneighbor cluster: s(xi) = 1.0 - maxj*