{"title": "Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model", "book": "Advances in Neural Information Processing Systems", "page_first": 241, "page_last": 248, "abstract": null, "full_text": "Modeling General and Speci\ufb01c Aspects of Documents\n\nwith a Probabilistic Topic Model\n\nChaitanya Chemudugunta, Padhraic Smyth\n\nDepartment of Computer Science\nUniversity of California, Irvine\nIrvine, CA 92697-3435, USA\n\nMark Steyvers\n\nDepartment of Cognitive Sciences\n\nUniversity of California, Irvine\nIrvine, CA 92697-5100, USA\n\n{chandra,smyth}@ics.uci.edu\n\nmsteyver@uci.edu\n\nAbstract\n\nTechniques such as probabilistic topic models and latent-semantic indexing have\nbeen shown to be broadly useful at automatically extracting the topical or seman-\ntic content of documents, or more generally for dimension-reduction of sparse\ncount data. These types of models and algorithms can be viewed as generating an\nabstraction from the words in a document to a lower-dimensional latent variable\nrepresentation that captures what the document is generally about beyond the spe-\nci\ufb01c words it contains. In this paper we propose a new probabilistic model that\ntempers this approach by representing each document as a combination of (a) a\nbackground distribution over common words, (b) a mixture distribution over gen-\neral topics, and (c) a distribution over words that are treated as being speci\ufb01c to\nthat document. We illustrate how this model can be used for information retrieval\nby matching documents both at a general topic level and at a speci\ufb01c word level,\nproviding an advantage over techniques that only match documents at a general\nlevel (such as topic models or latent-sematic indexing) or that only match docu-\nments at the speci\ufb01c word level (such as TF-IDF).\n\n1 Introduction and Motivation\n\nReducing high-dimensional data vectors to robust and interpretable lower-dimensional representa-\ntions has a long and successful history in data analysis, including recent innovations such as latent\nsemantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng,\nand Jordan, 2003). These types of techniques have found broad application in modeling of sparse\nhigh-dimensional count data such as the \u201cbag of words\u201d representations for documents or transaction\ndata for Web and retail applications.\n\nApproaches such as LSI and LDA have both been shown to be useful for \u201cobject matching\u201d in their\nrespective latent spaces. In information retrieval for example, both a query and a set of documents\ncan be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of\nhow well they match the query based on distance or similarity in the latent space. The mapping to\nlatent space represents a generalization or abstraction away from the sparse set of observed words, to\na \u201chigher-level\u201d semantic representation in the latent space. These abstractions in principle lead to\nbetter generalization on new data compared to inferences carried out directly in the original sparse\nhigh-dimensional space. The capability of these models to provide improved generalization has\nbeen demonstrated empirically in a number of studies (e.g., Deerwester et al 1994; Hofmann 1999;\nCanny 2004; Buntine et al, 2005).\n\nHowever, while this type of generalization is broadly useful in terms of inference and prediction,\nthere are situations where one can over-generalize. Consider trying to match the following query\nto a historical archive of news articles: election + campaign + Camejo. The query is intended to\n\ufb01nd documents that are about US presidential campaigns and also about Peter Camejo (who ran as\n\n\fvice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are\nlikely to highly rank articles that are related to presidential elections (even if they don\u2019t necessarily\ncontain the words election or campaign).\n\nHowever, a potential problem is that the documents that are highly ranked by LSI or topic models\nneed not include any mention of the name Camejo. The reason is that the combination of words\nin this query is likely to activate one or more latent variables related to the concept of presidential\ncampaigns. However, once this generalization is made the model has \u201clost\u201d the information about\nthe speci\ufb01c word Camejo and it will only show up in highly ranked documents if this word happens\nto frequently occur in these topics (unlikely in this case given that this candidate received relatively\nlittle media coverage compared to the coverage given to the candidates from the two main parties).\nBut from the viewpoint of the original query, our preference would be to get documents that are\nabout the general topic of US presidential elections with the speci\ufb01c constraint that they mention\nPeter Camejo.\n\ntechniques, such as the widely-used term-frequency inverse-document-\nWord-based retrieval\nfrequency (TF-IDF) method, have the opposite problem in general. They tend to be overly speci\ufb01c\nin terms of matching words in the query to documents.\n\nIn general of course one would like to have a balance between generality and speci\ufb01city. One ad hoc\napproach is to combine scores from a general method such as LSI with those from a more speci\ufb01c\nmethod such as TF-IDF in some manner, and indeed this technique has been proposed in information\nretrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the\nLDA model is linearly combined with document-speci\ufb01c word distributions to capture both general\nas well as speci\ufb01c information in documents. However, neither method is entirely satisfactory since\nit is not clear how to trade-off generality and speci\ufb01city in a principled way.\n\nThe contribution of this paper is a new graphical model based on latent topics that handles the trade-\noff between generality and speci\ufb01city in a fully probabilistic and automated manner. The model,\nwhich we call the special words with background (SWB) model, is an extension of the LDA model.\nThe new model allows words in documents to be modeled as either originating from general topics,\nor from document-speci\ufb01c \u201cspecial\u201d word distributions, or from a corpus-wide background distribu-\ntion. The idea is that words in a document such as election and campaign are likely to come from\na general topic on presidential elections, whereas a name such as Camejo is much more likely to\nbe treated as \u201cnon-topical\u201d and speci\ufb01c to that document. Words in queries are automatically inter-\npreted (in a probabilistic manner) as either being topical or special, in the context of each document,\nallowing for a data-driven document-speci\ufb01c trade-off between the bene\ufb01ts of topic-based abstrac-\ntion and speci\ufb01c word matching. Daum\u00b4e and Marcu (2006) independently proposed a probabilistic\nmodel using similar concepts for handling different training and test distributions in classi\ufb01cation\nproblems.\n\nAlthough we have focused primarily on documents in information retrieval in the discussion above,\nthe model we propose can in principle be used on any large sparse matrix of count data. For example,\ntransaction data sets where rows are individuals and columns correspond to items purchased or Web\nsites visited are ideally suited to this approach. The latent topics can capture broad patterns of\npopulation behavior and the \u201cspecial word distributions\u201d can capture the idiosyncracies of speci\ufb01c\nindividuals.\n\nSection 2 reviews the basic principles of the LDA model and introduces the new SWB model. Sec-\ntion 3 illustrates how the model works in practice using examples from New York Times news\narticles. In Section 4 we describe a number of experiments with 4 different document sets, includ-\ning perplexity experiments and information retrieval experiments, illustrating the trade-offs between\ngeneralization and speci\ufb01city for different models. Section 5 contains a brief discussion and con-\ncluding comments.\n\n2 A Topic Model for Special Words\n\nFigure 1(a) shows the graphical model for what we will refer to as the \u201cstandard topic model\u201d\nor LDA. There are D documents and document d has Nd words. \u03b1 and \u03b2 are \ufb01xed parameters of\nsymmetric Dirichlet priors for the D document-topic multinomials represented by \u03b8 and the T topic-\nword multinomials represented by \u03c6. In the generative model, for each document d, the Nd words\n\n\f(a)\n\nb\n\nf\n\nT\n\na\n\nq\n\nz\n\nw\n\ndN\n\nD\n\n(b)\n\nb\n\n2\n\nb\n\n1\n\ny\n\nb\n\n0\n\nf\n\nT\n\na\n\nq\n\nz\n\nw\n\ng\n\nl\n\nx\n\ndN\n\nD\n\nFigure 1: Graphical models for (a) the standard LDA topic model (left) and (b) the proposed special\nwords topic model with a background distribution (SWB) (right).\n\nare generated by drawing a topic t from the document-topic distribution p(z|\u03b8d) and then drawing\na word w from the topic-word distribution p(w|z = t, \u03c6t). As shown in Grif\ufb01ths and Steyvers\n(2004) the topic assignments z for each word token in the corpus can be ef\ufb01ciently sampled via\nGibbs sampling (after marginalizing over \u03b8 and \u03c6). Point estimates for the \u03b8 and \u03c6 distributions\ncan be computed conditioned on a particular sample, and predictive distributions can be obtained by\naveraging over multiple samples.\n\nWe will refer to the proposed model as the special words topic model with background distribution\n(SWB) (Figure 1(b)). SWB has a similar general structure to the LDA model (Figure 1(a)) but with\nadditional machinery to handle special words and background words. In particular, associated with\neach word token is a latent random variable x, taking value x = 0 if the word w is generated via\nthe topic route, value x = 1 if the word is generated as a special word (for that document) and\nvalue x = 2 if the word is generated from a background distribution speci\ufb01c for the corpus. The\nvariable x acts as a switch: if x = 0, the previously described standard topic mechanism is used\nto generate the word, whereas if x = 1 or x = 2, words are sampled from a document-speci\ufb01c\nmultinomial \u03a8 or a corpus speci\ufb01c multinomial \u2126 (with symmetric Dirichlet priors parametrized by\n\u03b21 and \u03b22) respectively. x is sampled from a document-speci\ufb01c multinomial \u03bb, which in turn has\na symmetric Dirichlet prior, \u03b3. One could also use a hierarchical Bayesian approach to introduce\nanother level of uncertainty about the Dirichlet priors (e.g., see Blei, Ng, and Jordan, 2003)\u2014we\nhave not investigated this option, primarily for computational reasons. In all our experiments, we\nset \u03b1 = 0.1, \u03b20 = \u03b22 = 0.01, \u03b21 = 0.0001 and \u03b3 = 0.3\u2014all weak symmetric priors.\n\nThe conditional probability of a word w given a document d can be written as:\n\np(w|d) = p(x = 0|d)\n\nT\n\nXt=1\n\np(w|z = t)p(z = t|d) + p(x = 1|d)p\u2032(w|d) + p(x = 2|d)p\u2032\u2032(w)\n\nwhere p\u2032(w|d) is the special word distribution for document d, and p\u2032\u2032(w) is the background word\ndistribution for the corpus. Note that when compared to the standard topic model the SWB model\ncan explain words in three different ways, via topics, via a special word distribution, or via a back-\nground word distribution. Given the graphical model above, it is relatively straightforward to derive\nGibbs sampling equations that allow joint sampling of the zi and xi latent variables for each word\ntoken wi, for xi = 0:\n\np (xi = 0, zi = t |w, x\u2212i, z\u2212i, \u03b1, \u03b20, \u03b3 ) \u221d\n\nNd0,\u2212i + \u03b3\nNd,\u2212i + 3\u03b3\n\n\u00d7\n\nand for xi = 1:\n\nC T D\n\ntd,\u2212i + \u03b1\nt\u2032d,\u2212i + T \u03b1\n\nPt\u2032 C T D\n\n\u00d7\n\nCW T\n\nwt,\u2212i + \u03b20\nw\u2032t,\u2212i + W \u03b20\n\nPw\u2032 CW T\n\np (xi = 1 |w, x\u2212i, z\u2212i, \u03b21, \u03b3 ) \u221d\n\nNd1,\u2212i + \u03b3\nNd,\u2212i + 3\u03b3\n\n\u00d7\n\nCW D\n\nwd,\u2212i + \u03b21\nw\u2032d,\u2212i + W \u03b21\n\nPw\u2032 CW D\n\nW\n\fe mail krugman nytimes com memo to critics of the media s liberal bias the pinkos you really should be going after are those business reporters even i was startled by \nthe tone of the jan 21 issue of investment news which describes itself as the weekly newspaper for financial advisers the headline was paul o neill s sweet deal the \nblurb was irs backs off closing loophole averting tax liability for execs and treasury chief it s not really news that the bush administration likes tax breaks for \nbusinessmen but two weeks later i learned from the wall street journal that this loophole is more than a tax break for businessmen it s a gift to biznesmen and it may be \npart of a larger pattern confused in the former soviet union the term biznesmen pronounced beeznessmen refers to the class of sudden new rich who emerged after the \nfall of communism and who generally got rich by using their connections to strip away the assets of public enterprises what we ve learned from enron and other \nplayers to be named later is that america has its own biznesmen and that we need to watch out for policies that make it easier for them to ply their trade it turns out that \nthe sweet deal investment news was referring to the use of split premium life insurance policies to give executives largely tax free compensation you don t want to \nknow the details is an even sweeter deal for executives of companies that go belly up it shields their wealth from creditors and even from lawsuits sure enough reports \nthe wall street journal former enron c e o s kenneth lay and jeffrey skilling both had large split premium policies so what other pro biznes policies have been \npromulgated lately last year both houses of \u2026 \n \n \njohn w snow was paid more than 50 million in salary bonus and stock in his nearly 12 years as chairman of the csx corporation the railroad company during that \nperiod the company s profits fell and its stock rose a bit more than half as much as that of the average big company mr snow s compensation amid csx s uneven \nperformance has drawn criticism from union officials and some corporate governance specialists in 2000 for example after the stock had plunged csx decided to \nreverse a 25 million loan to him the move is likely to get more scrutiny after yesterday s announcement that mr snow has been chosen by president bush to replace \npaul o neill as the treasury secretary like mr o neill mr snow is an outsider on wall street but an insider in corporate america with long experience running an industrial \ncompany some wall street analysts who follow csx said yesterday that mr snow had ably led the company through a difficult period in the railroad industry and would \nmake a good treasury secretary it s an excellent nomination said jill evans an analyst at j p morgan who has a neutral rating on csx stock i think john s a great person \nfor the administration he as the c e o of a railroad has probably touched every sector of the economy union officials are less complimentary of mr snow s performance \nat csx last year the a f l c i o criticized him and csx for the company s decision to reverse the loan allowing him to return stock he had purchased with the borrowed \nmoney at a time when independent directors are in demand a corporate governance specialist said recently that mr snow had more business relationships with \nmembers of his own board than any other chief executive in addition mr snow is the third highest paid of 37 chief executives of transportation companies said ric \nmarshall chief executive of the corporate library which provides specialized investment research into corporate boards his own compensation levels have been pretty \nhigh mr marshall said he could afford to take a public service job a csx program in 1996 allowed mr snow and other top csx executives to buy\u2026 \n\nFigure 2: Examples of two news articles with special words (as inferred by the model) shaded in\ngray. (a) upper, email article with several colloquialisms, (b) lower, article about CSX corporation.\n\nand for xi = 2:\n\np (xi = 2 |w, x\u2212i, z\u2212i, \u03b22, \u03b3 ) \u221d\n\nNd2,\u2212i + \u03b3\nNd,\u2212i + 3\u03b3\n\n\u00d7\n\nCW\n\nw,\u2212i + \u03b22\nw\u2032,\u2212i + W \u03b22\n\nPw\u2032 CW\n\nwhere the subscript \u2212i indicates that the count for word token i is removed, Nd is the number of\nwords in document d and Nd0, Nd1 and Nd2 are the number of words in document d assigned to the\nlatent topics, special words and background component, respectively, CW T\nw are the\nwt\nnumber of times word w is assigned to topic t, to the special-words distribution of document d, and\nto the background distribution, respectively, and W is the number of unique words in the corpus.\nNote that when there is not strong supporting evidence for xi = 0 (i.e., the conditional probability\nof this event is low), then the probability of the word being generated by the special words route,\nxi = 1, or background route, xi = 2 increases.\nOne iteration of the Gibbs sampler corresponds to a sampling pass through all word tokens in the\ncorpus. In practice we have found that around 500 iterations are often suf\ufb01cient for the in-sample\nperplexity (or log-likelihood) and the topic distributions to stabilize.\n\nand CW\n\n, CW D\n\nwd\n\nWe also pursued a variant of SWB, the special words (SW) model that excludes the background\ndistribution \u2126 and has a symmetric Beta prior, \u03b3, on \u03bb (which in SW is a document-speci\ufb01c Bernoulli\ndistribution). In all our SW model runs, we set \u03b3 = 0.5 resulting in a weak symmetric prior that is\nequivalent to adding one pseudo-word to each document. Experimental results (not shown) indicate\nthat the \ufb01nal word-topic assignments are not sensitive to either the value of the prior or the initial\nassignments to the latent variables, x and z.\n\n3 Illustrative Examples\n\nWe illustrate the operation of the SW model with a data set consisting of 3104 articles from the\nNew York Times (NYT) with a total of 1,399,488 word tokens. This small set of NYT articles was\nformed by selecting all NYT articles that mention the word \u201cEnron.\u201d The SW topic model was run\nwith T = 100 topics. In total, 10 Gibbs samples were collected from the model. Figure 2 shows\ntwo short fragments of articles from this NYT dataset. The background color of words indicates the\nprobability of assigning words to the special words topic\u2014darker colors are associated with higher\nprobability that over the 10 Gibbs samples a word was assigned to the special topic. The words\nwith gray foreground colors were treated as stopwords and were not included in the analysis. Figure\n2(a) shows how intentionally misspelled words such as \u201cbiznesmen\u201d and \u201cbeeznessmen\u201d and rare\n\n\fCollection\n\nNIPS\n\nPATENTS\n\nAP\nFR\n\nTotal # of\n\nMedian\n\nMean\n\n# of\nDocs Word Tokens Doc Length Doc Length Queries\n1740\n6711\n10000\n2500\n\n2,301,375\n15,014,099\n2,426,181\n6,332,681\n\n1322.6\n2237.2\n242.6\n2533.1\n\n# of\n\nN/A\nN/A\n142\n30\n\n1310\n1858\n235.5\n516\n\nTable 1: General characteristics of document data sets used in experiments.\n\nNIPS\n\nPATENTS\n\nset \nnumber \nresults \ncase \nproblem \nfunction \nvalues \npaper \napproach \nlarge \n\n.0206\n.0167\n.0153\n.0123\n.0118\n.0108\n.0102\n.0088\n.0080\n.0079\n\nfig \nend \nextend \ninvent \nview \nshown \nclaim \nside \nposit \nform \n\n.0647\n.0372\n.0267\n.0246\n.0214\n.0191\n.0189\n.0177\n.0153\n.0128\n\nAP\n\ntagnum \nitag \nrequir \ninclud \nsection \ndetermin \npart \ninform \naddit \napplic \n\n.0416\n.0412\n.0381\n.0207\n.0189\n.0134\n.0112\n.0105\n.0096\n.0086\n\nFR\n\nnation \nsai \npresid \npolici \nissu \ncall \nsupport \nneed \ngovern \neffort \n\n.0147\n.0129\n.0118\n.0108\n.0096\n.0094\n.0085\n.0079\n.0070\n.0068\n\nFigure 3: Examples of background distributions (10 most likely words) learned by the SWB model\nfor 4 different document corpora.\n\nwords such as \u201cpinkos\u201d are likely to be assigned to the special words topic. Figure 2(b) shows how\na last name such as \u201cSnow\u201d and the corporation name \u201cCSX\u201d that are speci\ufb01c to the document are\nlikely to be assigned to the special topic. The words \u201cSnow\u201d and \u201cCSX\u201d do not occur often in other\ndocuments but are mentioned several times in the example document. This combination of low\ndocument-frequency and high term-frequency within the document is one factor that makes these\nwords more likely to be treated as \u201cspecial\u201d words.\n\n4 Experimental Results: Perplexity and Precision\n\nWe use 4 different document sets in our experiments, as summarized in Table 1. The NIPS and\nPATENTS document sets are used for perplexity experiments and the AP and FR data sets for re-\ntrieval experiments. The NIPS data set is available online1 and PATENTS, AP, and FR consist of\ndocuments from the U.S. Patents collection (TREC Vol-3), Associated Press news articles from 1998\n(TREC Vol-2), and articles from the Federal Register (TREC Vol-1, 2) respectively. To create the\nsampled AP and FR data sets, all documents relevant to queries were included \ufb01rst and the rest of\nthe documents were chosen randomly. In the results below all LDA/SWB/SW models were \ufb01t using\nT = 200 topics.\nFigure 3 demonstrates the background component learned by the SWB model on the 4 different doc-\nument data sets. The background distributions learned for each set of documents are quite intuitive,\nwith words that are commonly used across a broad range of documents within each corpus. The ratio\nof words assigned to the special words distribution and the background distribution are (respectively\nfor each data set), 25%:10% (NIPS), 58%:5% (PATENTS), 11%:6% (AP), 50%:11% (FR). Of note\nis the fact that a much larger fraction of words are treated as special in collections containing long\ndocuments (NIPS, PATENTS, and FR) than in short \u201cabstract-like\u201d collections (such as AP)\u2014this\nmakes sense since short documents are more likely to contain general summary information while\nlonger documents will have more speci\ufb01c details.\n\n4.1 Perplexity Comparisons\n\nThe NIPS and PATENTS documents sets do not have queries and relevance judgments, but nonethe-\nless are useful for evaluating perplexity. We compare the predictive performance of the SW and\nSWB topic models with the standard topic model by computing the perplexity of unseen words in\ntest documents. Perplexity of a test set under a model is de\ufb01ned as follows:\n\n1From http://www.cs.toronto.edu/\u02dcroweis/data.html\n\n\f1900\n\n1800\n\n1700\n\n1600\n\n1500\n\n1400\n\n1300\n\n1200\n\n1100\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n1000\n\n10\n\n20\n\nLDA\nSW\nSWB\n\n550\n\n500\n\n450\n\n400\n\n350\n\n300\n\n250\n\n200\n\nl\n\ny\nt\ni\nx\ne\np\nr\ne\nP\n\n80\n\n90\n\n150\n\n10\n\n20\n\nLDA\nSW\nSWB\n\n80\n\n90\n\n40\n\n30\n70\nPercentage of Words Observed\n\n60\n\n50\n\n40\n\n30\n70\nPercentage of Words Observed\n\n60\n\n50\n\nFigure 4: Average perplexity of the two special words models and the standard topics model as a\nfunction of the percentage of words observed in test documents on the NIPS data set (left) and the\nPATENTS data set (right).\n\nPerplexity(wtest|Dtrain) = exp(cid:18)\u2212PDtest\n\nd=1 log p(wd|Dtrain)\n\n(cid:19)\n\nd=1 Nd\n\nPDtest\n\nwhere wtest is a vector of words in the test data set, wd is a vector of words in document d of the test\nset, and Dtrain is the training set. For the SWB model, we approximate p(wd|Dtrain) as follows:\n\np(wd|Dtrain) \u2248\n\n1\nS\n\nS\n\nXs=1\n\np(wd|{\u0398s\u03a6s \u03a8s \u2126s \u03bbs})\n\nwhere \u0398s, \u03a6s, \u03a8s, \u2126s and \u03bbs are point estimates from s = 1:S different Gibbs sampling runs.\nThe probability of the words wd in a test document d, given its parameters, can be computed as\nfollows:\n\np(wd|{\u0398s\u03a6s \u03a8s \u2126s \u03bbs}) =\n\nNd\n\nYi=1\"\u03bbs\n\n1d\n\n\u03c6s\nwit\u03b8s\n\ntd + \u03bbs\n\n2d\u03c8s\n\nwid + \u03bbs\n\n3d\u2126s\n\nwi#\n\nT\n\nXt=1\n\nwhere Nd is the number of words in test document d and wi is the ith word being predicted in the\ntest document. \u03b8s\n\nd are point estimates from sample s.\n\nwi and \u03bbs\n\nwid, \u2126s\n\nwit, \u03c8s\n\ntd, \u03c6s\n\nWhen a fraction of words of a test document d is observed, a Gibbs sampler is run on the observed\nwords to update the document-speci\ufb01c parameters, \u03b8d, \u03c8d and \u03bbd and these updated parameters are\nused in the computation of perplexity. For the NIPS data set, documents from the last year of the\ndata set were held out to compute perplexity (Dtest = 150), and for the PATENTS data set 500\ndocuments were randomly selected as test documents.\n\nFrom the perplexity \ufb01gures, it can be seen that once a small fraction of the test document words\nis observed (20% for NIPS and 10% for PATENTS), the SW and SWB models have signi\ufb01cantly\nlower perplexity values than LDA indicating that the SW and SWB models are using the special\nwords \u201croute\u201d to better learn predictive models for individual documents.\n\n4.2 Information Retrieval Results\n\nReturning to the point of capturing both speci\ufb01c and general aspects of documents as discussed in\nthe introduction of the paper, we generated 500 queries of length 3-5 using randomly selected low-\nfrequency words from the NIPS corpus and then ranked documents relative to these queries using\nseveral different methods. Table 2 shows for the top k-ranked documents (k = 1, 10, 50, 100) how\nmany of the retrieved documents contained at least one of the words in the query. Note that we are\nnot assessing relevance here in a traditional information retrieval sense, but instead are assessing how\n\n\fMethod\nTF-IDF\n\nLSI\nLDA\nSW\nSWB\n\n1 Ret Doc\n\n10 Ret Docs\n\n50 Ret Docs\n\n100 Ret Docs\n\n100.0\n97.6\n90.0\n99.2\n99.4\n\n100.0\n82.7\n80.6\n97.1\n96.6\n\n100.0\n64.6\n67.0\n79.1\n78.7\n\n100.0\n54.3\n58.7\n67.3\n67.2\n\nTable 2: Percentage of retrieved documents containing at least one query word (NIPS corpus).\n\nAP\n\nFR\n\nMethod\nTF-IDF\n\nLSI\nLDA\nSW\nSWB\n\nMethod\nTF-IDF\n\nLSI\nLDA\nSW\nSWB\n\nMAP\n\nMAP\n\nTitle\n.353\n.286\n.424\n.466*\n.460*\n\nTitle\n.268\n.329\n.344\n.371\n.373\n\nDesc\n.358\n.387\n.394\n.430*\n.417\n\nDesc\n.272\n.295\n.271\n.323*\n.328*\n\nConcepts\n\n.498\n.459\n.498\n.550*\n.549*\n\nConcepts\n\n.391\n.399\n.396\n.448*\n.435\n\n*=sig difference wrt LDA\n\nMethod\nTF-IDF\n\nLSI\nLDA\nSW\nSWB\n\nMethod\nTF-IDF\n\nLSI\nLDA\nSW\nSWB\n\nPr@10d\n\nTitle\n.406\n.455\n.478\n.524*\n.513*\n\nDesc\n.434\n.469\n.463\n.509*\n.495\n\nConcepts\n\n.549\n.523\n.556\n.599*\n.603*\n\nPr@10d\n\nTitle\n.300\n.366\n.428\n.469\n.462\n\nDesc\n.287\n.327\n.340\n.407*\n.423*\n\nConcepts\n\n.483\n.487\n.487\n.550*\n.523\n\nFigure 5: Information retrieval experimental results.\n\noften speci\ufb01c query words occur in retrieved documents. TF-IDF has 100% matches, as one would\nexpect, and the techniques that generalize (such as LSI and LDA) have far fewer exact matches.\nThe SWB and SW models have more speci\ufb01c matches than either LDA or LSI, indicating that they\nhave the ability to match at the level of speci\ufb01c words. Of course this is not of much utility unless\nthe SWB and SW models can also perform well in terms of retrieving relevant documents (not just\ndocuments containing the query words), which we investigate next.\n\nFor the AP and FR documents sets, 3 types of query sets were constructed from TREC Topics 1-\n150, based on the T itle (short), Desc (sentence-length) and Concepts (long list of keywords) \ufb01elds.\nQueries that have no relevance judgments for a collection were removed from the query set for that\ncollection.\n\nThe score for a document d relative to a query q for the SW and standard topic models can be com-\nputed as the probability of q given d (known as the query-likelihood model in the IR community).\nFor the SWB topic model, we have\n\np(q|d) \u2248 Yw\u2208q\n\n[p(x = 0|d)\n\nT\n\nXt=1\n\np(w|z = t)p(z = t|d) + p(x = 1|d)p\u2032(w|d) + p(x = 2|d)p\u2032\u2032(w)]\n\nWe compare SW and SWB models with the standard topic model (LDA), LSI and TF-IDF. The TF-\n. For\nIDF score for a word w in a document d is computed as TF-IDF(w, d) =\nLSI, the TF-IDF weight matrix is reduced to a K-dimensional latent space using SVD, K = 200. A\ngiven query is \ufb01rst mapped into the LSI latent space or the TF-IDF space (known as query folding),\nand documents are scored based on their cosine distances to the mapped queries.\n\nCW D\nwd\nNd\n\n\u00d7 log2\n\nD\nDw\n\nTo measure the performance of each algorithm we used 2 metrics that are widely used in IR research:\nthe mean average precision (MAP) and the precision for the top 10 documents retrieved (pr@10d).\nThe main difference between the AP and FR documents is that the latter documents are considerably\nlonger on average and there are fewer queries for the FR data set. Figure 5 summarizes the results,\nbroken down by algorithm, query type, document set, and metric. The maximum score for each\nquery experiment is shown in bold: in all cases (query-type/data set/metric) the SW or SWB model\nproduced the highest scores.\n\n\fTo determine statistical signi\ufb01cance, we performed a t-test at the 0.05 level between the scores of\neach of the SW and SWB models, and the scores of the LDA model (as LDA has the best scores\noverall among TF-IDF, LSI and LDA). Differences between SW and SWB are not signi\ufb01cant. In\n\ufb01gure 5, we use the symbol * to indicate scores where the SW and SWB models showed a statis-\ntically signi\ufb01cant difference (always an improvement) relative to the LDA model. The differences\nfor the \u201cnon-starred\u201d query and metric scores of SW and SWB are not statistically signi\ufb01cant but\nnonetheless always favor SW and SWB over LDA.\n\n5 Discussion and Conclusions\n\nWei and Croft (2006) have recently proposed an ad hoc LDA approach that models p(q|d) as a\nweighted combination of a multinomial over the entire corpus (the background model), a multino-\nmial over the document, and an LDA model. Wei and Croft showed that this combination provides\nexcellent retrieval performance compared to other state-of-the-art IR methods. In a number of exper-\niments (not shown) comparing the SWB and ad hoc LDA models we found that the two techniques\nproduced comparable precision performance, with small but systematic performance gains being\nachieved by an ad hoc combination where the standard LDA model in ad hoc LDA was replaced\nwith the SWB model. An interesting direction for future work is to investigate fully generative\nmodels that can achieve the performance of ad hoc approaches.\n\nIn conclusion, we have proposed a new probabilistic model that accounts for both general and spe-\nci\ufb01c aspects of documents or individual behavior. The model extends existing latent variable prob-\nabilistic approaches such as LDA by allowing these models to take into account speci\ufb01c aspects of\ndocuments (or individuals) that are exceptions to the broader structure of the data. This allows, for\nexample, documents to be modeled as a mixture of words generated by general topics and words\ngenerated in a manner speci\ufb01c to that document. Experimental results on information retrieval tasks\nindicate that the SWB topic model does not suffer from the weakness of techniques such as LSI\nand LDA when faced with very speci\ufb01c query words, nor does it suffer the limitations of TF-IDF in\nterms of its ability to generalize.\n\nAcknowledgements\n\nWe thank Tom Grif\ufb01ths for useful initial discussions about the special words model. This material\nis based upon work supported by the National Science Foundation under grant IIS-0083489. We\nacknowledge use of the computer clusters supported by NIH grant LM-07443-01 and NSF grant\nEIA-0321390 to Pierre Baldi and the Institute of Genomics and Bioinformatics.\n\nReferences\n\nBlei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent Dirichlet allocation, Journal of Machine Learning\n\nResearch 3: 993-1022.\n\nBuntine, W., L\u00a8ofstr\u00a8om, J., Perttu, S. and Valtonen, K. (2005) Topic-speci\ufb01c scoring of documents for relevant\nretrieval Workshop on Learning in Web Search: 22nd International Conference on Machine Learning,\npp. 34-41. Bonn, Germany.\n\nCanny, J. (2004) GaP: a factor model for discrete data. Proceedings of the 27th Annual SIGIR Conference,\n\npp. 122-129.\n\nDaum\u00b4e III, H., and Marcu, D. (2006) Domain Adaptation for Statistical classi\ufb01ers. Journal of the Arti\ufb01cial\n\nIntelligence Research, 26: 101-126.\n\nDeerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990) Indexing by latent\n\nsemantic analysis. Journal of the American Society for Information Science, 41(6): 391-407.\n\nGrif\ufb01ths, T. L., and Steyvers, M. (2004) Finding scienti\ufb01c topics, Proceedings of the National Academy of\n\nSciences, pp. 5228-5235.\n\nHofmann, T. (1999) Probabilistic latent semantic indexing, Proceedings of the 22nd Annual SIGIR Confer-\n\nence, pp. 50-57.\n\nVogt, C. and Cottrell, G. (1999) Fusion via a linear combination of scores. Information Retrieval, 1(3): 151-\n\n173.\n\nWei, X. and Croft, W.B. (2006) LDA-based document models for ad-hoc retrieval, Proceedings of the 29th\n\nSIGIR Conference, pp. 178-185.\n\n\f", "award": [], "sourceid": 2994, "authors": [{"given_name": "Chaitanya", "family_name": "Chemudugunta", "institution": null}, {"given_name": "Padhraic", "family_name": "Smyth", "institution": null}, {"given_name": "Mark", "family_name": "Steyvers", "institution": null}]}