Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers
Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or seman- tic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the spe- ciﬁc words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over gen- eral topics, and (c) a distribution over words that are treated as being speciﬁc to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a speciﬁc word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match docu- ments at the speciﬁc word level (such as TF-IDF).
1 Introduction and Motivation
Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representa- tions has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications.
Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of how well they match the query based on distance or similarity in the latent space. The mapping to latent space represents a generalization or abstraction away from the sparse set of observed words, to a “higher-level” semantic representation in the latent space. These abstractions in principle lead to better generalization on new data compared to inferences carried out directly in the original sparse high-dimensional space. The capability of these models to provide improved generalization has been demonstrated empirically in a number of studies (e.g., Deerwester et al 1994; Hofmann 1999; Canny 2004; Buntine et al, 2005).
However, while this type of generalization is broadly useful in terms of inference and prediction, there are situations where one can over-generalize. Consider trying to match the following query to a historical archive of news articles: election + campaign + Camejo. The query is intended to ﬁnd documents that are about US presidential campaigns and also about Peter Camejo (who ran as
vice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are likely to highly rank articles that are related to presidential elections (even if they don’t necessarily contain the words election or campaign).
However, a potential problem is that the documents that are highly ranked by LSI or topic models need not include any mention of the name Camejo. The reason is that the combination of words in this query is likely to activate one or more latent variables related to the concept of presidential campaigns. However, once this generalization is made the model has “lost” the information about the speciﬁc word Camejo and it will only show up in highly ranked documents if this word happens to frequently occur in these topics (unlikely in this case given that this candidate received relatively little media coverage compared to the coverage given to the candidates from the two main parties). But from the viewpoint of the original query, our preference would be to get documents that are about the general topic of US presidential elections with the speciﬁc constraint that they mention Peter Camejo.
techniques, such as the widely-used term-frequency inverse-document- Word-based retrieval frequency (TF-IDF) method, have the opposite problem in general. They tend to be overly speciﬁc in terms of matching words in the query to documents.
In general of course one would like to have a balance between generality and speciﬁcity. One ad hoc approach is to combine scores from a general method such as LSI with those from a more speciﬁc method such as TF-IDF in some manner, and indeed this technique has been proposed in information retrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the LDA model is linearly combined with document-speciﬁc word distributions to capture both general as well as speciﬁc information in documents. However, neither method is entirely satisfactory since it is not clear how to trade-off generality and speciﬁcity in a principled way.
The contribution of this paper is a new graphical model based on latent topics that handles the trade- off between generality and speciﬁcity in a fully probabilistic and automated manner. The model, which we call the special words with background (SWB) model, is an extension of the LDA model. The new model allows words in documents to be modeled as either originating from general topics, or from document-speciﬁc “special” word distributions, or from a corpus-wide background distribu- tion. The idea is that words in a document such as election and campaign are likely to come from a general topic on presidential elections, whereas a name such as Camejo is much more likely to be treated as “non-topical” and speciﬁc to that document. Words in queries are automatically inter- preted (in a probabilistic manner) as either being topical or special, in the context of each document, allowing for a data-driven document-speciﬁc trade-off between the beneﬁts of topic-based abstrac- tion and speciﬁc word matching. Daum´e and Marcu (2006) independently proposed a probabilistic model using similar concepts for handling different training and test distributions in classiﬁcation problems.
Although we have focused primarily on documents in information retrieval in the discussion above, the model we propose can in principle be used on any large sparse matrix of count data. For example, transaction data sets where rows are individuals and columns correspond to items purchased or Web sites visited are ideally suited to this approach. The latent topics can capture broad patterns of population behavior and the “special word distributions” can capture the idiosyncracies of speciﬁc individuals.
Section 2 reviews the basic principles of the LDA model and introduces the new SWB model. Sec- tion 3 illustrates how the model works in practice using examples from New York Times news articles. In Section 4 we describe a number of experiments with 4 different document sets, includ- ing perplexity experiments and information retrieval experiments, illustrating the trade-offs between generalization and speciﬁcity for different models. Section 5 contains a brief discussion and con- cluding comments.
2 A Topic Model for Special Words
Figure 1(a) shows the graphical model for what we will refer to as the “standard topic model” or LDA. There are D documents and document d has Nd words. α and β are ﬁxed parameters of symmetric Dirichlet priors for the D document-topic multinomials represented by θ and the T topic- word multinomials represented by φ. In the generative model, for each document d, the Nd words