Charles Isbell, Paul Viola
The task in text retrieval is to find the subset of a collection of documents relevant to a user's information request, usually expressed as a set of words. Classically, documents and queries are represented as vectors of word counts. In its simplest form, relevance is defined to be the dot product between a document and a query vector-a measure of the number of common terms. A central difficulty in text retrieval is that the presence or absence of a word is not sufficient to determine relevance to a query. Linear dimensionality reduction has been proposed as a tech(cid:173) nique for extracting underlying structure from the document collection. In some domains (such as vision) dimensionality reduction reduces computational com(cid:173) plexity. In text retrieval it is more often used to improve retrieval performance. We propose an alternative and novel technique that produces sparse represen(cid:173) tations constructed from sets of highly-related words. Documents and queries are represented by their distance to these sets, and relevance is measured by the number of common clusters. This technique significantly improves retrieval per(cid:173) formance, is efficient to compute and shares properties with the optimal linear projection operator and the independent components of documents.