{"title": "Restructuring Sparse High Dimensional Data for Effective Retrieval", "book": "Advances in Neural Information Processing Systems", "page_first": 480, "page_last": 486, "abstract": null, "full_text": "Restructuring Sparse High Dimensional Data for \n\nEffective Retrieval \n\nCharles Lee Isbell, Jr. \n\nAT&T Labs \n\n180 Park Avenue Room A255 \nFlorham Park, NJ 07932-0971 \n\nPaul Viola \n\nArtificial Intelligence Laboratory \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nThe task in text retrieval is to find the subset of a collection of documents relevant \nto a user's information request, usually expressed as a set of words. Classically, \ndocuments and queries are represented as vectors of word counts. In its simplest \nform, relevance is defined to be the dot product between a document and a query \nvector-a measure of the number of common terms. A central difficulty in text \nretrieval is that the presence or absence of a word is not sufficient to determine \nrelevance to a query. Linear dimensionality reduction has been proposed as a tech(cid:173)\nnique for extracting underlying structure from the document collection. In some \ndomains (such as vision) dimensionality reduction reduces computational com(cid:173)\nplexity. In text retrieval it is more often used to improve retrieval performance. \nWe propose an alternative and novel technique that produces sparse represen(cid:173)\ntations constructed from sets of highly-related words. Documents and queries \nare represented by their distance to these sets, and relevance is measured by the \nnumber of common clusters. This technique significantly improves retrieval per(cid:173)\nformance, is efficient to compute and shares properties with the optimal linear \nprojection operator and the independent components of documents. \n\n1 \n\nIntroduction \n\nThe task in text retrieval is to find the subset of a collection of documents relevant to a user's infor(cid:173)\nmation request, usually expressed as a set of words. Naturally, we would like to apply techniques \nfrom natural language understanding to this problem. Unfortunately, the sheer size of the data to be \nrepresented makes this difficult. We wish to process tens or hundreds of thousands of documents, \neach of which may contain hundreds of thousands of different words. It is clear that any useful \napproach must be time and space efficient. \n\nFollowing (Salton, 1971), we adopt a modified Vector Space Model (VSM) for document represen(cid:173)\ntation. A document is a vector where each dimension is a count of occurrences for a different word1 . \n\nlIn practice. suffixes are removed and counts are re-weighted by some function of their natural frequency \n\n\fRestructuring Sparse High Dimensional Data/or Effective Retrieval \n\n481 \n\nAfrica \n\nnational \n\nfootball \n\nMandala \n\nSouth \n\nleague \n\ncollege \n\nFigure 1: A Model of Word Generation. Independent topics give rise to specific words words \naccording an unknown probability distribution (Line thickness indicates the likelihood of generating \na word). \n\nA collection of documents is a matrix, D, where each column is a document vector di . Queries are \nsimilarly represented. \n\nWe propose a topic based model for the generation of words in documents. Each document is \ngenerated by the interaction of a set of independent hidden random variables called topics. When a \ntopic is active it causes words to appear in documents. Some words are very likely to be generated \nby a topic and others less so. Different topics may give rise to some of the same words. The final set \nof observed words results from a linear combination of topics. See Figure 1 for an example. \n\nIn this view of word generation, individual words are only weak indicators of underlying topics. \nOur task is to discover from data those collections of words that best predict the (unknown) under(cid:173)\nlying topics. The assumption that words are neither independent of one another or conditionally \nindependent of topics motivates our belief that this is possible. \n\nOur approach is to construct a set of linear operators which extract the independent topic structure of \ndocuments. We have explored different algorithms for discovering these operators include indepen(cid:173)\ndent components analysis (Bell and Sejnowski, 1995). The inferred topics are then used to represent \nand compare documents. \n\nBelow we describe our approach and contrast it with Latent Semantic Indexing (LSI), a technique \n. that also attempts to linearly transform the documents from \"word space\" into one more appropriate \nfor comparison (Hull, 1994; Deerwester et a\\., 1990). We show that the LSI transformation has very \ndifferent properties than the optimal linear transformation. We characterize some of these properties \nand derive an unsupervised method that searches for them. Finally, we present experiments demon(cid:173)\nstrating the robustness of this method and describe several computational and space advantages. \n\n2 The Vector Space Model and Latent Semantic Indexing \n\nThe similarity between two documents using the VSM model is their inner product, dT dj . Queries \nare just short documents, so the relevance of documents to a query, q, is DT q. There are several \nadvantages to this approach beyond its mathematical simplicity. Above all, it is efficient to compute \nand store the word counts. While the word-document matrix has a very large number of potential \nentries, most documents do not contain very many of the possible words, so it is sparsely populated. \nThus, algorithms for manipulating the matrix only require space and time proportional to the average \nnumber of different words that appear in a document, a number likely to be much smaller than the \nfull dimensionality of the document matrix (in practice, non-zero elements represent about 2% of \nthe total number of elements). Nevertheless, VSM makes an important tradeoff by sacrificing a great \ndeal of document structure, losing context that may disambiguate meaning. \n\nAny text retrieval system must overcome the fundamental difficulty that the presence or absence \nof a word is insufficient to determine relevance. This is due to two intrinsic problems of natural \n\n(Frakes and Baeza-Yates, 1992). We incorporate these methods; however, such details are unimportant for this \ndiscussion. \n\n\f482 \n\nC. L. Isbell and P. Viola \n\nlanguage: synonymy and polysemy. Synonymy refers to the fact that a single underlying concept \ncan be represented by many different words (e.g. \"car\" and \"automobile\" refer to the same class \nof objects). Polysemy refers to the fact that a single word can refer to more than one underlying \nconcept (e.g. \"apple\" is both a fruit and a computer company). Synonymy results in false negatives \nand polysemy results in false positives. \n\nLatent semantic indexing is one proposal for addressing this problem. LSI constructs a smaller \ndocument matrix that retains only the most important information from the original, by using the \nSingular Value Decomposition (SVD). Briefly, the SVD of a matrix Dis: U SV T where U and V \ncontain orthogonal vectors and S is diagonal (see (Golub and Loan, 1993) for further properties and \nalgorithms). Note that the co-occurrence matrix, DDT, can be written as U S2UT ; U contains the \neigenvectors of the co-occurrence matrix while the diagonal elements of S (referred to as singular \nvalues) contain the square roots of their corresponding eigenvalues. The eigenvectors with the largest \neigenvalues capture the axes of largest variation in the data. \nIn LSI, each document is projected into a lower dimensional space b = SkI (If D where Sk and Uk \nwhich contain only the largest k singular values and the corresponding eigenvectors, respectively. \nThe resulting document matrix is of smaller size but still provably represents the most variation in the \noriginal matrix. Thus, LSI represents documents as linear combinations of orthogonal features. It is \nhoped that these features represent meaningful underlying \"topics\" present in the collection. Queries \nare also projected into this space, so the relevance of documents to a query is DTUkSk2UI q. \n\nThis type of dimensionality reduction is very similar to principal components analysis (peA), which \nhas been used in other domains, including visual object recognition (Turk and Pentland, 1991). In \npractice, there is some evidence to suggest that LSI can improve retrieval performance; however, it \nis often the case that LSI improves text retrieval performance by only a small amount or not at all \n(see (Hull, 1994) and (Deerwester et aI., 1990) for a discussion). \n\n3 Do Optimal Projections for Retrieval Exist? \n\nHypotheses abound for the success of LSI, including: i) LSI removes noise from the document \nset; ii) LSI finds words that are synonyms; iii) LSI finds clusters of documents. Whatever it does, \nLSI operates without knowledge of the queries that will be presented to the system. We could \ninstead attempt a supervised approach, searching for a matrix P such that DT P pT q results in large \nvalues for documents in D that are known to be relevant for a particular query, q. The choice for \nthe structure of P embodies assumptions about the structure of D and q and what it means for \ndocuments and queries to be related. \nFor example, imagine that we are given a collection of documents, D, and queries, Q. For each query \nwe are told which documents are relevant. We can use this information to construct an optimal P \nsuch that: DT P pT Q ~ R, where Rij equals 1 if document i is relevant to query j, and 0 otherwise. \nWe find P in two steps. First we find an X minimizing IIDT XQ - RIIF, where II . IIF denotes \nthe Frobenius norm of a matrix2 . Second, we find P by decomposing X into P pT. Unfortunately, \nthis may not be simple. The matrix P pT has properties that are not necessarily shared by X. In \nparticular, while P pT is symmetric, there is no guarantee that X will be (in our experiments X is \nfar from symmetric). We can however take SVD of X = UxSx vt, using matrix Ux to project the \ndocuments and Vx to project the queries. \n\nWe can now compare LSI's projection axes, U with the optimal Ux computed as above. One measure \nof comparison is the distribution of documents as projected onto these axes. Figure 2a shows the \ndistribution of Medline documents3 projected onto the first axis of Ux . Notice that there is a large \n\n2First find M that minimizes IIDT M - RIIF . X is the matrix that minimizes IIXQ - MIIF \n3Medline is a small test collection, consisting of 1033 documents and about 8500 distinct words. We have \n\nfound similar results for other, larger collections. \n\n\fRestructuring Sparse High Dimensional Datajor Effective Retrieval \n\n483 \n\nFigure 2: (A). The distribution of medline documents projected onto one of the \"optimal\" axes. The \nkurtosis of this distribution is 44. (B). The distribution of medline documents projected onto one \nof the LSlaxes. The kurtosis of this distribution is 6.9. (C). The distribution of medline documents \nprojected onto one of the ICA axes. The kurtosis of this distribution is 60. \n\nspike near zero, and a well-separated outlier spike. The kurtosis of this distribution is 44. Subsequent \naxes of Ux result in similar distributions. We might hope that these axes each represent a topic shared \nby a few documents. Figure 2b shows the distribution of documents projected onto the first LSI axis. \nThis axis yields a distribution with a much lower kurtosis of 6.9 (a normal distribution has kurtosis \n3). This induces a distribution that looks nothing like a cluster: there is a smooth continuum of \nvalues. Similar distributions result for many of the first 100 axes. \n\nThese results suggest that LSI-like approaches may well be searching for projections that are sub(cid:173)\noptimal. In the next section, we describe an algorithm designed to find projections that look more \nlike those in Figure 2a than in Figure 2b. \n\n4 Topic Centered Representations \n\nThere are several problems with the \"optimal\" approach described in the previous section. Aside \nfrom its completely supervised nature, there may be a problem of over-fitting: the number of param(cid:173)\neters in X (the number of words squared) can be large compared to the number of documents and \nqueries. It is not clear how to move towards a solution that will likely have low generalization error, \nour ultimate goal. Further, computing X is expensive, involving several full-rank singular value \ndecompositions. \n\nOn the other hand, while we may not be able to take advantage of supervision, it seems reasonable to \nsearch for projections like those in Figure 2a. There are several unsupervised techniques we might \nuse. We begin with independent component analysis (Bell and Sejnowski, 1995), a technique that \nhas recently gained popularity. Extensions such as (Amari, Cichocki and Yang, 1996) have made \nthe algorithm more efficient and robust. \n\n4.1 What are the Independent Components of Documents? \n\nFigure 2C shows the distribution of Medline documents along one of the ICA axes (kurtosis 60). It \nis representative of other axes found for that collection, and for other, larger collections. \n\nLike the optimal axes found earlier, this axis also separates documents. This is desirable because \nit means that the axes are distinguishing groups of (presumably related) documents. Still, we can \nask a more interesting question; namely, how do these axes group words? Rather than project our \ndocuments onto the ICA space, we can project individual words (this amounts to projecting the \nidentity matrix onto that space) and observe how ICA redistributes them. \n\nFigure 3 shows a typical distribution of all the words along one of the axes found by ICA on the \n\n\f484 \n\nC. L. Isbell and P Viola \n\nanc \ntransition \nmandela \n\ncontinent \nelite \nethiopia \n\nafrica \napartheid \n\nsaharan \nL \n., --o=7S~~-Q.'-5 - -Q\u00b7~.2~5 -!----=-o.2=5-~O.'~~O.7S \n\nP~V.1uM \n\nFigure 3: The distribution of words with large magnitude along an leA axis from the White House \ncollection. \n\nleA induces a highly kurtotic distribution over the words. It is also \nWhite House collection.4 \nquite sparse: most words have a value very close to zero. The histogram shows only the words \nlarge values, both positive and negative. One group of words is made up of highly-related words; \nnamely, \"africa,\" \"apartheid,\" and \"man del a.\" The other is made up of words that have no obvious \nrelationship to one another. In fact, these words are not directly related, but each co-occurs with \ndifferent individual words in the first group. For example, \"saharan\" and \"africa\" occur together \nmany times, but not in the context of apartheid and South Africa; rather, in documents concerning \nUS policy toward Africa in general. As it so happens, \"saharan\" acts as a discriminating word for \nthese subtopics. \n\n4.2 Topic Centered Representations \n\nIt appears that leA is finding a set of words, S, that selects for related documents, H, along with \nanother set of words, T, whose elements do not select for H, but co-occur with elements of S. \nIntuitively, S selects for documents in a general subject area, and T removes a specific subset of \nthose documents, leaving a small set of highly related documents. This suggests a straightforward \nalgorithm to achieve the same goal directly: \n\nforeach topic, Ck , you wish to define: \n-Choose a source document de from D \n-Let b be the documents of D sorted by similarity to de \n-Divide b into into three groups: \n\nthose assumed to be relevant, \n\nthose assumed to be completely irrelevant, \nand those assumed to be weakly relevant. \n\n-Let Gk , Bk, and Afk be the centroid of each respective group \n-Let C k = f(G k - Bk) -\n\nf(Afk _ Gk) \n\nwhere f(x) = max(x,O). \n\nThe three groups of documents are used to drive the discovery of two sets of words. One set selects \nfor documents in a general topic area by finding the set of words that distinguish the relevant doc(cid:173)\numents from documents in general, a form of global clustering. The other set of words distinguish \nthe weakly-related documents from the relevant documents. Assigning them negative weight results \nin their removal. This leaves only a set of closely related documents. This local clustering approach \nis similar to an unsupervised version of Rocchio with Query Zoning (Singhal, 1997). \n\n4The White House collection contains transcripts of press releases and press conferences from 1993. There \n\nare 1585 documents and 18675 distinct words. \n\n\fRestructuring Sparse High Dimensional Data for Effective Retrieval \n\n485 \n\n0 7 \n\n06 \n\n05 \n\nI' \n\nc:: \n0 o. \n'Cij \n'0 \n~ 0 3 .... \\ \na.. \n\n0 2 \n\n01 1 \nI \n'. \n\n0 \n0 \n\n, , \n'\" \"'--\n\n\\ '-, \n--\n\n01 \n\n02 \n\n03 \n\nBaseline \nLSI \nDocuments as Clisters \nRelevanl Documents as Clusters \nICA \nTopIc Clustenng \n\n06 \n\n07 \n\n08 \n\nO. \n\n, \n\nO' \n\n------\n\n0 5 \n\nRecall \n\nFigure 4: A comparison of different algorithms on the Wall Street Journal \n\n5 Experiments \n\nIn this section, we show results of experiments with the Wall Street Journal collection. It con(cid:173)\ntains 42,652 documents and 89757 words. Following convention, we measure the success of a text \nretrieval system using precision-recall curves5 . Figure 4 illustrates the performance of several algo(cid:173)\nrithms: \n\n1. Baseline: the standard inner product measure, DT q. \n\n2. LSI: Latent Semantic Indexing. \n\n3. - Jeuments as Clusters: each document is a projection axis. This is equivalent to a modified \n\ninner product measure, DT DDT q. \n\n4. Relevant Documents as Clusters: In order to simulate psuedo-relevance feedback, we use \n\nthe centroid of the top few documents returned by the D T q similarity measure. \n\n5. ICA: Independent Component Analysis. \n\n6. Topic Clustering: The algorithm described in Section 4.2. \n\nIn this graph, we restrict queries to those that have at least fifty relevant documents. The topic \nclustering approach and ICA perform best, maintaining higher average precision over all ranges. \nUnlike smaller collections such as Medline, documents from this collection do not tend to cluster \naround the queries naturally. As a result, the baseline inner product measure performs poorly. Other \nclustering techniques that tend to work well on collections such as Medline perform even worse. \nFinally, LSI does not perform well. \n\nFigure 5 illustrates different approaches on subsets of Wall Street Journal queries. In general, as \neach query has more and more relevant documents, overall performance improves. In particular, \nthe simple clustering scheme using only relevant documents performs very well. Nonetheless, our \napproach improves upon this standard technique with minimal additional computation. \n\n5When asked to return n documents precision is the percentage of those which are rei avant. Recall is the \n\npercentage of the total relevant documents which are returned. \n\n\f486 \n\nC. L. Isbell and P Viola \n\nI ~ \n\nr,' \nO Tr, ;' \nOl~ \n\n\\. , \n\n~ , \n~ \n\n~'~ \n\n\"I \" \n~.~ o ~ \n\"r \n\n. \n\n\" \n\n0 T \n\nO. \n\nOlt \n\nRocall \n\nRoc \u2022 \u2022 \n\nFigure 5: (A). Performance of various clustering techniques for those queries with more than 75 \nrelevant documents . (B). Performance for those queries with more than 100 relevant documents. \n\n6 Discussion \n\nWe have described typical dimension reduction techniques used in text retrieval and shown that \nthese techniques make strong assumptions about the form of projection axes. We have character(cid:173)\nized another set of assumptions and derived an algorithm that enjoys significant computational and \nspace advantages. Further, we have described experiments that suggest that this approach is robust. \nFinally, much of what we have described here is not specific to text retrieval. Hopefully, similar \ncharacterizations will apply to other sparse high-dimensional domains. \n\nReferences \n\nAmari , S., Cichocki, A., and Yang, H. (1996). A new learning algorithm for blind source separation. In \n\nAdvances in Neural Information Processing Systems. \n\nBell, A. and Sejnowski, T. (1995). An information-maximizaton approach to blind source separation and blind \n\ndeconvolution. Neural Computation, 7: 1129-1159. \n\nDeerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. w., and Harshman, R. A. (1990). Indexing by latent \n\nsemantic analysis. Journal of the Society for Information Science, 41 (6):391-407. \n\nFrakes, W. B. and Baeza-Yates, R., editors (1992). Information Retrieval: Data Structures and Algorithms. \n\nPrentice-Hall. \n\nGolub, G. H. and Loan, C. F. V. (1993). Matrix Computations. The Johns Hopkins University Press. \nHull, D. (1994). Improving text retrieval for the routing problem using latent semantic indexing. In Proceed(cid:173)\n\nings of the 17th ACMISIGIR Conference, pages 282-290. \n\nKwok, K. L. (1996). A new method of weighting query terms for ad-hoc retrieval. In Proceedings of the 19th \n\nACMISIGIR Conference, pages 187-195. \n\nO'Brien, G. W. (1994). Information management tools for updating an svd-encoded indexing scheme. Techni(cid:173)\n\ncal Report UT-CS-94-259, University of Tennessee. \n\nSahami, M., Hearst, M., and Saund, E. (1996). Applying the multiple cause mixture model to text categoriza(cid:173)\n\ntion. In Proceedings of the 13th International Machine Learning Conference. \n\nSalton, G., editor (1971). The SMART Retrieval System: Experiments in Automatic Document Processing. \n\nPrentice-Hall. \n\nSinghal, A. (1997). Learning routing queries in a query zone. In Proceedings of the 20th International Confer(cid:173)\n\nence on Research and Development in Information Retrieval. \n\nTurk, M. A. and Pentland, A. P. (1991). Face recognition using eigenfaces. In IEEE Conference on Computer \n\nVision and Pattern Recognition, pages 586-591. \n\n\f", "award": [], "sourceid": 1597, "authors": [{"given_name": "Charles", "family_name": "Isbell", "institution": null}, {"given_name": "Paul", "family_name": "Viola", "institution": null}]}