{"title": "Euclidean Embedding of Co-Occurrence Data", "book": "Advances in Neural Information Processing Systems", "page_first": 497, "page_last": 504, "abstract": null, "full_text": " Euclidean Embedding of Co-occurrence Data\n\n\n\n Amir Globerson1 Gal Chechik2 Fernando Pereira3 Naftali Tishby1\n 1 School of computer Science and Engineering,\n Interdisciplinary Center for Neural Computation\n The Hebrew University Jerusalem, 91904, Israel\n 2 Computer Science Department, Stanford University, Stanford, CA 94305, USA\n 3 Department of Computer and Information Science,\n University of Pennsylvania, Philadelphia, PA 19104, USA\n\n\n\n Abstract\n\n Embedding algorithms search for low dimensional structure in complex\n data, but most algorithms only handle objects of a single type for which\n pairwise distances are specified. This paper describes a method for em-\n bedding objects of different types, such as images and text, into a single\n common Euclidean space based on their co-occurrence statistics. The\n joint distributions are modeled as exponentials of Euclidean distances in\n the low-dimensional embedding space, which links the problem to con-\n vex optimization over positive semidefinite matrices. The local struc-\n ture of our embedding corresponds to the statistical correlations via ran-\n dom walks in the Euclidean space. We quantify the performance of our\n method on two text datasets, and show that it consistently and signifi-\n cantly outperforms standard methods of statistical correspondence mod-\n eling, such as multidimensional scaling and correspondence analysis.\n\n\n\n1 Introduction\n\nEmbeddings of objects in a low-dimensional space are an important tool in unsupervised\nlearning and in preprocessing data for supervised learning algorithms. They are especially\nvaluable for exploratory data analysis and visualization by providing easily interpretable\nrepresentations of the relationships among objects. Most current embedding techniques\nbuild low dimensional mappings that preserve certain relationships among objects and dif-\nfer in the relationships they choose to preserve, which range from pairwise distances in\nmultidimensional scaling (MDS) [4] to neighborhood structure in locally linear embedding\n[12]. All these methods operate on objects of a single type endowed with a measure of\nsimilarity or dissimilarity.\n\nHowever, real-world data often involve objects of several very different types without a\nnatural measure of similarity. For example, typical web pages or scientific papers contain\nvaried data types such as text, diagrams, images, and equations. A measure of similarity\nbetween words and pictures is difficult to define objectively. Defining a useful measure of\nsimilarity is even difficult for some homogeneous data types, such as pictures or sounds,\nwhere the physical properties (pitch and frequency in sounds, color and luminosity distri-\nbution in images) do not directly reflect the semantic properties we are interested in.\n\n\f\nThe current paper addresses this problem by creating embeddings from statistical associ-\nations. The idea is to find a Euclidean embedding in low dimension that represents the\nempirical co-occurrence statistics of two variables. We focus on modeling the conditional\nprobability of one variable given the other, since in the data we analyze (documents and\nwords, authors and terms) there is a clear asymmetry which suggests a conditional model.\nJoint models based on similar principles can be devised in a similar fashion, and may be\nmore appropriate for symmetric data. We name our method CODE for Co-Occurrence\nData Embedding.\n\nOur cognitive notions are often built through statistical associations between different in-\nformation sources. Here we assume that those associations can be represented in a low-\ndimensional space. For example, pictures which frequently appear with a given text are\nexpected to have some common, locally low-dimensional characteristic that allows them to\nbe mapped to adjacent points. We can thus rely on co-occurrences to embed different en-\ntity types, such as words and pictures, genes and expression arrays, into the same subspace.\nOnce this embedding is achieved it also naturally defines a measure of similarity between\nentities of the same kind (such as images), induced by their other corresponding modality\n(such as text), providing a meaningful similarity measure between images.\n\nEmbedding of heterogeneous objects is performed in statistics using correspondence anal-\nysis (CA), a variant of canonical correlation analysis for count data [8]. These are related\nto Euclidean distances when the embeddings are constrained to be normalized. However, as\nwe show below, removing this constraint has great benefits for real data. Statistical embed-\nding of same-type objects was recently studied by Hinton and Roweis [9]. Their approach\nis similar to ours in that it assumes that distances induce probabilistic relations between\nobjects. However, we do not assume that distances are given in advance, but instead we\nderive them from the empirical co-occurrence data. The Parametric Embedding method\n[11], which also appears in the current proceedings, is formally similar to our method but\nis used in the setting of supervised classification.\n\n\n2 Problem Formulation\n\nLet X and Y be two categorical variables with an empirical distribution \n p(x, y). No ad-\nditional assumptions are made on the values of X and Y or their relationships. We wish\nto model the statistical dependence between X and Y through an intermediate Euclidean\nspace Rd and mappings : X Rd and : Y Rd. These mappings should reflect the\ndependence between X and Y in the sense that the distance between each (x) and (y)\ndetermines their co-occurrence statistics.\n\nWe focus in this manuscript on modeling the conditional distribution p(y|x)1, and define a\nmodel which relates conditional probabilities to distances by\n \n p(y)\n p(y|x) = e-d2x,y x X, y Y (1)\n Z(x)\n\nwhere d2x,y |(x)-(y)|2 = d (\n k=1 k (x) - k (y))2 is the Euclidean distance between\n(x) and (y) and Z(x) is the partition function for each value of x. This partition func-\ntion equals Z(x) = \n p(y)e-d2x,y and is thus the empirical mean of the exponentiated\n y\ndistances from x (therefore Z(x) 1).\n\nThis model directly relates the ratio p(y|x) to the distance between the embedded x and\n \n p(y)\ny. The ratio decays exponentially with the distance, thus for any x, a closer y will have\n\n 1We have studied several other models of the joint rather than the conditional distribution. These\ndiffer by the way the marginals are modeled and will be described elsewhere\n\n\f\n Figure 1: Embedding of X, Y into the same d-dimensional space.\n\n\n\n\na higher interaction ratio. As a result of the fast decay, the closest objects dominate the\ndistribution. The model of Eq. 1 can also be described as the result of a random walk\nin the low-dimensional space illustrated in Figure 1. When y has a uniform marginal, the\nprobability p(y|x) corresponds to a random walk from x to y, with transition probability\ninversely related to distance.\n\nWe now turn to the task of learning , from an empirical distribution \n p(x, y). It is natural\nin this case to maximize the likelihood (up to constants depending on \n p(y))\n\n\n max l(, ) = - \n p(x, y)d2x,y - \n p(x) log Z(x) , (2)\n , x,y x\n\n\n\nwhere \n p(x, y) denotes the empirical distribution over X, Y . As in other cases, maximizing\nthe likelihood is also equivalent to minimizing the DKL between the empirical and the\nmodel's distributions. The likelihood is composed of two terms. The first is (minus) the\nmean distance between x and y. This will be maximized when all distances are zero. This\ntrivial solution is avoided because of the regularization term \n p(x) log Z(x), which acts\n x\nto increase distances between x and y points. The next section discusses the relation of this\ntarget function to that of Canonical Correlation Analysis [10].\n\nTo characterize the maxima of the likelihood we differentiate it with respect to the embed-\ndings of individual objects (x), (y), and obtain the following gradients\n\n\n l(, ) = 2\n p(x) (y) p(y|x) - (y) p(y|x) (3)\n (x)\n\n l(, ) = 2p(y) (y) - (x) p(x|y) - 2\n p(y) (y) - (x) p(x|y) ,\n (y)\n\n\nwhere p(y) = p(y|x)\n p(x).\n x\n\nEquating these gradients to zero, the (x) gradient yields (y) p(y|x) = (y) p(y|x).\nThis characterization is similar to the one seen in maximum entropy learning. Since p(y|x)\nwill have significant values for Y values such that (y) is close to (x), this condition\nimplies that the expected location of a neighbor of (x) is the same under the empirical\nand model distributions.\n\nTo find the optimal , for a given embedding dimension d, we used a conjugate gradient\nascent algorithm with random restarts. In section 4 we describe a different approach to this\noptimization problem.\n\n\f\n3 Relation to Other Methods\n\n\nEmbedding the rows and columns of a contingency table into a low dimensional Euclidean\nspace is related to statistical methods for the analysis of heterogeneous data. Fisher [6]\ndescribed a method for mapping X and Y into (x), (y) such that the correlation coeffi-\ncient between (x), (y) is maximized. His method is in fact the discrete analogue of the\nmore widely known Canonical correlation analysis (CCA) [10]. Another closely related\nmethod is Correspondence analysis [8], which uses a different normalization scheme, and\naims to model 2 distances between rows and columns of \n p(x, y).\n\nThe goal of all the above methods is to maximize the correlation coefficient between the\nembeddings of X and Y . We now discuss their relation to our distance based method.\nFirst, note that the correlation coefficient is invariant under affine transformations and we\ncan thus focus on centered solutions with a unity covariance matrix ( (x) = 0 and\nCOV ((x)) = COV ((y)) = I) solutions. In this case, the correlation coefficient is\ngiven by the following expression (we focus on d = 1 for simplicity)\n\n 1\n ((x), (y)) = \n p(x, y)(x)(y) = - \n p(x, y)d2\n 2 x,y + 1 . (4)\n x,y x,y\n\n\nMaximizing the correlation is therefore equivalent to minimizing the mean distance across\nall pairs. This clarifies the relation between CCA and our method: Both methods aim\nto minimize the average distance between X and Y embeddings. However, CCA forces\nboth embeddings to be centered and with a unity covariance matrix, whereas our method\nintroduces a global regularization term related to the partition function.\n\nOur method is additionally related to exponential models of contingency tables, where the\ncounts are approximated by a normalized exponent of a low rank matrix [7]. The current\napproach can be understood as a constrained version of these models where the expression\nin the exponent is constrained to have a geometric interpretation.\n\nA well-known geometric oriented embedding method is multidimensional scaling\n(MDS) [4], whose standard version applies to same-type objects with predefined distances.\nMDS embedding of heterogeneous entities was studied in the context of modeling ranking\ndata (see [4] section 7.3). These models, however, focus on specific properties of ordi-\nnal data and therefore result in optimization principles and algorithms different from our\nprobabilistic interpretation.\n\n\n\n4 Semidefinite Representation\n\n\nThe optimal embeddings , may be found using unconstrained optimization techniques.\nHowever, the Euclidean distances used in the embedding space also allow us to reformu-\nlate the problem as constrained convex optimization over the cone of positive semidefinite\n(PSD) matrices [14].\n\nWe start by showing that for embeddings with dimension d = |X| + |Y |, maximizing (2) is\nequivalent to minimizing a certain convex non-linear function over PSD matrices. Consider\nthe matrix A whose columns are all the embedded vectors and . The matrix G AT A\nis the Gram matrix of the dot products between embedding vectors. It is thus a symmetric\nPSD matrix of rank d. The converse is also true: any PSD matrix of rank d can be\nfactorized as AT A, where A is an embedding matrix of dimension d. The distance between\ntwo columns in A is linearly related to the Gram matrix via d2 = g\n ij ii + gjj - 2gij .\n\nSince the likelihood function depends only on the distances between points in X and in Y ,\n\n\f\nwe can write the optimization problem in (2) as\n\n min \n p(x) log \n p(y)e-d2xy + \n p(x, y)d2xy (5)\n G x y x,y\n\n Subject to G 0 , rank(G) d, d2xy = gxx + gyy - 2gxy\n\nwhere gxy denotes the element in G corresponding to specific values of x, y.\n\nThus, our problem is equivalent to optimizing a nonlinear objective over the set of PSD\nmatrices of a constrained rank. The minimized function is convex, since it is the sum of a\nlinear function of G and functions log exp of an affine expression in G, which are also\nconvex (see Geometric Programming section in [2]). Moreover, when G has full rank, the\nset of constraints is also convex. We conclude that when the embedding dimension is of\nsize d = |X| + |Y | the optimization problem of Eq. (5) is convex. Thus there are no local\nminima, and solutions can be found efficiently.\n\nThe PSD formulation allows us to add non-trivial constraints. Consider, for example, con-\nstraining the p(y) marginal to its empirical values, i.e. p(y|x)\n p(x) = \n p(y). To in-\n x\ntroduce this as a convex constraint we take two steps. First, we note that we can relax\nthe constraint that distributions normalize to one, and require that they normalize to less\nthan one. This is achieved by replacing log Z(x) with a free variable a(x) and writing the\nproblem as follows (we omit the dependence of d2xy on G for brevity)\n\n min \n p(x)a(x) + \n p(x, y)d2xy (6)\n G x x,y\n\n Subject to G 0 , rank(G) d, log \n p(y)e-d2 -a(x)\n xy 0 x\n y\n\nIt can be shown that the optimum of 6 will be obtained for solutions normalized to one, and\nit thus coincides with the optimum of 5. The constraint p(y|x)\n p(x) = \n p(y) can now\n x\nbe relaxed to the inequality \n p(y)\n p(x)e-d2 -a(x)\n xy \n p(y), which defines a convex set.\n x\nAgain, the optimum will be obtained when the constraint is satisfied with equality.\n\nEmbedding into a low dimension requires constraining the rank, but this is difficult since\nthe problem is no longer convex in the general case. One approach to obtaining low rank\nsolutions is to optimize over a full rank G and then project it into a lower dimension via\nspectral decomposition as in [14] or classical MDS. However, in the current problem, this\nwas found to be ineffective. Instead, we penalize high-rank solutions by adding the trace\nof G [5] weighted by a positive factor, , to the objective function in (5). Small values\nof Tr(G) are expected to correspond to sparse eigenvalue sets and thus penalize high rank\nsolutions. This approach was tested on subsets of the databases described in section 5\nand yielded similar results to those of the gradient based algorithm. We believe that PSD\nalgorithms may turn out to be more efficient in cases where relatively high dimensional\nembeddings are sought. Furthermore, under the PSD formulation it is easy to introduce\nadditional constraints, for example on distances between subsets of points (as in [14]), and\non marginals of the distribution.\n\n\n5 Applications\n\nWe tested our approach on a variety of applications. Here we present embedding of words\nand documents and authors and documents. To provide quantitative assessment of the\nperformance of our method, that goes beyond visual inspection, we apply it to problems\nwhere some underlying structures are known in advance. The known structures are only\nused for performance measurement and not during learning.\n\n\f\n (a) (b) (c) (d)\n eye dominance\n AA bayesian convergence head agent\n NS support rat\n BI classifiers marginal polynomial cells agents actions\n VS gamma stimuli\n movements\n VM bayes machines papers stimulus policies rewards\n VB variational regularization motion receptor game\n LT movement\n CS regression receptive\n eeg\n IM bound\n bootstrap perception spatial cell policy\n AP\n SP loss activity\n nips channels\n CN risk recorded frequency biol documents\n ET bounds scene mdp\n temporal response\n position\n detector formation dirichlet\n\n\n\n\nFigure 2: CODE Embedding of 2483 documents and 2000 words from the NIPS database (the 2000\nmost frequent words, excluding the first 100, were used). The left panel shows document embeddings\nfor NIPS 15-17, with colors to indicate the document topic. Other panels show embedded words\nand documents for the areas specified by rectangles. Figure (b) shows the border region between\nalgorithms and architecture (AA) and learning theory (LT) (bottom rectangle in (a)). Figure (c)\nshows the border region between neuroscience (NS) and biological vision (VB) (upper rectangle in\n(a)). Figure (d) shows mainly control and navigation (CN) documents (left rectangle in (a)).\n\n\n (a) (b) (c) (d)\n\n\n dual\n svms loss machines planning\n agent\n proof Pouget\n lemma norm Barto\n rational agents Sutton Mel\n plan iiii\n ranking regularization\n kernels hyperplane game singh actions inhibition Li\n proposition generalisation retinal\n games\n Singh Thrun\n margin regularized cortex\n smola vapnik Dietterich Tesauro retina Baird\n corollary player rewards\n mdp Bower\n Shawe-Taylor neurosci\n Scholkopf bellman oscillatory auditory\n shawe sv Vapnik mdps policy pyramidal\n policies cortical\n vertex inhibitory\n pac Opper\n Bartlett ocular conductance\n Gordon\n lambda\n Meir Moore msec dendritic Koch\n\n\n\nFigure 3: CODE Embedding of 2000 words and 250 authors from the NIPS database (the 250\nauthors with highest word counts were chosen; words were selected as in Figure 2). Left panel shows\nembeddings for authors (red crosses) and words (blue dots). Other panels show embedded authors\n(only first 100 shown) and words for the areas specified by rectangles. They can be seen to correspond\nto learning theory, control and neuroscience (from left to right).\n\n\n\n\n5.1 NIPS Database\n\n\nEmbedding algorithms may be used to study the structure of document databases. Here we\nused the NIPS 0-12 database supplied by Roweis 2, and augmented it with data from NIPS\nvolumes 13-17 3. The last three volumes also contain an indicator of the document's topic\n(AA for algorithms and architecture, LT for learning theory, NS for neuroscience etc.).\n\nWe first used CODE to embed documents and words into R2. The results are shown in\nFigure 2. It can be seen that documents with similar topics are mapped next to each other\n(e.g. AA near LT and NS near Biological Vision). Furthermore, words characterize the\ntopics of their neighboring documents.\n\nNext, we used the data to generate an authors-words matrix (as in the Roweis database). We\ncould now embed authors and words into R2, by using CODE to model p(word|author).\nThe results are shown in Figure 3. It can be seen that authors are indeed mapped next to\nterms relevant to their work, and that authors dealing with similar domains are also mapped\ntogether. This illustrates how co-occurrence of words and authors may be used to induce a\nmetric on authors alone.\n\n\n 2See http://www.cs.toronto.edu/roweis/data.html\n 3Data available at http://robotics.stanford.edu/gal/\n\n\f\n (a) (b) (c)\n\n 1\n 1 1 CODE\n CA\n\n 0.9\n\n\n\n 0.8\n\n\npurity\n 0.7\n\n\n CODE doc-doc measure doc-word measure\n 0.6 IsoMap\n CA\n MDS\n SVD\n 0.5 0\n 1 10 100 1000 0\n N nearest neighbors CODE IsoM CA MDS SVD Newsgroup Sets\n\n\nFigure 4: (a) Document purity measure for the embedding of newsgroups crypt, electronics and\nmed, as a function of neighborhood size. (b) The doc - doc measure averaged over 7 newsgroup sets.\nFor each set, the maximum performance was normalized to one. Embedding dimension is 2. Sets are\natheism, graphics, crypt; ms-windows, graphics; ibm.pc.hw, ms-windows; crypt, electronics; crypt,\nelectronics, med; crypt, electronics, med, space; politics.mideast, politics.misc. (c) The word - doc\nmeasure for CODE and CA algorithms, for 7 newsgroup sets. Embedding dimension is 2.\n\n\n\n5.2 Information Retrieval\n\nTo obtain a more quantitative estimate of performance, we applied CODE to the 20 news-\ngroups corpus, preprocessed as described in [3]. This corpus consists of 20 groups, each\nwith 1000 documents. We first removed the 100 most frequent words, and then selected the\nnext k most frequent words for different values of k (see below). The resulting words and\ndocuments were embedded with CODE, Correspondence Analysis (CA), SVD, IsoMap\nand classical MDS 4. CODE was used to model the distribution of words given documents\np(w|d). All methods were tested under several normalization schemes, including document\nsum normalization and TFIDF. Results were consistent over all normalization schemes.\n\nAn embedding of words and documents is expected to map documents with similar seman-\ntics together, and to map words close to documents which are related to the meaning of the\nword. We next test how our embeddings performs with respect to these requirements. To\nrepresent the meaning of a document we use its corresponding newsgroup. Note that this\ninformation is used only for evaluation and not in constructing the embedding itself.\n\nTo measure how well similar documents are mapped together we define a purity measure,\nwhich we denote doc - doc. For each embedded document, we measure the fraction of its\nneighbors that are from the same newsgroup. This is repeated for all neighborhood sizes,\nand averaged over all sizes and documents.\n\nTo measure how documents are related to their neighboring words, we use a measured de-\nnoted by word-doc. For each document d we look at its n nearest words and calculate their\nprobability under the document's newsgroup, normalized by their prior. This is repeated\nfor neighborhood sizes smaller than 100 and averaged over documents . The word - doc\nmeasure was only compared with CA, since this is the only method that provides joint\nembeddings.\n\nFigure 4 compares the performance of CODE with that of the other methods with respect\nto the doc - doc and word - doc measures. CODE can be seen to outperform all other\nmethods on both measures.\n\n 4CA embedding followed the standard procedure described in [8]. IsoMap implementation was\nprovided by the IsoMap authors [13]. We tested both an SVD over the count matrix and SVD over\nlog of the count plus one, only the latter is described here because it was better than the former. For\nMDS, the distances between objects were calculated as the dot product between their count vectors\n(we also tested Euclidean distances)\n\n\f\n6 Discussion\n\nWe presented a method for embedding objects of different types into the same low dimen-\nsion Euclidean space. This embedding can be used to reveal low dimensional structures\nwhen distance measures between objects are unknown. Furthermore, the embedding in-\nduces a meaningful metric also between objects of the same type, which could be used, for\nexample, to embed images based on accompanying text, and derive the semantic distance\nbetween images.\n\nCo-occurrence embedding should not be restricted to pairs of variables, but can be extended\nto multivariate joint distributions, when these are available. It can also be augmented to use\ndistances between same-type objects when these are known.\n\nAn important question in embedding objects is whether the embedding is unique. In other\nwords, can there be two non isometric embeddings which are obtained at the optimum\nof the problem. This question is related to the rigidity and uniqueness of embeddings on\ngraphs, specifically complete bipartite graphs in our case. A theorem of Bolker and Roth\n[1] asserts that for such graphs with at least 5 vertices on each side, embeddings are rigid,\ni.e. they cannot be continuously transformed. This suggests that the CODE embeddings\nfor |X|, |Y | 5 are unique (at least locally) for d 3.\n\nWe focused here on geometric models for conditional distributions. While in some cases,\nsuch a modeling choice is more natural in others joint models may be more appropriate. In\nthis context it will be interesting to consider models of the form p(x, y) p(x)p(y)e-d2x,y\nwhere p(x), p(y) are the marginals of p(x, y). Maximum likelihood in these models is a\nnon-trivial constrained optimization problem, and may be approached using the semidefi-\nnite representation outlined here.\n\n\nReferences\n\n [1] E.D. Bolker. and B. Roth. When is a bipartite graph a rigid framework? Pacific J. Math.,\n 90:2744, 1980.\n\n [2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2004.\n\n [3] G. Chechik and N. Tishby. Extracting relevant structures with side information. In S. Becker,\n S. Thrun, and K. Obermayer, editors, NIPS 15, 2002.\n\n [4] T. Cox and M. Cox. Multidimensional Scaling. Chapman and Hall, London, 1984.\n\n [5] M. Fazel, H. Hindi, and S. P. Boyd. A rank minimization heuristic with application to minimum\n order system approximation. In Proc. of the American Control Conference, 2001.\n\n [6] R.A. Fisher. The percision of discriminant functions. Ann. Eugen. Lond., 10:422429, 1940.\n\n [7] A. Globerson and N. Tishby. Sufficient dimensionality reduction. Journal of Machine Learning\n Research, 3:13071331, 2003.\n\n [8] M.J. Greenacre. Theory and applications of correspondence analysis. Academic Press, 1984.\n\n [9] G. Hinton and S.T. Roweis. Stochastic neighbor embedding. In NIPS 15, 2002.\n\n[10] H. Hotelling. The most predictable criterion. Journal of Educational Psych., 26:139142, 1935.\n\n[11] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T. Griffiths, and J. Tenenbaum. Parametric embedding\n for class visualization. In NIPS 18, 2004.\n\n[12] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n Science, 290:23232326, 2000.\n\n[13] J.B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear\n dimensionality reduction. Science, 290:23192323, 2000.\n\n[14] K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by semidefinite\n programming. In CVPR, 2004.\n\n\f\n", "award": [], "sourceid": 2733, "authors": [{"given_name": "Amir", "family_name": "Globerson", "institution": null}, {"given_name": "Gal", "family_name": "Chechik", "institution": null}, {"given_name": "Fernando", "family_name": "Pereira", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}