{"title": "Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 585, "page_last": 591, "abstract": null, "full_text": "Laplacian Eigenmaps and Spectral \n\nTechniques for Embedding and Clustering \n\nMikhail Belkin and Partha Niyogi \n\nDepts. of Mathematics and Computer Science \n\nThe University of Chicago \n\nHyde Park , Chicago, IL 60637. \n\n(misha@math.uchicago.edu,niyogi@cs.uchicago.edu) \n\nAbstract \n\nDrawing on the correspondence between the graph Laplacian, the \nLaplace-Beltrami operator on a manifold , and the connections to \nthe heat equation , we propose a geometrically motivated algorithm \nfor constructing a representation for data sampled from a low di(cid:173)\nmensional manifold embedded in a higher dimensional space. The \nalgorithm provides a computationally efficient approach to non(cid:173)\nlinear dimensionality reduction that has locality preserving prop(cid:173)\nerties and a natural connection to clustering. Several applications \nare considered. \n\nIn many areas of artificial intelligence, information retrieval and data mining, one \nis often confronted with intrinsically low dimensional data lying in a very high di(cid:173)\nmensional space. For example, gray scale n x n images of a fixed object taken with \na moving camera yield data points in rn: n2 . However , the intrinsic dimensionality of \nthe space of all images of t he same object is the number of degrees of freedom of \nthe camera - in fact the space has the natural structure of a manifold embedded in \nrn: n2 . While there is a large body of work on dimensionality reduction in general, \nmost existing approaches do not explicitly take into account the structure of the \nmanifold on which the data may possibly reside. Recently, there has been some \ninterest (Tenenbaum et aI, 2000 ; Roweis and Saul, 2000) in the problem of devel(cid:173)\noping low dimensional representations of data in this particular context. In this \npaper , we present a new algorithm and an accompanying framework of analysis for \ngeometrically motivated dimensionality reduction. \n\nThe core algorithm is very simple, has a few local computations and one sparse \neigenvalue problem. The solution reflects th e intrinsic geom etric structure of the \nmanifold. The justification comes from the role of the Laplacian operator in pro(cid:173)\nviding an optimal emb edding. The Laplacian of the graph obtained from the data \npoints may be viewed as an approximation to the Laplace-Beltrami operator defined \non the manifold. The emb edding maps for the data come from approximations to \na natural map that is defined on the entire manifold. The framework of analysis \n\n\fpresented here makes this connection explicit. While this connection is known to \ngeometers and specialists in spectral graph theory (for example , see [1, 2]) to the \nbest of our knowledge we do not know of any application to data representation \nyet. The connection of the Laplacian to the heat kernel enables us to choose the \nweights of the graph in a principled manner. \n\nThe locality preserving character of the Laplacian Eigenmap algorithm makes it rel(cid:173)\natively insensitive to outliers and noise. A byproduct of this is that the algorithm \nimplicitly emphasizes the natural clusters in the data. Connections to spectral clus(cid:173)\ntering algorithms developed in learning and computer vision (see Shi and Malik , \n1997) become very clear. Following the discussion of Roweis and Saul (2000) , and \nTenenbaum et al (2000), we note that the biological perceptual apparatus is con(cid:173)\nfronted with high dimensional stimuli from which it must recover low dimensional \nstructure. One might argue that if the approach to recovering such low-dimensional \nstructure is inherently local , then a natural clustering will emerge and thus might \nserve as the basis for the development of categories in biological perception. \n\n1 The Algorithm \n\nGiven k points Xl , ... , Xk in ]]{ I, we construct a weighted graph with k nodes, one \nfor each point , and the set of edges connecting neighboring points to each other. \n\n1. Step 1. [Constructing th e Graph] We put an edge between nodes i and j if \n\nXi and Xj are \"close\" . There are two variations: \n\n(a) [-neighborhoods. [parameter [ E ]]{] Nodes i and j are connected by an \n\nedge if Ilxi - Xj 112 < f. \nAdvantages: geometrically motivated , the relationship is naturally \nsymmetric. \nDisadvantages : often leads to graphs with several connected compo(cid:173)\nnents , difficult to choose f. \n\n(b) n nearest neighbors. [parameter n E 1'::1] Nodes i and j are connected by \nan edge if i is among n nearest neighbors of j or j is among n nearest \nneighbors of i. \nAdvantages: simpler to choose, t ends to lead to connected graphs. \nDisadvantages : less geometrically intuitive. \n\n2. Step 2. \n\n[Choosing the weights] Here as well we have two variations for \n\nweighting the edges: \n\n(a) Heat kernel. [param eter t E ]]{]. If nodes i and j are connected, put \n\nWij = e-\n\nIlxi-X i 11 2 \n\nt \n\nThe justification for this choice of weights will be provided later. \n\n(b) Simple-minded. [No parameters]. W ij = 1 if and only if vertices i an d \n\nj are connected by an edge. \nA simplificat ion which avoids the necessity of choosing t. \n\n3. Step 3. [Eigenmaps] Assume the graph G, constructed above, is connected , \n\notherwise proceed with Step 3 for each connected component . \n\n\fX 10- 3 \n\n8,-- - - - - - - - - - , \n6 ~~ \n\n+ \n\n0,---- - - - - - - - - - - - . . . , \n\n10 \n\n20 \n\n30 I \n\n40 \n\no \n\n20 \n\n40 \n\n2 : \n\n4i\" \u2022 \no ~. ,. \n-2 $ \n-4 \\ \n\nCo \n\n. : \n\n-6 '~ \no \n\n-5 \n\n_8 L-~~~~----~ \n\n-4L-~ __ ~ ____ ~ \n\n-2 \n\no \n\n2 \n\nFigure 1: The left panel shows a horizontal and a vertical bar. The middle panel \nis a two dimensional representation of the set of all images using the Laplacian \neigenmaps. The right panel shows the result of a principal components analysis \nusing the first two principal directions to represent the data. Dots correspond to \nvertical bars and '+' signs correspond to horizontal bars. \n\nCompute eigenvalues and eigenvectors for the generalized eigenvector prob(cid:173)\nlem: \n\nLy = )'Dy \n\n(1) \n\nwhere D is diagonal weight matrix, its entries are column (or row, since \nW is symmetric) sums of W , Dii = Lj Wji. L = D - W is the Laplacian \nmatrix. Laplacian is a symmetric , positive semidefinite matrix which can \nbe thought of as an operator on functions defined on vertices of G. \nLet Yo , ... , Y k -1 be the solutions of equation 1, ordered according to their \neigenvalues with Yo having the smallest eigenvalue (in fact 0). The image \nof X i under the embedding into the lower dimensional space :Il{ m is given by \n(y 1 ( i) , . . . , y m (i)). \n\n2 Justification \n\nRecall that given a data set we construct a weighted graph G = (V, E) with edges \nconnecting nearby points to each other . Consider the problem of mapping the \nweighted connected graph G to a line so that connected points stay as close together \nas possible. We wish to choose Yi E :Il{ to minimize \n\n2)Yi - Yj )2Wij \ni ,j \n\nunder appropriate constraints. Let y = (Y1, Y2 , ... ,Yn)T be the map from the graph \nto the real line. First, note that for any y , we have \n\n(2) \n\nwhere as before, L = D - W. To see this , notice that Wij 1S symmetric and \nDii = Lj Wij . Thus Li ,j(Yi - Yj)2Wij can be written as \n\n2)Y; + yJ - 2YiYj )Wij = LY; Dii + LyJ Djj - 2 LYiYj Wij = 2yT Ly \n\ni ,j \n\nj \n\ni ,j \n\n\f-b O \n-.,!,!.,'i.~ \"\" \n- sa y \n\n- \"\" \"\"\".,\"\",oulo \n\n.-n a y \n\ns h o uld \n\n. - \",,;11 \n\nFigure 3: Fragments labeled by arrows in figure 2, from left to right. The first \ncontains infinitives of verbs , the second contains prepositions and the third mostly \nmodal and auxiliary verbs. We see that syntactic structure is well-preserved. \n\nTherefore, the minimization problem reduces to finding argminyTDY=lyT Ly. \n\nThe constraint yT Dy = 1 \nremoves an arbitrary scaling \nfactor in the embedding. Ma-\ntrix D provides a natural \nmeasure on the vertices of the \ngraph. From eq. 2, we see \nthat L is a positive semidef-\ninite matrix and the vector \ny that minimizes the objec(cid:173)\ntive function is given by the \nminimum eigenvalue solution \nto the generalized eigenvalue \nproblem Ly = )'Dy. \nLet 1 be the constant func(cid:173)\ntion taking value 1 at each \nvertex. It is easy to see that 1 is an eigenvector with eigenvalue O. If the graph \nis connected , 1 is the only eigenvector for ). = O. To eliminate this trivial solu(cid:173)\ntion which collapses all vertices of G onto the real number 1, we put an additional \nconstraint of orthogonality to obtain \n\nFigure 2: 300 most frequent words of the Brown \ncorpus represented in the spectral domain. \n\n:\u00b7st\u00b7 -. \n\nYopt = argmm yT Dy=l yT Ly \n\nyTDl=O \n\nThus, the solution Y opt is now given by the eigenvector with the smallest non-zero \neigenvalue. More generally, the embedding of the graph into lR!. m (m > 1) is given \nby the n x m matrix Y = [Y1Y2 ... Yml where the ith row, denoted by Yl, provides \nthe embedding coordinates of the ith vertex. Thus we need to minimize \n\nThis reduces now to \n\nL IIYi - 1j 11 2Wij = tr(yT LY) \n\ni ,j \n\nYopt = argminY T DY=I tr(yT LY) \n\n\fFor the one-dimensional embedding problem , the constraint prevents collapse onto \na point. For the m-dimensional embedding problem , the constraint presented above \nprevents collapse onto a subspace of dimension less than m. \n\n2.1 The Laplace-Beltrami Operator \n\n\\ \n\nemb edded \n\nThe Laplacian of a graph is analogous to the Laplace-Beltrami operator on mani(cid:173)\nfolds. \nConsider a smooth m-dimensional \nmanifold M \nin \nlR k. The Riemannian struc-\nture (metric tensor) on the \nmanifold is induced by the \nstandard Riemannian struc-\nture on lR k. Suppose we have \na map f : M ----+ lR . The gra(cid:173)\ndient V f( x) (which in local \ncoordinates can be written as \nV f( x) = 2::7=1 ltax.) is a \nvector field on the manifold, \nsuch that for small ox (in a \nlocal coordinate chart) \n\nFigure 4: 685 speech datapoints plotted in the two \ndimensional Laplacian spectral representation. \n\n,/\n\n\\ \n\nIf(x + ox) - f(x)1 ~ I(Vf(x) ,ox)1 ~ IIVf1111ox11 \n\nThus we see that if IIV fll is small , points near x will be mapped to points near \nf( x). We therefore look for a map that best preserves locality on average by trying \nto find \n\nM \n\nMinimizing f IIVf(x)112 corresponds directly to minimizing Lf = ~ 2::ij (li -\nf j )2W ij on a graph. Minimizing the squared gradient reduces to finding eigen-\nfun ctions of the Laplace-Beltrami operator.c. Recall that .c d;j div V(I) , where \nIt follows from the Stokes theorem that -div and V \ndiv is the divergence. \nif f is a function and X is a vector field \nare form ally adjoint operators, i.e. \nfM (X, V f) = fM div(X)f. Thus \n\n' \n\n1M IIV fl12 = 1M .c(l)f \n\nWe see that .c is positive semidefinite and the f that minimizes fM IIV fl12 has to \nbe an eigenfunction of .c. \n\n2.2 H eat K ernels and the Choice of W e ight Matrix \n\nThe Laplace-Beltrami operator on differentiable functions on a manifold M is in(cid:173)\ntimately related to the heat flow. Let f : M \nlR be the initial heat distri(cid:173)\nbution, u(x, t) be the heat distribution at time t (u(x ,O) = f( x) ). The heat \n\n----+ \n\n\fequation is the partial differential equation ~~ = \u00a3u. The solution is given by \nu(x , t) = fM Ht(x, y)f(y) where Ht is the heat kernel - the Green 's function for \nthis PDE. Therefore, \n\nn \n\nIlx-yl12 \n\nLocally, the heat kernel is approximately equal to the Gaussian , Ht(x, y) ~ \n(47rt)-\"2e--4-t - where Ilx - yll (x and yare m local coordmates) and tare \nboth sufficiently small and n = dim M. Notice that as t tends to 0, the heat \nkernel Ht(x , y) becomes increasingly localized and tends to Dirac's b-function, i.e., \nlim fM Ht(x, y)f(y) = f(x). Therefore , for small t from the definition of the deriva-\nt---+D \ntive we have \n\n. . \n\n\u00a3f(x;) ~ -I, f(x) - (47rt)-\"2 JM e- -\n\n1 [ \n\nn ( l l x -Yl1 2 \n\n] \n4t -f(y)dy \n\nIf Xl , ... , Xk are data points on M, the last expression can be approximated by \n\nXj \n\nO< IIX j -X ill