{"title": "On Spectral Clustering: Analysis and an algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 849, "page_last": 856, "abstract": null, "full_text": "On Spectral Clustering: \n\nAnalysis and an algorithm \n\nAndrew Y. Ng \n\nCS Division \nU.C. Berkeley \n\nMichael I. Jordan \n\nCS Div. & Dept. of Stat. \n\nU.C. Berkeley \n\nang@cs.berkeley.edu \n\njordan@cs.berkeley.edu \n\nYair Weiss \n\nSchool of CS & Engr. \n\nThe Hebrew Univ. \nyweiss@cs.huji.ac.il \n\nAbstract \n\nDespite many empirical successes of spectral clustering methods(cid:173)\nalgorithms that cluster points using eigenvectors of matrices de(cid:173)\nrived from the data-\nthere are several unresolved issues. First, \nthere are a wide variety of algorithms that use the eigenvectors \nin slightly different ways. Second, many of these algorithms have \nno proof that they will actually compute a reasonable clustering. \nIn this paper, we present a simple spectral clustering algorithm \nthat can be implemented using a few lines of Matlab. Using tools \nfrom matrix perturbation theory, we analyze the algorithm, and \ngive conditions under which it can be expected to do well. We \nalso show surprisingly good experimental results on a number of \nchallenging clustering problems. \n\n1 \n\nIntroduction \n\nThe task of finding good clusters has been the focus of considerable research in \nmachine learning and pattern recognition. For clustering points in Rn-a main ap(cid:173)\nplication focus of this paper- one standard approach is based on generative mod(cid:173)\nels, in which algorithms such as EM are used to learn a mixture density. These \napproaches suffer from several drawbacks. First, to use parametric density estima(cid:173)\ntors, harsh simplifying assumptions usually need to be made (e.g., that the density \nof each cluster is Gaussian) . Second, the log likelihood can have many local minima \nand therefore multiple restarts are required to find a good solution using iterative \nalgorithms. Algorithms such as K-means have similar problems. \n\nA promising alternative that has recently emerged in a number of fields is to use \nspectral methods for clustering. Here, one uses the top eigenvectors of a matrix \nderived from the distance between points. Such algorithms have been successfully \nused in many applications including computer vision and VLSI design [5, 1]. But \ndespite their empirical successes, different authors still disagree on exactly which \neigenvectors to use and how to derive clusters from them (see [11] for a review). \nAlso, the analysis of these algorithms, which we briefly review below, has tended to \nfocus on simplified algorithms that only use one eigenvector at a time. \n\nOne line of analysis makes the link to spectral graph partitioning, in which the sec-\n\n\fond eigenvector of a graph's Laplacian is used to define a semi-optimal cut. Here, \nthe eigenvector is seen as a solving a relaxation of an NP-hard discrete graph parti(cid:173)\ntioning problem [3], and it can be shown that cuts based on the second eigenvector \ngive a guaranteed approximation to the optimal cut [9, 3]. This analysis can be \nextended to clustering by building a weighted graph in which nodes correspond to \ndatapoints and edges are related to the distance between the points. Since the ma(cid:173)\njority of analyses in spectral graph partitioning appear to deal with partitioning the \ngraph into exactly two parts, these methods are then typically applied recursively \nto find k clusters (e.g. [9]). Experimentally it has been observed that using more \neigenvectors and directly computing a k way partitioning is better (e.g. [5, I]). \nHere, we build upon the recent work of Weiss [11] and Meila and Shi [6], who \nanalyzed algorithms that use k eigenvectors simultaneously in simple settings. We \npropose a particular manner to use the k eigenvectors simultaneously, and give \nconditions under which the algorithm can be expected to do well. \n\n2 Algorithm \n\nGiven a set of points S = {81' ... ,8n } in jRl that we want to cluster into k subsets: \n\n1. Form the affinity matrix A E Rnx n defined by A ij = exp(-Ilsi - sjW/2( 2 ) if \n\ni # j , and A ii = O. \n\n2. Define D to be the diagonal matrix whose (i, i)-element is the sum of A's i-th \n\nrow, and construct the matrix L = D-l / 2AD-l / 2 . 1 \n\n3. Find Xl , X2 , ... , Xk , the k largest eigenvectors of L (chosen to be orthogonal \nto each other in the case of repeated eigenvalues), and form the matrix X = \n[XIX2 . . . Xk) E R n xk by stacking the eigenvectors in columns. \n\n4. Form the matrix Y from X by renormalizing each of X's rows to have unit length \n\n(i.e. Yij = X ij/CL.j X~)1 / 2). \n\n5. Treating each row of Y as a point in Rk , cluster them into k clusters via K-means \n\nor any other algorithm (that attempts to minimize distortion). \n\n6. Finally, assign the original point Si to cluster j if and only if row i of the matrix \n\nY was assigned to cluster j. \n\nHere, the scaling parameter a 2 controls how rapidly the affinity Aij falls off with \nthe distance between 8i and 8j, and we will later describe a method for choosing \nit automatically. We also note that this is only one of a large family of possible \nalgorithms, and later discuss some related methods (e.g., [6]). \n\nAt first sight, this algorithm seems to make little sense. Since we run K-means \nin step 5, why not just apply K-means directly to the data? Figure Ie shows an \nexample. The natural clusters in jR2 do not correspond to convex regions, and K(cid:173)\nmeans run directly finds the unsatisfactory clustering in Figure li. But once we map \nthe points to jRk (Y 's rows) , they form tight clusters (Figure lh) from which our \nmethod obtains the good clustering shown in Figure Ie. We note that the clusters \nin Figure lh lie at 90 0 to each other relative to the origin (cf. [8]). \n\nlReaders familiar with spectral graph theory [3) may be more familiar with the Lapla(cid:173)\n\ncian 1- L. But as replacing L with 1- L would complicate our later discussion, and only \nchanges the eigenvalues (from Ai to 1 - Ai ) and not the eigenvectors, we instead use L . \n\n\f3 Analysis of algorithm \n\n3.1 \n\nInformal discussion: The \"ideal\" case \n\nTo understand the algorithm, it is instructive to consider its behavior in the \"ideal\" \ncase in which all points in different clusters are infinitely far apart. For the sake of \ndiscussion, suppose that k = 3, and that the three clusters of sizes n1, n2 and n3 \nare 8 1 ,82 , and 8 3 (8 = 8 1 U 8 2 U 8 3 , n = n1 +n2 + n3)' To simplify our exposition, \nalso assume that the points in 8 = {Sl,'\" \n,Sn} are ordered according to which \ncluster they are in, so that the first n1 points are in cluster 8 1 , the next n2 in 8 2 , \netc. We will also use \"j E 8/' as a shorthand for s\u00b7 E 8 i . Moving the clusters \n\"infinitely\" far apart corresponds to zeroing all the efements Aij corresponding to \npoints Si and Sj in different clusters. More precisely, define Aij = 0 if Xi and Xj are \nin different clusters, and Aij = Aij otherwise. Also let t , D , X and Y be defined \nas in the previous algorithm, but starting with A instead of A. Note that A and t \nare therefore block-diagonal: \n0 \n\nA(ll) \n\nA. = \n\n[ \n\n0 \no \n\nA(22) \n\n0 \n\no 1 A \no \n\n; L = \n\nA~~ \n\n[L(11) \n0 \n0 \n\no \n\u00a3(22) \no \n\n(1) \n\nwhere we have adopted the convention of using parenthesized superscripts to index \ninto subblocks of vectors/matrices, and Lrii) = (D(ii)) - 1/2A(ii) (D(ii)) - 1/2. Here, \nA(ii) = A(ii) E jRni xni is the matrix of \"intra-cluster\" affinities for cluster i. For fu(cid:173)\nture use, also define d(i) E jRni to be the vector containing D(ii) 's diagonal elements, \nand dE jRn to contain D's diagonal elements. \nTo construct X, we find t's first k = 3 eigenvectors. Since t is block diagonal, its \neigenvalues and eigenvectors are the union of the ei~envalues and eigenvectors of its \nblocks (the latter padded appropriately with zeros). It is straightforward to show \nthat Lrii) has a strictly positive principal eigenvector xii) E jRni with eigenvalue \n1. Also, since A)~) > 0 (j i:- k), the next eigenvalue is strictly less than 1. (See, \ne.g., [3]). Thus, stacking t 's eigenvectors in columns to obtain X, we have: \n\nX = \n\n[ \n\nxi1) \n0 \no \n\n0 \nxi2) \n0 \n\n0 1 \n0 \nxi3) \n\nE jRnx3. \n\nActually, a subtlety needs to be addressed here. Since 1 is a repeated eigenvalue \nin t, we could just as easily have picked any other 3 orthogonal vectors spanning \nthe same subspace as X's columns, and defined them to be our first 3 eigenvectors. \nThat is, X could have been replaced by XR for any orthogonal matrix R E jR3X3 \n(RT R = RRT = 1). Note that this immediately suggests that one use considerable \ncaution in attempting to interpret the individual eigenvectors of L, as the choice \nof X's columns is arbitrary up to a rotation, and can easily change due to small \nperturbations to A or even differences in the implementation of the eigensolvers. \nInstead, what we can reasonably hope to guarantee about the algorithm will be \narrived at not by considering the (unstable) individual columns of X, but instead \nthe subspace spanned by the columns of X, which can be considerably more stable. \nNext, when we renormalize each of X's rows to have unit length, we obtain: \n\ny = \n\n[ \n\ny(l) 1 [r 0 0 1 \n\ny(2) \n\n0 r 0 R \n0 0 r \n\ny(3) \n\nwhere we have used y(i) E jRni xk to denote the i-th subblock of Y. Letting fiji) \n\n(2) \n\n(3) \n\n\fdenote the j-th row of17(i) , we therefore see that fjY) is the i-th row ofthe orthogonal \nmatrix R. This gives us the following proposition. \n\nProposition 1 Let A's off-diagonal blocks A(ij ) , i =I-\nj, be zero. Also assume \nthat each cluster Si is connected.2 Then there exist k orthogonal vectors 1'1, . .. ,1' k \n(1'; l' j = 1 if i = j, 0 otherwise) so that Y's rows satisfy \n\n, (i) \n~ =G \n\n( ) \n4 \n\nfor all i = 1, ... ,k, j = 1, ... ,ni. \n\nIn other words, there are k mutually orthogonal points on the surface of the unit \nk-sphere around which Y 's rows will cluster. Moreover, these clusters correspond \nexactly to the true clustering of the original data. \n\n3.2 The general case \n\nIn the general case, A's off-diagonal blocks are non-zero, but we still hope to recover \nguarantees similar to Proposition 1. Viewing E = A - A as a perturbation to the \n\"ideal\" A that results in A = A+E, we ask: When can we expect the resulting rows \nof Y to cluster similarly to the rows of Y? Specifically, when will the eigenvectors \nof L, which we now view as a perturbed version of L, be \"close\" to those of L? \nMatrix perturbation theory [10] indicates that the stability of the eigenvectors of a \nmatrix is determined by the eigengap. More precisely, the subspace spanned by L's \nfirst 3 eigenvectors will be stable to small changes to L if and only if the eigengap \n8 = IA3 - A41, the difference between the 3rd and 4th eigenvalues of L, is large. As \ndiscussed previously, the eigenvalues of L is the union of the eigenvalues of D11), \nD22), and D33), and A3 = 1. Letting Ay) be the j-th largest eigenvalue of Dii), we \ntherefore see that A4 = maxi A~i). Hence, the assumption that IA3 - A41 be large is \nexactly the assumption that maXi A~i) be bounded away from 1. \nAssumption AI. There exists 8 > 0 so that, for all i = 1, ... ,k, A~i) :s: 1 - 8. \nNote that A~i) depends only on Dii), which in turn depends only on A(ii) = A(ii) , \nthe matrix of intra-cluster similarities for cluster Si' The assumption on A~i) has a \nvery natural interpretation in the context of clustering. Informally, it captures the \nidea that if we want an algorithm to find the clusters Sl, S2 and S3, then we require \nthat each of these sets Si really look like a \"tight\" cluster. Consider an example \nin which Sl = S1.1 U S1.2 , where S1.1 and S1.2 are themselves two well separated \nclusters. Then S = S1.1 U S1.2 U S2 U S3 looks like (at least) four clusters, and it \nwould be unreasonable to expect an algorithm to correctly guess what partition of \nthe four clusters into three subsets we had in mind. \n\nThis connection between the eigengap and the cohesiveness of the individual clusters \ncan be formalized in a number of ways. \n\nAssumption ALl. Define the Cheeger constant [3] of the cluster Si to be \n\n(5) \nwhere the outer minimum is over all index subsets I ~ {I, ... ,nd. Assume that \nthere exists 8 > 0 so that (h(Si))2 /2 ~ 8 for all i. \n\n'(.) ~ ,(.)}. \n\nh(S.) - mmI \n\nkli'I dk \n\n_ . \n\n. {~ \n\n~lE I, kIi'I A;;.,') \nlEI d \n\n, \n\nmm \n\n2This condition is satisfied by A.j~) > 0 (j i- k) , which is true in our case. \n\n\fA standard result in spectral graph theory shows that Assumption Al.l implies \nAssumption Al. Recall that d)i) = 2:k A)~) characterizes how \"well connected\" \nor how \"similar\" point j is to the other points in the same cluster. The term in \nthe minI{\u00b7} characterizes how well (I , I) partitions Si into two subsets, and the \nminimum over I picks out the best such partition. Specifically, if there is a partition \nof Si'S points so that the weight of the edges across the partition is small, and so \nthat each of the partitions has moderately large \"volume\" (sum of dY) 's), then the \nCheeger constant will be small. Thus, the assumption that the Cheeger constants \nh(Si) be large is exactly that the clusters Si be hard to split into two subsets. \n\nWe can also relate the eigengap to the mixing time of a random walk (as in [6]) \ndefined on the points of a cluster, in which the chance of transitioning from point i \nto j is proportional to Aij , so that we tend to jump to nearby-points. Assumption \nAl is equivalent to assuming that, for such a walk defined on the points of any \none of the clusters Si , the corresponding transition matrix has second eigenvalue at \nmost 1- 8. The mixing time of a random walk is governed by the second eigenvalue; \nthus, this assumption is exactly that the walks mix rapidly. Intuitively, this will be \ntrue for tight (or at least fairly \"well connected\") clusters, and untrue if a cluster \nconsists of two well-separated sets of points so that the random walk takes a long \ntime to transition from one half of the cluster to the other. Assumption Al can also \nbe related to the existence of multiple paths between any two points in the same \ncluster. \nAssumption A2. There is some fixed fl > 0, so that for every iI , i2 E {I, ... ,k} , \ni l =j:. i2, we have that \n\n(6) \n\nTo gain intuition about this, consider the case of two \"dense\" clusters il and i2 of \nsize O(n) each. Since dj measures how \"connected\" point j is to other points in \nthe same cluster, it will be dj = O(n) in this case, so the sum, which is over 0(n2 ) \nterms, is in turn divided by djdk = O(n2 ) . Thus, as long as the individual Ajk's \nare small, the sum will also be small, and the assumption will hold with small fl. \n\nWhereas dj measures how connected Sj E Si is to the rest of Si, 2:k:k'itSi Ajk \nmeasures how connected Sj is to points in other clusters. The next assumption is \nthat all points must be more connected to points in the same cluster than to points \nin other clusters; specifically, that the ratio between these two quantities be small. \nAssumption A3. For some fixed f2 > 0, for every i = 1, ... ,k, j E Si, we have: \n\n(7) \n\nFor intuition about this assumption, again consider the case of densely connected \nclusters (as we did previously). Here, the quantity in parentheses on the right hand \nside is 0(1), so this becomes equivalent to demanding that the following ratio be \nsmall: (2:k:k'itSi Ajk)/dj = (2: k:k'itSi Ajk)/(2:k:kESi A jk ) = 0(f2) . \n\nAssumption A4. There is some constant C > \u00b0 so that for every i = 1, .. . ,k, \nJ - 1, ... ,ni, we have dj ~ (2: k=l dk )/(Cni). \n. _ \nThis last assumption is a fairly benign one that no points in a cluster be \"too much \nless\" connected than other points in the same cluster. \n\n' (i) \n\nni \n\n' (i ) \n\nTheorem 2 Let assumptions Al, A2, A3 and A4 hold. Set f = Jk(k -\n\nl)fl + kE~. \n\n\fIf 0 > (2 + V2}::, then there exist k orthogonal vectors rl, . .. , rk (rF r j = I if i = j, \no otherwise) so that Y's rows satisfy \n\n(8) \n\nThus, the rows of Y will form tight clusters around k well-separated points (at 90 0 \nfrom each other) on the surface of the k-sphere according to their \"true\" cluster Si. \n\n4 Experiments \n\nTo test our algorithm, we applied it to seven clustering problems. Note that whereas \n(J2 was previously described as a human-specified parameter, the analysis also sug(cid:173)\ngests a particularly simple way of choosing it automatically: For the right (J2, \nTheorem 2 predicts that the rows of Y will form k \"tight\" clusters on the surface \nof the k-sphere. Thus, we simply search over (J2 , and pick the value that, after \nclustering Y 's rows, gives the tightest (smallest distortion) clusters. K-means in \nStep 5 of the algorithm was also inexpensively initialized using the prior knowledge \nthat the clusters are about 90 0 apart. 3 The results of our algorithm are shown in \nFigure l a-g. Giving the algorithm only the coordinates of the points and k, the \ndifferent clusters found are shown in the Figure via the different symbols (and col(cid:173)\nors, where available). The results are surprisingly good: Even for clusters that do \nnot form convex regions or that are not cleanly separated (such as in Figure 19) , \nthe algorithm reliably finds clusterings consistent with what a human would have \nchosen. \n\nWe note that there are other, related algorithms that can give good results on a \nsubset of these problems, but we are aware of no equally simple algorithm that \ncan give results comparable to these. For example, we noted earlier how K-means \neasily fails when clusters do not correspond to convex regions (Figure Ii). Another \nalternative may be a simple \"connected components\" algorithm that, for a threshold \nT, draws an edge between points Si and Sj whenever Iisi - sjl12 :s: T, and takes the \nresulting connected components to be the clusters. Here, T is a parameter that can \n(say) be optimized to obtain the desired number of clusters k. The result of this \nalgorithm on the threecircles-j oined dataset with k = 3 is shown in Figure lj. \nOne of the \"clusters\" it found consists of a singleton point at (1.5,2). It is clear \nthat this method is very non-robust. \n\nWe also compare our method to the algorithm of Meila and Shi [6] (see Figure lk). \nTheir method is similar to ours, except for the seemingly cosmetic difference that \nthey normalize A's rows to sum to I and use its eigenvectors instead of L's, and do \nnot renormalize the rows of X to unit length. A refinement of our analysis suggests \nthat this method might be susceptible to bad clusterings when the degree to which \ndifferent clusters are connected (L: j d;il) varies substantially across clusters. \n\n3 Briefiy, we let the first cluster centroid be a randomly chosen row of Y , and then \nrepeatedly choose as the next centroid the row of Y that is closest to being 90\u00b0 from \nall the centroids (formally, from the worst-case centroid) already picked. The resulting \nK-means was run only once (no restarts) to give the results presented. K-means with the \nmore conventional random initialization and a small number of restarts also gave identical \nresults. In contrast, our implementation of Meila and Shi's algorithm used 2000 restarts. \n\n\fflips,8clusten \n\no \no \n\n(a) \n\nsquigg les, 4 clusteNl \n\n(b) \n\n(c) \n\nth reeci~es-joiJ\\ed,2c1ust8fS \n\n(d) \n\n(e) \n\n(f) \n\nIhreecirdes_joined,3clusters \n\nRowsoJYOittered ,\n\nrarKlomlysubsa mpled) lorlW