{"title": "Limits of Spectral Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 857, "page_last": 864, "abstract": null, "full_text": " Limits of Spectral Clustering\n\n\n\n Ulrike von Luxburg and Olivier Bousquet\n Max Planck Institute for Biological Cybernetics\n Spemannstr. 38, 72076 Tubingen, Germany\n {ulrike.luxburg,olivier.bousquet}@tuebingen.mpg.de\n\n\n Mikhail Belkin\n The University of Chicago, Department of Computer Science\n 1100 E 58th st., Chicago, USA\n misha@cs.uchicago.edu\n\n\n\n\n Abstract\n\n\n An important aspect of clustering algorithms is whether the partitions\n constructed on finite samples converge to a useful clustering of the whole\n data space as the sample size increases. This paper investigates this\n question for normalized and unnormalized versions of the popular spec-\n tral clustering algorithm. Surprisingly, the convergence of unnormalized\n spectral clustering is more difficult to handle than the normalized case.\n Even though recently some first results on the convergence of normal-\n ized spectral clustering have been obtained, for the unnormalized case\n we have to develop a completely new approach combining tools from\n numerical integration, spectral and perturbation theory, and probability.\n It turns out that while in the normalized case, spectral clustering usually\n converges to a nice partition of the data space, in the unnormalized case\n the same only holds under strong additional assumptions which are not\n always satisfied. We conclude that our analysis gives strong evidence for\n the superiority of normalized spectral clustering. It also provides a basis\n for future exploration of other Laplacian-based methods.\n\n\n\n1 Introduction\n\nClustering algorithms partition a given data set into several groups based on some notion\nof similarity between objects. The problem of clustering is inherently difficult and often\nlacks clear criteria of \"goodness\". Despite the difficulties in determining the quality of a\ngiven partition, it is still possible to study desirable properties of clustering algorithms from\na theoretical point of view. In this paper we study the consistency of spectral clustering,\nwhich is an important property in the general framework of statistical pattern recognition.\nA clustering algorithm is consistent if it produces a well-defined (and, hopefully, sensible)\npartition, given sufficiently many data points. The consistency is a basic sanity check, as an\nalgorithm which is not consistent would change the partition indefinitely as we add points to\nthe dataset, and, consequently, no reasonable small-sample performance could be expected\nat all. Surprisingly, relatively little research into consistency of clustering algorithms has\n\n\f\nbeen done so far, exceptions being only k-centers (Pollard, 1981) and linkage algorithms\n(Hartigan, 1985).\n\nWhile finite-sample properties of spectral clustering have been studied from a theoretical\npoint of view (Spielman and Teng, 1996; Guattery and Miller, 1998; Kannan et al., 2000;\nNg et al., 2001; Meila and Shi, 2001) we focus on the limit behavior for sample size tending\nto infinity. In this paper we develop a new strategy to prove convergence results for spectral\nclustering algorithms. Unlike our previous attempts this strategy allows to obtain results\nfor both normalized and unnormalized spectral clustering. As a first result we can recover\nthe main theorem of von Luxburg et al. (2004), which had been proved with different and\nmore restrictive methods, and, in brief, states that usually normalized spectral clustering\nconverges. We also extend that result to the case of multiple eigenvectors. Our second result\nconcerns the case of unnormalized spectral clustering, for which no convergence properties\nhad been known so far. This case is much more difficult to treat than the normalized\ncase, as the limit operators have a more complicated form. We show that unnormalized\nspectral clustering also converges, but only under strong additional assumptions. Contrary\nto the normalized case, those assumptions are not always satisfied, as we can show by\nconstructing an example, and in this case there is no hope for convergence. Even worse,\non a finite sample it is impossible to verify whether the assumptions hold or not. As a third\nresult we prove statements about the form of the limit clustering. It turns out that in case\nof convergence, the structure of the clustering constructed on finite samples is conserved in\nthe limit process. From this we can conclude that if convergence takes place, then the limit\nclustering presents an intuitively appealing partition of the data space.\n\nIt is also interesting to note that several recent methods for semi-supervised and transduc-\ntive learning are based on eigenvectors of similarity graphs (cf. Belkin and Niyogi, 2003;\nChapelle et al., 2003; Zhu et al., 2003). Our theoretical framework can also be applied to\ninvestigate the consistency of those algorithms with respect to the unlabeled data.\n\nThere is an ongoing debate on the advantages of the normalized versus unnormalized graph\nLaplacians for spectral clustering. It has been found empirically that the normalized version\nperforms as well or better than the unnormalized version (e.g., Van Driessche and Roose,\n1995; Weiss, 1999; in the context of semi-supervised learning see also Zhou et al., 2004).\nWe are now able to provide additional evidence to this effect from a theoretical point of\nview. Normalized spectral clustering is a well-behaved algorithm which always converges\nto a sensible limit clustering. Unnormalized spectral clustering on the other hand should be\ntreated with care as consistency can only be asserted under strong assumptions which are\nnot always satisfied and, moreover, are difficult to check in practice.\n\n\n2 Graph Laplacians and spectral clustering on finite samples\n\n\nIn the following we denote by (T ) the spectrum of a linear operator, by C(X ) the space\nof continuous functions on X with infinity norm, and by rg(d) the range of a function\nd C(X ). For given sample points X1, ..., Xn drawn iid according to an (unknown) dis-\ntribution P on some data space X we denote the empirical distribution by Pn. For a non-\nnegative, symmetric similarity function s : X X IR we define the similarity matrix\nas Kn := (s(Xi, Xj))i,j=1,...,n, set di := n s(X\n j=1 i, Xj ), and define the degree matrix\nDn as the diagonal matrix with entries di. The unnormalized Laplacian matrix is defined\nas Ln := Dn - Kn, and two common ways of normalizing it are Ln := D-1/2\n n LnD-1/2\n n\nor Ln := D-1\n n Ln. In the following we always arrange the eigenvalues of the Laplacian\nmatrices in non-decreasing order 0 = 1 2... n respecting their multiplicities. In its\nsimplest form, unnormalized (resp. normalized) spectral clustering partitions the sample\npoints Xi into two groups according to whether the i-th coordinate of the second eigen-\nvector is larger or smaller than a certain threshold b IR. Often, instead of considering\n\n\f\nonly the second eigenvector, one uses the first r eigenvectors (for some small number r)\nsimultaneously to obtain a partition into several sets. For an overview of different spectral\nclustering algorithms see for example Weiss (1999).\n\n\n3 Limit results\n\nIn this section we want to state and discuss our main results. The general assumptions\nin the following three theorems are that the data space X is a compact metric space from\nwhich the sample points (Xi)iIN are drawn independently according to an unknown prob-\nability distribution P . Moreover we require the similarity function s : X X IR to be\ncontinuous, and in the normalized case to be bounded away from 0, that is s(x, y) > l > 0\nfor all x, y X and some l IR. By d C(X ) we will denote the \"degree function\", and\nU and U will denote the \"limit operators\" of Ln and Ln for n . The exact definitions\nof these functions and operators, as well as all further mathematical details, definitions, and\nproofs will be postponed to Section 4.\n\nLet us start with the first question raised in the introduction: does the spectral clustering\nconstructed on a finite sample converge to a partition of the whole data space if the sample\nsize increases? In the normalized case, convergence results have recently been obtained\nin von Luxburg et al. (2004). However, those methods were specifically designed for the\nnormalized Laplacian and cannot be used in the unnormalized case. Here we state a con-\nvergence result for the normalized case in the form how it can be obtained with our new\nmethods. The theorem is formulated for the symmetric normalization Ln, but it holds\nsimilarly for the normalization Ln.\n\nTheorem 1 (Convergence of normalized spectral clustering) Under the general as-\nsumptions, if the first r eigenvalues of the limit operator U have multiplicity 1, then the\nsame holds for the first r eigenvalues of Ln for sufficiently large n. In this case, the first r\neigenvalues of Ln converge to the first r eigenvalues of U, and the corresponding eigenvec-\ntors converge almost surely. The partitions constructed by normalized spectral clustering\nfrom the first r eigenvectors on finite samples converge almost surely to a limit partition of\nthe whole data space.\n\nOur new result about the convergence in the unnormalized case is the following:\n\nTheorem 2 (Convergence of unnormalized spectral clustering) Under the general as-\nsumptions, if the first r eigenvalues of the limit operator U have multiplicity 1 and are not\nelement of rg(d), then the same holds for the first r eigenvalues of 1 L\n n n for sufficiently\nlarge n. In this case, the first r eigenvalues of 1 L\n n n converge to the first r eigenvalues of U ,\nand the the corresponding eigenvectors converge almost surely. The partitions constructed\nby unnormalized spectral clustering from the first r eigenvectors on finite samples converge\nalmost surely to a limit partition of the whole data space.\n\nOn the first glance, this theorem looks very similar to Theorem 1: if the general assump-\ntions are satisfied and the first eigenvalues are \"nice\", then unnormalized spectral cluster-\ning converges. However, the difference between Theorems 1 and 2 is what it means for\nan eigenvalue to be \"nice\". In Theorem 1 we only require the eigenvalues to have mul-\ntiplicity 1 (and in fact, if the multiplicity is larger than 1 we can still prove convergence\nof eigenspaces instead of eigenvectors). In Theorem 2, however, the condition rg(d)\nhas to be satisfied. In the proof this is needed to ensure that the eigenvalue is isolated\nin the spectrum of U , which is a fundamental requirement for applying perturbation theory\nto the convergence of eigenvectors. If this condition is not satisfied, perturbation theory\nis in principle unsuitable to obtain convergence results for eigenvectors. The reason why\nthis condition appears in the unnormalized case but not in the normalized case lies in the\n\n\f\nstructure of the respective limit operators, which, surprisingly, is more complicated in the\nunnormalized case than in the normalized one. In the next section we will construct an ex-\nample where the second eigenvalue indeed lies within rg(d). This means that there actually\nexist situations in which Theorem 2 cannot be applied, and hence unnormalized spectral\nclustering might not converge.\n\nNow we want to turn to the second question raised in the introduction: In case of conver-\ngence, is the limit clustering a reasonable clustering of the whole data space? To answer\nthis question we analyze the structure of the limit operators (for simplicity we state this\nfor the unnormalized case only). Assume that we are given a partition X = ki X\n =1 i\nof the data space into k disjoint sets. If we order the sample points according to their\nmemberships in the sets Xi, then we can write the Laplacian in form of a block matrix\nLn (Lij,n)i,j=1,...,k where each sub-matrix Lij,n contains the rows of Ln corresponding\nto points in set Xi and the columns corresponding to points in Xj. In a similar way, the limit\noperator U can be decomposed into a matrix of operators Uij : C(Xj) C(Xi). Now we\ncan show that for all i, j = 1, ..., k the sub-matrices 1 L\n n ij,n converge to the corresponding\nsub-operators Uij such that their spectra converge in the same way as in Theorems 1 and 2.\nThis is a very strong result as it means that for every given partition of X , the structure of\nthe operators is preserved in the limit process.\n\nTheorem 3 (Structure of the limit operators) Let X = ki X\n =1 i be a partition of the data\nspace. Let Lij,n be the sub-matrices of Ln introduced above, Uij : C(Xj) C(Xi) the\nrestrictions of U corresponding to the sets Xi and Xj, and U ij,n and Uij the analogous\nquantities for the normalized case. Then under the general assumptions, 1 L\n n ij,n converges\ncompactly to Uij a.s. and Lij,n converges compactly to Uij a.s.\n\nWith this result it is then possible to give a first answer on the question how the limit\npartitions look like. In Meila and Shi (2001) it has been established that normalized spectral\nclustering tries to find a partition such that a random walk on the sample points tends to\nstay within each of the partition sets Xi instead of jumping between them. With the help\nof Theorem 3, the same can now be said for the normalized limit partition, and this can\nalso be extended to the unnormalized case. The operators U and U can be interpreted as\ndiffusion operators on the data space. The limit clusterings try to find a partition such that\nthe diffusion tends to stay within the sets Xi instead of jumping between them. In particular,\nthe limit partition segments the data space into sets such that the similarity within the sets\nis high and the similarity between the sets is low, which intuitively is what clustering is\nsupposed to do.\n\n\n4 Mathematical details\n\nIn this section we want to explain the general constructions and steps that need to be\ntaken to prove Theorems 1, 2, and 3. However, as the proofs are rather technical we only\npresent proof sketches that convey the overall strategy. Detailed proofs can be found in von\nLuxburg (2004) where all proofs are spelled out in full length. Moreover, we will focus on\nthe proof of Theorem 2 as the other results can be proved similarly.\n\nTo be able to define convergence of linear operators, all operators have to act on the same\nspace. As this is not the case for the matrices Ln for different n, for each Ln we will\nconstruct a related operator Un on the space C(X ) which will be used instead of Ln. In\nStep 2 we show that the interesting eigenvalues and eigenvectors of 1 L\n n n and Un are in a\none-to-one relationship. Then we will prove that the Un converge in a strong sense to some\nlimit operator U on C(X ) in Step 3. As we can show in Step 4, this convergence implies\nthe convergence of eigenvalues and eigenvectors of Un. Finally, assembling the parts will\nfinish the proof of Theorem 2.\n\n\f\nStep 1: Construction of the operators U on C(X ).\n n\n\n\n\nWe first define the empirical and true degree functions in C(X ) as\n\n dn(x) := s(x, y)dPn(y) and d(x) := s(x, y)dP (y).\n\nCorresponding to the matrices Dn and Kn we introduce the following multiplication and\nintegral operators on C(X ):\n\n Md f(x) := dn(x)f(x) and Mdf(x) := d(x)f(x)\n n\n\n\n\n\n\n Snf(x) := s(x, y)f (y)dPn(y) and Sf (x) := s(x, y)f (y)dP (y).\n\nNote that dn(Xi) = 1 d\n n i, and for f C (X ) and v := (f (X1), ..., f (Xn)) it holds\nthat 1 (D f (X (K\n n nv)i = Md i) and 1 nv)i = Snf (Xi). Hence the function dn and the\n n n\noperators Md and Sn are the counterparts of the discrete degrees 1 di and the matrices\n n n\n1 D K\nn n and 1\n n n. The scaling factor 1/n comes from the hidden 1/n-factor in the empirical\ndistribution Pn. The natural pointwise limits of dn, Md , and Sn for n are given\n n\n\nby d, Md, and S. The operators corresponding to the unnormalized Laplacians 1 L\n n n =\n1 (D\nn n - Kn) and its limit operator are\n\n Unf(x) := Md f(x) - Snf(x) and U f (x) := Mdf(x) - Sf(x).\n n\n\n\n\n\n\nStep 2: Relations between ( 1 L ) and (U ).\n n n\n n\n\n\n\nProposition 4 (Spectral properties) 1. The spectrum of Un consists of rg(dn), plus some\nisolated eigenvalues with finite multiplicity. The same holds for U and rg(d).\n 2. If f C(X ) is an eigenfunction of Un with arbitrary eigenvalue , then the vector\nv IRn with vi = f(Xi) is an eigenvector of the matrix 1 L\n n n with eigenvalue .\n 3. If v is an eigenvector of the matrix 1 L\n n n with eigenvalue rg(dn), then the\nfunction f (x) = 1 ( s(x, X\n n j j )vj )/(dn(x) - ) is the unique eigenfunction of Un with\neigenvalue satisfying f (Xi) = vi.\n\nProof. It is well-known that the (essential) spectrum of a multiplication operator coincides\nwith the range of the multiplier function. Moreover, the spectrum of a sum of a bounded\noperator with a compact operator contains the essential spectrum of the bounded operator.\nAdditionally, it can only contain some isolated eigenvalues with finite multiplicity (e.g.,\nTheorem IV.5.35 in Kato, 1966). The proofs of the other parts of this proposition can be\nobtained by elementary shuffling of eigenvalue equations and will be skipped.\n\n\nStep 3: Convergence of U to U .\n n\n\n\n\nDealing with the randomness. Recall that the operators Un are random operators as\nthey depend on the given sample points X1, ..., Xn via the empirical distribution Pn. One\nimportant tool to cope with this randomness will be the following proposition:\n\nProposition 5 (Glivenko-Cantelli class) Let (X , d) be a compact metric space and s :\nX X IR continuous. Then F := {s(x, ); x X } is a Glivenko-Cantelli class, that is\nsupx | s(x, y)dP\n X n(y) - s(x, y)dP (y)| 0 almost surely.\n\nProof. This proposition follows from Theorem 2.4.1. of v. d. Vaart and Wellner (1996).\n\n\nNote that one direct consequence of this proposition is that dn - d 0 a.s.\n\n\f\nTypes of convergence. Let E be an arbitrary Banach space and B its unit ball. A sequence\n(Sn)n of linear operators on E is called collectively compact if the set S\n n nB is relatively\ncompact in E (with respect to the norm topology). A sequence of operators converges\ncollectively compactly if it converges pointwise and if there exists some N IN such that\nthe operators (Sn - S)n>N are collectively compact. A sequence of operators converges\ncompactly if it converges pointwise and if for every sequence (xn)n in B, the sequence\n(S-Sn)xn is relatively compact. See Anselone (1971) and Chatelin (1983) for background\nreading. A sequence (xn)n in E converges up to a change of sign to x E if there exists\na sequence (an)n of signs an {-1, +1} such that the sequence (anxn)n converges to x.\n\nProposition 6 (U converges compactly to U a.s.) Let X be a compact metric space and\n n\n\ns : X X IR continuous. Then Un converges to U compactly a.s.\n\nProof. (a) Sn converges to S collectively compactly a.s. With the help of the Glivenko-\nCantelli property in Proposition 5 it is easy to see that Sn converges to S pointwise, that\nis Snf - Sf 0 a.s. for all f C(X ). As the limit operator S is compact, to\nprove that (Sn - S)n are collectively compact a.s. it is enough to prove that (Sn)n are\ncollectively compact a.s. This can be done by the Arzela-Ascoli theorem.\n (b) Md converges to Md in operator norm a.s. This is a direct consequence of the\n n\n\nGlivenko-Cantelli properties of Proposition 5.\n (c) Un = Sn - Md converges to U = S - Md compactly a.s. Both operator\n n\n\nnorm convergence and collectively compact convergence imply compact convergence\n(cf. Proposition 3.18 of Chatelin, 1983). Moreover, it is easy to see that the sum of two\ncompactly converging operators converges compactly.\n\n\nStep 4: Convergence of the eigenfunctions of U to those of U .\n n\n\n\n\nIt is a result of perturbation theory (see the comprehensive treatment in Chatelin, 1983,\nespecially Section 5.1) that compact convergence of operators implies the convergence of\neigenvalues and spectral projections in the following way. If is an isolated eigenvalue in\n(U ) with finite multiplicity, then there exists a sequence n (Un) of isolated eigen-\nvalues with finite multiplicity such that n . If the first r eigenvalues of T have\nmultiplicity 1, then the same holds for the first r eigenvalues of Tn for sufficiently large\nn, and the i-th eigenvalues of Tn converge to the i-th eigenvalue of T . The corresponding\neigenvectors converge up to a change of sign. If the multiplicity of the eigenvalues is larger\nthan 1 but finite, then the corresponding eigenspaces converge. Note that for eigenvalues\nwhich are not isolated in the spectrum, convergence cannot be asserted, and the same holds\nfor the corresponding eigenvectors (e.g., Section IV.3 of Kato, 1966).\nIn our case, by Proposition 4 we know that the spectrum of U consists of the whole in-\nterval rg(d), plus eventually some isolated eigenvalues. Hence an eigenvalue (U )\nis isolated in the spectrum iff rg(d) holds, in which case convergence holds as stated\nabove.\n\nStep 5: Convergence of unnormalized spectral clustering.\n\nNow we can to put together the different parts. In the first two steps we transferred the\nproblem of the convergence of the eigenvectors of 1 L\n n n to the convergence of eigenfunc-\ntions of Un. In Step 3 we showed that Un converges compactly to the limit operator U ,\nwhich according to Step 4 implies the convergence of the eigenfunctions of Un. In terms\nof the eigenvectors of 1 L\n n n this means the following: if denotes the j-th eigenvalue\nof U with eigenfunction f C(X ) and n the j-th eigenvalue of 1 L\n n n with eigenvec-\ntor vn = (vn,1, ..., vn,n), then there exists a sequence of signs ai {-1, +1} such that\nsupi=1,...,n |aivn,i - f(Xi)| 0 a.s. As spectral clustering is constructed from the co-\nordinates of the eigenvectors, this leads to the convergence of spectral clustering in the\nunnormalized case. This completes the proof of Theorem 2.\n\n\f\nThe proof for Theorem 1 can be obtained in a very similar way. Here the limit operator is\n\n U f (x) := (I - S)f (x) := f (x) - (s(x, y)/ d(x)d(y) )f (y)dP (y).\n\nThe main difference to the unnormalized case is that the operator Md in U gets replaced by\nthe identity operator I in U . This simplifies matters as one can easily express the spectrum\nof (I - S) via the spectrum of the compact operator S. From a different point of view,\nconsider the identity operator as the operator of multiplication by the constant one function\n1. Its range is the single point rg(1) = {1}, and hence the critical interval rg(d) (U )\nshrinks to the point 1 (U ), which in general is a non-isolated eigenvalue with infinite\nmultiplicity.\n\nFinally, note that it is also possible to prove more general versions of Theorems 1 and 2\nwhere the eigenvalues have finite multiplicity larger than 1. Instead of the convergence of\nthe eigenvectors we then obtain the convergence of the projections on the eigenspaces.\n\nThe proof of Theorem 3 works as the ones of the other two theorems. The exact definitions\nof the operators considered in this case are\n\n U ij : C(Xj) C(Xi), ijfi(x) - (sij(x, y)/ di(x)dj(y) )fj(y)dPj(y)\n\n Uij : C(Xj) C(Xi), Uijf(x) = ijdi(x)fi(x) - sij(x, y)fj(y)dPj(y)\n\nwhere di, fi, Pi, and sij denote the restrictions of the functions to Xi and XiXj, respec-\ntively, and ij is 1 if i = j and 0 otherwise. For the diffusion interpretation, note that if\nthere exists an ideal partition of the data space (that is, s(xi, xj) = 0 for xi, xj in different\nsets Xi and Xj), then the off-diagonal operators U ij and Uij with i = j vanish, and the\nfirst k eigenvectors of U and U can be reconstructed by the piecewise constant eigenvec-\ntors of the diagonal operators U ii and Uii. In this situation, spectral clustering recovers\nthe ideal clustering. If there exists no ideal clustering, but there exists a partition such that\nthe off-diagonal operators are \"small\" and the diagonal operators are \"large\", then it can\nbe seen by perturbation theory arguments that spectral clustering will find such a partition.\nThe off-diagonal operators can be interpreted as diffusion operators between different clus-\nters (note that even in the unnormalized case, the multiplication operator only appears in\nthe diagonal operators). Hence, constructing a clustering with small off-diagonal operators\ncorresponds to a partition such that few diffusion between the clusters takes place.\n\nFinally, we want to construct an example where the second eigenvalue of U satisfies \nrg(d). Let X = [1, 2] IR, s(x, y) := xy, and p a piecewise constant probability density\non X with p(x) = c if 4/3 x < 5/3 and p(x) = (3 - c)/2 otherwise, for some fixed\nconstant c [0, 3] (e.g., for small c this density has two clearly separated high density\nregions). The degree function in this case is d(x) = 1.5x (independently of c) and has\nrange [1.5, 3] on X . We can see that an eigenfunction of U for eigenvalue rg(d) has\nthe form f (x) = x/(3x - ), where the equation = x2/(3x - )p(x)dx has to\nbe satisfied. This means that rg(d) is an eigenvalue of U iff the equation g() :=\n 2 x2/(3x - )p(x)dx != 1 is satisfied. For our simple density function p, this integral\n 1\ncan be solved analytically. It can then been seen that g() = 1 is only satisfied for = 0,\nhence the only eigenvalue outside of rg(d) is the trivial eigenvalue 0.\n\nNote that in applications of spectral clustering, we do not know the limit operator U and\nhence cannot test whether its relevant eigenvalues are in its essential spectrum rg(d) or\nnot. If, for some special reason, one really wants to use unnormalized spectral clustering,\none should at least estimate the critical region rg(d) by [mini di/n, maxi di/n] and check\nwhether the relevant eigenvalues of 1 L\n n n are inside or close to this interval or not. This ob-\nservation then gives an indication whether the results obtained can considered to be reliable\nor not. However, this observation is not a valid statistical test.\n\n\f\n5 Conclusions\n\nWe have shown that under standard assumptions, normalized spectral clustering always\nconverges to a limit partition of the whole data space which depends only on the probability\ndistribution P and the similarity function s. For unnormalized spectral clustering, this can\nonly be guaranteed under the strong additional assumption that the first eigenvalues of the\nLaplacian do not fall inside the range of the degree function. As shown by our example,\nthis condition has to be taken seriously.\n\nConsistency results are a basic sanity check for behavior of statistical learning algorithms.\nAlgorithms which do not converge cannot be expected to exhibit reliable results on finite\nsamples. Therefore, in the light of our theoretical analysis we assert that the normalized\nversion of spectral clustering should be preferred in practice. This suggestion also extends\nto other applications of graph Laplacians including semi-supervised learning.\n\n\nReferences\n\nP. Anselone. Collectively compact operator approximation theory. Prentice-Hall, 1971.\n\nM. Belkin and P. Niyogi. Using manifold structure for partially labeled classification. In\n Advances in Neural Information Processing Systems 15, 2003.\n\nO. Chapelle, J. Weston, and B. Scholkopf. Cluster kernels for semi-supervised learning. In\n Advances in Neural Information Processing Systems 15, 2003.\n\nF. Chatelin. Spectral Approximation of Linear Operators. Academic Press, 1983.\n\nS. Guattery and G. L. Miller. On the quality of spectral separators. SIAM Journal of Matrix\n Anal. Appl., 19(3), 1998.\n\nJ. Hartigan. Statistical theory in clustering. Journal of classification, 2:6376, 1985.\n\nR. Kannan, S. Vempala, and A. Vetta. On clusterings - good, bad and spectral. Technical\n report, Computer Science Department, Yale University, 2000.\n\nT. Kato. Perturbation theory for linear operators. Springer, Berlin, 1966.\n\nM. Meila and J. Shi. A random walks view of spectral segmentation. In 8th International\n Workshop on Artificial Intelligence and Statistics, 2001.\n\nA. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In\n Advances in Neural Information Processing Systems 14, 2001.\n\nD. Pollard. Strong consistency of k-means clustering. Ann. of Stat., 9(1):135140, 1981.\n\nD. Spielman and S. Teng. Spectral partitioning works: planar graphs and finite element\n meshes. In 37th Annual Symposium on Foundations of Computer Science, 1996.\n\nA. v. d. Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer, 1996.\n\nR. Van Driessche and D. Roose. An improved spectral bisection algorithm and its applica-\n tion to dynamic load balancing. Parallel Comput., 21(1), 1995.\n\nU. von Luxburg. Statistical Learning with Similarity and Dissimilarity Functions. PhD\n thesis, draft, available at http://www.kyb.tuebingen.mpg.de/~ule, 2004.\n\nU. von Luxburg, O. Bousquet, and M. Belkin. On the convergence of spectral clustering\n on random samples: the normalized case. In COLT, 2004.\n\nY. Weiss. Segmentation using eigenvectors: A unifying view. In Proceedings of the Inter-\n national Conference on Computer Vision, pages 975982, 1999.\n\nD. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf. Learning with local and global\n consistency. In Advances in Neural Information Processing Systems 16, 2004.\n\nX. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian fields\n and harmonic functions. In ICML, 2003.\n\n\f\n", "award": [], "sourceid": 2692, "authors": [{"given_name": "Ulrike", "family_name": "Luxburg", "institution": null}, {"given_name": "Olivier", "family_name": "Bousquet", "institution": null}, {"given_name": "Mikhail", "family_name": "Belkin", "institution": null}]}