{"title": "Noise Thresholds for Spectral Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 954, "page_last": 962, "abstract": "Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results.", "full_text": "Noise Thresholds for Spectral Clustering\n\nSivaraman Balakrishnan\n\nMin Xu\n\nAkshay Krishnamurthy\n\nAarti Singh\n\nSchool of Computer Science, Carnegie Mellon University\n{sbalakri,minx,akshaykr,aarti}@cs.cmu.edu\n\nAbstract\n\nAlthough spectral clustering has enjoyed considerable empirical success in ma-\nchine learning, its theoretical properties are not yet fully developed. We analyze\nthe performance of a spectral algorithm for hierarchical clustering and show that\non a class of hierarchically structured similarity matrices, this algorithm can toler-\nate noise that grows with the number of data points while still perfectly recovering\nthe hierarchical clusters with high probability. We additionally improve upon pre-\nvious results for k-way spectral clustering to derive conditions under which spec-\ntral clustering makes no mistakes. Further, using minimax analysis, we derive\ntight upper and lower bounds for the clustering problem and compare the perfor-\nmance of spectral clustering to these information theoretic limits. We also present\nexperiments on simulated and real world data illustrating our results.\n\nIntroduction\n\n1\nClustering, a fundamental and ubiquitous problem in machine learning, is the task of organizing data\npoints into homogenous groups using a given measure of similarity. Two popular forms of clustering\nare k-way, where an algorithm directly partitions the data into k disjoint sets, and hierarchical,\nwhere the algorithm organizes the data into a hierarchy of groups. Popular algorithms for the k-way\nproblem include k-means, spectral clustering, and density-based clustering, while agglomerative\nmethods that merge clusters from the bottom up are popular for the latter problem.\nSpectral clustering algorithms embed the data points by projection onto a few eigenvectors of (some\nform of) the graph Laplacian matrix and use this spectral embedding to \ufb01nd a clustering. This\ntechnique has been shown to work on various arbitrarily shaped clusters and, in addition to being\nstraightforward to implement, often outperforms traditional clustering algorithms such as the k-\nmeans algorithm.\nReal world data is inevitably corrupted by noise and it is of interest to study the robustness of spectral\nclustering algorithms. This is the focus of our paper.\nOur main contributions are:\n\n\u2022 We leverage results from perturbation theory in a novel analysis of a spectral algorithm\nfor hierarchical clustering to understand its behavior in the presence of noise. We provide\nstrong guarantees on its correctness; in particular, we show that the amount of noise spectral\nclustering tolerates can grow rapidly with the size of the smallest cluster we want to resolve.\n\u2022 We sharpen existing results on k-way spectral clustering. In contrast with earlier work, we\nprovide precise error bounds through a careful characterization of a k-means style algo-\nrithm run on the spectral embedding of the data.\n\n\u2022 We also address the issue of optimal noise thresholds via the use of minimax theory. In\nparticular, we establish tight information-theoretic upper and lower bounds for cluster re-\nsolvability.\n\n1\n\n\f2 Related Work and De\ufb01nitions\n\nThere are several high-level justi\ufb01cations for the success of spectral clustering. The algorithm has\ndeep connections to various graph-cut problems, random walks on graphs, electric network theory,\nand via the graph Laplacian to the Laplace-Beltrami operator. See [16] for an overview.\nSeveral authors (see von Luxburg et. al. [17] and references therein) have shown various forms of\nasymptotic convergence for the Laplacian of a graph constructed from random samples drawn from\na distribution on or near a manifold. These results however often do not easily translate into precise\nguarantees for successful recovery of clusters, which is the emphasis of our work.\nThere has also been some theoretical work on spectral algorithms for cluster recovery in random\ngraph models. McSherry [9] studies the \u201ccluster-structured\u201d random graph model in which the\nprobability of adding an edge can vary depending on the clusters the edge connects. He considers a\nspecialization of this model, the planted partition model, which speci\ufb01es only two probabilities, one\nfor inter-cluster edges and another for intra-cluster edges. In this case, we can view the observed\nadjacency matrix as a random perturbation of a low rank \u201cexpected\u201d adjacency matrix which en-\ncodes the cluster membership. McSherry shows that one can recover the clusters from a low rank\napproximation of the observed (noisy) adjacency matrix. These results show that low-rank matrices\nhave spectra that are robust to noise. Our results however, show that we can obtain similar insensi-\ntivity (to noise) guarantees for a class of interesting structured full-rank matrices, indicating that this\nrobustness extends to a much broader class of matrices.\nMore recently, Rohe et al [11] analyze spectral clustering in the stochastic block model (SBM),\nwhich is an example of a structured random graph. They consider the high-dimensional scenario\nwhere the number of clusters k grows with the number of data points n and show that under certain\nassumptions the average number of mistakes made by spectral clustering ! 0 with increasing n.\nOur work on hierarchical clustering also has the same high-dimensional \ufb02avor since the number of\nclusters we resolve grows with n. However, in the hierarchical clustering setting, errors made at the\nbottom level propogate up the tree and we need to make precise arguments to ensure that the total\nnumber of errors ! 0 with increasing n (see Theorem 1).\nSince Rohe et al [11] and McSherry [9] consider random graph models, the \u201cnoise\u201d on each entry has\nbounded variance. We consider more general noise models and study the relation between errors in\nclustering and noise variance. Another related line of work is on the problem of spectrally separating\nmixtures of Gaussians [1, 2, 8].\nNg et al. [10] study k-way clustering and show that the eigenvectors of the graph Laplacian are stable\nin 2-norm under small perturbations. This justi\ufb01es the use of k-means in the perturbed subspace\nsince ideally without noise, the spectral embedding by the top k eigenvectors of the graph Laplacian\nre\ufb02ects the true cluster memberships, However, closeness in 2-norm does not translate into a strong\nbound on the total number of errors made by spectral clustering.\nHuang et al. [7] study the misclustering rate of spectral clustering under the somewhat unnatural\nassumption that every coordinate of the Laplacian\u2019s eigenvectors are perturbed by independent and\nidentically distributed noise. In contrast, we specify our noise model as an additive perturbation to\nthe similarity matrix, making no direct assumptions on how this affects the spectrum of the Lapla-\ncian. We show that the eigenvectors are stable in 1-norm and use this result to precisely bound the\nmisclustering rate of our algorithm.\n2.1 De\ufb01nitions\nThe clustering problem can be de\ufb01ned as follows: Given an (n \u21e5 n) similarity matrix on n data\npoints, \ufb01nd a set C of subsets of the points such that points belonging to the same subset have\nhigh similarity and points in different subsets have low similarity. Our \ufb01rst results focus on binary\nhierarchical clustering, which is formally de\ufb01ned as follows:\nDe\ufb01nition 1 A hierarchical clustering T on data points {Xi}n\ni=1 is a collection of clusters (subsets\nof the points) such that C0 := {Xi}n\ni=1 2T and for any Ci, Cj 2T , either Ci \u21e2 Cj, Cj \u21e2 Ci, or\nCi\\Cj = ;. A binary hierarchical clustering T is a hierarchical clustering such that for each non-\natomic Ck 2T , there exists two proper subsets Ci, Cj 2T with Ci \\ Cj = ; and Ci [ Cj = Ck.\nWe label each cluster by a sequence s of Ls and Rs so that Cs\u00b7L and Cs\u00b7R partitions Cs, Cs\u00b7LL and\nCs\u00b7LR partititons Cs\u00b7L, and so on.\n\n2\n\n\f[\u21b5s\u00b7LL, s\u00b7LL]\n[\u21b5s\u00b7L, s\u00b7L]\n\n[\u21b5s\u00b7L, s\u00b7L]\n[\u21b5s\u00b7LR, s\u00b7LR]\n\n[\u21b5s, s]\n\n26666664\n\n[\u21b5s, s]\n\n[\u21b5s\u00b7RL, s\u00b7RL]\n[\u21b5s\u00b7R, s\u00b7R]\n\n[\u21b5s\u00b7R, s\u00b7R]\n[\u21b5s\u00b7RR, s\u00b7RR]\n\n\u00b7 \u00b7 \u00b7\n\n37777775\n\n.\n.\n.\n\n(b)\n\n(a)\n\nFigure 1: (a): Two moons data set (Top). For a similarity function de\ufb01ned on the \u270f-neighborhood\ngraph (Bottom), this data set forms an ideal matrix. (b) An ideal matrix for the hierarchical problem.\n\nIdeally, we would like that at all levels of the hierarchy, points within a cluster are more similar\nto each other than to points outside of the cluster. For a suitably chosen similarity function, a\ndata set consisting of clusters that lie on arbitrary manifolds with complex shapes can result in\nthis ideal case. As an example, in the two-moons data set in Figure 1(a), the popular technique of\nconstructing a nearest neighbor graph and de\ufb01ning the distance between two points as the length\nof the longest edge on the shortest path between them results in an ideal similarity matrix. Other\nnon-Euclidean similarity metrics (for instance density based similarity metrics [12]) can also allow\nfor non-parametric cluster shapes.\nFor such ideal similarity matrices, we can show that the spectral clustering algorithm will determin-\nistically recover all clusters in the hierarchy (see Theorem 5 in the appendix). However, since this\nideal case does not hold in general, we focus on similarity matrices that can be decomposed into an\nideal matrix and a high-variance noise term.\n\nDe\ufb01nition 2 A similarity matrix W is a noisy hierarchical block matrix (noisy HBM) if W , A+R\nwhere A is ideal and R is a perturbation matrix, de\ufb01ned as follows:\n\n\u2022 An ideal similarity matrix, shown in Figure 1(b), is characterized by ranges of off-block-\ndiagonal similarity values [\u21b5s, s] for each cluster Cs such that if x 2 Cs\u00b7L and y 2 Cs\u00b7R\nthen \u21b5s \uf8ff Axy \uf8ff s. Additionally, min{\u21b5s\u00b7R,\u21b5 s\u00b7L} > s.\n\u2022 A symmetric (n\u21e5n) matrix R is a perturbation matrix with parameter if (a) E(Rij) = 0,\n(b) the entries of R are subgaussian, that is E(exp(tRij)) \uf8ff exp( 2t2\n2 ) and (c) for each\nrow i, Ri1, . . . , Rin are independent.\n\nThe perturbations we consider are quite general and can accommodate bounded (with upper\nbounded by the range), Gaussian (where is the standard deviation), and several other common\ndistributions. This model is well-suited to noise that arises from the direct measurement of similar-\nities. It is also possible to assume instead that the measurements of individual data points are noisy\nthough we do not focus on this case in our paper.\nIn the k-way case, we consider the following similarity matrix which is studied by Ng et. al [10].\n\nDe\ufb01nition 3 W is a noisy k-Block Diagonal matrix if W , A + R where R is a perturbation\nmatrix and A is an ideal matrix for the k-way problem. An ideal matrix for the k-way problem has\nwithin-cluster similarities larger than 0 > 0 and between cluster similarities 0.\n\nFinally, we de\ufb01ne the combinatorial Laplacian matrix, which will be the focus of our spectral algo-\nrithm and our subsequent analysis.\n\nDe\ufb01nition 4 The combinatorial Laplacian L of a matrix W is de\ufb01ned as L , D W where D is\na diagonal matrix with Dii ,Pn\n\nWe note that other analyses of spectral clustering have studied other Laplacian matrices, particularly,\nthe normalized Laplacians de\ufb01ned as Ln , D1L and Ln , D 1\n2 . However as we show in\nAppendix E, the normalized Laplacian can mis-cluster points even for an ideal noiseless similarity\nmatrix.\n\nj=1 Wij.\n\n2 LD 1\n\n3\n\n\fAlgorithm 1 HS\ninput (noisy) n \u21e5 n similarity matrix W\n\nCompute Laplacian L = D W\nv2 smallest non-constant eigenvector of L\nC1 { i : v2(i) 0}, C2 { j : v2(j) < 0}\nC { C1, C2}[ HS (WC1)[ HS (WC2)\n\nFigure 2: An ideal matrix and a noisy HBM. Clus-\nters at \ufb01ner granularity are masked by noise.\n\noutput C\nAlgorithm 2 K-WAY SPECTRAL\ninput (noisy) n \u21e5 n similarity matrix W , number of clusters k\nCompute Laplacian L = D W\nV (n \u21e5 k) matrix with columns v1, ..., vk, where vi , ith smallest eigenvector of L\nc1 V1 (the \ufb01rst row of V ).\nFor i = 2 . . . k let ci argmaxj2{1...n} minl2{1,...,i1} ||Vj Vcl||2.\nFor i = 1 . . . n set c(i) = argminj2{1...k}||Vi Vcj||2\noutput C , {{j 2{ 1 . . . n} : c(j) = i}}k\n\ni=1\n\n3 Algorithms and Main Results\nIn our analysis we study the algorithms for hierarchical and k-way clustering, outlined in Algo-\nrithms 1 and 2. Both of these algorithms take a similarity matrix W and compute the eigenvectors\ncorresponding to the smallest eigenvalues of the Laplacian of W . The algorithms then run simple\nprocedures to recover the clustering from the spectral embedding of the data points by these eigen-\nvectors. Our Algorithm 2 deviates slightly from the standard practice of running k-means in the\nperturbed subspace. We instead use the optimal algorithm for the k-center problem (Hochbaum-\nShmoys [6]) because of its amenability to theoretical analysis. We will in this section outline our\nmain results; we sketch the proofs in the next section and defer full proofs to the Appendix.\nWe \ufb01rst state the following general assumptions, which we place on the ideal similarity matrix A:\nAssumption 1 For all i, j, 0 < Aij \uf8ff \u21e4 for some constant \u21e4.\nAssumption 2 (Balanced clusters) There is a constant \u2318 1 such that at every split of the hierarchy\n|Cmax|\n|Cmin| \uf8ff \u2318, where |Cmax|,|Cmin| are the sizes of the biggest and smallest clusters respectively.\nAssumption 3 (Range Restriction) For every cluster s, min{\u21b5s\u00b7L,\u21b5 s\u00b7R} s >\u2318 (s \u21b5s).\nIt is important to note that these assumptions are placed only on the ideal matrices. The noisy HBMs\ncan with high probability violate these assumptions.\nWe assume that the entries of A are strictly greater than 0 for technical reasons; we believe, as\ncon\ufb01rmed empirically, that this restriction is not necessary for our results to hold. Assumption 2\nsays that at every level the largest cluster is only a constant fraction larger than the smallest. This\ncan be relaxed albeit at the cost of a worse rate. For the ideal matrix, the Assumption 3 ensures that\nat every level of the hierarchy, the gap between the within-cluster similarities and between-cluster\nsimilarities is larger than the range of between-cluster similarities. Earlier papers [9, 11] assume that\nthe ideal similarities are constant within a block in which case the assumption is trivially satis\ufb01ed\nby the de\ufb01nition of the ideal matrix. However, more generally this assumption is necessary to show\nthat the entries of the eigenvector are safely bounded away from zero. If this assumption is violated\nby the ideal matrix, then the eigenvector entries can decay as fast as O(1/n) (see Appendix E for\nmore details), and our analysis shows that such matrices will no longer be robust to noise.\nOther analyses of spectral clustering often directly make less interpretable assumptions about the\nspectrum. For instance, Ng et al. [10] assume conditions on the eigengap of the normalized Lapla-\ncian and this assumption implicitly creates constraints on the entries of the ideal matrix A that can\nbe hard to make explicit.\n\n4\n\n\fquanti\ufb01es\n\n. Intuitively, \u21e4\nS\n\nS , mins2S min{\u21b5s\u00b7L,\u21b5 s\u00b7R} s \u2318(s \u21b5s).\n\nTo state our theorems concisely we will de\ufb01ne an additional quantity \u21e4\nS\nhow close the ideal matrix comes to violating Assumption 3 over a set of clusters S.\nDe\ufb01nition 5 For a set of clusters S, de\ufb01ne \u21e4\nWe, as well as previous works [10, 11], rely on results from perturbation theory to bound the error\nin the observed eigenvectors in 2-norm. Using this approach, the straightforward way to analyze\nthe number of errors is pessimistic since it assumes the difference between the two eigenvectors is\nconcentrated on a few entries. However, we show that the perturbation is in fact generated by a\nrandom process and thus unlikely to be adversarially concentrated. We formalize this intuition to\nuniformly bound the perturbations on every entry and get a stronger guarantee.\nWe are now ready to state our main result for hierarchical spectral clustering. At a high level, this\nresult gives conditions on the noise scale factor under which Algorithm HS will recover all clusters\ns 2S m, where Sm is the set of all clusters of size at least m.\nTheorem 1 Suppose that W = A + R is an (n \u21e5 n) noisy HBM where A satis\ufb01es Assumptions 1,\n2, and 3. Suppose that the scale factor of R increases at = o\u21e3min\u21e3\uf8ff?5q m\nlog n\u2318\u2318\nlog n ,\uf8ff ?4 4q m\n1+\u2318\u2318, m > 0 and m = !(log n) 1. Then for all n large enough, with\nwhere \uf8ff? = min\u21e3\u21b50, ?\nprobability at least 1 6/n, HS , on input M, will exactly recover all clusters of size at least m.\nA few remarks are in order:\n\nSm\n\n1. It is impossible to resolve the entire hierarchy, since small clusters can be irrecoverably\nburied in noise. The amount of noise that algorithm HS can tolerate is directly dependent\non the size of the smallest cluster we want to resolve.\n\nrapidly with n.\n\n2. As a consequence of our proof, we show that to resolve only the \ufb01rst level of the hierarchy,\n\nthe amount of noise we can tolerate is (pessimistically) o(\uf8ff?5 4pn/ log n) which grows\n\n3. Under this scaling between n and , it can be shown that popular agglomerative algorithms\nsuch as single linkage will fail with high probability. We verify this negative result through\nexperiments (see Section 5).\n\n4. Since we assume that \u21e4 does not grow with n, both the range (s \u21b5s) and the gap\nSm must decrease as well.\nSm =\u21e5(1 / log n).\nSm is a crucial\n\n(min{\u21b5s\u00b7L,\u21b5 s\u00b7R} s) must decrease with n and hence that \u21e4\nFor example, if we have uniform ranges and gaps across all levels, then \u21e4\nFor constant \u21b50, for n large enough \uf8ff? = ?\n1+\u2318 . We see that in our analysis ?\nSm\ndeterminant of the noise tolerance of spectral clustering.\n\nWe extend the intuition behind Theorem 1 to the k-way setting. Some arguments are more subtle\nsince spectral clustering uses the subspace spanned by the k smallest eigenvectors of the Laplacian.\nWe improve the results of Ng et. al. [10] to provide a coordinate-wise bound on the perturbation of\nthe subspace, and use this to make precise guarantees for Algorithm K-WAY SPECTRAL.\nTheorem 2 Suppose that W = A+R is an (n\u21e5n) noisy k-Block Diagonal matrix where A satis\ufb01es\nAssumptions 1 and 2. Suppose that the scale factor of R increases at rate = o( 0\nk log n )1/4).\nk (\nThen with probability 1 8/n, for all n large enough, K-WAY SPECTRALwill exactly recover the\nk clusters.\n3.1\nHaving introduced our analysis for spectral clustering a pertinent question remains. Is the algorithm\noptimal in its dependence on the various parameters of the problem?\nWe establish the minimax rate in the simplest setting of a single binary split and compare it to our\nown results on spectral clustering. With the necessary machinery in place, the minimax rate for the\nk-way problem follows easily. We derive lower bounds on the problem of correctly identifying two\nclusters under the assumption that the clusters are balanced. In particular, we derive conditions on\n(n, , ), i.e. the number of objects, the noise variance and the gap between inter and intra-cluster\nsimilarities, under which any method will make an error in identifying the correct clusters.\n\nInformation-Theoretic Limits\n\nn\n\n1Recall an = o(bn) and bn = !(an) if limn!1\n\nan\nbn\n\n= 0\n\n5\n\n\fclustering requires \uf8ff min\u27135q n\n\nC log( n\n2 )\n\n, 4\n\n4q n\n\nC log( n\n\n2 )\u25c6 (for a large enough constant C).\n\nTheorem 3 There exists a constant \u21b5 2 (0, 1/8) such that if, q n\nfailure of any estimator of the clustering remains bounded away from 0 as n ! 1.\nUnder the conditions of this Theorem and \uf8ff? coincide, provided the inter-cluster similarities re-\nmain bounded away from 0 by at least a constant. As a direct consequence of Theorem 1, spectral\n\nthe probability of\n\n\u21b5 log( n\n2 )\n\nThus, the noise threshold for spectral clustering does not match the lower bound. To establish\nthat this lower bound is indeed tight, we need to demonstrate a (not necessarily computationally\nef\ufb01cient) procedure that achieves this rate. We analyze a combinatorial procedure that solves the\nNP-hard problem of \ufb01nding the minimum cut of size exactly n/2 by searching over all subsets. This\nalgorithm is strongly related to spectral clustering with the combinatorial Laplacian, which solves a\nrelaxation of the balanced minimum cut problem. We prove the following theorem in the appendix.\n\nC log( n\n2 )\n\nthe combinatorial procedure\n\nn which goes to 0 as n ! 1.\n\nTheorem 4 There exists a constant C such that if < q n\ndescribed above succeeds with probability at least 1 1\nThis theorem and the lower bound together establish the minimax rate.\nIt however, remains an\nopen problem to tighten the analysis of spectral clustering in this paper to match this rate. In the\nAppendix we modify the analysis of [9] to show that under the added restriction of block constant\nideal similarities there is an ef\ufb01cient algorithm that achieves the minimax rate.\n4 Proof Outlines\nHere, we present proof sketches of our main theorems, deferring the details to the Appendix.\nOutline of proof of Theorem 1\nLet us \ufb01rst restrict our attention toward \ufb01nding the \ufb01rst split in the hierarchical clustering. Once we\nprove that we can recover the \ufb01rst split correctly, we can then recursively apply the same arguments\nalong with some delicate union bounds to prove that we will recover all large-enough splits of the\nhierarchy. To make presentation clearer, we will only focus here on the scaling between 2 and n.\nOf course, when we analyze deeper splits, n becomes the size of the sub-cluster.\nLet W = A + R be the n\u21e5 n noisy HBM. One can readily verify that the Laplacian of W , LW , can\nbe decomposed as LA + LR. Let v(2), u(2) be the second eigenvector of LA, LW respectively.\nWe \ufb01rst show that the unperturbed v(2) can clearly distinguish the two outermost clusters and that\n1, 2, and 3 (the \ufb01rst, second, and third smallest eigenvalues of LW respectively), are far away\nfrom each other. More precisely we show |v(2)\n1pn ) for all i = 1, ..., n and its sign cor-\nresponds to the cluster identity of point i. Further the eigen-gap, 2 1 = 2 =\u21e5( n), and\n3 2 =\u21e5( n). Now, using the well-known Davis-Kahan perturbation theorem, we can show that\n\n| =\u21e5(\n\ni\n\n||v(2) u(2)||2 = O\u2713\n\npn log n\n\nmin(2, 3 2)\u25c6 = O r log n\nn !\n\nThe most straightforward way of turning this l2-norm bound into uniform-entry-wise l1 bound is to\nassume that only one coordinate has large perturbation and comprises all of the l2-perturbation. We\nperform a much more careful analysis to show that all coordinates uniformly have low perturbation.\nn ).\n\nSpeci\ufb01cally, we show that if = O( 4q log n\nCombining this and the fact that |v(2)\nleading constants, we can conclude that spectral clustering will correctly recover the \ufb01rst split.\nOutline of proof of Theorem 2\nLeveraging our analysis of Theorem 1 we derive an `1 bound on the bottom k-eigenvectors. One\npotential complication we need to resolve is that the k-Block Diagonal matrix has repeated eigen-\nvalues and more careful subspace perturbation arguments are warranted.\n\nn ), then with high probability, ||v(2)\n| =\u21e5(\n\n1pn ), and performing careful comparison with the\n\n||1 = O(q 1\n\ni u(2)\n\ni\n\ni\n\n6\n\n\f1\n\n0.5\n\ns\ns\ne\nc\nc\nu\nS\n\n \nf\n\no\n\n \n.\n\nb\no\nr\nP\n\n0\n\n \n\n \n\nn = 256\nn = 512\nn = 1024\nn = 2048\n\n1\n\n0.5\n\ns\ns\ne\nc\nc\nu\nS\n\n \nf\n\no\n\n \n.\n\nb\no\nr\nP\n\n2\n6\nNoise scale factor (\u03c3)\n\n4\n\n0\n\n \n\n8\n\n(a)\n\n \n\nn = 256\nn = 512\nn = 1024\nn = 2048\n\n0.5\n\n1\n\n1.5\n\nRescaled noise scale factor \u0398(\u03c3, n)\n\nt\nc\ne\nr\nr\no\nC\ne\ne\nr\nT\n\n \n\n \nf\n\n \n\no\nn\no\n\ni\nt\nc\na\nr\nF\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n \n\n0\n\n10\n\n20\n\n30\n\n40\n\nNoise Scale Factor (\u03c3)\n\n(b)\n\n(c)\n\n \n\nHS\nSL\nAL\nCL\n\n50\n\nt\nc\ne\nr\nr\no\nC\ne\ne\nr\nT\n\n \n\n \nf\n\n \n\no\nn\no\n\ni\nt\nc\na\nr\nF\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n \n\n0\n\n \n\nHS\nSL\n300\n\n100\nSequence Length\n\n200\n\n(d)\n\nFigure 3: (a),(b): Threshold curves for the \ufb01rst split in HBMs. Comparison of clustering algorithms\nwith n = 512, m = 9 (c), and on simulated phylogeny data (d).\nWe further propose a different algorithm, K-WAY SPECTRAL, from the standard k-means. The\nalgorithm carefully chooses cluster centers and then simply assigns each point to its nearest cen-\nter. The `1 bound we derive is much stronger than `2 bounds prevalent in the literature and in a\nstraightforward way provides a no-error guarantee on K-WAY SPECTRAL.\nOutline of proof of Theorem 3\nAs is typically the case with minimax analysis, we begin by restricting our attention to a small (but\nhard to distinguish) class of models, and follow this by the application of Fano\u2019s inequality. Models\nare indexed by \u21e5(n, , , I1), where I1 denotes the indices of the rows (and columns) in the \ufb01rst\ncluster. For simplicity, we\u2019ll focus only on models with |I1| = n/2.\nSince we are interested in the worst case we can make two further simpli\ufb01cations. The ideal (noise-\nless) matrix can be taken to be block-constant since the worst case is when the diagonal blocks are\nat their lower bound (which we call p) and the off diagonal blocks are at their upper bound (q). We\nconsider matrices W = A + R, which are (n \u21e5 n) matrices, with Rij \u21e0N (0, 2).\nGiven the true parameter \u27130 we choose the following \u201chard\" subset {\u27131, . . . ,\u2713 M}. We will select\nmodels which mis-cluster only the last object in I1, there are exactly n/2 such models. Our proof\nis an application of Fano\u2019s inequality, using the Hamming distance and the KL-divergence between\nthe true model I1 and the estimated model \u02c6I1. See the appendix for calculations and proof details.\nThe proof of Theorem 4 follows from a careful union bound argument to show that even amongst\nthe combinatorially large number of balanced cuts of the graph, the true cut has the lowest weight.\n5 Experiments\nWe evaluate our algorithms and theoretical guarantees on simulated matrices, synthetic phylogenies,\nand \ufb01nally on two real biological datasets. Our experiments focus on the effect of noise on spectral\nclustering in comparison with agglomerative methods such as single, average, and complete linkage.\n\n5.1 Threshold Behavior\nOne of our primary interests is to empirically validate the relation between the scale factor and\nthe sample size n derived in our theorems. For a range of scale factors and noisy HBMs of varying\nsize, we empirically compute the probability with which spectral clustering recovers the \ufb01rst split\nof the hierarchy. From the probability of success curves (Figure 3(a)), we can conclude that spectral\nclustering can tolerate noise that grows with the size of the clusters.\nWe further verify the dependence between and n for recovering the \ufb01rst split. For the \ufb01rst split we\n\nobserve that when we rescale the x-axis of the curves in Figure 3(a) byplog(n)/n the curves line\n\nup for different n. This shows that empirically, at least for the \ufb01rst split, spectral clustering appears\nto achieve the minimax rate for the problem.\n5.2 Simulations\nWe compare spectral clustering to several agglomerative methods on two forms of synthetic data:\nnoisy HBMs and simulated phylogenetic data. In these simulations, we exploit knowledge of the\ntrue reference tree to quantitatively evaluate each algorithm\u2019s output as the fraction of triplets of\nleaves for which the most similar pair in the output tree matches that of the reference tree. One can\nverify that a tree has a score of 1 if and only if it is identical to the reference tree.\nInitially, we explore how HS compares to agglomerative algorithms on large noisy HBMs. In Fig-\nure 3(c), we compare performance, as measured by the triplets metric, of four clustering algorithms\n(HS , and single, average, and complete linkage) with n = 512 and m = 9. We also evaluate\n\n7\n\n\f(a)\n\n(b)\n\nndPnd\n\ni=1 \u02c6p\u21e1(i) log \u02c6p\u21e1(i) where \u02c6p\u21e1(i) , (Pn\n\nFigure 4: Experiments with real world data. (a): Heatmaps of single linkage (left) and HS (right)\non gene expression data with n = 2048. (b) -entropy scores on real world data sets.\nHS and single linkage as applied to reconstructing phylogenetic trees from genetic sequences. In\nFigure 3(d), we plot accuracy, again measured using the triplets metric, of the two algorithms as a\nfunction of sequence length (for sequences generated from the phyclust R package [3]), which\nis inversely correlated with noise (i.e. short sequences amount to noisy similarities). From these\nexperiments, it is clear that HS consistently outperforms agglomerative methods, with tremendous\nimprovements in the high-noise setting where it recovers a signi\ufb01cant amount of the tree structure\nwhile agglomerative methods do not.\n5.3 Real-World Data\nWe apply hierarchical clustering methods to a yeast gene expression data set and one phylogenetic\ndata set from the PFAM database [5]. To evaluate our methods, we use a -entropy metric de\ufb01ned\nas follows: Given a permutation \u21e1 and a similarity matrix W , we compute the rate of decay off of\nthe diagonal as \u02c6sd , 1\ni=1 W\u21e1(i),\u21e1(i+d), for d 2{ 1, ..., n 1}. Next, we compute the entropy\n\u02c6E(\u21e1) , Pn1\nd=1 \u02c6sd)1\u02c6si. Finally, we compute -entropy\nas \u02c6E(\u21e1) = \u02c6E(\u21e1random) \u02c6E(\u21e1). A good clustering will have a large amount of the probability\nmass concentrated at a few of the \u02c6p\u21e1(i)s, thus yielding a high \u02c6E(\u21e1). On the other hand, poor\nclusterings will specify a more uniform distribution and will have lower -entropy.\nWe \ufb01rst compare HS to single linkage on yeast gene expression data from DeRisi et al [4]. This\ndataset consists of 7 expression pro\ufb01les, which we use to generate Pearson correlations that we use\nas similarities. We sampled gene subsets of size n = 512, 1024, and 2048 and ran both algorithms on\nthe reduced similarity matrix. We report -entropy scores in Table 4(b). These scores quantitatively\ndemonstrate that HS outperfoms single linkage and additionally, we believe the clustering produced\nby HS (Figure 4(a)) is qualitatively better than that of single linkage.\nFinally, we run HS on real phylogeny data, speci\ufb01cally, a subset of the PDZ domain (PFAM Id:\nPF00595). We consider this family because it is a highly-studied domain of evolutionarily well-\nrepresented protein binding motifs. Using alignments of varying length, we generated similarity\nmatrices and computed -entropy of clusterings produced by both HS and Single Linkage. The\nresults for three sequence lengths (Table 4(b)) show that HS and Single Linkage are comparable.\n6 Discussion\nIn this paper we have presented a new analysis of spectral clustering in the presence of noise and\nestablished tight information theoretic upper and lower bounds. As our analysis of spectral clustering\ndoes not show that it is minimax-optimal it remains an open problem to further tighten, or establish\nthe tightness of, our analysis, and to \ufb01nd a computationally ef\ufb01cient minimax procedure in the\ngeneral case when similarities are not block constant. Identifying conditions under which one can\nguarantee correctness for other forms of spectral clustering is another interesting direction. Finally,\nour results apply only for binary hierarchical clusterings, yet k-way hierarchies are common in\npractice. A future challenge is to extend our results to k-way hierarchies.\n7 Acknowledgements\nThis research is supported in part by AFOSR under grant FA9550-10-1-0382 and NSF under grant\nIIS-1116458. AK is supported in part by a NSF Graduate Research Fellowship. SB would like to\nthank Jaime Carbonell and Srivatsan Narayanan for several fruitful discussions.\n\n8\n\n\fReferences\n[1] Dimitris Achlioptas and Frank Mcsherry. On spectral learning of mixtures of distributions. In\n\nComputational Learning Theory, pages 458\u2013469, 2005.\n\n[2] S. Charles Brubaker and Santosh Vempala. Isotropic pca and af\ufb01ne-invariant clustering. In\n\nFOCS, pages 551\u2013560, 2008.\n\n[3] Wei-Chen Chen. Phylogenetic Clustering with R package phyclust, 2010.\n[4] Joseph L. DeRisi, Vishwanath R. Iyer, and Patrick O. Brown. Exploring the Metabolic and\nGenetic Control of Gene Expression on a Genomic Scale. Science, 278(5338):680\u2013686, 1997.\n[5] Robert D. Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger, Joanne E. Pollington,\nO. Luke Gavin, Prasad Gunesekaran, Goran Ceric, Kristoffer Forslund, Liisa Holm, Erik L.\nSonnhammer, Sean R. Eddy, and Alex Bateman. The Pfam Protein Families Database. Nucleic\nAcids Research, 2010.\n\n[6] Dorit S. Hochbaum and David B. Shmoys. A Best Possible Heuristic for the K-Center Problem.\n\nMathematics of Operations Research, 10:180\u2013184, 1985.\n\n[7] Ling Huang, Donghui Yan, Michael I. Jordan, and Nina Taft. Spectral Clustering with Per-\n\nturbed Data. In Advances in Neural Inforation Processing Systems, 2009.\n\n[8] Ravindran Kannan, Hadi Salmasian, and Santosh Vempala. The spectral method for general\nmixture models. In 18th Annual Conference on Learning Theory (COLT, pages 444\u2013457, 2005.\n[9] Frank McSherry. Spectral partitioning of random graphs. In IEEE Symposium on Foundations\n\nof Computer Science, page 529, 2001.\n\n[10] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On Spectral Clustering: Analysis and\nan Algorithm. In Advances in Neural Information Processing Systems, pages 849\u2013856. MIT\nPress, 2001.\n\n[11] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral Clustering and the High-Dimensional\n\nStochastic Block Model. Technical Report 791, Statistics Department, UC Berkeley, 2010.\n\n[12] Sajama and Alon Orlitsky. Estimating and Computing Density Based Distance Metrics. In\n\nICML05, 22nd International Conference on Machine Learning, 2005.\n\n[13] Dan Spielman. Lecture Notes on Spectral Graph Theory, 2009.\n[14] Terence Tao. Course notes on random Matrix Theory, 2010.\n[15] Alexandre B. Tsybakov. Introduction a l\u00c9stimation Non-param\u00c3l\u2019trique. Springer, 2004.\n[16] Ulrike von Luxburg. A Tutorial on Spectral Clustering. Technical Report 149, Max Planck\n\nInstitute for Biological Cybernetics, August 2006.\n\n[17] Ulrike von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of Spectral Cluster-\n\ning. In The Annals of Statistics, pages 857\u2013864. MIT Press, 2004.\n\n9\n\n\f", "award": [], "sourceid": 592, "authors": [{"given_name": "Sivaraman", "family_name": "Balakrishnan", "institution": null}, {"given_name": "Min", "family_name": "Xu", "institution": null}, {"given_name": "Akshay", "family_name": "Krishnamurthy", "institution": null}, {"given_name": "Aarti", "family_name": "Singh", "institution": null}]}