{"title": "Learning Spectral Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 305, "page_last": 312, "abstract": "", "full_text": "Learning Spectral Clustering\n\nFrancis R. Bach\nComputer Science\n\nUniversity of California\n\nBerkeley, CA 94720\nfbach@cs.berkeley.edu\n\nMichael I. Jordan\n\nComputer Science and Statistics\n\nUniversity of California\n\nBerkeley, CA 94720\n\njordan@cs.berkeley.edu\n\nAbstract\n\nSpectral clustering refers to a class of techniques which rely on the eigen-\nstructure of a similarity matrix to partition points into disjoint clusters\nwith points in the same cluster having high similarity and points in dif-\nferent clusters having low similarity. In this paper, we derive a new cost\nfunction for spectral clustering based on a measure of error between a\ngiven partition and a solution of the spectral relaxation of a minimum\nnormalized cut problem. Minimizing this cost function with respect to\nthe partition leads to a new spectral clustering algorithm. Minimizing\nwith respect to the similarity matrix leads to an algorithm for learning\nthe similarity matrix. We develop a tractable approximation of our cost\nfunction that is based on the power method of computing eigenvectors.\n\n1\n\nIntroduction\n\nSpectral clustering has many applications in machine learning, exploratory data analysis,\ncomputer vision and speech processing. Most techniques explicitly or implicitly assume\na metric or a similarity structure over the space of con\ufb01gurations, which is then used by\nclustering algorithms. The success of such algorithms depends heavily on the choice of\nthe metric, but this choice is generally not treated as part of the learning problem. Thus,\ntime-consuming manual feature selection and weighting is often a necessary precursor to\nthe use of spectral methods.\nSeveral recent papers have considered ways to alleviate this burden by incorporating prior\nknowledge into the metric, either in the setting of K-means clustering [1, 2] or spectral\nclustering [3, 4]. In this paper, we consider a complementary approach, providing a general\nframework for learning the similarity matrix for spectral clustering from examples. We as-\nsume that we are given sample data with known partitions and are asked to build similarity\nmatrices that will lead to these partitions when spectral clustering is performed. This prob-\nlem is motivated by the availability of such datasets for at least two domains of application:\nin vision and image segmentation, a hand-segmented dataset is now available [5], while\nfor the blind separation of speech signals via partitioning of the time-frequency plane [6],\ntraining examples can be created by mixing previously captured signals.\nAnother important motivation for our work is the need to develop spectral clustering meth-\nods that are robust to irrelevant features. Indeed, as we show in Section 4.2, the perfor-\nmance of current spectral methods can degrade dramatically in the presence of such irrele-\nvant features. By using our learning algorithm to learn a diagonally-scaled Gaussian kernel\n\n\ffor generating the af\ufb01nity matrix, we obtain an algorithm that is signi\ufb01cantly more robust.\nOur work is based on a new cost function J(W, e) that characterizes how close the eigen-\nstructure of a similarity matrix W is to a partition e. We derive this cost function in Sec-\ntion 2. As we show in Section 2.3, minimizing J with respect to e leads to a new clustering\nalgorithm that takes the form of a weighted K-means algorithm. Minimizing J with re-\nspect to W yields an algorithm for learning the similarity matrix, as we show in Section 4.\nSection 3 provides foundational material on the approximation of the eigensubspace of a\nsymmetric matrix that is needed for Section 4.\n\n2 Spectral clustering and normalized cuts\n\nGiven a dataset I of P points in a space X and a P \u00d7 P \u201csimilarity matrix\u201d (or \u201caf\ufb01nity\nmatrix\u201d) W that measures the similarity between the P points (Wpp0 is large when points\nindexed by p and p0 are likely to be in the same cluster), the goal of clustering is to organize\nthe dataset into disjoint subsets with high intra-cluster similarity and low inter-cluster sim-\nilarity. Throughout this paper we always assume that the elements of W are non-negative\n(W > 0) and that W is symmetric (W = W >).\nLet D denote the diagonal matrix whose i-th diagonal element is the sum of the elements in\nthe i-th row of W , i.e., D = diag(W 1), where 1 is de\ufb01ned as the vector in RP composed of\nones. There are different variants of spectral clustering. In this paper we focus on the task of\nminimizing \u201cnormalized cuts.\u201d The classical relaxation of this NP-hard problem [7, 8, 9]\nleads to an eigenvalue problem.\nIn this section we show that the problem of \ufb01nding a\nsolution to the original problem that is closest to the relaxed solution can be solved by a\nweighted K-means algorithm.\n\n2.1 Normalized cut and graph partitioning\n\nThe clustering problem is usually de\ufb01ned in terms of a complete graph with vertices\nV = {1, ..., P } and an af\ufb01nity matrix with weights Wpp0, for p, p0 \u2208 V . We wish to \ufb01nd R\n\ndisjoint clusters A = (Ar)r\u2208{1,...,R}, where SrAr = V , that optimize a certain cost func-\n\ntion. An example of such a function is the R-way normalized cut de\ufb01ned as follows [7, 10]:\n\nC(A, W ) =PR\n\nr=1(cid:16)Pi\u2208Ar,j\u2208V \\Ar\n\nWij(cid:17) / (cid:16)Pi\u2208Ar,j\u2208V Wij(cid:17) .\n\nLet er be the indicator vector in RP for the r-th cluster, i.e., er \u2208 {0, 1}R is such that er\nhas a nonzero component exactly at points in the r-th cluster. Knowledge of e = (er) is\nequivalent to knowledge of A = (Ar) and, when referring to partitions, we will use the two\nformulations interchangeably. A short calculation reveals that the normalized cut is then\n\nequal to C(e, W ) =PR\n\nr=1 e>\n\nr (D \u2212 W )er/ (e>\n\nr Der).\n\n2.2 Spectral relaxation and rounding\n\nThe following proposition, which extends a result of Shi and Malik [7] for two clusters\nto an arbitrary number of clusters, gives an alternative description of the clustering task,\nwhich will lead to a spectral relaxation:\n\nProposition 1 The R-way normalized cut is equal to R \u2212 tr Y >D\u22121/2W D\u22121/2Y for any\nmatrix Y \u2208 RP \u00d7R such that (a) the columns of D\u22121/2Y are piecewise constant with\nrespect to the clusters and (b) Y has orthonormal columns (Y >Y = I).\n\nProof The constraint (a) is equivalent to the existence of a matrix \u039b \u2208 RR\u00d7R such\nthat D\u22121/2Y = (e1, . . . , eR)\u039b = E\u039b. The constraint (b) is thus written as I = Y >Y =\n\u039b>E>DE\u039b. The matrix E>DE is diagonal, with elements e>\nr Der and is thus positive\n\n\fand invertible. This immediately implies that \u039b\u039b> = (E>DE)\u22121. This in turn implies that\ntr Y >D\u22121/2W D\u22121/2Y = tr \u039b>E>W E\u039b = tr E>W E\u039b\u039b> = tr E>W E(E>DE)\u22121,\nwhich is exactly the normalized cut (up to an additive constant).\nBy removing the constraint (a), we obtain a relaxed optimization problem, whose solutions\ninvolve the eigenstructure of D\u22121/2W D\u22121/2 and which leads to the classical lower bound\non the optimal normalized cut [8, 9]. The following proposition gives the solution obtained\nfrom the relaxation (for the proof, see [11]):\n\nProposition 2 The maximum of tr Y >D\u22121/2W D\u22121/2Y over matrices Y \u2208 RP \u00d7R such\nthat Y >Y = I is the sum of the R largest eigenvalues of D\u22121/2W D\u22121/2. It is attained\nat all Y of the form Y = U B1 where U \u2208 RP \u00d7R is any orthonormal basis of the R-th\nprincipal subspace of D\u22121/2W D\u22121/2 and B1 is an arbitrary rotation matrix in RR\u00d7R.\n\nThe solutions found by this relaxation will not in general be piecewise constant. In or-\nder to obtain a piecewise constant solution, we wish to \ufb01nd a piecewise constant matrix\nthat is as close as possible to one of the possible Y obtained from the eigendecompo-\nsition. Since such matrices are de\ufb01ned up to a rotation matrix, it makes sense to com-\npare the subspaces spanned by their columns. A common way to compare subspaces is to\ncompare the orthogonal projection operators on those subspaces [12], that is, to compute\nr Der)\n(\u03a00 is the orthogonal projection operator on the subspace spanned by the columns of\nD1/2E = D1/2(e1, . . . , er), from Proposition 1). We thus de\ufb01ne the following cost func-\ntion:\n\nthe Frobenius norm between U U > and \u03a00 = \u03a00(W, e) ,Pr D1/2ere>\n\nr D1/2/ (e>\n\nJ(W, e) = 1\n\n2 ||U U > \u2212 \u03a00||2\n\nto R \u2212 tr U U >\u03a00 = R \u2212Pr e>\n\n(1)\nUsing the fact that both U U > and \u03a00 are orthogonal projection operators on linear sub-\nspaces of dimension R, a short calculation reveals that the cost function J(W, e) is equal\nr Der). This cost function charac-\nterizes the ability of the matrix W to produce the partition e when using its eigenvectors.\nMinimizing with respect to e leads to a new clustering algorithm that we now present.\nMinimizing with respect to the matrix for a given partition e leads to the learning of the\nsimilarity matrix, as we show in Section 4.\n\nr D1/2U U >D1/2er/ (e>\n\nF\n\n2.3 Minimizing with respect to the partition\n\nIn this section, we show that minimizing J(W, e) is equivalent to a weighted K-means al-\ngorithm. The following theorem, inspired by the spectral relaxation of K-means presented\nin [8], shows that the cost function can be interpreted as a weighted distortion measure1:\n\nTheorem 1 Let W be an af\ufb01nity matrix and let U = (u1, . . . , uP ), where up \u2208 RR, be\nan orthonormal basis of the R-th principal subspace of D\u22121/2W D\u22121/2. For any partition\ne \u2261 A, we have\n\nJ(W, e) =\n\nmin\n\n(\u00b51,...,\u00b5R)\u2208RR\u00d7RX\nr\ndp||upd\u22121/2\n\np\n\nX\n\np\u2208Ar\n\ndp||upd\u22121/2\n\np\n\n\u2212 \u00b5r||2.\n\n\u2212 \u00b5r||2. Minimizing D(\u00b5, A) with respect\n\nto \u00b5 is a decoupled least-squares problem and we get:\n\nProof Let D(\u00b5, A) =PrPp\u2208Ar\nmin\u00b5 D(\u00b5, A) = PrPp\u2208Ar\n\nu>\n\np up \u2212Pr ||Pp\u2208Ar\n\np up||2/ (Pp\u2208Ar\nd1/2\n\ndp)\n\n1Note that a similar equivalence holds between normalized cuts and weighted K-means for\nthis leads\ni.e., we have: C(W, e) =\n\npositive semide\ufb01nite similarity matrices, which can be factorized as W = GG>;\nto an approximation algorithm for minimizing normalized cuts;\n\nmin(\u00b51,...,\u00b5R)\u2208RR\u00d7R Pr Pp\u2208Ar\n\ndp||gpd\u22121\n\np \u2212 \u00b5r||2 + R \u2212 tr D\u22121/2W D\u22121/2.\n\n\fInput: Similarity matrix W \u2208 RP \u00d7P .\nAlgorithm:\n\n1. Compute \ufb01rst R eigenvectors U of D\u22121/2W D\u22121/2 where D = diag(W 1).\n2. Let U = (u1, . . . , uP ) \u2208 RR\u00d7P and dp = Dpp.\n3. Weighted K-means: while partition A is not stationary,\n\na. For all r, \u00b5r =Pp\u2208Ar\nb. For all p, assign p to Ar where r = arg minr0 ||upd\u22121/2\ndp||upd\u22121/2\n\nOutput: partition A, distortion measure PrPp\u2208Ar\n\np up/Pp\u2208Ar\nd1/2\n\ndp\n\np\n\np\n\n\u2212 \u00b5r0||\n\u2212 \u00b5r||2\n\nFigure 1: Spectral clustering algorithm.\n\n= Pp u>\n= R \u2212Pr e>\n\np up \u2212PrPp,p0\u2208Ar\n\nd1/2\np d1/2\nr D1/2U U >D1/2er/ (e>\n\np up0 / (e>\n\np0 u>\nr Der) = J(W, e)\n\nr Der)\n\nThis theorem has an immediate algorithmic implication\u2014to minimize the cost function\nJ(W, e) with respect to the partition e, we can use a weighted K-means algorithm. The\nresulting algorithm is presented in Figure 1. While K-means is often used heuristically as\na post-processor for spectral clustering [13], our approach provides a mathematical foun-\ndation for the use of K-means, and yields a speci\ufb01c weighted form of K-means that is\nappropriate for the problem.\n\n2.4 Minimizing with respect to the similarity matrix\n\nWhen the partition e is given, we can consider minimization with respect to W . As we\nhave suggested, intuitively this has the effect of yielding a matrix W such that the result of\nspectral clustering with that W is as close as possible to e. We now make this notion pre-\ncise, by showing that the cost function J(W, e) is an upper bound on the distance between\nthe partition e and the result of spectral clustering using the similarity matrix W .\nThe metric between two partitions e = (er) and f = (fs) with R and S clusters respectively,\nis taken to be [14]:\n\nd(e, f ) =\n\n1\n2\n\nere>\nr\ne>\nr er\n\n\u2212X\n\ns\n\nfsf >\ns\nf >\ns fs\n\n=\n\nR + S\n\n2\n\n\u2212X\n\nr,s\n\nr fs)2\n(e>\nr er)(f >\n\n(e>\n\ns fs)\n\n(2)\n\nThis measure is always between zero and R+S\n2 \u22121, and is equal to zero if and only if e \u2261 f.\nThe following theorem shows that if we can perform weighted K-means exactly, we obtain\na bound on the performance of our spectral clustering algorithm (for a proof, see [11]):\n\nTheorem 2 Let \u03b7 = maxp Dpp/ minp Dpp > 1. If e(W ) = arg mine J(W, e), then for all\npartitions e, we have d(e, e(W )) 6 4\u03b7J(W, e).\n\n3 Approximation of the cost function\n\nIn order to minimize the cost function J(W, e) with respect to W , which is the topic of\nSection 4, we need to optimize a function of the R-th principal subspace of the matrix\nD\u22121/2W D\u22121/2. In this section, we show how we can compute a differentiable approxi-\nmation of the projection operator on this subspace.\n\n3.1 Approximation of eigensubspace\n\nLet X \u2208 RP \u00d7P be a real symmetric matrix. We assume that its eigenvalues are ordered by\nmagnitude: |\u03bb1| > |\u03bb2| > \u00b7 \u00b7 \u00b7 > |\u03bbP |. We assume that |\u03bbR| > |\u03bbR+1| so that the R-th\nprincipal subspace ER is well de\ufb01ned, with orthogonal projection \u03a0R.\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\nX\n\nr\n\n2\n\nF\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n\fOur approximations are based on the power method to compute eigenvectors. It is well\nknown that for almost all vectors v, the ratio X qv/||X qv|| converges to an eigenvector\ncorresponding to the largest eigenvalue [12]. The same method can be generalized to the\ncomputation of dominant eigensubspaces: If V is a matrix in RP \u00d7R, the subspace gener-\nated by the R columns of X qV will tend to the principal eigensubspace of X. Note that\nsince we are interested only in subspaces, and in particular the orthogonal projection op-\nerators on those subspaces, we can choose any method for \ufb01nding an orthonormal basis of\nrange(X qV ). The QR decomposition is fast and stable and is usually the method used to\ncompute such a basis (the algorithm is usually referred to as \u201corthogonal iteration\u201d [12]).\nHowever this does not lead to a differentiable function. We develop a different approach\nwhich does yield a differentiable function, as made precise in the following proposition (for\na proof, see [11]):\n\nProposition 3 Let V \u2208 RP \u00d7R such that \u03b7 =\n\nmax\n\nu\u2208ER(X)\u22a5, v\u2208range(V )\n\ncos(u, v) < 1. Then the\n\nfunction Y 7\u2192 e\u03a0R(Y ) = M (M >M )\u22121M >, where M = Y qV , is C\u221e in a neighborhood\nof X, and we have: ||e\u03a0R(X) \u2212 \u03a0R||2 6\n\n(1\u2212\u03b72)1/2 (|\u03bbR+1|/|\u03bbR|)q.\n\n\u03b7\n\nThis proposition shows that as q tends to in\ufb01nity, the range of X qV will tend to the princi-\npal eigensubspace. The rate of convergence is determined by the (multiplicative) eigengap\n|\u03bbR+1|/|\u03bbR| < 1: it is usually hard to compute principal subspace of matrices with eigen-\ngap close to one. Note that taking powers of matrices without care can lead to disastrous\nresults [12]. By using successive QR iterations, the computations can be made stable and\nthe same technique can be used for the computation of the derivatives.\n\n3.2 Potentially hard eigenvalue problems\n\nIn most of the literature on spectral clustering, it is taken for granted that the eigenvalue\nproblem is easy to solve. It turns out that in many situations, the (multiplicative) eigengap\nis very close to one, making the eigenvector computation dif\ufb01cult (examples are given in\nthe next section). We acknowledge this potential problem by averaging over several ini-\ntializations of the original subspace V . More precisely, let (Vm)m=1,...,M be M subspaces\nof dimension R. Let Bm = \u03a0(range((D\u22121/2W D\u22121/2)qVm)) be the approximations of\nthe projections on the R-th principal subspace2 of D\u22121/2W D\u22121/2. The cost function that\nwe use is the average error F (W, \u03a00(e)) = 1\nF . This cost function\ncan be rewritten as the distance between the average of the Bm and \u03a00 plus the variance\nof the approximations, thus explicitly penalizing the non-convergence of the power itera-\ntions. We choose Vi to be equal to D1/2 times a set of R indicator vectors corresponding to\nsubsets of each cluster. In simulations, we used q = 128, M = R2, and subsets containing\n2/(log2 q + 1) times the number of original points in the clusters.\n\nm=1 ||Bm \u2212 \u03a00||2\n\n2M PM\n\n3.3 Empirical comparisons\n\nIn this section, we study the ability of various cost functions to track the gold standard\nerror measure in Eq. (2) as we vary the parameter \u03b1 in the similarity matrix Wpp0 =\nexp(\u2212\u03b1||xp \u2212 xp0||2). We study the cost function J(W, e), its approximation based on\nthe power method presented in Section 3, and two existing approaches, one based on a\nMarkov chain interpretation of spectral clustering [15] and one based on the alignment [16]\nof D\u22121/2W D\u22121/2 and \u03a00. We carry out this experiment for the simple clustering example\n\n2The matrix D\u22121/2W D\u22121/2 always has the same largest eigenvalue 1 with eigenvector\nD1/21 and we could consider instead the (R \u2212 1)-st principal subspace of D\u22121/2W D\u22121/2 \u2212\nD1/211>D1/2/ (1>D1).\n\n\f)\np\na\ng\nn\ne\ng\ne\n\u2212\n1\n(\ng\no\n\ni\n\nl\n\n\u22123\n\n\u22124\n\n\u22125\n\n\u22126\n\n\u22127\n\n\u22128\n\n\u22129\n\n3\n\n0\n\n2\n\n1\nlog(\u03b1)\n(b)\n\nt\ns\no\nc\n/\nr\no\nr\nr\ne\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n3\n\n0\n\n1\n2\nlog(\u03b1)\n(c)\n\nt\ns\no\nc\n/\nr\no\nr\nr\ne\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n3\n\n0\n\n2\n\n1\nlog(\u03b1)\n(d)\n\n(a)\n\nFigure 2: Empirical comparison of cost functions. (a) Data. (b) Eigengap of the similarity\nmatrix as a function of \u03b1. (c) Gold standard clustering error (solid), spectral cost function\nJ (dotted) and its approximation based on the power method (dashed). (d) Gold standard\nclustering error (solid), the alignment (dashed), and a Markov-chain-based cost, divided by\n16 (dotted).\n\nshown in Figure 2(a). This apparently simple toy example captures much of the core dif-\n\ufb01culty of spectral clustering\u2014nonlinear separability and thinness/sparsity of clusters (any\npoint has very few near neighbors belonging to the same cluster, so that the weighted graph\nis sparse). In particular, in Figure 2(b) we plot the eigengap of the similarity matrix as\na function of \u03b1, noting that at the optimum, this gap is very close to one, and thus the\neigenvalue problem is hard to solve.\nIn Figure 2(c) and (d), we plot the four cost functions against the gold standard. The\ngold standard curve shows that the optimal \u03b1 lies near 2.5 on a log scale, and as seen in\nFigure 2(c), the minima of the new cost function and its approximation lie near to this\nvalue. As seen in Figure 2(d), on the other hand, the other two cost functions show a poor\nmatch to the gold standard, and yield minima far from the optimum.\nThe problem with the alignment and Markov-chain-based cost functions is that these func-\ntions essentially measure the distance between the similarity matrix W (or a normalized\nversion of W ) and a matrix T which (after permutation) is block-diagonal with constant\nblocks. Unfortunately, in examples like the one in Figure 2, the optimal similarity matrix\nis very far from being block diagonal with constant blocks. Rather, given that data points\nthat lie in the same ring are in general far apart, the blocks are very sparse\u2014not constant\nand full. Methods that try to \ufb01nd constant blocks cannot \ufb01nd the optimal matrices in these\ncases. In the language of spectral graph partitioning, where we have a weighted graph with\nweights W , each cluster is a connected but very sparse graph. The power W q corresponds\nto the q-th power of the graph; i.e., the graph in which two vertices are linked by an edge\nif and only if they are linked by a path of length no more than q in the original graph.\nThus taking powers can be interpreted as \u201cthickening\u201d the graph to make the clusters more\napparent, while not changing the eigenstructure of the matrix (taking powers of symmetric\nmatrices only changes the eigenvalues, not the eigenvectors).\n\n4 Learning the similarity matrix\n\nWe now turn to the problem of learning the similarity matrix from data. We assume that we\nare given one or more sets of data for which the desired clustering is known. The goal is to\ndesign a \u201csimilarity map,\u201d that is, a mapping from datasets of elements in X to the space\nof symmetric matrices with nonnegative elements. To turn this into a parametric learning\nproblem, we focus on similarity matrices that are obtained as Gram matrices of a kernel\nfunction k(x, y) de\ufb01ned on X\u00d7X . In particular, for concreteness and simplicity, we restrict\nourselves in this paper to the case of Euclidean data (X = RF ) and a diagonally-scaled\nGaussian kernel k\u03b1(x, y) = exp(\u2212(x\u2212y)> diag(\u03b1)(x\u2212y)), where \u03b1 \u2208 RF \u2014while noting\nthat our methods apply more generally.\n\n\f4.1 Learning algorithm\n\nWe assume that we are given N datasets Dn, n \u2208 {1, . . . , N }, of points in RF . Each dataset\nDn is composed of Pn points xnp, p \u2208 {1, . . . , Pn}. Each dataset is segmented, that is, for\neach n we know the partition en, so that the \u201ctarget\u201d matrix \u03a00(en, \u03b1) can be computed\nfor each dataset. For each n, we have a similarity matrix Wn(\u03b1). The cost function that\nN Pn F (Wn(\u03b1), \u03a00(en, \u03b1)) + C||\u03b1||1. The `1 penalty serves as a\nwe use is H(\u03b1) = 1\nfeature selection term, tending to make the solution sparse. The learning algorithm is the\nminimization of H(\u03b1) with respect to \u03b1 \u2208 RF\n+, using the method of conjugate gradient\nwith line search.\nSince the complexity of the cost function increases with q, we start the minimization with\nsmall q and gradually increase q up to its maximum value. We have observed that for small\nq, the function to optimize is smoother and thus easier to optimize\u2014in particular, the long\nplateaus of constant values are less pronounced.\nTesting. The output of the learning algorithm is a vector \u03b1 \u2208 RF . In order to cluster\npreviously unseen datasets, we compute the similarity matrix W and use the algorithm of\nFigure 1. In order to further enhance performance, we can also adopt an idea due to [13]\u2014\nwe hold the direction of \u03b1 \ufb01xed but perform a line search on its norm. This yields the\nreal number \u03bb such that the weighted distortion obtained after application of the spectral\nclustering algorithm of Figure 1, with the similarity matrices de\ufb01ned by \u03bb\u03b1, is minimum.3\n\n4.2 Simulations\n\nWe performed simulations on synthetic datasets in two dimensions, where we consider\ndatasets similar to the one in Figure 2, with two rings whose relative distance is constant\nacross samples (but whose relative orientation has a random direction). We add D irrelevant\ndimensions of the same magnitude as the two relevant variables. The goal is thus to learn\nthe diagonal scale \u03b1 \u2208 RD+2 of a Gaussian kernel that leads to the best clustering on\nunseen data. We learn \u03b1 from N sample datasets (N = 1 or 10), and compute the clustering\nerror of our algorithm with and without adaptive tuning of the norm of \u03b1 during testing (as\ndescribed in Section 4.1) on ten previously unseen datasets. We compare to an approach\nthat does not use the training data: \u03b1 is taken to be the vector of all ones and we again search\nover the best possible norm during testing (we refer to this method as \u201cno learning\u201d). We\nreport results in Table 1. Without feature selection, the performance of spectral clustering\ndegrades very rapidly when the number of irrelevant features increases, while our learning\napproach is very robust, even with only one training dataset.\n\n5 Conclusion\n\nWe have presented two algorithms\u2014one for spectral clustering and one for learning the\nsimilarity matrix. These algorithms can be derived as the minimization of a single cost\nfunction with respect to its two arguments. This cost function depends directly on the\neigenstructure of the similarity matrix. We have shown that it can be approximated ef\ufb01-\nciently using the power method, yielding a method for learning similarity matrices that can\ncluster effectively in cases in which non-adaptive approaches fail. Note in particular that\nour new approach yields a spectral clustering method that is signi\ufb01cantly more robust to\nirrelevant features than current methods.\nWe are currently applying our algorithm to problems in speech separation and image seg-\nmentation, in particular with the objective of selecting features from among the numerous\n\n3In [13], this procedure is used to learn one parameter of the similarity matrix with no training\ndata; it cannot be used directly here to learn a more complex similarity matrix with more parameters,\nbecause it would lead to over\ufb01tting.\n\n\fTable 1: Performance on synthetic datasets: clustering errors (multiplied by 100) for\nmethod without learning (but with tuning) and for our learning method with and without\ntuning, with N = 1 or 10 training datasets; D is the number of irrelevant features.\n\nD\n\n0\n1\n2\n4\n8\n16\n32\n\nno\n\nlearning\n\n0\n\n60.8\n79.8\n99.8\n99.8\n99.7\n99.9\n\nlearning w/o tuning\nN=10\n10.5\n9.5\n9.5\n9.7\n10.7\n10.9\n15.1\n\nN=1\n15.5\n37.7\n36.9\n37.8\n37\n38.8\n38.9\n\nlearning with tuning\nN=10\n\nN=1\n\n0\n0\n0\n0.4\n0\n14\n14.6\n\n0\n0\n0\n0\n0\n0\n6.1\n\nfeatures that are available in these domains [6, 7]. The number of points in such datasets\ncan be very large and we have developed ef\ufb01cient implementations of both learning and\nclustering based on sparsity and low-rank approximations [11].\n\nAcknowledgments\nWe would like to acknowledge support from NSF grant IIS-9988642, MURI ONR-\nN00014-01-1-0890 and a grant from Intel Corporation.\n\nReferences\n[1] K. Wagstaff, C. Cardie, S. Rogers, and S. Schr\u00a8odl. Constrained K-means clustering with back-\n\nground knowledge. In ICML, 2001.\n\n[2] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to\n\nclustering with side-information. In NIPS 15, 2003.\n\n[3] S. X. Yu and J. Shi. Grouping with bias. In NIPS 14, 2002.\n[4] S. D. Kamvar, D. Klein, and C. D. Manning. Spectral learning. In IJCAI, 2003.\n[5] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images\nand its application to evaluating segmentation algorithms and measuring ecological statistics.\nIn ICCV, 2001.\n\n[6] G. J. Brown and M. P. Cooke. Computational auditory scene analysis. Computer Speech and\n\nLanguage, 8:297\u2013333, 1994.\n\n[7] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 22(8):888\u2013\n\n905, 2000.\n\n[8] H. Zha, C. Ding, M. Gu, X. He, and H. Simon. Spectral relaxation for K-means clustering. In\n\nNIPS 14, 2002.\n\n[9] P. K. Chan, M. D. F. Schlag, and J. Y. Zien. Spectral K-way ratio-cut partitioning and clustering.\n\nIEEE Trans. CAD, 13(9):1088\u20131096, 1994.\n\n[10] M. Gu, H. Zha, C. Ding, X. He, and H. Simon. Spectral relaxation models and structure analysis\nfor K-way graph clustering and bi-clustering. Technical report, Penn. State Univ, Computer\nScience and Engineering, 2001.\n\n[11] F. R. Bach and M. I. Jordan. Learning spectral clustering. Technical report, UC Berkeley,\n\navailable at www.cs.berkeley.edu/\u02dcfbach, 2003.\n\n[12] G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, 1996.\n[13] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: analysis and an algorithm. In\n\nNIPS 14, 2001.\n\n[14] L. J. Hubert and P. Arabie. Comparing partitions. Journal of Classi\ufb01cation, 2:193\u2013218, 1985.\n[15] M. Meila and J. Shi. Learning segmentation by random walks. In NIPS 13, 2002.\n[16] N. Cristianini, J. Shawe-Taylor, and J. Kandola. Spectral kernel methods for clustering. In NIPS\n\n14, 2002.\n\n\f", "award": [], "sourceid": 2388, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}