{"title": "Spectral Relaxation for K-means Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1057, "page_last": 1064, "abstract": null, "full_text": "Spectral Relaxation for K-means \n\nClustering \n\nHongyuan Zha & Xiaofeng He \n\nDept. of Compo Sci. & Eng. \n\nThe Pennsylvania State University \n\nUniversity Park, PA 16802 \n\n{zha,xhe}@cse.psu.edu \n\nChris Ding & Horst Simon \n\nNERSC Division \n\nLawrence Berkeley National Lab. \nUC Berkeley, Berkeley, CA 94720 \n\n{chqding,hdsimon}@lbl.gov \n\nMing Gu \n\nDept. of Mathematics \n\nUC Berkeley, Berkeley, CA 95472 \n\nmgu@math.berkeley.edu \n\nAbstract \n\nThe popular K-means clustering partitions a data set by minimiz(cid:173)\ning a sum-of-squares cost function. A coordinate descend method \nis then used to find local minima. In this paper we show that the \nminimization can be reformulated as a trace maximization problem \nassociated with the Gram matrix of the data vectors. Furthermore, \nwe show that a relaxed version of the trace maximization problem \npossesses global optimal solutions which can be obtained by com(cid:173)\nputing a partial eigendecomposition of the Gram matrix, and the \ncluster assignment for each data vectors can be found by comput(cid:173)\ning a pivoted QR decomposition of the eigenvector matrix. As a \nby-product we also derive a lower bound for the minimum of the \nsum-of-squares cost function. \n\n1 \n\nIntroduction \n\nK-means is a very popular method for general clustering [6]. In K-means clusters \nare represented by centers of mass of their members, and it can be shown that the \nK-means algorithm of alternating between assigning cluster membership for each \ndata vector to the nearest cluster center and computing the center of each cluster \nas the centroid of its member data vectors is equivalent to finding the minimum of a \nsum-of-squares cost function using coordinate descend. Despite the popularity of K(cid:173)\nmeans clustering, one of its major drawbacks is that the coordinate descend search \nmethod is prone to local minima. Much research has been done on computing refined \ninitial points and adding explicit constraints to the sum-of-squares cost function for \nK-means clustering so that the search can converge to better local minimum [1 ,2]. \nIn this paper we tackle the problem from a different angle: we find an equivalent \nformulation of the sum-of-squares minimization as a trace maximization problem \nwith special constraints; relaxing the constraints leads to a maximization problem \n\n\fthat possesses optimal global solutions. As a by-product we also have an easily \ncomputable lower bound for the minimum of the sum-of-squares cost function. Our \nwork is inspired by [9, 3] where connection to Gram matrix and extension of K(cid:173)\nmeans method to general Mercer kernels were investigated. \n\nThe rest of the paper is organized as follows: in section 2, we derive the equivalent \ntrace maximization formulation and discuss its spectral relaxation. In section 3, we \ndiscuss how to assign cluster membership using pivoted QR decomposition, taking \ninto account the special structure of the partial eigenvector matrix. Finally, in \nsection 4, we illustrate the performance of the clustering algorithms using document \nclustering as an example. \nNotation. Throughout, II . II denotes the Euclidean norm of a vector. The trace \nof a matrix A, i.e., the sum of its diagonal elements, is denoted as trace(A). The \nFrobenius norm of a matrix IIAIIF = Jtrace(AT A). In denotes identity matrix of \norder n. \n\n2 Spectral Relaxation \n\nGiven a set of m-dimensional data vectors ai, i = 1, ... ,n, we form the m-by-n data \nmatrix A = [a1,\"\" an]. A partition II of the date vectors can be written in the \nfollowing form \n\n(1) \nwhere E is a permutation matrix, and Ai is m-by-si, i.e., the ith cluster contains \nthe data vectors in A. For a given partition II in (1), the associated sum-of-squares \ncost function is defined as \n\nk \n\nSi \n\nss(II) = L L Ila~i) - mi11 2 , m\u00b7 = \"a(i)ls\u00b7 \n\n'l ~ S \n\n2, \n\ni=l s=l \n\nSi \n\ns=l \n\ni.e., mi is the mean vector of the data vectors in cluster i. Let e be a vector \nof appropriate dimension with all elements equal to one, it is easy to see that \nmi = Aiel Si and \nSi \n\nSSi == L Ila~i) - mil1 2 = IIAi - mieTII} = IIAi(Isi - eeT ISi)II}\u00b7 \n\ns=l \n\nNotice that lSi - eeT I Si is a projection matrix and (Isi - eeT I Si)2 = lSi - eeT lSi, \nit follows that \n\nSSi = trace(Ai(Isi - eeT I si)Af) = trace((Isi - eeT I si)AT Ai). \n\nTherefore, \n\nss(II) = t, SSi = t, (trace(AT Ai) - (~) AT Ai (~) ) . \n\nLet the n-by-k orthonormal matrix X be \n\nX = :~ (e\n\nlVsl \n\nelVSi. \n\nSk \n\n(2) \n\n\fThe sum-of-squares cost function can now be written as \n\nss(II) = trace(AT A) - trace(XT AT AX), \n\nand its minimization is equivalent to \n\nmax{ trace(XT AT AX) I X of the form in (2)}. \n\nREMARK. Without loss of generality, let E = I in (1). If we let Xi be the cluster \nindicator vector, i.e., \n\nxT = [0, ... ,0,1, ... ,1,0, .. . ,0]. \n\n'---v-----\" \n\nSi \n\nThen it is easy to see that \n\ntrace(XT AT AX) = t xT AT AXi = t II Axil1 2 \n\ni=l \n\nXTXi \n\ni=l \n\nIIxil1 2 \n\nUsing the partition in (1), the right-hand side of the above can be written as \n\na weighted sum of the squared Euclidean norms of the mean vector of each clusters. \n\nIf we consider the elements of the Gram matrix AT A as measuring \nREMARK. \nsimilarity between data vectors, then we have shown that Euclidean distance leads \nto Euclidean inner-product similarity. This inner-product can be replaced by a \ngeneral Mercer kernel as is done in [9, 3]. \n\nIgnoring the special structure of X and let it be an arbitrary orthonormal matrix, \nwe obtain a relaxed maximization problem \n\nmax \n\nXTX=h \n\ntrace(XT AT AX) \n\n(3) \n\nIt turns out the above trace maximization problem has a closed-form solution. \n\nTheorem. (Ky Fan) Let H be a symmetric matrix with eigenvalues \n\nAl ::::: A2 ::::: ... ::::: An, \n\nand the corresponding eigenvectors U = [Ul, .. . , Un]. Then \n\nAl + ... Ak = max \nXTX=Ik \n\ntrace(XT H X) . \n\nMoreover, the optimal X* is given by X* = [Ul' ... ' Uk]Q with Q an arbitrary \northogonal matrix. \n\nIt follows from the above theorem that we need to compute the largest k eigenvectors \nof the Gram matrix AT A. As a by-product, we have \n\nminss(II) ::::: trace(AT A) - max \nn \nXT X=h \n\ntrace(XT AT AX) = L 0-; (A), \n\nmin{m ,n} \n\ni=k+l \n\n(4) \n\nwhere oi(A) is the i largest singular value of A. This gives a lower bound for the \nminimum of the sum-of-squares cost function. \n\n\fREMARK. It is easy to see from the above derivation that we can replace A with \nA - aeT , where a is an arbitrary vector. Then we have the following lower bound \n\nmJnss(II) ::::: m~ L u;(A - aeT ). \n\nmin{m,n} \n\ni=k+l \n\nREMARK. One might also try the following approach: notice that \n\nIIAi - mi e IIF = 2Si ~ ~ Ilaj - aj'11 . \n\nT2 1 \" , ' \" \n\n2 \n\naj EAi aj' EAi \n\nLet W = ( Ilai - ajl12 )i,j=l' and and Xi = [Xij]j=l with \n\n1 \n\nif aj E Ai \nXij = { o otherwise \n\nThen \n\nk \n\nT \n\nss(II) = ~ ' \" Xi WXi > ~ min ZTWZ = ~ ' \" Ai(W). \n\n2 ~ XT Xi \n\" \n\ni=l \n\n- 2 ZT Z=h \n\nn \n\n2 ~ \ni=n-k+l \n\nUnfortunately, some of the smallest eigenvalues of W can be negative. \nLet X k be the n-by-k matrix consisting of the k largest eigenvectors of AT A. Each \nrow of X k corresponds to a data vector, and the above process can be considered as \ntransforming the original data vectors which live in a m-dimensional space to new \ndata vectors which now live in a k-dimensional space. One might be attempted to \ncompute the cluster assignment by applying the ordinary K-means method to those \ndata vectors in the reduced dimension space. In the next section, we discuss an \nalternative that takes into account the structure of the eigenvector matrix X k [5]. \nREMARK. The similarity of the projection process to principal component analysis \nis deceiving: the goal here is not to reconstruct the data matrix using a low-rank \napproximation but rather to capture its cluster structure. \n\n3 Cluster Assignment Using Pivoted QR Decomposition \n\nWithout loss of generality, let us assume that the best partition of the data vec(cid:173)\ntors in A that minimizes ss(II) is given by A = [AI\"'\" Ak], each submatrix Ai \ncorresponding to a cluster. Now write the Gram matrix of A as \n\nATA=[A~A' ArA, \n\n~ 1+E=:B+E. \n\no \n\n0 \n\nArAk \n\nIf the overlaps among the clusters represented by the submatrices Ai are small, then \nthe norm of E will be small as compare with the block diagonal matrix B in the \nabove equation. Let the largest eigenvector of AT Ai be Yi , and \n\nthen the columns of the matrix \n\nAT AiYi = fJiYi , IIYil1 = 1, \n\ni = 1, ... , k, \n\n\fspan an invariant subspace of B. Let the eigenvalues and eigenvectors of AT A be \n\nA1:::: A2:::: ... :::: An, AT AXi = AiXi, \n\ni = 1, ... ,n. \n\nAssume that there is a gap between the two eigenvalue sets {fl1,'\" flk} and \n{Ak+1 , '\" An}, i.e. , \n\no < J = min{lfli - Aj II i = 1, ... ,k, j = k + 1, ... ,n}. \n\nThen Davis-Kahan sin(0) theorem states that IlynXk+1,'\" ,xn]11 < IIEII/J [11, \nTheorem 3.4]. After some manipulation, it can be shown that \n\nX k == [Xl, ... , Xk] = Yk V + O(IIEII) , \n\nwhere V is an k-by-k orthogonal matrix. Ignoring the O(IIEII) term, we see that \n\nv \n\ncluster 1 \n\nv \n\ncluster k \n\nwhere we have used y'[ = [Yil , ... ,Yis.], and VT = [V1' ... ,Vk]. A key observation is \nthat all the Vi are orthogonal to each other: once we have selected a Vi, we can jump \nto other clusters by looking at the orthogonal complement of Vi' Also notice that \nIIYil1 = 1, so the elements of Yi can not be all small. A robust implementation of \nthe above idea can be obtained as follows: we pick a column of X k T which has the \nlar;est norm, say, it belongs to cluster i , we orthogonalize the rest of the columns of \nX k against this column. For the columns belonging to cluster i the residual vector \nwill have small norm, and for the other columns the residual vectors will tend to \nbe not small. We then pick another vector with the largest residual norm, and \northogonalize the other residual vectors against this residual vector. The process \ncan be carried out k steps, and it turns out to be exactly QR decomposition with \ncolumn pivoting applied to X k T [4], i.e., we find a permutation matrix P such that \n\nX'[P = QR = Q[Rl1,Rd, \n\nwhere Q is a k-by-k orthogonal matrix, and Rl1 is a k-by-k upper triangular matrix. \nWe then compute the matrix \n\nR = Rj} [Rl1 ' Rd pT = [Ik' Rj} R12]PT. \n\nThen the cluster membership of each data vector is determined by the row index of \nthe largest element in absolute value of the corresponding column of k \nREMARK. Sometimes it may be advantageous to include more than k eigenvectors \nto form Xs T with s > k. We can still use QR decomposition with column pivoting \nto select k columns of Xs T to form an s-by-k matrix, say X. Then for each column \nz of Xs T we compute the least squares solution of t* = argmintERk li z - Xtll. Then \nthe cluster membership of z is determined by the row index of the largest element \nin absolute value of t* . \n\n4 Experimental Results \n\nIn this section we present our experimental results on clustering a dataset of news(cid:173)\ngroup articles submitted to 20 newsgroups.1 This dataset contains about 20,000 \narticles (email messages) evenly divided among the 20 newsgroups. We list the \nnames of the news groups together with the associated group labels. \n\nlThe newsgroup dataset together with the bow toolkit for processing it can be down(cid:173)\nloadedfrorn http : //www . cs.cmu.edu/afs/cs/project/theo-ll/www/naive-bayes.html. \n\n\f0\u00b7~.5 \n\n0.55 \n\n0.6 \n\n0.65 \n\n0.7 \n\n0.75 \n\n0.8 \n\n0.85 \n\n0.9 \n\n0.95 \n\n0\u00b71L, -~--,c-----O~' _--,c-'-_~-----' \n\np-{)R \n\np-Kmeans \n\nFigure 1: Clustering accuracy for five newsgroups NG2/NG9/NG10/NG15/NG18: \np-QR vs. p-Kmeans (left) and p-Kmeans vs. Kmeans (right) \n\nNG4: comp.sys.ibm.pc.hardvare \n\nNG6: comp.vindovs.x \n\nNG10: rec.sport.baseball \n\nNG8: rec.autos \n\nNG1: alt.atheism NG2: comp.graphics \nNG3: comp.os.ms-vindovs.misc \nNG5:comp.sys.mac.hardvare \nNG7:misc.forsale \nNG9:rec.motorcycles \nNGll:rec.sport.hockey NG12: sci. crypt \nNG13:sci.electronics \nNG15:sci.space \nNG17:talk.politics.guns \nNG19:talk.politics.misc \n\nNG14: sci.med \n\nNG16: soc.religion.christian \n\nNG18: talk.politics.mideast \nNG20: talk.religion.misc \n\nWe used the bow toolkit to construct the term-document matrix for this dataset, \nspecifically we use the tokenization option so that the UseNet headers are stripped, \nand we also applied stemming [8]. The following three preprocessing steps are done: \n1) we apply the usual tf.idf weighting scheme; 2) we delete words that appear too \nfew times; 3) we normalized each document vector to have unit Euclidean length. \n\nWe tested three clustering algorithms: 1) p-QR, this refers to the algorithm using \nthe eigenvector matrix followed by pivoted QR decomposition for cluster member(cid:173)\nship assignment; 2) p-Kmeans, we compute the eigenvector matrix, and then apply \nK-means on the rows of the eigenvector matrix; 3) K-means, this is K-means directly \napplied to the original data vectors. For both K-means methods, we start with a set \nof cluster centers chosen randomly from the (projected) data vectors, and we aslo \nmake sure that the same random set is used for both for comparison. To assess the \nquality of a clustering algorithm, we take advantage of the fact that the news group \ndata are already labeled and we measure the performance by the accuracy of the \nclustering algorithm against the document category labels [10]. In particular, for a \nk cluster case, we compute a k-by-k confusion matrix C = [Cij] with Cij the number \nof documents in cluster i that belongs to newsgroup category j. It is actually quite \nsubtle to compute the accuracy using the confusion matrix because we do not know \nwhich cluster matches which newsgroup category. An optimal way is to solve the \nfollowing maximization problem \n\nmax{ trace(CP) I P is a permutation matrix}, \n\nand divide the maximum by the total number of documents to get the accuracy. \nThis is equivalent to finding perfect matching a complete weighted bipartite graph, \none can use Kuhn-Munkres algorithm [7]. In all our experiments, we used a greedy \nalgorithm to compute a sub-optimal solution. \n\n\fTable 1: Comparison of p-QR, p-Kmeans, and K-means for two-way clustering \n\nNewsgroups \nNG1/NG2 \nNG2/NG3 \nNG8/NG9 \nNG10/NG11 \nNG1/NG 15 \nNG18/NG19 \n\np-QR \n89.29 \u00b1 7.51 % \n62.37 \u00b1 8.39% \n75.88 \u00b1 8.88% \n73.32 \u00b1 9.08% \n73.32 \u00b1 9.08% \n63.86 \u00b1 6.09% \n\np-Kmeans \n89.62 \u00b1 6.90% \n63.84 \u00b1 8.74% \n77.64 \u00b1 9.00% \n74.86 \u00b1 8.89% \n74.86 \u00b1 8.89% \n64.04 \u00b1 7.23% \n\nK-means \n76.25 \u00b1 13.06% \n61.62 \u00b1 8.03% \n65.65 \u00b1 9.26% \n62.04 \u00b1 8.61% \n62 .04 \u00b1 8.61% \n63.66 \u00b1 8.48% \n\nTable 2: Comparison of p-QR, p-Kmeans, and K-means for multi-way clustering \n\nNewsgroups \nNG2/NG3/NG4/NG5/NG6 (50) \nNG2/NG3/NG4/NG5/NG6 UOO) \nNG2/NG9/NG10/NG15/NG18 l50j \nNG2/NG9/NG10/NG15/NG18 (100) \n\nNG1/NG5/NG7/NG8/NG 11/ \n\nNG12/NG13/NG14/NG15/NG17 \n\nNG1/NG5/NG7 /NG8/NG 11/ \n\nNG12/NG13/NG14/NG15/NG17 \n\n(100) \n\n65.08 \u00b1 5.14% \n\n58.99 \u00b1 5.22% \n\n48.33 \u00b1 5.64% \n\np-QR \n40.36 \u00b1 5.17% \n41.67 \u00b1 5.06% \n77.83 \u00b1 9.26% \n79.91 \u00b1 9.90% \n60.21 \u00b1 4.88% \n\np-Kmeans \n41.15 \u00b1 5.73% \n42.53 \u00b1 5.02% \n70.13 \u00b1 11.67% \n75.56 \u00b1 10.63% \n58.18 \u00b1 4.41% \n\nK-means \n35.77 \u00b1 5.19% \n37.20 \u00b1 4.39% \n58.10 \u00b1 9.60% \n66.37 \u00b1 10.89% \n40.18 \u00b1 4.64% \n\n(50) \n\nEXAMPLE 1. In this example, we look at binary clustering. We choose 50 random \ndocument vectors each from two newsgroups. We tested 100 runs for each pair \nof newsgroups, and list the means and standard deviations in Table 1. The two \nclustering algorithms p-QR and p-Kmeans are comparable to each other, and both \nare better and sometimes substantially better t han K-means. \nEXAMPLE 2. In this example, we consider k-way clustering with k = 5 and k = 10. \nThree news group sets are chosen with 50 and 100 random samples from each news(cid:173)\ngroup as indicated in the parenthesis. Again 100 runs are used for each tests and the \nmeans and standard deviations are listed in Table 2. Moreover, in Figure 1, we also \nplot the accuracy for the 100 runs for the test NG2/NG9/NG10/NG15/NG18 (50). \nBoth p-QR and p-Kmeans perform better than Kmeans. For news group sets with \nsmall overlaps, p-QR performs better than p-Kmeans. This might be explained by \nthe fact that p-QR explores the special structure of the eigenvector matrix and is \ntherefore more efficient. As a less thorough comparison wit h the information bottle(cid:173)\nneck method used in [10], there for 15 runs of NG2/NG9/NGlO/NG15/NG18 (100) \nmean accuracy 56.67% with maximum accuracy 67.00% is obtained. For 15 runs \nof the 10 newsgroup set with 50 samples, mean accuracy 35.00% with maximum \naccuracy about 40.00% is obtained. \n\nEXAMPLE 3. We compare the lower bound given in (4). We only list a typical \nsample from NG2/NG9/NGlO/NG15/NG18 (50). The column with \"NG labels\" \nindicates clustering using the newsgroup labels and by definition has 100% accuracy. \nIt is quite clear that the news group categories are not completely captured by \nthe sum-of-squares cost function because p-QR and \"NG labels\" both have higher \naccuracy but also larger sum-of-squares values. Interestingly, it seems t hat p-QR \ncaptures some of this information of the newsgroup categories. \n\naccuracy \n\nssm) \n\np-QR \n86.80% \n224.1110 \n\np-Kmeans K-means NG labels \n\nlower bound \n\n83.60% \n223.8966 \n\n57.60% \n228.8416 \n\n100% \n\n224.4040 \n\nN/A \n\n219.0266 \n\n\fAcknowledgments \n\nThis work was supported in part by NSF grant CCR-9901986 and by Department \nof Energy through an LBL LDRD fund. \n\nReferences \n\n[1] P. S. Bradley and Usama M. Fayyad. (1998). R efining Initial Points for K-Means \n\nClustering. Proc. 15th International Conf. on Machine Learning, 91- 99. \n\n[2] P. S. Bradley, K. Bennett and A. Demiritz. Constrained K-means Clustering. Mi(cid:173)\n\ncrosoft Research, MSR-TR-2000-65, 2000. \n\n[3] M. Girolani. (2001). Mercer Kernel Based Clustering in Feature Space. To appear in \n\nIEEE Transactions on Neural Networks. \n\n[4] G. Golub and C. Van Loan. (1996) . Matrix Computations. Johns Hopkins University \n\nPress, 3rd Edition. \n\n[5] Ming Gu, Hongyuan Zha, Chris Ding, Xiaofeng He and Horst Simon. (2001) . Spectral \nEmbedding for K- Way Graph Clustering. Technical Report, Department of Computer \nScience and Engineering, CSE-OI-007, Pennsylvania State University. \n\n[6] J.A. Hartigan and M.A. Wong. (1979). A K-means Clustering Algorithm. Applied \n\nStatistics, 28:100- 108. \n\n[7] L. Lovasz and M.D. Plummer. (1986) Matching Theory. Amsterdam: North Holland. \n[8] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, clas(cid:173)\n\nsification and clustering. http : //www . CS. cmu. edu/ mccallum/bow. \n\n[9] B. Schi:ilkopf, A. Smola and K.R. Miiller. (1998). Nonlinear Component Analysis as \n\na Kernel Eigenvalue Problem. Neural Computation, 10: 1299- 1219. \n\n[10] N. Slonim and N. Tishby. (2000). Document clustering using word clusters via the \n\ninformation bottleneck method. Proceedings of SIGIR-2000. \n\n[11] G.W. Stewart and J.G. Sun. (1990). Matrix Perturbation Theory. Academic Press, \n\nSan Diego, CA. \n\n\f", "award": [], "sourceid": 1992, "authors": [{"given_name": "Hongyuan", "family_name": "Zha", "institution": null}, {"given_name": "Xiaofeng", "family_name": "He", "institution": null}, {"given_name": "Chris", "family_name": "Ding", "institution": null}, {"given_name": "Ming", "family_name": "Gu", "institution": null}, {"given_name": "Horst", "family_name": "Simon", "institution": null}]}