{"title": "Doubly Stochastic Normalization for Spectral Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1569, "page_last": 1576, "abstract": null, "full_text": "Doubly Stochastic Normalization for Spectral Clustering\n\nRon Zass\n\nand Amnon Shashua \n\nAbstract\nIn this paper we focus on the issue of normalization of the affinity matrix in spectral clustering. We show that the difference between N-cuts and Ratio-cuts is in the error measure being used (relative-entropy versus L1 norm) in finding the closest doubly-stochastic matrix to the input affinity matrix. We then develop a scheme for finding the optimal, under Frobenius norm, doubly-stochastic approximation using Von-Neumann's successive projections lemma. The new normalization scheme is simple and efficient and provides superior clustering performance over many of the standardized tests.\n\n1 Introduction\nThe problem of partitioning data points into a number of distinct sets, known as the clustering problem, is central in data analysis and machine learning. Typically, a graph-theoretic approach to clustering starts with a measure of pairwise affinity Kij measuring the degree of similarity between points xi , xj , followed by a normalization step, followed by the extraction of the leading eigenvectors which form an embedded coordinate system from which the partitioning is readily available. In this domain there are three principle dimensions which make a successful clustering: (i) the affinity measure, (ii) the normalization of the affinity matrix, and (iii) the particular clustering algorithm. Common practice indicates that the former two are largely responsible for the performance whereas the particulars of the clustering process itself have a relatively smaller impact on the performance. In this paper we focus on the normalization of the affinity matrix. We first show that the existing popular methods Ratio-cut (cf. [1]) and Normalized-cut [7] employ an implicit normalization which corresponds to L1 and Relative Entropy based approximations of the affinity matrix K to a doubly stochastic matrix. We then introduce a Frobenius norm (L2 ) normalization algorithm based on a simple successive projections scheme (based on Von-Neumann's [5] successive projection lemma for finding the closest intersection of sub-spaces) which finds the closest doubly stochastic matrix under the least-squares error norm. We demonstrate the impact of the various normalization schemes on a large variety of data sets and show that the new normalization algorithm often induces a significant performance boost in standardized tests. Taken together, we introduce a new tuning dimension to clustering algorithms allowing better control of the clustering performance.\n\n2 The Role of Doubly Stochastic Normalization\nIt has been shown in the past [11, 4] that K-means and spectral clustering are intimately related where in particular [11] shows that the popular affinity matrix normalization such as employed by Normalized-cuts is related to a doubly-stochastic constraint induced by K-means. Since this background is a key to our work we will briefly introduce the relevant arguments and derivations. Let xi  RN , i = 1, ..., n, be points arranged in k (mutually exclusive) clusters 1 , .., k with j nj points in cluster j and nj = n. Let Kij = (xi , xj ) be a symmetric positive-semi-definite\n\n\nSchool of Engineering and Computer Science, Hebrew University of Jerusalem, Jerusalem 91904, Israel.\n\n\f\naffinity function, e.g. Kij = exp- by maximizing:\n\nxi -xj 2/ 2\n\n. Then, the problem of finding the cluster assignments 1( nj Kr,s ,\nr,s)j\n\n1 ,...,k\n\nmax\n\njk\n=1\n\n(1)\n\nis equivalent to minimizing the \"kernel K-means\" problem: min c1 ,...,ck 1 ,...,k jk i\n=1 j\n\n(xi ) - cj 2,\n\nwhere (xi ) is a mapping associated with the kernel (xi , xj ) = (xi ) (xj ) and cj = i (1/nj ) j (xi ) are the class centers. After some algebraic manipulations it can be shown that the optimization setup of eqn. 1 is equivalent to the matrix form: max tr(G\nG K\n\nG) s.t G  0, GG\n\n1\n\n= 1, G\n\nG\n\n=I\n\n(2)\n\n where G is the desired assignment matrix with Gij = 1/ nj if i  j and zero otherwise, and 1 is a column vector of ones. Note that the feasible set of matrices satisfying the constraints G  0, GG 1 = 1, G G = I are of this form for some partitioning 1 , ..., k . Note also that the matrix F = GG must be doubly stochastic (F is non-negative, symmetric and F 1 = 1). Taken together, we see that the desire is to find a doubly-stochastic matrix F as close as possii ble to the input matrix K (in the sense that j Fij Kij is maximized over all feasible F ), such s that the symmetric decomposition F = GG atisfies non-negativity (G  0) and orthonormality constraints (G G = I ). To see the connection with spectral clustering, and N-cuts in particular, relax the non-negativity condition of eqn. 2 and define a two-stage approach: find the closest doubly stochastic matrix K to K and we are left with a spectral decomposition problem: max tr(G\nG KG\n\n) s.t G\n\nG\n\n=I\n\n(3)\n\nwhere G contains the leading k eigenvectors of K . We will refer to the process of transforming K to K as a normalization step. In N-cuts, the normalization takes the form K = D-1/2 K D-1/2 where D = diag (K 1) (a diagonal matrix containing the row sums of K ) [9]. In [11] it was shown that repeating the N-cuts normalization, i.e., setting up the iterative step K (t+1) = D-1/2 K (t) D-1/2 where D = diag (K (t) 1) and K (0) = K converges to a doubly-stochastic matrix (a symmetric version of the well known \"iterative proportional fitting procedure\" [8]). The conclusion of this brief background is to highlight the motivation for seeking a doubly-stochastic approximation to the input affinity matrix as part of the clustering process. The open issue is under what error measure is the approximation to take place? It is not difficult to show that repeating the N-cuts normalization converges to the global optimum under the relative entropy measure (see Appendix). Noting that spectral clustering optimizes the Frobenius norm it seems less natural to have the normalization step optimize a relative entropy error measure. We will derive in this paper the normalization under the L1 norm and under the Frobenius norm. The purpose of the L1 norm is to show that the resulting scheme is equivalent to a ratio-cut clustering -- thereby not introducing a new clustering scheme but only contributing to the unification and better understanding the differences between the N-cuts and Ratio-cuts schemes. The Frobenius norm normalization is a new formulation and is based on a simple iterative scheme. The resulting normalization provides a new clustering performance which proves quite practical and boosts the clustering performance in many of the standardized tests we conducted.\n\n3 Ratio-cut and the L1 Normalization\nGiven that our desire is to find a doubly stochastic approximation K to the input affinity matrix K , we begin with the L1 norm approximation:\n\n\f\nProposition 1 (ratio-cut) The closest doubly stochastic matrix K K\n=\n\nu\n\nnder the L1 error norm is\n\nK - D + I,\n\nwhich leads to the ratio-cut clustering algorithm, i.e., the partitioning of the data set into two clusters is determined by the second smallest eigenvector of the Laplacian D - K , where D = diag (K 1). i Proof: Let r = minF K - F 1 s.t. F 1 = 1, F = F , where A 1 = j abs(Aij ) is the L1 norm. Since K - F 1  (K - F )1 1 for any matrix F , we must have: r  (K - F )1 1 = D1 - 1 1 = D - I 1. Let F = K - D + I , then K - (K - D + I ) 1 = D - I 1. If v is an eigenvector of the Laplacian D - K with eigenvalue , then v is also an eigenvector of K = K - D + I with eigenvalue 1 -  and since (D - K )1 = 0 then the smallest eigenvector v = 1 of the Laplacian is the largest of K , and the second smallest eigenvector of the Laplacian (the ratio-cut result) corresponds to the second largest eigenvector of K . What we have so far is that the difference between N-cuts and Ratio-cuts as two popular spectral clustering schemes is that the former uses the relative entropy error measure in finding a doubly stochastic approximation to K and the latter uses the L1 norm error measure (which turns out to be the negative Laplacian with an added identity matrix).\n\n4 Normalizing under Frobenius Norm\nGiven that spectral clustering optimizes the Frobenius norm, there is a strong argument in favor of finding a Frobenius-norm optimum doubly stochastic approximation to K . The optimization setup is that of a quadratic linear programming (QLP). However, the special circumstances of our problem render the solution to the QLP to consist of a very simple iterative computation, as described next. The closest doubly-stochastic matrix K K i\n= u\n\nnder Frobenius norm is the solution to the following QLP:\n,\n\nargminF K - F 2 s.t. F  0, F 1 = 1, F = F F\n\n(4)\n\n2 where A 2 = F j Aij is the Frobenius norm. We define next two sub-problems, each with a closed-form solution, and have our QLP solution derived by alternating successively between the two until convergence. Consider the affine sub-problem:\n\nP1 (X ) = argminF X - F 2 s.t. F 1 = 1, F = F F and the convex sub-problem: P2 (X ) = argminF X - F 2 s.t. F  0 F\n\n(\n\n5)\n\n(6)\n\nWe will use the Von-Neumann [5] successive projection lemma stating that P1 P2 P1 P2 ...P1 (K ) will converge onto the projection of K onto the intersection of the affine and conic subspaces described above1 . Therefore, what remains to show is that the projections P1 and P2 can be solved efficiently (and in closed form). We begin with the solution for P1 . The Lagrangian corresponding to eqn. 5 takes the form: L(F, 1 , 2 ) = trace(F where from the condition F = F to F to zero yields:\nw F\n\n- 2X\n\nF\n\n) - 1 (F 1 - 1) - 2 (F\n\n1\n\n- 1),\n\ne have that 1 = 2 = . Setting the derivative with respect F = X + 1\n+\n\n1\n\n.\n\n1 actually, the Von-Neumann lemma applies only to linear subspaces. The extension to convex subspaces involves a \"deflection\" component described by Dykstra [3]. However, it is possible to show that for this specific problem the deflection component is redundant and the Von-Neumann lemma still applies.\n\n\f\n4000 Projection Matlab QP\n\n4 L1 Frobenius Relative Entropy\n\n3000\n\n3\n\nseconds\n\n2000\n\nseconds\n10 20 30 # of data-points 40 50\n\n2\n\n1000\n\n1\n\n0\n\n0\n\n500\n\n1000 1500 # of data-points\n\n2000\n\n(a)\n\n(b)\n\nFigure 1: Running times of the normalization algorithms. (a) the Frobenius scheme compared to a general\nMatlab QLP solver, (b) running time of the three normalization schemes.\n\nIsolate  by multiplying by 1 on both sides:  = (nI + 11 )-1 (I - X )1. Noting that (nI + 11 )-1 = (1/n)(I - (1/2n)11 ) we obtain a closed form solution: 1 1 1 X1 1 1 P1 (X ) = X + I+ I- X (7) 1 - 11 X. n n2 n n The projection P2 (X ) can also be described in a simple closed form manner. Let I+ be the set of indices corresponding to non-negative entries of X and I- the set of negative entries of X . The criterion function X - F 2 becomes: F ( ( X -F 2 = (Xij - Fij )2 + (Xij - Fij )2 . F\ni,j )I+ i,j )I-\n\nClearly, the minimum energy over F  0 is obtained when Fij = Xij for all (i, j )  I+ and zero otherwise. Let th0 (X ) stand for the operator that zeroes out all negative entries of X . Then, P2 (X ) = th0 (X ). To conclude, the global optimum of eqn. 4 which returns the closest doubly stochastic matrix K in Frobenius error norm to the input affinity matrix K is obtained by repeating the following steps: Algorithm 1 (Frobenius-optimal Doubly Stochastic Normalization) finds the closest doubly stochastic approximation in Frobenius error norm to a given matrix K (global optimum of eqn. 4). 1. Let X (0) = K . 2. Repeat t = 0, 1, 2, ... (a) X (t+1) = P1 (X (t) ) (b) If X (t+1)  0 then stop and set K\n=\n\nX (t+1) , otherwise set X (t+1) = th0 (X (t+1) ).\n\nThis algorithm is simple and very efficient. Fig. 1a shows the running time of the algorithm compared to an off-the-shelf QLP Matlab solver over random matrices of increasing size -- one can see that the run-time of our algorithm is a fraction of the standard QLP solver and scales very well with dimension. In fact the standard QLP solver can handle only small problem sizes. In Fig. 1b we plot the running times of all three normalization schemes: the L1 norm (computing the Laplacian), the relative-entropy (the iterative D-1/2 K D-1/2 ), and the Frobenius scheme presented in this section. The Frobenius is more efficient than the relative-entropy normalization (which is the least efficient among the three).\n\n5 Experiments\nFor the clustering algorithm into k  2 clusters we experimented with the spectral algorithms described in [10] and [6]. The latter uses the N-cuts normalization D-1/2 K D-1/2 followed by K-means on the embedded coordinates (the leading k eigenvectors of the normalized affinity) and\n\n\f\nthe former uses a certain discretization scheme to turn the k leading eigenvectors into an indicator matrix. Both algorithms produced similar results thus we focused on [10] while replacing the normalization with the three schemes presented above. We refer to \"Ncuts\" as the original normalization D-1/2 K D-1/2 , by \"RE\" to the iterative application of the original normalization (which is proven to converge to a doubly stochastic matrix [11]), by \"L1\" to the L1 doubly-stochastic normalization (which we have shown is equivalent to Ratio-cuts) and by \"Frobenius\" to the iterative Frobenius scheme based on Von-Neumann's lemma described in Section 4. We also included a \"None\" field which corresponds to no normalization being applied. Dataset SPECTF heart Pima Wine SpamBase BUPA WDBC Kernel RBF RBF RBF RBF Poly Poly k 2 2 3 2 2 2 Size 267 768 178 4601 345 569 Dim. 44 8 13 57 6 30 L1 27.5 36.2 38.8 36.1 37.4 18.8 Lowest Error Rate Frobenius RE NCuts 19.2 27.5 27.5 35.2 34.9 35.2 27.0 34.3 29.2 30.3 37.7 31.8 37.4 41.7 41.7 11.1 37.4 37.4\n\nNone 29.5 35.4 27.5 30.4 37.4 18.8\n\nTable 1: UCI datasets used, together with some characteristics and the best result achieved using the different\nmethods.\n\nDataset Leukemia Lung Prostate Prostate Outcome\n\nKernel Poly Poly RBF RBF\n\nk 2 2 2 2\n\nSize 72 181 136 21\n\n#PC 5 5 5 5 L1 27.8 15.5 40.4 28.6\n\nLowest Error Rate Frobenius RE NCuts 16.7 36.1 38.9 9.9 16.6 15.5 19.9 43.4 40.4 4.8 23.8 28.6\n\nNone 30.6 15.5 40.4 28.6\n\nTable 2: Cancer datasets used, together with some characteristics and the best result achieved using the different methods.\n\nWe begin with evaluating the clustering quality obtained under the different normalization methods taken over a number of well studied datasets from the UCI repository2 . The data-sets are listed in Table 1 together with some of their characteristics. The best performance (lowest error rate)\nxi -xj 2\n\nis presented in Boldface. With the first four datasets we used an RBF kernel e 2 for the affinity matrix, while for the latter two a polynomial kernel (xT xj + 1)d was used. The kernel i parameters were calibrated independently for each method and for each dataset. In most cases the best performance was obtained with the Frobenius norm approximation, but as a general rule the type of normalization depends on the data. Also worth noting are instances, such as Wine and SpamBase, when the RE or Ncuts actually worsen the performance. In that case the RE performance is worse the Ncuts as the entire normalization direction is counter-productive. When RE outperforms None it also outperforms Ncuts (as can be expected since Ncuts is the first step in the iterative scheme of RE). With regard to tuning the affinity measure, we show in Fig. 2 the clustering performance of each dataset under each normalization scheme under varying kernel setting ( and d values). Generally, the performance of the Frobenius normalization behaves in a smoother manner and is more stable under varying kernel settings than the other normalization schemes. Our next set of experiments was over some well studied cancer data-sets3 . The data-sets are listed in Table 2 together with some of their characteristics. The column \"#PC\" refers to the number of principal components used in a PCA pre-processing for the purpose of dimensionality reduction prior to clustering. Note that better results can be achieved when using a more sophisticated preprocessing, but since the focus is on the performances of the clustering algorithms and not on the datasets, we prefer not to use the optimal pre-processing and leave the data noisy. The AML/ALL\n2 3\n\nhttp://www.ics.uci.edu/ mlearn/MLRepository.html All cancer datasets can be found at http://sdmc.i2r.a-star.edu.sg/rp/\n\n\f\n50\n\n50\n\n40 % errors % errors 20 40 60 sigma 80 100\n\n45\n\n30\n\n40\n\n20\n\n35\n\n10\n\n30\n\n1\n\n2\n\n3 sigma\n\n4\n\n5\n\n6\n\n(SPECTF)\n60 45\n\n(Pima)\n\n50 40 % errors 40 % errors 35 30 200 400 600 sigma 800 30\n\n20\n\n50\n\n100\n\n150 200 sigma\n\n250\n\n300\n\n(Wine)\n50 50\n\n(SpamBase)\n\n40 45 % errors % errors 40 20 35 10 20 30 40 degree 50 60 30\n\n10 1\n\n2\n\n3\n\ndegree\n\n4\n\n5\n\n6\n\n(BUPA)\n\n(WDBC)\n\nFigure 2: Error rate vs. similarity measure, for the UCI datasets listed in Table 1 L1 in magenta +; Forbenius in blue o; Relative Entropy in black ; and Normalized-Cuts in red\n\n\f\n50\n\n50 40 % errors 2 4 6 degree 8 10 30 20 10 0\n\n40 % errors\n\n30\n\n20\n\n10\n\n2\n\n4\n\n6 degree\n\n8\n\n10\n\n(AML/ALL Leukemia)\n50 50 40 % errors 30 20 10 0\n\n(Lung Cancer)\n\n40 % errors\n\n30\n\n20\n\n10\n\n50\n\nsigma\n\n100\n\n150\n\n200\n\n400 sigma\n\n600\n\n800\n\n(Prostate)\n\n(Prostate Outcome)\n\nFigure 3: Error rate vs. similarity measure, for the cancer datasets listed in Table 2. L1 in magenta +; Forbenius in blue o; Relative Entropy in black ; and Normalized-Cuts in red\n\nLeukemia dataset is a challenging benchmark common in the cancer community, where the task is to distinguish between two types of Leukemia. The original dataset consists of 7129 coordinates probed from 6817 human genes, and we perform PCA to obtain 5 leading principal components prior to clustering using a polynomial kernel. Lung Cancer (Brigham and Women's Hospital, Harvard Medical School) dataset is another common benchmark that describes 12533 genes sampled from 181 tissues. The task is to distinguish between malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) of the lung. The Prostate dataset consists of 12,600 coordinates representing different genes, where the task is to identify prostate samples as tumor or non-tumor. We use the first five principal components as input for clustering using an RBF kernel. The Prostate Outcome dataset uses the same genes from another set of prostate samples, where the task is to predict the clinical outcome (relapse or non-relapse for at least four years). Finally, Fig. 3 shows the clustering performance of each dataset under each normalization scheme under varying kernel settings ( and d values).\n\n6 Summary\nNormalization of the affinity matrix is a crucial element in the success of spectral clustering. The type of normalization performed by N-cuts is a step towards a doubly-stochastic approximation of the affinity matrix under relative entropy [11]. In this paper we have extended the normalization via doubly-stochasticity in three ways: (i) we have shown that the difference between N-Cuts and Ratio-cuts is in the error measure used to find the closest doubly stochastic approximation to the input affinity matrix, (ii) we have introduced a new normalization scheme based on Frobenius norm approximation. The scheme involves a succession of simple computations, is very simple to implement and is efficient computation-wise, and (iii) throughout extensive experimentation on standard data-sets we have shown the importance of normalization to the performance of spectral clustering.\n\n\f\nIn the experiments we have conducted the Frobenius normalization had the upper-hand in most cases. We have also shown that the relative-entropy normalization is not always the right approach as in some data-sets the performance worsened after the relative-entropy but never worsened when the Frobenius normalization was applied.\n\nReferences\n[1] P. K. Chan, M. D. F. Schlag, and J. Y. Zien. Spectral k-way ratio-cut partitioning and clustering. IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, 13(9):1088 1096, 1994. [2] I. Csiszar. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146158, 1975. [3] R.L. Dykstra. An algorithm for restricted least squares regression. J. of the Amer. Stat. Assoc., 78:837842, 1983. [4] I.S.Dhillon, Y.Guan, and B.Kulis. Kernel k-means, spectral clustering and normalized cuts. In International Conference on Knowledge Discovery and Data Mining(KDD), pages 551556, Aug. 2004. [5] J. Von Neumann. Functional Operators Vol. II. Princeton University Press, 1950. [6] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Proceedings of the conference on Neural Information Processing Systems (NIPS), 2001. [7] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 2000. [8] R. Sinkhorn and P. Knopp. Conerning non-negative matrices and doubly stochastic matrices. Pacific J. Math., 21:343348, 1967. [9] Y. Weiss. Segmentation using eigenvectors: a unifying view. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1999. [10] S.X. Yu and J. Shi. Multiclass spectral clustering. In Proceedings of the International Conference on Computer Vision, 2003. [11] R. Zass and A. Shashua. A unifying approach to hard and probabilistic clustering. In Proceedings of the International Conference on Computer Vision, Beijing, China, Oct. 2005.\n\nA Normalized Cuts and Relative Entropy Normalization\nThe following proposition is an extension (symmetric version) of the claim about the iterative proportional fitting procedure converging in relative entropy error measure [2]: Proposition 2 The closest doubly-stochastic matrix F under the relative-entropy error measure to a given symmetric matrix K , i.e., which minimizes: min RE (F ||K ) s.t. F  0, F = F , F 1 = 1, F 1 = 1\nF\n\nhas the form F = DK D for some (unique) diagonal matrix D. Proof: The Lagrangian of the problem is: i i i i j j i fij L() = fij ln + kij - fij - i ( fij - 1) - j ( fij - 1) kij j j j The derivative with respect to fij is: L = ln fij + 1 - ln kij - 1 - i - j = 0  fij from which we obtain: fij = ei ej kij Let D1 = diag (e1 , ..., en ) and D2 = diag (e1 , ..., en ), then we have: F = D1 K D2 a Since F = F nd K is symmetric we must have D1 = D2 .\n\n\f\n", "award": [], "sourceid": 3049, "authors": [{"given_name": "Ron", "family_name": "Zass", "institution": null}, {"given_name": "Amnon", "family_name": "Shashua", "institution": null}]}