{"title": "Learning A Structured Optimal Bipartite Graph for Co-Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 4129, "page_last": 4138, "abstract": "Co-clustering methods have been widely applied to document clustering and gene expression analysis. These methods make use of the duality between features and samples such that the co-occurring structure of sample and feature clusters can be extracted. In graph based co-clustering methods, a bipartite graph is constructed to depict the relation between features and samples. Most existing co-clustering methods conduct clustering on the graph achieved from the original data matrix, which doesn\u2019t have explicit cluster structure, thus they require a post-processing step to obtain the clustering results. In this paper, we propose a novel co-clustering method to learn a bipartite graph with exactly k connected components, where k is the number of clusters. The new bipartite graph learned in our model approximates the original graph but maintains an explicit cluster structure, from which we can immediately get the clustering results without post-processing. Extensive empirical results are presented to verify the effectiveness and robustness of our model.", "full_text": "Learning A Structured Optimal Bipartite Graph\n\nfor Co-Clustering\n\nFeiping Nie1, Xiaoqian Wang2, Cheng Deng3, Heng Huang2\u2217\n\n1 School of Computer Science, Center for OPTIMAL, Northwestern Polytechnical University, China\n\n2 Department of Electrical and Computer Engineering, University of Pittsburgh, USA\n\n3 School of Electronic Engineering, Xidian University, China\nfeipingnie@gmail.com,xqwang1991@gmail.com\n\nchdeng@mail.xidian.edu.cn,heng.huang@pitt.edu\n\nAbstract\n\nCo-clustering methods have been widely applied to document clustering and gene\nexpression analysis. These methods make use of the duality between features and\nsamples such that the co-occurring structure of sample and feature clusters can be\nextracted. In graph based co-clustering methods, a bipartite graph is constructed\nto depict the relation between features and samples. Most existing co-clustering\nmethods conduct clustering on the graph achieved from the original data matrix,\nwhich doesn\u2019t have explicit cluster structure, thus they require a post-processing\nstep to obtain the clustering results. In this paper, we propose a novel co-clustering\nmethod to learn a bipartite graph with exactly k connected components, where k is\nthe number of clusters. The new bipartite graph learned in our model approximates\nthe original graph but maintains an explicit cluster structure, from which we can\nimmediately get the clustering results without post-processing. Extensive empirical\nresults are presented to verify the effectiveness and robustness of our model.\n\n1\n\nIntroduction\n\nClustering has long been a fundamental topic in unsupervised learning. The goal of clustering is to\npartition data into different groups. Clustering methods have been successfully applied to various\nareas, such as document clustering [3, 17], image segmentation [18, 7, 8] and bioinformatics [16, 14].\nIn clustering problems, the input data is usually formatted as a matrix, where one dimension represents\nsamples and the other denotes features. Each sample can be seen as a data point characterized by\na vector in the feature space. Alternatively, each feature can be regarded as a vector spanning in\nthe sample space. Traditional clustering methods propose to cluster samples according to their\ndistribution on features, or conversely, cluster features in terms of their distribution on samples.\nIn several types of data, such as document data and gene expression data, duality exists between\nsamples and features. For example, in document data, we can reasonably assume that documents\ncan be clustered based on their relations with different word clusters, while word clusters are formed\naccording to their associations with distinct document clusters. However, in the one-sided clustering\nmechanism, the duality between samples and features is not taken into consideration. To make full\nuse of the duality information, co-clustering methods (also known as bi-clustering methods) are\nproposed. The co-clustering mechanism takes advantage of the co-occurring cluster structure among\nfeatures and samples to strengthen the clustering performance and gain better interpretation of the\npragmatic meaning of the clusters.\n\n\u2217This work was partially supported by U.S. NSF-IIS 1302675, NSF-IIS 1344152, NSF-DBI 1356628,\n\nNSF-IIS 1619308, NSF-IIS 1633753, NIH AG049371.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fSeveral co-clustering methods have been put forward to depict the relations between samples and\nfeatures. In the graph based methods, the co-occurring structure between samples and features\nis usually treated as a bipartite graph, where the weights of edges indicate the relations between\nsample-feature pairs. In the left part of Fig. 1 we show an illustration of such bipartite graph, where\nthe blue nodes on the left represent features while red nodes on the right show samples. The af\ufb01nity\nbetween the features and samples is denoted by the weight of the corresponding edge. For example,\nBij denotes the af\ufb01nity between the i-th feature and the j-sample. In [4], the authors propose to\nminimize the cut between samples and features, which is equivalent to conducting spectral clustering\non the bipartite graph. However, in this method, since the original graph doesn\u2019t display an explicit\ncluster structure, it still calls for the post-processing step like K-mean clustering to obtain the \ufb01nal\nclustering indicators, which may not be optimal.\nTo address this problem, in this paper, we propose a novel graph based co-clustering model to learn a\nbipartite graph with exactly k connected components, where k is the number of clusters. The new\nbipartite graph learned in our model approximates the original graph but maintains an explicit cluster\nstructure, from which we can directly get the clustering results without post-processing steps. To\nachieve such an ideal structure of the new bipartite graph, we impose constraints on the rank of\nits Laplacian or normalized Laplacian matrix and derive algorithms to optimize the objective. We\nconduct several experiments to evaluate the effectiveness and robustness of our model. On both\nsynthetic and benchmark datasets we gain equivalent or even better clustering results than other\nrelated methods.\nNotations: Throughout the paper, all the matrices are written as uppercase. For matrix M, the ij-th\nelement of M is denoted by mij. The trace of matrix M is denoted by T r(M ). The (cid:96)2-norm of\nvector v is denoted by (cid:107)v(cid:107)2, the Frobenius norm of matrix M is denoted by (cid:107)M(cid:107)F .\n\n2 Bipartite Spectral Graph Partitioning Revisited\n\nThe classic Bipartite Spectral Graph Partitioning (BSGP) method [4] is very effective for co-clustering.\nIn order to simultaneously partition the rows and columns of a data matrix B \u2208 Rn1\u00d7n2, we \ufb01rst\nview B as the weight matrix of a bipartite graph, where the left-side nodes are the n1 rows of B, the\nright-side nodes are the n2 columns of B, and the weight to connect the i-th left-side node and the\nj-th right-side node is bij (see Fig.1). The procedure of BSGP is as follows:\n\n\u2212 1\nu BD\n\n\u2212 1\nv\n\n2\n\n2\n\n1) Calculate \u02dcA = D\n, where the diagonal matrices Du and Dv are de\ufb01ned in Eq.(6).\n2) Calculate U and V , which are the leading k left and right singular vectors of \u02dcA, respectively.\n3) Run the K-means on the rows of F de\ufb01ned in Eq. (6) to obtain the \ufb01nal clustering results.\nThe bipartite graph can be viewed as an undirected weighted graph G = {V, A} with n = n1 + n2\nnodes, where V is the node set and the af\ufb01nity matrix A \u2208 Rn\u00d7n is\n\nIn the following, we will show that the BSGP method essentially performs spectral clustering with\nnormalized cut on the graph G.\nSuppose the graph G is partitioned into k components V = {V1,V2, ...,Vk} . According to the\nspectral clustering, the normalized cut on the graph G = {V, A} is de\ufb01ned as\n\nwhere cut(Vi,V\\Vi) =(cid:80)\n\nNcut =\n\ni\u2208Vi,j\u2208V\\Vi\n\naij;\n\ni\u2208Vi,j\u2208V aij.\n\nLet Y \u2208 Rn\u00d7k be the partition indicator matrix, i.e., yij = 1 indicates the i-th node is partitioned\ninto the j-th component. Then minimizing the normalized cut de\ufb01ned in Eq.(2) can be rewritten as\nthe following problem:\n\n(cid:20) 0 B\n\nBT\n\n0\n\n(cid:21)\n\nA =\n\ni=1\n\ncut(Vi,V\\Vi)\nassoc(Vi,V)\n\nk(cid:88)\nassoc(Vi,V) =(cid:80)\nk(cid:88)\n\nyT\ni Lyi\nyT\ni Dyi\n\nmin\n\nY\n\ni=1\n\n2\n\n(1)\n\n(2)\n\n(3)\n\n\fFigure 1: Illustration of the structured optimal bipartite graph.\n\nwhere yi is the i-th column of Y , L = D \u2212 A \u2208 Rn\u00d7n is the Laplacian matrix, and D \u2208 Rn\u00d7n is the\n\ndiagonal degree matrix de\ufb01ned as dii =(cid:80)\n\nj aij.\n\nLet Z = Y (Y T DY )\u2212 1\n\n2 , and denote the identity matrix by I, then problem (3) can be rewritten as\n\nFurther, denotes F = D 1\n\n2 Z = D 1\n\n2 Y (Y T DY )\u2212 1\n\nmin\n\nZT DZ=I\n\nT r(Z T LZ)\n\n2 , then the problem (4) can be rewritten as\nT r(F T \u02dcLF )\n\nwhere \u02dcL = I \u2212 D\u2212 1\nWe rewrite F and D as the following block matrices:\n\n2 AD\u2212 1\n\n2 is the normalized Laplacian matrix.\n\n(cid:20) Du\n\n(cid:21)\n\nmin\n\nF T F =I\n\n(cid:20) U\n\n(cid:21)\n\nF =\n\n, D =\n\nDv\nwhere U \u2208 Rn1\u00d7k, V \u2208 Rn2\u00d7k, Du \u2208 Rn1\u00d7n1, Dv \u2208 Rn2\u00d7n2.\nThen according to the de\ufb01nition of A in Eq. (1), the problem (5) can be further rewritten as\n\nV\n\nmax\n\nU T U +V T V =I\n\nT r(U T D\n\n\u2212 1\nu BD\n\n2\n\n\u2212 1\nv V )\n\n2\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\nNote that in addition to the constraint U T U + V T V = I, the U, V should be constrained to be\ndiscrete values according to the de\ufb01nitions of U and V . This discrete constraint makes the problem\nvery dif\ufb01cult to solve. To address it, we \ufb01rst remove the discrete constraint to make the problem (7)\nsolvable with Lemma 1 , and then run K-means on U and V to get the discrete solution.\nLemma 1 Suppose M \u2208 Rn1\u00d7n2, X \u2208 Rn1\u00d7k, Y \u2208 Rn2\u00d7k. The optimal solutions to the problem\n(8)\n\nT r(X T M Y )\n\nmax\n\nX T X+Y T Y =I\n\n2\n\n\u221a\n2 U1, Y =\n\n\u221a\n2 V1, where U1, V1 are the leading k left and right singular vectors of M,\n\nare X =\n2\nrespectively.\nProof: Denote the Lagrangian function of the problem is L(X, Y, \u039b) = T r(X T AY )\u2212T r(\u039b(X T X +\nY T Y \u2212 I)) By setting the derivative of L(X, Y, \u039b) w.r.t. X to zero, we have AY = X\u039b. By taking\nthe derivative of L(X, Y, \u039b) w.r.t. Y to zero, we have AT X = Y \u039b. Thus AAT X = AY \u039b = X\u039b2.\nTherefore, the optimal solution X should be the eigenvectors of AAT , i.e, the left singular vectors\nof M. Similarly, the optimal solution Y should be the right singular vectors of M. Since it is a\nmaximization problem, the optimal solution X, Y should be the leading k left and right singular\n(cid:3)\nvectors of M, respectively.\nAccording to Lemma 1, if the discrete constraint on U and V is not considered, the optimal solution\n\u2212 1\nU and V to the problem (7) are the leading k left and right singular vectors of \u02dcA = D\n,\nv\nrespectively.\nSince the solution U and V are not discrete values, we need to run the K-means on the rows of F\nde\ufb01ned in Eq.(6) to obtain the \ufb01nal clustering results.\n\n\u2212 1\nu BD\n\n2\n\n2\n\n3\n\n\f3 Learning Structured Optimal Bipartite Graph for Co-Clustering\n\n3.1 Motivation\n\nWe can see from the previous section that the given B or A does not have a very clear clustering\nstructure (i.e., A is not a block diagonal matrix with proper permutation) and the U and V are\nnot discrete values, thus we need run the K-means to obtain the \ufb01nal clustering results. However,\nK-means is very sensitive to the initialization, which makes the clustering performance unstable and\nsuboptimal.\nTo address this challenging and fundamental problem, we target to learn a new graph similarity matrix\nS \u2208 Rn\u00d7n or P \u2208 Rn1\u00d7n2 as\n\nS =\n\n,\n\n(9)\n\n(cid:20) 0\n\nP T\n\n(cid:21)\n\nP\n0\n\nsuch that the new graph is more suitable for clustering task. In our strategy, we learn an S that has\nexact k connected components, see Fig. 1. Obviously such a new graph can be considered as the\nideal graph for clustering task with providing clear clustering structure. If S has exact k connected\ncomponents, we can directly obtain the \ufb01nal clustering result based on S, without running K-means\nor other discretization procedures as traditional graph based clustering methods have to do.\nThe learned structured optimal graph similarity matrix S should be as close as possible to the given\ngraph af\ufb01nity matrix A, so we propose to solve the following problem:\n\nmin\n\nP\u22650,P 1=1,S\u2208\u2126\n\n(cid:107)S \u2212 A(cid:107)2\n\nF\n\n(10)\n\nwhere \u2126 is the set of matrices S \u2208 Rn\u00d7n which have exact k connected components.\nAccording to the special structure of A and S in Eq. (1) and Eq. (9), the problem (10) can be written\nas\n\n(11)\nThe problem (11) seems very dif\ufb01cult to solve since the constraint S \u2208 \u2126 is intractable to handle. In\nthe next subsection, we will propose a novel and ef\ufb01cient algorithm to solve this problem.\n\nP\u22650,P 1=1,S\u2208\u2126\n\n(cid:107)P \u2212 B(cid:107)2\n\nmin\n\nF\n\n3.2 Optimization\nIf the similarity matrix S is nonnegative, then the Laplacian matrix LS = DS \u2212 S associated with S\nhas an important property as follows [13, 12, 11, 2].\n\nTheorem 1 The multiplicity k of the eigenvalue 0 of the Laplacian matrix LS is equal to the number\nof connected components in the graph associated with S.\nTheorem 1 indicates that if rank(LS) = n \u2212 k, the constraint S \u2208 \u2126 will be held. Therefore, the\nproblem (11) can be rewritten as:\n\n(12)\nSuppose \u03c3i(LS) is the i-th smallest eigenvalue of LS. Note that \u03c3i(LS) \u2265 0 because LS is positive\nsemi-de\ufb01nite. The problem (12) is equivalent to the following problem for a large enough \u03bb:\n\nP\u22650,P 1=1,rank(LS )=n\u2212k\n\nmin\n\nF\n\n(cid:107)P \u2212 B(cid:107)2\n\nk(cid:88)\n\ni=1\n\nmin\n\nP\u22650,P 1=1\n\n(cid:107)P \u2212 B(cid:107)2\n\nF + \u03bb\n\n\u03c3i(LS)\n\n(13)\n\nWhen \u03bb is large enough (note that \u03c3i(LS) \u2265 0 for every i), the optimal solution S to the problem\ni=1 \u03c3i(LS) to be zero, and thus the constraint rank(LS) = n \u2212 k\n\n(13) will make the second term(cid:80)k\nk(cid:88)\n\nin the problem (12) would be satis\ufb01ed.\nAccording to the Ky Fan\u2019s Theorem [6], we have:\n\n\u03c3i(LS) =\n\ni=1\n\nmin\n\nF\u2208Rn\u00d7k,F T F =I\n\nT r(F T LSF )\n\n(14)\n\n4\n\n\fTherefore, the problem (13) is further equivalent to the following problem\n\n(cid:107)P \u2212 B(cid:107)2\n\nmin\nP,F\ns.t. P \u2265 0, P 1 = 1, F \u2208 Rn\u00d7k, F T F = I\n\nF + \u03bbT r(F T LSF )\n\n(15)\n\nThe problem (15) is much easier to solve compared with the rank constrained problem (12). We can\napply the alternating optimization technique to solve this problem.\nWhen P is \ufb01xed, the problem (15) becomes:\nmin\n\nT r(F T LSF )\n\n(16)\n\nF\u2208Rn\u00d7k,F T F =I\n\nThe optimal solution F is formed by the k eigenvectors of LS corresponding to the k smallest\neigenvalues.\nWhen F is \ufb01xed, the problem (15) becomes\n\nAccording to the property of Laplacian matrix, we have the following relationship:\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\nmin\n\nP\u22650,P 1=1\n\n(cid:107)P \u2212 B(cid:107)2\n\nF + \u03bbT r(F T LSF )\n\nT r(F T LSF ) =\n\n1\n2\n\n(cid:107)fi \u2212 fj(cid:107)2\n\n2 sij\n\nn(cid:88)\n\nn(cid:88)\n\ni=1\n\nj=1\n\nn1(cid:88)\n\nn2(cid:88)\n\ni=1\n\nj=1\n\nwhere fi is the i-th row of F .\nThus according to the structure of S de\ufb01ned in Eq.(9), Eq.(18) can be rewritten as\n\nT r(F T LSF ) =\n\n(cid:107)fi \u2212 fj(cid:107)2\n\n2 pij\n\nBased on Eq. (19), the problem (17) can be rewritten as\n\nn1(cid:88)\n\nn2(cid:88)\n\ni=1\n\nj=1\n\nmin\n\nP\u22650,P 1=1\n\n(pij \u2212 bij)2 + \u03bb(cid:107)fi \u2212 fj(cid:107)2\n\n2 pij\n\nNote that the problem (20) is independent between different i, so we can solve the following problem\nindividually for each i. Denote vij = (cid:107)fi \u2212 fj(cid:107)2\n2, and denote vi as a vector with the j-th element as\nvij (same for pi and bi), then for each i, the problem (20) can be written in the vector form as\n\n(cid:13)(cid:13)(cid:13)(cid:13)pi \u2212 (bi \u2212 \u03bb\n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\nvi)\n\nmin\n\ni 1=1,pi\u22650\npT\n\n2\nThis problem can be solved by an ef\ufb01cient iterative algorithm [9].\nThe detailed algorithm to solve the problem (15) is summarized in Algorithm 1. In the algorithm,\nwe can only update the m nearest similarities for each data points in P and thus the complexity of\nupdating P and updating F (only need to compute top k eigenvectors on very sparse matrix) can\nbe reduced signi\ufb01cantly. Nevertheless, Algorithm 1 needs to conduct eigen-decomposition on an\nn \u00d7 n(n = n1 + n2) matrix in each iteration, which is time consuming. In the next section, we will\npropose another optimization algorithm, which only needs to conduct SVD on an n1 \u00d7 n2 matrix in\neach iteration, and thus is much more ef\ufb01cient than Algorithm 1.\nAlgorithm 1 Algorithm to solve the problem (15).\n\ninput B \u2208 Rn1\u00d7n2, cluster number k, a large enough \u03bb.\noutput P \u2208 Rn1\u00d7n2 and thus S \u2208 Rn\u00d7n de\ufb01ned in Eq.(9) with exact k connected components.\nInitialize F \u2208 Rn\u00d7k, which is formed by the k eigenvectors of L = D \u2212 A corresponding to the k\nsmallest eigenvalues, A is de\ufb01ned in Eq. (1).\nwhile not converge do\n1. For each i, update the i-th row of P by solving the problem (21), where the j-th element of\nvi is vij = (cid:107)fi \u2212 fj(cid:107)2\n2.\n2. Update F , which is formed by the k eigenvectors of LS = DS \u2212 S corresponding to the k\nsmallest eigenvalues.\n\nend while\n\n5\n\n\f4 Speed Up the Model\nIf the similarity matrix S is nonnegative, then the normalized Laplacian matrix \u02dcLS = I\u2212D\nassociated with S also has an important property as follows [11, 2].\nTheorem 2 The multiplicity k of the eigenvalue 0 of the normalized Laplacian matrix \u02dcLS is equal to\nthe number of connected components in the graph associated with S.\nTheorem 2 indicates that if rank( \u02dcLS) = n \u2212 k, the constraint S \u2208 \u2126 will be hold. Therefore, the\nproblem (11) can also be rewritten as\n\n\u2212 1\n\u2212 1\nS SD\nS\n\n2\n\n2\n\nSimilarly, the problem (22) is equivalent to the following problem for a large enough value of \u03bb:\n\nmin\n\nP\u22650,P 1=1,rank( \u02dcLS )=n\u2212k\n\n(cid:107)P \u2212 B(cid:107)2\n\nF\n\n(cid:107)P \u2212 B(cid:107)2\n\nmin\nP,F\ns.t. P \u2265 0, P 1 = 1, F \u2208 Rn\u00d7k, F T F = I\n\nF + \u03bbT r(F T \u02dcLSF )\n\nAgain, we can apply the alternating optimization technique to solve problem (23).\nWhen P is \ufb01xed, since \u02dcLS = I \u2212 D\n\n\u2212 1\nS , the problem (23) becomes\n\n\u2212 1\nS SD\n\n2\n\n2\n\nWe rewrite F and DS as the following block matrices:\n\nmax\n\nF\u2208Rn\u00d7k,F T F =I\n\nT r(F T D\n\n\u2212 1\nS SD\n\n2\n\n\u2212 1\nS F )\n\n2\n\n(cid:21)\n\n(cid:20) U\n\nV\n\nF =\n\n,\n\nDS =\n\n(cid:20) DSu\n\n(cid:21)\n\nDSv\n\nwhere U \u2208 Rn1\u00d7k, V \u2208 Rn2\u00d7k, DSu \u2208 Rn1\u00d7n1, DSv \u2208 Rn2\u00d7n2.\nThen according to the de\ufb01nition of S in Eq. (9), the problem (24) can be further rewritten as\n\nmax\n\nU T U +V T V =I\n\nT r(U T D\n\n\u2212 1\nSu P D\n\n2\n\n\u2212 1\nSv V )\n\n2\n\nAccording to Lemma 1, the optimal solution U and V to the problem (26) are the leading k left and\nright singular vectors of \u02dcS = D\nWhen F is \ufb01xed, the problem (23) becomes\n\n\u2212 1\nSv , respectively.\n\n\u2212 1\nSu P D\n\n2\n\n2\n\n(22)\n\n(23)\n\n(24)\n\n(25)\n\n(26)\n\n(27)\n\n(28)\n\n(cid:107)P \u2212 B(cid:107)2\n\nmin\ns.t. P \u2265 0, P 1 = 1\n\nP\n\nF + \u03bbT r(F T \u02dcLSF )\n\nn(cid:88)\n\nn(cid:88)\n\ni=1\n\nj=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) fi\u221a\n\ndi\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n\u2212 fj(cid:112)dj\n\nT r(F T \u02dcLSF ) =\n\n1\n2\n\nAccording to the property of normalized Laplacian matrix, we have the following relationship:\n\nsij\n\n(cid:13)(cid:13)(cid:13)(cid:13) fi\u221a\n\ndi\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n\u2212 fj\u221a\n\ndj\n\n,the problem\n\nThus according to the structure of S de\ufb01ned in Eq.(9), and denote vij =\n(27) can be rewritten as\n\nn1(cid:88)\n\nn2(cid:88)\n\ni=1\n\nj=1\n\nmin\n\nP\u22650,P 1=1\n\n(pij \u2212 bij)2 + \u03bbvijpij,\n\nwhich has the same form as in Eq. (20) and thus can be solved ef\ufb01ciently.\nThe detailed algorithm to solve the problem (23) is summarized in Algorithm 2. In the algorithm, we\ncan also only update the m nearest similarities for each data points in P and thus the complexity of\nupdating P and updating F can be reduced signi\ufb01cantly.\n\n6\n\n\fNote that Algorithm 2 only needs to conduct SVD on an n1 \u00d7 n2 matrix in each iteration. In\nsome cases, min(n1, n2) (cid:28) (n1 + n2), thus Algorithm 2 is much more ef\ufb01cient than Algorithm 1.\nTherefore, in the next section, we use Algorithm 2 to conduct the experiments.\nAlgorithm 2 Algorithm to solve the problem (23).\n\ninput B \u2208 Rn1\u00d7n2, cluster number k, a large enough \u03bb.\noutput P \u2208 Rn1\u00d7n2 and thus S \u2208 Rn\u00d7n de\ufb01ned in Eq.(9) with exact k connected components.\nInitialize F \u2208 Rn\u00d7k, which is formed by the k eigenvectors of \u02dcL = I\u2212D\u2212 1\n2 corresponding\nto the k smallest eigenvalues, A is de\ufb01ned in Eq. (1).\nwhile not converge do\n\n2 AD\u2212 1\n\n1. For each i, update the i-th row of P by solving the problem (21), where the j-th element of\n\nvi is vij =\n\n(cid:13)(cid:13)(cid:13)(cid:13) fi\u221a\n\ndi\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n.\n\n\u2212 fj\u221a\n\n(cid:20) U\n\n(cid:21)\n\ndj\n\nV\n\n2. Update F =\n\n, where U and V are the leading k left and right singular vectors of\n\n\u2212 1\nSu P D\n\n2\n\n\u02dcS = D\nend while\n\n\u2212 1\nSv respectively and DS =\n\n2\n\n(cid:20) DSu\n\n(cid:21)\n\n.\n\nDSv\n\n5 Experimental Results\nIn this section, we conduct multiple experiments to evaluate our model. We will \ufb01rst introduce the\nexperimental settings throughout the section and then present evaluation results on both synthetic and\nbenchmark datasets.\n\n5.1 Experimental Settings\nWe compared our method (denoted by SOBG) with two related co-clustering methods, including\nBipartite Spectral Graph Partition (BSGP) [4] and Orthogonal Nonnegative Matrix Tri-Factorizations\n(ONMTF) [5]. Also, we introduced several one-sided clustering methods to the comparison, which\nare K-means clustering, Normalized Cut (NCut) and Nonnegative Matrix Factorization (NMF).\nFor methods requiring a similarity graph as the input, i.e., NCut and NMF, we adopted the self-tuning\nGaussian method [19] to construct the graph, where the number of neighbors was set to be 5 and the\n\u03c3 value was self-tuned. In the experiment, there are four methods involving K-means clustering,\nwhich are K-means, NCut, BSGP and ONMTF (the latter three methods need K-means as the\npost-processing step to get the clustering results). When running K-means we used 100 random\ninitializations for all these four methods and recorded the average performance over these 100 runs as\nwell as the best one with respect to the K-means objective function value.\nIn our method, to accelerate the algorithmic procedure, we determined the parameter \u03bb in an heuristic\nway: \ufb01rst specify the value of \u03bb with an initial guess; next, we computed the number of zero\neigenvalues in \u02dcLS in each iteration, if it was larger than k, then divided \u03bb by 2; if smaller then\nmultiplied \u03bb by 2; otherwise we stopped the iteration.\nThe number of clusters was set to be the ground truth. The evaluation of different methods was based\non the percentage of correctly clustered samples, i.e., clustering accuracy.\n\n5.2 Results on Synthetic Data\nIn this subsection, we \ufb01rst apply our method to the synthetic data as a sanity check. The synthetic\ndata is constructed as a two-dimensional matrix, where rows and columns come from three clusters\nrespectively. Row clusters and column clusters maintain mutual dependence, i.e., rows and columns\nfrom the \ufb01rst cluster form a block along the diagonal of the data matrix, and this also holds true\nfor the second and third cluster. The number of rows for each cluster is 20, 30 and 40 respectively,\nwhile the number of columns is 30, 40 and 50. Each block is generated randomly with elements\ni.i.d. sampled from Gaussian distribution N (0, 1). Also, we add noise to the \u201cnon-block\" area of the\ndata matrix, i.e., all entries in the matrix excluding elements in the three clusters. The noise can be\ndenoted as r \u2217 \u03b4, where \u03b4 is Gaussian noise i.i.d. sampled from Gaussian distribution N (0, 1) and r\n\n7\n\n\f(a) Noise = 0.6\n\n(b) Noise = 0.6\n\n(c) Noise = 0.7\n\n(d) Noise = 0.7\n\n(e) Noise = 0.8\n\n(f) Noise = 0.8\n\n(g) Noise = 0.9\n\n(h) Noise = 0.9\n\nFigure 2: Illustration of the data matrix in different settings of noise. Different rows of \ufb01gures come\nfrom different settings of noise. In each row, \ufb01gures on the left column are the original data matrices\ngenerated in the experiment, while on the right column display the bipartite matrix B learned in our\nmodel which approximates the original data matrix and maintains the block structure.\n\nClustering\nAccuracy(%)\n\non Rows\n\nClustering\nAccuracy(%)\non Columns\n\nMethods\nK-means\n\nNCut\nNMF\nBSGP\nONMTF\nSOBG\nK-means\n\nNCut\nNMF\nBSGP\nONMTF\nSOBG\n\nNoise = 0.6\n\nNoise = 0.7\n\nNoise = 0.8\n\nNoise = 0.9\n\n99.17\n99.17\n98.33\n100.00\n99.17\n100.00\n100.00\n100.00\n100.00\n100.00\n100.00\n100.00\n\n97.50\n95.00\n95.00\n93.33\n97.50\n100.00\n95.56\n91.11\n90.00\n93.33\n95.56\n100.00\n\n71.67\n46.67\n46.67\n62.50\n71.67\n98.33\n51.11\n60.00\n47.78\n63.33\n51.11\n100.00\n\n39.17\n38.33\n37.50\n40.00\n39.17\n84.17\n46.67\n38.89\n37.78\n46.67\n46.67\n87.78\n\nTable 1: Clustering accuracy comparison on rows and columns of the synthetic data in different\nportion of noise.\n\nis the portion of noise. We set r to be {0.6, 0.7, 0.8, 0.9} respectively so as to evaluate the robustness\nof different methods under the circumstances of various disturbance.\nWe apply all comparing methods to the synthetic data and assess their ability to cluster the rows and\ncolumns. One-sided clustering methods are applied to the data twice (once to cluster rows and the\nother time to cluster columns) such that clustering accuracy on these two dimensions can be achieved.\nCo-clustering methods can obtain clustering results on both dimensions simultaneously in one run.\nIn Table 1 we summarize the clustering accuracy comparison on both rows and columns under\ndifferent settings of noise. In Fig. 2 we display the corresponding original data matrix and the\nbipartite matrix B learned in our model. We can notice that when the portion of noise r is relatively\nlow, i.e., 0.6 and 0.7, the block structure of the original data is clear, then all methods perform fairly\nwell in clustering both rows and columns. However, as r increases, the block structure in the original\ndata blurs thus brings obstacles to the clustering task. With high portion of noise, all other methods\nseem to be disturbed to a large extent while our method shows apparent robustness. Even when the\nportion of noise becomes as high as 0.9, such that the structure of clusters in the original data becomes\nhard to distinguish with eyes, our method still excavates a reasonable block arrangement with a\nclustering accuracy of over 80%. Also, we can \ufb01nd that co-clustering methods usually outperform\none-sided clustering methods since they utilize the interrelations between rows and columns. The\ninterpretation of the co-clustering structure strengthens the performance, which conforms to our\ntheoretical analysis.\n\n8\n\n 0.20.40.60.8 0.20.40.60.8 0.20.40.60.8 0.20.40.60.8\fMethods\n\nK-means Ave\nBest\nAve\nBest\n\nNCut\n\nNMF\n\nBSGP\n\nONMTF\n\nSOBG\n\nAve\nBest\nAve\nBest\n\nReuters21578\n40.86\u00b14.59\n32.77\n26.92\u00b10.93\n29.18\n30.91\n11.44\u00b10.39\n11.26\n17.57\u00b11.95\n27.90\n43.94\n\nLUNG\n61.91\u00b16.00\n71.43\n69.67\u00b114.26\n79.80\n75.86\n64.95\u00b15.06\n70.94\n61.31\u00b110.34\n71.43\n78.82\n\nProstate-MS\n46.47\u00b13.26\n45.34\n46.86\u00b11.19\n47.20\n47.83\n46.27\u00b10.00\n46.27\n45.46\u00b13.18\n45.34\n62.73\n\nprostateCancerPSA410\n64.15\u00b19.40\n62.92\n55.06\u00b10.00\n55.06\n55.06\n57.30\u00b10.00\n57.30\n62.92\u00b10.00\n62.92\n69.66\n\nTable 2: Clustering accuracy comparison on four benchmark datasets. For the four methods involving\nK-means clustering, i.e., K-means, NCut, BSGP and ONMTF, their average performance (Ave) over\n100 repetitions and the best one (Best) w.r.t. K-means objective function value were both reported.\n\n5.3 Results on Benchmark Data\n\nIn this subsection, we use four benchmark datasets for the evaluation. There are one document dataset\nand three gene expression datasets participating in the experiment, the property of which is introduced\nin details as below.\nReuters21578 dataset is processed and downloaded from http://www.cad.zju.edu.cn/\nhome/dengcai/Data/TextData.html.\nIt contains 8293 documents in 65 topics. Each\ndocument is depicted by its frequency on 18933 terms.\nLUNG dataset [1] provides a source for the study of lung cancer. It has 203 samples in \ufb01ve classes,\namong which there are 139 adenocarcinoma (AD), 17 normal lung (NL), 6 small cell lung cancer\n(SMCL), 21 squamous cell carcinoma (SQ) as well as 20 pulmonary carcinoid (COID) samples. Each\nsample has 3312 genes.\nProstate-MS dataset [15] contains a total of 332 samples from three different classes, which are\n69 samples diagnosed as prostate cancer, 190 samples of benign prostate hyperplasia, as well as 63\nnormal samples showing no evidence of disease. Each sample has 15154 genes.\nProstateCancerPSA410 dataset [10] describes gene information of patients with prostate-speci\ufb01c\nantigen (PSA)-recurrent prostate cancer. It includes a total of 89 samples from two classes. Each\nsample has 15154 genes.\nBefore the clustering process, feature scaling was performed on each dataset such that features are on\nthe same scale of [0, 1]. Also, the (cid:96)2-norm of each feature was normalized to 1.\nTable 2 summarizes the clustering accuracy comparison on these benchmark datasets. Our method\nperforms equally or even better than the alternatives on all these datasets. This veri\ufb01es the effective-\nness of our method in the practical situation. There is an interesting phenomenon that the advantage of\nour method tends to be more obvious for higher dimensional data. This is because high-dimensional\nfeatures make the differences in the distance between samples to be smaller thus the cluster structure\nof the original data becomes vague. In this case, since our model is more robust compared with the\nalternative methods (veri\ufb01ed in the synthetic experiments), we can get better clustering results.\n\n6 Conclusions\n\nIn this paper, we proposed a novel graph based co-clustering model. Different from existing methods\nwhich conduct clustering on the graph achieved from the original data, our model learned a new\nbipartite graph with explicit cluster structure. By imposing the rank constraint on the Laplacian matrix\nof the new bipartite graph, we guaranteed the learned graph to have exactly k connected components,\nwhere k is the number of clusters. From this ideal structure of the new bipartite graph learned in\nour model, the obvious clustering structure can be obtained without resorting to post-processing\nsteps. We presented experimental results on both synthetic data and four benchmark datasets, which\nvalidated the effectiveness and robustness of our model.\n\n9\n\n\fReferences\n[1] A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C. Ladd, J. Beheshti,\nR. Bueno, M. Gillette, et al. Classi\ufb01cation of human lung carcinomas by mrna expression\npro\ufb01ling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of\nSciences, 98(24):13790\u201313795, 2001.\n\n[2] F. R. K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics,\n\nNo. 92, American Mathematical Society, February 1997.\n\n[3] X. Cui and T. E. Potok. Document clustering analysis based on hybrid pso+ k-means algorithm.\n\nJournal of Computer Sciences (special issue), 27:33, 2005.\n\n[4] I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In\nProceedings of the seventh ACM SIGKDD international conference on Knowledge discovery\nand data mining, pages 269\u2013274. ACM, 2001.\n\n[5] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t-factorizations for\nclustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 126\u2013135. ACM, 2006.\n\n[6] K. Fan. On a theorem of weyl concerning eigenvalues of linear transformations. i. 35(11):652\u2013\n\n655, 1949.\n\n[7] P. F. Felzenszwalb and D. P. Huttenlocher. Ef\ufb01cient graph-based image segmentation. Interna-\n\ntional Journal of Computer Vision, 59(2):167\u2013181, 2004.\n\n[8] M. Gong, Y. Liang, J. Shi, W. Ma, and J. Ma. Fuzzy c-means clustering with local information\nand kernel metric for image segmentation. Image Processing, IEEE Transactions on, 22(2):573\u2013\n584, 2013.\n\n[9] J. Huang, F. Nie, and H. Huang. A new simplex sparse learning model to measure data similarity\nfor clustering. In Proceedings of the 24th International Conference on Arti\ufb01cial Intelligence,\npages 3569\u20133575, 2015.\n\n[10] Z. Liao and M. W. Datta. A simple computer program for calculating psa recurrence in prostate\n\ncancer patients. BMC urology, 4(1):8, 2004.\n\n[11] B. Mohar. The laplacian spectrum of graphs. In Graph Theory, Combinatorics, and Applications,\n\npages 871\u2013898. Wiley, 1991.\n\n[12] F. Nie, X. Wang, and H. Huang. Clustering and projected clustering with adaptive neighbors. In\nProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 977\u2013986, 2014.\n\n[13] F. Nie, X. Wang, M. I. Jordan, and H. Huang. The constrained laplacian rank algorithm for\n\ngraph-based clustering. In AAAI, pages 1969\u20131976, 2016.\n\n[14] H.-W. N\u00fctzmann and A. Osbourn. Gene clustering in plant specialized metabolism. Current\n\nopinion in biotechnology, 26:91\u201399, 2014.\n\n[15] E. F. Petricoin, D. K. Ornstein, C. P. Paweletz, A. Ardekani, P. S. Hackett, B. A. Hitt, A. Velassco,\nC. Trucco, L. Wiegand, K. Wood, et al. Serum proteomic patterns for detection of prostate\ncancer. Journal of the National Cancer Institute, 94(20):1576\u20131578, 2002.\n\n[16] F. Piano, A. J. Schetter, D. G. Morton, K. C. Gunsalus, V. Reinke, S. K. Kim, and K. J.\nKemphues. Gene clustering based on rnai phenotypes of ovary-enriched genes in c. elegans.\nCurrent Biology, 12(22):1959\u20131964, 2002.\n\n[17] F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons. Document clustering using\nnonnegative matrix factorization. Information Processing & Management, 42(2):373\u2013386,\n2006.\n\n[18] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine\n\nIntelligence, IEEE Transactions on, 22(8):888\u2013905, 2000.\n\n[19] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, 2004.\n\n10\n\n\f", "award": [], "sourceid": 2184, "authors": [{"given_name": "Feiping", "family_name": "Nie", "institution": "University of Texas Arlington"}, {"given_name": "Xiaoqian", "family_name": "Wang", "institution": "University of Pittsburgh"}, {"given_name": "Cheng", "family_name": "Deng", "institution": "School of Electronic Engineering, Xidian University, China"}, {"given_name": "Heng", "family_name": "Huang", "institution": "University of Pittsburgh"}]}