{"title": "Biclustering Using Message Passing", "book": "Advances in Neural Information Processing Systems", "page_first": 3617, "page_last": 3625, "abstract": "Biclustering is the analog of clustering on a bipartite graph. Existent methods infer biclusters through local search strategies that find one cluster at a time; a common technique is to update the row memberships based on the current column memberships, and vice versa. We propose a biclustering algorithm that maximizes a global objective function using message passing. Our objective function closely approximates a general likelihood function, separating a cluster size penalty term into row- and column-count penalties. Because we use a global optimization framework, our approach excels at resolving the overlaps between biclusters, which are important features of biclusters in practice. Moreover, Expectation-Maximization can be used to learn the model parameters if they are unknown. In simulations, we find that our method outperforms two of the best existing biclustering algorithms, ISA and LAS, when the planted clusters overlap. Applied to three gene expression datasets, our method finds coregulated gene clusters that have high quality in terms of cluster size and density.", "full_text": "Biclustering Using Message Passing\n\nLuke O\u2019Connor\n\nBioinformatics and Integrative Genomics\n\nHarvard University\n\nCambridge, MA 02138\n\nloconnor@g.harvard.edu\n\nSoheil Feizi\n\nElectrical Engineering and Computer Science\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\nsfeizi@mit.edu\n\nAbstract\n\nBiclustering is the analog of clustering on a bipartite graph. Existent methods infer\nbiclusters through local search strategies that \ufb01nd one cluster at a time; a common\ntechnique is to update the row memberships based on the current column member-\nships, and vice versa. We propose a biclustering algorithm that maximizes a global\nobjective function using message passing. Our objective function closely approx-\nimates a general likelihood function, separating a cluster size penalty term into\nrow- and column-count penalties. Because we use a global optimization frame-\nwork, our approach excels at resolving the overlaps between biclusters, which are\nimportant features of biclusters in practice. Moreover, Expectation-Maximization\ncan be used to learn the model parameters if they are unknown. In simulations, we\n\ufb01nd that our method outperforms two of the best existing biclustering algorithms,\nISA and LAS, when the planted clusters overlap. Applied to three gene expres-\nsion datasets, our method \ufb01nds coregulated gene clusters that have high quality in\nterms of cluster size and density.\n\n1\n\nIntroduction\n\nThe term biclustering has been used to describe several distinct problems variants. In this paper,\nwe consider the problem of biclustering as a bipartite analogue of clustering: Given an N \u00d7 M\nmatrix, a bicluster is a subset of rows that are heavily connected to a subset of columns. In this\nframework, biclustering methods are data mining techniques allowing simultaneous clustering of\nthe rows and columns of a matrix. We suppose there are two possible distributions for edge weights\nin the bipartite graph: a within-cluster distribution and a background distribution. Unlike in the\ntraditional clustering problem, in our setup, biclusters may overlap, and a node may not belong to\nany cluster. We emphasize the distinction between biclustering and the bipartite analog of graph\npartitioning, which might be called bipartitioning.\nBiclustering has several noteworthy applications. It has been used to \ufb01nd modules of coregulated\ngenes using microarray gene expression data [1] and to predict tumor phenotypes from their geno-\ntypes [2]. It has been used for document classi\ufb01cation, clustering both documents and related words\nsimultaneously [3]. In all of these applications, biclusters are expected to overlap with each other,\nand these overlaps themselves are often of interest (e.g., if one wishes to explore the relationships\nbetween document topics).\nThe biclustering problem is NP-hard (see Proposition 1). However, owing to its practical impor-\ntance, several heuristic methods using local search strategies have been developed. A popular ap-\nproach is to search for one bicluster at a time by iteratively assigning rows to a bicluster based on the\ncolumns, and vice versa. Two algorithms based on this approach are ISA [4] and LAS [5]. Another\napproach is an exhaustive search for complete bicliques used by Bimax [6]. This approach fragments\nlarge noisy clusters into small complete ones. SAMBA [7] uses a heuristic combinatorial search for\nlocally optimal biclusters, motivated by an exhaustive search algorithm that is exponential in the\n\n1\n\n\fmaximum degree of the nodes. FABIA [8] uses a variational approach to \ufb01t a model with fuzzy bi-\nclusters. For more details about existent biclustering algorithms, and performance comparisons, see\nreferences [6], [9] and [10]. Most of these methods have two shortcomings: \ufb01rst, they apply a local\noptimality criterion to each bicluster individually. Because a collection of locally optimal biclusters\nmight not be globally optimal, these local methods struggle to resolve overlapping clusters, which\narise frequently in many applications. Second, the lack of a well-de\ufb01ned global objective function\nprecludes an analytical characterization of their expected results.\nGlobal optimization methods have been developed for problems closely related to biclustering, in-\ncluding clustering. Unlike most biclustering problem formulations, these are mostly partitioning\nproblems: each node is assigned to one cluster or category. Major recent progress has been made in\nthe development of spectral clustering methods (see references [11] and [12]) and message-passing\nalgorithms (see [13], [14] and [15]). In particular, Af\ufb01nity Propagation [14] maximizes the sum of\nsimilarities to one central exemplar instead of overall cluster density. Reference [16] uses variational\nexpectation-maximization to \ufb01t the latent block model, which is a binary model in which each row\nor column is assigned to a row or column cluster, and the probability of an edge is dictated by the\nrespective cluster memberships. Row and column clusters that are not paired to form biclusters.\nIn this paper, we propose a message-passing algorithm that searches for a globally optimal col-\nlection of possibly overlapping biclusters. Our method maximizes a likelihood function using an\napproximation that separates a cluster-size penalty term into a row-count penalty and a column-\ncount penalty. This decoupling enables the messages of the max-sum algorithm to be computed\nef\ufb01ciently, effectively breaking an intractable optimization into a pair of tractable ones that can be\nsolved in nearly linear time. When the underlying model parameters are unknown, they can be\nlearned using an expectation-maximization approach.\nOur approach has several advantages over existing biclustering algorithms: the objective function\nof our biclustering method has the \ufb02exibility to handle diverse statistical models; the max-sum al-\ngorithm is a more robust optimization strategy than commonly used iterative approaches; and in\nparticular, our global optimization technique excels at resolving overlapping biclusters. In simula-\ntions, our method outperforms two of the best existing biclustering algorithms, ISA and LAS, when\nthe planted clusters overlap. Applied to three gene expression datasets, our method found biclusters\nof high quality in terms of cluster size and density.\n\n2 Methods\n\n2.1 Problem statement\n\nLet G = (V, W, E) be a weighted bipartite graph, with vertices V = (1, ..., N ) and W = (1, ..., M ),\nconnected by edges with non-negative weights: E : V \u00d7 W \u2192 [0,\u221e). Let V1, ..., VK \u2282 V and\nW1, ..., WK \u2282 W . Let (Vk, Wk) = {(i, j) : i \u2208 Vk, j \u2208 Wk} be a bicluster: Graph edge weights\neij are drawn independently from either a within-cluster distribution or a background distribution\ndepending on whether, for some k, i \u2208 Vk and j \u2208 Wk. In this paper, we assume that the within-\ncluster and background distributions are homogenous. However, our formulation can be extended\nto a general case in which the distributions are row- or column-dependent.\n\nij be the indicator for i \u2208 Vk and j \u2208 Wk. Let cij (cid:44) min(1,(cid:80)\n\nij) and let c (cid:44) (ck\nij).\n\nLet ck\nDe\ufb01nition 1 (Biclustering Problem). Let G = (V, W, E) be a bipartite graph with biclusters\n(V1, W1), ..., (VK, WK), within-cluster distribution f1 and background distribution f0. The problem\nis to \ufb01nd the maximum likelihood cluster assignments (up to reordering):\n\nk ck\n\n(cid:88)\nrs = 1 \u21d2 ck\n\n(i,j)\n\nc\n\n\u02c6c = arg max\n\ncij log\n\nf1(eij)\nf0(eij)\n\n,\n\n(1)\n\nck\nij = ck\n\nis = ck\n\nrj = 1,\n\n\u2200i, r \u2208 V,\u2200j, s \u2208 W.\n\nFigure 1 demonstrates the problem qualitatively for an unweighted bipartite graph. In general, the\ncombinatorial nature of a biclustering problem makes it computationally challenging.\nProposition 1. The clique problem can be reduced to the maximum likelihood problem of De\ufb01nition\n(1). Thus, the biclustering problem is NP-hard.\n\n2\n\n\fFigure 1: Biclustering is the analogue of clustering on a bipartite graph. (a) Biclustering allows\nnodes to be reordered in a manner that reveals modular structures in the bipartite graph. (b) The\nrows and columns of an adjacency matrix are similarly biclustered and reordered.\n\nProof. Proof is provided in Supplementary Note 1.\n\n2.2 BCMP objective function\n\nIn this section, we introduce the global objective function considered in the proposed biclustering\nalgorithm called Biclustering using Message Passing (BCMP). This objective function approximates\nthe likelihood function of De\ufb01nition 1. Let lij = log f1(eij )\nf0(eij ) be the log-likelihood ratio score of\n\ntuple (i, j). Thus, the likelihood function of De\ufb01nition 1 can be written as(cid:80) cijlij. If there were\n\nno consistency constraints in the Optimization (1), an optimal maximum likelihood biclustering\nsolution would be to set cij = 1 for all tuples with positive lij. Our key idea is to enforce the\nconsistency constraints by introducing a cluster-size penalty function and shifting the log-likelihood\nratios lij to recoup this penalty. Let Nk and Mk be the number of rows and columns, respectively,\nassigned to cluster k. We have,\n\n(cid:88)\n\n(i,j)\n\ncijlij\n\n(i,j)\n\n(a)\u2248 (cid:88)\n(cid:88)\n(c)\u2248 (cid:88)\n\n(b)\n=\n\n(i,j)\n\n(i,j)\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n(i,j)\n\ncij max(0, lij + \u03b4) \u2212 \u03b4\n\ncij max(0, lij + \u03b4) + \u03b4\n\ncij max(0, lij + \u03b4) + \u03b4\n\ncij\n\nmax(0,\u22121 +\n\nmax(0,\u22121 +\n\n(cid:88)\n(cid:88)\n\nk\n\nk\n\nij) \u2212 \u03b4\nck\n\nij) \u2212 \u03b4\nck\n2\n\n(cid:88)\n(cid:88)\n\nk\n\nk\n\nNkMk\n\nrkN 2\n\nk + r\u22121\n\nk M 2\nk .\n\n(i,j)\n\n(i,j)\n\n(2)\nThe approximation (a) holds when \u03b4 is large enough that thresholding lij at \u2212\u03b4 has little effect on\nthe resulting objective function. In equation (b), we have expressed the second term of (a) in terms\nof a cluster size penalty \u2212\u03b4NkMk, and we have added back a term corresponding to the overlap\nbetween clusters. Because a cluster-size penalty function of the form NkMk leads to an intractable\noptimization in the max-sum framework, we approximate it using a decoupling approximation (c)\nwhere rk is a cluster shape parameter:\n\n2NkMk \u2248 rkN 2\n\nk + r\u22121\n\n(3)\nwhen rk \u2248 Mk/Nk. The cluster-shape parameter can be iteratively tuned to \ufb01t the estimated biclus-\nters.\nFollowing equation (2), the BCMP objective function can be separated into three terms as follows:\n\nk M 2\nk ,\n\n3\n\n(a)(b)column variablesrow variablescolumn variablesrow variablesBiclusteringBiclustering\fF (c) =\n\n\u03b7k +\n\n\u03c4ij +\n\n(cid:88)\n(cid:88)\nij) + \u03b4 max(0,(cid:80)\n\ni,j\n\nk\n\nk ck\n\n\uf8f1\uf8f2\uf8f3\u03c4ij = (cid:96)ij min(1,(cid:80)\n\n\u03b7k = \u2212 \u03b4\n\u00b5k = \u2212 \u03b4\n\n2 rkN 2\n2 r\u22121\nk M 2\nk\n\nk\n\n(cid:88)\n\nk\n\n\u00b5k,\n\nij \u2212 1) \u2200(i, j) \u2208 V \u00d7 W,\n\nk ck\n\n\u22001 \u2264 k \u2264 K,\n\u22001 \u2264 k \u2264 K\n\n(4)\n\n(5)\n\nHere \u03c4ij, the tuple function, encourages heavier edges of the bipartite graph to be clustered. Its\nsecond term compensates for the fact that when biclusters overlap, the cluster-size penalty functions\ndouble-count the overlapping regions. (cid:96)ij (cid:44) max(0, lij \u2212 \u03b4) is the shifted log-likelihood ratio for\nobserved edge weight eij. \u03b7k and \u00b5k penalize the number of rows and columns of cluster k, Nk\nand Mk, respectively. Note that by introducing a penalty for each nonempty cluster, the number of\nclusters can be learned, and \ufb01nding weak, spurious clusters can be avoided (see Supplementary Note\n3.3).\nNow, we analyze BCMP over the following model for a binary or unweighted bipartite graph:\nDe\ufb01nition 2. The binary biclustering model is a generative model for N \u00d7 M bipartite graph\n(V, W, E) with K biclusters distributed by uniform sampling with replacement, allowing for over-\nlapping clusters. Within a bicluster, edges are drawn independently with probability p, and outside\nof a bicluster, they are drawn independently with probability q < p.\n\nIn the following, we assume that p, q, and K are given. We discuss the case that the model pa-\nrameters are unknown in Section 2.4. The following proposition shows that optimizing the BCMP\nobjective function solves the problem of De\ufb01nition 1 in the case of the binary model:\nProposition 2. Let (eij) be a matrix generated by the binary model described in De\ufb01nition 2.\nSuppose p, q and K are given. Suppose the maximum likelihood assignment of edges to biclusters,\narg max(P (data|c)), is unique up to reordering. Let rk = M(cid:48)\nk be the cluster shape ratio for\nthe k-th maximum likelihood cluster. Then, by using these values of rk, setting (cid:96)ij = eij, for all\n(i, j), with cluster size penalty\n\nk/N(cid:48)\n\n= \u2212 log( 1\u2212p\n1\u2212q )\n2 log( p(1\u2212q)\nq(1\u2212p) )\n\n,\n\n\u03b4\n2\n\nwe have,\n\narg max\n\nc\n\n(P (data|c)) = arg max\n\nc\n\n(6)\n\n(7)\n\n(F (c)).\n\nIt is presented in Supplementary Note\n\nProof. The proof follows the derivation of equation (2).\n2.\nRemark 1. In the special case when q = 1 \u2212 p \u2208 (0, 1/2), according to equation (6), we have\n2 = 1/4. This is suggested as a reasonable initial value to choose when the true values of p and q\n\u03b4\nare unknown; see Section 2.4 for a discussion of learning the model parameters.\nThe assumption that rk = N(cid:48)\nk may seem rather strong. However, it is essential as it justi\ufb01es the\ndecoupling equation (3) that enables a linear-time algorithm. In practice, if the initial choice of rk\nis close enough to the actual ratio that a cluster is detected corresponding to the real cluster, rk can\nbe tuned to \ufb01nd the true value by iteratively updating it to \ufb01t the estimated bicluster. This iterative\nstrategy works well in our simulations. For more details about automatically tuning the parameter\nrk, see Supplementary Note 3.1.\nIn a more general statistical setting, log-likelihood ratios lij may be unbounded below, and the \ufb01rst\nstep (a) of derivation (2) is an approximation; setting \u03b4 arbitrarily large will eventually lead to\ninstability in the message updates.\n\nk/M(cid:48)\n\n4\n\n\f2.3 Biclustering Using Message Passing\n\nIn this section, we use the max-sum algorithm to optimize the objective function of equation (4).\nFor a review of the max-sum message update rules, see Supplementary Note 4. There are N M\nfunction nodes for the functions \u03c4ij, K function nodes for the functions \u03b7k, and K function nodes\nfor the functions \u00b5k. There are N M K binary variables, each attached to three function nodes: ck\nij\nis attached to \u03c4ij, \u03b7k, and \u00b5k (see Supplementary Figure 1). The incoming messages from these\nfunction nodes are named tk\nij, respectively. In the following, we describe messages for\nck\nij = c1\nFirst, we compute t1\n\n12; other messages can be computed similarly.\n\nij, and mk\n\nij, nk\n\n12:\n\n(cid:88)\n\nk(cid:54)=1\n\n(cid:88)\n\n(cid:88)\n\nck\n12) + \u03b4 max(0,\n\nk\n\nk\n\n12 \u2212 1) +\nck\n\n(8)\n\nck\n12(mk\n\n12 + nk\n\n12)] + d1\n\n(cid:88)\n\nk(cid:54)=1\n\nt1\n12(x)\n\n(a)\n= max\n\nc2\n12,...,cK\n12\n\n[\u03c412(x, c2\n\n12, . . . , cK\n\n12) +\n\nmk\n\n12(ck\n\n12) + nk\n\n12(ck\n\n12)]\n\n(b)\n= max\n\nc2\n12,...,cK\n12\n\n[(cid:96)12 min(1,\n\n12(0)+nk\n\nwhere d1 =(cid:80)\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3t1\n= (cid:96)12 +(cid:80)\n= (cid:96)12 \u2212 \u03b4 +(cid:80)\n12 = 1, we have min(1,(cid:80)\n\n12(1) \u2212 d1\n12(0) \u2212 d1\nt1\n12(0) \u2212 d1\nt1\n\n(d)\n\n(c)\n\nk(cid:54)=1 mk\n\n12(0) is a constant. Equality (a) comes from the de\ufb01nition of messages\naccording to equation (6) in the Supplement. Equality (b) uses the de\ufb01nition of \u03c412 of equation (5)\nand the de\ufb01nition of the scalar message of equation (8) in the Supplement. We can further simplify\nt12 as follows:\n\n(e)\n= max(0, (cid:96)12 + maxk(cid:54)=1(mk\n\nk(cid:54)=1 max(0, \u03b4 + mk\n\n12 + nk\n\n12),\n\nk(cid:54)=1 max(0, \u03b4 + mk\n12 + nk\n\n12) = 1, and max(0,(cid:80)\n\n12 + nk\n12)),\n\n12),\n\nif \u2203k, nk\notherwise .\n\n12 \u2212 1) = (cid:80)\n\n12 + mk\n\n12 + \u03b4 > 0,\n\n(9)\n\nk ck\n\nIf c1\nequality (c). A similar argument can be made if c1\n0. This leads to equality (d). If c1\nthe increase obtained by letting ck\nThis leads to equality (e).\nRemark 2. Computation of t1\nij, ..., tk\nmation need only be computed once.\n\n12 = 0 and there is no k such that nk\n12 + mk\n12 = 1 (i.e., (cid:96)12) with the penalty (i.e., mk\n\n12. These lead to\n12+\u03b4 >\n12 + \u03b4 > 0, we compare\n12), for the best k.\n12 + nk\nij using equality (d) costs O(K), and not O(K 2), as the sum-\n\n12 = 0 but there exists a k such that nk\n\n12+mk\n\nk(cid:54)=1 ck\n\nk ck\n\nij(c1\nij(c1\n\nij) + n1\nij) + m1\n\n(i,j)(cid:54)=(1,2) t1\n(i,j)(cid:54)=(1,2) t1\n12 in constant time, we perform a preliminary\n\nij(c1\nij(c1\n\nij)],\nij)],\n\n(10)\n\n(cid:40)\n\n12 and n1\n\nMessages m1\n\n12 are computed as follows:\n\n12=x [\u00b51(c1) +(cid:80)\n12=x [\u03b71(c1) +(cid:80)\nm1\n12(x) = maxc1|c1\nn1\n12(x) = maxc1|c1\nij : i \u2208 V, j \u2208 W}. To compute n1\nwhere c1 = {c1\n(cid:88)\noptimization, ignoring the effect of edge (1, 2):\n\narg max\n\nc1\n\n\u2212 \u03b4\n2\n\nLet si =(cid:80)M\n\nN 2\n\n1 +\n\nt1\nij(c1\n\nij) + m1\n\nij(c1\n\nij).\n\n(i,j)\n\n(11)\n\nij + t1\n\nj=1 max(0, m1\n\nij) be the sum of positive incoming messages of row i. The function\n\u03b71 penalizes the number of rows containing some nonzero c1\nij: if any message along that row is\nincluded, there is no additional penalty for including every positive message along that row. Thus,\noptimization (11) is computed by deciding which rows to include. This can be done ef\ufb01ciently\nthrough sorting: we sort row sums s(1), ..., s(N ) at a cost of O(N log N ). Then we proceed from\nlargest to smallest, including row (N + 1 \u2212 i) if the marginal penalty \u03b4\n2 (2i \u2212 1)\nN 2 can be computed\nis less than s(N +1\u2212i). After solving optimization (11), the messages n1\nin linear time, as we explain in Supplementary Note 5.\nRemark 3. Computation of nk\nProposition 3 (Computational Complexity of BCMP). The computational complexity of BCMP\nover a bipartite graph with N rows, M columns, and K clusters is O(K(N + log M )(M + log N )).\n\nij through sorting costs O(N log N ).\n\n2 (i2 \u2212 (i \u2212 1)2) = \u03b4\n12, ..., n1\n\n5\n\n\fProof. For each iteration, there are N M messages tij to be computed at cost O(K) each. Before\nij), there are K sorting steps at a cost of O(M log M ), after which each message may\ncomputing (nk\nbe computed in constant time. Likewise, there are K sorting steps at a cost of O(N log N ) each\nbefore computing (mk\n\nij).\n\nWe provide an empirical runtime example of the algorithm in Supplementary Figure 3.\n\n2.4 Parameter learning using Expectation-Maximization\n\nIn the BCMP objective function described in Section 2.2, the parameters of the generative model\nwere used to compute the log-likelihood ratios (lij). In practice, however, these parameters may\nbe unknown. Expectation-Maximization (EM) can be used to estimate these parameters. The use\nof EM in this setting is slightly unorthodox, as we estimate the hidden labels (cluster assignments)\nin the M step instead of the E step. However, the distinction between parameters and labels is not\nintrinsic in the de\ufb01nition of EM [17] and the true ML solution is still guaranteed to be a \ufb01xed point\nof the iterative process. Note that it is possible that the EM iterative procedure leads to a locally\noptimal solution and therefore it is recommended to use several random re-initializations for the\nmethod.\nThe EM algorithm has three steps:\n\n\u2022 Initialization: We choose initial values for the underlying model parameters \u03b8 and compute\nthe log-likelihood ratios (lij) based on these values, denoting by F0 the initial objective\nfunction.\n\n\u2022 M step: We run BCMP to maximize the objective Fi(c). We denote the estimated cluster\n\nassignments by by \u02c6ci .\n\n\u2022 E step: We compute the expected-log-likelihood function as follows:\n\n(cid:88)\n\nFi+1(c) = E\u03b8[log P ((eij)|\u03b8)|c = \u02c6ci] =\n\nE\u03b8[log P (eij|\u03b8)|c = \u02c6ci].\n\n(12)\n\n(i,j)\n\nConveniently, the expected-likelihood function takes the same form as the original likelihood func-\ntion, with an input matrix of expected log-likelihood ratios. These can be computed ef\ufb01ciently if\nconjugate priors are available for the parameters. Therefore, BCMP can be used to maximize Fi+1.\nThe algorithm terminates upon failure to improve the estimated likelihood Fi( \u02c6ci).\nFor a discussion of the application of EM to the binary and Gaussian models, see Supplementary\nNote 6.\nIn the case of the binary model, we use uniform Beta distributions as conjugate priors\nfor p and q, and in the case of the Gaussian model, we use inverse-gamma-normal distributions as\nthe priors for the variances and means. Even when convenient priors are not available, EM is still\ntractable as long as one can sample from the posterior distributions.\n\n3 Evaluation results\n\nWe compared the performance of our biclustering algorithm with two methods, ISA and LAS, in\nsimulations and in real gene expression datasets (Supplementary Note 8). ISA was chosen because\nit performed well in comparison studies [6] [9], and LAS was chosen because it outperformed ISA\nin preliminary simulations. Both ISA and LAS search for biclusters using iterative re\ufb01nement. ISA\nassigns rows iteratively to clusters fractionally in proportion to the sum of their entries over columns.\nIt repeats the same for column-cluster assignments, and this process is iterated until convergence.\nLAS uses a similar greedy iterative search without fractional memberships, and it masks already-\ndetected clusters by mean subtraction.\nIn our simulations, we generate simulated bipartite graphs of size 100x100. We planted (possibly\noverlapping) biclusters as full blocks with two noise models:\n\n\u2022 Bernoulli noise: we drew edges according to the binary model of De\ufb01nition 2 with varying\nnoise level q = 1 \u2212 p.\n\n6\n\n\fFigure 2: Performance comparison of the proposed method (BCMP) with ISA and LAS, for\nBernoulli and Gaussian models, and for overlapping and non-overlapping biclusters. On the y axis\nis the total number of misclassi\ufb01ed row-column pairs. Either the noise level or the amount of overlap\nis on the x axis.\n\n\u2022 Gaussian noise: we drew edge weights within and outside of biclusters from normal distri-\n\nbutions N (1, \u03c32) and N (0, \u03c32), respectively, for different values of \u03c3.\n\nFor each of these cases, we ran simulations on three setups (see Figure 2):\n\n\u2022 Non-overlapping clusters: three non-overlapping biclusters were planted in a 100 \u00d7 100\nmatrix with sizes 20 \u00d7 20, 15 \u00d7 20, and 15 \u00d7 10. We varied the noise level.\n\u2022 Overlapping clusters with \ufb01xed overlap: Three overlapping biclusters with \ufb01xed overlaps\nwere planted in a 100 \u00d7 100 matrix with sizes 20 \u00d7 20, 20 \u00d7 10, and 10 \u00d7 30. We varied\nthe noise level.\n\u2022 Overlapping clusters with variable overlap: we planted two 30 \u00d7 30 biclusters in a 100 \u00d7\n100 matrix with variable amount of overlap between them, where the amount of overlap\nis de\ufb01ned as the fraction of rows and columns shared between the two clusters. We used\nBernoulli noise level q = 1 \u2212 p = 0.15, and Gaussian noise level \u03c3 = 0.7.\n\nThe methods used have some parameters to set. Pseudocode for BCMP is presented in Supplemen-\ntary Note 10. Here are the parameters that we used to run each method:\n\n\u2022 BCMP method with underlying parameters given: We computed the input matrix of shifted\nlog-likelihood ratios following the discussion in Section 2.2. The number of biclusters\nK was given. We initialized the cluster-shape parameters rk at 1 and updated them as\ndiscussed in Supplementary Note 3.1. In the case of Bernoulli noise, following Proposition\n2 and Remark 1, we set (cid:96)ij = eij and \u03b4\n2 = 1/4. In the case of Gaussian noise, we chose a\nthreshold \u03b4 to maximize the unthresholded likelihood (see Supplementary Note 3.2).\n\n\u2022 BCMP - EM method: Instead of taking the underlying model parameters as given, we\nestimated them using the procedure described in Section 2.4 and Supplementary Note 6.\n\n7\n\ncolumn variablesrow variablescolumn variablesrow variablescolumn variablesrow variablesGaussian noiseBernoulli noiseoverlapping biclusters(fixed overlap)non-overlapping biclustersoverlapping biclusters(variable overlap)(a3)(a2)(b3)(b2)(b1)(a1)00.050.10.150.20.250.30.350200400600800100012001400(cid:37)(cid:38)(cid:48)(cid:51)(cid:3)(cid:237)(cid:3)(cid:40)(cid:48)BCMPLASISAnoise levelaverage number of misclassified tuplestotal number of clustered tuples is 85000.050.10.150.20.250.30.35020040060080010001200140016001800average number of misclassified tuplesnoise leveltotal number of clustered tuples is 900(cid:37)(cid:38)(cid:48)(cid:51)(cid:3)(cid:237)(cid:3)(cid:40)(cid:48)BCMPLASISA00.10.20.30.40.50.60100200300400500600average number of misclassified tuplesoverlap(cid:37)(cid:38)(cid:48)(cid:51)(cid:3)(cid:237)(cid:3)(cid:40)(cid:48)BCMPLAS00.20.40.60.8102004006008001000noise levelaverage number of misclassified tuplestotal number of clustered tuples is 850(cid:37)(cid:38)(cid:48)(cid:51)(cid:3)(cid:237)(cid:3)(cid:40)(cid:48)BCMPLASISA00.20.40.60.8102004006008001000noise levelaverage number of misclassified tuplestotal number of clustered tuples is 900(cid:37)(cid:38)(cid:48)(cid:51)(cid:3)(cid:237)(cid:3)(cid:40)(cid:48)BCMPLASISA00.10.20.30.40.50100200300400500600700800overlapaverage number of misclassified tuples(cid:37)(cid:38)(cid:48)(cid:51)(cid:3)(cid:237)(cid:3)(cid:40)(cid:48)BCMPLAS\fWe used identical, uninformative priors on the parameters of the within-cluster and null\ndistributions.\n\n\u2022 ISA method: We used the same threshold ranges for both rows and columns, attempting\nto \ufb01nd best-performing threshold values for each noise level. These values were mostly\naround 1.5 for both noise types and for all three dataset types. We found positive biclusters,\nand used 20 reinitializations. Out of these 20 runs, we selected the best-performing run.\n\n\u2022 LAS method: There were no parameters to set. Since K was given, we selected the \ufb01rst K\n\nbiclusters discovered by LAS, which marginally increased its performance.\n\nEvaluation results of both noise models and non-overlapping and overlapping biclusters are shown\nin Figure 2. In the non-overlapping case, BCMP and LAS performed similarly well, better than\nISA. Both of these methods made few or no errors up until noise levels q = 0.2 and \u03c3 = .6 in\nBernoulli and Gaussian cases, respectively. When the parameters had to be estimated using EM,\nBCMP performed worse for higher levels of Gaussian noise but well otherwise. ISA outperformed\nBCMP and LAS at very high levels of Bernoulli noise; at such a high noise level, however, the\nresults of all three algorithms are comparable to a random guess.\nIn the presence of overlap between biclusters, BCMP outperformed both ISA and LAS except at very\nhigh noise levels. Whereas LAS and ISA struggled to resolve these clusters even in the absence of\nnoise, BCMP made few or no errors up until noise levels q = 0.2 and \u03c3 = .6 in Bernoulli and Gaus-\nsian cases, respectively. Notably, the overlapping clusters were more asymmetrical, demonstrating\nthe robustness of the strategy of iteratively tuning rk in our method. In simulations with variable\noverlaps between biclusters, for both noise models, BCMP outperformed LAS signi\ufb01cantly, while\nthe results for the ISA method were very poor (data not shown). These results demonstrate that\nBCMP excels at inferring overlapping biclusters.\n\n4 Discussion and future directions\n\nIn this paper, we have proposed a new biclustering technique called Biclustering Using Message\nPassing that, unlike existent methods, infers a globally optimal collection of biclusters rather than a\ncollection of locally optimal ones. This distinction is especially relevant in the presence of overlap-\nping clusters, which are common in most applications. Such overlaps can be of importance if one is\ninterested in the relationships among biclusters. We showed through simulations that our proposed\nmethod outperforms two popular existent methods, ISA and LAS, in both Bernoulli and Gaussian\nnoise models, when the planted biclusters were overlapping. We also found that BCMP performed\nwell when applied to gene expression datasets.\nBiclustering is a problem that arises naturally in many applications. Often, a natural statistical model\nfor the data is available; for example, a Poisson model can be used for document classi\ufb01cation (see\nSupplementary Note 9). Even when no such statistical model will be available, BCMP can be used\nto maximize a heuristic objective function such as the modularity function [19]. This heuristic is\npreferable to clustering the original adjacency matrix when the degrees of the nodes vary widely;\nsee Supplementary Note 7.\nThe same optimization strategy used in this paper for biclustering can also be applied to perform\nclustering, generalizing the graph-partitioning problem by allowing nodes to be in zero or several\nclusters. We believe that the \ufb02exibility of our framework to \ufb01t various statistical and heuristic models\nwill allow BCMP to be used in diverse clustering and biclustering applications.\n\nAcknowledgments\n\nWe would like to thank Professor Manolis Kellis and Professor Muriel M\u00e9dard for their advice\nand support. We would like to thank the Harvard Division of Medical Sciences for supporting this\nproject.\n\n8\n\n\fReferences\n[1] Cheng, Yizong, and George M. Church. \"Biclustering of expression data.\" Ismb. Vol. 8. 2000.\n[2] Dao, Phuong, et al. \"Inferring cancer subnetwork markers using density-constrained bicluster-\n\ning.\" Bioinformatics 26.18 (2010): i625-i631.\n\n[3] Bisson, Gilles, and Fawad Hussain. \"Chi-sim: A new similarity measure for the co-clustering\ntask.\" Machine Learning and Applications, 2008. ICMLA\u201908. Seventh International Conference\non. IEEE, 2008.\n\n[4] Bergmann, Sven, Jan Ihmels, and Naama Barkai. \"Iterative signature algorithm for the analysis\n\nof large-scale gene expression data.\" Physical review E 67.3 (2003): 031902.\n\n[5] Shabalin, Andrey A., et al. \"Finding large average submatrices in high dimensional data.\" The\n\nAnnals of Applied Statistics (2009): 985-1012.\n\n[6] Prelic, Amela, et al. \"A systematic comparison and evaluation of biclustering methods for gene\n\nexpression data.\" Bioinformatics 22.9 (2006): 1122-1129.\n\n[7] Tanay, Amos, Roded Sharan, and Ron Shamir. \"Discovering statistically signi\ufb01cant biclusters\n\nin gene expression data.\" Bioinformatics 18.suppl 1 (2002): S136-S144.\n\n[8] Hochreiter, Sepp, et al. \"FABIA: factor analysis for bicluster acquisition.\" Bioinformatics 26.12\n\n(2010): 1520-1527.\n\n[9] Li, Li, et al. \"A comparison and evaluation of \ufb01ve biclustering algorithms by quantifying good-\n\nness of biclusters for gene expression data.\" BioData mining 5.1 (2012): 1-10.\n\n[10] Eren, Kemal, et al. \"A comparative analysis of biclustering algorithms for gene expression\n\ndata.\" Brie\ufb01ngs in bioinformatics 14.3 (2013): 279-292.\n\n[11] Nadakuditi, Raj Rao, and Mark EJ Newman. \"Graph spectra and the detectability of commu-\n\nnity structure in networks.\" Physical review letters 108.18 (2012): 188701.\n\n[12] Krzakala, Florent, et al. \"Spectral redemption in clustering sparse networks.\" Proceedings of\n\nthe National Academy of Sciences 110.52 (2013): 20935-20940.\n\n[13] Decelle, Aurelien, et al. \"Asymptotic analysis of the stochastic block model for modular net-\n\nworks and its algorithmic applications.\" Physical Review E 84.6 (2011): 066106.\n\n[14] Frey, Brendan J., and Delbert Dueck. \"Clustering by passing messages between data points.\"\n\nScience 315.5814 (2007): 972-976.\n\n[15] Dueck, Delbert, et al. \"Constructing treatment portfolios using af\ufb01nity propagation.\" Research\n\nin Computational Molecular Biology. Springer Berlin Heidelberg, 2008.\n\n[16] Govaert, G. and Nadif, M. \"Block clustering with bernoulli mixture models: Comparison of\n\ndifferent approaches.\" Computational Statistics and Data Analysis, 52 (2008): 3233-3245.\n\n[17] Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. \"Maximum likelihood from incom-\nplete data via the EM algorithm.\" Journal of the Royal Statistical Society. Series B (Method-\nological) (1977): 1-38.\n\n[18] Marbach, Daniel, et al. \"Wisdom of crowds for robust gene network inference.\" Nature meth-\n\nods 9.8 (2012): 796-804.\n\n[19] Newman, Mark EJ. \"Modularity and community structure in networks.\" Proceedings of the\n\nNational Academy of Sciences 103.23 (2006): 8577-8582.\n\n[20] Yedidia, Jonathan S., William T. Freeman, and Yair Weiss. \"Constructing free-energy approxi-\nmations and generalized belief propagation algorithms.\" Information Theory, IEEE Transactions\non 51.7 (2005): 2282-2312.\n\n[21] Caldas, Jos\u00e9, and Samuel Kaski. \"Bayesian biclustering with the plaid model.\" Machine Learn-\n\ning for Signal Processing, 2008. MLSP 2008. IEEE Workshop on. IEEE, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1901, "authors": [{"given_name": "Luke", "family_name": "O'Connor", "institution": "Harvard University"}, {"given_name": "Soheil", "family_name": "Feizi", "institution": "Massachusetts Institute of Technology"}]}