{"title": "Learning Sparse Gaussian Graphical Models with Overlapping Blocks", "book": "Advances in Neural Information Processing Systems", "page_first": 3808, "page_last": 3816, "abstract": "We present a novel framework, called GRAB (GRaphical models with overlApping Blocks), to capture densely connected components in a network estimate. GRAB takes as input a data matrix of p variables and n samples, and jointly learns both a network among p variables and densely connected groups of variables (called `blocks'). GRAB has four major novelties as compared to existing network estimation methods: 1) It does not require the blocks to be given a priori. 2) Blocks can overlap. 3) It can jointly learn a network structure and overlapping blocks. 4) It solves a joint optimization problem with the block coordinate descent method that is convex in each step. We show that GRAB reveals the underlying network structure substantially better than four state-of-the-art competitors on synthetic data. When applied to cancer gene expression data, GRAB outperforms its competitors in revealing known functional gene sets and potentially novel genes that drive cancer.", "full_text": "Learning Sparse Gaussian Graphical Models with\n\nOverlapping Blocks\n\nMohammad Javad Hosseini1\n\nSu-In Lee1,2\n\n1Department of Computer Science & Engineering, University of Washington, Seattle\n\n2Department of Genome Sciences, University of Washington, Seattle\n\n{hosseini, suinlee}@cs.washington.edu\n\nAbstract\n\nWe present a novel framework, called GRAB (GRaphical models with overlApping\nBlocks), to capture densely connected components in a network estimate. GRAB\ntakes as input a data matrix of p variables and n samples and jointly learns both\na network of the p variables and densely connected groups of variables (called\n\u2018blocks\u2019). GRAB has four major novelties as compared to existing network es-\ntimation methods: 1) It does not require blocks to be given a priori. 2) Blocks\ncan overlap. 3) It can jointly learn a network structure and overlapping blocks. 4)\nIt solves a joint optimization problem with the block coordinate descent method\nthat is convex in each step. We show that GRAB reveals the underlying network\nstructure substantially better than four state-of-the-art competitors on synthetic data.\nWhen applied to cancer gene expression data, GRAB outperforms its competitors\nin revealing known functional gene sets and potentially novel cancer driver genes.\n\n1\n\nIntroduction\n\nMany real-world networks contain subsets of variables densely connected to one another, a property\ncalled modularity (Fig 1A); however, standard network inference methods do not incorporate this\nproperty. As an example, biologists are increasingly interested in understanding how thousands of\ngenes interact with each other on the basis of gene expression data that measure expression levels of\np genes across n samples. This has stimulated considerable research into the structure estimation\nof a network from high-dimensional data (p  n). It is well-known that the network structure\ncorresponds to the non-zero pattern of the inverse covariance matrix, \u23031 [1]. Thus, obtaining a\nsparse estimate of \u23031 by using `1 penalty has been a standard approach to inferring a network, a\nmethod called graphical lasso [2]. However, applying an `1 penalty to each edge fails to re\ufb02ect the\nfact that genes involved in similar functions are more likely to be connected with each other and that\nhow genes are organized into functional modules are often not known.\nWe present a novel structural prior, called GRAB prior, which encourages the network estimate to\nbe dense within a block (i.e, a subset of variables) and sparse between blocks, where blocks are not\ngiven a priori. Fig 1B illustrates the effectiveness of the GRAB prior (bottom) in a high-dimensional\nsetting (p = 200 and n = 100), where it is dif\ufb01cult to reveal the true underlying network by using\nthe graphical lasso (GLasso) (top). The major novelty of GRAB is four-fold:\nFirst, unlike previous work [3, 4, 5], GRAB allows each variable to belong to more than one block,\nwhich is an important property of many real-world networks. For example, genes important in disease\nprocesses are often involved in multiple functional modules [6], and identifying such genes would be\nof great scienti\ufb01c interest (Section 4.2). Although existing methods to learn non-overlapping blocks\nallow edges between different blocks, they use stronger regularization parameters for between-block\nedges, which decreases the power to detect variables associated with multiple blocks.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fSecond, GRAB jointly learns the network structure and the assignment of variables into overlapping\nblocks (Fig 2). Existing methods to incorporate blocks in network learning either use blocks given a\npriori or use a sequential approach to learn blocks and then learn a network given the blocks held\n\ufb01xed. Interestingly, the GRAB algorithm can be viewed as a generalization of the joint learning of the\ndistance metric among p variables and graph-cut clustering of p variables into blocks (Section 3.4)\nThird, GRAB solves a joint optimization problem with the block coordinate descent method that is\nconvex in each step. This is a powerful feature that is dif\ufb01cult to be achieved by existing methods\nto cluster variables into blocks. This property guarantees the convergence of the learning algorithm\n(Section 3).\nFinally, the GRAB framework we presented in this paper uses the Gaussian graphical model as a\n\nbaseline model. However, the GRAB prior, formulated as trZZ||\u21e5| (Section 2.2), can be used in\n\nany kind of network models such as pairwise Markov random \ufb01elds.\nIn the following sections, we show that GRAB outperforms the graphical lasso [2] and existing\nmethods to learn blocks and network estimates [3, 4] on synthetic data and cancer gene expression\ndata. We also demonstrate GRAB\u2019s potential to identify novel genes that drive cancer.\n\n(A)$\n\nx2\t\n\nx4\t\n\nx3\t\n\nx1\t\n\n(B)$\n\nx5\t\n\nx6\t\n\nx7\t\n\nx8\t\n\nx1\t\n\nx6\t\n\nx7\t\n\nx8\t\n\nx2\t\n\nx3\t\n\nx4\t\n\nx5\t\n\nx1\t\n\nx2\t\n\nx3\t\n\nx4\t\n\nx5\t\n\nx6\t\n\nx7\t\n\nx8\t\n\nFigure 1: (A) A network with overlapping\nblocks (top) and its adjacency matrix (bot-\ntom). (B) Network estimates of GLasso (top)\nand GRAB (bottom) in a toy example.\n\n2 GGM with Overlapping Blocks\n\nZ:#assignment#matrix#\n(A)#\n\nK#\n\nZT#\np#\n\nK#\n\nZi#\n\np#\n\nZj\n\nT#\n\n(B)#\n!\n#\np#\n\n(ZZT)ij#\n\nZZT#\np#\n\nBlock#1#\n\nThe#similarity#between#ith#\nand#jth#variables##\n\n#\n\np\ne\nt\ns\n.\nZ\n\nZT#\np#\n\nLearning#Z##\n\nZ#\n\nK#\n\n(C)#\n\nK#\n\np#\n\n\u03b8\n.\ns\nt\ne\np\n\n#\n\nLearning#\u03b8##\n\n\u03b8#\n\np#\n\n(D)#\n!\n\n#\np#\n\nSparsity#paBern#\nencouraged#by#ZZT#\n\nFigure 2: The GRAB framework \u2013 an iterative\nalgorithm that jointly learns \u21e5 and Z.\n\nBlock#2#\n\nBlock#3#\n\nSBM#\n\n(E)#GRAB#\n\n\u03bb1#\n\n\u03bb2#\n\n\u03bbw#\n\u03bbb#\n\n\u03bb1,#\u03bb2:#based#on#Z#\n\u03bbw:#within?block#\n\u03bbb:#between?block#\n\n2.1 Background: High-Dimensional Gaussian Graphical Model (GGM)\nWe aim to learn a GGM of p variables on the basis of n observations (p  n). That is, suppose that\nX(1), . . . , X(n) are i.i.d. N (\u00b5, \u2303), where \u00b5 2 Rp and \u2303 is a p \u21e5 p positive de\ufb01nite matrix. It is well\nknown that the sparsity pattern of \u23031 determines the conditional independence structure of the p\nvariables; there is an edge between the ith and jth variables if and only if the (i, j) element of \u23031 is\nnon-zero [1]. A number of authors have proposed to estimate \u23031 using the graphical lasso [2, 7, 8]:\n\nmaximize\n\n\u21e5\u232b0\n\nlog det \u21e5  tr(S\u21e5)  k\u21e5k1,\n\n(1)\n\nwhere the solution b\u21e5 is an estimate of \u23031, S denotes the empirical covariance matrix, and  is a\n\nnonnegative tuning parameter that controls the strength of the `1 penalty applied to the elements of\n\u21e5. This amounts to maximizing a penalized log-likelihood.\n\n2\n\n\f2.2 GGM with the Overlapping Block Prior\n\nHere, we present the GRAB prior, formulated as trZZ||\u21e5|, that encourages \u21e5 to have overlapping\nblocks. Let X = {X1, . . . , Xp} be variables in the network and Z be a real matrix of size p \u21e5 K,\nwhere K is the total number of blocks. Each element 1 \uf8ff Zik \uf8ff 1 can be interpreted as a score\nrepresenting how likely the ith variable Xi belongs to the kth block Bk. The ith row of Z, denoted\nby Zi, can be interpreted as a low-rank embedding for the variable Xi showing its block assignment\nscores. Then, the (i, j) element (ZZ|)ij =\u2303 K\nk=1ZikZjk (the dot product of Zi and Zj) represents\nthe similarity between variables Xi and Xj in their embeddings.\nTo more clearly understand the impact of the GRAB prior on the sparsity structure of \u21e5, let us\nassume a hard assignment model in which we assign variables to blocks. Then, Z becomes a binary\nmatrix and the sparsity pattern of ZZ| would indicate the region covered by all K blocks (Fig 2A-B).\nThen, jointly learning Z and \u21e5 to increase \u2303i,j(ZZ|)ij|\u21e5ij| would encourage \u21e5 to have a sparsity\nstructure imposed by (ZZ|). In the continuous case, it would encourage |\u21e5ij| to be non-zero when\nXi and Xj have similar embeddings (i.e., a dot product of Zi and Zj is large).\nIncorporating the GRAB prior into Eq (1) as a structural prior leads to:\n\nwhere  is a non-negative tuning parameter. We can re-write Eq (2) as:\n\nmaximize\n\u21e5\u232b0,Z2D\n\nmaximize\n\u21e5\u232b0,Z2D\n\nlog det \u21e5  tr(S\u21e5)  \u21e3k\u21e5k1  trZZ||\u21e5|\u2318,\n\u21e31  (ZZ|)ij\u2318|\u21e5ij|.\nlog det \u21e5  tr(S\u21e5) Xi,j\n\n(2)\n\n(3)\n\nWe use the value of the sparsity tuning parameter 1  (ZZ|)ij for each (i, j) element \u21e5ij. A\nnetwork edge that corresponds to two variables with similar embeddings would be penalized less.\nThe set D\u21e2 [1, 1]p\u21e5K contains matrices Z satisfying the following constraints: (a) kZik2 \uf8ff 1,\nwhere Zi denotes the ith row of Z. This constraint ensures the regularization parameters of all (i, j)\npairs of variables are non-negative. (b) kZkF \uf8ff . In addition to the variable speci\ufb01c constraint\non each Zi in (a), we need a global constraint on Z to prevent all regularization parameters from\nbecoming zero (8i, j : (ZZT )ij = 1). (c) kZk2 \uf8ff \u2327, where k.k2 of a matrix is its maximum singular\nvalue. This constraint prevents the case where all variables are assigned to one block.\nThere are two hyperparameters,  and \u2327; however we describe below that we set \u2327 = pK and that\nhas an effect to guarantee that there are at least K non-empty blocks. In our experiments, we set the\nhyper-parameter  =p p\n2 , which, intuitively, would allow each variable to get on average half of\ni where i is the ith singular value\nits largest possible squared norm. Given that kZk2\nof Z, from the constraint (b),Pp\ni \uf8ff 2. We set \u2327 = pK , where \u2327 means the upper bound\nof the maximum singular value, given the constraint (c). This means that there would be at least\nK non-empty blocks given that the constraint (b) is tight. We show in Section 3 that this choice of\nhyperparameters makes our learning algorithm simpler (see Lemma 3.2).\n\nF =Pp\n\ni=1 2\n\ni=1 2\n\n2.3 Probabilistic Interpretation\nThe joint distribution over X, \u21e5 and Z is as: P (X, \u21e5, Z) = P (X|\u21e5)P (\u21e5|Z)P (Z). The \ufb01rst two\nterms, log det(\u21e5)  trace(S\u21e5), in Eq (3) correspond to log P (X|\u21e5), the log-likelihood of GGM\ngiven a particular parameter \u21e5 (i.e., an estimate of \u23031), as described in Section 2.1. For \u21e5 \u232b 0,\nP (\u21e5|Z) =Q P (\u21e5ij|Z), where P (\u21e5ij|Z) represents a conditional probability over \u21e5ij given the\nblock assignment scores of Xi and Xj. We use the Laplacian prior with the sparsity parameter value\nDQ(i,j) exp(((1  (ZZ|)ij))|\u21e5ij|), where D is the\n(1  (ZZ|)ij). For \u21e5 \u232b 0, P (\u21e5|Z) is: 1\nnormalization constant. The prior probability P (Z) is proportional to D.\n\n2.4 Related Work\n\nTo our knowledge, GRAB is the \ufb01rst attempt to jointly learn the overlapping blocks and the structure\nof a conditional dependence network such as a GGM. Related work consists of 3 categories:\n\n3\n\n\f1) Learning blocks with a network held \ufb01xed: This category includes (a) stochastic block model\n(SBM) [9], (b) spectral clustering [10], and (c) a screening rule to identify non-overlapping blocks\nbased on the empirical covariance matrix [11].\n2) Learning a network with blocks given a priori and held \ufb01xed: This category includes a) a method\nto solve graphical lasso with group `1 penalty to encourage group sparsity of edges within pairs of\nblocks [12], and b) an ef\ufb01cient learning algorithm for GGMs given a set of overlapping blocks [13].\n3) Learning non-overlapping blocks \ufb01rst and then the network given the blocks: (a) Marlin et\nal. (2009) extend the prior work [12] to identify non-overlapping blocks which are then used to\nlearn a network [3]. (b) Another method assigns each variable to one block, and use different\nregularization parameters for within-block and between-block edges [14]. (c) Tan et al. (2015)\npropose to use hierarchical clustering (complete-linkage and average-linkage) to cluster variables\ninto non-overlapping blocks, and apply graphical lasso to each block [4].\n\n3 GRAB Learning Algorithm\n3.1 Overview\nOur learning algorithm jointly learns the block assignment scores Z and the network estimate \u21e5 by\nsolving Eq (2). We adopt the block coordinate descent (BCD) method to iteratively learn Z and \u21e5. Our\nlearning algorithm essentially performs adaptive distance (similarity) metric learning and clustering\nof variables into blocks simultaneously (Section 3.4). Given the current assignment of variables into\nblocks, Z, we learn a network among variables, \u21e5. Then, |\u21e5| is used as a similarity matrix among\nvariables to update the assignment of variables to blocks, Z. We iterate until convergence.\nConvergence is theoretically guaranteed. Since our objective function is continuous on a compact\nlevel set, based on Theorem 4.1 in [15], the solution sequence of our method is de\ufb01ned and bounded.\nEvery coordinate block found by the \u21e5-step and Z-step is a stationary point of GRAB. We indeed\nobserved the value of the objective function monotonically increases until convergence.\nIn the following, we show that the BCD method will be convex in each step. We \ufb01rst re-write Eq (2)\nwith all the constraints explicitly:\n\nlog det \u21e5  tr(S\u21e5)  \u21e3k\u21e5k1  trZZ||\u21e5|\u2318\n\nmaximize\n\u21e5\u232b0,Z\nsubject to kZk2 \uf8ff \u2327, kZik2 \uf8ff 1, kZkF \uf8ff , (i 2{ 1, . . . p})).\n\nNow, we state the following lemma, the proof of which can be found in the Appendix.\nLemma 3.1 Eq (4) is equivalent to the following:\n\nlog det \u21e5  tr(S\u21e5)  \u21e3k\u21e5k1  trW|\u21e5|\u2318\n\nmaximize\n\u21e5\u232b0,W\u232b0\nsubject to rank(W) \uf8ff K, W  \u2327 2I, diag(W) \uf8ff 1, tr(W) \uf8ff 2,\n\nwhere W is a p \u21e5 p matrix, K means the number of blocks, and I is the identity matrix of size p.1\nCorollary 3.1.1 Suppose that (\u21e5\u21e4, W\u21e4) is the optimal solution of the optimization problem (5).\nThen, \u21e5\u21e4, Z\u21e4 = UpD is the optimal solution of problem 4, where U 2 Rp\u21e5K is a matrix with\n\ncolumns containing K eigenvectors of W corresponding to the largest eigenvalues and D is a\ndiagonal matrix of the corresponding eigenvalues.\n\n(4)\n\n(5)\n\n3.2 Learning \u21e5 (\u21e5-step)\nTo estimate \u21e5 given Z, based on Eq (3), we solve the following problem:\n\nmaximize\n\n(6)\nwhere \u21e4ij = (1 (ZZ|)ij). This is the graphical lasso with edge-speci\ufb01c regularization parameters\n\u21e4ij. Eq (6) is a convex problem and we solve it by adopting a standard solver for graphical lasso [16].\n1 In this paper, we assume diag is an operator that maps a vector to a diagonal matrix with the vector as its\n\nlog det \u21e5  tr(S\u21e5) P(i,j) \u21e4ij|\u21e5ij|,\n\n\u21e5\u232b0\n\ndiagonal, and maps a matrix to a vector containing its diagonal.\n\n4\n\n\f3.3 Learning Z (Z-step)\nHere we describe how to learn Z given \u21e5. Instead of solving (4), we solve (5) because (5) is a\nconvex optimization problem with respect to W. Interestingly, we can remove the rank constraint,\nrank(W) \uf8ff K; in Lemma 3.2, we show that with the choice of \u2327 = pK , the rank constraint is\nautomatically satis\ufb01ed. This leads to the following optimization problem:\n\ntrW|\u21e5|\n\nmaximize\nW\u232b0\nsubject to W  \u2327 2I, diag(W) \uf8ff 1, tr(W) \uf8ff 2.\n\nThis W-step is a semi-de\ufb01nite programming problem. We solve the dual of Eq (7) that leads to an\nef\ufb01cient optimization problem.2 We introduce three dual variables: 1) a matrix Y \u232b 0 for the `2\nnorm constraint, 2) a vector v 2 Rp\n+ for the constraints on the diagonal and 3) a scalar y  0 for the\nconstraint on trace. The Lagrangian is:\nL(W, Y, v , y) = trW|\u21e5| + tr(\u2327 2I  W)Y + y(2  tr(W)) + vT (1  diag(W)). (8)\n\nThe dual function is as:\n\nsup\nW\u232b0\n\ntrW|\u21e5| + tr(\u2327 2I  W)Y + y(2  trW)) + v|(1  diag(W))\n=\u21e2\u2327 2tr(Y ) + y + vT 1 if Y \u232b| \u21e5| yI  diag(v)\n\ntrW(|\u21e5| Y  yI  diag(v)) + \u2327 2tr(Y ) + y + v|1\n\n= sup\nW\u232b0\n\notherwise\n\n.\n\nconsequently, we get the following dual problem for Eq (7):\n\n\u2327 2tr(Y ) + y 2 + v|1\n\n+1\nminimize\n\nY,y,v\n\n(7)\n\n(9)\n\n(10)\n\n(11)\n\n(13)\n\nEq (10) has a closed form solution in Y and y given that v is \ufb01xed. The dual problem boils down to:\n\nsubject to Y \u232b|\u21e5| yI  diag(v), Y \u232b 0, y  0, v  0.\n\nminimize\n\ng(v) = minimize\n\n\u2327 2\n\nv0\n\nv0\n\nKXi=1C+,i + v|1,\n\nwhere we have replaced 2\nassume it has eigenvalues (1, . . . p) in descending order and (C)+,i = max(0, i). We solve\nEq (11) by projected subgradient descent method where the subgradient direction is:\n\n\u2327 2 with K (because \u2327 = /pK). We de\ufb01ne C = (|\u21e5| diag(v)) and\nrvg(v) = \u2327 2 diagUC1K(DC)U|\n\n(12)\nDC is the diagonal matrix of eigenvalues in descending order and UC is the matrix containing\northonormal eigenvectors of C as its columns. We de\ufb01ne 1K(DC) as a binary vector of size p with\njth element equal to 1 if and only if j \uf8ff K and j > 0.\nAfter \ufb01nding the optimal v\u21e4, the optimal solution W\u21e4 can be obtained by:\n\nC + 1.\n\nW\u21e4 = argmax\nW\u232b0\n\ntrW(|\u21e5| diag(v\u21e4))\nsubject to W  \u2327 2I, tr(W) \uf8ff 2.\n\nthe solution of problem (13) is W \u21e4 = \u2327 2UC\u21e412/\u2327 2(DC\u21e4)U|\n\nOne can see that\nC\u21e4 =\n\u2327 2UC\u21e41K(DC\u21e4)U|\nC\u21e4, where C\u21e4, UC\u21e4, DC\u21e4 and 1K(DC\u21e4) are de\ufb01ned similarly to (12). By de\ufb01ni-\ntion, 1K(.) is a diagonal matrix with at most K nonzeros elements. Therefore, W \u21e4 will have rank at\nmost K, which means that we do not need the rank constraint on W. This leads to the following\nlemma.\nLemma 3.2 If we set \u2327 = pK\nin (5), the constraint rank(W) \uf8ff K will be automatically satis\ufb01ed.\nFinally, we construct Z\u21e4 = UpD as instructed in corollary 3.1.1. Note that in the intermediate\niterations, we do not need to compute Z; we need to construct the matrix Z\u21e4 to \ufb01nd the overlapping\nblocks after the learning algorithm will converge3.\n\n2The primal problem has a strictly feasible solution \u270fI, where \u270f is a small number and I is the identity matrix;\n\ntherefore strong duality holds.\n\n3The source code is available at: http://suinlee.cs.washington.edu/software/grab\n\n5\n\n\f3.4 A special case: K-way graph cut algorithm\n\nHere, we show that GRAB algorithm generalizes the K-way graph cut algorithm in two ways: 1)\nGRAB allows each variable to be in multiple blocks with soft membership; and 2) GRAB updates a\nnetwork structure \u21e5, used as a similarity matrix, in each iteration. The proof is in the Appendix.\n\nLemma 3.3 Say that we use a binary matrix Z (hard assignment) with the following constraints: a)\nFor all variables i, kZik2 \uf8ff 1, where Zi denotes the ith row of Z. b) For all blocks k, kZkk2 >= 1,\nwhere Zk denotes the kth column of Z. This means that each variable can belong to only one block\n(i.e., non-overlapping blocks), and each block has at least one variable. Then GRAB is equivalent to\niterating between K-way graph-cut on |\u21e5| to \ufb01nd Z and solving graphical lasso problem to \ufb01nd \u21e5.\n\n4 Experimental Results\n\nWe present results on synthetically generated data and real data.\nComparison. Three state-of-the-art competitors are considered: UGL1 - unknown group `1 regu-\nlarization [3]; CGL - cluster graphical lasso [4]; and GLasso - standard graphical lasso [2]. CGL\nhas two variants depending on the type of hierarchical clustering used: average linkage clustering\n(CGL:ALC) and complete linkage clustering (CGL:CLC). Each method selects the regularization\nparameter using the standard cross-validation (or held out validation) procedure.\nCGL and UGL1 have their own ways of selecting the number of blocks K [4, 3]. GRAB selects\nK based on the validation-set log-likelihood in initialization. We initialize GRAB by constructing\nthe Z matrix. We \ufb01rst perform spectral clustering on |S|, where S denotes the empirical covariance\nmatrix, then add overlap by assigning a random subset of variables to clusters with the highest\naverage correlation. Then, we project the Z matrix into the convex set de\ufb01ned in Section 2.2 and\nform W = ZZ|. In the Z-step of the GRAB learning algorithm, we use step size 1/pt, where t is\nthe iteration number and iterate until the relative change in the objective function is less than 106\n(Section 3.3). We use the warm-start technique between the BCD iterations.\nEvaluation criteria. In the synthetic data experiments (sectoin 4.1), we evaluate each method based\non the learned network with the optimal regularization parameter chosen for each method based on\nonly training-set. For the AML dataset (Section 4.2), we evaluate the learned blocks for varying\nregularization parameters (x-axis) to better illustrate the difference among the methods in terms of\ntheir performances. In all experiments, we standardize the data and show the average results over 10\nruns and the standard deviations as error bars.\n\n4.1 Synthetic Data Experiments\n\nData generation. We \ufb01rst generate \uf8ff overlapping blocks forming a chain, a random tree or a lattice.\nIn each case, two neighboring blocks overlap each other by o (the ratio of the variables shared between\ntwo overlapping blocks). Then, we randomly generate a true underlying network of p variables with\ndensity of 20%, and convert it to the precision matrix following the procedure of [17]. We generate\n100 training samples and 50 validation samples from the multivariate Gaussian distribution with\nmean zero and the covariance matrix equal to the inverse of the precision matrix.\nWe consider a varying number of true blocks \uf8ff 2{ 9, 25, 49} and overlap ratio o = .25. For\n\uf8ff = 25, we consider o 2{ .1, .25, .4}. We vary the number of variables p 2{ 400, 800} for the\nlattice-structured blocks. The results on the chain and random tree blocks are similar and so we\nprovide only the results for p = 400 for these block structures. For all methods, we considered the\nregularization parameter  2 [.02, .4] with step size .02.\nResults. Fig 3 compares \ufb01ve methods when a regularization parameter was selected for each method\nbased on the 50 validation samples. Each of the four plots correspond to different block structure or\nnumber of variables. Each bar group corresponds to a particular (\uf8ff, o, \u2318), in which we computed the\nmodularity measure \u2318 as (fraction of edges that fall within groups - expected fraction if edges were\ndistributed at random), as was done by [18]. Fig 3A shows how accurately each method recovers\nthe true network. For each method m, we compared the learned edges (EZ,m) and that from the\nunderlying network (EZ). By comparing EZ,m and EZ, we can compute the precision and recall\n\n6\n\n\fLa/ce:'p=400,n=100'\n\nLa/ce:'p=800,n=100'\n\nChain:'p=400,n=100'\n\nRandom:'p=400,n=100'\n\n\u03ba=9'\no=.25'\n\u03b7=0.85'\n\n\u03ba=25'\no=.4'\n\u03b7=0.88'\n\n\u03ba=25'\no=.25'\n\u03b7=0.91'\n'\n\n\u03ba=25'\no=.1'\n\u03b7=0.96'\n\n\u03ba=49'\no=.25'\n\u03b7=0.93'\n\n\u03ba=9'\no=.25'\n\u03b7=0.85'\n\n\u03ba=25'\no=.4'\n\u03b7=0.88'\n\n\u03ba=25'\no=.25'\n\u03b7=0.91'\n'\n\n\u03ba=25'\no=.1'\n\u03b7=0.96'\n\n\u03ba=49'\no=.25'\n\u03b7=0.93'\n\n\u03ba=10'\no=.25'\n\u03b7=0.85'\n\n\u03ba=30'\no=.4'\n\u03b7=0.94'\n\n\u03ba=30'\no=.1'\n\u03b7=0.96'\n\n\u03ba=50'\no=.25'\n\u03b7=0.97'\n\n\u03ba=30'\no=.25'\n\u03b7=0.95'\n'\n\n\u03ba=10'\no=.25'\n\u03b7=0.87'\n\n\u03ba=30'\no=.4'\n\u03b7=0.93'\n\n\u03ba=30'\no=.25'\n\u03b7=0.96'\n'\n\n\u03ba=30'\no=.1'\n\u03b7=0.97'\n\n\u03ba=50'\no=.25'\n\u03b7=0.96'\n\nFigure 3: Comparison based on average network recovery F1 on synthetic data from lattice blocks,\nwhen p = 400 (\ufb01rst panel) and p = 800 (second panel), chain blocks (third panel) and random\nblocks (fourth panel) when p = 400. Each bar group corresponds to a particular (number of blocks \uf8ff,\noverlap ratio o, modularity \u2318).\n\npr+rec as an evaluation metric.\n\nof network recovery. Since it is not enough to get only high precision or recall, we use the F1 (or\nF-measure) = 2 pr\u21e4rec\nA number of authors have shown that identifying the underlying network structure is very challenging\nin the high-dimensional setting, resulting in low accuracies even on synthetic data [14, 19, 4]. Our\nresults also show that the F1 scores for network are lower than 0.40. Despite that, GRAB identi\ufb01es\nnetwork edges much more accurately than its competitors.\n\n4.2 Cancer Gene Expression Data\n\nWe consider the MILE data [20] that measure the mRNA expression levels of 16,853 genes in 541\npatients with acute myeloid leukemia (AML), an aggressive blood cancer. For a better visualization\nof the network in limited space (Fig 5), we selected 500 genes4, consisting of 488 highest varying\ngenes in MILE and 12 genes highly associated with AML: FLT3, NPM1, CEBPA, KIT, N-RAS,\nMLL, WT1, IDH1/2, TET2, DNMT3A, and ASXL1. These genes are identi\ufb01ed by [21] in a large\nstudy on 1,185 patients with AML to be signi\ufb01cantly mutated in these AML patients. These genes\nare well-known to have signi\ufb01cant role in driving AML.\nHere, we evaluate GRAB and the other methods qualitatively in terms of how useful each method is\nfor cancer biologists to make discovery from data. For that, we \ufb01x the number of blocks to be K = 10\nacross all methods such that we get average of over 50 variables per block, which is considered close\nto the average number of genes in known pathways [22]. We varied K and obtained similar results.\nGenes in the same block are likely to share similar functions. Statistical signi\ufb01cance of the overlap\nbetween gene clusters (here, blocks) and known functional gene sets have been widely used as an\nevaluation criteria [23, 5]. We show how to obtain blocks from the learned Z.\nObtaining blocks from Z. After the GRAB algorithm converges, we obtain a network estimate \u21e5\nand a block membership matrix Z. We \ufb01nd K overlapping blocks satisfying two constraints: a)\nmaximum number of assignments is C; and b) each variable is assigned to  1 block. Here, we\nused C = 1.3p. We perform the following greedy procedure: 1) We \ufb01rst run k-means clustering\nalgorithm on the p rows of the matrix Z.5. 2) We compute the similarity of variables i to blocks\n(ZZ|)ij, where |Bk| is the number of variables in Bk. Then, we add overlap by\nBk as\nassigning C  p variables to blocks with highest similarity.\nTo evaluate the blocks, we used 4,722 curated gene sets from the molecular signature database [24]\nand computed a p-value to measure the signi\ufb01cance of the overlap between each block and each\ngene set. We consider the (block, gene set) pairs with false discovery rate (FDR)-corrected p < 0.05\nto be signi\ufb01cantly overlapping pairs. When a block is signi\ufb01cantly overlapped with a gene set,\nwe consider the gene set to be revealed by the corresponding block. We compare GRAB with the\n\n|Bk|Pj2Bk\n\n1\n\n4GRAB runs for 0.5-1.5 hours for 500 genes and up to 20 hours for 2,000 genes on a computer with 2.5 GHz\n\nIntel Core i5 processor\n\n5This resembles spectral clustering (equivalently, kmeans on eigenvectors of Laplacian matrix)\n\n7\n\n\fmethods introduced in section 4.1. Since we only need the blocks for this experiment, we added\ntwo more competitors: k-means and spectral clustering methods applied to |S|, where S denotes the\nempirical covariance matrix. Fig 4 shows the number of gene sets that are revealed by any block\n(FDR-corrected p < 0.05) in each method. GRAB signi\ufb01cantly outperforms, which indicates the\nimportance of learning overlapping blocks; GRAB\u2019s overlapping blocks reveal known functional\norganization of genes better than other methods. Fig 4 shows the average results of 10 random\ninitializations.\nFig 5 compares the learned networks \u21e5 by GLasso (A) and GRAB (B) when the regularization\nparameters are set such that the networks show a similar level of sparsity. For GRAB, we removed\nthe between-block edges and reordered genes such that the genes in the same blocks tend to appear\nnext to each other. GRAB shows more interpretable network structure, highlighting the genes that\nbelong to multiple blocks.\nThe key innovation of GRAB is to allow for overlap between blocks. Interestingly, the 12 well-known\nAML genes are signi\ufb01cantly enriched for the genes assigned to 3 or more blocks: FLT3, NPM1, TET2\nand DNMT3A belong to 3 blocks while there are only 24 such genes out of 500 genes (p-value: 0.001)\n(Fig 5B). This supports our claim that variables assigned to multiple blocks are likely important. Out\nof the 24 genes assigned to  3 blocks, 12 are known to be involved in myeloid differentiation (the\nprocess impaired in AML) or other types of cancer. This can lead to new discovery on the genes that\ndrive AML.\nThese genes include CCNA1 that has shown to be signi\ufb01cantly differentially expressed in AML\npatients [25]. TSPAN7 is expressed in acute myelocytic leukemia of some patients6. Several genes\nare associated with other types of cancer. For example, CCL20 is associated with pancreatic cancer\n[26]. ELOVL7 is involved in prostate cancer growth [27]. SCRN1 is a novel marker for prognosis\nin colorectal cancer [28]. These genes assigned to many blocks and have been implicated in other\ncancers or leukemias can lead the discovery of novel AML driver genes.\n\n(A)$\n\n(B)$\n\n(C)$\n\n(A)$\n\n(B)$\n\nNPM1$\n\nDNMT3A$\nTET2$\nFLT3$\n\nFigure 4: Average number of gene sets highly\nassociated with blocks at a varying regulariza-\ntion parameter. The cross-validation results\nare consistent with these results.\n\nFigure 5: Learned networks of: (A) GLasso and\n(B) GRAB. For GRAB, we have sorted the genes\nbased on the blocks and highlighted the following\n4 genes (out of the 12 highly associated genes with\nAML) that belong to many blocks: NPM1, FLT3,\nDNMT3A and TET2.\n\n5 Discussion and Future Work\n\nWe present a novel general framework, called GRAB, that can explicitly model densely connected\nnetwork components that can overlap with each other in a graphical model. The novel GRAB\nstructural prior encourages the network estimate to be dense within each block (i.e., a densely\nconnected group of variables) and sparse between the variables in different blocks. The GRAB\nlearning algorithm adopts BCD and is convex in each step. We demonstrate the effectiveness of our\nframework in synthetic data and cancer gene expression dataset. Our framework is general and can\nbe applied to other kinds of graphical models, such as pairwise Markov random \ufb01elds.\nAcknowledgements: We give warm thanks to Reza Eghbali and Amin Jalali for many useful\ndiscussions. This work was supported by the National Science Foundation grant DBI-1355899 and\nthe American Cancer Society Research Scholar Award 127332-RSG-15-097-01-TBG.\n\n6http://www.genecards.org/cgi-bin/carddisp.pl?gene=TSPAN7\n\n8\n\n\fReferences\n[1] S. L. Lauritzen. Graphical Models. Oxford Science Publications, 1996.\n[2] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\nBiostatistics, 9:432\u2013441, 2007.\n\n[3] B. M. Marlin and K. P. Murphy. Sparse gaussian graphical models with unknown block structure. pages\n\n705\u2013712, 2009.\n\n[4] K. M. Tan, D. Witten, and A. Shojaie. The cluster graphical lasso for improved estimation of gaussian\n\ngraphical models. Computational statistics & data analysis, 85:23\u201336, 2015.\n\n[5] S. Celik, B. A. Logsdon, and S.-I. Lee. Ef\ufb01cient dimensionality reduction for high-dimensional network\n\nestimation. ICML, 2014.\n\n[6] A. Lasorella, R. Benezra, and A. Iavarone. The id proteins: master regulators of cancer stem cells and\n\ntumour aggressiveness. Nature Reviews Cancer, 14(2):77\u201391, 2014.\n\n[7] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika,\n\n94(10):19\u201335, 2007.\n\n[8] A. Rothman, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation. Electronic Journal\n\nof Statistics, 2:494\u2013515, 2008.\n\n[9] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: Some \ufb01rst steps. Social Networks,\n\n5(2):109\u2013137, 1983.\n\n[10] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence,\n\nIEEE Transactions on, 22(8):888\u2013905, 2000.\n\n[11] D. M. Witten, J. H. Friedman, and N. Simon. New insights and faster computations for the graphical lasso.\n\nJournal of Computational and Graphical Statistics, 20(4):892\u2013900, 2011.\n\n[12] J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse gaussians. UAI, 2008.\n[13] M. Grechkin, M. Fazel, D. Witten, and S.-I. Lee. Pathway graphical lasso. 2015.\n[14] C. Ambroise, J. Chiquet, and C. Matias. Inferring sparse gaussian graphical models with latent structure.\n\nElectron. J. Statist., 3:205\u2013238, 2009.\n\n[15] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal\n\nof Optimization Theory and Applications, 109(3):475\u2013494, 2001.\n\n[16] Cho-Jui Hsieh, Inderjit S Dhillon, Pradeep K Ravikumar, and M\u00e1ty\u00e1s A Sustik. Sparse inverse covariance\nmatrix estimation using quadratic approximation. In Advances in neural information processing systems,\npages 2330\u20132338, 2011.\n\n[17] Q. Liu and A. T. Ihler. Learning scale free networks by reweighted l1 regularization. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 40\u20138, 2011.\n\n[18] Mark EJ Newman. Modularity and community structure in networks. Proceedings of the National Academy\n\nof Sciences, 103(23):8577\u20138582, 2006.\n\n[19] K. Mohan, M. Chung, S. Han, D. Witten, S.-I. Lee, and M. Fazel. Structured learning of gaussian graphical\n\nmodels. In NIPS, pages 620\u2013628, 2012.\n\n[20] T. Haferlach, A. Kohlmann, L. Wieczorek, et al. Clinical utility of microarray-based gene expression\npro\ufb01ling in the diagnosis and subclassi\ufb01cation of leukemia. Journal of Clinical Oncology, 28(15):2529\u2013\n2537, 2010.\n\n[21] Y. Shen, Y.-M. Zhu, X. Fan, et al. Gene mutation patterns and their prognostic impact in a cohort of 1185\n\npatients with acute myeloid leukemia. Blood, 118(20):5593\u20135603, 2011.\n\n[22] E. Segal, M. Shapira, A. Regev, D. Pe\u2019er, D. Botstein, D. Koller, and N. Friedman. Module networks:\nidentifying regulatory modules and their condition-speci\ufb01c regulators from gene expression data. Nature\ngenetics, 34(2):166\u2013176, 2003.\n\n[23] S.-I. Lee and S. Batzoglou. Ica-based clustering of genes from microarray expression data. In Advances in\n\nNeural Information Processing Systems, volume 16, 2003.\n\n[24] A. Liberzon, A. Subramanian, R. Pinchback, H. Thorvaldsd\u00f3ttir, P. Tamayo, and J. P. Mesirov. Molecular\n\nsignatures database (msigdb) 3.0. Bioinformatics, 27(12):1739\u20131740, 2011.\n\n[25] Y. Fang, L. N. Xie, X. M. Liu, et al. Dysregulated module approach identi\ufb01es disrupted genes and pathways\n\nassociated with acute myelocytic leukemia. Eur Rev Med Pharmacol Sci, 19(24):4811\u20134826, 2015.\n\n[26] C. Rubie, V. O. Frick, P. Ghadjar, et al. Research ccl20/ccr6 expression pro\ufb01le in pancreatic cancer. 2010.\n[27] K. Tamura, A. Makino, et al. Novel lipogenic enzyme elovl7 is involved in prostate cancer growth through\n\nsaturated long-chain fatty acid metabolism. Cancer Research, 69(20):8133\u20138140, 2009.\n\n[28] N. Miyoshi, H. Ishii, K. Mimori, et al. Scrn1 is a novel marker for prognosis in colorectal cancer. Journal\n\nof surgical oncology, 101(2):156\u2013159, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1899, "authors": [{"given_name": "Mohammad Javad", "family_name": "Hosseini", "institution": "University of Washington"}, {"given_name": "Su-In", "family_name": "Lee", "institution": "University of Washington"}]}