{"title": "A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 2330, "page_last": 2338, "abstract": "In this paper, we consider the $\\ell_1$ regularized sparse inverse covariance matrix estimation problem with a very large number of variables. Even in the face of this high dimensionality, and with limited number of samples, recent work has shown this estimator to have strong statistical guarantees in recovering the true structure of the sparse inverse covariance matrix, or alternatively the underlying graph structure of the corresponding Gaussian Markov Random Field. Our proposed algorithm divides the problem into smaller sub-problems, and uses the solutions of the sub-problems to build a good approximation for the original problem. We derive a bound on the distance of the approximate solution to the true solution. Based on this bound, we propose a clustering algorithm that attempts to minimize this bound, and in practice, is able to find effective partitions of the variables. We further use the approximate solution, i.e., solution resulting from solving the sub-problems, as an initial point to solve the original problem, and achieve a much faster computational procedure. As an example, a recent state-of-the-art method, QUIC requires 10 hours to solve a problem (with 10,000 nodes) that arises from a climate application, while our proposed algorithm, Divide and Conquer QUIC (DC-QUIC) only requires one hour to solve the problem.", "full_text": "A Divide-and-Conquer Procedure for Sparse Inverse\n\nCovariance Estimation\n\nCho-Jui Hsieh\n\nDept. of Computer Science\nUniversity of Texas, Austin\n\ncjhsieh@cs.utexas.edu\n\nInderjit S. Dhillon\n\nDept. of Computer Science\nUniversity of Texas, Austin\n\ninderjit@cs.utexas.edu\n\nPradeep Ravikumar\n\nDept. of Computer Science\n\nUniversity of Texas\n\npradeepr@cs.utexas.edu\n\nArindam Banerjee\n\nDept. of Computer Science & Engineering\n\nUniversity of Minnesota, Twin Cities\n\nbanerjee@cs.umn.edu\n\nAbstract\n\nWe consider the composite log-determinant optimization problem, arising from\nthe (cid:96)1 regularized Gaussian maximum likelihood estimator of a sparse inverse\ncovariance matrix, in a high-dimensional setting with a very large number of vari-\nables. Recent work has shown this estimator to have strong statistical guarantees\nin recovering the true structure of the sparse inverse covariance matrix, or alter-\nnatively the underlying graph structure of the corresponding Gaussian Markov\nRandom Field, even in very high-dimensional regimes with a limited number of\nsamples. In this paper, we are concerned with the computational cost in solving\nthe above optimization problem. Our proposed algorithm partitions the problem\ninto smaller sub-problems, and uses the solutions of the sub-problems to build a\ngood approximation for the original problem. Our key idea for the divide step to\nobtain a sub-problem partition is as follows: we \ufb01rst derive a tractable bound on\nthe quality of the approximate solution obtained from solving the corresponding\nsub-divided problems. Based on this bound, we propose a clustering algorithm\nthat attempts to minimize this bound, in order to \ufb01nd effective partitions of the\nvariables. For the conquer step, we use the approximate solution, i.e., solution\nresulting from solving the sub-problems, as an initial point to solve the original\nproblem, and thereby achieve a much faster computational procedure.\n\nIntroduction\n\n1\nLet {x1, x2, . . . , xn} be n sample points drawn from a p-dimensional Gaussian distribution\nN (\u00b5, \u03a3), also known as a Gaussian Markov Random Field (GMRF), where each xi is a p-\ndimensional vector. An important problem is that of recovering the covariance matrix, or its inverse,\ngiven the samples in a high-dimensional regime where n (cid:28) p, and p could number in the tens of\nthousands. In such settings, the computational ef\ufb01ciency of any estimator becomes very important.\nA popular approach for such high-dimensional inverse covariance matrix estimation is to impose the\nstructure of sparsity on the inverse covariance matrix (which can be shown to encourage conditional\nindependences among the Gaussian variables), and to solve the following (cid:96)1 regularized maximum\nlikelihood problem:\n\narg min\n\u0398(cid:31)0\n\n(cid:80)n\ni=1(xi \u2212 \u02dc\u00b5)(xi \u2212 \u02dc\u00b5)T is the sample covariance matrix and \u02dc\u00b5 = 1\n\n{\u2212 log det \u0398 + tr(S\u0398) + \u03bb(cid:107)\u0398(cid:107)1} = arg min\n\u0398(cid:31)0\n\nf (\u0398),\n\nwhere S = 1\ni=1 xi is the\nn\nsample mean. The key focus in this paper is on developing computationally ef\ufb01cient methods to\nsolve this composite log-determinant optimization problem.\n\nn\n\n(cid:80)n\n\n(1)\n\n1\n\n\fDue in part to its importance, many optimization methods [4, 1, 9, 7, 6] have been developed in\nrecent years for solving (1). However, these methods have a computational complexity of at least\nO(p3) (typically this is the complexity per iteration). It is therefore hard to scale these procedures\nto problems with tens of thousands of variables. For instance, in a climate application, if we are\nmodeling a GMRF over random variables corresponding to each Earth grid point, the number of\nnodes can easily number in the tens of thousands. For this data, a recently proposed state-of-the-art\nmethod QUIC [6], that uses a Newton-like method to solve (1), for instance takes more than 10\nhours to converge.\nA natural strategy when the computational complexity of a procedure scales poorly with the problem\nsize is a divide and conquer strategy: Given a partition of the set of nodes, we can \ufb01rst solve the\n(cid:96)1 regularized MLE over the sub-problems invidually, and than in the second step, aggregate the\nsolutions together to get \u00af\u0398. But how do we come up with a suitable partition? The main contribution\nof this paper is to provide a principled answer to this question. As we show, our resulting divide and\nconquer procedure produces overwhelming improvements in computational ef\ufb01ciency.\nInterestingly, [8] recently proposed a decomposition-based method for GMRFs. They \ufb01rst observe\nthe following useful property of the composite log-determinant optimization problem in (1): if we\nthreshold the off-diagonal elements of the sample covariance matrix S, and the resulting thresholded\nmatrix is block-diagonal, then the corresponding inverse covariance matrix has the same block-\ndiagonal sparsity structure as well. Using this property, they decomposed the problem along these\nblock-diagonal components and solved these separately, thus achieving a sharp computational gain.\nA major drawback to this approach of [8] however is that often the decomposition of the thresholded\nsample covariance matrix can be very unbalanced \u2014 indeed, in many of our real-life examples, we\nfound that the decomposition resulted in one giant component and several very small components.\nIn these cases, the approach in [8] is only a bit faster than directly solving the entire problem.\nIn this paper, we propose a different strategy based on the following simple idea. Suppose we are\ngiven a particular partitioning, and solve the sub-problems speci\ufb01ed by the partition components.\nThe resulting decomposed estimator \u00af\u0398 clearly need not be equal to (cid:96)1 regularized MLE (1). How-\never, can we use bounds on the deviation to propose a clustering criterion? We \ufb01rst derive a bound\non (cid:107) \u00af\u0398 \u2212 \u0398\u2217(cid:107)F based on the off-diagonal error of the partition. Based on this bound, we propose\na normalized-cut spectral clustering algorithm to minimize the off-diagonal error, which is able to\n\ufb01nd a balanced partition such that \u00af\u0398 is very close to \u0398\u2217. Interestingly, we show that this clustering\ncriterion can also be motivated as leveraging a property more general than that in [8] of the (cid:96)1 reg-\nularized MLE (1). In the \u201cconquering\u201d step, we then use \u00af\u0398 to initialize an iterative solver for the\noriginal problem (1). As we show, the resulting algorithm is much faster than other state-of-the-art\nmethods. For example, our algorithm can achieve an accurate solution for the climate data problem\nin 1 hour, whereas directly solving it takes 10 hours.\nIn section 2, we outline the standard skeleton of a divide and conquer framework for GMRF es-\ntimation. The key step in such a framework is to come up with a suitable and ef\ufb01cient clustering\ncriterion. In the next section 3, we then outline our clustering criteria. Finally, in Section 4 we show\nthat in practice, our method achieves impressive improvements in computational ef\ufb01ciency.\n2 The Proposed Divide and Conquer Framework\nWe \ufb01rst set up some notation. In this paper, we will consider each p \u00d7 p matrix X as an adjacency\nmatrix, where V = {1, . . . , p} is the node set, Xij is the weighted link between node i and node j.\nc=1 to denote a disjoint partitioning of the node set V, and each Vc will be called a\nWe will use {Vc}k\npartition or a cluster.\nGiven a partition {Vc}k\ntions to get the inverse covariance matrices {\u0398(c)}k\n\nc=1, our divide and conquer algorithm \ufb01rst solves GMRF for all node parti-\n\nc=1, and then uses the following matrix\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\u0398(1)\n\n...\n\n0\n\n0\n\n\u00af\u0398 =\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb ,\n\n0\n\n\u0398(2)\n...\n\n0\n\n. . .\n. . .\n\n0\n0\n\n...\n...\n0 \u0398(k)\n\n(2)\n\nto initialize the solver for the whole GMRF. In this paper we use X (c) to denote the submatrix\nXVc,Vc for any matrix X. Notice that in our framework any sparse inverse covariance solver can\n\n2\n\n\fbe used, however, in this paper we will focus on using the state-of-the-art method QUIC [6] as the\nbase solver, which was shown to have super-linear convergence when close to the solution. Using a\nbetter starting point enables QUIC to more quickly reach this region of super-linear convergence, as\nwe will show later in our experiments.\nThe skeleton of the divide and conquer framework is quite simple and is summarized in Algorithm 1.\nIn order that Algorithm 1 be ef\ufb01cient, we require that \u00af\u0398 de\ufb01ned in (2) should be close to the optimal\nsolution of the original problem \u0398\u2217. In the following, we will derive a bound for (cid:107)\u0398\u2217\u2212 \u00af\u0398(cid:107)F . Based\non this bound, we propose a spectral clustering algorithm to \ufb01nd an effective partitioning of the\nnodes.\n\n: Empirical covariance matrix S, scalar \u03bb\n\nAlgorithm 1: Divide and Conquer method for Sparse Inverse Covariance Estimation\nInput\nOutput: \u0398\u2217, the solution of (1)\nObtain a partition of the nodes {Vc}k\nfor c = 1, . . . , k do\n\nc=1 ;\n\nSolve (1) on S(c) and subset of variables in Vc to get \u0398(c);\n\nend\nForm \u00af\u0398 by \u0398(1), \u0398(2), . . . , \u0398(k) as in (2) ;\nUse \u00af\u0398 as an initial point to solve the whole problem (1) ;\n\n1\n2\n3\n4\n5\n6\n\n2.1 Hierarchical Divide and Conquer Algorithm\n\nAssume we conduct a k-way clustering, then the initial time for solving sub-problems is at least\nO(k(p/k)3) = O(p3/k2) where p denotes the dimensionality, When we consider k = 2, the divide\nand conquer algorithm can be at most 4 times faster than the original one. One can increase k,\nhowever, a larger k entails a worse initial point for training the whole problem.\nBased on this observation, we consider the hierarchical version of our divide-and-conquer algorithm.\nFor solving subproblems we can again apply a divide and conquer algorithm. In this way, the initial\ntime can be much less than O(p3/k2) if we use divide and conquer algorithm hierarchically for\neach level. In the experiments, we will see that this hierarchical method can further improve the\nperformance of the divide-and-conquer algorithm.\n\n3 Main Results: Clustering Criteria for GMRF\nThis section outlines the main contribution of this paper; in coming up with suitable ef\ufb01cient clus-\ntering criteria for use within the divide and framework structure in the previous section.\n\n3.1 Bounding the distance between \u0398\u2217 and \u00af\u0398\n\nTo start, we discuss the following result from [8], which we reproduce using the notation in this\npaper for convenience. Speci\ufb01cally, [8] shows that when all the between cluster edges in S have\nabsolute values smaller than \u03bb, \u0398\u2217 will have a block-diagonal structure.\nc=1, if |Sij| \u2264 \u03bb for all i, j in different\nTheorem 1 ([8]). For any \u03bb > 0 and a given partition {Vc}k\npartitions, then \u0398\u2217 = \u00af\u0398, where \u0398\u2217 is the optimal solution of (1) and \u00af\u0398 is as de\ufb01ned in (2).\nAs a consequence, if a partition {Vc}k\nc=1 satis\ufb01es the assumption of Theorem 1, \u00af\u0398 and \u0398\u2217 will be\nthe same, and the last step of Algorithm 1 is not needed anymore. Therefore the result in [8] may be\nviewed as a special case of our Divide-and-Conquer Algorithm 1.\nHowever, in most real examples, a perfect partitioning as in Theorem 1 does not exist, which moti-\nvates a divide and conquer framework that does not need as stringent assumptions as in Theorem 1.\nTo allow a more general relationship between \u0398\u2217 and \u00af\u0398, we \ufb01rst prove a similar property for the\nfollowing generalized inverse covariance problem:\n\n\u0398\u2217 = arg min\n\u0398(cid:31)0\n\n{\u2212 log det \u0398 + tr(S\u0398) +\n\n3\n\n(cid:88)\n\ni,j\n\n\u039bij|\u0398ij|} = arg min\n\u0398(cid:31)0\n\nf\u039b(\u0398).\n\n(3)\n\n\fIn the following, we use 1\u03bb to denote a matrix with all elements equal to \u03bb. Therefore (1) is a special\ncase of (3) with \u039b = 1\u03bb. In (3), the regularization parameter \u039b is a p\u00d7p matrix, where each element\ncorresponds to a weighted regularization of each element of \u0398. We can then prove the following\ntheorem, as a generalization of Theorem 1.\nTheorem 2. For any matrix regularization parameter \u039b (\u039bij > 0 \u2200i, j) and a given partition\n{Vc}k\nc=1, if |Sij| \u2264 \u039bij for all i, j in different partitions, then the solution of (3) will be the block\ndiagonal matrix \u00af\u0398 de\ufb01ned in (2), where \u0398(c) is the solution for (3) with sample covariance S(c) and\nregularization parameter \u039b(c).\n\nProof. Consider the dual problem of (3):\n\n|Wij \u2212 Sij| \u2264 \u039bij \u2200i, j,\n\nmax\nW(cid:31)0\n\nlog det W s.t.\n\nlution of (4) with the objective function value (cid:80)k\nity [2], det \u02c6W \u2264 (cid:81)k\n\n(4)\nbased on the condition stated in the theorem, we can easily verify \u00afW = \u00af\u0398\u22121 is a feasible so-\nc=1 log det \u00afW (c). To show that \u00afW is the op-\ntimal solution of (4), we consider an arbitrary feasible solution \u02c6W . From Fischer\u2019s inequal-\nc=1 det \u02c6W (c) for \u02c6W (cid:31) 0. Since \u00afW (c) is the optimizer of the c-th block,\ndet \u00afW (c) \u2265 det \u02c6W (c) for all c, which implies log det \u02c6W \u2264 log det \u00afW . Therefore \u00af\u0398 is the primal\noptimal solution.\n\nNext we apply Theorem 2 to develop a decomposition method. Assume our goal is to solve (1) and\nwe have clusters {Vc}k\nc=1 which may not satisfy the assumption in Theorem 1. We start by choosing\na matrix regularization weight \u00af\u039b such that\n\n\u00af\u039bij =\n\nmax(|Sij|, \u03bb)\n\nif i, j are in the same cluster,\nif i, j are in different clusters.\n\n(5)\n\n(cid:26)\u03bb\n\n(cid:26)0\n\nNow consider the generalized inverse covariance problem (3) with this speci\ufb01ed \u00af\u039b. By construction,\nthe assumption in Theorem 2 holds for \u00af\u039b, so we can decompose this problem into k sub-problems;\nfor each cluster c \u2208 {1, . . . , k}, the subproblem has the following form:\n\n\u0398(c) = arg min\n\u0398(cid:31)0\n\n{\u2212 log det \u0398 + tr(S(c)\u0398) + \u03bb(cid:107)\u0398(cid:107)1},\n\nwhere S(c) is the sample covariance matrix of cluster c. Therefore, \u00af\u0398 is the optimal solution of\nproblem (3) with \u00af\u039b as the regularization parameter.\nBased on this observation, we will now provide another view of our divide and conquer algorithm as\nfollows. Considering the dual problem of the sparse inverse covariance estimation with the weighted\nregularization de\ufb01ned in (4), Algorithm 1 can be seen to solve (4) with \u039b = \u00af\u039b de\ufb01ned in (5) to get\nthe initial point \u00afW , and then solve (4) with \u039b = 1\u03bb for all elements. Therefore we initially solve\nthe problem with looser bounded constraints to get an initial guess, and then solve the problem with\ntighter constraints. Intuitively, if the relaxed constraints \u00af\u039b are close to the real constraint 1\u03bb, the\nsolutions \u00afW and W \u2217 will be close to each other. So in the following we derive a bound based on\nthis observation.\nFor convenience, we use P \u03bb to denote the original dual problem (4) with \u039b = 1\u03bb, and P \u00af\u039b to denote\nthe relaxed dual problem with different edge weights across edges as de\ufb01ned in (5). Based on the\nabove discussions, W \u2217 = (\u0398\u2217)\u22121 is the solution of P \u03bb and \u00afW = \u00af\u0398\u22121 is the solution of P \u00af\u039b. We\nde\ufb01ne E as the following matrix:\n\nEij =\n\nmax(|Sij| \u2212 \u03bb, 0)\n\nif i, j are in the same cluster,\notherwise.\n\n(6)\n\nIf E = 0, all the off-diagonal elements are below the threshold \u03bb, so W \u2217 = \u00afW by Theorem 2. In\nthe following we consider a more interesting case where E (cid:54)= 0. In this case (cid:107)E(cid:107)F measures how\nmuch the off-diagonal elements exceed the threshold \u03bb, and a good clustering algorithm should be\nable to \ufb01nd a partition to minimize (cid:107)E(cid:107)F . In the following theorem we show that (cid:107)W \u2217 \u2212 \u00afW(cid:107)F can\nbe bounded by (cid:107)E(cid:107)F , therefore (cid:107)\u0398\u2217 \u2212 \u00af\u0398(cid:107)F can also be bounded by (cid:107)E(cid:107)F :\n\n4\n\n\fTheorem 3. If there exists a \u03b3 > 0 such that (cid:107)E(cid:107)2 \u2264 (1 \u2212 \u03b3)\n\n1(cid:107) \u00afW(cid:107)2\n\n, then\n\np max(\u03c3max( \u00afW ), \u03c3max(W \u2217))\n\n(cid:107)W \u2217\u2212 \u00afW(cid:107)F <\n(cid:107)E(cid:107)F ,\n(cid:107)\u0398\u2217 \u2212 \u00af\u0398(cid:107)F \u2264 p max(\u03c3max( \u00af\u0398), \u03c3max(\u0398\u2217))2\u03c3max( \u00af\u0398)\nwhere \u03c3min(\u00b7), \u03c3max(\u00b7) denote the minimum/maximum singular values.\n\n\u03b3 min(\u03c3min(\u0398\u2217), \u03c3min( \u00af\u0398))\n\n\u03b3\u03c3min( \u00afW )\n\n(cid:107)E(cid:107)F ,\n\n(7)\n\n(8)\n\nProof. To prove Theorem 3, we need the following Lemma, which is proved in the Appendix:\nLemma 1. If A is a positive de\ufb01nite matrix and there exists a \u03b3 > 0 such that (cid:107)A\u22121B(cid:107)2 \u2264 1 \u2212 \u03b3,\nthen\n\nlog det(A + B) \u2265 log det A \u2212 p/(\u03b3\u03c3min(A))(cid:107)B(cid:107)F .\n\n(9)\n\np\n\nSince P \u00af\u039b has a relaxed bounded constraint than P \u03bb, \u00afW may not be a feasible solution of P \u03bb.\nHowever, we can construct a feasible solution \u02c6W = \u00afW \u2212 G \u25e6 E, where Gij = sign(Wij) and\n\u25e6 indicates the entrywise product of two matrices. The assumption of this theorem implies that\n(cid:107)G \u25e6 E(cid:107)2 \u2264 (1 \u2212 \u03b3)/(cid:107) \u00afW(cid:107)2, so (cid:107) \u00afW \u22121(G \u25e6 E)(cid:107) \u2264 (1 \u2212 \u03b3). From Lemma 1 we have log det \u02c6W \u2265\nlog det \u00afW \u2212\n\u03b3\u03c3min( \u00afW )(cid:107)E(cid:107)F . Since W \u2217 is the optimal solution of P \u03bb and \u02c6W is a feasible solution\nof P \u03bb, log det W \u2217 \u2265 log det \u02c6W \u2265 log det \u00afW \u2212\n\u03b3\u03c3min( \u00afW )(cid:107)E(cid:107)F . Also, since \u00afW is the optimal\nsolution of P \u00af\u039b and W \u2217 is a feasible solution of P \u00af\u039b, we have log det W \u2217 < log det \u00afW . Therefore,\n| log det \u00afW \u2212 log det W \u2217| <\nBy the mean value theorem and some calculations, we have |f (W \u2217) \u2212 f ( \u00afW )| >\nmax(\u03c3max( \u00afW ),\u03c3max(W \u2217)) , which implies (7).\nTo establish the bound on \u0398, we use the mean value theorem again with g(W ) = W \u22121 = \u0398,\n\u2207g(W ) = \u0398 \u2297 \u0398 where \u2297 is kronecker product. Moreover, \u03c3max(\u0398 \u2297 \u0398) = (\u03c3max(\u0398))2, so we\ncan combine with (7) to prove (8).\n\n\u03b3\u03c3min( \u00afW )(cid:107)E(cid:107)F .\n\n(cid:107) \u00afW\u2212W \u2217(cid:107)F\n\np\n\np\n\n3.2 Clustering algorithm\n\nIn order to obtain computational savings, the clustering algorithm for the divide-and-conquer algo-\nrithm (Algorithm 1) should satisfy three conditions: (1) minimize the distance between the approx-\nimate and the true solution (cid:107) \u00af\u0398 \u2212 \u0398\u2217(cid:107)F , (2) be cheap to compute, and (3) partition the nodes into\nbalanced clusters.\nAssume the real inverse covariance matrix \u0398\u2217 is block-diagonal, then it is easy to show that W \u2217\nis also block-diagonal. This is the case considered in [8]. Now let us assume \u0398\u2217 has almost a\nblock-diagonal structure but a few off-diagonal entries are not zero. Assume \u0398\u2217 = \u0398bd + veieT\nwhere \u0398bd is the block-diagonal part of \u0398\u2217 and ei denotes the i-th standard basis vector, then from\nSherman-Morrison formula,\n\nj\n\nW \u2217 = (\u0398\u2217)\u22121 = (\u0398bd)\u22121 \u2212\n\nv\n\n1 + v(\u0398bd)ij\n\n\u03b8bd\ni (\u03b8bd\n\nj )T ,\n\nwhere \u03b8bd\nis the ith column vector of \u0398bd. Therefore adding one off-diagonal element to \u0398bd will\ni\nintroduce at most one nonzero off-diagonal block in W . Moreover, if block (i, j) of W is already\nnonzero, adding more elements in block (i, j) of \u0398 will not introduce any more nonzero blocks in\nW . As long as just a few entries in off-diagonal blocks of \u0398\u2217 are nonzero, W will be block-diagonal\nwith a few nonzero off-diagonal blocks. Since (cid:107)W \u2217\u2212S\u2217(cid:107)\u221e \u2264 \u03bb, we are able to use the thresholding\nmatrix S\u03bb to guess the clustering structure of \u0398\u2217.\nIn the following, we show this observation is consistent with the bound we get in Theorem 3. From\ni |\u03c3i(E)|. Since it is computationally\np(cid:107)E(cid:107)F , so that minimizing (cid:107)E(cid:107)F\n\n(8), ideally we want to \ufb01nd a partition to minimize (cid:107)E(cid:107)\u2217 =(cid:80)\ndif\ufb01cult to optimize this directly, we can use the bound (cid:107)E(cid:107)\u2217 \u2264 \u221a\ncan be cast as a relaxation of the problem of minimizing (cid:107) \u00af\u0398 \u2212 \u0398\u2217(cid:107)F .\n\n5\n\n\f(cid:80)\n\nk(cid:88)\n\nc=1\n\nTo \ufb01nd a partition minimizing (cid:107)E(cid:107)F , we want to \ufb01nd a partition {Vc}k\noff-diagonal block entries of S\u03bb is minimized, where S\u03bb is de\ufb01ned as\n\nc=1 such that the sum of\n\n(S\u03bb)ij = max(|Sij| \u2212 \u03bb, 0)2 \u2200 i (cid:54)= j and S\u03bb\n\n(10)\nAt the same time, we want to have balanced clusters. Therefore, we minimize the following normal-\nized cut objective value [10]:\n\nN Cut(S\u03bb,{Vc}k\n\nc=1) =\n\nS\u03bb\nij\n\ni\u2208Vc,j /\u2208Vc\nd(Vc)\n\nwhere d(Vc) =\n\nS\u03bb\nij.\n\n(11)\n\nIn (11), d(Vc) is the volume of the vertex set Vc for balancing cluster sizes, and the numerator is\nthe sum of off-diagonal entries, which corresponds to (cid:107)E(cid:107)2\nF . As shown in [10, 3], minimizing the\nnormalized cut is equivalent to \ufb01nding cluster indicators x1, . . . , xc to maximize\n= trace(Y T (I \u2212 D\u22121/2S\u03bbD\u22121/2)Y ),\n\nc (D \u2212 S\u03bb)xc\nxT\n\nk(cid:88)\n\n(12)\n\nij = 0 \u2200i = j.\np(cid:88)\n(cid:88)\n\ni\u2208Vc\n\nj=1\n\nmin\n\nwhere D is a diagonal matrix with Dii = (cid:80)p\n\nc Dx\n\nxT\n\nc=1\n\nx\n\nj=1 S\u03bb\n\nij, Y = D1/2X and X = [x1 . . . xc]. There-\nfore, a common way for getting cluster indicators is to compute the leading k eigenvectors of\nD\u22121/2S\u03bbD\u22121/2 and then conduct kmeans on these eigenvectors.\nThe time complexity of normalized cut on S\u03bb is mainly from computing the leading k eigenvectors\nof D\u22121/2S\u03bbD\u22121/2, which is at most O(p3). Since most state-of-the-art methods for solving (1)\nrequire O(p3) per iteration, the cost for clustering is no more than one iteration for the original\nsolver. If S\u03bb is sparse, as is common in real situations, we could speed up the clustering phase by\nusing the Graclus multilevel algorithm, which is a faster heuristic to minimize normalized cut [3].\n\n4 Experimental Results\nIn this section, we \ufb01rst show that the normalized cut criterion for the thresholded matrix S\u03bb in (10)\ncan capture the block diagonal structure of the inverse covariance matrix before solving (1). Using\nthe clustering results, we show that our divide and conquer algorithm signi\ufb01cantly reduces the time\nneeded for solving the sparse inverse covariance estimation problem.\nWe use the following datasets:\n\n1. Leukemia: Gene expression data \u2014 originally provided by [5], we use the data after the\n\npre-processing done in [7].\n\n2. Climate: This dataset is generated from NCEP/NCAR Reanalysis data 1, with focus on\nthe daily temperature at several grid points on earth. We treat each grid point as a random\nvariable, and use daily temperature in year 2001 as features.\n\n3. Stock: Financial dataset downloaded from Yahoo Finance 2. We collected 3724 stocks,\n\neach with daily closing price recorded in latest 300 days before May 15, 2012.\n\n4. Synthetic: We generated synthetic data containing 20, 000 nodes with 100 randomly gen-\nerated group centers \u00b51, . . . , \u00b5100, each of dimension 200, such that each group c has half\nof its nodes with feature \u00b5c and the other half with features \u2212\u00b5c. We then add Gaussian\nnoise to the features.\n\nThe data statistics are summarized in Table 1.\n\nc=1, we use the following \u201cwithin-cluster ratio\u201d to determine its\n\n(\u0398\u2217\n\nij)2\n\n.\n\n(13)\n\n4.1 Clustering quality on real datasets\nGiven a clustering partition {Vc}k\nperformance on \u0398\u2217:\n\n(cid:80)k\n\nc=1\n\n(cid:80)\n\n(cid:80)\n\ni,j:i(cid:54)=j and i,j\u2208Vc\n\ni(cid:54)=j(\u0398\u2217\n\nij)2\n\nR({Vc}k\n\nc=1) =\n\n1www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.surface.html\n2http://\ufb01nance.yahoo.com/\n\n6\n\n\fTable 1: Dataset Statistics\n\nLeukemia Climate Stock Synthetic\n\np\nn\n\n1255\n72\n\n10512\n1464\n\n3724\n300\n\n20000\n200\n\nTable 2: Within-cluster ratios (see (13)) on real datasets. We can see that our proposed clustering\nmethod Spectral S\u03bb is very close to the clustering based on \u02c6\u0398 = \u0398\u2217 \u25e6 \u0398\u2217, which we cannot see\nbefore solving (1).\n\nLeukemia\n\nClimate\n\nStock\n\nSynthetic\n\n\u03bb = 0.5 \u03bb = 0.3 \u03bb = 0.005 \u03bb = 0.001 \u03bb = 0.0005 \u03bb = 0.0001 \u03bb = 0.005 \u03bb = 0.001\n\nrandom clustering 0.26\n0.91\n0.93\n\nspectral on S\u03bb\nspectral on \u02c6\u0398\n\n0.24\n0.84\n0.84\n\n0.24\n0.87\n0.90\n\n0.25\n0.65\n0.71\n\n0.24\n0.96\n0.97\n\n0.24\n0.87\n0.85\n\n0.25\n0.98\n0.99\n\n0.24\n0.93\n0.93\n\nc=1) are indicative of better performance of the clustering algorithm.\n\nHigher values of R({Vc}k\nIn section 3.1, we presented theoretical justi\ufb01cation for using normalized cut on the thresholded\nmatrix S\u03bb. Here we show that this strategy shows great promise on real datasets. Table 2 shows the\nwithin-cluster ratios (13) of the inverse covariance matrix using different clustering methods. We\ninclude the following methods in our comparison:\n\n\u2022 Random partition: partition the nodes randomly into k clusters. We use this as a baseline.\n\u2022 Spectral clustering on thresholded matrix S\u03bb: Our proposed method.\n\u2022 Spectral clustering on \u02c6\u0398 = \u0398\u2217 \u25e6 \u0398\u2217, which is the element-wise square of \u0398\u2217: This is the\nbest clustering method we can conduct, which directly minimizes within-cluster ratio of\nthe \u0398\u2217 matrix. However, practically we cannot use this method as we do not know \u0398\u2217.\n\nWe can observe in Table 2 that our proposed spectral clustering on S\u03bb achieves almost the same\nperformance as spectral clustering on \u0398\u2217 \u25e6 \u0398\u2217 even though we do not know \u0398\u2217.\nAlso, Figure 1 gives a pictorial view of how our clustering results help in recovering the sparse\ninverse covariance matrix at different levels. We run a hierarchical 2-way clustering on the Leukemia\ndataset, and plot the original \u0398\u2217 (solution of (1)), \u00af\u0398 with 1-level clustering and \u00af\u0398 with 2-level\nclustering. We can see that although our clustering method does not look at \u0398\u2217, the clustering result\nmatches the nonzero pattern of \u0398\u2217 pretty well.\n\n4.2 The performance of our divide and conquer algorithm\n\nNext, we investigate the time taken by our divide and conquer algorithm on large real and synthetic\ndatasets. We include the following methods in our comparisons:\n\n\u2022 DC-QUIC-1: Divide and Conquer framework with QUIC and with 1 level clustering.\n\n(a) The inverse covariance ma-\ntrix \u0398\u2217.\nFigure 1: The clustering results and the nonzero patterns of inverse covariance matrix \u0398\u2217 on\nLeukemia dataset. Although our clustering method does not look at \u0398\u2217, the clustering results\nmatch the nonzero pattern in \u0398\u2217 pretty well.\n\n(c) The recovered \u00af\u0398 from level 2\nclusters.\n\n(b) The recovered \u00af\u0398 from level-1\nclusters.\n\n7\n\n\f(a) Leukemia\n\n(b) Stock\n\n(c) Climate\n\n(d) Synthetic\n\nFigure 2: Comparison of algorithms on real datasets. The results show that DC-QUIC is much faster\nthan other state-of-the-art solvers.\n\nestimation [6].\n\n\u2022 DC-QUIC-3: Divide and Conquer QUIC with 3 levels of hierarchical clustering.\n\u2022 QUIC: The original QUIC, which is a state-of-the-art second order solver for sparse inverse\n\u2022 QUIC-conn: Using the decomposition method described in [8] and using QUIC to solve\n\u2022 Glasso: The block coordinate descent algorithm proposed in [4].\n\u2022 ALM: The alternating linearization algorithm proposed and implemented by [9].\n\neach smaller sub-problem.\n\nAll of our experiments are run on an Intel Xeon E5440 2.83GHz CPU with 32GB main memory.\nFigure 2 shows the results. For DC-QUIC and QUIC-conn, we show the run time of the whole\nprocess, including the preprocessing time. We can see that in the largest synthetic dataset, DC-\nQUIC is more than 10 times faster than QUIC, and thus also faster than Glasso and ALM. For the\nlargest real dataset: Climate with more than 10,000 points, QUIC takes more than 10 hours to get a\nreasonable solution (relative error=0), while DC-QUIC-3 converges in 1 hour. Moreover, on these\n4 datasets QUIC-conn using the decomposition method of [8] provides limited savings, in part\nbecause the connected components for the thresholded covariance matrix for each dataset turned\nout to have a giant component, and multiple smaller components. DC-QUIC however was able to\nleverage a reasonably good clustered decomposition, which dramatically reduced the inference time.\n\nAcknowledgements\n\nWe would like to thank Soumyadeep Chatterjee and Puja Das for help with the climate and stock\ndata. C.-J.H., I.S.D and P.R. acknowledge the support of NSF under grant IIS-1018426. P.R. also\nacknowledges support from NSF IIS-1149803. A.B. acknowledges support from NSF grants IIS-\n0916750, IIS-0953274, and IIS-1029711.\n\n8\n\n\fReferences\n[1] O. Banerjee, L. E. Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum\nlikelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learn-\ning Research, 9, 6 2008.\n\n[2] R. Bhatia. Matrix Analysis. Springer Verlag, New York, 1997.\n[3] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multi-\nlevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),\n29:11:1944\u20131957, 2007.\n\n[4] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph-\n\nical lasso. Biostatistics, 9(3):432\u2013441, July 2008.\n\n[5] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller,\nM. L. Loh, J. R. Downing, M. A. Caligiuri, and C. D. Bloom\ufb01eld. Molecular classication of\ncancer: class discovery and class prediction by gene expression monitoring. Science, pages\n531\u2013537, 1999.\n\n[6] C.-J. Hsieh, M. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix\n\nestimation using quadratic approximation. In NIPS, 2011.\n\n[7] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covariance\n\nselection. Mathematical Programming Computation, 2:291\u2013315, 2010.\n\n[8] R. Mazumder and T. Hastie. Exact covariance thresholding into connected components for\n\nlarge-scale graphical lasso. Journal of Machine Learning Research, 13:723\u2013736, 2012.\n\n[9] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternating\n\nlinearization methods. NIPS, 2010.\n\n[10] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis\n\nand Machine Intelligence, 22(8):888\u2013905, 2000.\n\n9\n\n\f", "award": [], "sourceid": 1139, "authors": [{"given_name": "Cho-jui", "family_name": "Hsieh", "institution": null}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": null}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}]}