{"title": "Large Scale Distributed Sparse Precision Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 584, "page_last": 592, "abstract": "We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in column-blocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores.", "full_text": "Large Scale Distributed Sparse Precision Estimation\n\nDept. of Computer Science & Engg, University of Minnesota, Twin Cities\n\nHuahua Wang, Arindam Banerjee\n{huwang,banerjee}@cs.umn.edu\n\nCho-Jui Hsieh, Pradeep Ravikumar, Inderjit S. Dhillon\nDept. of Computer Science, University of Texas, Austin\n{cjhsieh,pradeepr,inderjit}@cs.utexas.edu\n\nAbstract\n\nWe consider the problem of sparse precision matrix estimation in high dimensions\nusing the CLIME estimator, which has several desirable theoretical properties. We\npresent an inexact alternating direction method of multiplier (ADMM) algorithm\nfor CLIME, and establish rates of convergence for both the objective and opti-\nmality conditions. Further, we develop a large scale distributed framework for the\ncomputations, which scales to millions of dimensions and trillions of parameters,\nusing hundreds of cores. The proposed framework solves CLIME in column-\nblocks and only involves elementwise operations and parallel matrix multiplica-\ntions. We evaluate our algorithm on both shared-memory and distributed-memory\narchitectures, which can use block cyclic distribution of data and parameters to\nachieve load balance and improve the ef\ufb01ciency in the use of memory hierarchies.\nExperimental results show that our algorithm is substantially more scalable than\nstate-of-the-art methods and scales almost linearly with the number of cores.\n\n0 \u2208 S p\n\n(cid:80)\n\nIntroduction\n\n1\nConsider a p-dimensional probability distribution with true covariance matrix \u03a30 \u2208 S p\n++ and true\n++. Let [R1 \u00b7\u00b7\u00b7 Rn] \u2208 (cid:60)p\u00d7n be n indepen-\nprecision (or inverse covariance) matrix \u21260 = \u03a3\u22121\ndent and identically distributed random samples drawn from this p-dimensional distribution. The\ncentered normalized sample matrix A = [a1 \u00b7\u00b7\u00b7 an] \u2208 (cid:60)p\u00d7n can be obtained as ai = 1\u221a\nn (Ri \u2212 \u00afR),\nwhere \u00afR = 1\ni Ri, so that the sample covariance matrix can be computed as C = AAT . In\nn\nrecent years, considerable effort has been invested in obtaining an accurate estimate of the precision\nmatrix \u02c6\u2126 based on the sample covariance matrix C in the \u2018low sample, high dimensions\u2019 setting,\ni.e., n (cid:28) p, especially when the true precision \u21260 is assumed to be sparse [28]. Suitable estima-\ntors and corresponding statistical convergence rates have been established for a variety of settings,\nincluding distributions with sub-Gaussian tails, polynomial tails [25, 3, 19]. Recent advances have\nalso established parameter-free methods which achieve minimax rates of convergence [4, 19].\nSpurred by these advances in the statistical theory of precision matrix estimation, there has been\nconsiderable recent work on developing computationally ef\ufb01cient optimization methods for solving\nthe corresponding statistical estimation problems: see [1, 8, 14, 21, 13], and references therein.\nWhile these methods are able to ef\ufb01ciently solve problems up to a few thousand variables, ultra-\nlarge-scale problems with millions of variables remain a challenge. Note further that in precision\nmatrix estimation, the number of parameters scales quadratically with the number of variables; so\nthat with a million dimensions p = 106, the total number of parameters to be estimated is a trillion,\np2 = 1012. The focus of this paper is on designing an ef\ufb01cient distributed algorithm for precision\nmatrix estimation under such ultra-large-scale dimensional settings.\nWe focus on the CLIME statistical estimator [3], which solves the following linear program (LP):\n\nmin(cid:107) \u02c6\u2126(cid:107)1\n\ns.t.\n\n(cid:107)C \u02c6\u2126 \u2212 I(cid:107)\u221e \u2264 \u03bb ,\n\n(1)\n\n1\n\n\fwhere \u03bb > 0 is a tuning parameter. The CLIME estimator not only has strong statistical guar-\nantees [3], but also comes with inherent computational advantages. First, the LP in (1) does not\nexplicitly enforce positive de\ufb01niteness of \u02c6\u2126, which can be a challenge to handle ef\ufb01ciently in high-\ndimensions. Secondly, it can be seen that (1) can be decomposed into p independent LPs, one\nfor each column of \u02c6\u2126. This separable structure has motivated solvers for (1) which solve the LP\ncolumn-by-column using interior point methods [3, 28] or the alternating direction method of multi-\npliers (ADMM) [18]. However, these solvers do not scale well to ultra-high-dimensional problems:\nthey are not designed to run on hundreds to thousands of cores, and in particular require the entire\nsample covariance matrix C to be loaded into the memory of a single machine, which is impractical\neven for moderate sized problems.\nIn this paper, we present an ef\ufb01cient CLIME-ADMM variant along with a scalable distributed frame-\nwork for the computations [2, 26]. The proposed CLIME-ADMM algorithm can scale up to millions\nof dimensions, and can use up to thousands of cores in a shared-memory or distributed-memory ar-\nchitecture. The scalability of our method relies on the following key innovations. First, we propose\nan inexact ADMM [27, 12] algorithm targeted to CLIME, where each step is either elementwise\nparallel or involves suitable matrix multiplications. We show that the rates of convergence of the\nobjective to the optimum as well as residuals of constraint violation are both O(1/T ). Second, we\nsolve (1) in column-blocks of the precision matrix at a time, rather than one column at a time. Since\n(1) already decomposes columnwise, solving multiple columns together in blocks might not seem\nworthwhile. However, as we show our CLIME-ADMM working with column-blocks uses matrix-\nmatrix multiplications which, building on existing literature [15, 5, 11] and the underlying low rank\nand sparse structure inherent in the precision matrix estimation problem, can be made substantially\nmore ef\ufb01cient than repeated matrix-vector multiplications. Moreover, matrix multiplication can be\nfurther simpli\ufb01ed as block-by-block operations, which allows choosing optimal block sizes to min-\nimize cache misses, leading to high scalability and performance [16, 5, 15]. Lastly, since the core\ncomputations can be parallelized, CLIME-ADMM scales almost linearly with the number of cores.\nWe experiment with shared-memory and distributed-memory architectures to illustrate this point.\nEmpirically, CLIME-ADMM is shown to be much faster than existing methods for precision esti-\nmation, and scales well to high-dimensional problems, e.g., we estimate a precision matrix of one\nmillion dimension and one trillion parameters in 11 hours by running the algorithm on 400 cores.\nOur framework can be positioned as a part of the recent surge of effort in scaling up machine learn-\ning algorithms [29, 22, 6, 7, 20, 2, 23, 9] to \u201cBig Data\u201d. Scaling up machine learning algorithms\nthrough parallelization and distribution has been heavily explored on various architectures, includ-\ning shared-memory architectures [22], distributed memory architectures [23, 6, 9] and GPUs [24].\nSince MapReduce [7] is not ef\ufb01cient for optimization algorithms, [6] proposed a parameter server\nthat can be used to parallelize gradient descent algorithms for unconstrained optimization problems.\nHowever, this framework is ill-suited for the constrained optimization problems we consider here,\nbecause gradient descent methods require the projection at each iteration which involves all vari-\nables and thus ruins the parallelism. In other recent related work based on ADMM, [23] introduce\ngraph projection block splitting (GPBS) to split data into blocks so that examples and features can\nbe distributed among multiple cores. Our framework uses a more general blocking scheme (block\ncyclic distribution), which provides more options in choosing the optimal block size to improve the\nef\ufb01ciency in the use of memory hierarchies and minimize cache misses [16, 15, 5]. ADMM has\nalso been used to solve constrained optimization in a distributed framework [9] for graphical model\ninference, but they consider local constraints, in contrast to the global constraints in our framework.\nNotation: A matrix is denoted by a bold face upper case letter, e.g., A. An element of a matrix\nis denoted by a upper case letter with row index i and column index j, e.g., Aij is the ij-th el-\nement of A. A block of matrix is denoted by a bold face lower case letter indexed by ij, e.g.,\nAij. (cid:126)Aij represents a collection of blocks of matrix A on the ij-th core (see block cyclic distri-\nbution in Section 4). A(cid:48) refers the transpose of A. Matrix norms used are all elementwise norms,\n(cid:80)n\nij,(cid:107)A(cid:107)\u221e = max1\u2264i\u2264p,1\u2264j\u2264n |Aij|. The\nj=1 AijBij. X \u2208 (cid:60)p\u00d7k de-\nnotes k(1 \u2264 k \u2264 p) columns of the precision matrix \u02c6\u2126, and E \u2208 (cid:60)p\u00d7k denotes the same k columns\nof the identity matrix I \u2208 (cid:60)p\u00d7p. Let \u03bbmax(C) be the largest eigenvalue of covariance matrix C.\n\ne.g., (cid:107)A(cid:107)1 =(cid:80)p\nmatrix inner product is de\ufb01ned in elementwise, e.g., (cid:104)A, B(cid:105) =(cid:80)p\n\n(cid:80)n\n\n(cid:80)n\nj=1 |Aij|,(cid:107)A(cid:107)2\n\n2 =(cid:80)p\n\ni=1\n\ni=1\n\ni=1\n\nj=1 A2\n\n2\n\n\fAlgorithm 1 Column Block ADMM for CLIME\n1: Input: C, \u03bb, \u03c1, \u03b7\n2: Output: X\n3: Initialization: X0, Z0, Y0, V0, \u02c6V0 = 0\n4: for t = 0 to T \u2212 1 do\n5: X-update: Xt+1 = soft(Xt \u2212 Vt, 1\n6: Mat-Mul:\n\nUt+1 = CXt+1\n\nlow rank : Ut+1 = A(A(cid:48)Xt+1)\nZ-update: Zt+1 = box(Ut+1 + Yt, \u03bb), where\n\n7:\n8: Y-update: Yt+1 = Yt + Ut+1 \u2212 Zt+1\n9: Mat-Mul:\n\n(cid:26) sparse :\n(cid:26) sparse :\n\n\u02c6Vt+1 = CYt+1\n\u02c6Vt+1 = A(A(cid:48)Yt+1)\n\n\u03b7 ), where\n\nlow rank :\n\n10: V-update: Vt+1 = \u03c1\n11: end for\n\n\u03b7 (2 \u02c6Vt+1 \u2212 \u02c6Vt)\n\n(cid:40) Xij \u2212 \u03b3 ,\n(cid:40) Eij + \u03bb,\n\nXij + \u03b3 ,\n0 ,\n\nXij,\nEij \u2212 \u03bb,\n\nif Xij > \u03b3 ,\nif Xij < \u2212\u03b3 ,\notherwise\nif Xij \u2212 Eij > \u03bb,\nif |Xij \u2212 Eij| \u2264 \u03bb,\nif Xij \u2212 Eij < \u2212\u03bb,\n\nsoft(X, \u03b3) =\n\nbox(X, E, \u03bb) =\n\n(2)\n\n2 Column Block ADMM for CLIME\nIn this section, we propose an algorithm to estimate the precision matrix in terms of column blocks\ninstead of column-by-column. Assuming a column block contains k(1 \u2264 k \u2264 p) columns, the\nsparse precision matrix estimation amounts to solving (cid:100)p/k(cid:101) independent linear programs. Denoting\nX \u2208 (cid:60)p\u00d7k be k columns of \u02c6\u2126, (1) can be written as\n\n(cid:107)CX \u2212 E(cid:107)\u221e \u2264 \u03bb ,\nwhich can be rewritten in the following equality-constrained form:\n\nmin(cid:107)X(cid:107)1\n\ns.t.\n\nmin(cid:107)X(cid:107)1\n\n(3)\nThrough the splitting variable Z \u2208 (cid:60)p\u00d7k, the in\ufb01nity norm constraint becomes a box constraint and\nis separated from the (cid:96)1 norm objective. We use ADMM to solve (3). The augmented Lagrangian\nof (3) is\n\ns.t.\n\n(cid:107)Z \u2212 E(cid:107)\u221e \u2264 \u03bb, CX = Z .\n\nL\u03c1 = (cid:107)X(cid:107)1 + \u03c1(cid:104)Y, CX \u2212 Z(cid:105) +\n\n(cid:107)CX \u2212 Z(cid:107)2\n2 ,\n\n(4)\n\n\u03c1\n2\n\nwhere Y \u2208 (cid:60)p\u00d7k is a scaled dual variable and \u03c1 > 0. ADMM yields the following iterates [2]:\n\n\u03c1\n2\n\n(cid:107)CX \u2212 Zt + Yt(cid:107)2\n2 ,\n\nXt+1 = argminX (cid:107)X(cid:107)1 +\nZt+1 = argmin\nYt+1 = Yt + CXt+1 \u2212 Zt+1 .\n(7)\n(5) can be solved using exisiting Lasso algorithms, but that will lead to a\nAs a Lasso problem,\ndouble-loop algorithm.\n(5) does not have a closed-form solution since C in the quadratic penalty\nterm makes X coupled. We decouple X by linearizing the quadratic penalty term and adding a\nproximal term as follows:\n\n(cid:107)CXt+1 \u2212 Z + Yt(cid:107)2\n2 ,\n\n(cid:107)Z\u2212E(cid:107)\u221e\u2264\u03bb\n\n(5)\n\n(6)\n\n\u03c1\n2\n\nXt+1 = argminX (cid:107)X(cid:107)1 + \u03b7(cid:104)Vt, X(cid:105) +\n\n(8)\n\u03b7 C(Yt + CXt \u2212 Zt) and \u03b7 > 0. (8) is usually called an inexact ADMM update.\n\u03b7 (2 \u02c6Vt \u2212 \u02c6Vt\u22121) . (8) has the\n\n\u03b7 C(2Yt \u2212 Yt\u22121). Let \u02c6Vt = CYt, we have Vt = \u03c1\n\n(cid:107)X \u2212 Xt(cid:107)2\n2 ,\n\n\u03b7\n2\n\nwhere Vt = \u03c1\nUsing (7), Vt = \u03c1\nfollowing closed-form solution:\n\nXt+1 = soft(Xt \u2212 Vt,\n\n1\n\u03b7\n\n) ,\n\n(9)\n\nwhere soft denotes the soft-thresholding and is de\ufb01ned in Step 5 of Algorithm 1.\nLet Ut+1 = CXt+1. (6) is a box constrained quadratic programming which has the following\nclosed-form solution:\n\nZt+1 = box(Ut+1 + Yt, E, \u03bb) ,\n\n(10)\n\n3\n\n\fwhere box denotes the projection onto the in\ufb01nity norm constraint (cid:107)Z \u2212 E(cid:107)\u221e \u2264 \u03bb and is de\ufb01ned\nin Step 7 of Algorithm 1. In particular, if (cid:107)Ut+1 + Yt \u2212 E(cid:107)\u221e \u2264 \u03bb, Zt+1 = Ut+1 + Yt and thus\nYt+1 = Yt + Ut+1 \u2212 Zt+1 = 0.\nThe ADMM algorithm for CLIME is summarized in Algorithm 1. In Algorithm 1, while step 5, 7, 8\nand 10 amount to elementwise operations which cost O(pk) operations, steps 6 and 9 involve matrix\nmultiplication which is the most computationally intensive part and costs O(p2k) operations. The\nmemory requirement includes O(pn) for A and O(pk) for the other six variables.\nAs the following results show, Algorithm 1 has a O(1/T ) convergence rate for both the objective\nfunction and the residuals of optimality conditions. The proof technique is similar to [26]. [12]\nshows a similar result as Theorem 2 but uses a different proof technique. For proofs, please see\nAppendix A in the supplement.\nTheorem 1 Let {Xt, Zt, Yt} be generated by Algorithm 1 and \u00afXT = 1\nZ0 = Y0 = 0 and \u03b7 \u2265 \u03c1\u03bb2\n\nt=1 Xt. Assume X0 =\n\n(cid:80)T\n\nT\n\nmax(C). For any CX = Z, we have\n(cid:107) \u00afXT(cid:107)1 \u2212 (cid:107)X(cid:107)1 \u2264 \u03b7(cid:107)X(cid:107)2\n\n2\n\n2T\n\n.\n\n(11)\n\nTheorem 2 Let {Xt, Zt, Yt} be generated by Algorithm 1 and {X\u2217, Z\u2217, Y\u2217} be a KKT point for\nthe Lagrangian of (3). Assume X0 = Z0 = Y0 = 0 and \u03b7 \u2265 \u03c1\u03bb2\n\n(cid:107)CXT \u2212 ZT(cid:107)2\n\n2 + (cid:107)ZT \u2212 ZT\u22121(cid:107)2\n\n2 + (cid:107)XT \u2212 XT\u22121(cid:107)2\n\n.\n\n(12)\n\nmax(C). We have\n2 + \u03b7\nT\n\n(cid:107)Y\u2217(cid:107)2\n\n\u03c1 I\u2212C2 \u2264\n\n\u03b7\n\n\u03c1(cid:107)X\u2217(cid:107)2\n\n2\n\n3 Leveraging Sparse, Low-Rank Structure\nIn this section, we consider a few possible directions that can further leverage the underlying struc-\nture of the problem; speci\ufb01cally sparse and low-rank structure.\n\n3.1 Sparse Structure\nAs we detail here, there could be sparsity in the intermediate iterates, or the sample covariance\nmatrix itself (or a perturbed version thereof); which can be exploited to make our CLIME-ADMM\nvariant more ef\ufb01cient.\nIterate Sparsity: As the iterations progress, the soft-thresholding operation will yield a sparse\nXt+1, which can help speed up step 6: Ut+1 = CX t+1, via sparse matrix multiplication. Further,\nthe box-thresholding operation will yield a sparse Yt+1. In the ideal case, if (cid:107)Ut+1+Yt\u2212E(cid:107)\u221e \u2264 \u03bb\nin step 7, then Zt+1 = Ut+1 + Yt. Thus, \u02c6Yt+1 = Yt + Ut+1 \u2212 Zt+1 = 0. More generally, Yt+1\nwill become sparse as the iterations proceed, which can help speed up step 9: \u02c6Vt+1 = CYt+1.\nSample Covariance Sparsity: We show that one can \u201cperturb\u201d the sample covariance to obtain a\nsparse and coarsened matrix, solve CLIME with this pertubed matrix, and yet have strong statistical\nguarantees. The statistical guarantees for CLIME [3], including convergence in spectral, matrix\nL1, and Frobenius norms, only require from the sample covariance matrix C a deviation bound of\n\nthe form (cid:107)C \u2212 \u03a30(cid:107)\u221e \u2264 c(cid:112)log p/n, for some constant c. Accordingly, if we perturb the matrix\n\nC with a perturbation matrix \u2206 so that the perturbed matrix (C + \u2206) continues to satisfy the\ndeviation bound, the statistical guarantees for CLIME would hold even if we used the perturbed\nmatrix (C + \u2206). The following theorem (for details, please see Appendix B in the supplement)\nillustrates some perturbations \u2206 that satisfy this property:\n\nTheorem 3 Let the original random variables Ri be sub-Gaussian, with sample covariance C. Let\n\u2206 be a random perturbation matrix, where \u2206ij are independent sub-exponential random variables.\nThen, for positive constants c1, c2, c3, P ((cid:107)C + \u2206 \u2212 \u03a30(cid:107)\u221e \u2265 c1\n\n(cid:113) log p\nn ) \u2264 c2p\u2212c3.\n\nAs a special case, one can thus perturb elements of Cij with suitable constants \u2206ij with |\u2206ij| \u2264\n\nc(cid:112)log p/n, so that the perturbed matrix is sparse, i.e., if |Cij| \u2264 c(cid:112)log p/n, then it can be safely\n\n4\n\n\ftruncated to 0. Thus, in practice, even if sample covariance matrix is only close to a sparse ma-\ntrix [21, 13], or if it is close to being block diagonal [21, 13], the complexity of matrix multiplication\nin steps 6 and 9 can be signi\ufb01cantly reduced via the above perturbations.\n\n3.2 Low Rank Structure\nAlthough one can use sparse structures of matrices participating in the matrix multiplication to\naccelerate the algorithm, the implementation requires substantial work since dynamic sparsity of\nX and Y is unknown upfront and static sparsity of the sample covariance matrix may not exist.\nSince the method will operate in a low-sample setting, we can alternatively use the low rank of the\nsample covariance matrix to reduce the complexity of matrix multiplication. Since C = AAT and\np (cid:29) n, CX = A(AT X), and thus the computational complexity of matrix multiplication reduces\nfrom O(p2k) to O(npk), which can achieve signi\ufb01cant speedup for small n. We use such low-rank\nmultiplications for the experiments in Section 5.\n\n4 Scalable Parallel Computation Framework\n\nIn this section, we elaborate on scalable frameworks for CLIME-ADMM in both shared-memory\nand distributed-memory achitectures.\nIn a shared-memory architecture (e.g., a single machine), data A is loaded to the memory and shared\nby q cores, as shown in Figure 1(a). Assume the p \u00d7 p precision matrix \u02c6\u2126 is evenly divided into\nl = p/k (\u2265 q) column blocks, e.g., X1,\u00b7\u00b7\u00b7 , Xq,\u00b7\u00b7\u00b7 , Xl, and thus each column block contains k\ncolumns. The column blocks are assigned to q cores cyclically, which means the j-th column block\nis assigned to the mod(j, q)-th core. The q cores can solve q column blocks in parallel without com-\nmunication and synchronization, which can be simply implemented via multithreading. Meanwhile,\nanother q column blocks are waiting in their respective queues. Figure 1(a) gives an example of how\nto solve 8 column blocks on 4 cores in a shared-memory environment. While the 4 cores are solving\nthe \ufb01rst 4 column blocks, the next 4 column blocks are waiting in queues (red arrows).\nAlthough the shared-memory framework is free from communication and synchronization, the lim-\nited resources prevent it from scaling up to datasets with millions of dimensions, which can not be\nloaded to the memory of a single machine or solved by tens of cores in a reasonble time. As more\nmemory and computing power are needed for high dimensional datasets, we implement a frame-\nwork for CLIME-ADMM in a distributed-memory architecture, which automatically distributes\ndata among machines, parallelizes computation, and manages communication and synchronization\namong machines, as shown in Figure 1(b). Assume q processes are formed as a r \u00d7 c process\ngrid and the p \u00d7 p precision matrix \u02c6\u2126 is evenly divided into l = p/k (\u2265 q) column blocks, e.g.,\nXj, 1 \u2264 j \u2264 l. We solve a column block Xj at a time in the process grid. Assume the data matrix\nA has been evenly distributed into the process grid and (cid:126)Aij is the data on the ij-th core, i.e., A is\ncolletion of (cid:126)Aij under a mapping scheme, which we will discuss later. Figure 1(b) illustrates that\nthe 2 \u00d7 2 process grid is computing the \ufb01rst column block X1 while the second column block X2\nis waiting in queues (red lines), assuming X1, X2 are distributed into the process grid in the same\nway as A and (cid:126)X1\nA typical issue in parallel computation is load imbalance, which is mainly caused by the compu-\ntational disparity among cores and leads to unsatisfactory speedups. Since each step in CLIME-\nADMM are basic operations like matrix multiplication, the distribution of sub-matrices over pro-\ncesses has a major impact on the load balance and scalability. The following discussion focuses on\nthe matrix multiplication in the step 6 in Algorithm 1. Other steps can be easily incorporated into\nthe framework. The matrix multiplication U = A(A(cid:48)X1) can be decomposed into two steps, i.e.,\nW = A(cid:48)X1 and U = AW, where A \u2208 (cid:60)n\u00d7p, X1 \u2208 (cid:60)p\u00d7k, W \u2208 (cid:60)n\u00d7k and U \u2208 (cid:60)n\u00d7k. Divid-\ning matrices A, X evenly into r \u00d7 c large consecutive blocks like [23] will lead to load imbalance.\nFirst, since the sparse structure of X changes over time (Section 3.1), large consecutive blocks may\nassign dense blocks to some processes and sparse blocks to the other processes. Second, there will\nbe no blocks in some processes after the multiplication using large blocks since W is a small matrix\ncompared to A, X, e.g., p could be millions and n, k are hundreds. Third, large blocks may not be\n\ufb01t in the cache, leading to cache misses. Therefore, we use block cyclic data distribution which uses\na small nonconsecutive blocks and thus can largely achieve load balance and scalability. A matrix\nis \ufb01rst divided into consecutive blocks of size pb \u00d7 nb. Then blocks are distributed into the process\n\nij is the block of X1 assigned to the ij-th core.\n\n5\n\n\f(a) Shared-Memory\n\n(b) Distributed-Memory\n\n(c) Block Cyclic\n\nFigure 1: CLIME-ADMM on shared-memory and distribtued-memory architectures.\n\ngrid cyclically. Figure 1(c) illustrates how to distribute the matrix to a 2 \u00d7 2 process grid. A is\ndivided into 3\u00d7 2 consecutive blocks, where each block is of size pb \u00d7 nb. Blocks of the same color\nwill be assigned to the same process. Green blocks will be assigned to the upper left process, i.e.,\n(cid:126)A11 = {a11, a13, a31, a33, a51, a53} in Figure 1(b). The distribution of X1 can be done in a similar\nway except the block size should be pb \u00d7 kb, where pb is to guarantee that matrix multiplication\nA(cid:48)X1 works. In particular, we denote pb \u00d7 nb \u00d7 kb as the block size for matrix multiplication.\nTo distribute the data in a block cyclic manner, we use a parallel I/O scheme, where processes can\naccess the data in parallel and only read/write the assigned blocks.\n5 Experimental Results\nIn this section, we present experimental results to compare CLIME-ADMM with existing al-\ngorithms and show its scalability.\nIn all experiments, we use the low rank property of the\nsample covariance matrix and do not assume any other special structures. Our algorithm is\nimplemented in a shared-memory architecture using OpenMP (http://openmp.org/wp/) and a\ndistributed-memory architecture using OpenMPI (http://www.open-mpi.org) and ScaLAPACK [15]\n(http://www.netlib.org/scalapack/).\n5.1 Comparision with Existing Algorithms\nWe compare CLIME-ADMM with three other methods for estimating the inverse covariance matrix,\nincluding CLIME, Tiger in package \ufb02are1 and divide and conquer QUIC (DC-QUIC) [13]. The\ncomparisons are run on an Intel Zeon E5540 2.83GHz CPU with 32GB main memory.\nWe test the ef\ufb01ciency of the above methods on both synthetic and real datasets. For synthetic\ndatasets, we generate the underlying graphs with random nonzero pattern by the same way as in [14].\nWe control the sparsity of the underlying graph to be 0.05, and generate random graphs with var-\nious dimension. Since each estimator has different parameters to control the sparsity, we set them\nindividually to recover the graph with sparsity 0.05, and compare the time to get the solution. The\ncolumn block size k for CLIME-ADMM is 100. Figure 2(a) shows that CLIME-ADMM is the most\nscalable estimator for large graphs. We compare the precision and recall for different methods on\nrecovering the groud truth graph structure. We run each method using different parameters (which\ncontrols the sparsity of the solution), and plot the precision and recall for each solution in Figure\n2(b). As Tiger is parameter tuning free and achieves the minimax optimal rate [19], it achieves the\nbest performance in terms of recall. The other three methods have the similar performance. CLIME\ncan also be free of parameter tuning and achieve the optimal minimax rate by solving an additional\nlinear program which is similar to (1) [4]. We refer the readers to [3, 4, 19] for detailed comparisons\nbetween the two models CLIME and Tiger, which is not the focus of this paper.\nWe further test the ef\ufb01ciency of the above algorithms on two real datasets, Leukemia and Climate\n(see Table 1). Leukemia is gene expression data provided by [10], and the pre-processing was done\nby [17]. Climate dataset is the temperature data in year 2001 recorded by NCEP/NCAR Reanalysis\ndata2and preprocessed by [13]. Since the ground truth for real datasets are unknown, we test the\ntime taken for each method to recover graphs with 0.1 and 0.01 sparsity. The results are presented\nin Table 1. Although Tiger is faster than CLIME-ADMM on small dimensional dataset Leukemia,\n\n1The interior point method in [3] is written in R and extremely slow. Therefore, we use \ufb02are which is\n\nimplemented in C with R interface. http://cran.r-project.org/web/packages/\ufb02are/index.html\n\n2www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.surface.html\n\n6\n\n2X1X6X5X4X3X8X7XA21A(cid:38)121X(cid:38)11A(cid:38)111X(cid:38)122X(cid:38)22A(cid:38)112X(cid:38)12A(cid:38)211X(cid:38)221X(cid:38)222X(cid:38)222X(cid:38)Parallel IO13A12A14A11A23A22A24A21A33A32A34A31A43A42A44A41A53A52A54A51A63A62A64A61A\f(a) Runtime\n\n(a) Speedup Scol\nk\n\n(a) Speedup Scol\nk\n\n(b) Precision and recall\n\n(b) Speedup Score\n\nq\n\n(b) Speedup Score\n\nq\n\n, i.e., Score\n\nq = T core\n\n1 /T core\n\nq\n\nq\n\nover q cores T core\n\nFigure 3: Shared-Memory.\n\nFigure 4: Distributed-Memory.\n\nFigure 2: Synthetic datasets\nit does not scale well on the high dimensional dataset as CLIME-ADMM, which is mainly due\nto the fact that ADMM is not competitive with other methods on small problems but has superior\nscalability on big datasets [2]. DC-QUIC runs faster than other methods for small sparsity but\ndramatically slows down when sparsity increases. DC-QUIC essentially works on a block-diagonal\nmatrix by thresholding the off-diagonal elements of the sample covariance matrix. A small sparsity\ngenerally leads to small diagonal blocks, which helps DC-QUIC to make a giant leap forward in the\ncomputation. A block-diagonal structure in the sample covariance matrix can be easily incorporated\ninto the matrix multiplication in CLIME-ADMM to achieve a sharp computational gain. On a single\ncore, CLIME-ADMM is faster than \ufb02are ADMM. We also show the results of CLIME-ADMM on 8\ncores, showing CLIME-ADMM achieves a linear speedup (more results will be seen in Section 5.2).\nNote Tiger can estimate the spase precision matrix column-by-column in parallel, while CLIME-\nADMM solves CLIME in column-blocks in parallel.\n5.2 Scalability of CLIME ADMM\nWe evaluate the scalability of CLIME-ADMM in a shared memory and a distributed memory ar-\nchitecture in terms of two kinds of speedups. The \ufb01rst speedup is de\ufb01ned as the time on 1 core\nT core\n. The second speedup is caused by the use of col-\n1\numn blocks. Assume the total time for solving CLIME column-by-column (k = 1) is T col\n1 , which\nis considered as the baseline. The speedup of solving CLIME in column block with size k over a\nsingle column is de\ufb01ned as Scol\nk . The experiments are done on synthetic data which is\ngenerated in the same way as in Section 5.1. The number of samples is \ufb01xed to be n = 200.\nShared-memory We estimate a precision matrix with p = 104 dimensions on a server with 20\ncores and 64G memory. We use OpenMP to parallelize column blocks. We run the algorithm on\ndifferent number of cores q = 1, 5, 10, 20, and with different column block size k. The speedup\nScol\nis plotted in Figure 3(a), which shows the results on three different number of cores. When\nk \u2264 20, the speedups keep increasing with increasing number of columns k in each block. For\nk\nk \u2265 20, the speedups are maintained on 1 core and 5 cores, but decreases on 10 and 20 cores. The\ntotal number of columns in the shared-memory is k\u00d7 q. For a \ufb01xed k, more columns are involved in\nthe computation when more cores are used, leading to more memory consumption and competition\nfor the usage of shared cache. The speedup Score\nis the time\non a single core. The ideal linear speedups are archived on 5 cores for all block sizes k. On 10\ncores, while small and medium column block sizes can maintain the ideal linear speedups, the large\ncolumn block sizes fail to scale linearly. The failure to achieve a linear speedup propagate to small\nand medium column block sizes on 20 cores, although their speedups are larger than large column\nblock size. As more and more column blocks are participating in the computation, the speed-ups\ndecrease possibly because of the competition for resources (e.g., L2 cache) in the shared-memory\nenvironment.\n\nis plotted in Figure 3(b), where T core\n\nk = T col\n\n1 /T col\n\n1\n\nq\n\n7\n\n\fTable 1: Comparison of runtime (sec) on real datasets.\nsparsity\n\nCLIME-ADMM\n\nDC-QUIC\n\nDataset\nLeukemia\n(1255 \u00d7 72)\nClimate\n(10512 \u00d7 1464)\n\n0.1\n0.01\n0.1\n0.01\n\n1 core\n48.64\n44.98\n\n8 cores\n6.27\n5.83\n\n4.76 hours\n4.46 hours\n\n0.6 hours\n0.56 hours\n\n93.88\n21.59\n\nTiger\n34.56\n17.10\n10.51 hours > 1 day\n2.12 hours > 1 day\n\n\ufb02are CLIME\n\n142.5\n87.60\n> 1 day\n> 1 day\n\nTable 2: Effect (runtime (sec)) of using different number of cores in a node with p = 106.\nUsing one core per node is the most ef\ufb01cient as there is no resource sharing with other cores.\n\nnode \u00d7core\n\n100\u00d71\n25\u00d7 4\n200\u00d71\n50\u00d74\n\nk = 1\n0.56\n1.02\n0.37\n0.74\n\nk = 5\n1.26\n2.40\n0.68\n1.44\n\nk = 10\n2.59\n3.42\n1.12\n2.33\n\nk = 50\n6.98\n8.25\n3.48\n4.49\n\nk = 100\n13.97\n16.44\n6.76\n8.33\n\nk = 500\n62.35\n84.08\n33.95\n48.20\n\nk = 1000\n136.96\n180.89\n70.59\n103.87\n\nq\n\n1\n\nDistributed-memory We estimate a precision matrix with one million dimensions (p = 106), which\ncontains one trillion parameters (p2 = 1012). The experiments are run on a cluster with 400 com-\nputing nodes. We use 1 core per node to avoid the competition for the resources as we observed in\n2 \u00d7 2 since p (cid:29) n. The block size\nthe shared-memory case. For q cores, we use the process grid q\npb\u00d7nb\u00d7kb for matrix multiplication is 10\u00d710\u00d71 for k \u2264 10 and 10\u00d710\u00d710 for k > 10. Since the\ncolumn block CLIME problems are totally independent, we report the speedups on solving a single\ncolumn block. The speedup Scol\nk is plotted in Figure 4(a), where the speedups are larger and more\nstable than that in the shared-memory environment. The speedup keeps increasing before arriving\nat a certain number as column block size increases. For any column block size, the speedup also\nincreases as the number of cores increases. The speedup Score\nis plotted in Figure 4(b), where T core\nis the time on 50 cores. A single column (k = 1) fails to achieve linear speedups when hundreds of\ncores are used. However, if using a column block k > 1, the ideal linear speedups are achieved with\nincreasing number of cores. Note that due to distributed memory, the larger column block sizes also\nscale linearly, unlike in the shared memory setting, where the speedups were limited due to resource\nsharing. As we have seen, k depends on the size of process grid, block size in matrix multiplication,\ncache size and probably the sparsity pattern of matrices. In Table 2, we compare the performance\nof 1 core per node to that of using 4 cores per node, which mixes the effects of shared-memory and\ndistributed-memory architectures. For small column block size (k = 1, 5), the use of multiple cores\nin a node is almost two times slower than the use of a single core in a node. For other column block\nsizes, it is still 30% slower. Finally, we ran CLIME-ADMM on 400 cores with one node per core\nand block size k = 500, and the entire computation took about 11 hours.\n6 Conclusions\nIn this paper, we presented a large scale distributed framework for the estimation of sparse precision\nmatrix using CLIME. Our framework can scale to millions of dimensions and run on hundreds of\nmachines. The framework is based on inexact ADMM, which decomposes the constrained optimiza-\ntion problem into elementary matrix multiplications and elementwise operations. Convergence rates\nfor both the objective and optimality conditions are established. The proposed framework solves the\nCLIME in column-blocks and uses block cyclic distribution to achieve load balancing. We evaluate\nour algorithm on both shared-memory and distributed-memory architectures. Experimental results\nshow that our algorithm is substantially more scalable than state-of-the-art methods and scales al-\nmost linearly with the number of cores. The framework presented can be useful for a variety of other\nlarge scale constrained optimization problems and will be explored in future work.\nAcknowledgment\nH. W. and A. B. acknowledge the support of NSF via IIS-0953274, IIS-1029711, IIS- 0916750,\nIIS-0812183, and the technical support from the University of Minnesota Supercomputing Institute.\nH. W. acknowledges the support of DDF (2013-2014) from the University of Minnesota. C.-J.H.\nand I.S.D was supported by NSF grants CCF-1320746 and CCF-1117055. C.-J.H also acknowledge\nthe support of IBM PhD fellowship. P.R. acknowledges the support of NSF via IIS-1149803, DMS-\n1264033 and ARO via W911NF-12-1-0390.\n\n8\n\n\fReferences\n[1] O. Banerjee, L. E. Ghaoui, and A. dAspremont. Model selection through sparse maximum likelihood\n\nestimation for multivariate Gaussian or binary data. JMLR, 9:2261\u20132286, 2008.\n\n[2] S. Boyd, E. Chu N. Parikh, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundation and Trends Machine Learning, 3(1), 2011.\n[3] T. Cai, W. Liu, and X. Luo. A constrained (cid:96)1 minimization approach to sparse precision matrix estimation.\n\nJournal of American Statistical Association, 106:594\u2013607, 2011.\n\n[4] T. Cai, W. Liu, and H. Zhou. Estimating sparse precision matrix: Optimal rates of convergence and\n\nadaptive estimation. Preprint, 2012.\n\n[5] J. Choi. A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. In\n\nHigh Performance Computing on the Information Superhighway, 1997.\n\n[6] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker,\n\nK. Yang, and A. Y. Ng. Large scale distributed deep networks. In NIPS, 2012.\n\n[7] J. Dean and S. Ghemawat. Map-Reduce: simpli\ufb01ed data processing on large clusters. In CACM, 2008.\n[8] J. Friedman, T. Hastie, and R. Tibshirani. Model selection through sparse maximum likelihood estimation\n\nfor multivariate gaussian or binary data. Biostatistics, 9:432\u2013441, 2008.\n\n[9] Q. Fu, H. Wang, and A. Banerjee. Bethe-ADMM for tree decomposition based parallel MAP inference.\n\nIn UAI, 2013.\n\n[10] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh,\nJ. R. Downing, M. A. Caligiuri, and C. D. Bloom\ufb01eld. Molecular classication of cancer: class discovery\nand class prediction by gene expression monitoring. Science, pages 531\u2013537, 1999.\n\n[11] K. Goto and R. Van De Geijn. Highperformance implementation of the level-3 BLAS. ACM Transactions\n\non Mathematical Software, 35:1\u201314, 2008.\n\n[12] B. He and X. Yuan. On non-ergodic convergence rate of Douglas-Rachford alternating direction method\n\nof multipliers. Preprint, 2012.\n\n[13] C. Hsieh, I. Dhillon, P. Ravikumar, and A. Banerjee. A divide-and-conquer method for sparse inverse\n\ncovariance estimation. In NIPS, 2012.\n\n[14] C. Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estimation using\n\nquadratic approximation. In NIPS, 2011.\n\n[15] A. Cleary J. Demmel I. S. Dhillon J. Dongarra S. Hammarling G. Henry A. Petitet K. Stanley D. Walker\n\nL. Blackford, J. Choi and R.C. Whaley. ScaLAPACK Users\u2019 Guide. SIAM, 1997.\n\n[16] M. Lam, E. Rothberg, and M. Wolf. The cache performance and optimization of blocked algorithms. In\n\nArchitectural Support for Programming Languages and Operating Systems, 1991.\n\n[17] L. Li and K.-C. Toh. An inexact interior point method for L1-reguarlized sparse covariance selection.\n\nMathematical Programming Computation, 2:291\u2013315, 2010.\n\n[18] X. Li, T. Zhao, X. Yuan, and H. Liu. An R package \ufb02are for high dimensional linear regression and\n\nprecision matrix estimation. http://cran.r-project.org/web/packages/\ufb02are, 2013.\n\n[19] H. Liu and L. Wang. Tiger: A tuning-insensitive approach for optimally estimating Gaussian graphical\n\nmodels. Preprint, 2012.\n\n[20] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. Hellerstein. Distributed graphlab: A\n\nframework for machine learning in the cloud. In VLDB, 2012.\n\n[21] R. Mazumder and T. Hastie. Exact covariance thresholding into connected components for large-scale\n\ngraphical lasso. JMLR, 13:723\u2013736, 2012.\n\n[22] F. Niu, B. Retcht, C. Re, and S. J. Wright. Hogwild! a lock-free approach to parallelizing stochastic\n\ngradient descent. In NIPS, 2011.\n\n[23] N. Parikh and S. Boyd. Graph projection block splitting for distributed optimization. Preprint, 2012.\n[24] R. Raina, A. Madhavan, and A. Y. Ng. Large-scale deep unsupervised learning using graphics processors.\n\nIn ICML, 2009.\n\n[25] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by\nminimizing l1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935\u2013980, 2011.\n\n[26] H. Wang and A. Banerjee. Online alternating direction method. In ICML, 2012.\n[27] J. Yang and Y. Zhang. Alternating direction algorithms for L1-problems in compressive sensing. ArXiv,\n\n2009.\n\n[28] M. Yuan. Sparse inverse covariance matrix estimation via linear programming. JMLR, 11, 2010.\n[29] M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. In NIPS, 2010.\n\n9\n\n\f", "award": [], "sourceid": 361, "authors": [{"given_name": "Huahua", "family_name": "Wang", "institution": "University of Minnesota, Twin Cites"}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": "University of Minnesota, Twin Cites"}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UT Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas"}]}