{"title": "BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 3165, "page_last": 3173, "abstract": "The l1-regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters scaling quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20,000 variables. In this paper, we develop an algorithm BigQUIC, which can solve 1 million dimensional l1-regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel block-coordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications,  we are able to theoretically analyze our procedure and show that BigQUIC can achieve super-linear or even quadratic convergence rates.", "full_text": "BIG & QUIC: Sparse Inverse Covariance Estimation\n\nfor a Million Variables\n\nCho-Jui Hsieh, M\u00b4aty\u00b4as A. Sustik, Inderjit S. Dhillon, Pradeep Ravikumar\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\n{cjhsieh,sustik,inderjit,pradeepr}@cs.utexas.edu\n\nRussell A. Poldrack\n\nDepartment of Psychology and Neurobiology\n\nUniversity of Texas at Austin\n\npoldrack@mail.utexas.edu\n\nAbstract\n\nThe (cid:96)1-regularized Gaussian maximum likelihood estimator (MLE) has been\nshown to have strong statistical guarantees in recovering a sparse inverse covari-\nance matrix even under high-dimensional settings. However, it requires solving\na dif\ufb01cult non-smooth log-determinant program with number of parameters scal-\ning quadratically with the number of Gaussian variables. State-of-the-art methods\nthus do not scale to problems with more than 20, 000 variables. In this paper,\nwe develop an algorithm BIGQUIC, which can solve 1 million dimensional (cid:96)1-\nregularized Gaussian MLE problems (which would thus have 1000 billion pa-\nrameters) using a single machine, with bounded memory. In order to do so, we\ncarefully exploit the underlying structure of the problem. Our innovations include\na novel block-coordinate descent method with the blocks chosen via a clustering\nscheme to minimize repeated computations; and allowing for inexact computation\nof speci\ufb01c components. In spite of these modi\ufb01cations, we are able to theoreti-\ncally analyze our procedure and show that BIGQUIC can achieve super-linear or\neven quadratic convergence rates.\n\nf(\u0398),\n\nIntroduction\n\n1\nLet {y1, y2, . . . , yn} be n samples drawn from a p-dimensional Gaussian distribution N (\u00b5, \u03a3), also\nknown as a Gaussian Markov Random Field (GMRF). An important problem is that of recovering\nthe covariance matrix (or its inverse) of this distribution, given the n samples, in a high-dimensional\nregime where n (cid:28) p. A popular approach involves leveraging the structure of sparsity in the inverse\ncovariance matrix, and solving the following (cid:96)1-regularized maximum likelihood problem:\n\n(cid:80)n\narg min\n\u0398(cid:31)0\ni=1(yi \u2212 \u02dc\u00b5)(yi \u2212 \u02dc\u00b5)T is the sample covariance matrix and \u02dc\u00b5 = 1\n\n{\u2212 log det \u0398 + tr(S\u0398) + \u03bb(cid:107)\u0398(cid:107)1} = arg min\n\u0398(cid:31)0\n\nwhere S = 1\ni=1 yi is\nn\nthe sample mean. While the non-smooth log-determinant program in (1) is usually considered a\ndif\ufb01cult optimization problem to solve, due in part to its importance, there has been a long line\nof recent work on algorithms to solve (1): see [7, 6, 3, 16, 17, 18, 15, 11] and references therein.\nThe state-of-the-art seems to be a second order method QUIC [9] that has been shown to achieve\nsuper-linear convergence rates. Complementary techniques such as exact covariance thresholding\n[13, 19], and the divide and conquer approach of [8], have also been proposed to speed up the\nsolvers. However, as noted in [8], the above methods do not scale to problems with more than 20, 000\nvariables, and typically require several hours even for smaller dimensional problems involving ten\nthousand variables. There has been some interest in statistical estimators other than (1) that are\nmore amenable to optimization: including solving node-wise Lasso regression problems [14] and\nthe separable linear program based CLIME estimator [2]. However the caveat with these estimators\nis that they are not guaranteed to yield a positive-de\ufb01nite covariance matrix, and typically yield less\naccurate parameters.\nWhat if we want to solve the M-estimator in (1) with a million variables? Note that the number of\nparameters in (1) is quadratic in the number of variables, so that for a million variables, we would\n\n(cid:80)n\n\n(1)\n\nn\n\n1\n\n\fhave a trillion parameters. There has been considerable recent interest in such \u201cBig Data\u201d problems\ninvolving large-scale optimization: these however are either targeted to \u201cbig-n\u201d problems with a lot\nof samples, unlike the constraint of \u201cbig-p\u201d with a large number of variables in our problem, or are\nbased on large-scale distributed and parallel frameworks, which require a cluster of processors, as\nwell as software infrastructure to run the programs over such clusters. At least one caveat with such\nlarge-scale distributed frameworks is that they would be less amenable to exploratory data analysis\nby \u201clay users\u201d of such GMRFs. Here we ask the following ambitious but simple question: can we\nsolve the M-estimator in (1) with a million variables using a single machine with bounded memory?\nThis might not seem like a viable task at all in general, but note that the optimization problem in (1)\narises from a very structured statistical estimation problem: can we leverage the underlying structure\nto be able to solve such an ultra-large-scale problem?\nIn this paper, we propose a new solver, BIGQUIC, to solve the (cid:96)1-regularized Gaussian MLE prob-\nlem with extremely high dimensional data. Our method can solve one million dimensional problems\nwith 1000 billion variables using a single machine with 32 cores and 32G memory. Our proposed\nmethod is based on the state-of-the-art framework of QUIC [9, 8]. The key bottleneck with QUIC\nstems from the memory required to store the gradient W = X\u22121 of the iterates X, which is a dense\np\u00d7 p matrix, and the computation of the log-determinant function of a p\u00d7 p matrix. A starting point\nto reduce the memory footprint is to use sparse representations for the iterates X and compute the\nelements of the empirical covariance matrix S on demand from the sample data points. In addition\nwe also have to avoid the storage of the dense matrix X\u22121 and perform intermediate computa-\ntions involving functions of such dense matrices on demand. These naive approaches to reduce the\nmemory however would considerably increase the computational complexity, among other caveats,\nwhich would make the algorithm highly impractical.\nTo address this, we present three key innovations. Our \ufb01rst is to carry out the coordinate descent\ncomputations in a blockwise manner, and by selecting the blocks very carefully using an automated\nclustering scheme, we not only leverage sparsity of the iterates, but help cache computations suit-\nably. Secondly, we reduce the computation of the log-determinant function to linear equation solving\nusing the Schur decomposition that also exploits the symmetry of the matrices in question. Lastly,\nsince the Hessian computation is a key bottleneck in the second-order method, we compute it inex-\nactly. We show that even with these modi\ufb01cations and inexact computations, we can still guarantee\nnot only convergence of our overall procedure, but can easily control the degree of approximation\nof Hessian to achieve super-linear or even quadratic convergence rates. Inspite of our low-memory\nfootprint, these innovations allow us to beat the state of the art DC-QUIC algorithm (which has\nno memory limits) in computational complexity even on medium-size problems of a few thousand\nvariables. Finally, we show how to parallelize our method in a multicore shared memory system.\nThe paper is organized as follows. In Section 2, we brie\ufb02y review the QUIC algorithm and outline\nthe dif\ufb01culties of scaling QUIC to million dimensional data. Our algorithm is proposed in Section 3.\nWe theoretically analyze our algorithm in Section 4, and present experimental results in Section 5.\n2 Dif\ufb01culties in scaling QUIC to million dimensional data\nOur proposed algorithm is based on the framework of QUIC [9]; which is a state of the art procedure\nfor solving (1), based on a second-order optimization method. We present a brief review of the\nalgorithm, and then explain the key bottlenecks that arise when scaling it to million dimensions.\nSince the objective function of (1) is non-smooth, we can separate the smooth and non-smooth part\nby f(X) = g(X) + h(X), where g(X) = \u2212 log det X + tr(SX) and h(X) = \u03bb(cid:107)X(cid:107)1.\nQUIC is a second-order method that iteratively solves for a generalized Newton direction using\ncoordinate descent; and then descends using this generalized Newton direction and line-search. To\nleverage the sparsity of the solution, the variables are partitioned into Sf ixed and Sf ree sets:\n\nXij \u2208 Sf ixed if |\u2207ijg(X)| \u2264 \u03bbij, and Xij = 0, Xij \u2208 Sf ree otherwise.\n\n(2)\nOnly the free set Sf ree is updated at each Newton iteration, reducing the number of variables to be\nupdated to m = |Sf ree|, which is comparable to (cid:107)X\u2217(cid:107)0, the sparsity of the solution.\nDif\ufb01culty in Approximating the Newton Direction. Let us \ufb01rst consider the generalized Newton\ndirection for (1):\n(3)\n\n{\u00afgXt(D) + h(Xt + D)},\n\nDt = arg min\nD\n\n\u00afgXt(D) = g(Xt) + tr(\u2207g(Xt)T D) +\n\nwhere\n(4)\nIn our problem \u2207g(Xt) = S \u2212 X\u22121\n, where \u2297 denotes the Kronecker\nproduct of two matrices. When Xt is sparse, the Newton direction computation (3) can be solved\n\n1\nvec(D)T\u22072g(Xt) vec(D).\n2\nt \u2297 X\u22121\n\nand \u22072g(X) = X\u22121\n\nt\n\nt\n\n2\n\n\f; using this to compute a = W 2\n\nt\n\nDij \u2190 Dij \u2212 c + S(c \u2212 b/a, \u03bbij/a),\n\nij + WiiWjj, b = Sij \u2212 Wij + wT\n\nef\ufb01ciently by coordinate descent [9]. The obvious implementation calls for the computation and\nstorage of Wt = X\u22121\ni Dwj, and\nc = Xij + Dij. Armed with these quantities, the coordinate descent update for variable Dij takes\nthe form:\n(5)\nwhere S(z, r) = sign(z) max{|z| \u2212 r, 0} is the soft-thresholding function.\nThe key computational bottleneck here is in computing the terms wT\ni Dwj, which take O(p2)\ntime when implemented naively. To address this, [9] proposed to store and maintain U = DW ,\nwhich reduced the cost to O(p) \ufb02ops per update. However, this is not a strategy we can use when\ndealing with very large data sets: storing the p by p dense matrices U and W in memory would\nbe prohibitive. The straightforward approach is to compute (and recompute when necessary) the\nelements of W on demand, resulting in O(p2) time complexity.\nOur key innovation to address this is a novel block coordinate descent scheme, detailed in Section\n3.1, that also uses clustering to strike a balance between memory use and computational cost while\nexploiting sparsity. The result is a procedure with comparable wall-time to that of QUIC on mid-\nsized problems and can scale up to very large problem instances that the original QUIC could not.\nDif\ufb01culty in the Line Search Procedure. After \ufb01nding the generalized Newton direction Dt,\nQUIC then descends using this direction after a line-search via Armijo\u2019s rule. Speci\ufb01cally, it selects\nthe largest step size \u03b1 \u2208 {\u03b20, \u03b21, . . .} such that X + \u03b1Dt is (a) positive de\ufb01nite, and (b) satis\ufb01es\nthe following suf\ufb01cient decrease condition:\n\nf(X + \u03b1D\u2217) \u2264 f(X) + \u03b1\u03c3\u03b4, \u03b4 = tr(\u2207g(X)T D\u2217) + (cid:107)X + D\u2217(cid:107)1 \u2212 (cid:107)X(cid:107)1.\n\n(6)\nThe key computational bottleneck is checking positive de\ufb01niteness (typically by computing the\nsmallest eigenvalue), and the computation of the determinant of a sparse matrix with dimension\nthat can reach a million. As we show in Appendix 6.4, the time and space complexity of classical\nsparse Cholesky decomposition generally grows quadratically to dimensionality even when \ufb01xing\nthe number of nonzero elements in the matrix, so it is nontrivial to address this problem. Our key\ninnovation, detailed in Section 3.2, is an ef\ufb01cient procedure that checks both conditions (a) and (b)\nabove using Schur complements and sparse linear equation solving. The computation only uses\nmemory proportional to the number of nonzeros in the iterate.\nMany other dif\ufb01culties arise when dealing with large sparse matrices in the sparse inverse covairance\nproblem. We present some of them in Appendix 6.5.\n3 Our proposed algorithm\nIn this section, we describe our proposed algorithm, BIGQUIC, with the key innovations mentioned\nin the previous section. We assume that the iterates Xt have m nonzero elements, and that each\niterate is stored in memory using a sparse format. We denote the size of the free set by s and observe\nthat it is usually very small and just a constant factor larger than m\u2217, the number of nonzeros in the\n\ufb01nal solution [9]. Also, the sample covariance matrix is stored in its factor form S = Y Y T , where\nY is the normalized sample matrix. We now discuss a crucial element of BIGQUIC, our novel block\ncoordinate descent scheme for solving each subproblem (3).\n\n3.1 Block Coordinate Descent method\n\nThe most expensive step during the coordinate descent update for Dij is the computation of\ni Dwj, where wi is the i-th column of W = X\u22121; see (5). It is not possible to compute W = X\u22121\nwT\nwith Cholesky factorization as was done in [9], nor can it be stored in memory. Note that wi is the\nsolution of the linear system Xwi = ei. We thus use the conjugate gradient method (CG) to com-\npute wi, leveraging the fact that X is a positive de\ufb01nite matrix. This solver requires only matrix\nvector products, which can be ef\ufb01ciently implemented for the sparse matrix X. CG has time com-\nplexity O(mT ), where T is the number of iterations required to achieve the desired accuracy.\nVanilla Coordinate Descent. A single step of coordinate descent requires the solution of two\nlinear systems Xwi = ei and Xwj = ej which yield the vectors wi, wj, and we can then compute\ni Dwj. The time complexity for each update would require O(mT +s) operations, and the overall\nwT\ncomplexity will be O(msT +s2) for one full sweep through the entire matrix. Even when the matrix\nis sparse, the quadratic dependence on nonzero elements is expensive.\nOur Approach: Block Coordinate Descent with memory cache scheme.\nIn the following we\npresent a block coordinate descent scheme that can accelerate the update procedure by storing and\n\n3\n\n\fS\u00afz \u00afq\n\ni uj = Pij + wT\nSz\n\nuSz + wT\nSq\n\nreusing more results of the intermediate computations. The resulting increased memory use and\nspeedup is controlled by the number of blocks employed, that we denote by k.\nAssume that only some columns of W are stored in memory. In order to update Dij, we need both\nwi and wj; if either one is not directly available, we have to recompute it by CG and we call this\na \u201ccache miss\u201d. A good update sequence can minimize the cache miss rate. While it is hard to \ufb01nd\nthe optimal sequence in general, we successfully applied a block by block update sequence with a\ncareful clustering scheme, where the number of cache misses is suf\ufb01ciently small.\nAssume we pick k such that we can store p/k columns of W (p2/k elements) in memory. Suppose\nwe are given a partition of N = {1, . . . , p} into k blocks, S1, . . . , Sk. We divide matrix D into\nk \u00d7 k blocks accordingly. Within each block we run Tinner sweeps over variables within that block,\nand in the outer iteration we sweep through all the blocks Touter times. We use the notation WSq to\ndenote a p by |Sq| matrix containing columns of W that corresponds to the subset Sq.\nCoordinate descent within a block. To update the variables in the block (Sz, Sq) of D, we \ufb01rst\ncompute WSz and WSq by CG and store it in memory, meaning that there is no cache miss during\nthe within-block coordinate updates. With Usq = DWSq maintained, the update for Dij can be\ni uj when i \u2208 Sz and j \u2208 Sq. After updating each Dij to Dij + \u00b5, we can maintain\ncomputed by wT\nUSq by\n\nUit \u2190 Uit + \u00b5Wjt, Ujt \u2190 Ujt + \u00b5Wit, \u2200t \u2208 Sq.\nThe above coordinate update computations cost only O(p/k) operations because we only update a\nsubset of the columns. Observe that Urt never changes when r /\u2208 {Sz \u222a Sq}.\nTherefore, we can use the following arrangement to further reduce the time complexity. Before\n(uj)S\u00afz \u00afq for all (i, j)\nrunning coordinate descent for the block we compute and store Pij = (wi)T\nin the free set of the current block, where S\u00afz \u00afq = {i | i /\u2208 Sz and i /\u2208 Sq}. The term wT\ni uj for\nupdating Dij can then be computed by wT\nuSq. With this trick, each\ncoordinate descent step within the block only takes O(p/k) time, and we only need to store USz,Sq,\nwhich only requires O(p2/k2) memory. Computing Pij takes O(p) time for each i, j, so if we\nupdate each coordinate Tinner times within a block, the time complexity is O(p + Tinnerp/k) and the\namortized cost per coordinate update is only O(p/Tinner + p/k). This time complexity suggests that\nwe should run more iterations within each block.\nTo go through all the blocks, each time we select a z \u2208\nSweeping through all the blocks.\n{1, . . . , k} and updates blocks (Sz, S1), . . . , (Sz, Sk). Since all of them share {wi | i \u2208 Sz}, we\n\ufb01rst compute them and store in memory. When updating an off-diagonal block (Sz, Sq), if the free\nsets are dense, we need to compute and store {wi | i \u2208 Sq}. So totally each block of W will be\ncomputed k times. The total time complexity becomes O(kpmT ), where m is number of nonzeros\nin X and T is number of conjugate gradient iterations. Assume the nonzeros in X is close to the\nsize of free set (m \u2248 s), then each coordinate update costs O(kpT ) \ufb02ops.\nSelecting the blocks using clustering. We now show that a careful selection of the blocks using\na clustering scheme can lead to dramatic speedup for block coordinate descent. When updating\nvariables in the block (Sz, Sq), we would need the column wj only if some variable in {Dij | i \u2208\nSz} lies in the free set. Leveraging this key observation, given two partitions Sz and Sq, we de\ufb01ne\nthe set of boundary nodes as: B(Sz, Sq) \u2261 {j | j \u2208 Sq and \u2203i \u2208 Sz s.t. Fij = 1}, where the matrix\nF is an indicator of the free set.\nz(cid:54)=q |B(Sz, Sq)|.\nz(cid:54)=q |B(Sz, Sq)| is mini-\nmal. It appears to be hard to \ufb01nd the partitioning that minimizes the number of boundary nodes.\nHowever, we note that the number in question is bounded by the number of cross cluster edges:\nFij. This suggests the use of graph clustering algorithms, such as METIS\n[10] or Graclus [5] which minimize the right hand side. Assuming that the ratio of between-cluster\nedges to the number of total edges is r, we observe a reduced time complexity of O((p+rm)T ) when\ncomputing elements of W , and r is very small in real datasets. In real datasets, when we converge\nto very sparse solutions, more than 95% of edges are in the diagonal blocks. In case of the fMRI\ndataset with p = 228483, we used 20 blocks, and the total number of boundary nodes were only\n|B| = 8697. Compared to block coordinate descent with random partition, which generally needs\nto compute 228483 \u00d7 20 columns, the clustering resulted in the computation of 228483 + 8697\ncolumns, thus achieved an almost 20 times speedup. In Appendix 6.6 we also discuss additional\nbene\ufb01ts of the graph clustering algorithm that results in accelerated convergence.\n\nThe number of columns to be computed in one sweep is then given by p +(cid:80)\nTherefore, we would like to \ufb01nd a partition {S1, . . . , Sk} for which (cid:80)\nB(Sz, Sq) <(cid:80)\n\ni\u2208Sz,j\u2208Sq\n\n4\n\n\fp(cid:88)\n\n3.2 Line Search\n\nThe line search step requires an ef\ufb01cient and scalable procedure that computes log det(A) and\nchecks the positive de\ufb01niteness of a sparse matrix A. We present a procedure that has complex-\nity of at most O(mpT ) where T is the number of iterations used by the sparse linear solver. We\nnote that computing log det(A) for a large sparse matrix A for which we only have a matrix-vector\nmultiplication subroutine available is an interesting subproblem on its own and we expect that nu-\nmerous other applications may bene\ufb01t from the approach presented below. The following lemma\ncan be proved by induction on p:\n\nis a partitioning of an arbitrary p \u00d7 p matrix, where a is a scalar\nLemma 1. If A =\nand b is a p \u2212 1 dimensional vector then det(A) = det(C)(a \u2212 bT C\u22121b). Moreover, A is positive\nde\ufb01nite if and only if C is positive de\ufb01nite and (a \u2212 bT C\u22121b) > 0.\n\nb C,\n\n(cid:18) a bT\n\n(cid:19)\n\nThe above lemma allows us to compute the determinant by reducing it to solving linear systems;\nand also allows us to check positive-de\ufb01niteness. Applying Lemma 1 recursively, we get\n\nlog det A =\n\nlog(Aii \u2212 AT\n\n(i+1):p,iA\u22121\n\n(i+1):p,(i+1):pA(i+1):p,i),\n\n(7)\n\ni=1\n\nwhere each Ai1:i2,j1:j2 denotes a submatrix of A with row indexes i1, . . . , i2 and column indexes\nj1, . . . , j2. Each A\u22121\n(i+1):p,(i+1):pA(i+1):p,i in the above formula can be computed as the solution of\na linear system and hence we can avoid the storage of the (dense) inverse matrix. By Lemma 1, we\ncan check the positive de\ufb01niteness by verifying that all the terms in (7) are positive de\ufb01nite. Notice\nthat we have to compute (7) in a reverse order (i = p, . . . , 1) to avoid the case that A(i+1):p,(i+1):p\nis non positive de\ufb01nite.\n\n3.3 Summary of the algorithm\nIn this section we present BIGQUIC as Algorithm 1. The detailed time complexity analysis are\npresented in Appendix 6.7. In summary, the time needed to compute the columns of W in block\ncoordinate descent, O((p + |B|)mT Touter), dominates the time complexity, which underscores the\nimportance of minimizing the number of boundary nodes |B| via our clustering scheme.\n\nAlgorithm 1: BIGQUIC algorithm\n\nInput\nOutput: Sequence {Xt} that converges to X\u2217.\n\n: Samples Y , regularization parameter \u03bb, initial iterate X0\n\nCompute Wt = X\u22121\nRun graph clustering algorithm based on absolute values on free set.\nfor sweep = 1, . . . , Touter do\n\ncolumn by column, partition the variables into free and \ufb01xed sets.\n\nt\n\n1 for t = 0, 1, . . . do\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n\nfor s = 1, . . . , k do\n\nCompute WSs by CG.\nfor q = 1, . . . , k do\n\nIdentify boundary nodes Bsq := B(Ss, Sq) \u2282 Sq (only need if s (cid:54)= q)\nCompute WBsq for boundary nodes (only need if s (cid:54)= q).\nCompute UBsq, and Pij for all (i, j) the current block.\nConduct coordinate updates.\n\n12\n\nFind the step size \u03b1 by the method proposed in Section 3.2.\n\nParallelization. While our method can run well on a single machine with a single core, here\nwe point out components of our algorithm that can be \u201cembarrassingly\u201d parallelized on any single\nmachine with multiple cores (with shared memory). We \ufb01rst note that we can obtain a good starting\npoint for our algorithm by applying the divide-and-conquer framework proposed in [8]: this divides\nthe problem into k subproblems, which can then be independently solved in parallel. Consider the\nsteps of our Algorithm 1 in BIGQUIC. In step 2, instead of computing columns of W one by one,\nwe can compute t rows of W at a time, and parallelize these t jobs. A similar trick can be used in\nstep 6 and 9. In step 3, we use the multi-core version of METIS (ParMETIS) for graph clustering.\n\n5\n\n\fIn step 8 and 10, the computations are naturally independent. In step 15, we compute each term\nin (7) independently and abort if any of the processes report non-positive de\ufb01niteness. The only\nsequential part is the coordinate update in step 11, but note, (see Section 3.1), that we have reduced\nthe complexity of this step from O(p) in QUIC to O(p/k).\n4 Convergence Analysis\nIn this section, we present two main theoretical results. First, we show that our algorithm converges\nto the global optimum even with inaccurate Hessian computation. Second, we show that by a careful\ncontrol of the error in the Hessian computation, BIGQUIC can still achieve a quadratic rate of\nconvergence in terms of Newton iterations. Our analysis differs from that in QUIC [9], where the\ncomputations are all assumed to be accurate. [11] also provides a convergence analysis for general\nproximal Newton methods, but our algorithm with modi\ufb01cations such as \ufb01xed/free set selection\ndoes not exactly fall into their framework; moreover our analysis shows a quadratic convergence\nrate, while they only show a super-linear convergence rate.\nIn the BIGQUIC algorithm, we compute wi in two places. The \ufb01rst place is the gradient compu-\ntation in the second term of (4), where \u2207g(X) = S \u2212 W . The second place is in the third term\nof (4), where \u22072g(X) = W \u2297 W . At the \ufb01rst glance they are equivalent and can be computed\nsimultaneously, but it turns out that by carefully analysing the difference between two types of wi,\nwe can achieve much faster convergence, as discussed below.\nThe key observation is that we only require the gradient Wij for all (i, j) \u2208 Sf ree to conduct\ncoordinate descent updates. Since the free set is very sparse and can \ufb01t in memory, those Wij only\nneed to be computed once and stored in memory. On the other hand, the computation of wT\ni Dwj\ncorresponds to the Hessian computation, and we need two columns for each coordinate update,\nwhich has to be computed repeatedly.\nIt is easy to produce an example where the algorithm converges to a wrong point when the gradient\ncomputation is not accurate, as shown in Figure 5(b) (in Appendix 6.5). Luckily, based on the above\nanalysis the gradient only needs to be computed once per Newton iteration, so we can compute it\nwith high precision. On the other hand, wi for the Hessian has to be computed repeatedly, so we do\nnot want to spend too much time to compute each of them accurately. We de\ufb01ne \u02c6Ht = \u02c6Wt\u2297 \u02c6Wt to be\nthe approximated Hessian matrix, and derive the following theorem to show that even if Hessian is\ninaccurate, BIGQUIC still converges to the global optimum. Notice that our proof covers BIGQUIC\nalgorithm with \ufb01xed/free set selection, and the only assumption is that subproblem (3) is solved\nexactly for each Newton iteration; it is the future work to consider the case where subproblems are\nsolved approximately.\nTheorem 1. For solving (1), if \u2207g(X) is computed exactly and \u00af\u03b7I (cid:23) \u02c6Ht (cid:23) \u03b7I for some constant\n\u00af\u03b7, \u03b7 > 0 at every Newton iteration, then BIGQUIC converges to the global optimum.\n\nThe proof is in Appendix 6.1. Theorem 1 suggests that we do not need very accurate Hessian compu-\ntation for convergence. To have super-linear convergence rate, we require the Hessian computation\nto be more and more accurate as Xt approaches X\u2217. We \ufb01rst introduce the following notion of\nminimum norm subgradient to measure the optimality of X:\n\n(cid:26)\u2207ijg(X) + sign(Xij)\u03bbij\n\ngradS\n\nij f(X) =\n\nsign(\u2207ijg(X)) max(|\u2207ijg(X)| \u2212 \u03bbij, 0)\n\nif Xij (cid:54)= 0,\nif Xij = 0.\n\nThe following theorem then shows that if we compute Hessian more and more accurately, BIGQUIC\nwill have a super-linear or even quadratic convergence rate.\nTheorem 2. When applying BIGQUIC to solve (1), assume \u2207g(Xt) is exactly computed and\n\u22072g(Xt) is approximated by Ht, and the following condition holds:\n\n(cid:64)(i, j) such that X\u2217\n\nij = 0 and |\u2207ijg(X\u2217)| = \u03bb.\n\nThen (cid:107)Xt+1 \u2212 X\u2217(cid:107) = O((cid:107)Xt \u2212 X\u2217(cid:107)1+p) as k \u2192 \u221e for 0 < p \u2264 1 if and only if\n\n(cid:107) \u02c6Ht \u2212 \u22072g(Xt)(cid:107) = O((cid:107) gradS(Xt)(cid:107)p) as k \u2192 \u221e.\n\n(8)\n\n(9)\n\nThe proof is in Appendix 6.2. The assumption in (8) can be shown to be satis\ufb01ed with very high\nprobability (and was also satis\ufb01ed in our experiments). Theorem 2 suggests that we can achieve\nsuper-linear, or even quadratic convergence rate by a careful control of the approximated Hessian\n\u02c6Ht. In the BIGQUIC algorithm, we can further control (cid:107) \u02c6Ht\u2212\u22072g(Xt)(cid:107) by the residual of conjugate\n\n6\n\n\f(a) Comparison on chain graph.\n\n(b) Comparison on random graph.\n\n(c) Comparison on fmri data.\n\nFigure 1: The comparison of scalability on three types of graph structures. In all the experiments, BIGQUIC\ncan solve larger problems than QUIC even with a single core, and using 32 cores BIGQUIC can solve million\ndimensional data in one day.\ngradient solvers to achieve desired convergence rate. Suppose the residual is bi = X \u02c6wi \u2212 ei for\neach i = 1, . . . , p, and Bt = [b1b2 . . . bp] is a collection of the residuals at the t-th iteration. The\nfollowing theorem shows that we can control the convergence rate by controlling the norm of Bt.\nTheorem 3. In the BIGQUIC algorithm, if the residual matrix (cid:107)Bt(cid:107) = O((cid:107) gradS(Xt)(cid:107)p) for\nsome 0 < p \u2264 1 as t \u2192 \u221e, then (cid:107)Xt+1 \u2212 X\u2217(cid:107) = O((cid:107)Xt \u2212 X\u2217(cid:107)1+p) as t \u2192 \u221e.\n\ni,i = 1.25.\n\ni,i\u22121 = \u22120.5 and \u03a3\u22121\n\nThe proof is in Appendix 6.3. Since gradS(Xt) can be easily computed without additional cost, and\nresiduals B can be naturally controlled when running conjugate gradient, we can easily control the\nasymptotic convergence rate in practice.\n5 Experimental Results\nIn this section, we show that our proposed method BIGQUIC can scale to high-dimensional datasets\non both synthetic data and real data. All the experiments are run on a single computing node with 4\nIntel Xeon E5-4650 2.7GHz CPUs, each with 8 cores and 32G memory.\nScalability of BIGQUIC on high-dimensional datasets.\nIn the \ufb01rst set of experiments, we show\nBIGQUIC can scale to extremely high dimensional datasets. We conduct experiments on the fol-\nlowing synthetic and real datasets:\n(1) Chain graphs: the ground truth precision matrix is set to be \u03a3\u22121\n(2) Graphs with random pattern: we use the procedure mentioned in Example 1 in [12] to generate\nrandom pattern. When generating the graph, we assume there are 500 clusters, and 90% of the edges\nare within clusters. We \ufb01x the average degree to be 10.\n(3) FMRI data: The original dataset has dimensionality p = 228, 483 and n = 518. For scalability\nexperiments, we subsample various number of random variables from the whole dataset.\nWe use \u03bb = 0.5 for chain and random Graph so that number of recovered edges is close to the ground\ntruth, and set number of samples n = 100. We use \u03bb = 0.6 for the fMRI dataset, which recovers\na graph with average degree 20. We set the stopping condition to be gradS(Xt) < 0.01(cid:107)Xt(cid:107)1. In\nall of our experiments, number of nonzeros during the optimization phase do not exceed 5(cid:107)X\u2217(cid:107)0 in\nintermediate steps, therefore we can always store the sparse representation of Xt in memory. For\nBIGQUIC, we set blocks k to be the smallest number such that p/k columns of W can \ufb01t into\n32G memory. For both QUIC and BIGQUIC, we apply the divide and conquer method proposed\nin [8] with 10-clusters to get a better initial point. The results are shown in Figure 1. We can see\nthat BIGQUIC can solve one million dimensional chain graphs and random graphs in one day, and\nhandle the full fMRI dataset in about 5 hours.\nMore interestingly, even for dataset with size less than 30000, where p2 size matrices can \ufb01t in\nmemory, BIGQUIC is faster than QUIC by exploiting the sparsity. Figure 2 shows an example on\na sampled fMRI dataset with p = 20000, and we can see BIGQUIC outperforms QUIC even when\nusing a single core. Also, BIGQUIC is much faster than other solvers, including Glasso [7] and\nALM [17]. Figure 3 shows the speedup under a multicore shared memory machine. BIGQUIC can\nachieve about 14 times speedup using 16 cores, and 20 times speedup when using 32 cores.\nFMRI dataset. An extensive resting state fMRI dataset from a single individual was analyzed in or-\nder to test BIGQUIC on real-world data. The data (collected as part of the MyConnectome project:\nhttp://www.myconnectome.org) comprised 36 resting fMRI sessions collected across dif-\nferent days using whole-brain multiband EPI acquisition, each lasting 10 minutes (TR=1.16 secs,\nmultiband factor=4, TE = 30 ms, voxel size = 2.4 mm isotropic, 68 slices, 518 time points). The\n\n7\n\n\fFigure 3: The speedup of BIGQUIC when using\nmultiple cores.\n\nFigure 2: Comparison on fMRI data with p =\n20000 (the maximum dimension that state-of-the-\nart softwares can handle).\ndata were preprocessed using FSL 5.0.2, including motion correction, scrubbing of motion frames,\nregistration of EPI images to a common high-resolution structural image using boundary-based reg-\nistration, and af\ufb01ne transformation to MNI space. The full brain mask included 228,483 voxels.\nAfter motion scrubbing, the dataset included a total 18,435 time points across all sessions.\nBIGQUIC was applied to the full dataset: for the \ufb01rst time, we can learn a GMRF over the entire\nset of voxels, instead of over a smaller set of curated regions or supervoxels. Exploratory analyses\nover a range of \u03bb values suggested that \u03bb = 0.5 offered a reasonable level of sparsity. The result-\ning graph was analyzed to determine whether it identi\ufb01ed neuroscienti\ufb01cally plausible networks.\nDegree was computed for each vertex; high-degree regions were primarily found in gray matter re-\ngions, suggesting that the method successfully identi\ufb01ed plausible functional connections (see left\npanel of Figure 4). The structure of the graph was further examined in order to determine whether\nthe method identi\ufb01ed plausible network modules. Modularity-based clustering [1] was applied to\nthe graph, resulting in 60 modules that exceeded the threshold size of 100 vertices. A number of\nneurobiologically plausible resting-state networks were identi\ufb01ed, including \u201cdefault mode\u201d and\nsensorimotor networks (right panel of Figure 4). In addition, the method identi\ufb01ed a number of\nstructured coherent noise sources (i.e. MRI artifacts) in the dataset. For both neurally plausible and\nartifactual modules, the modules detected by BIGQUIC are similar to those identi\ufb01ed using inde-\npendent components analysis on the same dataset, without the need for the extensive dimensionality\nreduction (without statistical guarantees) inherent in such techniques.\n\nFigure 4: (Best viewed in color) Results from BIGQUIC analyses of resting-state fMRI data. Left panel: Map\nof degree distribution across voxels, thresholded at degree=20. Regions showing high degree were generally\nfound in the gray matter (as expected for truly connected functional regions), with very few high-degree voxels\nfound in the white matter. Right panel: Left-hemisphere surface renderings of two network modules obtained\nthrough graph clustering. Top panel shows a sensorimotor network, bottom panel shows medial prefrontal,\nposterior cingulate, and lateral temporoparietal regions characteristic of the \u201cdefault mode\u201d generally observed\nduring the resting state. Both of these are commonly observed in analyses of resting state fMRI data.\nAcknowledgments\nThis research was supported by NSF grant CCF-1320746 and NSF grant CCF-1117055. C.-J.H.\nalso acknowledges the support of IBM PhD fellowship. P.R. acknowledges the support of ARO via\nW911NF-12-1-0390 and NSF via IIS-1149803, DMS-1264033. R.P. acknowledges the support of\nONR via N000140710116 and the James S. McDonnell Foundation.\n\n8\n\n\fReferences\n[1] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of community\n\nhierarchies in large networks. J. Stat Mech, 2008.\n\n[2] T. Cai, W. Liu, and X. Luo. A constrained (cid:96)1 minimization approach to sparse precision matrix\n\nestimation. Journal of American Statistical Association, 106:594\u2013607, 2011.\n\n[3] A. d\u2019Aspremont, O. Banerjee, and L. E. Ghaoui. First-order methods for sparse covariance\n\nselection. SIAM Journal on Matrix Analysis and its Applications, 30(1):56\u201366, 2008.\n\n[4] R. S. Dembo, S. C. Eisenstat, and T. Steihaug. Inexact Newton methods. SIAM J. Numerical\n\nAnal., 19(2):400\u2013408, 1982.\n\n[5] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multi-\nlevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),\n29:11:1944\u20131957, 2007.\n\n[6] J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse Gaus-\n\nsians. UAI, 2008.\n\n[7] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph-\n\nical lasso. Biostatistics, 9(3):432\u2013441, July 2008.\n\n[8] C.-J. Hsieh, I. S. Dhillon, P. Ravikumar, and A. Banerjee. A divide-and-conquer method for\n\nsparse inverse covariance estimation. In NIPS, 2012.\n\n[9] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse inverse covariance estimation\n\nusing quadratic approximation. 2013.\n\n[10] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular\n\ngraphs. SIAM J. Sci. Comput., 20(1):359\u2013392, 1999.\n\n[11] J. D. Lee, Y. Sun, and M. A. Saunders. Proximal newton-type methods for minimizing com-\n\nposite functions. In NIPS, 2012.\n\n[12] L. Li and K.-C. Toh. An inexact interior point method for (cid:96)1-reguarlized sparse covariance\n\nselection. Mathematical Programming Computation, 2:291\u2013315, 2010.\n\n[13] R. Mazumder and T. Hastie. Exact covariance thresholding into connected components for\n\nlarge-scale graphical lasso. Journal of Machine Learning Research, 13:723\u2013736, 2012.\n\n[14] N. Meinshausen and P. B\u00a8uhlmann. High dimensional graphs and variable selection with the\n\nlasso. Annals of Statistics, 34:1436\u20131462, 2006.\n\n[15] P. Olsen, F. Oztoprak, J. Nocedal, and S. Rennie. Newton-like methods for sparse inverse\ncovariance estimation. Technical report, Optimization Center, Northwestern University, 2012.\n[16] B. Rolfs, B. Rajaratnam, D. Guillot, A. Maleki, and I. Wong. Iterative thresholding algorithm\n\nfor sparse inverse covariance estimation. In NIPS, 2012.\n\n[17] K. Scheinberg, S. Ma, and D. Goldfarb. Sparse inverse covariance selection via alternating\n\nlinearization methods. NIPS, 2010.\n\n[18] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedy coor-\ndinate ascent approach. In J. Balczar, F. Bonchi, A. Gionis, and M. Sebag, editors, Machine\nLearning and Knowledge Discovery in Databases, volume 6323 of Lecture Notes in Computer\nScience, pages 196\u2013212. Springer Berlin / Heidelberg, 2010.\n\n[19] D. M. Witten, J. H. Friedman, and N. Simon. New insights and faster computations for the\n\ngraphical lasso. Journal of Computational and Graphical Statistics, 20(4):892\u2013900, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1459, "authors": [{"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UT Austin"}, {"given_name": "Matyas", "family_name": "Sustik", "institution": "University of Texas"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}, {"given_name": "Russell", "family_name": "Poldrack", "institution": "University of Texas"}]}