{"title": "Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 2330, "page_last": 2338, "abstract": "The L_1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program.  In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton's method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when  compared to other state-of-the-art methods.", "full_text": "Sparse Inverse Covariance Matrix Estimation\n\nUsing Quadratic Approximation\n\nCho-Jui Hsieh, M\u00b4aty\u00b4as A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\n{cjhsieh,sustik,inderjit,pradeepr}@cs.utexas.edu\n\nAustin, TX 78712 USA\n\nAbstract\n\nThe !1 regularized Gaussian maximum likelihood estimator has been shown to\nhave strong statistical guarantees in recovering a sparse inverse covariance ma-\ntrix, or alternatively the underlying graph structure of a Gaussian Markov Random\nField, from very limited samples. We propose a novel algorithm for solving the re-\nsulting optimization problem which is a regularized log-determinant program. In\ncontrast to other state-of-the-art methods that largely use \ufb01rst order gradient infor-\nmation, our algorithm is based on Newton\u2019s method and employs a quadratic ap-\nproximation, but with some modi\ufb01cations that leverage the structure of the sparse\nGaussian MLE problem. We show that our method is superlinearly convergent,\nand also present experimental results using synthetic and real application data that\ndemonstrate the considerable improvements in performance of our method when\ncompared to other state-of-the-art methods.\n\n1 Introduction\nGaussian Markov Random Fields; Covariance Estimation. Increasingly, in modern settings statis-\ntical problems are high-dimensional, where the number of parameters is large when compared to\nthe number of observations. An important class of such problems involves estimating the graph\nstructure of a Gaussian Markov random \ufb01eld (GMRF) in the high-dimensional setting, with appli-\ncations ranging from inferring gene networks and analyzing social interactions. Speci\ufb01cally, given\nn independently drawn samples {y1, y2, . . . , yn} from a p-variate Gaussian distribution, so that\nyi \u223cN (\u00b5, \u03a3), the task is to estimate its inverse covariance matrix \u03a3\u22121, also referred to as the\nprecision or concentration matrix. The non-zero pattern of this inverse covariance matrix \u03a3\u22121 can\nbe shown to correspond to the underlying graph structure of the GMRF. An active line of work in\nhigh-dimensional settings where p < n is thus based on imposing some low-dimensional structure,\nsuch as sparsity or graphical model structure on the model space. Accordingly, a line of recent\npapers [2, 8, 20] has proposed an estimator that minimizes the Gaussian negative log-likelihood reg-\nularized by the !1 norm of the entries (off-diagonal entries) of the inverse covariance matrix. The\nresulting optimization problem is a log-determinant program, which is convex, and can be solved in\npolynomial time.\nExisting Optimization Methods for the regularized Gaussian MLE. Due in part to its importance,\nthere has been an active line of work on ef\ufb01cient optimization methods for solving the !1 regularized\nGaussian MLE problem. In [8, 2] a block coordinate descent method has been proposed which is\ncalled the graphical lasso or GLASSO for short. Other recent algorithms proposed for this problem\ninclude PSM that uses projected subgradients [5], ALM using alternating linearization [14], IPM an\ninexact interior point method [11] and SINCO a greedy coordinate descent method [15].\nFor typical high-dimensional statistical problems, optimization methods typically suffer sub-linear\nrates of convergence [1]. This would be too expensive for the Gaussian MLE problem, since the\n\n1\n\n\fnumber of matrix entries scales quadraticallywith the numberof nodes. Luckily, the log-determinant\nproblem has special structure; the log-determinant function is strongly convex and one can observe\nlinear (i.e. geometric) rates of convergence for the state-of-the-art methods listed above. However,\nat most linear rates in turn become infeasible when the problem size is very large, with the number\nof nodes in the thousands and the number of matrix entries to be estimated in the millions. Here\nwe ask the question: can we obtain superlinear rates of convergence for the optimization problem\nunderlying the !1 regularized Gaussian MLE?\nOne characteristic of these state-of-the-art methods is that they are \ufb01rst-order iterative methods that\nmainly use gradient information at each step. Such \ufb01rst-order methods have become increasingly\npopular in recent years for high-dimensional problems in part due to their ease of implementation,\nand because they require very little computation and memory at each step. The caveat is that they\nhave at most linear rates of convergence [3]. For superlinear rates, one has to consider second-order\nmethods which at least in part use the Hessian of the objective function. There are however some\ncaveats to the use of such second-order methods in high-dimensional settings. First, a straight-\nforward implementation of each second-order step would be very expensive for high-dimensional\nproblems. Secondly, the log-determinant function in the Gaussian MLE objective acts as a barrier\nfunction for the positive de\ufb01nite cone. This barrier property would be lost under quadratic approxi-\nmations so there is a danger that Newton-like updates will not yield positive-de\ufb01nitematrices, unless\none explicitly enforces such a constraint in some manner.\nOur Contributions. In this paper, we present a new second-order algorithm to solve the !1 regular-\nized Gaussian MLE. We perform Newton steps that use iterative quadratic approximations of the\nGaussian negative log-likelihood, but with three innovations that enable \ufb01nessing the caveats de-\ntailed above. First, we provide an ef\ufb01cient method to compute the Newton direction. As in recent\nmethods [12, 9], we build on the observation that the Newton direction computation is a Lasso prob-\nlem, and perform iterative coordinate descent to solve this Lasso problem. However, the naive ap-\nproach has an update cost of O(p2) for performing each coordinate descent update in the inner loop,\nwhich makes this resume infeasible for this problem. But we show how a careful arrangement and\ncaching of the computations can reduce this cost to O(p). Secondly, we use an Armijo-rule based\nstep size selection rule to obtain a step-size that ensures suf\ufb01cient descent and positive-de\ufb01niteness\nof the next iterate. Thirdly, we use the form of the stationary condition characterizing the optimal\nsolution to then focus the Newton direction computation on a small subset of free variables, in a\nmanner that preserves the strong convergence guarantees of second-order descent.\nHere is a brief outline of the paper. In Section 3, we present our algorithm that combines quadratic\napproximation, Newton\u2019s method and coordinate descent. In Section 4, we show that our algorithm\nis not only convergent but superlinearly so. We summarize the experimental results in Section 5,\nusing real application data from [11] to compare the algorithms, as well as synthetic examples which\nreproduce experiments from [11]. We observe that our algorithm performs overwhelmingly better\n(quadratic instead of linear convergence) than the other solutions described in the literature.\n2 Problem Setup\nLet y be a p-variate Gaussian random vector, with distribution N (\u00b5, \u03a3). We are given n indepen-\ndently drawn samples {y1, . . . , yn} of this random vector, so that the sample covariance matrix can\nbe written as\n\nn\n\nk=1\n\n1\nn\n\nn\n\ni=1\n\n1\nn\n\nyi.\n\nS =\n\n(1)\nGiven some regularization penalty \u03bb> 0, the !1 regularized Gaussian MLE for the inverse covari-\nance matrix can be estimated by solving the following regularized log-determinant program:\n\n(yk \u2212 \u02c6\u00b5)(yk \u2212 \u02c6\u00b5)T , where \u02c6\u00b5 =\n\n!\n\n!\n\narg min\n\nX\"0 \" \u2212 log det X + tr(SX) + \u03bb#X#1# = arg min\n(2)\ni,j=1 |Xij| is the elementwise !1 norm of the p \u00d7 p matrix X. Our results\nwhere #X#1 = $p\ncan be also extended to allow a regularization term of the form #\u03bb \u25e6 X#1 = $p\ni,j=1 \u03bbij|Xij|,\ni.e. different nonnegative weights can be assigned to different entries. This would include for\ninstance the popular off-diagonal !1 regularizationvariant where we penalize $i#=j |Xij|, but not the\ndiagonal entries. The addition of such !1 regularization promotes sparsity in the inverse covariance\nmatrix, and thus encourages sparse graphical model structure. For further details on the background\nof !1 regularization in the context of GMRFs, we refer the reader to [20, 2, 8, 15].\n\nf (X),\n\nX\"0\n\n2\n\n\f3 Quadratic Approximation Method\nOur approach is based on computing iterative quadratic approximations to the regularized Gaussian\nMLE objective f (X) in (2). This objective function f can be seen to comprise of two parts, f (X) \u2261\ng(X) + h(X), where\n\ng(X) = \u2212 log det X + tr(SX) and h(X) = \u03bb#X#1.\n\n(3)\nThe \ufb01rst component g(X) is twice differentiable, and strictly convex, while the second part\nh(X) is convex but non-differentiable. Following the standard approach [17, 21] to building a\nquadratic approximation around any iterate Xt for such composite functions, we build the second-\norder Taylor expansion of the smooth component g(X). The second-order expansion for the\nlog-determinant function (see for instance [4, Chapter A.4.3]) is given by log det(Xt +\u2206) \u2248\nand write the second-order\nlog det Xt +tr(X \u22121\napproximation \u00afgXt(\u2206) to g(X) = g(Xt +\u2206) as\n\nt \u2206). We introduce Wt = X \u22121\n\nt \u2206X \u22121\n\n2 tr(X \u22121\n\nt \u2206)\u2212 1\n\nt\n\n\u00afgXt(\u2206) = tr((S \u2212 Wt)\u2206) + (1/2) tr(Wt\u2206Wt\u2206) \u2212 log det Xt + tr(SXt).\n\n(4)\nWe de\ufb01ne the Newton direction Dt for the entire objective f (X) can then be written as the solution\nof the regularized quadratic program:\n\nDt = arg min\n\u2206\n\n\u00afgXt(\u2206) + h(Xt +\u2206) .\n\n(5)\nThis Newton direction can be used to compute iterative estimates {Xt} for solving the optimization\nproblem in (2). In the sequel, we will detail three innovations which makes this resume feasible.\nFirstly, we provide an ef\ufb01cient method to compute the Newton direction. As in recent methods [12],\nwe build on the observation that the Newton direction computation is a Lasso problem, and perform\niterative coordinate descent to \ufb01nd its solution. However, the naive approach has an update cost of\nO(p2) for performing each coordinate descent update in the inner loop, which makes this resume\ninfeasible for this problem. We show how a careful arrangement and caching of the computations\ncan reduce this cost to O(p). Secondly, we use an Armijo-rule based step size selection rule to obtain\na step-size that ensures suf\ufb01cient descent and positive-de\ufb01niteness of the next iterate. Thirdly, we\nuse the form of the stationary condition characterizing the optimal solution to then focus the Newton\ndirection computation on a small subset of free variables, in a manner that preserves the strong\nconvergence guarantees of second-order descent. We outline each of these three innovations in the\nfollowing three subsections. We then detail the complete method in Section 3.4.\n3.1 Computing the Newton Direction\nThe optimization problem in (5) is an !1 regularized least squares problem, also called Lasso [16]. It\nis straightforward to verify that for a symmetric matrix \u2206 we have tr(Wt\u2206Wt\u2206) = vec(\u2206)T (Wt \u2297\nWt) vec(\u2206) , where \u2297 denotes the Kronecker product and vec(X) is the vectorized listing of the\nelements of matrix X.\nIn [7, 18] the authors show that coordinate descent methods are very ef\ufb01cient for solving lasso type\nproblems. However, an obvious way to update each element of \u2206 to solve for the Newton direction\nin (5) needs O(p2) \ufb02oating point operations since Q := Wt \u2297Wt is a p2\u00d7p2 matrix, thus yielding an\nO(p4) procedure for approximating the Newton direction. As we show below, our implementation\nreduces the cost of one variable update to O(p) by exploiting the structure of Q or in other words\nthe speci\ufb01c form of the second order term tr(Wt\u2206Wt\u2206). Next, we discuss the details.\nFor notational simplicity we will omit the Newton iteration index t in the derivations that follow.\n(Hence, the notation for \u00afgXt is also simpli\ufb01ed to \u00afg.) Furthermore, we omit the use of a separate\nindex for the coordinate descent updates. Thus, we simply use D to denote the current iterate\napproximating the Newton direction and use D$ for the updated direction. Consider the coordinate\ndescent update for the variable Xij, with i < j that preserves symmetry: D$ = D+\u00b5(eie\ni ).\nThe solution of the one-variable problem corresponding to (5) yields \u00b5:\n\nT\nj +ej e\n\nT\n\narg min\n\n\u00afg(D + \u00b5(eie\n\n(6)\nAs a matter of notation: we use xi to denote the i-th column of the matrix X. We expand the terms\ni ) for \u2206 in (4) and omit\nappearing in the de\ufb01nition of \u00afg after substituting D$ = D + \u00b5(eie\nthe terms not dependent on \u00b5. The contribution of tr(SD$) \u2212 tr(W D$) yields 2\u00b5(Sij \u2212 Wij), while\n\nT\ni )) + 2\u03bb|Xij + Dij + \u00b5|.\n\nT\nj + ej e\n\nT\nj + ej e\n\n\u00b5\n\nT\n\n3\n\n\f1\n2\n\nT\n\nT\n\n\u00b5 = \u2212c + S(c \u2212 b/a, \u03bb/a),\n\nT\ni Dwj)\u00b5 + \u03bb|Xij + Dij + \u00b5|.\n\ntr(W D$W D$) = tr(W DW D) + 4\u00b5w\n\nthe regularization term contributes 2\u03bb|Xij + Dij + \u00b5|, as seen from (6). The quadratic term can be\nrewritten using tr(AB) = tr(BA) and the symmetry of D and W to yield:\ni Dwj + 2\u00b52(W 2\nT\n\n(W 2\nij + WiiWjj)\u00b52 + (Sij \u2212 Wij + w\nLetting a = W 2\nij +WiiWjj, b = Sij \u2212Wij + w\nfor:\n\n(7)\nIn order to compute the single variable update we seek the minimum of the following function of \u00b5:\n(8)\ni Dwj, and c = Xij +Dij the minimum is achieved\n(9)\nwhere S(z, r) = sign(z) max{|z|\u2212 r, 0} is the soft-thresholding function. The values of a and c\nare easy to compute. The main cost arises while computing the third term contributing to coef\ufb01cient\ni Dwj. Direct computation requires O(p2) time. Instead, we maintain U = DW by\nb, namely w\nupdating two rows of the matrix U for every variable update in D costing O(p) \ufb02ops, and then\ncompute w\ni uj using also O(p) \ufb02ops. Another way to view this arrangement is that we maintain a\nk throughout the process by storing the uk vectors, allowing\ndecomposition W DW = $p\nO(p) computation of update (9). In order to maintain the matrix U we also need to update two\ncoordinates of each uk when Dij is modi\ufb01ed. We can compactly write the row updates of U as\nfollows: ui\u00b7 \u2190 ui\u00b7 + \u00b5wj\u00b7 and uj\u00b7 \u2190 uj\u00b7 + \u00b5wi\u00b7, where ui\u00b7 refers to the i-th row vector of U.\nWe note that the calculation of the Newton direction can be simpli\ufb01ed if X is a diagonal ma-\ntrix. For instance, if we are starting from a diagonal matrix X0,\ni Dwj equal\nDij/((X0)ii(X0)jj), which are independent of each other implying that we only need to update\neach variable according to (9) only once, and the resulting D will be the optimum of (5). Hence, the\ntime cost of \ufb01nding the \ufb01rst Newton direction is reduced from O(p3) to O(p2).\n3.2 Computing the Step Size\nFollowing the computation of the Newton direction Dt, we need to \ufb01nd a step size \u03b1 \u2208 (0, 1] that\nensures positive de\ufb01niteness of the next iterate Xt + \u03b1Dt and suf\ufb01cient decrease in the objective\nfunction.\nWe adopt Armijo\u2019s rule [3, 17] and try step-sizes \u03b1 \u2208{ \u03b20,\u03b2 1,\u03b2 2, . . . } with a constant decrease rate\n0 <\u03b2< 1 (typically \u03b2 = 0.5) until we \ufb01nd the smallest k \u2208 N with \u03b1 = \u03b2k such that Xt + \u03b1Dt\n(a) is positive-de\ufb01nite, and (b) satis\ufb01es the following condition:\n\nthe terms w\n\nT\n\nf (Xt + \u03b1Dt) \u2264 f (Xt) + \u03b1\u03c3\u2206t, \u2206t = tr(\u2207g(Xt)Dt) + \u03bb#Xt + Dt#1 \u2212 \u03bb#Xt#1\n\n(10)\nwhere 0 <\u03c3< 0.5 is a constant. To verify positive de\ufb01niteness, we use a Cholesky factorization\ncosting O(p3) \ufb02ops during the objective function evaluation to compute log det(Xt + \u03b1Dt) and this\nstep dominates the computational cost in the step-size computations. In the Appendix in Lemma 9\nwe show that for any Xt and Dt, there exists a \u00af\u03b1t > 0 such that (10) and the positive-de\ufb01niteness of\nXt + \u03b1Dt are satis\ufb01ed for any \u03b1 \u2208 (0, \u00af\u03b1t], so we can always \ufb01nd a step size satisfying (10) and the\npositive-de\ufb01niteness even if we do not have the exact Newton direction. Following the line search\nand the Newton step update Xt+1 = Xt + \u03b1Dt we ef\ufb01ciently compute Wt+1 = X \u22121\nt+1 by reusing\nthe Cholesky decomposition of Xt+1.\n3.3 Identifying which variables to update\nIn this section, we propose a way to select which variables to updatethat uses the stationary condition\nof the Gaussian MLE problem. At the start of any outer loop computing the Newton direction, we\npartition the variables into free and \ufb01xed sets based on the value of the gradient. Speci\ufb01cally, we\nclassify the (Xt)ij variable as \ufb01xed if |\u2207ijg(Xt)| <\u03bb \u2212 \u0001 and (Xt)ij = 0, where \u0001> 0 is small.\n(We used \u0001 = 0.01 in our experiments.) The remaining variables then constitute the free set. The\nfollowing lemma shows the property of the \ufb01xed set:\nLemma 1. For any Xt and the corresponding\ufb01xed and free sets Sf ixed, Sf ree, the optimized update\non the \ufb01xed set would not change any of the coordinates. In other words, the solution of the following\noptimization problem is \u2206= 0 :\narg min\n\u2206\n\nf (Xt +\u2206) such that \u2206ij = 0 \u2200(i, j) \u2208 Sf ree.\n\nij + WiiWjj).\n\nT\nk=1 wku\n\nT\n\n4\n\n\fThe proof is given in Appendix 7.2.3. Based on the above observation, we perform the inner loop\ncoordinate descent updates restricted to the free set only (to \ufb01nd the Newton direction). This reduces\nthe number of variables over which we perform the coordinate descent from O(p2) to the number\nof non-zeros in Xt, which in general is much smaller than p2 when \u03bb is large and the solution is\nsparse. We have observed huge computational gains from this modi\ufb01cation, and indeed in our main\ntheorem we show the superlinear convergence rate for the algorithm that includes this heuristic.\nThe attractive facet of this modi\ufb01cation is that it leverages the sparsity of the solution and intermedi-\nate iterates in a manner that falls within a block coordinate descent framework. Speci\ufb01cally, suppose\nas detailed above at any outer loop Newton iteration, we partition the variables into the \ufb01xed and\nfree set, and then \ufb01rst perform a Newton update restricted to the \ufb01xed block, followed by a Newton\nupdate on the free block. According to Lemma 1 a Newton update restricted to the \ufb01xed block does\nnot result in any changes.\nIn other words, performing the inner loop coordinate descent updates restricted to the free set is\nequivalent to two block Newton steps restricted to the \ufb01xed and free sets consecutively. Note further,\nthat the union of the free and \ufb01xed sets is the set of all variables, which as we show in the convergence\nanalysis in the appendix, is suf\ufb01cient to ensure the convergence of the block Newton descent.\nBut would the size of free set be small? We initialize X0 to the identity matrix, which is indeed\nsparse. As the following lemma shows, if the limit of the iterates (the solution of the optimization\nproblem) is sparse, then after a \ufb01nite number of iterations, the iterates Xt would also have the same\nsparsity pattern.\nLemma 2. Assume {Xt} converges to X \u2217. If for some index pair (i, j), |\u2207ijg(X \u2217)| <\u03bb (so that\nij = 0), then there exists a constant \u00aft > 0 such that for all t > \u00aft, the iterates Xt satisfy\nX \u2217\n(11)\nThe proof comes directly from Lemma 11 in the Appendix. Note that |\u2207ijg(X \u2217)| <\u03bb implying\nij = 0 follows from the optimality condition of (2). A similar (so called shrinking) strategy is\nX \u2217\nused in SVM or !1-regularized logistic regression problems as mentioned in [19]. In Appendix 7.4\nwe show in experiments this strategy can reduce the size of variables very quickly.\n3.4 The Quadratic Approximation based Method\nWe now have the machinery for a description of our algorithm QUIC standing for QUadratic Inverse\nCovariance. A high level summary of the algorithm is shown in Algorithm 1, while the the full\ndetails are given in Algorithm 2 in the Appendix.\nAlgorithm 1: Quadratic Approximation method for Sparse Inverse Covariance Learning (QUIC)\nInput\nOutput: Sequence of Xt converging to arg minX\"0 f (X), where\n\n: Empirical covariance matrix S, scalar \u03bb, initial X0, inner stopping tolerance \u0001\nf (X) = \u2212 log det X + tr(SX) + \u03bb#X#1.\n\n|\u2207ijg(Xt)| <\u03bb and (Xt)ij = 0.\n\nt\n\n.\n\nCompute Wt = X \u22121\nForm the second order approximation \u00affXt(\u2206) := \u00afgXt(\u2206) + h(Xt +\u2206) to f (Xt +\u2206) .\nPartition the variables into free and \ufb01xed sets based on the gradient, see Section 3.3.\nUse coordinate descent to \ufb01nd the Newton direction Dt = arg min\u2206 \u00affXt(Xt +\u2206) over the\nfree variable set, see (6) and (9). (A Lasso problem.)\nUse an Armijo-rule based step-size selection to get \u03b1 s.t. Xt+1 = Xt + \u03b1Dt is positive de\ufb01nite\nand the objective value suf\ufb01ciently decreases, see (10).\n\n1 for t = 0, 1, . . . do\n2\n3\n4\n5\n6\n7 end\n4 Convergence Analysis\nIn this section, we show that our algorithm has strong convergence guarantees. Our \ufb01rst main result\nshows that our algorithm does converge to the optimum of (2). Our second result then shows that\nthe asymptotic convergence rate is actually superlinear, speci\ufb01cally quadratic.\n4.1 Convergence Guarantee\nWe build upon the convergence analysis in [17, 21] of the block coordinate gradient descent method\napplied to composite objectives. Speci\ufb01cally, [17, 21] consider iterative updates where at each\n\n5\n\n\fiteration t they update just a block of variables Jt. They then consider a Gauss-Seidel rule:\n\n%\n\nj=0,...,T \u22121\n\nJt+j \u2287N \u2200 t = 1, 2, . . . ,\n\n(12)\n\nwhere N is the set of all variables and T is a \ufb01xed number. Note that the condition (12) ensures that\neach block of variables will be updated at least once every T iterations. Our Newton steps with the\nfree set modi\ufb01cation is a special case of this framework: we set J2t, J2t+1 to be the \ufb01xed and freesets\nrespectively. As outlined in Section 3.3, our selection of the \ufb01xed sets ensures that a block update\nrestricted to the \ufb01xed set would not change any values since these variables in \ufb01xed sets already\nsatisfy the coordinatewise optimality condition. Thus, while our algorithm only explicitly updates\nthe free set block, this is equivalent to updating variables in \ufb01xed and free blocks consecutively. We\nalso have J2t \u222a J2t+1 = N, implying the Gauss-Seidel rule with T = 3.\nFurther, the composite objectives in [17, 21] have the form F (x) = g(x) + h(x), where g(x)\nis smooth (continuously differentiable), and h(x) is non-differentiable but separable. Note that in\nour case, the smooth component is the log-determinant function g(X) = \u2212 log det X + tr(SX),\nwhile the non-differentiable separable component is h(x) = \u03bb#x#1. However, [17, 21] impose the\nadditional assumption that g(x) is smooth over the domain Rn. In our case g(x) is smooth over\nthe restricted domain of the positive de\ufb01nite cone Sp\n++ . In the appendix, we extend the analysis\nso that convergence still holds under our setting. In particular, we prove the following theorem in\nAppendix 7.2:\nTheorem 1. In Algorithm 1, the sequence {Xt} converges to the unique global optimum of (2).\n4.2 Asymptotic Convergence Rate\nIn addition to convergence, we further show that our algorithm has a quadratic asymptotic conver-\ngence rate.\nTheorem 2. Our algorithm QUIC converges quadratically, that is for some constant 0 <\u03ba< 1:\n\nlim\nt\u2192\u221e\n\n#Xt+1 \u2212 X \u2217#F\n#Xt \u2212 X \u2217#2\nF\n\n= \u03ba.\n\nThe proof, given in Appendix 7.3, \ufb01rst shows that the step size as computed in Section 3.2 would\neventually become equal to one, so that we would be eventually performing vanilla Newton updates.\nFurther we use the fact that after a \ufb01nite number of iterations, the sign pattern of the iterates con-\nverges to the sign pattern of the limit. From these two assertions, we build on the convergence rate\nresult for constrained Newton methods in [6] to show that our method is quadratically convergent.\n5 Experiments\nIn this section, we compare our method QUIC with other state-of-the-art methods on both synthetic\nand real datasets. We have implemented QUIC in C++, and all the experiments were executed on\n2.83 GHz Xeon X5440 machines with 32G RAM and Linux OS.\nWe include the following algorithms in our comparisons:\n\u2022 ALM: the Alternating Linearization Method proposed by [14]. We use their MATLAB source\n\u2022 GLASSO: the block coordinate descent method proposed by [8]. We used their Fortran code\n\u2022 PSM: the Projected Subgradient Method proposed by [5]. We use the MATLAB source code\n\u2022 SINCO: the greedy coordinate descent method proposed by [15]. The code can be downloaded\n\u2022 IPM: An inexact interior point method proposed by [11]. The source code can be downloaded\n\ncode for the experiments.\navailable from cran.r-project.org, version 1.3 released on 1/22/09.\navailable at http://www.cs.ubc.ca/\u02dcschmidtm/Software/PQN.html.\nfrom https://projects.coin-or.org/OptiML/browser/trunk/sinco.\nfrom http://www.math.nus.edu.sg/\u02dcmattohkc/Covsel-0.zip.\n\nSince some of the above implementations do not support the generalized regularization term #\u03bb \u25e6\nX#1, our comparisons use \u03bb#X#1 as the regularization term.\nThe GLASSO algorithm description in [8] does not clearly specify the stopping criterion for the\nLasso iterations. Inspection of the available Fortran implementation has revealed that a separate\n\n6\n\n\fTable 1: The comparisons on synthetic datasets. p stands for dimension, #\u03a3\u22121#0 indicates the\nnumber of nonzeros in ground truth inverse covariance matrix, #X \u2217#0 is the number of nonzeros in\nthe solution, and \u0001 is a speci\ufb01ed relative error of objective value. \u2217 indicates the run time exceeds\nour time limit 30,000 seconds (8.3 hours). The results show that QUIC is overwhelmingly faster\nthan other methods, and is the only one which is able to scale up to solve problem where p = 10000.\n\nDataset setting\n\np #\u03a3\u22121#0\n\nParameter setting\n\u03bb #X \u2217#0\n\nTime (in seconds)\n\n\u0001 QUIC ALM Glasso PSM IPM Sinco\n0.30 18.89\n23.28 15.59 86.32 120.0\n2.26 41.85\n45.1 34.91 151.2 520.8\n11.28\n922\n3458 5246\n1068 567.9\n53.51\n*\n1734\n5754\n1258\n2119\n216.7 13820\n*\n*\n*\n8450\n986.6 28190\n*\n*\n* 19251\n0.52 42.34\n10.31 20.16 71.62 60.75\n1.2 28250\n20.43 59.89 116.7 683.3\n1.17 65.64\n17.96 23.53 78.27 576.0\n6.87\n*\n91.7 145.8 4449\n60.61\n23.25\n1429\n4928 7375\n1479\n1052\n160.2\n*\n*\n8097\n4232\n2561\n65.57\n*\n*\n5621\n2963\n3328\n478.8\n*\n*\n8356\n9541 13650\n337.7 26270 21298\n*\n*\n*\n1125\n*\n*\n*\n*\n*\n803.5\n*\n*\n*\n*\n*\n2951\n*\n*\n*\n*\n*\n\n10\u22122\n10\u22126\n10\u22122\n10\u22126\n10\u22122\n10\u22126\n10\u22122\n10\u22126\n10\u22122\n10\u22126\n10\u22122\n10\u22126\n10\u22122\n10\u22126\n10\u22122\n10\u22126\n10\u22122\n10\u22126\n\npattern\nchain\nchain\nchain\n\n1000\n\n2998\n\n0.4\n\n3028\n\n4000\n\n11998\n\n0.4\n\n11998\n\n10000\n\n29998\n\n0.4\n\n29998\n\nrandom 1000\n\n10758\n\nrandom 4000\n\n41112\n\nrandom 10000\n\n91410\n\n0.12\n\n10414\n\n0.075\n\n55830\n\n0.08\n\n41910\n\n0.05 247444\n\n0.08\n\n89652\n\n0.04 392786\n\nthreshold is computed and is used for these inner iterations. We found that under certain conditions\nthe threshold computed is smaller than the machine precision and as a result the overall algorithm\noccasionally displayed erratic convergencebehavior and slow performance. We modi\ufb01ed the Fortran\nimplementation of GLASSO to correct this error.\n5.1 Comparisons on synthetic datasets\nWe \ufb01rst compare the run times of the different methods on synthetic data. We generate the two\nfollowing types of graph structures for the underlying Gaussian Markov Random Fields:\n\u2022 Chain Graphs: The ground truth inverse covariance matrix \u03a3\u22121 is set to be \u03a3\u22121\ni,i\u22121 = \u22120.5 and\ni,i = 1.25.\n\u03a3\u22121\n\u2022 Graphs with Random Sparsity Structures: We use the procedure mentioned in Example 1 in [11]\nto generate inverse covariance matrices with random non-zero patterns. Speci\ufb01cally, we \ufb01rst\ngenerate a sparse matrix U with nonzero elements equal to \u00b11, set \u03a3\u22121 to be U T U and then add\na diagonal term to ensure \u03a3\u22121 is positive de\ufb01nite. We control the number of nonzeros in U so\nthat the resulting \u03a3\u22121 has approximately 10p nonzero elements.\n\nGiven the inverse covariance matrix \u03a3\u22121, we draw a limited number, n = p/2 i.i.d. samples, to sim-\nulate the high-dimensional setting, from the corresponding GMRF distribution. We then compare\nthe algorithms listed above when run on these samples.\nWe can use the minimum-norm sub-gradient de\ufb01ned in Lemma 5 in Appendix 7.2 as the stopping\ncondition, and computing it is easy because X \u22121 is available in QUIC. Table 1 shows the results\nfor timing comparisons in the synthetic datasets. We vary the dimensionality from 1000, 4000 to\n10000 for each dataset. For chain graphs, we select \u03bb so that the solution had the (approximately)\ncorrect number of nonzero elements. To test the performance of algorithms on different parameters\n(\u03bb), for random sparse pattern we test the speed under two values of \u03bb, one discovers correct number\nof nonzero elements, and one discovers 5 times the number of nonzero elements. We report the time\nfor each algorithm to achieve \u0001-accurate solution de\ufb01ned by f (X k) \u2212 f (X \u2217) <\u0001f (X \u2217). Table 1\nshows the results for \u0001 = 10\u22122 and 10\u22126, where \u0001 = 10\u22122 tests the ability for an algorithm to get a\n\n7\n\n\fgood initial guess (the nonzero structure), and \u0001 = 10\u22126 tests whether an algorithm can achieve an\naccurate solution. Table 1 shows that QUIC is consistently and overwhelmingly faster than other\nmethods, both initially with \u0001 = 10\u22122, and at \u0001 = 10\u22126. Moreover, for p = 10000 random pattern,\nthere are p2 = 100 million variables, the selection of \ufb01xed/free sets helps QUIC to focus only on\nvery small part of variables, and can achieve an accurate solution in about 15 minutes, while other\nmethods fails to even have an initial guess within 8 hours. Notice that our \u03bb setting is smaller\nthan [14] because here we focus on the \u03bb which discovers true structure, therefore the comparison\nbetween ALM and PSM are different from [14].\n5.2 Experiments on real datasets\nWe use the real world biology datasets preprocessed by [11] to compare the performance of our\nmethod with other state-of-the-art methods. The regularization parameter \u03bb is set to 0.5 according\nto the experimentalsetting in [11]. Results on the following datasets are shown in Figure 1: Estrogen\n(p = 692), Arabidopsis (p = 834), Leukemia (p = 1, 225), Hereditary (p = 1, 869). We plot the\nrelative error (f (Xt) \u2212 f (X \u2217))/f (X \u2217) (on a log scale) against time in seconds. On these real\ndatasets, QUIC can be seen to achieve super-linear convergence, while other methods have at most\na linear convergence rate. Overall QUIC can be ten times faster than other methods, and even more\nfaster when higher accuracy is desired.\n6 Acknowledgements\nWe would like to thank Professor Kim-Chuan Toh for providing the data set and the IPM code.\nWe would also like to thank Professor Katya Scheinberg and Shiqian Ma for providing the ALM\nimplementation. This research was supported by NSF grant IIS-1018426 and CCF-0728879. ISD\nacknowledges support from the Moncrief Grand Challenge Award.\n\nl\n\n)\ne\na\nc\ns\n \n\ng\no\nl\n(\n \nr\no\nr\nr\ne\ne\nv\ni\nt\n\n \n\nl\n\na\ne\nR\n\nl\n\n)\ne\na\nc\ns\n \n\ng\no\nl\n(\n \nr\no\nr\nr\ne\n\n \n\ne\nv\ni\nt\n\nl\n\na\ne\nR\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n10\u22128\n\n10\u221210\n\n \n0\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n10\u22128\n\n10\u221210\n\n \n0\n\n10\n\n20\n\n30\n\n(a) Time for Estrogen, p = 692\n\nTime (sec)\n\n40\n\nALM\nSinco\nPSM\nGlasso\nIPM\nQUIC\n\n10\n\n20\n\n30\n\n(b) Time for Arabidopsis, p = 834\n\n40\nTime (sec)\n\n50\n\n60\n\n70\n\nALM\nSinco\nPSM\nGlasso\nIPM\nQUIC\n\n \n\n \n\n80\n\n \n\n60\n\n \n\nALM\nSinco\nPSM\nGlasso\nIPM\nQUIC\n\n50\n\nALM\nSinco\nPSM\nGlasso\nIPM\nQUIC\n\nl\n\n)\ne\na\nc\ns\n \n\ng\no\nl\n(\n \nr\no\nr\nr\ne\n\n \n\ne\nv\ni\nt\n\nl\n\na\ne\nR\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n10\u22128\n\n10\u221210\n\n \n0\n\n100\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n10\u22128\n\nl\n\n)\ne\na\nc\ns\n \n\ng\no\nl\n(\n \nr\no\nr\nr\ne\ne\nv\ni\nt\n\n \n\nl\n\na\ne\nR\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n400\n\n(c) Time for Leukemia, p = 1, 255\n\nTime (sec)\n\n450\n\n500\n\n10\u221210\n\n \n0\n\n200\n\n400\n\n1000\n\n1200\n\n(d) Time for hereditarybc, p = 1, 869\n\n600\nTime (sec)\n\n800\n\nFigure 1: Comparison of algorithms on real datasets. The results show QUIC converges faster than\nother methods.\n\n8\n\n\fReferences\n[1] A. Agarwal, S. Negahban, and M. Wainwright. Convergence rates of gradient methods for\n\nhigh-dimensional statistical recovery. In NIPS, 2010.\n\n[2] O. Banerjee, L. E. Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum\nlikelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learn-\ning Research, 9, 6 2008.\n\n[3] D. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, Belmont, MA, 1995.\n[4] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 7th printing\n\n[5] J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse Gaus-\n\nedition, 2009.\n\nsians. UAI, 2008.\n\n[6] J. Dunn. Newton\u2019s method and the Goldstein step-length rule for constrained minimization\n\nproblems. SIAM J. Control and Optimization, 18(6):659\u2013674, 1980.\n\n[7] J. Friedman, T. Hastie, H. H\u00a8o\ufb02ing, and R. Tibshirani. Pathwise coordinateoptimization. Annals\n\nof Applied Statistics, 1(2):302\u2013332, 2007.\n\n[8] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graph-\n\nical lasso. Biostatistics, 9(3):432\u2013441, July 2008.\n\n[9] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models\n\nvia coordinate descent. Journal of Statistical Software, 33(1):1\u201322, 2010.\n\n[10] E. S. Levitin and B. T. Polyak. Constrained minimization methods. U.S.S.R. Computational\n\nMath. and Math. Phys., 6:1\u201350, 1966.\n\n[11] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covariance\n\nselection. Mathematical Programming Computation, 2:291\u2013315, 2010.\n\n[12] L. Meier, S. Van de Geer, and P. B\u00a8uhlmann. The group lasso for logistic regression. Journal\n\nof the Royal Statistical Society, Series B, 70:53\u201371, 2008.\n\n[13] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, 1970.\n[14] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternating\n\nlinearization methods. NIPS, 2010.\n\n[15] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedy coor-\ndinate ascent approach. In J. Balczar, F. Bonchi, A. Gionis, and M. Sebag, editors, Machine\nLearning and Knowledge Discovery in Databases, volume 6323 of Lecture Notes in Computer\nScience, pages 196\u2013212. Springer Berlin / Heidelberg, 2010.\n\n[16] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, 58:267\u2013288, 1996.\n\n[17] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable mini-\n\nmization. Mathematical Programming, 117:387\u2013423, 2007.\n\n[18] T. T. Wu and K. Lange. Coordinate descent algorithms for lasso penalized regression. The\n\nAnnals of Applied Statistics, 2(1):224\u2013244, 2008.\n\n[19] G.-X. Yuan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A comparison of optimization methods\nand software for large-scale l1-regularized linear classi\ufb01cation. Journal of Machine Learning\nResearch, 11:3183\u20133234, 2010.\n\n[20] M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model.\n\nBiometrika, 94:19\u201335, 2007.\n\n[21] S. Yun and K.-C. Toh. A coordinate gradient descent method for l1-regularized convex mini-\n\nmization. Computational Optimizations and Applications, 48(2):273\u2013307, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1249, "authors": [{"given_name": "Cho-jui", "family_name": "Hsieh", "institution": null}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}, {"given_name": "M\u00e1ty\u00e1s", "family_name": "Sustik", "institution": null}]}