{"title": "Newton-Like Methods for Sparse Inverse Covariance Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 755, "page_last": 763, "abstract": "We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding method (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. We show that quasi-Newton methods are also effective in this context, and describe a limited memory BFGS variant of the orthant-based Newton method. We present numerical results that suggest that all the techniques described in this paper have attractive properties and constitute useful tools for solving the sparse inverse covariance estimation problem. Comparisons with the method implemented in the QUIC software package are presented.", "full_text": "Newton-Like Methods for Sparse Inverse Covariance\n\nEstimation\n\nPeder A. Olsen\n\nIBM, T. J. Watson Research Center\n\npederao@us.ibm.com\n\nJorge Nocedal\n\nNorthwestern University\n\nnocedal@eecs.northwestern.edu\n\nFigen Oztoprak\nSabanci University\n\nfigen@sabanciuniv.edu\n\nSteven J. Rennie\n\nIBM, T. J. Watson Research Center\n\nsjrennie@us.ibm.com\n\nAbstract\n\nWe propose two classes of second-order optimization methods for solving the\nsparse inverse covariance estimation problem. The \ufb01rst approach, which we call\nthe Newton-LASSO method, minimizes a piecewise quadratic model of the objec-\ntive function at every iteration to generate a step. We employ the fast iterative\nshrinkage thresholding algorithm (FISTA) to solve this subproblem. The second\napproach, which we call the Orthant-Based Newton method, is a two-phase algo-\nrithm that \ufb01rst identi\ufb01es an orthant face and then minimizes a smooth quadratic ap-\nproximation of the objective function using the conjugate gradient method. These\nmethods exploit the structure of the Hessian to ef\ufb01ciently compute the search di-\nrection and to avoid explicitly storing the Hessian. We also propose a limited\nmemory BFGS variant of the orthant-based Newton method. Numerical results,\nincluding comparisons with the method implemented in the QUIC software [1],\nsuggest that all the techniques described in this paper constitute useful tools for\nthe solution of the sparse inverse covariance estimation problem.\n\n1\n\nIntroduction\n\nCovariance selection, \ufb01rst described in [2], has come to refer to the problem of estimating a nor-\nmal distribution that has a sparse inverse covariance matrix P, whose non-zero entries correspond\nto edges in an associated Gaussian Markov Random Field, [3]. A popular approach to covariance\nselection is to maximize an (cid:96)1 penalized log likelihood objective, [4]. This approach has also been\napplied to related problems, such as sparse multivariate regression with covariance estimation, [5],\nand covariance selection under a Kronecker product structure, [6]. In this paper, we consider the\nsame objective function as in these papers, and present several Newton-like algorithms for minimiz-\ning it.\nFollowing [4, 7, 8], we state the problem as\n\nP\u2217 = arg max\n\nlog det(P) \u2212 trace(SP) \u2212 \u03bb(cid:107)vec(P)(cid:107)1,\n\n(1)\n\nwhere \u03bb is a (\ufb01xed) regularization parameter,\n\nP(cid:31)0\n\n(cid:80)N\ni=1(xi \u2212 \u00b5)(xi \u2212 \u00b5)T\n\n(2)\nis the empirical sample covariance, \u00b5 is known, the xi \u2208 Rn are assumed to be independent,\nidentically distributed samples, and vec(P) de\ufb01nes a vector in Rn2 obtained by stacking the columns\nof P. We recast (1) as the minimization problem\n\nS = 1\nN\n\nF (P) def= L(P) + \u03bb(cid:107)vec(P)(cid:107)1,\n\nmin\nP(cid:31)0\n\n(3)\n\n1\n\n\fwhere L is the negative log likelihood function\n\nL(P) = \u2212log det(P) + trace(SP).\n\nThe convex problem (3) has a unique solution P\u2217 that satis\ufb01es the optimality conditions [7]\n\nS \u2212 [P\u2217]\u22121 + \u03bbZ\u2217 = 0,\n\n\uf8f1\uf8f2\uf8f3 1\n\n\u22121\n\u03b1 \u2208 [\u22121, 1]\n\nif P \u2217\nif P \u2217\nif P \u2217\n\nij > 0\nij < 0\nij = 0.\n\nwhere\n\nZ\u2217\nij =\n\nWe note that Z\u2217 solves the dual problem\nZ\u2217 = arg max(cid:107)vec(Z)(cid:107)\u221e\u22641\nS+\u03bbZ(cid:31)0\n\nU (Z),\n\nU (Z) = \u2212log det(S + \u03bbZ) + n.\n\nThe main contribution of this paper is to propose two classes of second-order methods for solving\nproblem (3). The \ufb01rst class employs a piecewise quadratic model in the step computation, and can be\nseen as a generalization of the sequential quadratic programming method for nonlinear programming\n[9]; the second class minimizes a smooth quadratic model of F over a chosen orthant face in Rn2.\nWe argue that both types of methods constitute useful tools for solving the sparse inverse covariance\nmatrix estimation problem.\nAn overview of recent work on the sparse inverse covariance estimation problem is given in [10, 11].\nFirst-order methods proposed include block-coordinate descent approaches, such as COVSEL, [4, 8]\nand GLASSO [12], greedy coordinate descent, known as SINCO [13], projected subgradient methods\nPSM [14], \ufb01rst order optimal gradient ascent [15], and the alternating linearization method ALM\n[16]. Second-order methods include the inexact interior point method IPM proposed in [17], and the\ncoordinate relaxation method described in [1] and implemented in the QUIC software. It is reported\nin [1] that QUIC is signi\ufb01cantly faster than the ALM, GLASSO, PSM, SINCO and IPM methods. We\ncompare the algorithms presented in this paper to the method implemented in QUIC.\n\n2 Newton Methods\n\nWe can de\ufb01ne a Newton iteration for problem (1) by constructing a quadratic, or piecewise\nquadratic, model of F using \ufb01rst and second derivative information. It is well known [4] that the\nderivatives of the log likelihood function (4) are given by\n\ng def= L(cid:48)(P) = vec(S \u2212 P\u22121)\n\n(7)\nwhere \u2297 denotes the Kronecker product. There are various ways of using these quantities to de\ufb01ne\na model of F , and each gives rise to a different Newton-like iteration.\nIn the Newton-LASSO Method, we approximate the objective function F at the current iterate Pk\nby the piecewise quadratic model\n\nand\n\nH def= L(cid:48)(cid:48)(P) = (P\u22121 \u2297 P\u22121),\n\nk vec(P \u2212 Pk) + 1\n\nqk(P) = L(Pk) + g(cid:62)\n\n2 vec(cid:62)(P \u2212 Pk)Hkvec(P \u2212 Pk) + \u03bb(cid:107)vec(P)(cid:107)1,\n\n(8)\nwhere gk = L(cid:48)(Pk), and similarly for Hk. The trial step of the algorithm is computed as a mini-\nmizer of this model, and a backtracking line search ensures that the new iterate lies in the positive\nde\ufb01nite cone and decreases the objective function F . We note that the minimization of qk is often\ncalled the LASSO problem [18] in the case when the unknown is a vector.\nIt is advantageous to perform the minimization of (8) in a reduced space; see e.g. [11] and the\nreferences therein. Speci\ufb01cally, at the beginning of the k-th iteration we de\ufb01ne the set Fk of (free)\nvariables that are allowed to move, and the active set Ak. To do so, we compute the steepest descent\nfor the function F , which is given by\n\n\u2212(gk + \u03bbZk),\n\n(4)\n\n(5)\n\n(6)\n\nwhere\n\n(Zk)ij =\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n1\n\u22121\n\u22121\n1\n\u03bb [gk]ij\n\n\u2212 1\n\nif (Pk)ij > 0\nif (Pk)ij < 0\nif (Pk)ij = 0 and [gk]ij > \u03bb\nif (Pk)ij = 0 and [gk]ij < \u2212\u03bb\nif (Pk)ij = 0 and | [gk]ij| \u2264 \u03bb.\n\n(9)\n\n2\n\n\fThe sets Fk,Ak are obtained by considering a small step along this steepest descent direction, as this\nguarantees descent in qk(P). For variables satisfying the last condition in (9), a small perturbation\nof Pij will not decrease the model qk. This suggests de\ufb01ning the active and free sets of variables at\niteration k as\nAk = {(i, j)|(Pk)ij = 0 and |[gk]ij| \u2264 \u03bb},\n(10)\nThe algorithm minimizes the model qk over the set of free variables. Let us de\ufb01ne pF = vec(P)F ,\nto be the free variables, and let pkF = vecF (Pk) denote their value at the current iterate \u2013 and\nsimilarly for other quantities. Let us also de\ufb01ne HkF to be the matrix obtained by removing from\nHk the columns and rows corresponding to the active variables (with indices in Ak). The reduced\nmodel is given by\n\nFk = {(i, j)|(Pk)ij (cid:54)= 0 or |[gk]ij| > \u03bb}.\n\nqF (P) = L(Pk) + g(cid:62)\n\nThe search direction d is de\ufb01ned by\n\nkF (pF \u2212 pkF ) + 1\n2 (pF \u2212 pkF )(cid:62)HkF (pF \u2212 pkF ) + \u03bb(cid:107)pF(cid:107)1.\n(cid:21)\n\n(cid:20) \u02c6pF \u2212 pkF\n\n(cid:20) dF\n\n(cid:21)\n\nd =\n\n=\n\ndA\n\n0\n\n,\n\nwhere \u02c6pF is the minimizer of (11). The algorithm performs a line search along the direction D =\nmat(d), where the operator mat(d) satis\ufb01es mat(vec(D)) = D. The line search starts with the\nunit steplength and backtracks, if necessary, to obtain a new iterate Pk+1 that satis\ufb01es the suf\ufb01cient\ndecrease condition and positive de\ufb01niteness (checked using a Cholesky factorization):\n\nF (Pk+1) \u2212 F (Pk) < \u03c3 (qF (Pk+1) \u2212 qF (Pk))\n\nand Pk+1 (cid:31) 0,\n\nwhere \u03c3 \u2208 (0, 1).\nIt is suggested in [1] that coordinate descent is the most effective iteration for solving the LASSO\nproblem (11). We claim, however, that other techniques merit careful investigation. These include\ngradient projection [19] and iterative shrinkage thresholding algorithms, such as ISTA [20] and FISTA\n[21]. In section 3 we describe a Newton-LASSO method that employs the FISTA iteration.\nConvergence properties of the Newton-LASSO method that rely on the exact solution of the LASSO\nproblem (8) are given in [22]. In practice, it is more ef\ufb01cient to solve problem (8) inexactly, as\ndiscussed in section 6. The convergence properties of inexact Newton-LASSO methods will be the\nsubject of a future study.\nThe Orthant-Based Newton method computes steps by solving a smooth quadratic approximation\nof F over an appropriate orthant \u2013 or more precisely, over an orthant face in Rn2. The choice\nof this orthant face is done, as before, by computing the steepest descent direction of F , and is\ncharacterized by the matrix Zk in (9). Speci\ufb01cally the \ufb01rst four conditions in (9) identify an orthant\nin Rn2 where variables are allowed to move, while the last condition in (9) determines the variables\nto be held at zero. Therefore, the sets of free and active variables are de\ufb01ned as in (10). If we de\ufb01ne\nzF = vecF (Z), then the quadratic model of F over the current orthant face is given by\n(pF \u2212 pkF )(cid:62)HF (pF \u2212 pkF ) + \u03bbz(cid:62)\n\nQF (P) = L(Pk) + g(cid:62)\n\nF pF .\nF = pkF \u2212 H\u22121F (gF + \u03bbzF ), and the step of the algorithm is given by\n\nThe minimizer is p\u2217\n\nF (pF \u2212 pkF ) +\n\n(14)\n\n1\n2\n\n(cid:21)\n\n(cid:20) dF\n\ndA\n\n(cid:20) p\u2217\n\nF \u2212 pkF\n\n0\n\n(cid:21)\n\n(11)\n\n(12)\n\n(13)\n\nd =\n\n=\n\n.\n\n(15)\n\nIf pkF +d lies outside the current orthant, we project it onto this orthant and perform a backtracking\nline search to obtain the new iterate Pk+1, as discussed in section 4.\nThe orthant-based Newton method therefore moves from one orthant face to another, taking advan-\ntage of the fact that F is smooth in every orthant in Rn2. In Figure 1 we compare the two methods\ndiscussed so far.\n\nThe optimality conditions (5) show that P\u2217 is diagonal when \u03bb \u2265 |Sij| for all i (cid:54)= j, and given by\n(diag(S) + \u03bbI)\u22121. This suggests that a good choice for the initial value (for any value of \u03bb > 0) is\n(16)\n\nP0 = (diag(S) + \u03bbI)\u22121.\n\n3\n\n\fMethod NL (Newton-LASSO)\nRepeat:\n\n1. Phase I: Determine the sets of \ufb01xed and\n\nfree indices Ak and Fk, using (10).\n\n2. Phase II: Compute the Newton step D\ngiven by (12), by minimizing the piece-\nwise quadratic model (11) for the free\nvariables Fk.\n\n3. Globalization: Choose Pk+1 by per-\nforming an Armijo backtracking line\nsearch starting from Pk + D.\n\n4. k \u2190 k + 1.\n\nMethod OBN (Orthant-Based Newton)\nRepeat:\n\n1. Phase I: Determine the active orthant\nface through the matrix Zk given in (9).\n2. Phase II: Compute the Newton direc-\ntion D given by (15), by minimizing\nthe smooth quadratic model (14) for the\nfree variables Fk.\n\n3. Globalization: Choose Pk+1 in the cur-\nrent orthant by a projected backtracking\nline search starting from Pk + D.\n\n4. k \u2190 k + 1.\n\nFigure 1: Two classes of Newton methods for the inverse covariance estimation problem (3).\n\nNumerical experiments indicate that this choice is advantageous for all methods considered.\nA popular orthant based method for the case when the unknown is a vector is OWL [23]; see also\n[11]. Rather than using the Hessian (7), OWL employs a quasi-Newton approximation to minimize\nthe reduced quadratic, and applies an alignment procedure to ensure descent. However, for reasons\nthat are dif\ufb01cult to justify the OWL step employs the reduced inverse Hessian (as apposed to the\ninverse of the reduced Hessian), and this can give steps of poor quality. We have dispensed with\nthe alignment as it is not needed in our experiments. The convergence properties of OBM methods\nare the subject of a future study (we note in passing that the convergence proof given in [23] is not\ncorrect).\n\n3 A Newton-LASSO Method with FISTA Iteration\n\nLet us consider a particular instance of the Newton-LASSO method that employs the Fast Iterative\nShrinkage Thresholding Algorithm FISTA [21] to solve the reduced subproblem (11). We recall that\nfor the problem\n\nwhere f is a smooth convex quadratic function, the ISTA iteration [20] is given by\n\nwhere c is a constant such that cI \u2212 f(cid:48)(cid:48)(x) (cid:31) 0, and the FISTA acceleration is given by\n\nwhere \u02c6x1 = x0, t1 = 1, ti+1 =\noperator given by\n\n/2. Here S\u03bb/c denotes the soft thresholding\n\nWe can apply the ISTA iteration (18) to the reduced quadratic in (11) starting from x0 = vecFk (X0)\n(which is not necessarily equal to pk = vecFk (Pk)). Substituting in the expressions for the \ufb01rst and\nsecond derivative in (7) gives\n\n(cid:18)\n(cid:18)\n\nxi = S\u03bb/c\n\n= S\u03bb/c\n\nvecFk ( \u02c6Xi) \u2212 1\nc\nvecFk ( \u02c6Xi) \u2212 1\nc\n\n(cid:16)\ngkFk + HkFk vecFk ( \u02c6Xi \u2212 Pk)\nvecFk (S \u2212 2P\u22121\n\u02c6XiP\u22121\nk )\n\nk + P\u22121\n\nk\n\n(cid:17)(cid:19)\n(cid:19)\n\n,\n\n4\n\nf (x) + \u03bb(cid:107)x(cid:107)1,\n\nmin\nx\u2208Rn2\n\n(cid:18)\n\n(cid:19)\n\nxi = S\u03bb/c\n\n\u02c6xi \u2212 1\nc\n\n\u2207f (\u02c6xi)\n\n,\n\n(xi \u2212 xi\u22121),\n\n\u02c6xi+1 = xi +\n\n(cid:16)\n\nti \u2212 1\nti+1\n\n(cid:17)\n\n1 +(cid:112)1 + 4t2\n(cid:26)\n\ni\n\n(S\u03c3(y))i =\n\n0\n\nyi \u2212 \u03c3sign(yi)\n\nif |yi| \u2264 \u03c3,\notherwise.\n\n(17)\n\n(18)\n\n(19)\n\n\fwhere the constant c should satisfy c > 1/(eigminPk)2. The FISTA acceleration step is given by\n(19). Let \u00afx denote the free variables part of the (approximate) solution of (11) obtained by the\nFISTA iteration. Phase I of the Newton-LASSO-FISTA method selects the free and active sets, Fk,Ak,\nas indicated by (10). Phase II, applies the FISTA iteration to the reduced problem (11), and sets\nPk+1 \u2190 mat\n\n. The computational cost of K iterations of the FISTA algorithm is O(Kn3).\n\n(cid:18)\u00afx\n\n(cid:19)\n\n0\n\n4 An Orthant-Based Newton-CG Method\n\nWe now consider an orthant-based Newton method in which a quadratic model of F is minimized\napproximately using the conjugate gradient (CG) method. This approach is attractive since, in addi-\ntion to the usual advantages of CG (optimal Krylov iteration, \ufb02exibility), each CG iteration can be\nef\ufb01ciently computed by exploiting the structure of the Hessian matrix in (7).\nPhase I of the orthant-based Newton-CG method computes the matrix Zk given in (9), which is used\nto identify an orthant face in Rn2. Variables satisfying the last condition in (9) are held at zero and\ntheir indices are assigned to the set Ak, while the rest of the variables are assigned to Fk and are\nallowed to move according to the signs of Zk: variables with (Zk)ij = 1 must remain non-negative,\nand variables with (Zk)ij = \u22121 must remain non-positive.\nHaving identi\ufb01ed the current orthant face, phase II of the method constructs the quadratic model\nQF in the free variables, and computes an approximate solution by means of the conjugate gradient\nmethod, as described in Algorithm 1.\n\nConjugate Gradient Method for Problem (14)\ninput : Gradient g, orthant indicator z, current iterate P0, maximum steps K, residual tolerance\n\n\u0001r, and the regularization parameter \u03bb.\n\noutput: Approximate Newton direction d = cg(P0, g, z, \u03bb, K)\nn = size(P0, 1) , G = mat(g) , Z = mat(z);\nA = {(i, j) : [P0]ij = 0 & |Gij| \u2264 \u03bb};\nB = P\u22121\n0 , X0 = 0n\u00d7n, x0 = vec(X0);\nR0 = \u2212(G + \u03bbZ), [R0]A \u2190 0;\nk = 0, q0 = r0 = vec(R0);\nwhile k \u2264 min(n2, K) and (cid:107)rk(cid:107) > \u0001r do\n\n( \u2234 [r0]F = vF )\n\n;\n\nQk = reshape(qk, n, n);\nYk = BQkB, [Yk]A \u2190 0, yk = vec(Yk);\n\u03b1k = r(cid:62)\nk rk\nq(cid:62)\nk yk\nxk+1 = xk + \u03b1kqk;\nrk+1 = rk \u2212 \u03b1kyk;\nr(cid:62)\nk+1rk+1\n\u03b2k =\nr(cid:62)\nk rk\nqk+1 = rk+1 + \u03b2kqk;\nk \u2190 k + 1;\n\n;\n\nend\nreturn d = xk+1\n\nAlgorithm 1: CG Method for Minimizing the Reduced Model QF .\n\nThe search direction of the method is given by D = mat(d), where d denotes the output of Al-\ngorithm 1. If the trial step Pk + D lies in the current orthant, it is the optimal solution of (14).\nOtherwise, there is at least one index such that\n(i, j) \u2208 Ak and [L(cid:48)(Pk + D)]ij /\u2208 [\u2212\u03bb, \u03bb],\n\n(i, j) \u2208 Fk and (Pk + D)ijZij < 0.\n\nor\n\nIn this case, we perform a projected line search to \ufb01nd a point in the current orthant that yields a\ndecrease in F . Let \u03a0(\u00b7) denote the orthogonal projection onto the orthant face de\ufb01ned by Zk, i.e.,\n\n(cid:26)Pij\n\n0\n\n\u03a0(Pij) =\n\nif sign(Pij) = sign(Zk)ij\notherwise.\n\n(20)\n\n5\n\n\fThe line search computes a steplength \u03b1k to be the largest member of\n{1, 1/2, . . . , 1/2i, . . .} such that\n\nF (\u03a0(Pk + \u03b1kD)) \u2264 F (Pk) + \u03c3(cid:101)\u2207F (Pk)T (\u03a0(Pk + \u03b1kD) \u2212 Pk) ,\n\nwhere \u03c3 \u2208 (0, 1) is a given constant and (cid:101)\u2207F denotes the minimum norm subgradient of F . The\n\nnew iterate is de\ufb01ned as Pk+1 = \u03a0(Pk + \u03b1kD).\nThe conjugate gradient method requires computing matrix-vector products involving the reduced\nHessian, HkF . For our problem, we have\n\nthe sequence\n\n(21)\n\nHkF (pF \u2212 pkF ) = (cid:2)Hk\n= (cid:2)P\u22121\n\n(cid:0) pF\u2212pkF\n(cid:1)(cid:3)\nk mat(cid:0) pF\u2212pkF\n\nF\n\n0\n\n(cid:1) P\u22121\n\n(cid:3)\n\n(22)\nThe second line follows from the identity (A \u2297 B)vec(C) = vec(BCA(cid:62)). The cost of performing\nK steps of the CG algorithm is O(Kn3) operations, and K = n2 steps is needed to guarantee an\nexact solution. Our practical implementation computes a small number of CG steps relative to n,\nK = O(1), and as a result the search direction is not an accurate approximation of the true Newton\nstep. However, such inexact Newton steps achieve a good balance between the computational cost\nand the quality of the direction.\n\nF .\n\n0\n\nk\n\n5 Quasi-Newton Methods\n\nThe methods considered so far employ the exact Hessian of the likelihood function L, but one can\nalso approximate it using (limited memory) quasi-Newton updating. At \ufb01rst glance it may not seem\npromising to approximate a complicated Hessian like (7) in this manner, but we will see that quasi-\nNewton updating is indeed effective, provided that we store matrices using the compact limited\nmemory representations [9].\nLet us consider an orthant-based method that minimizes the quadratic model (14), where HF is\nreplaced by a limited memory BFGS matrix, which we denote by BF . This matrix is not formed\nexplicitly, but is de\ufb01ned in terms of the difference pairs\n\nyk = gk+1 \u2212 gk,\n\nsk = vec(Pk+1 \u2212 Pk).\n\n(23)\n\nIt is shown in [24, eq(5.11)] that the minimizer of the model QF is given by\n\nF = pF + B\u22121F (\u03bbzF \u2212 gF )\np\u2217\n= 1\n\n\u03b8 (\u03bbzF \u2212 gF ) + 1\n\n\u03b82 RTF W(I \u2212 1\n\n\u22121\n\nMWT RF (\u03bbzF \u2212 gF ).\n\n\u03b8 MWT RF RTF W)\n\n(24)\nHere RF is a matrix consisting of the set of unit vectors that span the subspace of free variables, \u03b8\nis a scalar, W is an n2 \u00d7 2t matrix containing the t most recent correction pairs (23), and M is a\n2t \u00d7 2t matrix formed by inner products between the correction pairs. The total cost of computing\nF is 2t2|F| + 4t|F| operations, where |F| is the cardinality of F. Since the memory\nthe minimizer p\u2217\nparameter t in the quasi-Newton updating scheme is chosen to be a small number, say between 5\nand 20, the cost of computing the subspace minimizer (24) is quite affordable. A similar approach\nwas taken in [25] for the related constrained inverse covariance sparsity problem.\nWe have noted above that OWL, which is an orthant based quasi-Newton method does not correctly\napproximate the minimizer (24). We note also that quasi-Newton updating can be employed in\nNewton-LASSO methods, but we do not discuss this approach here for the sake of brevity.\n\n6 Numerical Experiments\nWe generated test problems by \ufb01rst creating a random sparse inverse covariance matrix1, \u03a3\u22121, and\nthen sampling data to compute a corresponding non-sparse empirical covariance matrix S. The di-\nmensions, sparsity, and conditioning of the test problems are given along with the results in Table 2.\nFor each data set, we solved problem (3) with \u03bb values in the range [0.01, 0.5]. The number of\nsamples used to compute the sample covariance matrix was 10n.\n\n1http://www.cmap.polytechnique.fr/\u02dcaspremon/CovSelCode.html, [7]\n\n6\n\n\fThe algorithms we tested are listed in Table 1. With the exception of C:QUIC, all of these algorithms\nwere implemented in MATLAB. Here NL and OBN are abbreviations for the methods in Figure 1.\nNL-Coord is a MATLAB implementation of the QUIC algorithm that follows the C-version [1]\n\nDescription\nNewton-LASSO-FISTA method\nNewton-LASSO method using coordinate descent\nOrthant-based Newton-CG method with a limit of K CG iterations\nOBN-CG-K with K=5 initially and increased by 1 every 3 iterations.\n\nAlgorithm\nNL-FISTA\nNL-Coord\nOBN-CG-K\nOBN-CG-D\nOBN-LBFGS Orthant-based quasi-Newton method (see section 5)\nALM\u2217\nC:QUIC\n\nAlternating linearization method [26].\nThe C implementation of QUIC given in [1].\n\nTable 1: Algorithms tested. \u2217For ALM, the termination criteria was changed to the (cid:96)\u221e norm and the value\nof ABSTOL was set to 10\u22126 to match the stopping criteria of the other algorithms.\n\nfaithfully. We have also used the original C-implementation of QUIC and refer to it as C:QUIC. For\nthe Alternating Linearization Method (ALM) we utilized the MATLAB software available at [26],\nwhich implements the \ufb01rst-order method described in [16]. The NL-FISTA algorithm terminated\nthe FISTA iteration when the minimum norm subgradient of the LASSO subproblem qF became\nless than 1/10 of the minimum norm subgradient of F at the previous step.\nLet us compare the computational cost of the inner iteration techniques used in the Newton-like\nmethods discussed in this paper.\n(i) Applying K steps of the FISTA iteration requires O(Kn3)\noperations. (ii) Coordinate descent, as implemented in [1], requires O(Kn|F|) operations for K\ncoordinate descent sweeps through the set of free variables; (iii) Applying KCG iterations of the CG\nmethods costs O(KCGn3) operations.\nThe algorithms were terminated when either 10n iterations were executed or the minimum norm\nsubgradient of F was suf\ufb01ciently small, i.e. when (cid:107) \u02dc\u2207F (P)(cid:107)\u221e \u2264 10\u22126. The time limit of each run\nwas set to 5000 seconds.\nThe results presented in Table 2 show that the ALM method was never the fastest algorithm, but\nnonetheless outperformed some second-order methods when the solution was less sparse. The num-\nbers in bold indicate the fastest MATLAB implementation for each problem. As for the other meth-\nods, no algorithm appears to be consistently superior to the others, and the best choice may depend\non problem characteristics. The Newton-LASSO method with coordinate descent (NL-Coord) is\nthe most ef\ufb01cient when the sparsity level is below 1%, but the methods introduced in this paper,\nNL-FISTA, OBN-CG and OBN-LBFGS, seem more robust and ef\ufb01cient for problems that are less\nsparse. Based on these results, OBN-LBFGS appears to be the best choice as a universal solver for\nthe covariance selection problem. The C implementation of the QUIC algorithm is roughly \ufb01ve times\nfaster than its Matlab counterpart (OBN-Coord). C:QUIC was best in the two sparsest conditions,\nbut not in the two densest conditions. We expect that optimized C implementations of the presented\nalgorithms will also be signi\ufb01cantly faster. Note also that the crude strategy for dynamically in-\ncreasing the number of CG-steps in OBN-CG-D was effective, and we expect it could be further\nimproved. Our focus in this paper has been on exploring optimization methods and ideas rather\nthan implementation ef\ufb01ciency. However, we believe the observed trends will hold even for highly\noptimized versions of all tested algorithms.\n\n7\n\n\f0.5\n\ntime\n\niter\n\n0.1\n\ntime\n\niter\n\n0.05\n\ntime\n\niter\n\n0.01\n\ntime\n\n7.27%\n27.38\n\n22.01\n100.63\n26.24\n15.41\n21.92\n152.76\n15.62\n\n14.86%\n16.11\n\n13.12\n19.69\n7.36\n5.22\n11.42\n32.98\n3.79\n\n6.65%\n18.23\n\n106.79\n225.59\n87.73\n51.99\n80.02\n639.49\n46.14\n\n8.18%\n11.75\n\n72.21\n79.71\n35.85\n26.87\n40.31\n255.79\n17.42\n\n1.75%\n23.71\n\n1039.08\n1178.07\n896.24\n532.15\n497.31\n>5000\n183.53\n\n1.49%\n4.72\n\n153.18\n71.55\n71.54\n75.82\n78.34\n1262.83\n24.65\n\n10\n49\n97\n34\n178\n387\n41\n\n19\n14\n27\n15\n82\n78\n13\n\n9\n24\n51\n21\n111\n252\n22\n\n7\n12\n20\n12\n67\n99\n11\n\n10\n34\n78\n27\n155\n-\n17\n\n7\n7\n6\n6\n26\n76\n8\n\n11.83%\n51.01\n\n37.04\n279.69\n70.91\n43.29\n38.23\n115.11\n35.64\n\n25.66%\n32.27\n\n34.53\n71.51\n28.40\n14.14\n23.04\n61.35\n11.91\n\n13.19%\n39.59\n\n203.07\n951.23\n198.17\n132.38\n111.49\n462.34\n186.87\n\n18.38%\n26.75\n\n156.46\n408.62\n83.42\n78.98\n82.51\n267.02\n90.62\n\n4.33%\n46.54\n\n1490.37\n>5000\n2394.95\n1038.26\n785.36\n>5000\n818.54\n\n10.51%\n17.02\n\n694.93\n1152.86\n250.11\n188.93\n232.23\n1800.67\n256.90\n\n11\n66\n257\n65\n293\n284\n58\n\n15\n21\n101\n31\n155\n149\n21\n\n12\n36\n108\n39\n178\n186\n34\n\n10\n19\n47\n27\n124\n106\n19\n\n10\n-\n203\n43\n254\n-\n40\n\n9\n13\n21\n13\n71\n106\n13\n\n12\n103\n1221\n189\n519\n574\n100\n\n13\n55\n795\n176\n455\n720\n56\n\n12\n-\n1103\n171\n548\n734\n72\n\n22\n49\n681\n148\n397\n577\n52\n\n32.48%\n118.56\n\n47.33%\n99.49\n\n106.27\n1885.89\n373.63\n275.29\n84.13\n219.80\n206.42\n\n100.90\n791.84\n240.90\n243.55\n78.33\n292.43\n103.58\n\n25.03%\n132.13\n\n36.34%\n106.34\n\n801.79\n>5000\n2026.26\n1584.14\n384.30\n1826.29\n1445.17\n\n554.08\n4837.46\n1778.88\n2055.44\n297.90\n1448.83\n1100.72\n\n14.68%\n134.54\n-\n-\n-\n-\n610\n-\n-\n31.68%\n79.61\n\n>5000\n>5000\n>5000\n>5000\n2163.12\n>5000\n>5000\n\n12\n-\n397\n110\n318\n-\n33\n\n2852.86\n>5000\n4766.69\n5007.83\n1125.67\n>5000\n3899.68\n\n\u03bb\n\n= 3.5%\n\n= 2.4%\n\nproblem\n\nalgorithm\ncard(P\u2217)\ncond(P\u2217)\nNL-FISTA\nNL-Coord\nn = 500\nCard(\u03a3\u22121) OBN-CG-5\nOBN-CG-D\nOBN-LBFGS\nALM\nC:QUIC\ncard(P\u2217)\ncond(P\u2217)\nNL-FISTA\nNL-Coord\nn = 500\nCard(\u03a3\u22121) OBN-CG-5\nOBN-CG-D\n= 20.1%\nOBN-LBFGS\nALM\nC:QUIC\ncard(P\u2217)\ncond(P\u2217)\nNL-FISTA\nNL-Coord\nn = 1000\nCard(\u03a3\u22121) OBN-CG-5\nOBN-CG-D\nOBN-LBFGS\nALM\nC:QUIC\ncard(P\u2217)\ncond(P\u2217)\nNL-FISTA\nNL-Coord\nn = 1000\nCard(\u03a3\u22121) OBN-CG-5\nOBN-CG-D\nOBN-LBFGS\nALM\nC:QUIC\ncard(P\u2217)\ncond(P\u2217)\nNL-FISTA\nNL-Coord\nn = 2000\nCard(\u03a3\u22121) OBN-CG-5\nOBN-CG-D\nOBN-LBFGS\nALM\nC:QUIC\ncard(P\u2217)\ncond(P\u2217)\nNL-FISTA\nNL-Coord\nn = 2000\nCard(\u03a3\u22121) OBN-CG-5\nOBN-CG-D\n= 18.7%\nOBN-LBFGS\nALM\nC:QUIC\n\n= 11%\n\n= 1%\n\niter\n\n8\n21\n15\n12\n47\n445\n16\n\n4\n4\n3\n3\n9\n93\n6\n\n7\n9\n9\n8\n34\n247\n10\n\n4\n4\n3\n3\n8\n113\n6\n\n0.74%\n8.24\n\n5.71\n3.86\n4.07\n3.88\n5.37\n162.96\n0.74\n\n0.21%\n3.39\n\n1.25\n0.42\n0.83\n0.84\n1.00\n35.75\n0.19\n\n0.18%\n6.22\n\n28.20\n5.23\n15.34\n15.47\n18.27\n617.63\n2.38\n\n0.10%\n4.20\n\n9.03\n2.23\n4.70\n4.61\n4.29\n283.99\n1.18\n\n0.13%\n7.41\n\n8\n14\n13\n9\n41\n\n264.94\n54.33\n187.41\n127.11\n115.13\n- >5000\n18.07\n\n11\n\n0.05%\n2.32\nP\u2217 = P0\nP\u2217 = P0\nP\u2217 = P0\nP\u2217 = P0\nP\u2217 = P0\n52\n8\n\n874.22\n10.35\n\nTable 2: Results for 5 Newton-like methods and the QUIC, ALM method.\n\n8\n\n\fReferences\n[1] C. J. Hsieh, M. A. Sustik, P. Ravikumar, and I. S. Dhillon. Sparse inverse covariance matrix estimation\n\nusing quadratic approximation. Advances in Neural Information Processing Systems (NIPS), 24, 2011.\n\n[2] A. P. Dempster. Covariance selection. Biometrics, 28:157\u201375, 1972.\n[3] J. D. Picka. Gaussian Markov random \ufb01elds: theory and applications. Technometrics, 48(1):146\u2013147,\n\n2006.\n\n[4] O. Banerjee, L. El Ghaoui, A. d\u2019Aspremont, and G. Natsoulis. Convex optimization techniques for \ufb01tting\n\nsparse Gaussian graphical models. In ICML, pages 89\u201396. ACM, 2006.\n\n[5] A.J. Rothman, E. Levina, and J. Zhu. Sparse multivariate regression with covariance estimation. Journal\n\nof Computational and Graphical Statistics, 19(4):947\u2013962, 2010.\n\n[6] T. Tsiligkaridis and A. O. Hero III. Sparse covariance estimation under Kronecker product structure. In\n\nICASSP 2006 Proceedings, pages 3633\u20133636, Kyoto, Japan, 2012.\n\n[7] O. Banerjee, L. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\nestimation for multivariate gaussian or binary data. The Journal of Machine Learning Research, 9:485\u2013\n516, 2008.\n\n[8] A. d\u2019Aspremont, O. Banerjee, and L. El Ghaoui. First-order methods for sparse covariance selection.\n\nSIAM Journal on Matrix Analysis and Applications, 30(1):56\u201366, 2008.\n\n[9] J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. 1999.\n[10] I. Rish and G. Grabarnik. ELEN E6898 Sparse Signal Modeling (Spring 2011): Lecture 7, Beyond\n\nLASSO: Other Losses (Likelihoods). https://sites.google.com/site/eecs6898sparse2011/, 2011.\n\n[11] S. Sra, S. Nowozin, and S. J. Wright. Optimization for Machine Learning. MIT Press, 2011.\n[12] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical LASSO.\n\nBiostatistics, 9(3):432, 2008.\n\n[13] K. Scheinberg and I. Rish. SINCO-a greedy coordinate ascent method for sparse inverse covariance\n\nselection problem. Technical report, IBM RC24837, 2009.\n\n[14] J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse Gaussians. In Proc.\n\nof the Conf. on Uncertainty in AI. Citeseer, 2008.\n\n[15] Z. Lu. Smooth optimization approach for sparse covariance selection. Arxiv preprint arXiv:0904.0687,\n\n2009.\n\n[16] K. Scheinberg, S. Ma, and D. Goldfarb. Sparse inverse covariance selection via alternating linearization\n\nmethods. Arxiv preprint arXiv:1011.0097, 2010.\n\n[17] L. Li and K. C. Toh. An inexact interior point method for L1-regularized sparse covariance selection.\n\nMathematical Programming Computation, 2(3):291\u2013315, 2010.\n\n[18] R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society\n\nB, 58(1):267\u2013288, 1996.\n\n[19] B. T. Polyak. The conjugate gradient method in extremal problems. U.S.S.R. Computational Mathematics\n\nand Mathematical Physics, 9:94\u2013112, 1969.\n\n[20] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems\nwith a sparsity constraint. Communications on pure and applied mathematics, 57(11):1413\u20131457, 2004.\n[21] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[22] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization.\n\nMathematical Programming, 117(1):387\u2013423, 2009.\n\n[23] G. Andrew and J. Gao. Scalable training of L1-regularized log-linear models. In ICML, pages 33\u201340.\n\nACM, 2007.\n\n[24] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimiza-\n\ntion. SIAM Journal on Scienti\ufb01c Computing, 16(5):1190\u20131208, 1995.\n\n[25] J. Dahl, V. Roychowdhury, and L. Vandenberghe. Maximum likelihood estimation of gaussian graphical\n\nmodels: numerical implementation and topology selection. UCLA Preprint, 2005.\n\n[26] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Matlab scripts for alternating direction method\n\nof multipliers. Technical report, http://www.stanford.edu/ boyd/papers/admm/, 2012.\n\n9\n\n\f", "award": [], "sourceid": 344, "authors": [{"given_name": "Figen", "family_name": "Oztoprak", "institution": null}, {"given_name": "Jorge", "family_name": "Nocedal", "institution": null}, {"given_name": "Steven", "family_name": "Rennie", "institution": null}, {"given_name": "Peder", "family_name": "Olsen", "institution": null}]}