{"title": "Sparse Inverse Covariance Selection via Alternating Linearization Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 2101, "page_last": 2109, "abstract": "Gaussian graphical models are of great interest in statistical learning. Because the conditional independencies between different nodes correspond to zero entries in the inverse covariance matrix of the Gaussian distribution, one can learn the structure of the graph by estimating a sparse inverse covariance matrix from sample data, by solving a convex maximum likelihood problem with an $\\ell_1$-regularization term. In this paper, we propose a first-order method based on an alternating linearization technique that exploits the problem's special structure; in particular, the subproblems solved in each iteration have closed-form solutions. Moreover, our algorithm obtains an $\\epsilon$-optimal solution in $O(1/\\epsilon)$ iterations. Numerical experiments on both synthetic and real data from gene association networks show that a practical version of this algorithm outperforms other competitive algorithms.", "full_text": "Sparse Inverse Covariance Selection via\n\nAlternating Linearization Methods\n\nKatya Scheinberg\nDepartment of ISE\nLehigh University\n\nkatyas@lehigh.edu\n\nShiqian Ma, Donald Goldfarb\n\nDepartment of IEOR\nColumbia University\n\n{sm2756,goldfarb}@columbia.edu\n\nAbstract\n\nGaussian graphical models are of great interest in statistical learning. Because the\nconditional independencies between different nodes correspond to zero entries in\nthe inverse covariance matrix of the Gaussian distribution, one can learn the struc-\nture of the graph by estimating a sparse inverse covariance matrix from sample\ndata, by solving a convex maximum likelihood problem with an \u21131-regularization\nterm. In this paper, we propose a \ufb01rst-order method based on an alternating lin-\nearization technique that exploits the problem\u2019s special structure; in particular, the\nsubproblems solved in each iteration have closed-form solutions. Moreover, our\nalgorithm obtains an \u03f5-optimal solution in O(1/\u03f5) iterations. Numerical experi-\nments on both synthetic and real data from gene association networks show that a\npractical version of this algorithm outperforms other competitive algorithms.\n\n1 Introduction\n\nIn multivariate data analysis, graphical models such as Gaussian Markov Random Fields pro-\nvide a way to discover meaningful interactions among variables. Let Y = {y(1), . . . , y(n)} be\nan n-dimensional random vector following an n-variate Gaussian distribution N (\u00b5, (cid:6)), and let\nG = (V, E) be a Markov network representing the conditional independence structure of N (\u00b5, (cid:6)).\nSpeci\ufb01cally, the set of vertices V = {1, . . . , n} corresponds to the set of variables in Y , and the\nedge set E contains an edge (i, j) if and only if y(i) is conditionally dependent on y(j) given all\nremaining variables; i.e., the lack of an edge between i and j denotes the conditional indepen-\n\u22121\ndence of y(i) and y(j), which corresponds to a zero entry in the inverse covariance matrix (cid:6)\n([1]). Thus learning the structure of this graphical model is equivalent to the problem of learning the\n\u22121. To estimate this sparse inverse covariance matrix, one can solve the following\nzero-pattern of (cid:6)\n\u2211\nlog det(X) \u2212 \u27e8 ^(cid:6), X\u27e9 \u2212 \u03c1\u2225X\u22250,\nsparse inverse covariance selection (SICS) problem: maxX\u2208Sn\n++ denotes the set of n \u00d7 n positive de\ufb01nite matrices, \u2225X\u22250 is the number of nonzeros in\nwhere Sn\ni=1(Yi \u2212 ^\u03b2)(Yi \u2212 ^\u03b2)\nX, ^(cid:6) = 1\ni=1 Yi is the sample\np\nmean and Yi is the i-th random sample of Y . This problem is NP-hard in general due to the com-\nbinatorial nature of the cardinality term \u03c1\u2225X\u22250 ([2]). To get a numerically tractable problem, one\ncan replace the cardinality term \u2225X\u22250 by \u2225X\u22251 :=\n|Xij|, the envelope of \u2225X\u22250 over the set\n{X \u2208 Rn\u00d7n : \u2225X\u2225\u221e \u2264 1} (see [3]). This results in the convex optimization problem (see e.g.,\n[4, 5, 6, 7]):\n\n\u22a4 is the sample covariance matrix, ^\u03b2 = 1\n\n\u2211\n\n\u2211\n\ni;j\n\n++\n\np\n\np\n\np\n\n\u2212 log det(X) + \u27e8 ^(cid:6), X\u27e9 + \u03c1\u2225X\u22251.\n\nmin\nX\u2208Sn\n\n(1)\nmax\u2225U\u22251\u2264(cid:26) \u2212 log det X + \u27e8 ^(cid:6) + U, X\u27e9, where \u2225U\u2225\u221e\nNote that (1) can be rewritten as minX\u2208Sn\nis the largest absolute value of the entries of U. By exchanging the order of max and min, we obtain\n\n++\n\n++\n\n1\n\n\fthe dual problem max\u2225U\u22251\u2264(cid:26) minX\u2208Sn\n\n++\n\n\u2212 log det X + \u27e8 ^(cid:6) + U, X\u27e9, which is equivalent to\n\n{log det W + n : \u2225W \u2212 ^(cid:6)\u2225\u221e \u2264 \u03c1}.\n\nmax\nW\u2208Sn\n\n++\n\n(2)\n\nBoth the primal and dual problems have strictly convex objectives; hence, their optimal solutions\nare unique. Given a dual solution W , X = W\n\n\u22121 is primal feasible resulting in the duality gap\n\ngap := \u27e8 ^(cid:6), W\n\n\u22121\u27e9 + \u03c1\u2225W\n\n\u22121\u22251 \u2212 n.\n\n(3)\n\nThe primal and the dual SICS problems (1) and (2) are semide\ufb01nite programming problems and can\nbe solved via interior point methods (IPMs) in polynomial time. However, the per-iteration com-\nputational cost and memory requirements of an IPM are prohibitively high for the SICS problem.\nAlthough an approximate IPM has recently been proposed for the SICS problem [8], most of the\nmethods developed for it are \ufb01rst-order methods. Banerjee et al. [7] proposed a block coordinate\ndescent (BCD) method to solve the dual problem (2). Their method updates one row and one column\nof W in each iteration by solving a convex quadratic programming problem by an IPM. The glasso\nmethod of Friedman et al. [5] is based on the same BCD approach as in [7], but it solves each sub-\nproblem as a LASSO problem by yet another coordinate descent (CD) method [9]. Sun et al. [10]\nproposed solving the primal problem (1) by using a BCD method. They formulate the subproblem\nas a min-max problem and solve it using a prox method proposed by Nemirovski [11]. The SINCO\nmethod proposed by Scheinberg and Rish [12] is a greedy CD method applied to the primal problem.\nAll of these BCD and CD approaches lack iteration complexity bounds. They also have been shown\nto be inferior in practice to gradient based approaches. A projected gradient method for solving\nthe dual problem (2) that is considered to be state-of-the-art for SICS was proposed by Duchi et al.\n[13]. However, there are no iteration complexity results for it either. Variants of Nesterov\u2019s method\n[14, 15] have been applied to solve the SICS problem. d\u2019Aspremont et al. [16] applied Nesterov\u2019s\noptimal \ufb01rst-order method to solve the primal problem (1) after smoothing the nonsmooth \u21131 term,\nobtaining an iteration complexity bound of O(1/\u03f5) for an \u03f5-optimal solution, but the implementation\n\u221a\nin [16] was very slow and did not produce good results. Lu [17] solved the dual problem (2), which\nis a smooth problem, by Nesterov\u2019s algorithm, and improved the iteration complexity to O(1/\n\u03f5).\nHowever, since the practical performance of this algorithm was not attractive, Lu gave a variant\n(VSM) of it that exhibited better performance. The iteration complexity of VSM is unknown. Yuan\n[18] proposed an alternating direction method based on an augmented Lagrangian framework (see\nthe ADAL method (8) below). This method also lacks complexity results. The proximal point algo-\nrithm proposed by Wang et al. in [19] requires a reformulation of the problem that increases the size\nof the problem making it impractical for solving large-scale problems. Also, there is no iteration\ncomplexity bound for this algorithm. The IPM in [8] also requires such a reformulation.\nOur contribution. In this paper, we propose an alternating linearization method (ALM) for solving\nthe primal SICS problem. An advantage of solving the primal problem is that the \u21131 penalty term in\nthe objective function directly promotes sparsity in the optimal inverse covariance matrix.\nAlthough developed independently, our method is closely related to Yuan\u2019s method [18]. Both\nmethods exploit the special form of the primal problem (1) by alternatingly minimizing one of\nthe terms of the objective function plus an approximation to the other term. The main difference\nbetween the two methods is in the construction of these approximations. As we will show, our\nmethod has a theoretically justi\ufb01ed interpretation and is based on an algorithmic framework with\ncomplexity bounds, while no complexity bound is available for Yuan\u2019s method. Also our method\nhas an intuitive interpretation from a learning perspective. Extensive numerical test results on both\nsynthetic data and real problems have shown that our ALM algorithm signi\ufb01cantly outperforms\nother existing algorithms, such as the PSM algorithm proposed by Duchi et al. [13] and the VSM\nalgorithm proposed by Lu [17]. Note that it is shown in [13] and [17] that PSM and VSM outperform\nthe BCD method in [7] and glasso in [5].\nOrganization of the paper. In Section 2 we brie\ufb02y review alternating linearization methods for\nminimizing the sum of two convex functions and establish convergence and iteration complexity\nresults. We show how to use ALM to solve SICS problems and give intuition from a learning\nperspective in Section 3. Finally, we present some numerical results on both synthetic and real data\nin Section 4 and compare ALM with PSM algorithm [13] and VSM algorithm [17].\n\n2\n\n\f2 Alternating Linearization Methods\n\nWe consider here the alternating linearization method (ALM) for solving the following problem:\n\n(4)\nwhere f and g are both convex functions. An effective way to solve (4) is to \u201csplit\u201d f and g by\nintroducing a new variable, i.e., to rewrite (4) as\n\nmin F (x) \u2261 f (x) + g(x),\n\n{f (x) + g(y) : x \u2212 y = 0},\n\nmin\nx;y\n\nand apply an alternating direction augmented Lagrangian method to it. Given a penalty parameter\n1/\u00b5, at the k-th iteration, the augmented Lagrangian method minimizes the augmented Lagrangian\nfunction\n\nL(x, y; \u03bb) := f (x) + g(y) \u2212 \u27e8\u03bb, x \u2212 y\u27e9 +\n\n\u2225x \u2212 y\u22252\n2,\n\n1\n2\u00b5\n\nwith respect to x and y, i.e., it solves the subproblem\n\n(xk, yk) := arg min\nx;y\n\nL(x, y; \u03bbk),\n\n(6)\n\nand updates the Lagrange multiplier \u03bb via:\n\n\u03bbk+1 := \u03bbk \u2212 (xk \u2212 yk)/\u00b5.\n\n(7)\nSince minimizing L(x, y; \u03bb) with respect to x and y jointly is usually dif\ufb01cult, while doing so with\nrespect to x and y alternatingly can often be done ef\ufb01ciently, the following alternating direction\nversion of the augmented Lagrangian method (ADAL) is often advocated (see, e.g., [20, 21]):\n\n(5)\n\n(8)\n\n:= arg minx L(x, yk; \u03bbk)\n:= arg miny L(xk+1, y; \u03bbk)\n:= \u03bbk \u2212 (xk+1 \u2212 yk+1)/\u00b5.\n\n\uf8f1\uf8f2\uf8f3 xk+1\nversion of the ADAL method.\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3 xk+1\n\nyk+1\n\u03bbk+1\n\nx\n\n\u03bbk+1\nyk+1\n\u03bbk+1\n\ny\n\nIf we also update \u03bb after we solve the subproblem with respect to x, we get the following symmetric\n\n:= arg minx L(x, yk; \u03bbk\ny)\n\u2212 (xk+1 \u2212 yk)/\u00b5\n:= \u03bbk\n:= arg miny L(xk+1, y; \u03bbk+1\ny\n)\n\u2212 (xk+1 \u2212 yk+1)/\u00b5.\n:= \u03bbk+1\n\nx\n\nx\n\n(9)\n\nAlgorithm (9) has certain theoretical advantages when f and g are smooth. In this case, from the\n\ufb01rst-order optimality conditions for the two subproblems in (9), we have that:\n\nx = \u2207f (xk+1)\n\u03bbk+1\n\ny = \u2212\u2207g(yk+1).\n\n(10)\nSubstituting these relations into (9), we obtain the following equivalent algorithm for solving (4),\nwhich we refer to as the alternating linearization minimization (ALM) algorithm.\n\nand \u03bbk+1\n\nAlgorithm 1 Alternating linearization method (ALM) for smooth problem\n\nInput: x0 = y0\nfor k = 0, 1,\u00b7\u00b7\u00b7 do\n1. Solve xk+1 := arg minx Qg(x, yk) \u2261 f (x) + g(yk) +\n2. Solve yk+1 := arg miny Qf (xk+1, y) \u2261 f (xk+1) +\nxk+1\u22252\nend for\n\n2 + g(y);\n\n\u27e8\u2207g(yk), x \u2212 yk\n\u27e9\n\u27e8\u2207f (xk+1), y \u2212 xk+1\n\n+ 1\n2(cid:22)\n\n\u27e9\n\u2225x \u2212 yk\u22252\n2;\n\u2225y \u2212\n+ 1\n2(cid:22)\n\nAlgorithm 1 can be viewed in the following way: at each iteration we construct a quadratic approxi-\nmation of the function g(x) at the current iterate yk and minimize the sum of this approximation and\nf (x). The approximation is based on linearizing g(x) (hence the name ALM) and adding a \u201cprox\u201d\n2. When \u00b5 is small enough (\u00b5 \u2264 1/L(g), where L(g) is the Lipschitz constant for\nterm 1\n2(cid:22)\n\n\u2225x \u2212 yk\u22252\n\n3\n\n\f\u27e8\u2207g(yk), x \u2212 yk\n\n\u27e9\n\n+ 1\n2(cid:22)\n\n\u2225x\u2212 yk\u22252\n\n\u2207g) this quadratic function, g(yk) +\n2 is an upper approximation to\ng(x), which means that the reduction in the value of F (x) achieved by minimizing Qg(x, yk) in Step\n1 is not smaller than the reduction achieved in the value of Qg(x, yk) itself. Similarly, in Step 2 we\n\u2225y\u2212xk+1\u22252\nbuild an upper approximation to f (x) at xk+1, f (xk+1)+\n2,\nand minimize the sum Qf (xk+1, y) of it and g(y).\nLet us now assume that f (x) is in the class C 1;1 with Lipschitz constant L(f ), while g(x) is simply\nconvex. Then from the \ufb01rst-order optimality conditions for the second minimization in (9), we have\n\u2212\u03bbk+1\n\u2208 \u2202g(yk+1), the subdifferential of g(y) at y = yk+1. Hence, replacing \u2207g(yk) in the\nde\ufb01nition of Qg(x, yk) by \u2212\u03bbk+1\n\n\u27e8\u2207f (xk+1), y \u2212 xk+1\n\nin (9), we obtain the following modi\ufb01ed version of (9).\n\n+ 1\n2(cid:22)\n\n\u27e9\n\ny\n\ny\n\nAlgorithm 2 Alternating linearization method with skipping step\n\n1. Solve xk+1 := arg minx Q(x, yk) \u2261 f (x) + g(yk) \u2212\u27e8\n\nInput: x0 = y0\nfor k = 0, 1,\u00b7\u00b7\u00b7 do\n\n\u03bbk, x \u2212 yk\n\n2. If F (xk+1) > Q(xk+1, yk) then xk+1 := yk.\n3. Solve yk+1 := arg miny Qf (xk+1, y);\n4. \u03bbk+1 = \u2207f (xk+1) \u2212 (xk+1 \u2212 yk+1)/\u00b5.\n\nend for\n\n\u27e9\n\n\u2225x \u2212 yk\u22252\n2;\n\n+ 1\n2(cid:22)\n\nAlgorithm 2 is identical to the symmetric ADAL algorithm (9) as long as F (xk+1) \u2264 Q(xk+1, yk)\nat each iteration (and to Algorithm 1 if g(x) is in C 1;1 and \u00b5 \u2264 1/ max{L(f ), L(g)}). If this con-\ndition fails, then the algorithm simply sets xk+1 \u2190 yk. Algorithm 2 has the following convergence\nproperty and iteration complexity bound. For a proof see the Appendix.\nTheorem 2.1. Assume \u2207f is Lipschitz continuous with constant L(f ). For \u03b2/L(f ) \u2264 \u00b5 \u2264 1/L(f )\nwhere 0 < \u03b2 \u2264 1, Algorithm 2 satis\ufb01es\n\nF (yk) \u2212 F (x\n\n\u2217\n\n) \u2264 \u2225x0 \u2212 x\n\n\u2217\u22252\n2\u00b5(k + kn)\n\n,\u2200k,\n\n(11)\n\u2217 is an optimal solution of (4) and kn is the number of iterations until the k \u2212 th for which\nwhere x\nF (xk+1) \u2264 Q(xk+1, yk). Thus Algorithm 2 produces a sequence which converges to the optimal\nsolution in function value, and the number of iterations needed is O(1/\u03f5) for an \u03f5-optimal solution.\nIf g(x) is also a smooth function in the class C 1;1 with Lipschitz constant L(g) \u2264 1/\u00b5, then Theorem\n2.1 also applies to Algorithm 1 since in this case kn = k (i.e., no \u201cskipping\u201d occurs). Note that the\n\u221a\niteration complexity bound in Theorem 2.1 can be improved. Nesterov [15, 22] proved that one can\nobtain an optimal iteration complexity bound of O(1/\n\u03f5), using only \ufb01rst-order information. His\nacceleration technique is based on using a linear combination of previous iterates to obtain a point\nwhere the approximation is built. This technique has been exploited and extended by Tseng [23],\n\u221a\nBeck and Teboulle [24], Goldfarb et al. [25] and many others. A similar technique can be adopted\nto derive a fast version of Algorithm 2 that has an improved complexity bound of O(1/\n\u03f5), while\nkeeping the computational effort in each iteration almost unchanged. However, we do not present\nthis method here, since when applied to the SICS problem, it did not work as well as Algorithm 2.\n\n3 ALM for SICS\n\nThe SICS problem\n\nF (X) \u2261 f (X) + g(X),\n\n++\n\nmin\nX\u2208Sn\n\n(12)\nwhere f (X) = \u2212 log det(X) + \u27e8 ^(cid:6), X\u27e9 and g(X) = \u03c1\u2225X\u22251, is of the same form as (4). However,\nin this case neither f (X) nor g(X) have Lipschitz continuous gradients. Moreover, f (X) is only\nde\ufb01ned for positive de\ufb01nite matrices while g(X) is de\ufb01ned everywhere. These properties of the\nobjective function make the SICS problem especially challenging for optimization methods. Nev-\nertheless, we can still apply (9) to solve the problem directly. Moreover, we can apply Algorithm 2\nand obtain the complexity bound in Theorem 2.1 as follows.\n\n4\n\n\f++ and the gradient of f (X), which\n++. Fortunately, as proved in Proposition\n. Therefore, if we de\ufb01ne\n\n1\n\n\u22121 + ^(cid:6), is not Lipschitz continuous in Sn\n\nThe log det(X) term in f (X) implicitly requires that X \u2208 Sn\nis given by \u2212X\n\u2217 \u227d \u03b1I, where \u03b1 =\n3.1 in [17], the optimal solution of (12) X\nC := {X \u2208 Sn : X \u227d (cid:11)\nmin\nX;Y\n\n\u2225 ^(cid:6)\u2225+n(cid:26)\n2 I}, the SICS problem (12) can be formulated as:\n{f (X) + g(Y ) : X \u2212 Y = 0, X \u2208 C, Y \u2208 C}.\n\n(13)\nWe can include constraints X \u2208 C in Step 1 and Y \u2208 C in Step 3 of Algorithm 2. Theorem 2.1\ncan then be applied as discussed in [25]. However, a dif\ufb01culty now arises when performing the\nminimization in Y . Without the constraint Y \u2208 C, only a matrix shrinkage operation is needed,\nbut with this additional constraint the problem becomes harder to solve. Minimization in X with or\nwithout the constraint X \u2208 C is accomplished by performing an SVD. Hence the constraint can be\neasily imposed.\nInstead of imposing constraint Y \u2208 C we can obtain feasible solutions by a line search on \u00b5. We\nknow that the constraint X \u227d (cid:11)\n2 I is not tight at the solution. Hence if we start the algorithm with\nX \u227d \u03b1I and restrict the step size \u00b5 to be suf\ufb01ciently small then the iterates of the method will\nremain in C.\nNote however, that the bound on the Lipschitz constant of the gradient of f (X) is 1/\u03b12 and hence\ncan be very large. It is not practical to restrict \u00b5 in the algorithm to be smaller than \u03b12, since \u00b5\ndetermines the step size at each iteration. Hence, for a practical approach we can only claim that the\ntheoretical convergence rate bound holds in only a small neighborhood of the optimal solution. We\nnow present a practical version of our algorithm applied to the SICS problem.\n\nAlgorithm 3 Alternating linearization method (ALM) for SICS\n\nInput: X 0 = Y 0, \u00b50.\nfor k = 0, 1,\u00b7\u00b7\u00b7 do\n0. Pick \u00b5k+1 \u2264 \u00b5k.\n1. Solve X k+1 := arg minX\u2208C f (X) + g(Y k) \u2212 \u27e8(cid:3)k, X \u2212 Y k\u27e9 + 1\n2. If g(X k+1) > g(Y k) \u2212 \u27e8(cid:3)k, X k+1 \u2212 Y k\u27e9 + 1\n\u2225X k+1 \u2212 Y k\u22252\n3. Solve Y k+1 := arg minY f (X k+1) + \u27e8\u2207f (X k+1), Y \u2212 X k+1\u27e9 + 1\ng(Y );\n4. (cid:3)k+1 = \u2207f (X k+1) \u2212 (X k+1 \u2212 Y k+1)/\u00b5k+1.\nend for\n\n2(cid:22)k+1\n\n\u2225X \u2212 Y k\u22252\nF ;\n\n2(cid:22)k+1\nF , then X k+1 := Y k.\n\u2225Y \u2212 X k+1\u22252\n\n2(cid:22)k+1\n\nF +\n\nWe now show how to solve the two optimization problems in Algorithm 3. The \ufb01rst-order optimality\nconditions for Step 1 in Algorithm 3, ignoring the constraint X \u2208 C are:\n\n\u2207f (X) \u2212 (cid:3)k + (X \u2212 Y k)/\u00b5k+1 = 0.\n(\n\n\u22a4 - the spectral decomposition of Y k + \u00b5k+1((cid:3)k \u2212 ^(cid:6)) and let\n\n)\n\n(14)\n\nConsider V Diag(d)V\n\n{\n\n\u03b3i =\n\ndi +\n\n\u221a\n\nd2\ni + 4\u00b5k+1\n\n/2, i = 1, . . . , n.\n\n\u22121 + ^(cid:6), it is easy to verify that X k+1 := V Diag(\u03b3)V\n\n(15)\n(\nSince \u2207f (X) = \u2212X\n\u22a4 satis\ufb01es (14). When\nthe constraint X \u2208 C is imposed, the optimal solution changes to X k+1 := V Diag(\u03b3)V\n\u22a4 with\n, i = 1, . . . , n. We observe that solving (14) requires\n\u03b3i = max\napproximately the same effort (O(n3)) as is required to compute \u2207f (X k+1). Moreover, from the\nsolution to (14), \u2207f (X k+1) is obtained with only a negligible amount of additional effort, since\n(X k+1)\nThe \ufb01rst-order optimality conditions for Step 2 in Algorithm 3 are:\n\n\u22121 := V Diag(\u03b3)\n\nd2\ni + 4\u00b5k+1\n\n}\n\n\u22121V\n\n\u22a4.\n\n\u03b1/2,\n\ndi +\n\n/2\n\n0 \u2208 \u2207f (X k+1) + (Y \u2212 X k+1)/\u00b5k+1 + \u2202g(Y ).\n\nSince g(Y ) = \u03c1\u2225Y \u22251, it is well known that the solution to (16) is given by\n\u22121), \u00b5k+1\u03c1),\n\nY k+1 = shrink(X k+1 \u2212 \u00b5k+1( ^(cid:6) \u2212 (X k+1)\n\n(16)\n\n\u221a\n)\n\n5\n\n\fwhere the \u201cshrinkage operator\u201d shrink(Z, \u03c1) updates each element Zij of the matrix Z by the for-\nmula shrink(Z, \u03c1)ij = sgn(Zij) \u00b7 max{|Zij| \u2212 \u03c1, 0}.\nThe O(n3) complexity of Step 1, which requires a spectral decomposition, dominates the O(n2)\ncomplexity of Step 2 which requires a simple shrinkage. There is no closed-form solution for the\nsubproblem corresponding to Y when the constraint Y \u2208 C is imposed. Hence, we neither impose\nthis constraint explicitly nor do so by a line search on \u00b5k, since in practice this degrades the perfor-\nmance of the algorithm substantially. Thus, the resulting iterates Y k may not be positive de\ufb01nite,\nwhile the iterates X k remain so. Eventually due to the convergence of Y k and X k, the Y k iterates\nbecome positive de\ufb01nite and the constraint Y \u2208 C is satis\ufb01ed.\nLet us now remark on the learning based intuition behind Algorithm 3. We recall that \u2212(cid:3)k \u2208\n\u2202g(Y k). The two steps of the algorithm can be written as\n\nX k+1 := arg min\nX\u2208C\n\n{f (X) +\n\n1\n\n2\u00b5k+1\n\n\u2225X \u2212 (Y k + \u00b5k+1(cid:3)k)\u22252\n\nF\n\n}\n\nand\n\nY k+1 := arg min\n\nY\n\n{g(Y ) +\n\n1\n\n2\u00b5k+1\n\n\u2225Y \u2212 (X k+1 \u2212 \u00b5k+1( ^(cid:6) \u2212 (X k+1)\n\n\u22121))\u22252\n\nF\n\n(17)\n\n(18)\n\n}.\n\nThe SICS problem is trying to optimize two con\ufb02icting objectives: on the one hand it tries to \ufb01nd a\n\u22121 that best \ufb01ts the observed data, i.e., is as close to ^(cid:6) as possible, and on the\ncovariance matrix X\nother hand it tries to obtain a sparse matrix X. The proposed algorithm address these two objectives\nin an alternating manner. Given an initial \u201cguess\u201d of the sparse matrix Y k we update this guess\nby a subgradient descent step of length \u00b5k+1: Y k + \u00b5k+1(cid:3)k. Recall that \u2212(cid:3)k \u2208 \u2202g(Y k). Then\nproblem (17) seeks a solution X that optimizes the \ufb01rst objective (best \ufb01t of the data) while adding\na regularization term which imposes a Gaussian prior on X whose mean is the current guess for the\nsparse matrix: Y k + \u00b5k+1(cid:3)k. The solution to (17) gives us a guess for the inverse covariance X k+1.\nWe again update it by taking a gradient descent step: X k+1 \u2212 \u00b5k+1( ^(cid:6)\u2212 (X k+1)\n\u22121). Then problem\n(18) seeks a sparse solution Y while also imposing a Gaussian prior on Y whose mean is the guess\nfor the inverse covariance matrix X k+1 \u2212 \u00b5k+1( ^(cid:6) \u2212 (X k+1)\n\u22121). Hence the sequence of X k\u2019s is\na sequence of positive de\ufb01nite inverse covariance matrices that converge to a sparse matrix, while\nthe sequence of Y k\u2019s is a sequence of sparse matrices that converges to a positive de\ufb01nite inverse\ncovariance matrix.\nAn important question is how to pick \u00b5k+1. Theory tells us that if we pick a small enough value,\nthen we can obtain the complexity bounds. However, in practice this value is too small. We discuss\nthe simple strategy that we use in the next section.\n\n4 Numerical Experiments\n\nIn this section, we present numerical results on both synthetic and real data to demonstrate the\nef\ufb01ciency of our SICS ALM algorithm. Our codes for ALM were written in MATLAB. All nu-\nmerical experiments were run in MATLAB 7.3.0 on a Dell Precision 670 workstation with an Intel\nXeon(TM) 3.4GHZ CPU and 6GB of RAM.\nSince \u2212(cid:3)k \u2208 \u2202g(Y k), \u2225(cid:3)k\u2225\u221e \u2264 \u03c1; hence ^(cid:6) \u2212 (cid:3)k is a feasible solution to the dual problem (2) as\nlong as it is positive de\ufb01nite. Thus the duality gap at the k-th iteration is given by:\nDgap := \u2212 log det(X k) + \u27e8 ^(cid:6), X k\u27e9 + \u03c1\u2225X k\u22251 \u2212 log det( ^(cid:6) \u2212 (cid:3)k) \u2212 n.\n\n(19)\nWe de\ufb01ne the relative duality gap as: Rel.gap := Dgap/(1 +|pobj| +|dobj|), where pobj and dobj\nare respectively the objective function values of the primal problem (12) at point X k, and the dual\nproblem (2) at ^(cid:6) \u2212 (cid:3)k. De\ufb01ning dk(\u03d5(x)) \u2261 max{1, \u03d5(xk), \u03d5(xk\u22121)}, we measure the relative\nchanges of objective function value F (X) and the iterates X and Y as follows:\n\n\u2225Y k \u2212 Y k\u22121\u2225F\n\nd(\u2225Y \u2225F )\n\n.\n\nF rel :=\n\n|F (X k) \u2212 F (X k\u22121)|\n\ndk(|F (X)|)\n\n, Xrel :=\n\n, Y rel :=\n\n\u2225X k \u2212 X k\u22121\u2225F\n\ndk(\u2225X\u2225F )\n\nWe terminate ALM when either\n(i) Dgap \u2264 \u03f5gap\n\nor\n\n(ii) max{F rel, Xrel, Y rel} \u2264 \u03f5rel.\n\n(20)\n\n6\n\n\fNote that in (19), computing log det(X k) is easy since the spectral decomposition of X k is already\navailable (see (14) and (15)), but computing log det( ^(cid:6) \u2212 (cid:3)k) requires another expensive spectral\ndecomposition. Thus, in practice, we only check (20)(i) every Ngap iterations. We check (20)(ii) at\nevery iteration since this is inexpensive.\nA continuation strategy for updating \u00b5 is also crucial to ALM. In our experiments, we adopted the\nfollowing update rule. After every N(cid:22) iterations, we set \u00b5 := max{\u00b5\u00b7 \u03b7(cid:22), (cid:22)\u00b5}; i.e., we simply reduce\n\u00b5 by a constant factor \u03b7(cid:22) every N(cid:22) iterations until a desired lower bound on \u00b5 is achieved.\nWe compare ALM (i.e., Algorithm 3 with the above stopping criteria and \u00b5 updates), with the\nprojected subgradient method (PSM) proposed by Duchi et al. in [13] and implemented by Mark\nSchmidt 1 and the smoothing method (VSM) 2 proposed by Lu in [17], which are considered to be\nthe state-of-the-art algorithms for solving SICS problems. The per-iteration complexity of all three\nalgorithms is roughly the same; hence a comparison of the number of iterations is meaningful. The\nparameters used in PSM and VSM are set at their default values. We used the following parameter\n\u22126}, \u03b7(cid:22) =\nvalues in ALM: \u03f5gap = 10\n1/3, where \u00b50 is the initial \u00b5 which is set according to \u03c1; speci\ufb01cally, in our experiments, \u00b50 =\n100/\u03c1, if \u03c1 < 0.5, \u00b50 = \u03c1 if 0.5 \u2264 \u03c1 \u2264 10, and \u00b50 = \u03c1/100 if \u03c1 > 10.\n\n\u22128, Ngap = 20, N(cid:22) = 20, (cid:22)\u00b5 = max{\u00b50\u03b78\n\n\u22123, \u03f5rel = 10\n\n(cid:22), 10\n\n4.1 Experiments on synthetic data\n\n\u22a4\n\nWe randomly created test problems using a procedure proposed by Scheinberg and Rish in [12].\nSimilar procedures were used by Wang et al. in [19] and Li and Toh in [8]. For a given dimension n,\nwe \ufb01rst created a sparse matrix U \u2208 Rn\u00d7n with nonzero entries equal to -1 or 1 with equal proba-\nbility. Then we computed S := (U \u2217 U\n\u22121 was sparse.\nWe then drew p = 5n iid vectors, Y1, . . . , Yp, from the Gaussian distribution N (0, S) by using the\n\u22a4\np\nmvnrnd function in MATLAB, and computed a sample covariance matrix ^(cid:6) := 1\n.\ni=1 YiY\ni\np\nWe compared ALM with PSM [13] and VSM [17] on these randomly created data with different\n\u03c1. The PSM code was terminated using its default stopping criteria, which included (20)(i) with\n\u22123. Since PSM and VSM solve the\n\u03f5gap = 10\ndual problem (2), the duality gap which is given by (3) is available without any additional spectral\ndecompositions. The results are shown in Table 1. All CPU times reported are in seconds.\n\n\u22123. VSM was also terminated when Dgap \u2264 10\n\n\u22121 as the true covariance matrix. Hence, S\n)\n\n\u2211\n\nTable 1: Comparison of ALM, PSM and VSM on synthetic data\n\nn\n\n200\n500\n1000\n1500\n2000\n\n200\n500\n1000\n1500\n2000\n\n200\n500\n1000\n1500\n2000\n\niter\n\n300\n220\n180\n199\n200\n\n140\n100\n100\n140\n160\n\n180\n140\n160\n180\n240\n\nDgap\n\n8.70e-4\n5.55e-4\n9.92e-4\n1.73e-3\n6.13e-5\n\n9.80e-4\n1.69e-4\n9.28e-4\n2.17e-4\n4.70e-4\n\n4.63e-4\n4.14e-4\n3.19e-4\n8.28e-4\n9.58e-4\n\nALM\n\nRel.gap\n\n1.51e-6\n4.10e-7\n3.91e-7\n4.86e-7\n1.35e-8\n\n1.15e-6\n7.59e-8\n2.12e-7\n3.39e-8\n5.60e-8\n\n4.63e-7\n1.56e-7\n6.07e-8\n1.07e-7\n9.37e-8\n\nCPU\n\n13\n84\n433\n1405\n3110\n\n6\n39\n247\n1014\n2529\n\n8\n55\n394\n1304\n3794\n\niter\n\n1682\n861\n292\n419\n349\n\n6106\n903\n489\n746\n613\n\n7536\n2099\n774\n1088\n1158\n\nPSM\n\nDgap\n(cid:26) = 0:1\n\nRel.gap\n\n9.99e-4\n9.98e-4\n9.91e-4\n9.76e-4\n1.12e-3\n\n(cid:26) = 0:5\n\n1.00e-3\n9.90e-4\n9.80e-4\n9.96e-4\n9.96e-4\n\n(cid:26) = 1:0\n\n1.00e-3\n9.96e-4\n9.83e-4\n9.88e-4\n9.35e-4\n\n1.74e-6\n7.38e-7\n3.91e-7\n2.74e-7\n2.46e-7\n\n1.18e-6\n4.46e-7\n2.24e-7\n1.55e-7\n1.18e-7\n\n1.00e-6\n3.76e-7\n1.87e-7\n1.27e-7\n9.15e-8\n\nCPU\n\n38\n205\n446\n1975\n3759\n\n137\n212\n749\n3514\n6519\n\n171\n495\n1172\n5100\n12310\n\niter\n\n857\n946\n741\n802\n915\n\n1000\n1067\n1039\n1191\n1640\n\n1296\n1015\n1310\n1484\n2132\n\nDgap\n\n9.97e-4\n9.98e-4\n9.97e-4\n9.98e-4\n1.00e-3\n\n9.99e-4\n9.99e-4\n9.95e-4\n9.96e-4\n9.99e-4\n\n9.96e-4\n9.97e-4\n9.97e-4\n9.96e-4\n9.99e-4\n\nVSM\n\nRel.gap\n\n1.73e-6\n7.38e-7\n3.94e-7\n2.80e-7\n2.20e-7\n\n1.18e-6\n4.50e-7\n2.27e-7\n1.55e-7\n1.19e-7\n\n9.96e-7\n3.76e-7\n1.90e-7\n1.28e-7\n9.77e-8\n\nCPU\n\n37\n377\n1928\n6340\n16085\n\n43\n425\n2709\n9405\n28779\n\n57\n406\n3426\n11749\n37406\n\nFrom Table 1 we see that on these randomly created SICS problems, ALM outperforms PSM and\nVSM in both accuracy and CPU time with the performance gap increasing as \u03c1 increases. For\nexample, for \u03c1 = 1.0 and n = 2000, ALM achieves Dgap = 9.58e \u2212 4 in about 1 hour and 15\nminutes, while PSM and VSM need about 3 hours and 25 minutes and 10 hours and 23 minutes,\nrespectively, to achieve similar accuracy.\n\n1The MATLAB can be downloaded from http://www.cs.ubc.ca/(cid:24)schmidtm/Software/PQN.html\n2The MATLAB code can be downloaded from http://www.math.sfu.ca/(cid:24)zhaosong\n\n7\n\n\f4.2 Experiments on real data\n\nWe tested ALM on real data from gene expression networks using the \ufb01ve data sets from [8] provided\nto us by Kim-Chuan Toh: (1) Lymph node status; (2) Estrogen receptor; (3) Arabidopsis thaliana;\n(4) Leukemia; (5) Hereditary breast cancer. See [8] and references therein for the descriptions of\nthese data sets. Table 2 presents our test results. As suggested in [8], we set \u03c1 = 0.5. From Table 2\nwe see that ALM is much faster and provided more accurate solutions than PSM and VSM.\n\nTable 2: Comparison of ALM, PSM and VSM on real data\n\nprob.\n(1)\n(2)\n(3)\n(4)\n(5)\n\nn\n587\n692\n834\n1255\n1869\n\niter\n60\n80\n100\n120\n160\n\nDgap\n9.41e-6\n6.13e-5\n7.26e-5\n6.69e-4\n5.59e-4\n\nALM\n\nRel.gap\n5.78e-9\n3.32e-8\n3.27e-8\n1.97e-7\n1.18e-7\n\nCPU\n35\n73\n150\n549\n2158\n\niter\n178\n969\n723\n1405\n1639\n\nDgap\n9.22e-4\n9.94e-4\n1.00e-3\n9.89e-4\n9.96e-4\n\nPSM\n\nRel.gap\n5.67e-7\n5.38e-7\n4.50e-7\n2.91e-7\n2.10e-7\n\nCPU\n64\n531\n662\n4041\n14505\n\niter\n467\n953\n1097\n1740\n3587\n\nDgap\n9.78e-4\n9.52e-4\n7.31e-4\n9.36e-4\n9.93e-4\n\nVSM\n\nRel.gap\n6.01e-7\n5.16e-7\n3.30e-7\n2.76e-7\n2.09e-7\n\nCPU\n273\n884\n1668\n8568\n52978\n\n4.3 Solution Sparsity\n\nIn this section, we compare the sparsity patterns of the solutions produced by ALM, PSM and VSM.\nFor ALM, the sparsity of the solution is given by the sparsity of Y . Since PSM and VSM solve\nthe dual problem, the primal solution X, obtained by inverting the dual solution W , is never sparse\ndue to \ufb02oating point errors. Thus it is not fair to measure the sparsity of X or a truncated version\nof X.\nInstead, we measure the sparsity of solutions produced by PSM and VSM by appealing\nto complementary slackness. Speci\ufb01cally, the (i, j)-th element of the inverse covariance matrix\nis deemed to be nonzero if and only if |Wij \u2212 ^(cid:6)ij| = \u03c1. We give results for a random problem\n(n = 500) and the \ufb01rst real data set in Table 3. For each value of \u03c1, the \ufb01rst three rows show\nthe number of nonzeros in the solution and the last three rows show the number of entries that are\nnonzero in the solution produced by one of the methods but are zero in the solution produced by\nthe other method. The sparsity of the ground truth inverse covariance matrix of the synthetic data\nis 6.76%. From Table 3 we can see that when \u03c1 is relatively large (\u03c1 \u2265 0.5), all three algorithms\n\nTable 3: Comparison of sparsity of solutions produced by ALM, PSM and VSM\n\n(cid:26)\n\nALM\nPSM\nVSM\n\nALM vs PSM\nPSM vs VSM\nVSM vs ALM\n\nALM\nPSM\nVSM\n\nALM vs PSM\nPSM vs VSM\nVSM vs ALM\n\n100\n\n700\n700\n700\n0\n0\n0\n\n587\n587\n587\n0\n0\n0\n\n2810\n2810\n2810\n\n0\n0\n0\n\n587\n587\n587\n0\n0\n0\n\n50\n\n10\n\n5\n\n1\n\nsynthetic problem data\n28758\n28758\n28758\n\n15324\n15324\n15324\n\n11844\n11844\n11844\n\n0\n0\n0\n\n587\n587\n587\n0\n0\n0\n\n0\n0\n0\n\n0\n0\n0\nreal problem data\n587\n587\n587\n0\n0\n0\n\n587\n587\n587\n0\n0\n0\n\n0.5\n\n0.1\n\n0.05\n\n0.01\n\n37510\n37510\n37510\n\n0\n0\n0\n\n4617\n4617\n4617\n\n0\n0\n0\n\n63000\n63000\n63000\n\n0\n0\n0\n\n37613\n37615\n37613\n\n0\n2\n0\n\n75566\n75566\n75568\n\n2\n0\n2\n\n65959\n65957\n65959\n\n2\n0\n0\n\n106882\n106870\n106876\n\n14\n8\n2\n\n142053\n142051\n142051\n\n2\n0\n0\n\nproduce solutions with exactly the same sparsity patterns. Only when \u03c1 is very small, are there slight\ndifferences. We note that the ROC curves depicting the trade-off between the number of true positive\nelements recovered versus the number of false positive elements as a function of the regularization\nparameter \u03c1 are also almost identical for the three methods.\n\nAcknowledgements\n\nWe would like to thank Professor Kim-Chuan Toh for providing the data set used in Section 4.2. The\nresearch reported here was supported in part by NSF Grants DMS 06-06712 and DMS 10-16571,\nONR Grant N00014-08-1-1118 and DOE Grant DE-FG02-08ER25856.\n\n8\n\n\fReferences\n[1] S. Lauritzen. Graphical Models. Oxford University Press, 1996.\n[2] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24:227\u2013\n\n234, 1995.\n\n[3] J.-B. Hiriart-Urruty and C. Lemar\u00b4echal. Convex Analysis and Minimization Algorithms II: Advanced\n\nTheory and Bundle Methods. Springer-Verlag, New York, 1993.\n\n[4] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika,\n\n94(1):19\u201335, 2007.\n\n[5] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\nBiostatistics, 2007.\n\n[6] M. Wainwright, P. Ravikumar, and J. Lafferty. High-dimensional graphical model selection using \u21131-\n\nregularized logistic regression. NIPS, 19:1465\u20131472, 2007.\n\n[7] O. Banerjee, L. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\nestimation for multivariate gaussian for binary data. Journal of Machine Learning Research, 9:485\u2013516,\n2008.\n\n[8] L. Li and K.-C. Toh. An inexact interior point method for l1-regularized sparse covariance selection.\n\npreprint, 2010.\n\n[9] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267\u2013288,\n\n1996.\n\n[10] L. Sun, R. Patel, J. Liu, K. Chen, T. Wu, J. Li, E. Reiman, and J. Ye. Mining brain region connectivity for\n\nalzheimer\u2019s disease study via sparse inverse covariance estimation. KDD\u201909, 2009.\n\n[11] A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz\ncontinuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on\nOptimization, 15(1):229\u2013251, 2005.\n\n[12] K. Scheinberg and I. Rish.\n\nverse covariance selection problem.\nonline.org/DB HTML/2009/07/2359.html.\n\nSinco - a greedy coordinate ascent method for sparse in-\nPreprint available at http://www.optimization-\n\n2009.\n\n[13] J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse Gaussian. Confer-\n\nence on Uncertainty in Arti\ufb01cial Intelligence (UAI 2008), 2008.\n\n[14] Y. E. Nesterov. Smooth minimization for non-smooth functions. Math. Program. Ser. A, 103:127\u2013152,\n\n2005.\n\n[15] Y. E. Nesterov. Introductory lectures on convex optimization. 87:xviii+236, 2004. A basic course.\n[16] A. D\u2019Aspremont, O. Banerjee, and L. El Ghaoui. First-order methods for sparse covariance selection.\n\nSIAM Journal on Matrix Analysis and its Applications, 30(1):56\u201366, 2008.\n\n[17] Z. Lu. Smooth optimization approach for sparse covariance selection. SIAM J. Optim., 19(4):1807\u20131827,\n\n2009.\n\n[18] X. Yuan. Alternating direction methods for sparse covariance selection. 2009. Preprint available at\n\nhttp://www.optimization-online.org/DB HTML/2009/09/2390.html.\n\n[19] C. Wang, D. Sun, and K.-C. Toh. Solving log-determinant optimization problems by a Newton-CG primal\n\nproximal point algorithm. preprint, 2009.\n\n[20] M. Fortin and R. Glowinski. Augmented Lagrangian methods: applications to the numerical solution of\n\nboundary-value problems. North-Holland Pub. Co., 1983.\n\n[21] R. Glowinski and P. Le Tallec. Augmented Lagrangian and Operator-Splitting Methods in Nonlinear\n\n[22] Y. E. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence\n\nMechanics. SIAM, Philadelphia, Pennsylvania, 1989.\nO(1/k2). Dokl. Akad. Nauk SSSR, 269:543\u2013547, 1983.\n\n[23] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM\n\nJ. Optim., 2008.\n\n[24] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[25] D. Goldfarb, S. Ma, and K. Scheinberg. Fast alternating linearization methods for minimizing the sum of\n\ntwo convex functions. Technical report, Department of IEOR, Columbia University, 2010.\n\n9\n\n\f", "award": [], "sourceid": 109, "authors": [{"given_name": "Katya", "family_name": "Scheinberg", "institution": null}, {"given_name": "Shiqian", "family_name": "Ma", "institution": null}, {"given_name": "Donald", "family_name": "Goldfarb", "institution": null}]}