{"title": "Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1574, "page_last": 1582, "abstract": "Sparse graphical modelling/inverse covariance selection is an important problem in machine learning and has seen significant advances in recent years. A major focus has been on methods which perform model selection in high dimensions. To this end, numerous convex $\\ell_1$ regularization approaches have been proposed in the literature. It is not however clear which of these methods are optimal in any well-defined sense. A major gap in this regard pertains to the rate of convergence of proposed optimization methods. To address this, an iterative thresholding algorithm for numerically solving the $\\ell_1$-penalized maximum likelihood problem for sparse inverse covariance estimation is presented. The proximal gradient method considered in this paper is shown to converge at a linear rate, a result which is the first of its kind for numerically solving the sparse inverse covariance estimation problem. The convergence rate is provided in closed form, and is related to the condition number of the optimal point. Numerical results demonstrating the proven rate of convergence are presented.", "full_text": "Iterative Thresholding Algorithm for Sparse Inverse\n\nCovariance Estimation\n\nDominique Guillot\nDept. of Statistics\nStanford University\nStanford, CA 94305\n\nBala Rajaratnam\nDept. of Statistics\nStanford University\nStanford, CA 94305\n\nBenjamin T. Rolfs\n\nICME\n\nStanford University\nStanford, CA 94305\n\ndguillot@stanford.edu\n\nbrajarat@stanford.edu\n\nbenrolfs@stanford.edu\n\nArian Maleki\nDept. of ECE\nRice University\n\nHouston, TX 77005\n\narian.maleki@rice.edu\n\nIan Wong\n\nDept. of EE and Statistics\n\nStanford University\nStanford, CA 94305\nianw@stanford.edu\n\nAbstract\n\nThe (cid:96)1-regularized maximum likelihood estimation problem has recently become\na topic of great interest within the machine learning, statistics, and optimization\ncommunities as a method for producing sparse inverse covariance estimators. In\nthis paper, a proximal gradient method (G-ISTA) for performing (cid:96)1-regularized\ncovariance matrix estimation is presented. Although numerous algorithms have\nbeen proposed for solving this problem, this simple proximal gradient method is\nfound to have attractive theoretical and numerical properties. G-ISTA has a linear\nrate of convergence, resulting in an O(log \u03b5) iteration complexity to reach a toler-\nance of \u03b5. This paper gives eigenvalue bounds for the G-ISTA iterates, providing\na closed-form linear convergence rate. The rate is shown to be closely related to\nthe condition number of the optimal point. Numerical convergence results and\ntiming comparisons for the proposed method are presented. G-ISTA is shown to\nperform very well, especially when the optimal point is well-conditioned.\n\n1\n\nIntroduction\n\nDatasets from a wide range of modern research areas are increasingly high dimensional, which\npresents a number of theoretical and practical challenges. A fundamental example is the problem\nof estimating the covariance matrix from a dataset of n samples {X (i)}n\ni=1, drawn i.i.d from a p-\n++, X (i) \u223c Np(0, \u03a3),\ndimensional, zero-mean, Gaussian distribution with covariance matrix \u03a3 \u2208 Sp\n(cid:80)n\n++ denotes the space of p \u00d7 p symmetric, positive de\ufb01nite matrices. When n \u2265 p the maxi-\nwhere Sp\ni=1 X (i)X (i)T . A\nmum likelihood covariance estimator \u02c6\u03a3 is the sample covariance matrix S = 1\nn\nproblem however arises when n < p, due to the rank-de\ufb01ciency in S. In this sample de\ufb01cient case,\ncommon throughout several modern applications such as genomics, \ufb01nance, and earth sciences, the\nmatrix S is not invertible, and thus cannot be directly used to obtain a well-de\ufb01ned estimator for the\ninverse covariance matrix \u2126 := \u03a3\u22121.\nA related problem is the inference of a Gaussian graphical model ([27, 14]), that is, a sparsity\npattern in the inverse covariance matrix, \u2126. Gaussian graphical models provide a powerful means\nof dimensionality reduction in high-dimensional data. Moreover, such models allow for discovery\nof conditional independence relations between random variables since, for multivariate Gaussian\ndata, sparsity in the inverse covariance matrix encodes conditional independences. Speci\ufb01cally, if\n\n1\n\n\fi=1 \u2208 Rp is distributed as X \u223c Np(0, \u03a3), then (\u03a3\u22121)ij = \u2126ij = 0 \u21d0\u21d2 Xi \u22a5\nX = (Xi)p\n\u22a5 Xj|{Xk}k(cid:54)=i,j, where the notation A \u22a5\u22a5 B|C denotes the conditional independence of A and\nB given the set of variables C (see [27, 14]). If a dataset, even one with n (cid:29) p is drawn from a\nnormal distribution with sparse inverse covariance matrix \u2126, the inverse sample covariance matrix\nS\u22121 will almost surely be a dense matrix, although the estimates for those \u2126ij which are equal to 0\nmay be very small in magnitude. As sparse estimates of \u2126 are more robust than S\u22121, and since such\nsparsity may yield easily interpretable models, there exists signi\ufb01cant impetus to perform sparse\ninverse covariance estimation in very high dimensional low sample size settings.\n\nBanerjee et al. [1] proposed performing such sparse inverse covariance estimation by solving the\n(cid:96)1-penalized maximum likelihood estimation problem,\n\n\u0398\u2217\n\u03c1 = arg min\n\u0398\u2208Sp\n\n\u2212 log det \u0398 + (cid:104)S, \u0398(cid:105) + \u03c1(cid:107)\u0398(cid:107)1 ,\n\n++\n\nwhere \u03c1 > 0 is a penalty parameter, (cid:104)S, \u0398(cid:105) = Tr (S\u0398), and (cid:107)\u0398(cid:107)1 = (cid:80)\n\u03c1, as it is the closest convex relaxation of the 0 \u2212 1 penalty, (cid:107)\u0398(cid:107)0 =(cid:80)\n\ni,j |\u0398ij|. For \u03c1 > 0,\nProblem (1) is strongly convex and hence has a unique solution, which lies in the positive de\ufb01nite\ncone Sp\n++ due to the log det term, and is hence invertible. Moreover, the (cid:96)1 penalty induces sparsity\nI(\u0398ij (cid:54)= 0), where\nin \u0398\u2217\nI(\u00b7) is the indicator function [5]. The unique optimal point of problem (1), \u0398\u2217\n\u03c1, is both invertible\n(for \u03c1 > 0) and sparse (for suf\ufb01ciently large \u03c1), and can be used as an inverse covariance matrix\nestimator.\n\ni,j\n\n(1)\n\n(cid:13)(cid:13)\u0398t+1 \u2212 \u0398\u2217\n\n(cid:13)(cid:13)F\n\n\u2264 s(cid:13)(cid:13)\u0398t \u2212 \u0398\u2217\n\n(cid:13)(cid:13)F\n\nIn this paper, a proximal gradient method for solving Problem (1) is proposed. The resulting \u201cgraph-\nical iterative shrinkage thresholding algorithm\u201d, or G-ISTA, is shown to converge at a linear rate to\n\u0398\u2217\n\u03c1, that is, its iterates \u0398t are proven to satisfy\n\n,\n\n\u03c1\n\n\u03c1\n\n(2)\nfor a \ufb01xed worst-case contraction constant s \u2208 (0, 1), where (cid:107)\u00b7(cid:107)F denotes the Frobenius norm.\nThe convergence rate s is provided explicitly in terms of S and \u03c1, and importantly, is related to the\ncondition number of \u0398\u2217\n\u03c1.\nThe paper is organized as follows. Section 2 describes prior work related to solution of Problem (1).\nThe G-ISTA algorithm is formulated in Section 3. Section 4 contains the convergence proofs of\nthis algorithm, which constitutes the primary mathematical result of this paper. Numerical results\nare presented in Section 5, and concluding remarks are made in Section 6.\n\n2 Prior Work\n\nWhile several excellent general convex solvers exist (for example, [11] and [4]), these are not al-\nways adept at handling high dimensional problems (i.e., p > 1000). As many modern datasets\nhave several thousands of variables, numerous authors have proposed ef\ufb01cient algorithms designed\nspeci\ufb01cally to solve the (cid:96)1-penalized sparse maximum likelihood covariance estimation problem (1).\n\nThese can be broadly categorized as either primal or dual methods. Following the literature, we refer\nto primal methods as those which directly solve Problem (1), yielding a concentration estimate. Dual\nmethods [1] yield a covariance matrix by solving the constrained problem,\n\n\u2212 log det(S + U ) \u2212 p\n\nminimize\nU\u2208Rp\u00d7p\nsubject to (cid:107)U(cid:107)\u221e \u2264 \u03c1,\n\n(3)\n\nwhere the primal and dual variables are related by \u0398 = (S + U )\u22121. Both the primal and dual prob-\nlems can be solved using block methods (also known as \u201crow by row\u201d methods), which sequentially\noptimize one row/column of the argument at each step until convergence. The primal and dual block\nproblems both reduce to (cid:96)1-penalized regressions, which can be solved very ef\ufb01ciently.\n\n2\n\n\f2.1 Dual Methods\n\nA number of dual methods for solving Problem (1) have been proposed in the literature. Banerjee\net al. [1] consider a block coordinate descent algorithm to solve the block dual problem, which\nreduces each optimization step to solving a box-constrained quadratic program. Each of these\nquadratic programs is equivalent to performing a \u201classo\u201d ((cid:96)1-regularized) regression. Friedman et al.\n[10] iteratively solve the lasso regression as described in [1], but do so using coordinate-wise de-\nscent. Their widely used solver, known as the graphical lasso (glasso) is implemented on CRAN.\nGlobal convergence rates of these block coordinate methods are unknown. D\u2019Aspremont et al. [9]\nuse Nesterov\u2019s smooth approximation scheme, which produces an \u03b5-optimal solution in O(1/\u03b5) it-\n\u221a\nerations. A variant of Nesterov\u2019s smooth method is shown to have a O(1/\n\u03b5) iteration complexity\nin [15, 16].\n\n2.2 Primal Methods\n\nInterest in primal methods for solving Problem (1) has been growing for many reasons. One impor-\ntant reason stems from the fact that convergence within a certain tolerance for the dual problem does\nnot necessarily imply convergence within the same tolerance for the primal.\nYuan and Lin [30] use interior point methods based on the max-det problem studied in [26]. Yuan\n[31] use an alternating-direction method, while Scheinberg et al. [24] proposes a similar method and\nshow a sublinear convergence rate. Mazumder and Hastie [18] consider block-coordinate descent\napproaches for the primal problem, similar to the dual approach taken in [10]. Mazumder and\nAgarwal [17] also solve the primal problem with block-coordinate descent, but at each iteration\nperform a partial as opposed to complete block optimization, resulting in a decreased computational\ncomplexity per iteration. Convergence rates of these primal methods have not been considered in the\nliterature and hence theoretical guarantees are not available. Hsieh et al. [13] propose a second-order\nproximal point algorithm, called QUIC, which converges superlinearly locally around the optimum.\n\n3 Methodology\n\nIn this section, the graphical iterative shrinkage thresholding algorithm (G-ISTA) for solving the\nprimal problem (1) is presented. A rich body of mathematical and numerical work exists for general\niterative shrinkage thresholding and related methods; see, in particular, [3, 8, 19, 20, 21, 25]. A brief\ndescription is provided here.\n\n3.1 General Iterative Shrinkage Thresholding (ISTA)\n\nIterative shrinkage thresholding algorithms (ISTA) are general \ufb01rst-order techniques for solving\nproblems of the form\n\nminimize\n\n(4)\nwhere X is a Hilbert space with inner product (cid:104)\u00b7,\u00b7(cid:105) and associated norm (cid:107)\u00b7(cid:107), f : X \u2192 R is a\ncontinuously differentiable, convex function, and g : X \u2192 R is a lower semi-continuous, convex\nfunction, not necessarily smooth. The function f is also often assumed to have Lipschitz-continuous\ngradient \u2207f, that is, there exists some constant L > 0 such that\n\nF (x) := f (x) + g(x),\n\nx\u2208X\n\n(cid:107)\u2207f (x1) \u2212 \u2207f (x2)(cid:107) \u2264 L(cid:107)x1 \u2212 x2(cid:107)\n\n(5)\n\n(cid:26)\n\n3\n\nfor any x1, x2 \u2208 X .\nFor a given lower semi-continuous convex function g, the proximity operator of g, denoted by\nproxg : X \u2192 X , is given by\n\nproxg(x) = arg min\ny\u2208X\n\n(6)\nIt is well known (for example, [8]) that x\u2217 \u2208 X is an optimal solution of problem (4) if and only if\n(7)\n\nx\u2217 = prox\u03b6g(x\u2217 \u2212 \u03b6\u2207f (x\u2217))\n\ng(y) +\n\n,\n\n(cid:107)x \u2212 y(cid:107)2\n\n1\n2\n\n(cid:27)\n\n\ffor any \u03b6 > 0. The above characterization suggests a method for optimizing problem (4) based on\nthe iteration\n\n(8)\nfor some choice of step size, \u03b6t. This simple method is referred to as an iterative shrinkage thresh-\nolding algorithm (ISTA). For a step size \u03b6t \u2264 1\n\nL, the ISTA iterates xt are known to satisfy\n\nxt+1 = prox\u03b6tg (xt \u2212 \u03b6t\u2207f (xt))\n\nF (xt) \u2212 F (x\u2217) (cid:39) O\n\n,\u2200t,\n\n(9)\nwhere x\u2217 is some optimal point, which is to say, they converge to the space of optimal points at a\nsublinear rate. If no Lipschitz constant L for \u2207f is known, the same convergence result still holds\nfor \u03b6t chosen such that\n\nt\n\n(cid:18) 1\n\n(cid:19)\n\nf (xt+1) \u2264 Q\u03b6t(xt+1, xt),\n\nwhere Q\u03b6(\u00b7,\u00b7) : X \u00d7 X \u2192 R is a quadratic approximation to f, de\ufb01ned by\n(cid:107)x \u2212 y(cid:107)2 .\n\nQ\u03b6(x, y) = f (y) + (cid:104)x \u2212 y,\u2207f (y)(cid:105) +\n\n1\n2\u03b6\n\nSee [3] for more details.\n\n3.2 Graphical Iterative Shrinkage Thresholding (G-ISTA)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\nThe general method described in Section 3.1 can be adapted to the sparse inverse covariance es-\n++ \u2192 R by\ntimation Problem (1). Using the notation introduced in Problem (4), de\ufb01ne f, g : Sp\nf (X) = \u2212 log det(X) + (cid:104)S, X(cid:105) and g(X) = \u03c1(cid:107)X(cid:107)1. Both are continuous convex functions de-\n\ufb01ned on Sp\n++,\nit is Lipschitz continuous within any compact subset of Sp\n++ (See Lemma 2 of the Supplemental\nsection).\nLemma 1 ([1, 15]). The solution of Problem (1), \u0398\u2217\n\n++. Although the function \u2207f (X) = S \u2212 X\u22121 is not Lipschitz continuous over Sp\n\n1\n\n\u03b1 =\n\n(cid:107)S(cid:107)2 + p\u03c1\n\n,\n\nmin{1T(cid:12)(cid:12)S\u22121(cid:12)(cid:12) 1, (p \u2212 \u03c1\n21T(cid:12)(cid:12)(S + \u03c1\n\n2 I)\u22121(cid:12)(cid:12) 1 \u2212 Tr((S + \u03c1\n\n\u221a\n\n\u03b2= min\n\n(cid:26) p \u2212 \u03b1 Tr(S)\n(cid:27)\n\u03c1, satis\ufb01es \u03b1I (cid:22) \u0398\u2217\n\u03c1 (cid:22) \u03b2I, for\np\u03b1)(cid:13)(cid:13)S\u22121(cid:13)(cid:13)2 \u2212 (p \u2212 1)\u03b1}\n\n, \u03b3\n\n\u03c1\n\n,\n\n2 I)\u22121)\n\nif S \u2208 Sp\n++\notherwise,\n\nand\n\n(cid:40)\n\n\u03b3 =\n\nwhere I denotes the p \u00d7 p dimensional identity matrix and 1 denotes the p-dimensional vector of\nones.\nNote that f + g as de\ufb01ned is a continuous, strongly convex function on Sp\n++. Moreover, by Lemma\n2 of the supplemental section, f has a Lipschitz continuous gradient when restricted to the compact\ndomain aI (cid:22) \u0398 (cid:22) bI. Hence, f and g as de\ufb01ned meet the conditions described in Section 3.1.\nThe proximity operator of \u03c1(cid:107)X(cid:107)1 for \u03c1 > 0 is the soft-thresholding operator, \u03b7\u03c1 : Rp\u00d7p \u2192 Rp\u00d7p,\nde\ufb01ned entrywise by\n\n(14)\nwhere for some x \u2208 R, (x)+ := max(x, 0) (see [8]). Finally, the quadratic approximation Q\u03b6t of f,\nas in equation (11), is given by\n\n[\u03b7\u03c1(X)]i,j = sgn(Xi,j) (|Xi,j| \u2212 \u03c1)+ ,\n\nQ\u03b6t(\u0398t+1, \u0398t) = \u2212 log det(\u0398t) + (cid:104)S, \u0398t(cid:105) + (cid:104)\u0398t+1 \u2212 \u0398t, S \u2212 \u0398\u22121\n\nt (cid:105) +\n\n1\n2\u03b6t\n\n(cid:107)\u0398t+1 \u2212 \u0398t(cid:107)2\nF .\n(15)\n\nThe G-ISTA algorithm for solving Problem (1) is given in Algorithm 1. As in [3], the algorithm\nuses a backtracking line search for the choice of step size. The procedure terminates when a pre-\nspeci\ufb01ed duality gap is attained. The authors found that an initial estimate of \u03980 satisfying [\u03980]ii =\n\n4\n\n\f(Sii + \u03c1)\u22121 works well in practice. Note also that the positive de\ufb01nite check of \u0398t+1 during Step\n(1) of Algorithm 1 is accomplished using a Cholesky decomposition, and the inverse of \u0398t+1 is\ncomputed using that Cholesky factor.\n\nAlgorithm 1: G-ISTA for Problem (1)\ninput : Sample covariance matrix S, penalty parameter \u03c1, tolerance \u03b5, backtracking constant\n\nc \u2208 (0, 1), initial step size \u03b61,0, initial iterate \u03980. Set \u2206 := 2\u03b5.\n\nwhile \u2206 > \u03b5 do\n\n(1) Line search: Let \u03b6t be the largest element of {cj\u03b6t,0}j=0,1,... so that for\n\u0398t+1 = \u03b7\u03b6t\u03c1\n\n(cid:0)\u0398t \u2212 \u03b6t(S \u2212 \u0398\u22121\n\nt )(cid:1), the following are satis\ufb01ed:\n(cid:0)\u0398t \u2212 \u03b6t(S \u2212 \u0398\u22121\nt )(cid:1)\n\nand\n\n\u0398t+1 (cid:31) 0\n\nf (\u0398t+1) \u2264 Q\u03b6t(\u0398t+1, \u0398t),\n\nfor Q\u03b6t as de\ufb01ned in (15).\n(2) Update iterate: \u0398t+1 = \u03b7\u03b6t\u03c1\n(3) Set next initial step, \u03b6t+1,0. See Section 3.2.1.\n(4) Compute duality gap:\n\n\u2206 = \u2212 log det(S + Ut+1) \u2212 p \u2212 log det \u0398t+1 + (cid:104)S, \u0398(cid:105) + \u03c1(cid:107)\u0398t+1(cid:107)1 ,\n\nwhere (Ut+1)i,j = min{max{([\u0398\u22121\n\nt+1]i,j \u2212 Si,j),\u2212\u03c1}, \u03c1}.\n\nend\noutput: \u03b5-optimal solution to problem (1), \u0398\u2217\n\n\u03c1 = \u0398t+1.\n\n3.2.1 Choice of initial step size, \u03b60\nEach iteration of Algorithm 1 requires an initial step size, \u03b60. The results of Section 4 guarantee\nthat any \u03b60 \u2264 \u03bbmin(\u0398t)2 will be accepted by the line search criteria of Step 1 in the next iteration.\nHowever, in practice this choice of step is overly cautious; a much larger step can often be taken.\nOur implementation of Algorithm 1 chooses the Barzilai-Borwein step [2]. This step, given by\n\n\u03b6t+1,0 =\n\nTr ((\u0398t+1 \u2212 \u0398t)(\u0398t+1 \u2212 \u0398t))\nt \u2212 \u0398\u22121\nTr ((\u0398t+1 \u2212 \u0398t)(\u0398\u22121\nt+1))\n\n,\n\n(16)\n\nis also used in the SpaRSA algorithm [29], and approximates the Hessian around \u0398t+1. If a cer-\ntain number of maximum backtracks do not result in an accepted step, G-ISTA takes the safe step,\n\u03bbmin(\u0398t)2. Such a safe step can be obtained from \u03bbmax(\u0398\u22121\nt ), which in turn can be quickly ap-\nproximated using power iteration.\n\n4 Convergence Analysis\n\nIn this section, linear convergence of Algorithm 1 is discussed. Throughout the section, \u0398t (t =\n1, 2, . . . ) denote the iterates of Algorithm 1, and \u0398\u2217\n\u03c1 the optimal solution to Problem (1) for \u03c1 > 0.\nThe minimum and maximum eigenvalues of a symmetric matrix A are denoted by \u03bbmin(A) and\n\u03bbmax(A), respectively.\nTheorem 1. Assume that the iterates \u0398t of Algorithm 1 satisfy aI (cid:22) \u0398t (cid:22) bI,\u2200t for some \ufb01xed\nconstants 0 < a < b. If \u03b6t \u2264 a2,\u2200t, then\n\u2264 max\n\n(cid:13)(cid:13)\u0398t+1 \u2212 \u0398\u2217\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:27)(cid:13)(cid:13)\u0398t \u2212 \u0398\u2217\n\n(cid:26)(cid:12)(cid:12)(cid:12)(cid:12)1 \u2212 \u03b6t\n\n(cid:13)(cid:13)F\n\n(cid:12)(cid:12)(cid:12)(cid:12)1 \u2212 \u03b6t\n\na2\n\n(cid:13)(cid:13)F\n\n(cid:12)(cid:12)(cid:12)(cid:12) ,\n\n(17)\n\nb2\n\n\u03c1\n\n.\n\n\u03c1\n\nFurthermore,\n\n1. The step size \u03b6t which yields an optimal worst-case contraction bound s(\u03b6t) is \u03b6 =\n\n2\n\na\u22122+b\u22122 .\n\n2. The optimal worst-case contraction bound corresponding to \u03b6 =\n\n2\n\na\u22122+b\u22122 is given by\n\ns(\u03b6) : = 1 \u2212 2\n\n1 + b2\na2\n\n5\n\n\fProof. A direct proof is given in the appendix. Note that linear convergence of proximal gradient\nmethods for strongly convex objective functions in general has already been proven (see Supple-\nmental section).\n\nIt remains to show that there exist constants a and b which bound the eigenvalues of \u0398t,\u2200t. The\nexistence of such constants follows directly from Theorem 1, as \u0398t lie in the bounded domain\n{\u0398 \u2208 Sp\n++ : f (\u0398) + g(\u0398) < f (\u03980) + g(\u03980)}, for all t. However, it is possible to specify the\nconstants a and b to yield an explicit rate; this is done in Theorem 2.\nTheorem 2. Let \u03c1 > 0, de\ufb01ne \u03b1 and \u03b2 as in Lemma 1, and assume \u03b6t \u2264 \u03b12,\u2200t. Then the iterates\np(\u03b2 + \u03b1).\n\n\u0398t of Algorithm 1 satisfy \u03b1I (cid:22) \u0398t (cid:22) b(cid:48)I,\u2200t, with b(cid:48) =(cid:13)(cid:13)\u0398\u2217\n\n+(cid:13)(cid:13)\u03980 \u2212 \u0398\u2217\n\n\u2264 \u03b2 +\n\n\u221a\n\n\u03c1\n\n\u03c1\n\n(cid:13)(cid:13)F\n\n(cid:13)(cid:13)2\n\nProof. See the Supplementary section.\n\nImportantly, note that the bounds of Theorem 2 depend explicitly on the bound of \u0398\u2217\n\u03c1, as given by\nLemma 1. These eigenvalue bounds on \u0398t+1, along with Theorem 1, provide a closed form linear\nconvergence rate for Algorithm 1. This rate depends only on properties of the solution.\nTheorem 3. Let \u03b1 and \u03b2 be as in Lemma 1. Then for a constant step size \u03b6t := \u03b6 < \u03b12, the iterates\nof Algorithm 1 converge linearly with a rate of\n\ns(\u03b6) = 1 \u2212\n\n\u03b12 + (\u03b2 +\n\np(\u03b2 \u2212 \u03b1))2 < 1\n\n2\u03b12\n\u221a\n\n(18)\n\nProof. By Theorem 2, for \u03b6 < \u03b12, the iterates \u0398t satisfy\n\n\u03b1I (cid:22) \u0398t (cid:22)(cid:16)(cid:13)(cid:13)\u0398\u2217\n(cid:13)(cid:13)2\n(cid:13)(cid:13)F\n+(cid:13)(cid:13)\u03980 \u2212 \u0398\u2217\n(cid:13)(cid:13)2\n(cid:13)(cid:13)\u0398\u2217\n\n\u03c1\n\n\u03c1\n\n\u03c1\n\n+(cid:13)(cid:13)\u03980 \u2212 \u0398\u2217\n\n\u03c1\n\nI\n\n(cid:17)\n(cid:13)(cid:13)F\np(cid:13)(cid:13)\u03980 \u2212 \u0398\u2217\n\n\u03c1\n\np(\u03b2 \u2212 \u03b1).\n\n(cid:13)(cid:13)2\n\n\u221a\n\u221a\n\n\u2264 \u03b2 +\n\u2264 \u03b2 +\n\nfor all t. Moreover, since \u03b1I (cid:22) \u0398\u2217 (cid:22) \u03b2I, if \u03b1I (cid:22) \u03980 (cid:22) \u03b2I (for instance, by taking \u03980 =\n(S + \u03c1I)\u22121 or some multiple of the identity) then this can be bounded as:\n\n(19)\n(20)\n\n(21)\n\nTherefore,\n\n\u03b1I (cid:22) \u0398t (cid:22) (\u03b2 +\n\n\u221a\n\np(\u03b2 \u2212 \u03b1)) I,\n\nand the result follows from Theorem 1.\n\nRemark 1. Note that the contraction constant (equation 18) of Theorem 3 is closely related to the\ncondition number of \u0398\u2217\n\u03c1,\n\n\u03ba(\u0398\u2217\n\n\u03c1) =\n\n\u03bbmax(\u0398\u2217\n\u03c1)\n\u03bbmin(\u0398\u2217\n\u03c1)\n\n\u2264 \u03b2\n\u03b1\n\nas\n\n1 \u2212\n\n2\u03b12\n\u221a\n\np(\u03b2 \u2212 \u03b1))2 \u2265 1 \u2212 2\u03b12\n\n\u03b12 + \u03b22 \u2265 1 \u2212 2\u03ba(\u0398\u2217\n\n\u03c1)\u22122.\n\n\u03b12 + (\u03b2 +\n\n(22)\n\nTherefore, the worst case bound becomes close to 1 as the conditioning number of \u0398\u2217\n\n\u03c1 increases.\n\n5 Numerical Results\n\nIn this section, we provide numerical results for the G-ISTA algorithm. In Section 5.2, the theo-\nretical results of Section 4 are demonstrated. Section 5.3 compares running times of the G-ISTA,\nglasso [10], and QUIC [13] algorithms. All algorithms were implemented in C++, and run on an\nIntel i7 \u2212 2600k 3.40GHz \u00d7 8 core with 16 GB of RAM.\n\n6\n\n\f5.1 Synthetic Datasets\n\nSynthetic data for this section was generated following the method used by [16, 17]. For a \ufb01xed p, a\np dimensional inverse covariance matrix \u2126 was generated with off-diagonal entries drawn i.i.d from\na uniform(\u22121, 1) distribution. These entries were set to zero with some \ufb01xed probability (in this\ncase, either 0.97 or 0.85 to simulate a very sparse and a somewhat sparse model). Finally, a multiple\nof the identity was added to the resulting matrix so that the smallest eigenvalue was equal to 1. In\nthis way, \u2126 was insured to be sparse, positive de\ufb01nite, and well-conditioned. Datsets of n samples\nwere then generated by drawing i.i.d. samples from a Np(0, \u2126\u22121) distribution. For each value of p\nand sparsity level of \u2126, n = 1.2p and n = 0.2p were tested, to represent both the n < p and n > p\ncases.\n\nproblem\n\np = 2000\nn = 400\n\nnnz(\u2126) = 3%\n\np = 2000\nn = 2400\n\nnnz(\u2126) = 3%\n\np = 2000\nn = 400\n\nnnz(\u2126) = 15%\n\np = 2000\nn = 2400\n\nnnz(\u2126) = 15%\n\n\u03c1\n\nalgorithm\n\u03c1)/\u03ba(\u2126\u2217\n\u03c1)\n\nnnz(\u2126\u2217\n\nglasso\n\nG-ISTA\n\nQUIC\n\u03c1)/\u03ba(\u2126\u2217\n\u03c1)\n\nnnz(\u2126\u2217\n\nglasso\n\nG-ISTA\n\nQUIC\n\u03c1)/\u03ba(\u2126\u2217\n\u03c1)\n\nnnz(\u2126\u2217\n\nglasso\n\nG-ISTA\n\nQUIC\n\u03c1)/\u03ba(\u2126\u2217\n\u03c1)\n\nnnz(\u2126\u2217\n\nglasso\n\nQUIC\n\nG-ISTA\n\n0.03\n\ntime/iter\n\n27.65%/48.14\n\n1977.92/11\n1481.80/21\n145.60/437\n14.56%/10.25\n\n667.29/7\n211.29/10\n14.09/47\n\n27.35%/64.22\n\n2163.33/11\n1496.98/21\n251.51/714\n19.98%/17.72\n\n708.15/6\n301.35/10\n28.23/88\n\n0.06\n\ntime/iter\n\n0.09\n\ntime/iter\n\n0.12\n\ntime/iter\n\n15.08%/20.14\n\n7.24%/7.25\n\n2.39%/2.32\n\n831.69/8\n257.97/11\n27.05/9\n\n604.42/7\n68.49/8\n8.05/27\n\n401.59/5\n15.25/6\n3.19/12\n\n3.11%/2.82\n\n0.91%/1.51\n\n0.11%/1.18\n\n490.90/6\n24.98/7\n3.51/13\n\n318.24/4\n\n5.16/5\n2.72/10\n\n233.94/3\n1.56/4\n2.20/8\n\n15.20%/28.50\n\n7.87%/11.88\n\n2.94%/2.87\n\n862.39/8\n318.57/12\n47.35/148\n5.49%/4.03\n\n507.04/6\n491.54/17\n4.08/16\n\n616.81/7\n96.25/9\n7.96/28\n\n48.47/7\n23.62/7\n3.18/12\n\n65.47%/1.36\n\n0.03%/1.09\n\n313.88/4\n\n4.12/5\n1.95/7\n\n233.16/3\n\n1.34/4\n1.13/4\n\nTable 1: Timing comparisons for p = 2000 dimensional datasets, generated as in Section 5.1.\nAbove, nnz(A) is the percentage of nonzero elements of matrix A.\n\n5.2 Demonstration of Convergence Rates\n\nThe linear convergence rate derived for G-ISTA in Section 4 was shown to be heavily dependent on\nthe conditioning of the \ufb01nal estimator. To demonstrate these results, G-ISTA was run on a synthetic\ndataset, as described in Section 5.1, with p = 500 and n = 300. Regularization parameters of\n\u03c1 = 0.75, 0.1, 0.125, 0.15, and 0.175 were used. Note as \u03c1 increases, \u0398\u2217\n\u03c1 generally becomes\nbetter conditioned. For each value of \u03c1, the numerical optimum was computed to a duality gap of\n10\u221210 using G-ISTA. These values of \u03c1 resulted in sparsity levels of 81.80%, 89.67%, 94.97%,\n97.82%, and 99.11%, respectively. G-ISTA was then run again, and the Frobenius norm argument\nerrors at each iteration were stored. These errors were plotted on a log scale for each value of \u03c1\nto demonstrate the dependence of the convergence rate on condition number. See Figure 1, which\nclearly demonstrates the effects of conditioning.\n\n5.3 Timing Comparisons\n\nThe G-ISTA, glasso, and QUIC algorithms were run on synthetic datasets (real datasets are\npresented in the Supplemental section) of varying p, n and with different levels of regularization, \u03c1.\nAll algorithms were run to ensure a \ufb01xed duality gap, here taken to be 10\u22125. This comparison used\nef\ufb01cient C++ implementations of each of the three algorithms investigated. The implementation of\nG-ISTA was adapted from the publicly available C++ implementation of QUIC Hsieh et al. [13].\nRunning times were recorded and are presented in Table 1. Further comparisons are presented in the\nSupplementary section.\nRemark 2. The three algorithms variable ability to take advantage of multiple processors is an\nimportant detail. The times presented in Table 1 are wall times, not CPU times. The comparisons\nwere run on a multicore processor, and it is important to note that the Cholesky decompositions and\n\n7\n\n\fFigure 1: Semilog plot of(cid:13)(cid:13)\u0398t \u2212 \u0398\u2217\n\n\u03c1\n\n(cid:13)(cid:13)F\n\nrates of G-ISTA, and dependence of those rates on \u03ba(\u0398\u2217\n\u03c1).\n\nvs. iteration number t, demonstrating linear convergence\n\ninversions required by both G-ISTA and QUIC take advantage of multiple cores. On the other hand,\nthe p2 dimensional lasso solve of QUIC and p-dimensional lasso solve of glasso do not. For this\nreason, and because Cholesky factorizations and inversions make up the bulk of the computation\nrequired by G-ISTA, the CPU time of G-ISTA was typically greater than its wall time by a factor\nof roughly 4. The CPU and wall times of QUIC were more similar; the same applies to glasso.\n\n6 Conclusion\n\nIn this paper, a proximal gradient method was applied to the sparse inverse covariance problem.\nLinear convergence was discussed, with a \ufb01xed closed-form rate. Numerical results have also been\npresented, comparing G-ISTA to the widely-used glasso algorithm and the newer, but very fast,\nQUIC algorithm. These results indicate that G-ISTA is competitive, in particular for values of\n\u03c1 which yield sparse, well-conditioned estimators. The G-ISTA algorithm was very fast on the\nsynthetic examples of Section 5.3, which were generated from well-conditioned models. For poorly\nconditioned models, QUIC is very competitive. The Supplemental section gives two real datasets\nwhich demonstrate this. For many practical applications however, obtaining an estimator that is\nwell-conditioned is important ([23, 28]). To conclude, although second-order methods for the sparse\ninverse covariance method have recently been shown to perform well, simple \ufb01rst-order methods\ncannot be ruled out, as they can also be very competitive in many cases.\n\n8\n\n5010015020025030035040045010\u2212610\u2212410\u22122100102iteration||\u0398t\u2212\u0398*\u03c1||F \u03c1 = 0.075, \u03ba(\u0398\u2217\u03c1) = 7.263\u03c1 = 0.1, \u03ba(\u0398\u2217\u03c1) = 3.9637\u03c1 = 0.125, \u03ba(\u0398\u2217\u03c1) = 2.3581\u03c1 = 0.15, \u03ba(\u0398\u2217\u03c1) = 1.6996\u03c1 = 0.175, \u03ba(\u0398\u2217\u03c1) = 1.3968\fReferences\n[1] O. Banerjee, L. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\nestimation for multivarate gaussian or binary data. Journal of Machine Learning Research, 9:485\u2013516,\n2008.\n\n[2] Jonathan Barzilai and Jonathan M. Borwein. Two-Point Step Size Gradient Methods. IMA Journal of\n\nNumerical Analysis, 8(1):141\u2013148, 1988.\n\n[3] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse prob-\n\nlems. SIAM Journal on Imaging Sciences, 2:183\u2013202, 2009. ISSN 1936-4954.\n\n[4] S. Becker, E.J. Candes, and M. Grant. Templates for convex cone problems with applications to sparse\n\nsignal recovery. Mathematical Programming Computation, 3:165\u2013218, 2010.\n\n[5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[6] P. Brohan, J. J. Kennedy, I. Harris, S. F. B. Tett, and P. D. Jones. Uncertainty estimates in regional and\nglobal observed temperature changes: A new data set from 1850. Journal of Geophysical Research, 111,\n2006.\n\n[7] George H.G. Chen and R.T. Rockafellar. Convergence rates in forward-backward splitting. Siam Journal\n\non Optimization, 7:421\u2013444, 1997.\n\n[8] Patrick L. Combettes and Val\u00b4erie R. Wajs. Signal recovery by proximal forward-backward splitting.\n\nMultiscale Modeling & Simulation, 4(4):1168\u20131200, 2005.\n\n[9] Alexandre D\u2019Aspremont, Onureena Banerjee, and Laurent El Ghaoui. First-order methods for sparse\n\ncovariance selection. SIAM Journal on Matrix Analysis and Applications, 30(1):56\u201366, 2008.\n\n[10] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\nBiostatistics, 9:432\u2013441, 2008.\n\nhttp://cvxr.com/cvx, April 2011.\n\n[11] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21.\n\n[12] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1990.\n[13] Cho-Jui Hsieh, Matyas A. Sustik, Inderjit S. Dhillon, and Pradeep K. Ravikumar. Sparse inverse covari-\nance matrix estimation using quadratic approximation. In Advances in Neural Information Processing\nSystems 24, pages 2330\u20132338. 2011.\n\n[14] S.L. Lauritzen. Graphical models. Oxford Science Publications. Clarendon Press, 1996.\n[15] Zhaosong Lu. Smooth optimization approach for sparse covariance selection. SIAM Journal on Opti-\n\nmization, 19(4):1807\u20131827, 2009. ISSN 1052-6234. doi: http://dx.doi.org/10.1137/070695915.\n\n[16] Zhaosong Lu. Adaptive \ufb01rst-order methods for general sparse inverse covariance selection. SIAM Journal\n\non Matrix Analysis and Applications, 31:2000\u20132016, 2010.\n\n[17] Rahul Mazumder and Deepak K. Agarwal. A \ufb02exible, scalable and ef\ufb01cient algorithmic framework for\n\nthe Primal graphical lasso. Pre-print, 2011.\n\n[18] Rahul Mazumder and Trevor Hastie. The graphical lasso: New insights and alternatives. Pre-print, 2011.\n[19] Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2).\n\nSoviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\n[20] Yurii Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, 2004.\n[21] Yurii Nesterov. Gradient methods for minimizing composite objective function. CORE discussion papers,\n\nUniversit\u00b4e catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2007.\n\n[22] Jennifer Pittman, Erich Huang, Holly Dressman, Cheng-Fang F. Horng, Skye H. Cheng, Mei-Hua H.\nTsou, Chii-Ming M. Chen, Andrea Bild, Edwin S. Iversen, Andrew T. Huang, Joseph R. Nevins, and\nMike West. Integrated modeling of clinical and gene expression information for personalized prediction\nof disease outcomes. Proceedings of the National Academy of Sciences of the United States of America,\n101(22):8431\u20138436, 2004.\n\n[23] Benjamin T. Rolfs and Bala Rajaratnam. A note on the lack of symmetry in the graphical lasso. Compu-\n\ntational Statistics and Data Analysis, 2012.\n\n[24] Katya Scheinberg, Shiqian Ma, and Donald Goldfarb. Sparse inverse covariance selection via alternating\nIn Advances in Neural Information Processing Systems 23, pages 2101\u20132109.\n\nlinearization methods.\n2010.\n\n[25] Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted to\n\nSIAM Journal on Optimization, 2008.\n\n[26] Lieven Vandenberghe, Stephen Boyd, and Shao-Po Wu. Determinant maximization with linear matrix\n\ninequality constraints. SIAM Journal on Matrix Analysis and Applications, 19:499\u2013533, 1996.\n\n[27] J. Whittaker. Graphical Models in Applied Multivariate Statistics. Wiley, 1990.\n[28] J. Won, J. Lim, S. Kim, and B. Rajaratnam. Condition number regularized covariance estimation. Journal\n\nof the Royal Statistical Society Series B, 2012.\n\n[29] Stephen J. Wright, Robert D. Nowak, and M\u00b4ario A. T. Figueiredo. Sparse reconstruction by separable\n\napproximation. IEE Transactions on Signal Processing, 57(7):2479\u20132493, 2009.\n\n[30] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94\n\n[31] X.M. Yuan. Alternating direction method of multipliers for covariance selection models. Journal of\n\n(1):19\u201335, 2007.\n\nScienti\ufb01c Computing, pages 1\u201313, 2010.\n\n9\n\n\f", "award": [], "sourceid": 740, "authors": [{"given_name": "Benjamin", "family_name": "Rolfs", "institution": null}, {"given_name": "Bala", "family_name": "Rajaratnam", "institution": null}, {"given_name": "Dominique", "family_name": "Guillot", "institution": null}, {"given_name": "Ian", "family_name": "Wong", "institution": null}, {"given_name": "Arian", "family_name": "Maleki", "institution": null}]}