{"title": "Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \\boldmath$\\ell_1$-regularized MLE", "book": "Advances in Neural Information Processing Systems", "page_first": 1329, "page_last": 1336, "abstract": null, "full_text": "Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of 1-regularized MLE\nPradeep Ravikumar , Garvesh Raskutti , Martin J. Wainwright and Bin Yu Department of Statistics , Department of EECS , University of California, Berkeley {pradeepr,garveshr,wainwright,binyu}@stat.berkeley.edu\n\nAbstract\nWe consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the 1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the 1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = (d2 log(p)), with the error decaying as O(exp(-c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.\n\n1 Introduction\nThe area of high-dimensional statistics deals with estimation in the \"large p, small n\" setting, where p and n correspond, respectively, to the dimensionality of the data and the sample size. Such highdimensional problems arise in a variety of applications, among them remote sensing, computational biology and natural language processing, where the model dimension may be comparable or substantially larger than the sample size. It is well-known that such high-dimensional scaling can lead to dramatic breakdowns in many classical procedures. In the absence of additional model assumptions, it is frequently impossible to obtain consistent procedures when p n. Accordingly, an active line of statistical research is based on imposing various restrictions on the model---for instance, sparsity, manifold structure, or graphical model structure---and then studying the scaling behavior of different estimators as a function of sample size n, ambient dimension p and additional parameters related to these structural assumptions. In this paper, we study the problem of estimating the graph structure of a Gauss Markov random field (GMRF) in the high-dimensional setting. This graphical model selection problem can be reduced to the problem of estimating the zero-pattern of the inverse covariance or concentration matrix . A line of recent work [1, 2, 3, 4] has studied estimators based on minimizing Gaussian log-likelihood penalized by the 1 norm of the entries (or the off-diagonal entries) of the concentration matrix. The resulting optimization problem is a log-determinant program, which can be solved in polynomial time with interior point methods [5], or by faster co-ordinate descent algorithms [3, 4]. In recent work, Rothman et al. [1] have analyzed some aspects of high-dimensional behavior, in particular establishing consistency in Frobenius norm under certain conditions on the model covariance and under certain scalings of the sparsity, sample size, and ambient model dimension. The main contribution of this paper is to provide sufficient conditions for model selection consistency of 1 -regularized Gaussian maximum likelihood. It is worth noting that such a consistency result for structure learning of Gaussian graphical models cannot be derived from Frobenius norm consistency alone. For any concentration matrix , denote the set of its non-zero off-diagonal entries 1\n\n\f\nby E () = {s = t | st = 0}. (As will be clarified below, the notation E alludes to the fact that this set corresponds to the edges in the graph defining the GMRF.) Under certain technical conditions to be specified, we prove that the 1 -regularized (on off-diagonal entries of ) Gaussian MLE recovers this edge set with high probability, meaning that P[E () = E ( )] 1. In many applications of graphical models (e.g., protein networks, social network analysis), it is this edge structure itself, as opposed to the weights t on the edges, that is of primary interest. Moreover, we note s that model selection consistency is useful even when one is interested in convergence in spectral or Frobenius norm; indeed, having extracted the set E ( ), we could then restrict to this subset, and estimate the non-zero entries of at the faster rates applicable to the reduced dimension. The remainder of this paper is organized as follows. In Section 2, we state our main result, discuss its connections to related work, and some of its consequences. Section 3 provides an outline of the proof. In Section 4, we provide some simulations that illustrate our results. Notation For the convenience of the reader, we summarize here notation to be used throughout the paper. Given a vector u Rd and parameter a [1, ], we use u a to denote the usual a norm. Given a matrix U Rpp and parameters a, b [1, ], we use |||U |||a,b to denote the induced matrix-operator norm max y a =1 U y b; see [6] for background. Three cases of particular importance in this paper are the spectral norm |||U |||2 , corresponding to the maximal singular value of U ; the / -operator norm, given by kp |||U ||| := max |Uj k |, (1 )\nj =1,...,p =1\n\nand the 1 /1 -operator norm, given by |||U |||1 = |||U ||| . Finally, we use U to denote the element-wise maximum maxi,j |Uij |; note that this is not a matrix norm, but rather a norm on the 2 vectorized form of the matrix. For any matrix U Rpp , we use vec(U ) or equivalently U i Rp to denote its vectorized form, obtained by stacking up the rows of U . We use U, V := ,j Uij Vij to denote the trace inner product on the space of symmetric matrices. Note that this inner product i 2 induces the Frobenius norm |||U |||F := ,j Uij . Finally, for asymptotics, we use the following standard notation: we write f (n) = O(g (n)) if f (n) cg (n) for some constant c < , and f (n) = (g (n)) if f (n) c g (n) for some constant c > 0. The notation f (n) g (n) means that f (n) = O(g (n)) and f (n) = (g (n)).\n\nT\n\n2 Background and statement of main result\nIn this section, we begin by setting up the problem, with some background on Gaussian MRFs and 1 -regularization. We then state our main result, and discuss some of its consequences. 2.1 Gaussian MRFs and 1 penalized estimation Consider an undirected graph G = (V , E ) with p = |V | vertices, and let X = (X1 , . . . , Xp ) denote a p-dimensional Gaussian random vector, with variate Xi identified with vertex i V . A Gauss-Markov random field (MRF) is described by a density of the form . - 1 1T f (x1 , . . . , xp ; ) = x x (2 ) ex p 2 (2 det( ))p/2 As illustrated in Figure 1, Markov structure is reflected in the sparsity pattern of the inverse covariance or concentration matrix , a p p symmetric matrix. In particular, by the Hammersley Clifford theorem [7], it must satisfy ij = 0 for all (i, j ) E . Consequently, the problem of / graphical model selection is equivalent to estimating the off-diagonal zero-pattern of the concentration matrix--that is, the set E ( ) := {i, j V | i = j, j = 0}. i In this paper, we stuidy the minimizer of the 1 -penalized Gaussian negative log-likelihood. Letting A, B := ,j Aij Bij be the trace inner product on the space of symmetric matrices, this objective function takes the form = = arg min arg min g (; , n ). (3 ) , - logdet() + n 1,off\n0 0\n\n2\n\n\f\nZero pattern of inverse covariance\n\n1\n\n2\n\n1\n\n2\n\n3 5 4 ( a)\n\n3\n\n4\n\n5 1 2 3 4 5\n\n(b )\n\nFigure 1. (a) Simple undirected graph. A Gauss Markov random field has a Gaussian variable Xi associated with each vertex i V . This graph has p = 5 vertices, maximum degree d = 3 and s = 6 edges. (b) Zero pattern of the inverse covariance associated with the GMRF in (a). The set E ( ) corresponds to the off-diagonal non-zeros (white blocks); the diagonal is also non-zero (grey squares), but these entries do not correspond to edges. The black squares correspond to non-edges, or zeros in .\n\nn 1 Here denotes the sample covariance--that is, := n =1 X () [X () ]T , where each X () is drawn in an i.i.d. manner according to the idensity (2). The quantity n > 0 is a user-defined regularization parameter. and 1,off := =j |ij | is the off-diagonal 1 regularizer; note that it does not include the diagonal. Since the negative log-determinant is a strictly convex function [5], this problem always has a unique solution, so that there is no ambiguity in equation (3). We let E () = {(i, j ) | i = j, ij = 0} denote the edge set associated with the estimate. Of interest in this paper is studying the probability P[E ( ) = E ()] as a function of the graph size p (which serves as the \"model dimension\" for the Gauss-Markov model), the sample size n, and the structural properties of . In particular, we define both the sparsity index s := |E ( )| = {i, j V | i = j, j = 0}|. i d := max |{i | j = 0}, i (4 ) corresponding to the total number of edges, and the maximum degree or row cardinality\nj =1,...,p\n\n(5 )\n\ncorresponding to the maximum number of non-zeros in any row of , or equivalently the maximum degree in the graph G, where we include the diagonal in the degree count. 2.2 Statement of main result Our assumptions involve the Hessian with respect to of the objective function g defined in equation (3), evaluated at the true model . Using standard results on matrix derivatives [5], it can be shown that this Hessian takes the form := 2 g () (6 ) = -1 -1 , \n=\n\nwhere denotes the Kronecker matrix product. By definition, is a p2 p2 matrix indexed by 2 vertex pairs, so that entry j,k),(,m) corresponds to the second partial derivative j g m , evalu( k ated at = . When X has multivariate Gaussian distribution, then is the Fisher information of the model, and by standard results on cumulant functions in exponential families [8], we have the more specific expression j,k),(,m) = cov{Xj Xk , X Xm }. For this reason, can be viewed as ( an edge-based counterpart to the usual covariance matrix . We define the set of non-zero off-diagonal entries in the model concentration matrix : S ( ) := {(i, j ) V V | i = j, j = 0}, i (7 )\n\nand let S ( ) = {S ( ) {(1, 1), . . . , (p, p)} be the augmented set including the diagonal. We let S c ( ) denote the complement of S ( ) in the set {1, . . . , p} {1, . . . , p}, corresponding to all 3\n\n\f\nWe require the following conditions on the Fisher information matrix :\n\npairs (, m) for which m = 0. When it is clear from context, we shorten our notation for these sets to S and S c , respectively. Finally, for any two subsets T and T of V V , we use T to T denote the |T | |T | matrix with rows and columns of indexed by T and T respectively.\n\n[A1] Incoherence condition: This condition captures the intuition that variable-pairs which are non-edges cannot exert an overtly strong effect on variable-pairs which form edges of the Gaussian graphical model. ||| c S ( S )-1 ||| S S (1 - ), for some fixed > 0. (8 ) We note that similar conditions arise in the analysis of the Lasso in linear regression [9, 10, 11]. [A2] Covariance control: There exist constants K , K < such that These assumptions require that the covariance elements along any row of ( )-1 and ( S )-1 have S bounded 1 norms. Note that similar assumptions are are also required for consistency in Frobenius norm [1]. Recall from equations (4) and (5) the definitions of the sparsity index s and maximum degree d, respectively. With this notation, we have: Theorem 1. Consider a Gaussian distribution with concentration matrix that satisfies conditions l\ng (A1) and (A2). Suppose the penalty is set as n = C1 on p , and the minimum edge-weight min := l g min(i,j )S |j | scales as in > C2 on p for some constants C1 , C2 > 0. Further, suppose the i m triple (n, d, p) satisfies the scaling\n\n||| -1 ||| K ,\n\nand |||( S )-1 ||| K . S\n\n(9 )\n\nn>\n\nL d2 log(p),\n\n(1 0 )\n\nfor some constant L > 0. Then the edge set E () specified by the estimator specifies the true edge set w.h.p.--in particular, for some constant c > 0. P[E () = E ( )] 1 - exp(-c log p) 1. (1 1 )\n\nRemarks: Rothman et al. [1] prove that the error of the estimator in Frobenius norm obeys the bound ||| - |||2 = O {((s + p) log p)/n}, with high probability. We note that model selection F consistency does not follow from this result, since an estimate may be close in Frobenius norm while differing substantially in terms of zero-pattern. In one sense, the model selection criterion is more demanding, since given knowledge of the edge set E ( ), one could restrict estimation procedures to this subset, and so achieve faster rates. On the other hand, Theorem 1 requires incoherence conditions [A1] on the covariance matrix, which are not required for Frobenius norm consistency [1]. 2.3 Comparison to neighbor-based graphical model selection It is interesting to compare the estimator to the Gaussian neighborhood regression method studied by Meinshausen and Buhlmann [9], in which each node is linearly regressed with an 1 penalty (Lasso) on the rest of the nodes; and the location of the non-zero regression weights is taken as the neighborhood estimate of that node. These neighborhoods are then combined, by either an OR rule or an AND rule, to estimate the full graph. Wainwright [12] shows that the rate n d log p is a sharp threshold for the success/failure of neighborhood selection by Lasso. By a union bound over the p nodes, it follows this threshold holds for the Meinshausen and Buhlmann approach as well. This is superior to the scaling in our result (10). However, the two methods rely on slightly different underlying assumptions, and the current form of the neighborhood-based approach requires solving a total of p Lasso programs, as opposed to a single log-determinant problem. Below we show two cases where the Lasso irrepresentability condition holds, while the log-determinant requirement fails. However, in general, we do not know whether the log-determinant irrepresentability strictly dominates its analog for the Lasso. 4\n\n\f\n2.3.1 Illustration of irrepresentability: Diamond graph Consider the following Gaussian MRF example from [13]. Figure 2(a) shows a diamond-shaped graph G = (V , E ), with vertex set V = {1, 2, 3, 4} and edge-set as the fully connected graph over V with the edge (1, 4) removed. The covariance matrix is parameterized by the correlation param2 2 3\n\n1\n\n4\n\n1\n\n3 ( a)\n\n4 (b )\n\nFigure 2: (a) Graph of the example discussed by [13]. (b) A simple 4-node star graph.\n\n eter [0, 1/ 2]: the diagonal entries are set to i = 1, for all i V ; the entries corresponding i to edges are set to j = for (i, j ) E \\{(2, 3)}, 3 = 0; and finally the entry corresponding to 2 i the non-edge is set as 4 = 22 . For this model, [13] showed that the 1 -regularized MLE fails 1 to recover the graph structure for any sample size, if > -1 + (3/2)1/2 0.23. It is instructive to compare this necessary condition to the sufficient condition provided in our analysis, namely the incoherence Assumption [A1] as applied to the Hessian . For this particular example, a little calculation shows that Assumption [A1] is equivalent to the constraint 4||(|| + 1) < 1, an inequality which holds for all (-0.2017, 0.2017). Note that the upper value 0.2017 is just below the necessary threshold discussed by [13]. On the other hand, the irrepresentability condition for the Lasso requires only that 2|| < 1, i.e., (-0.5, 0.5). Thus, in the regime || [0.2017, 0.5), the Lasso irrepresentability condition holds while our log-determinant counterpart fails. 2.3.2 Illustration of irrepresentability: Star graphs A second interesting example is the star-shaped graphical model, illustrated in Figure 2(b), which consists of a single hub node connected to the rest of the spoke nodes. We consider a four node graph, with vertex set V = {1, 2, 3, 4} and edge-set E = {(1, s) | s {2, 3, 4}}. The covariance matrix is parameterized the correlation parameter [-1, 1]: the diagonal entries are set to i = 1, for all i V ; the entries corresponding to edges are set to j = for (i, j ) E ; while i i the non-edge entries are set as j = 2 for (i, j ) E . Consequently, for this particular example, / i Assumption [A1] reduces to the constraint ||(|| + 2) < 1, which holds for all (-0.414, 0.414). The irrepresentability condition for the Lasso on the other hand allows the full range (-1, 1). Thus there is again a regime, || [0.414, 1), where the Lasso irrepresentability condition holds while the log-determinant counterpart fails.\n\n3 Proof outline\nTheorem 1 follows as a corollary to Theorem 2 in Ravikumar et al [14], an extended and more general version of this paper. There we consider the more general problem of estimation of the covariance matrix of a random vector (that need not necessarily be Gaussian) from i.i.d. samples; and where we relax Assumption [A2], and allow quantities K , K to grow with sample size n. We provide here a high-level outline of the proof of Theorem 1, deferring details to the extended version [14]. Our proofs are based on a technique that we call a primal-dual witness method, used previously in analysis of the Lasso [12]. It involves following a specific sequence of steps to construct a pair (, Z ) of symmetric matrices that together satisfy the optimality conditions associated with the convex program (3) with high probability. Thus, when the constructive procedure succeeds, is equal to the unique solution of the convex program (3), and Z is an optimal solution to its 5\n\n\f\ndual. In this way, the estimator inherits from various optimality properties in terms of its distance to the truth , and its recovery of the signed sparsity pattern. To be clear, our procedure for constructing is not a practical algorithm for solving the log-determinant problem (3), but rather is used as a proof technique for certifying the behavior of the 1 -regularized MLE (3). 3.1 Primal-dual witness approach At the core of the primal-dual witness method are the standard convex optimality conditions that characterize the optimum of the convex program (3). For future reference, we note that the subdifferential of the norm 1,off evaluated at some consists the set of all symmetric matrices Z Rpp such that if i = j 0 Zij = (1 2 ) if i = j and ij = 0 sign(ij ) [-1, +1] if i = j and ij = 0.\n\nLemma 1. For any n > 0 and sample covariance with strictly positive diagonal, the 1 regularized log-determinant problem (3) has a unique solution 0 characterized by where Z is an element of the subdifferential - -1 + n Z = 0,\n1,off .\n\n(1 3 )\n\nBased on this lemma, we construct the primal-dual witness solution (, Z ) as follows:\n0, S c =0\n\nNote that by construction, we have 0, and moreover S c = 0. (b) We choose ZS as a member of the sub-differential of the regularizer 1,off , evaluated at . (c) We set ZS c as ZS c = , 1 - -1 ]S c Sc + [ n (1 5 )\n\n(a) We determine the matrix by solving the restricted log-determinant problem . := arg min , - log det() + n 1,off\n\n(1 4 )\n\n(d) We verify the strict dual feasibility condition |Zij | <\n\nwhich ensures that constructed matrices (, Z ) satisfy the optimality condition (13). 1 for all (i, j ) S c .\n\nTo clarify the nature of the construction, steps (a) through (c) suffice to obtain a pair (, Z ) that satisfy the optimality conditions (13), but do not guarantee that Z is an element of sub-differential 1,off . By construction, specifically step (b) of the construction ensures that the entries Z in S satisfy the sub-differential conditions, since ZS is a member of the sub-differential of S 1,off . The purpose of step (d), then, is to verify that the remaining elements of Z satisfy the necessary conditions to belong to the sub-differential. If the primal-dual witness construction succeeds, then it acts as a witness to the fact that the solution to the restricted problem (14) is equivalent to the solution to the original (unrestricted) problem (3). We exploit this fact in our proof of Theorem 1: we first show that the primal-dual witness technique succeeds with high-probability, from which we can conclude that the support of the optimal solution is contained within the support of the true . The next step requires checking that none of the entries in S constructed in Equation (14) are zero. It is to verify this that we require the lower bound assumption in Theorem 1 on the value of the minimum value in . m 6\n\n\f\n4 Experiments\nIn this section, we describe some experiments which illustrate the model selection rates in Theorem 1. We solved the 1 penalized log-determinant optimization problem using the \"glasso\" program [4], which builds on the block co-ordinate descent algorithm of [3]. We report experiments for star-shaped graphs, which consist of one node connected to the rest of the nodes. These graphs allow us to vary both d and p, since the degree of the central hub can be varied between 1 and p - 1. Applying the algorithm to these graphs should therefore provide some insight on how the required number of samples n is related to d and p. We tested varying graph sizes p from p = 64 upwards to p = 375. The edge-weights were set as entries in the inverse of a covariance matrix with diagonal entries set as i = 1 for all i = 1, . . . , p, and j = 2.5/d for all (i, j ) E , so that the i i quantities (K , K , ) remain constant. Dependence on graph size:\nStar graph 1 0.8 Prob. of success 0.6 0.4 0.2 0 100 p=64 p=100 p=225 p=375 200 300 n 400 500 Prob. of success 1 0.8 0.6 0.4 0.2 0 20 p=64 p=100 p=225 p=375 40 60 80 n/log p 100 120 140 Star graph\n\n( a)\n\n(b )\n\nFigure 3. Simulations for a star graph with varying number of nodes p, fixed maximal degree d = 40, and edge covariances j = 1/16 for all edges. Plots of probability of correct signed edge-set recovery i versus the sample size n in panel (a), and versus the rescaled sample size n/ log p in panel (b). Each point corresponds to the average over N = 100 trials.\n\nPanel (a) of Figure 3 plots the probability of correct signed edge-set recovery against the sample size n for a star-shaped graph of three different graph sizes p. For each curve, the probability of success starts at zero (for small sample sizes n), but then transitions to one as the sample size is increased. As would be expected, it is more difficult to perform model selection for larger graph sizes, so that (for instance) the curve for p = 375 is shifted to the right relative to the curve for p = 64. Panel (b) of Figure 3 replots the same data, with the horizontal axis rescaled by (1/ log p). This scaling was chosen because our theory predicts that the sample size should scale logarithmically with p (see equation (10)). Consistent with this prediction, when plotted against the rescaled sample size n/ log p, the curves in panel (b) all stack up. Consequently, the ratio (n/ log p) acts as an effective sample size in controlling the success of model selection, consistent with the predictions o f T h eo r em 1 . Dependence on the maximum node degree: Panel (a) of Figure 4 plots the probability of correct signed edge-set recovery against the sample size n for star-shaped graphs; each curve corresponds to a different choice of maximum node degree d, allowing us to investigate the dependence of the sample size on this parameter. So as to control these comparisons, we fixed the number of nodes to p = 200. Observe how the plots in panel (a) shift to the right as the maximum node degree d is increased, showing that star-shaped graphs with higher degrees are more difficult. In panel (b) of Figure 4, we plot the same data versus the rescaled sample size n/d. Recall that if all the curves were to stack up under this rescaling, then it means the required sample size n scales linearly with d. These plots are closer to aligning than the unrescaled plots, but the agreement is not perfect. In particular, observe that the curve d (right-most in panel (a)) remains a bit to the right in panel (b), which suggests that a somewhat more aggressive rescaling--perhaps n/d for some (1, 2)--is appropriate. The sufficient condition from Theorem 1, as summarized 7\n\n\f\nTruncated Star with Varying d 1 d=50 d=60 d=70 d=80 d=90 d=100\n1 d=50 d=60 d=70 d=80 d=90 d=100\n\nTruncated Star with Varying d\n\n0.8 Prob. of success\n\n0.8 Prob. of success\n\n0.6\n\n0.6\n\n0.4\n\n0.4\n\n0.2\n\n0.2\n\n0 1000\n\n1500\n\n2000 n\n\n2500\n\n3000\n\n3500\n\n0 26\n\n28\n\n30\n\n32 n/d\n\n34\n\n36\n\n38\n\n( a)\n\n(b )\n\nFigure 4. Simulations for star graphs with fixed number of nodes p = 200, varying maximal (hub) degree d, edge covariances ij = 2.5/d. Plots of probability of correct signed edge-set recovery versus the sample size n in panel (a), and versus the rescaled sample size n/d in panel (b).\n\nin equation (10), is n = (d2 log p), which appears to be overly conservative based on these data. Thus, it might be possible to tighten our theory under certain regimes.\n\nReferences\n[1] A.J. Rothman, P.J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance estimation. Electron. J. Statist., 2:494515, 2008. [2] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):1935, 2007. [3] A. d'Aspremont, O. Banerjee, and L. El Ghaoui. First-order methods for sparse covariance selection. SIAM J. Matrix Anal. Appl., 30(1):5666, 2008. [4] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical Lasso. Biostat., 9(3):432441, 2007. [5] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, UK, 2004. [6] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 1985. [7] S. L. Lauritzen. Graphical Models. Oxford University Press, Oxford, 1996. [8] L.D. Brown. Fundamentals of statistical exponential families. Institute of Mathematical Statistics, Hayward, CA, 1986. [9] N. Meinshausen and P. Buhlmann. High-dimensional graphs and variable selection with the Lasso. Ann. Statist., 34(3):14361462, 2006. [10] J. A. Tropp. Just relax: Convex programming methods for identifying sparse signals. IEEE Trans. Info. Theory, 51(3):10301051, 2006. [11] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning Research, 7:25412567, 2006. [12] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity using the Lasso. Technical Report 709, UC Berkeley, May 2006. To appear in IEEE Trans. Info. Theory. [13] N. Meinshausen. A note on the Lasso for graphical Gaussian model selection. Statistics and Probability Letters, 78(7):880884, 2008. [14] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing 1 -penalized log-determinant divergence. Technical Report 767, Department of Statistics, UC Berkeley, November 2008.\n\n8\n\n\f\n", "award": [], "sourceid": 3436, "authors": [{"given_name": "Garvesh", "family_name": "Raskutti", "institution": null}, {"given_name": "Bin", "family_name": "Yu", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}]}