{"title": "Robust Gaussian Graphical Modeling with the Trimmed Graphical Lasso", "book": "Advances in Neural Information Processing Systems", "page_first": 2602, "page_last": 2610, "abstract": "Gaussian Graphical Models (GGMs) are popular tools for studying network structures. However, many modern applications such as gene network discovery and social interactions analysis often involve high-dimensional noisy data with outliers or heavier tails than the Gaussian distribution. In this paper, we propose the Trimmed Graphical Lasso for robust estimation of sparse GGMs. Our method guards against outliers by an implicit trimming mechanism akin to the popular Least Trimmed Squares method used for linear regression. We provide a rigorous statistical analysis of our estimator in the high-dimensional setting. In contrast, existing approaches for robust sparse GGMs estimation lack statistical guarantees. Our theoretical results are complemented by experiments on simulated and real gene expression data which further demonstrate the value of our approach.", "full_text": "Robust Gaussian Graphical Modeling with the\n\nTrimmed Graphical Lasso\n\nEunho Yang\n\nIBM T.J. Watson Research Center\n\neunhyang@us.ibm.com\n\nAur\u00b4elie C. Lozano\n\nIBM T.J. Watson Research Center\n\naclozano@us.ibm.com\n\nAbstract\n\nGaussian Graphical Models (GGMs) are popular tools for studying network struc-\ntures. However, many modern applications such as gene network discovery and\nsocial interactions analysis often involve high-dimensional noisy data with out-\nliers or heavier tails than the Gaussian distribution. In this paper, we propose the\nTrimmed Graphical Lasso for robust estimation of sparse GGMs. Our method\nguards against outliers by an implicit trimming mechanism akin to the popular\nLeast Trimmed Squares method used for linear regression. We provide a rigorous\nstatistical analysis of our estimator in the high-dimensional setting. In contrast,\nexisting approaches for robust sparse GGMs estimation lack statistical guaran-\ntees. Our theoretical results are complemented by experiments on simulated and\nreal gene expression data which further demonstrate the value of our approach.\n\n1\n\nIntroduction\n\nGaussian graphical models (GGMs) form a powerful class of statistical models for representing dis-\ntributions over a set of variables [1]. These models employ undirected graphs to encode conditional\nindependence assumptions among the variables, which is particularly convenient for exploring net-\nwork structures. GGMs are widely used in variety of domains, including computational biology [2],\nnatural language processing [3], image processing [4, 5, 6], statistical physics [7], and spatial statis-\ntics [8]. In many modern applications, the number of variables p can exceed the number of obser-\nvations n. For instance, the number of genes in microarray data is typically larger than the sample\nsize. In such high-dimensional settings, sparsity constraints are particularly pertinent for estimat-\ning GGMs, as they encourage only a few parameters to be non-zero and induce graphs with few\nedges. The most widely used estimator among others (see e.g. [9]) minimizes the Gaussian negative\nlog-likelihood regularized by the (cid:96)1 norm of the entries (or the off-diagonal entries) of the precision\nmatrix (see [10, 11, 12]). This estimator enjoys strong statistical guarantees (see e.g. [13]). The\ncorresponding optimization problem is a log-determinant program that can be solved with interior\npoint methods [14] or by co-ordinate descent algorithms [11, 12]. Alternatively neighborhood se-\nlection [15, 16] can be employed to estimate conditional independence relationships separately for\neach node in the graph, via Lasso linear regression, [17]. Under certain assumptions, the sparse\nGGM structure can still be recovered even under high-dimensional settings.\nThe aforementioned approaches rest on a fundamental assumption: the multivariate normality of the\nobservations. However, outliers and corruption are frequently encountered in high-dimensional data\n(see e.g. [18] for gene expression data). Contamination of a few observations can drastically affect\nthe quality of model estimation. It is therefore imperative to devise procedures that can cope with\nobservations deviating from the model assumption. Despite this fact, little attention has been paid to\nrobust estimation of high-dimensional graphical models. Relevant work includes [19], which lever-\nages multivariate t-distributions for robusti\ufb01ed inference and the EM algorithm. They also propose\nan alternative t-model which adds \ufb02exibility to the classical t but requires the use of Monte Carlo\nEM or variational approximation as the likelihood function is not available explicitly. Another per-\n\n1\n\n\ftinent work is that of [20] which introduces a robusti\ufb01ed likelihood function. A two-stage procedure\nis proposed for model estimation, where the graphical structure is \ufb01rst obtained via coordinate gra-\ndient descent and the concentration matrix coef\ufb01cients are subsequently re-estimated using iterative\nproportional \ufb01tting so as to guarantee positive de\ufb01niteness of the \ufb01nal estimate.\nIn this paper, we propose the Trimmed Graphical Lasso method for robust Gaussian graphical mod-\neling in the sparse high-dimensional setting. Our approach is inspired by the classical Least Trimmed\nSquares method used for robust linear regression [21], in the sense that it disregards the observations\nthat are judged less reliable. More speci\ufb01cally the Trimmed Graphical Lasso seeks to minimize a\nweighted version of the negative log-likelihood regularized by the (cid:96)1 penalty on the concentration\nmatrix for the GGM and under some simple constraints on the weights. These weights implicitly\ninduce the trimming of certain observations. Our key contributions can be summarized as follows.\n\u2022 We introduce the Trimmed Graphical Lasso formulation, along with two strategies for solv-\ning the objective. One involves solving a series of graphical lasso problems; the other is\nmore ef\ufb01cient and leverages composite gradient descent in conjunction with partial opti-\nmization.\n\u2022 As our key theoretical contribution, we provide statistical guarantees on the consistency of\nour estimator. To the best of our knowledge, this is in stark contrast with prior work on\nrobust sparse GGM estimation (e.g. [19, 20]) which do not provide any statistical analysis.\n\u2022 Experimental results under various data corruption scenarios further demonstrate the value\n\nof our approach.\n\n2 Problem Setup and Robust Gaussian Graphical Models\n\n(cid:107)U(cid:107)1,off :=(cid:80)\n\nFor matrices U \u2208 Rp\u00d7p and V \u2208 Rp\u00d7p, (cid:104)(cid:104)U, V (cid:105)(cid:105) denotes the trace inner product\nNotation.\ntr(A BT ). For a matrix U \u2208 Rp\u00d7p and parameter a \u2208 [1,\u221e], (cid:107)U(cid:107)a denotes the element-wise\n(cid:96)a norm, and (cid:107)U(cid:107)a,off does the element-wise (cid:96)a norm only for off-diagonal entries. For example,\ni(cid:54)=j |Uij|. Finally, we use (cid:107)U(cid:107)F and |||U|||2 to denote the Frobenius and spectral norms,\nrespectively.\nSetup. Let X = (X1, X2, . . . , Xp) be a zero-mean Gaussian random \ufb01eld parameterized by p \u00d7 p\nconcentration matrix \u0398\u2217:\n\nP(X; \u0398\u2217) = exp\n\n(cid:104)(cid:104)\u0398\u2217, XX(cid:62)(cid:105)(cid:105) \u2212 A(\u0398\u2217)\n\n(1)\nwhere A(\u0398\u2217) is the log-partition function of Gaussian random \ufb01eld. Here, the probability density\nfunction in (1) is associated with p-variate Gaussian distribution, N (0, \u03a3\u2217) where \u03a3\u2217 = (\u0398\u2217)\u22121.\n\nGiven n i.i.d. samples(cid:8)X (1), . . . , X (n)(cid:9) from high-dimensional Gaussian random \ufb01eld (1), the\n\nstandard way to estimate the inverse covariance matrix is to solve the (cid:96)1 regularized maximum like-\nlihood estimator (MLE) that can be written as the following regularized log-determinant program:\n\n2\n\n(cid:16) \u2212 1\n\n(cid:17)\n\n(cid:68)(cid:68)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nminimize\n\n\u0398\u2208\u2126\n\n\u0398,\n\nX (i)(X (i))(cid:62)(cid:69)(cid:69) \u2212 log det(\u0398) + \u03bb(cid:107)\u0398(cid:107)1,off\n\n(2)\n\nwhere \u2126 is the space of the symmetric positive de\ufb01nite matrices, and \u03bb is a regularization parameter\nthat encourages a sparse graph model structure.\nIn this paper, we consider the case where the number of random variables p may be substantially\nlarger than the number of sample size n, however, the concentration parameter of the underlying\ndistribution is sparse:\n(C-1) The number of non-zero off-diagonal entries of \u0398\u2217 is at most k, that is |{\u0398\u2217\n0 for i (cid:54)= j}| \u2264 k.\nNow, suppose that n samples are drawn from this underlying distribution (1) with true parameter\n\u0398\u2217. We further allow some samples are corrupted and not drawn from (1). Speci\ufb01cally, the set of\nsample indices {1, 2, . . . , n} is separated into two disjoint subsets: if i-th sample is in the set of\n\u201cgood\u201d samples, which we name G, then it is a genuine sample from (1) with the parameter \u0398\u2217. On\n\nij (cid:54)=\n\n: \u0398\u2217\n\nij\n\n2\n\n\fAlgorithm 1 Trimmed Graphical Lasso in (3)\n\nInitialize \u0398(0) (e.g. \u0398(0) = (S + \u03bbI)\u22121)\nrepeat\n\ni=1 w(t)\n\nh\n\n(cid:80)n\n\ni X (i)(X (i))(cid:62) \u2212 (\u0398(t\u22121))\u22121\n\nCompute w(t) given \u0398(t\u22121), by assigning a weight of one to the h observations with lowest negative\nlog-likelihood and a weight of zero to the remaining ones.\n\u2207L(t) \u2190 1\nLine search. Choose \u03b7(t) (See Nesterov (2007) for a discussion of how the stepsize may be chosen),\nchecking that the following update maintains positive de\ufb01niteness. This can be veri\ufb01ed via Cholesky\nfactorization (as in [23]).\nUpdate. \u0398(t) \u2190 S\u03b7(t)\u03bb(\u0398(t\u22121) \u2212 \u03b7(t)\u2207L(t)), where S is the soft-thresholding operator: [S\u03bd (U )]i,j =\nsign(Ui,j) max(|Ui,j| \u2212 \u03bd, 0) and is only applied to the off-diagonal elements of matrix U.\nCompute (\u0398(t))\u22121 reusing the Cholesky factor.\n\nuntil stopping criterion is satis\ufb01ed\n\nthe other hand, if the i-th sample is in the set of \u201cbad\u201d samples, B, the sample is corrupted. The\nidenti\ufb01cations of G and B are hidden to us. However, we naturally assume that only a small number\nof samples are corrupted:\n(C-2) Let h be the number of good samples: h := |G| and hence |B| = n \u2212 h. Then, we assume\nthat larger portion of samples are genuine and uncorrupted so that |G|\u2212|B|\n|G| \u2265 \u03b1 where 0 < \u03b1 \u2264 1.\nIf we assume that 40% of samples are corrupted, then \u03b1 = 0.6n\u22120.4n\n3.\n0.6n = 1\n\nIn later sections, we will derive a robust estimator for corrupted samples of sparse Gaussian graphical\nmodels and provide statistical guarantees of our estimator under the conditions (C-1) and (C-2).\n\n2.1 Trimmed Graphical Lasso\n\nWe now propose a Trimmed Graphical Lasso for robust estimation of sparse GGMs:\n\nwiX (i)(X (i))(cid:62)(cid:69)(cid:69) \u2212 log det(\u0398) + \u03bb(cid:107)\u0398(cid:107)1,off\n\n(cid:68)(cid:68)\n\nn(cid:88)\n\ni=1\n\nminimize\n\n\u0398\u2208\u2126,w\n\n\u0398,\n\n1\nh\n\ns. t. w \u2208 [0, 1]n , 1(cid:62)w = h , (cid:107)\u0398(cid:107)1 \u2264 R\n\n(3)\n\nwhere \u03bb is a regularization parameter to decide the sparsity of our estimation, and h is another\nparameter, which decides the number of samples (or sum of weights) used in the training. h is\nideally set as the number of uncorrupted samples in G, but practically we can tune the parameter h by\ncross-validation. Here, the constraint (cid:107)\u0398(cid:107)1 \u2264 R is required to analyze this non-convex optimization\nproblem as discussed in [22]. For another tuning parameter R, any positive real value would be\nsuf\ufb01cient for R as long as (cid:107)\u0398\u2217(cid:107)1 \u2264 R. Finally note that when h is \ufb01xed as n (and R is set as\nin\ufb01nity), the optimization problem (3) will be simply reduced to the vanilla (cid:96)1 regularized MLE for\nsparse GGM without concerning outliers.\nThe optimization problem (3) is convex in w as well as in \u0398, however this is not the case jointly.\nNevertheless, we will show later that any local optimum of (3) is guaranteed to be strongly consistent\nunder some fairly mild conditions.\n\nOptimization As we brie\ufb02y discussed above, the problem (3) is not jointly convex but biconvex.\nOne possible approach to solve the objective of (3) thus is to alternate between solving for \u0398 with\n\ufb01xed w and solving for w with \ufb01xed \u0398. Given \u0398, solving for w is straightforward and boils down to\nassigning a weight of one to the h observations with lowest negative log-likelihood and a weight of\nzero to the remaining ones. Given w, solving for \u0398 can be accomplished by any algorithm solving\nthe \u201cvanilla\u201d graphical Lasso program, e.g. [11, 12]. Each step solves a convex problem hence the\nobjective is guaranteed to decrease at each iteration and will converge to a local minima.\nA more ef\ufb01cient optimization approach can be obtained by adopting a partial minimization strategy\nfor \u0398: rather than solving to completion for \u0398 each time w is updated, one performs a single\nstep update. This approach stems from considering the following equivalent reformulation of our\n\n3\n\n\fobjective:\n\nminimize\n\n\u0398\u2208\u2126\n\n(cid:68)(cid:68)\n\n\u0398,\n\nn(cid:88)\n\ni=1\n\n1\nh\n\nwi(\u0398)X (i)(X (i))(cid:62)(cid:69)(cid:69) \u2212 log det(\u0398) + \u03bb(cid:107)\u0398(cid:107)1,off\nwiX (i)(X (i))(cid:62)(cid:69)(cid:69)\n\nn(cid:88)\n\n(cid:68)(cid:68)\n\nargmin\n\n\u0398,\n\n1\nh\n\ni=1\n\ns. t. wi(\u0398) =\n\nw\u2208[0,1]n , 1(cid:62)w=h\n\n(4)\n\n, (cid:107)\u0398(cid:107)1 \u2264 R ,\n\nOn can then leverage standard \ufb01rst-order methods such as projected and composite gradient descent\n[24] that will converge to local optima. The overall procedure is depicted in Algorithm 1. Therein we\nassume that we pick R suf\ufb01ciently large, so one does not need to enforce the constraint (cid:107)\u0398(cid:107)1 \u2264 R\nexplicitly. If needed the constraint can be enforced by an additional projection step [22].\n\n3 Statistical Guarantees of Trimmed Graphical Lasso\n\nOne of the main contributions of this paper is to provide the statistical guarantees of our Trimmed\nGraphical Lasso estimator for GGMs. The optimization problem (3) is non-convex, and therefore\nthe gradient-type methods solving (3) will \ufb01nd estimators by local minima. Hence, our theory in this\nsection provides the statistical error bounds on any local minimum measured by (cid:107) \u00b7 (cid:107)F and (cid:107) \u00b7 (cid:107)1,off\nnorms simultaneously.\n\nSuppose that we have some local optimum ((cid:101)\u0398,(cid:101)w) of (3) by arbitrary gradient-based method. While\nto (cid:101)wi so that w\u2217\ndependent on (cid:101)w.\n\ni \u2212 (cid:101)wi = 0. Otherwise for a sample index i \u2208 B, we set w\u2217\n\n\u0398\u2217 is \ufb01xed unconditionally, we de\ufb01ne w\u2217 as follows: for a sample index i \u2208 G, w\u2217\n\ni is simply set\ni = 0. Hence, w\u2217 is\n\nIn order to derive the upper bound on the Frobenius norm error, we \ufb01rst need to assume the standard\nrestricted strong convexity condition of (3) with respective to the parameter \u0398:\n\n(C-3) (Restricted strong convexity condition) Let \u2206 be an arbitrary error of parameter \u0398. That\nis, \u2206 := \u0398 \u2212 \u0398\u2217. Then, for any possible error \u2206 such that (cid:107)\u2206(cid:107)F \u2264 1,\n\n(cid:68)(cid:68)(cid:0)\u0398\u2217(cid:1)\u22121 \u2212(cid:0)\u0398\u2217 + \u2206(cid:1)\u22121\n\n, \u2206\n\n(cid:69)(cid:69) \u2265 \u03bal(cid:107)\u2206(cid:107)2\n\nF\n\n(5)\n\nwhere \u03bal is a curvature parameter.\n\nNote that in order to guarantee the Frobenius norm-based error bounds, (C-3) is required even for the\nvanilla Gaussian graphical models without outliers, which has been well studied by several works\nsuch as the following lemma:\nLemma 1 (Section B.4 of [22]). For any \u2206 \u2208 Rp\u00d7p such that (cid:107)\u2206(cid:107)F \u2264 1,\n\n(cid:69)(cid:69) \u2265(cid:0)|||\u0398\u2217|||2 + 1(cid:1)\u22122(cid:107)\u2206(cid:107)2\n\nF ,\n\n, \u2206\n\n(cid:68)(cid:68)(cid:0)\u0398\u2217(cid:1)\u22121 \u2212(cid:0)\u0398\u2217 + \u2206(cid:1)\u22121\nthus (C-3) holds with \u03bal =(cid:0)|||\u0398\u2217|||2 + 1(cid:1)\u22122.\n\nWhile (C-3) is a standard condition that is also imposed for the conventional estimators under clean\nset of of samples, we additionally require the following condition for a successful estimation of (3)\non corrupted samples:\n\n(C-4) Consider arbitrary local optimum ((cid:101)\u0398,(cid:101)w). Let (cid:101)\u2206 := (cid:101)\u0398 \u2212 \u0398\u2217 and(cid:101)\u0393 := (cid:101)w \u2212 w\u2217. Then,\n\n(cid:101)\u0393iX (i)(X (i))(cid:62), (cid:101)\u2206\n\n(cid:69)(cid:69)(cid:12)(cid:12)(cid:12) \u2264 \u03c41(n, p)(cid:107)(cid:101)\u2206(cid:107)F + \u03c42(n, p)(cid:107)(cid:101)\u2206(cid:107)1\n\n(cid:12)(cid:12)(cid:12)(cid:68)(cid:68) 1\n\nh\n\nn(cid:88)\n\ni=1\n\nwith some positive quantities \u03c41(n, p) and \u03c42(n, p) on n and p. These will be speci\ufb01ed below for\nsome concrete examples.\n\n(C-4) can be understood as a structural incoherence condition between the model parameter \u0398 and\nthe weight parameter w. Such a condition is usually imposed when analyzing estimators with mul-\ntiple parameters (for example, see [25] for a robust linear regression estimator). Since w\u2217 is de\ufb01ned\n\n4\n\n\fh\n\ni=1\n\nw\u2217\n\n4 max\n\nn(cid:88)\n\n(cid:40)(cid:13)(cid:13)(cid:13) 1\n\n\u2264 \u03bb \u2264 \u03bal \u2212 \u03c41(n, p)\n\nregularization parameter \u03bb in (3) is set such that\n\nunder some reasonable cases, this condition for any local optimum holds with high probability. Also\nall i \u2208 {1, . . . , n} and hence the LHS becomes 0.\nArmed with these conditions, we now state our main theorem on the error bounds of our estimator\n(3):\n\ndepending on (cid:101)w, each local optimum has its own (C-4) condition. We will see in the sequel that\nnote that for the case with clean samples, the condition (C-4) is trivially satis\ufb01ed since(cid:101)\u0393i = 0 for\nTheorem 1. Consider corrupted Gaussian graphical models. Let ((cid:101)\u0398,(cid:101)w) be an any local opti-\nmum of M-estimator (3). Suppose that ((cid:101)\u0398,(cid:101)w) satis\ufb01es the condition (C-4). Suppose also that the\n\nThen, this local optimum ((cid:101)\u0398,(cid:101)w) is guaranteed to be consistent as follows:\n(cid:17)\n(cid:17)2\n3\u03bb(cid:112)k + p + \u03c41(n, p)\n\n(cid:41)\ni X (i)(X (i))(cid:62) \u2212 (\u0398\u2217)\u22121(cid:13)(cid:13)(cid:13)\u221e , \u03c42(n, p)\n(cid:16) 3\u03bb\n(cid:107)(cid:101)\u0398 \u2212 \u0398\u2217(cid:107)F \u2264 1\n(cid:16)\n(cid:107)(cid:101)\u0398 \u2212 \u0398\u2217(cid:107)1,off \u2264 2\ni=1 wiX (i)(X (i))(cid:62)(cid:11)(cid:11) \u2212 log det(\u0398),\n(cid:80)n\n\nThe statement in Theorem 1 holds deterministically, and the probabilistic statement comes where\n\nwe show (C-4) and (6) for a given ((cid:101)\u0398,(cid:101)w) are satis\ufb01ed. Note that, de\ufb01ning L(\u0398, w(cid:1) :=\n(cid:10)(cid:10)\u0398, 1\n(cid:107)\u2207\u0398L(cid:0)\u0398\u2217, w\u2217(cid:1)(cid:107)\u221e (see [26], for details). Also it is important to note that the term\n\nis a standard way of choosing \u03bb based on\nk + p captures\nthe relation between element-wise (cid:96)1 norm and the error norm (cid:107)\u00b7(cid:107)F including diagonal entries. Due\nto the space limit, the proof of Theorem 1 (and all other proofs) are provided in the Supplements\n[27].\nNow, it is natural to ask how easily we can satisfy the conditions in Theorem 1. Intuitively it is\nimpossible to recover true parameter by weighting approach as in (3) when the amount of corruptions\nexceeds that of normal observation errors.\nTo this end, suppose that we have some upper bound on the corruptions:\n\n.\n\n(6)\n\n(7)\n\nk + p\n2\n\n+ \u03c41(n, p)\n\n(C-B1) For some function f (\u00b7), we have(cid:0)|||X B|||2\n\n(cid:1)2 \u2264 f (X B)\n\nh log p\n\n\u03bb \u03bal\n\nand\n\n.\n\n\u221a\n\n\u221a\n\n\u221a\n\n3R\n\n\u03bal\n\nit\n\nh\n\nwhere X B denotes the sub-design matrix in R|B|\u00d7p corresponding to outliers. Under this assump-\ntion, we can properly choose the regularization parameter \u03bb satisfying (6) as follows:\nCorollary 1. Consider corrupted Gaussian graphical models with conditions (C-2) and (C-B1).\nSuppose that we choose the regularization parameter\n\n\u03bb = 4 max\n\n8(max\n\ni\n\n\u03a3\u2217\nii)\n\n10\u03c4 log p\nh \u2212 |B| +\n\n|B|\nh\n\n(cid:107)\u03a3\u2217(cid:107)\u221e , f (X B)\n\nlog p\n\nh\n\n\u2264 \u03bal \u2212 f (X B)\n\n3R\n\n(cid:40)\n\n(cid:115)\n\n(cid:114)\n\n(cid:41)\n\n(cid:113)|B| log p\n\nh\n\n.\n\nThen, any local optimum of (3) is guaranteed to satisfy (C-4) and have the error bounds in (7) with\nprobability at least 1 \u2212 c1 exp(\u2212c(cid:48)\nIf we further assume the number of corrupted samples scales with\n\u221a\n(C-B2) |B| \u2264 a\n\n1h\u03bb2) for some universal positive constants c1 and c(cid:48)\n1.\n\nn for some constant a \u2265 0,\n\nn at most :\n\n\u221a\n\nthen we can derive the following result as another corollary of Theorem 1:\nCorollary 2. Consider corrupted Gaussian graphical models. Suppose that the conditions (C-\n2), (C-B1) and (C-B2) hold. Also suppose that the regularization parameter \u03bb is set as c\n\n(cid:113) log p\n2f (X B)(cid:9). Then, if the sample size n is lower\n\nwhere c := 4 max(cid:8)16(maxi \u03a3\u2217\n\n5\u03c4 + 2a(cid:107)\u03a3\u2217(cid:107)\u221e\n\n\u221a\n\n\u221a\n\n\u221a\n\nn\n\n,\n\nii)\n\nlog p\n\n5\n\n\fbounded as\n\n(cid:26)\n\n16a2 ,(cid:0)|||\u0398\u2217|||2 + 1(cid:1)4(cid:16)\n\nn \u2265 max\n\n(cid:32)\n\n(cid:114)\n\n(cid:107)(cid:101)\u0398 \u2212 \u0398\u2217(cid:107)F \u2264 1\n\n3Rc + f (X B)(cid:112)2|B|(cid:17)2\n(cid:114)\n\n2|B| log p\n\n(log p)\n\n,\n\n(cid:27)\n(cid:33)\n\nthen any local optimum of (3) is guaranteed to satisfy (C-4) and have the following error bound:\n\nn\n\n\u03bal\n\n(8)\n\n+ f (X B)\n\n(k + p) log p\n\n3c\n2\nwith probability at least 1 \u2212 c1 exp(\u2212c(cid:48)\n1h\u03bb2) for some universal positive constants c1 and c(cid:48)\n1.\nNote that the (cid:107) \u00b7 (cid:107)1,off norm based error bound also can be easily derived using the selection of \u03bb\n\u221a\nn) samples out of total n samples\nfrom (7). Corollary 2 reveals an interesting result: even when O(\nare corrupted, our estimator (3) can successfully recover the true parameter with guaranteed error\nin (8). The \ufb01rst term in this bound is O\nwhich exactly recovers the Frobenius error\nbound for the case without outliers (see [13, 22] for example). Due to the outliers, we have the\nperformance degrade with the second term, which is O\n. To the best of our knowledge,\nthis is the \ufb01rst statistical error bounds on the parameter estimation for Gaussian graphical models\nwith outliers. Also note that Corollary 1 only concerns on any local optimal point derived by an\narbitrary optimization algorithm. For the guarantees of multiple local optima simultaneously, we\nmay use a union bound from the corollary.\n\n(cid:17)\n(cid:16)(cid:113)|B| log p\n\n(cid:16)(cid:113) (k+p) log p\n\n(cid:17)\n\nn\n\nn\n\nn\n\nWhen Outliers Follow a Gaussian Graphical Model Now let us provide a concrete example and\nshow how f (X B) in (C-B1) is precisely speci\ufb01ed in this case:\n\n(C-B3) Outliers in the set B are drawn from another Gaussian graphical model (1) with a param-\neter (\u03a3B)\u22121.\nThis can be understood as the Gaussian mixture model where the most of the samples are drawn\nfrom (\u0398\u2217)\u22121 that we want to estimate, and small portion of samples are drawn from \u03a3B. In this\ncase, Corollary 2 can be further shaped as follows:\nCorollary 3. Suppose that the conditions (C-2), (C-B2) and (C-B3) hold. Then the statement in\n\n2a(cid:0)1+\n\n\u221a\n4\n\nlog p(cid:1)2|||\u03a3B|||2\n\n\u221a\n\u221a\n\n.\n\nlog p\n\nCorollary 2 holds with f (X B) :=\n\n4 Experiments\n\nIn this section we corroborate the performance of our Trimmed Graphical Lasso (trim-glasso) algo-\nrithm on simulated data. We compare against glasso: the vanilla Graphical Lasso [11]; the t-Lasso\nand t*-lasso methods [19], and robust-LL: the robusti\ufb01ed-likelihood approach of [20].\n\n4.1 Simulated data\n\nOur simulation setup is similar to [20] and is a akin to gene regulatory networks. Namely we con-\nsider four different scenarios where the outliers are generated from models with different graphical\nstructures. Speci\ufb01cally, each sample is generated from the following mixture distribution:\n\nyk \u223c (1 \u2212 p0)Np(0, \u0398\u22121) +\n\nNp(\u2212\u00b5, \u0398\u22121\n\no ) +\n\np0\n2\n\np0\n2\n\nNp(\u00b5, \u0398\u22121\n\no ), k = 1, . . . , n,\n\nwhere po = 0.1, n = 100, and p = 150. Four different outlier distributions are considered:\n\nM1: \u00b5 = (1, . . . , 1)T , \u0398o = \u02dc\u0398, M2: \u00b5 = (1.5, . . . , 1.5)T , \u0398o = \u02dc\u0398,\nM3: \u00b5 = (1, . . . , 1)T , \u0398o = Ip, M4: \u00b5 = (1.5, . . . , 1.5)T , \u0398o = Ip.\n\nWe also consider the scenario where the outliers are not symmetric about the mean and simulate\ndata from the following model\n\n6\n\n\f(a) M1\n\n(b) M2\n\n(c) M3\n\n(d) M4\n\nFigure 1: Average ROC curves for the comparison methods for contamination scenarios M1-M4.\n\nM5: yk \u223c (1 \u2212 po)Np(0, \u0398\u22121) + poNp(2, Ip), k = 1, . . . , n.\n\nFor each simulation run, \u0398 is a randomly generated precision matrix corresponding to a network\nwith 9 hub nodes simulated as follows. Let A be the adjacency of the network. For all i < j we set\nAij = 1 with probability 0.03, and zero otherwise. We set Aji = Aij. We then randomly select 9\nhub nodes and set the elements of the corresponding rows and columns of A to one with probability\n0.4 and zero otherwise. Using A, the simulated nonzero coef\ufb01cients of the precision matrix are\nsampled as follows. First we create a matrix E so that Ei,j = 0 if Ai,j = 0, and Ei,j is sampled\nuniformly from [\u22120.75,\u22120.23] \u222a [0.25, 0.75] if Ai,j (cid:54)= 0. Then we set E = E+ET\n. Finally we set\n\u0398 = E + (0.1 \u2212 \u039bmin(E))Ip, where \u039bmin(E) is the smallest eigenvalue of E. \u02dc\u0398 is a randomly\ngenerated precision matrix in the same way \u0398 is generated.\nFor the robustness parameter \u03b2 of the robust-LL method, we consider \u03b2 \u2208 {0.005, 0.01, 0.02, 0.03}\nn \u2208 {90, 85, 80}. Since all the\nas recommended in [20]. For the trim-glasso method we consider 100h\nrobust comparison methods converge to a stationary point, we tested various initialization strategies\nfor the concentration matrix, including Ip, (S + \u03bbIp)\u22121 and the estimate from glasso. We did not\nobserve any noticeable impact on the results.\nFigure 1 presents the average ROC curves of the comparison methods over 100 simulation data sets\nfor scenarios M1-M4 as the tuning parameter \u03bb varies. In the \ufb01gure, for robust-LL and trim-glasso\nmethods, we depict the best curves with respect to parameter \u03b2 and h respectively. Due to space\nconstraints, the detailed results for all the values of \u03b2 and h considered, as well as the results for\nmodel M5 are provided in the Supplements [27].\nFrom the ROC curves we can see that our proposed approach is competitive compared the alternative\nrobust approaches t-lasso, t*-lasso and robust-LL. The edge over glasso is even more pronounced for\n\n2\n\n7\n\n0.00.10.20.30.40.50.10.20.30.40.50.61-specificitysensitivityglassot-lassot*-lassorobust-LL (best)trim-glasso (best)0.00.10.20.30.40.10.20.30.40.50.61-specificitysensitivity0.00.10.20.30.40.50.10.20.30.40.50.60.71-specificitysensitivity0.00.10.20.30.40.50.10.20.30.40.50.60.71-specificitysensitivity\fFigure 2: (a) Histogram of standardized gene expression levels for gene ORC3. (b) Network esti-\nmated by trim-glasso\n\nscenarios M2, M4 and M5. Surprisingly, trim-glasso with h/n = 80% achieves superior sensitivity\nfor nearly any speci\ufb01city.\nComputationally the trim-glasso method is also competitive compared to alternatives. The average\nrun-time over the path of tuning parameters \u03bb is 45.78s for t-lasso, 22.14s for t*-lasso, 11.06s\nfor robust-LL, 1.58s for trimmed lasso, 1.04s for glasso. Experiments were run on R in a single\ncomputing node with a Intel Core i5 2.5GHz CPU and 8G memory. For t-lasso, t*-lasso and robust-\nLL we used the R implementations provided by the methods\u2019 authors. For glasso we used the\nglassopath package.\n\n4.2 Application to the analysis of Yeast Gene Expression Data\n\nWe analyze a yeast microarray dataset generated by [28]. The dataset concerns n = 112 yeast\nsegregants (instances). We focused on p = 126 genes (variables) belonging to cell-cycle pathway as\nprovided by the KEGG database [29]. For each of these genes we standardize the gene expression\ndata to zero-mean and unit standard deviation. We observed that the expression levels of some genes\nare clearly not symmetric about their means and might include outliers. For example the histogram\nof gene ORC3 is presented in Figure 2(a). For the robust-LL method we set \u03b2 = 0.05 and for trim-\nglasso we use h/n = 80%. We use 5-fold-CV to choose the tuning parameters for each method.\nAfter \u03bb is chosen for each method, we rerun the methods using the full dataset to obtain the \ufb01nal\nprecision matrix estimates.\nFigure 2(b) shows the cell-cycle pathway estimated by our proposed method. For comparison the\ncell-cycle pathway from the KEGG [29] is provided in the Supplements [27]. It is important to note\nthat the KEGG graph corresponds to what is currently known about the pathway. It should not be\ntreated as the ground truth. Certain discrepancies between KEGG and estimated graphs may also\nbe caused by inherent limitations in the dataset used for modeling. For instance, some edges in\ncell-cycle pathway may not be observable from gene expression data. Additionally, the perturbation\nof cellular systems might not be strong enough to enable accurate inference of some of the links.\nglasso tends to estimate more links than the robust methods. We postulate that the lack of robustness\nmight result in inaccurate network reconstruction and the identi\ufb01cation of spurious links. Robust\nmethods tend to estimate networks that are more consistent with that from the KEGG (F1-score\nof 0.23 for glasso, 0.37 for t*-lasso, 0.39 for robust-NLL and 0.41 for trim-glasso, where the F1\nscore is the harmonic mean between precision and recall). For instance our approach recovers\nseveral characteristics of the KEGG pathway. For instance, genes CDC6 (a key regulator of DNA\nreplication playing important roles in the activation and maintenance of the checkpoint mechanisms\ncoordinating S phase and mitosis) and PDS1 (essential gene for meiotic progression and mitotic cell\ncycle arrest) are identi\ufb01ed as a hub genes, while genes CLB3,BRN1,YCG1 are unconnected to any\nother genes.\n\n8\n\nrescaled ORC3 gene expressionFrequency-3-2-101202468\fReferences\n[1] S.L. Lauritzen. Graphical models. Oxford University Press, USA, 1996.\n[2] Jung Hun Oh and Joseph O. Deasy. Inference of radio-responsive gene regulatory networks using the\n\ngraphical lasso algorithm. BMC Bioinformatics, 15(S-7):S5, 2014.\n\n[3] C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press,\n\n1999.\n\n[4] J.W. Woods. Markov image modeling. IEEE Transactions on Automatic Control, 23:846\u2013850, October\n\n1978.\n\n[5] M. Hassner and J. Sklansky. Markov random \ufb01eld models of digitized image texture. In ICPR78, pages\n\n538\u2013540, 1978.\n\n[6] G. Cross and A. Jain. Markov random \ufb01eld texture models. IEEE Trans. PAMI, 5:25\u201339, 1983.\n[7] E. Ising. Beitrag zur theorie der ferromagnetismus. Zeitschrift f\u00a8ur Physik, 31:253\u2013258, 1925.\n[8] B. D. Ripley. Spatial statistics. Wiley, New York, 1981.\n[9] E. Yang, A. C. Lozano, and P. Ravikumar. Elementary estimators for graphical models. In Neur. Info.\n\nProc. Sys. (NIPS), 27, 2014.\n\n[10] M. Yuan and Y. Lin. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):\n\n19\u201335, 2007.\n\n[11] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical Lasso.\n\nBiostatistics, 2007.\n\n[12] O. Bannerjee, , L. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\n\nestimation for multivariate Gaussian or binary data. Jour. Mach. Lear. Res., 9:485\u2013516, March 2008.\n\n[13] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by\nminimizing (cid:96)1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935\u2013980, 2011.\n[14] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, UK, 2004.\n[15] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the Lasso. Annals\n\nof Statistics, 34:1436\u20131462, 2006.\n\n[16] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu. Graphical models via generalized linear models. In Neur.\n\nInfo. Proc. Sys. (NIPS), 25, 2012.\n\n[17] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[18] Z.J. Daye, J. Chen, and Li H. High-dimensional heteroscedastic regression with an application to eqtl\n\ndata analysis. Biometrics, 68:316\u2013326, 2012.\n\n[19] Michael Finegold and Mathias Drton. Robust graphical modeling of gene networks using classical and\n\nalternative t-distributions. The Annals of Applied Statistics, 5(2A):1057\u20131080, 2011.\n\n[20] H. Sun and H. Li. Robust Gaussian graphical modeling via l1 penalization. Biometrics, 68:1197\u2013206,\n\n2012.\n\n[21] A. Alfons, C. Croux, and S. Gelper. Sparse least trimmed squares regression for analyzing high-\n\ndimensional large data sets. Ann. Appl. Stat., 7:226\u2013248, 2013.\n\n[22] P-L Loh and M. J. Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic\n\ntheory for local optima. Arxiv preprint arXiv:1305.2436v2, 2013.\n\n[23] C. J. Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estimation using\n\nquadratic approximation. In Neur. Info. Proc. Sys. (NIPS), 24, 2011.\n\n[24] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center\n\nfor Operations Research and Econometrics (CORE), Catholic Univ. Louvain (UCL)., 2007.\n\n[25] N. H. Nguyen and T. D. Tran. Robust Lasso with missing and grossly corrupted observations.\n\nTrans. Info. Theory, 59(4):2036\u20132058, 2013.\n\nIEEE\n\n[26] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of M-estimators with decomposable regularizers. Statistical Science, 27(4):538\u2013557, 2012.\n\n[27] E. Yang and A. C. Lozano. Robust gaussian graphical modeling with the trimmed graphical Lasso.\n\narXiv:1510.08512, 2015.\n\n[28] Rachel B Brem and Leonid Kruglyak. The landscape of genetic complexity across 5,700 gene expression\ntraits in yeast. Proceedings of the National Academy of Sciences of the United States of America, 102(5):\n1572\u20131577, 2005.\n\n[29] M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi, and M. Tanabe. Data, information, knowl-\n\nedge and principle: back to metabolism in kegg. Nucleic Acids Res., 42:D199\u2013D205, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1522, "authors": [{"given_name": "Eunho", "family_name": "Yang", "institution": "IBM Thomas J. Watson Research Center"}, {"given_name": "Aurelie", "family_name": "Lozano", "institution": "IBM Research"}]}