{"title": "Extended Bayesian Information Criteria for Gaussian Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 604, "page_last": 612, "abstract": "Gaussian graphical models with sparsity in the inverse covariance matrix are of significant interest in many modern applications. For the problem of recovering the graphical structure, information criteria provide useful optimization objectives for algorithms searching through sets of graphs or for selection of tuning parameters of other methods such as the graphical lasso, which is a likelihood penalization technique. In this paper we establish the asymptotic consistency of an extended Bayesian information criterion for Gaussian graphical models in a scenario where both the number of variables p and the sample size n grow. Compared to earlier work on the regression case, our treatment allows for growth in the number of non-zero parameters in the true model, which is necessary in order to cover connected graphs. We demonstrate the performance of this criterion on simulated data when used in conjuction with the graphical lasso, and verify that the criterion indeed performs better than either cross-validation or the ordinary Bayesian information criterion when p and the number of non-zero parameters q both scale with n.", "full_text": "Extended Bayesian Information Criteria for Gaussian\n\nGraphical Models\n\nRina Foygel\n\nUniversity of Chicago\n\nrina@uchicago.edu\n\nMathias Drton\n\nUniversity of Chicago\n\ndrton@uchicago.edu\n\nAbstract\n\nGaussian graphical models with sparsity in the inverse covariance matrix are of\nsigni\ufb01cant interest in many modern applications. For the problem of recovering\nthe graphical structure, information criteria provide useful optimization objectives\nfor algorithms searching through sets of graphs or for selection of tuning parame-\nters of other methods such as the graphical lasso, which is a likelihood penaliza-\ntion technique. In this paper we establish the consistency of an extended Bayesian\ninformation criterion for Gaussian graphical models in a scenario where both the\nnumber of variables p and the sample size n grow. Compared to earlier work on\nthe regression case, our treatment allows for growth in the number of non-zero pa-\nrameters in the true model, which is necessary in order to cover connected graphs.\nWe demonstrate the performance of this criterion on simulated data when used in\nconjunction with the graphical lasso, and verify that the criterion indeed performs\nbetter than either cross-validation or the ordinary Bayesian information criterion\nwhen p and the number of non-zero parameters q both scale with n.\n\n1\n\nIntroduction\n\nThis paper is concerned with the problem of model selection (or structure learning) in Gaussian\ngraphical modelling. A Gaussian graphical model for a random vector X = (X1, . . . , Xp) is de-\ntermined by a graph G on p nodes. The model comprises all multivariate normal distributions\nN(\u00b5, \u0398\u22121) whose inverse covariance matrix satis\ufb01es that \u0398jk = 0 when {j, k} is not an edge in G.\nFor background on these models, including a discussion of the conditional independence interpreta-\ntion of the graph, we refer the reader to [1].\nIn many applications, in particular in the analysis of gene expression data, inference of the graph G is\nof signi\ufb01cant interest. Information criteria provide an important tool for this problem. They provide\nthe objective to be minimized in (heuristic) searches over the space of graphs and are sometimes\nused to select tuning parameters in other methods such as the graphical lasso of [2]. In this work\nwe study an extended Bayesian information criterion (BIC) for Gaussian graphical models. Given a\nsample of n independent and identically distributed observations, this criterion takes the form\n\nBIC\u03b3(E) = \u22122ln( \u02c6\u0398(E)) + |E| log n + 4|E|\u03b3 log p,\n\n(1)\nwhere E is the edge set of a candidate graph and ln( \u02c6\u0398(E)) denotes the maximized log-likelihood\nfunction of the associated model. (In this context an edge set comprises unordered pairs {j, k} of\ndistinct elements in {1, . . . , p}.) The criterion is indexed by a parameter \u03b3 \u2208 [0, 1]; see the Bayesian\nIf \u03b3 = 0, then the classical BIC of [4] is recovered, which is\ninterpretation of \u03b3 given in [3].\nwell known to lead to (asymptotically) consistent model selection in the setting of \ufb01xed number of\nvariables p and growing sample size n. Consistency is understood to mean selection of the smallest\ntrue graph whose edge set we denote E0. Positive \u03b3 leads to stronger penalization of large graphs\nand our main result states that the (asymptotic) consistency of an exhaustive search over a restricted\n\n1\n\n\fmodel space may then also hold in a scenario where p grows moderately with n (see the Main\nTheorem in Section 2). Our numerical work demonstrates that positive values of \u03b3 indeed lead to\nimproved graph inference when p and n are of comparable size (Section 3).\nThe choice of the criterion in (1) is in analogy to a similar criterion for regression models that was\n\ufb01rst proposed in [5] and theoretically studied in [3, 6]. Our theoretical study employs ideas from\nthese latter two papers as well as distribution theory available for decomposable graphical models.\nAs mentioned above, we treat an exhaustive search over a restricted model space that contains all\ndecomposable models given by an edge set of cardinality |E| \u2264 q. One difference to the regression\ntreatment of [3, 6] is that we do not \ufb01x the dimension bound q nor the dimension |E0| of the smallest\ntrue model. This is necessary for connected graphs to be covered by our work.\nIn practice, an exhaustive search is infeasible even for moderate values of p and q. Therefore, we\nmust choose some method for preselecting a smaller set of models, each of which is then scored\nby applying the extended BIC (EBIC). Our simulations show that the combination of EBIC and\ngraphical lasso gives good results well beyond the realm of the assumptions made in our theoretical\nanalysis. This combination is consistent in settings where both the lasso and the exhaustive search\nare consistent but in light of the good theoretical properties of lasso procedures (see [7]), studying\nthis particular combination in itself would be an interesting topic for future work.\n\n2 Consistency of the extended BIC for Gaussian graphical models\n\n2.1 Notation and de\ufb01nitions\n\nIn the sequel we make no distinction between the edge set E of a graph on p nodes and the asso-\nciated Gaussian graphical model. Without loss of generality we assume a zero mean vector for all\ndistributions in the model. We also refer to E as a set of entries in a p \u00d7 p matrix, meaning the 2|E|\nentries indexed by (j, k) and (k, j) for each {j, k} \u2208 E. We use \u2206 to denote the index pairs (j, j)\nfor the diagonal entries of the matrix.\nLet \u03980 be a positive de\ufb01nite matrix supported on \u2206 \u222a E0. In other words, the non-zero entries\nof \u03980 are precisely the diagonal entries as well as the off-diagonal positions indexed by E0; note\nthat a single edge in E0 corresponds to two positions in the matrix due to symmetry. Suppose the\nrandom vectors X1, . . . , Xn are independent and distributed identically according to N(0, \u0398\u22121\n0 ).\nLet S = 1\ni be the sample covariance matrix. The Gaussian log-likelihood function\nn\nsimpli\ufb01es to\n\ni XiX T\n\n(cid:80)\n\nln(\u0398) = n\n2\n\n[log det(\u0398) \u2212 trace(S\u0398)] .\n\nmax = max\n\u03c32\n\nj\n\n(\u0398\u22121\n\n0 )jj.\n\nWe introduce some further notation. First, we de\ufb01ne the maximum variance of the individual nodes:\n\nNext, we de\ufb01ne \u03b80 = mine\u2208E0 |(\u03980)e|, the minimum signal over the edges present in the graph.\n(For edge e = {j, k}, let (\u03980)e = (\u03980)jk = (\u03980)kj.) Finally, we write \u03bbmax for the maximum\neigenvalue of \u03980. Observe that the product \u03c32\nmax\u03bbmax is no larger than the condition number of \u03980\nbecause 1/\u03bbmin(\u03980) = \u03bbmax(\u0398\u22121\nmax.\n\n0 ) \u2265 \u03c32\n\n(2)\n\n(3)\n\n2.2 Main result\n\nSuppose that n tends to in\ufb01nity with the following asymptotic assumptions on data and model:\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nE0 is decomposable, with |E0| \u2264 q,\nmax\u03bbmax \u2264 C,\n\u03c32\np = O(n\u03ba), p \u2192 \u221e,\n\u03b30 = \u03b3 \u2212 (1 \u2212 1\n(p + 2q) log p \u00d7 \u03bb2\nmax\n\u03b82\n0\n\n4\u03ba) > 0,\n\n= o(n)\n\nHere C, \u03ba > 0 and \u03b3 are \ufb01xed reals, while the integers p, q, the edge set E0, the matrix \u03980, and\nthus the quantities \u03c32\nmax, \u03bbmax and \u03b80 are implicitly allowed to vary with n. We suppress this latter\ndependence on n in the notation. The \u2018big oh\u2019 O(\u00b7) and the \u2018small oh\u2019 o(\u00b7) are the Landau symbols.\n\n2\n\n\fMain Theorem. Suppose that conditions (3) hold. Let E be the set of all decomposable models E\nwith |E| \u2264 q. Then with probability tending to 1 as n \u2192 \u221e,\nE\u2208E BIC\u03b3(E).\n\nE0 = arg min\n\nThat is, the extended BIC with parameter \u03b3 selects the smallest true model E0 when applied to any\nsubset of E containing E0.\nIn order to prove this theorem we use two techniques for comparing likelihoods of different mod-\nels. Firstly, in Chen and Chen\u2019s work on the GLM case [6], the Taylor approximation to the log-\nlikelihood function is used and we will proceed similarly when comparing the smallest true model\nE0 to models E which do not contain E0. The technique produces a lower bound on the decrease in\nlikelihood when the true model is replaced by a false model.\nTheorem 1. Suppose that conditions (3) hold. Let E1 be the set of models E with E (cid:54)\u2283 E0 and\n|E| \u2264 q. Then with probability tending to 1 as n \u2192 \u221e,\n\nln(\u03980) \u2212 ln( \u02c6\u0398(E)) > 2q(log p)(1 + \u03b30) \u2200 E \u2208 E1.\n\nSecondly, Porteous [8] shows that in the case of two nested models which are both decomposable,\nthe likelihood ratio (at the maximum likelihood estimates) follows a distribution that can be ex-\npressed exactly as a log product of Beta distributions. We will use this to address the comparison\nbetween the model E0 and decomposable models E containing E0 and obtain an upper bound on\nthe improvement in likelihood when the true model is expanded to a larger decomposable model.\nTheorem 2. Suppose that conditions (3) hold. Let E0 be the set of decomposable models E with\nE \u2283 E0 and |E| \u2264 q. Then with probability tending to 1 as n \u2192 \u221e,\n\nln( \u02c6\u0398(E)) \u2212 ln( \u02c6\u0398(E0)) < 2(1 + \u03b30)(|E| \u2212 |E0|) log p \u2200E \u2208 E0\\{E0}.\n\nProof of the Main Theorem. With probability tending to 1 as n \u2192 \u221e, both of the conclusions of\nTheorems 1 and 2 hold. We will show that both conclusions holding simultaneously implies the\ndesired result.\nObserve that E \u2282 E0 \u222a E1. Choose any E \u2208 E\\{E0}. If E \u2208 E0, then (by Theorem 2):\n\nBIC\u03b3(E) \u2212 BIC\u03b3(E0) = \u22122(ln( \u02c6\u0398(E)) \u2212 ln( \u02c6\u0398(E0))) + 4(1 + \u03b30)(|E| \u2212 |E0|) log p > 0.\n\nIf instead E \u2208 E1, then (by Theorem 1, since |E0| \u2264 q):\n\nBIC\u03b3(E) \u2212 BIC\u03b3(E0) = \u22122(ln( \u02c6\u0398(E)) \u2212 ln( \u02c6\u0398(E0))) + 4(1 + \u03b30)(|E| \u2212 |E0|) log p > 0.\n\nTherefore, for any E \u2208 E\\{E0}, BIC\u03b3(E) > BIC\u03b3(E0), which yields the desired result.\n\nSome details on the proofs of Theorems 1 and 2 are given in the Appendix in Section 5.\n\n3 Simulations\n\nIn this section, we demonstrate that the EBIC with positive \u03b3 indeed leads to better model selection\nproperties in practically relevant settings. We let n grow, set p \u221d n\u03ba for various values of \u03ba, and\napply the EBIC with \u03b3 \u2208 {0, 0.5, 1} similarly to the choice made in the regression context by [3]. As\nmentioned in the introduction, we \ufb01rst use the graphical lasso of [2] (as implemented in the \u2018glasso\u2019\npackage for R) to de\ufb01ne a small set of models to consider (details given below). From the selected\nset we choose the model with the lowest EBIC. This is repeated for 100 trials for each combination\nof values of n, p, \u03b3 in each scaling scenario. For each case, the average positive selection rate (PSR)\nand false discovery rate (FDR) are computed.\nWe recall that the graphical lasso places an (cid:96)1 penalty on the inverse covariance matrix. Given a\npenalty \u03c1 \u2265 0, we obtain the estimate\n\n\u02c6\u0398\u03c1 = arg min\n\n\u0398\n\n\u2212ln(\u0398) + \u03c1(cid:107)\u0398(cid:107)1.\n\n(4)\n\n3\n\n\fFigure 1: The chain (top) and the \u2018double chain\u2019 (bottom) on 6 nodes.\n\n(Here we may de\ufb01ne (cid:107)\u0398(cid:107)1 as the sum of absolute values of all entries, or only of off-diagonal en-\ntries; both variants are common). The (cid:96)1 penalty promotes zeros in the estimated inverse covariance\nmatrix \u02c6\u0398\u03c1; increasing the penalty yields an increase in sparsity. The \u2018glasso path\u2019, that is, the set\nof models recovered over the full range of penalties \u03c1 \u2208 [0,\u221e), gives a small set of models which,\nroughly, include the \u2018best\u2019 models at various levels of sparsity. We may therefore apply the EBIC to\nthis manageably small set of models (without further restriction to decomposable models). Consis-\ntency results on the graphical lasso require the penalty \u03c1 to satisfy bounds that involve measures of\nregularity in the unknown matrix \u03980; see [7]. Minimizing the EBIC can be viewed as a data-driven\nmethod of tuning \u03c1, one that does not require creation of test data.\nWhile cross-validation does not generally have consistency properties for model selection (see [9]),\nit is nevertheless interesting to compare our method to cross-validation. For the considered simulated\ndata, we start with the set of models from the \u2018glasso path\u2019, as before, and then perform 100-fold\ncross-validation. For each model and each choice of training set and test set, we \ufb01t the model to\nthe training set and then evaluate its performance on each sample in the test set, by measuring error\nin predicting each individual node conditional on the other nodes and then taking the sum of the\nsquared errors. We note that this method is computationally much more intensive than the BIC or\nEBIC, because models need to be \ufb01tted many more times.\n\n3.1 Design\n\nIn our simulations, we examine the EBIC as applied to the case where the graph is a chain with node\nj being connected to nodes j\u22121, j+1, and to the \u2018double chain\u2019, where node j is connected to nodes\nj \u2212 2, j \u2212 1, j + 1, j + 2. Figure 1 shows examples of the two types of graphs, which have on the\norder of p and 2p edges, respectively. For both the chain and the double chain, we investigate four\ndifferent scaling scenarios, with the exponent \u03ba selected from {0.5, 0.9, 1, 1.1}. In each scenario,\nwe test n = 100, 200, 400, 800, and de\ufb01ne p \u221d n\u03ba with the constant of proportionality chosen such\nthat p = 10 when n = 100 for better comparability.\nIn the case of a chain, the true inverse covariance matrix \u03980 is tridiagonal with all diagonal entries\n(\u03980)j,j set equal to 1, and the entries (\u03980)j,j+1 = (\u03980)j+1,j that are next to the main diagonal\nequal to 0.3. For the double chain, \u03980 has all diagonal entries equal to 1, the entries next to the main\ndiagonal are (\u03980)j,j+1 = (\u03980)j+1,j = 0.2 and the remaining non-zero entries are (\u03980)j,j+2 =\n(\u03980)j+2,j = 0.1. In both cases, the choices result in values for \u03b80, \u03c32\nmax and \u03bbmax that are bounded\nuniformly in the matrix size p.\nFor each data set generated from N(0, \u0398\u22121\n0 ), we use the \u2018glasso\u2019 package [2] in R to compute the\n\u2018glasso path\u2019. We choose 100 penalty values \u03c1 which are logarithmically evenly spaced between\n\u03c1max (the smallest value which will result in a no-edge model) and \u03c1max/100. At each penalty\nvalue \u03c1, we compute \u02c6\u0398\u03c1 from (4) and de\ufb01ne the model E\u03c1 based on this estimate\u2019s support. The R\nroutine also allows us to compute the unpenalized maximum likelihood estimate \u02c6\u0398(E\u03c1). We may\nthen readily compute the EBIC from (1). There is no guarantee that this procedure will \ufb01nd the\nmodel with the lowest EBIC along the full \u2018glasso path\u2019, let alone among the space of all possible\nmodels of size \u2264 q. Nonetheless, it serves as a fast way to select a model without any manual tuning.\n\n3.2 Results\n\nChain graph: The results for the chain graph are displayed in Figure 2. The \ufb01gure shows the positive\nselection rate (PSR) and false discovery rate (FDR) in the four scaling scenarios. We observe that,\nfor the larger sample sizes, the recovery of the non-zero coef\ufb01cients is perfect or nearly perfect for all\nthree values of \u03b3; however, the FDR rate is noticeably better for the positive values of \u03b3, especially\n\n4\n\n\ffor higher scaling exponents \u03ba. Therefore, for moderately large n, the EBIC with \u03b3 = 0.5 or \u03b3 = 1\nperforms very well, while the ordinary BIC0 produces a non-trivial amount of false positives. For\n100-fold cross-validation, while the PSR is initially slightly higher, the growing FDR demonstrates\nthe extreme inconsistency of this method in the given setting.\nDouble chain graph: The results for the double chain graph are displayed in Figure 3. In each\nof the four scaling scenarios for this case, we see a noticeable decline in the PSR as \u03b3 increases.\nNonetheless, for each value of \u03b3, the PSR increases as n and p grow. Furthermore, the FDR for the\nordinary BIC0 is again noticeably higher than for the positive values of \u03b3, and in the scaling scenar-\nios \u03ba \u2265 0.9, the FDR for BIC0 is actually increasing as n and p grow, suggesting that asymptotic\nconsistency may not hold in these cases, as is supported by our theoretical results. 100-fold cross-\nvalidation shows signi\ufb01cantly better PSR than the BIC and EBIC methods, but the FDR is again\nextremely high and increases quickly as the model grows, which shows the unreliability of cross-\nvalidation in this setting. Similarly to what Chen and Chen [3] conclude for the regression case,\nit appears that the EBIC with parameter \u03b3 = 0.5 performs well. Although the PSR is necessarily\nlower than with \u03b3 = 0, the FDR is quite low and decreasing as n and p grow, as desired.\nFor both types of simulations, the results demonstrate the trade-off inherent in choosing \u03b3 in the \ufb01nite\n(non-asymptotic) setting. For low values of \u03b3, we are more likely to obtain a good (high) positive\nselection rate. For higher values of \u03b3, we are more likely to obtain a good (low) false discovery\nrate. (In the Appendix, this corresponds to assumptions (5) and (6)). However, asymptotically, the\nconditions (3) guarantee consistency, meaning that the trade-off becomes irrelevant for large n and\np. In the \ufb01nite case, \u03b3 = 0.5 seems to be a good compromise in simulations, but the question of\ndetermining the best value of \u03b3 in general settings is an open question. Nonetheless, this method\noffers guaranteed asymptotic consistency for (known) values of \u03b3 depending only on n and p.\n\n4 Discussion\n\nWe have proposed the use of an extended Bayesian information criterion for multivariate data gener-\nated by sparse graphical models. Our main result gives a speci\ufb01c scaling for the number of variables\np, the sample size n, the bound on the number of edges q, and other technical quantities relating to\nthe true model, which will ensure asymptotic consistency. Our simulation study demonstrates the\nthe practical potential of the extended BIC, particularly as a way to tune the graphical lasso. The\nresults show that the extended BIC with positive \u03b3 gives strong improvement in false discovery rate\nover the classical BIC, and even more so over cross-validation, while showing comparable positive\nselection rate for the chain, where all the signals are fairly strong, and noticeably lower, but steadily\nincreasing, positive selection rate for the double chain with a large number of weaker signals.\n\n5 Appendix\n\nWe now sketch proofs of non-asymptotic versions of Theorems 1 and 2, which are formulated as\nTheorems 3 and 4. We also give a non-asymptotic formulation of the Main Theorem; see Theorem 5.\nIn the non-asymptotic approach, we treat all quantities as \ufb01xed (e.g. n, p, q, etc.) and state precise\nassumptions on those quantities, and then give an explicit lower bound on the probability of the\nextended BIC recovering the model E0 exactly. We do this to give an intuition for the magnitude\nof the sample size n necessary for a good chance of exact recovery in a given setting but due to the\nproof techniques, the resulting implications about sample size are extremely conservative.\n\n5.1 Preliminaries\n\nWe begin by stating two lemmas that are used in the proof of the main result, but are also more\ngenerally interesting as tools for precise bounds on Gaussian and chi-square distributions. First, Cai\n[10, Lemma 4] proves the following chi-square bound. For any n \u2265 1, \u03bb > 0,\n\nP{\u03c72\n\nn > n(1 + \u03bb)} \u2264 1\n\u221a\n\u03bb\n\n\u03c0n\n\ne\u2212 n\n\n2 (\u03bb\u2212log(1+\u03bb)).\n\nWe can give an analagous left-tail upper bound. The proof is similar to Cai\u2019s proof and omitted here.\nWe will refer to these two bounds together as (CSB).\n\n5\n\n\fFigure 2: Simulation results when the true graph is a chain.\n\nLemma 1. For any \u03bb > 0, for n such that n \u2265 4\u03bb\u22122 + 1,\n\nP{\u03c72\n\nn < n(1 \u2212 \u03bb)} \u2264\n\nn\u22121\n2 (\u03bb+log(1\u2212\u03bb)).\n\ne\n\n\u03bb(cid:112)\u03c0(n \u2212 1)\n\n1\n\nSecond, we give a distributional result about the sample correlation when sampling from a bivariate\nnormal distribution.\nLemma 2. Suppose (X1, Y1), . . . , (Xn, Yn) are independent draws from a bivariate normal distri-\nbution with zero mean, variances equal to one and covariance \u03c1. Then the following distributional\nequivalence holds, where A and B are independent \u03c72\n\nn(cid:88)\n\ni=1\n\n(XiYi \u2212 \u03c1)\n\nD\n=\n\n1 + \u03c1\n\n2\n\n(B \u2212 n).\n\nProof. Let A1, B1, A2, B2, . . . , An, Bn be independent standard normal random variables. De\ufb01ne:\n\nXi =\n\n(cid:114)1 \u2212 \u03c1\n\n(cid:114)1 + \u03c1\n(cid:114)1 + \u03c1\n2 Ai \u2212\nn variables. The claim follows from writing(cid:80)\n\n2 Bi; Yi =\n\n2 Ai +\n\nThen the variables X1, Y1, X2, Y2, . . . , Xn, Yn have the desired joint distribution, and A, B are in-\ndependent \u03c72\n\ni XiYi in terms of A and B.\n\nn(cid:88)\n\nn(cid:88)\n\n2 Bi; A =\n\nA2\n\ni ; B =\n\nB2\ni .\n\ni=1\n\ni=1\n\nn variables:\n(A \u2212 n) \u2212 1 \u2212 \u03c1\n2\n(cid:114)1 \u2212 \u03c1\n\n6\n\n\fFigure 3: Simulation results when the true graph is a \u2018double chain\u2019.\n\n5.2 Non-asymptotic versions of the theorems\nWe assume the following two conditions, where \u00010, \u00011 > 0, C \u2265 \u03c32\n\u03b30 = \u03b3 \u2212 (1 \u2212 1\n\n4\u03ba):\n\n(p + 2q) log p\n\n\u00d7 \u03bb2\nmax\n\u03b82\n0\n\nn\n\n2((cid:112)1 + \u03b30 \u2212 1) \u2212 log log p + log(4\n\n\u2264\n\n\u221a\n\n2 log p\n\n3200 max{1 + \u03b30,(cid:0)1 + \u00011\n\n1\n\n(cid:1) C 2}\n\n2\n\n1 + \u03b30) + 1\n\n\u2265 \u00010\n\nTheorem 3. Suppose assumption (5) holds. Then with probability at least 1 \u2212\nE (cid:54)\u2283 E0 with |E| \u2264 q,\n\nln(\u03980) \u2212 ln( \u02c6\u0398(E)) > 2q(log p)(1 + \u03b30).\n\nmax\u03bbmax, \u03ba = logn p, and\n\n(5)\n\n(6)\n\u03c0 log p p\u2212\u00011, for all\n1\u221a\n\nProof. We sketch a proof along the lines of the proof of Theorem 2 in [6], using Taylor series\ncentered at the true \u03980 to approximate the likelihood at \u02c6\u0398(E). The score and the negative Hessian\nof the log-likelihood function in (2) are\n\nsn(\u0398) = d\n\nd\u0398 ln(\u0398) = n\n2\n\nHn(\u0398) = \u2212 d\n\nd\u0398 sn(\u0398) = n\n2\n\n\u0398\u22121 \u2297 \u0398\u22121.\n\nHere, the symbol \u2297 denotes the Kronecker product of matrices. Note that, while we require \u0398 to be\nsymmetric positive de\ufb01nite, this is not re\ufb02ected in the derivatives above. We adopt this convention\nfor the notational convenience in the sequel.\n\n(cid:0)\u0398\u22121 \u2212 S(cid:1) ,\n\n7\n\n\fNext, observe that \u02c6\u0398(E) has support on \u2206\u222a E0 \u222a E, and that by de\ufb01nition of \u03b80, we have the lower\nbound | \u02c6\u0398(E) \u2212 \u03980|F \u2265 \u03b80 in terms of the Frobenius norm. By concavity of the log-likelihood\nfunction, it suf\ufb01ces to show that the desired inequality holds for all \u0398 with support on \u2206 \u222a E0 \u222a E\nwith |\u0398 \u2212 \u03980|F = \u03b80. By Taylor expansion, for some \u02dc\u0398 on the path from \u03980 to \u0398, we have:\nvec(\u0398 \u2212 \u03980)T Hn( \u02dc\u0398)vec(\u0398 \u2212 \u03980).\n\nln(\u0398) \u2212 ln(\u03980) = vec(\u0398 \u2212 \u03980)T sn(\u03980) \u2212 1\n2\n\nNext, by (CSB) and Lemma 2, with probability at least 1 \u2212\n\u03c0 log p e\u2212\u00011 log p, the following bound\n1\u221a\nholds for all edges e in the complete graph (we omit the details):\n\n(sn(\u03980))2\n\ne \u2264 6\u03c34\n\nmax(2 + \u00011)n log p.\n\nNow assume that this bound holds for all edges. Fix some E as above, and \ufb01x \u0398 with support on\n\u2206 \u222a E0 \u222a E, with |\u0398 \u2212 \u03980| = \u03b80. Note that the support has at most (p + 2q) entries. Therefore,\n\n|vec(\u0398 \u2212 \u03980)T sn(\u03980)|2 \u2264 \u03b82\n\n0(p + 2q) \u00d7 6\u03c34\n\nmax(2 + \u00011)n log p.\n\nln(\u0398) \u2212 ln(\u03980) \u2264(cid:113)\n\nFurthermore, the eigenvalues of \u0398 are bounded by \u03bbmax + \u03b80 \u2264 2\u03bbmax, and so by properties of\n2 (2\u03bbmax)\u22122. We conclude that\nKronecker products, the minimum eigenvalue of Hn( \u02dc\u0398) is at least n\n\n0(p + 2q) \u00d7 6\u03c34\n\u03b82\n\n0 \u00d7 n\n2\nCombining this bound with our assumptions above, we obtain the desired result.\nTheorem 4. Suppose additionally that assumption (6) holds (in particular, this implies that \u03b3 >\n1 \u2212 1\np\u2212\u00010\n1\u2212p\u2212\u00010 , for all decomposable models E such\nthat E (cid:41) E0 and |E| \u2264 q,\n\n4\u03ba ). Then with probability at least 1 \u2212\n\nmax(2 + \u00011)n log p \u2212 1\n2 \u03b82\n\n(2\u03bbmax)\u22122.\n\n\u221a\n4\n\n\u03c0 log p\n\n1\n\nln( \u02c6\u0398(E)) \u2212 ln( \u02c6\u0398(E0)) < 2(1 + \u03b30)(|E| \u2212 |E0|) log p.\n\n2 log ((cid:81)m\nProof. First, \ufb01x a single such model E, and de\ufb01ne m = |E| \u2212 |E0|. By [8, 11], ln( \u02c6\u0398(E)) \u2212\nln( \u02c6\u0398(E0)) is distributed as \u2212 n\n2) are independent random\n, 1\ngraph given by model E, implying ci \u2264 \u221a\nvariables and the constants c1, . . . , cm are bounded by 1 less than the maximal clique size of the\n2q for each i. Also shown in [8] is the stochastic inequality\n\u2212 log(Bi) \u2264\n\ni=1 Bi), where Bi \u223c Beta( n\u2212ci\n\n1\n\n2\n\nn\u2212ci\u22121 \u03c72\n\n1. It follows that, stochastically,\nln( \u02c6\u0398(E)) \u2212 ln( \u02c6\u0398(E0)) \u2264 n\n\u00d7\n2\n\nn \u2212 \u221a\n\n1\n2q \u2212 1 \u03c72\nm.\n\nFinally, combining the assumptions on n, p, q and the (CSB) inequalities, we obtain:\n\nP{ln( \u02c6\u0398(E)) \u2212 ln( \u02c6\u0398(E0)) \u2265 2(1 + \u03b30)m log(p)} \u2264\n\n\u221a\n\n1\n\u03c0 log p\n\n4\n\ne\u2212 m\n\n2 (4(1+ \u00010\n\n2 ) log p).\n\nNext, note that the number of models |E| with E \u2283 E0 and |E| \u2212 |E0| = m is bounded by p2m.\nTaking the union bound over all choices of m and all choices of E with that given m, we obtain that\nthe desired result holds with the desired probability.\n\nWe are now ready to give a non-asymptotic version of the Main Theorem. For its proof apply the\nunion bound to the statements in Theorems 3 and 4, as in the asymptotic proof given in section 2.\nTheorem 5. Suppose assumptions (5) and (6) hold. Let E be the set of subsets E of edges be-\ntween the p nodes, satisfying |E| \u2264 q and representing a decomposable model. Then it holds with\nprobability at least 1 \u2212\n\n1\n\n\u221a\n4\n\n\u03c0 log p\n\n1\u2212p\u2212\u00010 \u2212\np\u2212\u00010\n\n\u03c0 log p p\u2212\u00011 that\n1\u221a\nE\u2208E BIC\u03b3(E).\n\nE0 = arg min\n\nThat is, the extended BIC with parameter \u03b3 selects the smallest true model.\n\nFinally, we note that translating the above to the asymptotic version of the result is simple. If the\nconditions (3) hold, then for suf\ufb01ciently large n (and thus suf\ufb01ciently large p), assumptions (5) and\n(6) hold. Furthermore, although we may not have the exact equality \u03ba = logn p, we will have\nlogn p \u2192 \u03ba; this limit will be suf\ufb01cient for the necessary inequalities to hold for suf\ufb01ciently large\nn. The proofs then follow from the non-asymptotic results.\n\n8\n\n\fReferences\n\n[1] Steffen L. Lauritzen. Graphical models, volume 17 of Oxford Statistical Science Series. The\n\nClarendon Press Oxford University Press, New York, 1996. Oxford Science Publications.\n\n[2] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation\n\nwith the graphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[3] Jiahua Chen and Zehua Chen. Extended Bayesian information criterion for model selection\n\nwith large model space. Biometrika, 95:759\u2013771, 2008.\n\n[4] Gideon Schwarz. Estimating the dimension of a model. Ann. Statist., 6(2):461\u2013464, 1978.\n[5] Malgorzata Bogdan, Jayanta K. Ghosh, and R. W. Doerge. Modifying the Schwarz Bayesian\ninformation criterion to locate multiple interacting quantitative trait loci. Genetics, 167:989\u2013\n999, 2004.\n\n[6] Jiahua Chen and Zehua Chen. Extended BIC for small-n-large-p sparse GLM. Preprint.\n[7] Pradeep Ravikumar, Martin J. Wainwright, Garvesh Raskutti, and Bin Yu.\n\nHigh-\ndimensional covariance estimation by minimizing (cid:96)1-penalized log-determinant divergence.\narXiv:0811.3628, 2008.\n\n[8] B. T. Porteous. Stochastic inequalities relating a class of log-likelihood ratio statistics to their\n\nasymptotic \u03c72 distribution. Ann. Statist., 17(4):1723\u20131734, 1989.\n\n[9] Jun Shao. Linear model selection by cross-validation. J. Amer. Statist. Assoc., 88(422):486\u2013\n\n494, 1993.\n\n[10] T. Tony Cai. On block thresholding in wavelet regression: adaptivity, block size, and threshold\n\nlevel. Statist. Sinica, 12(4):1241\u20131273, 2002.\n\n[11] P. Svante Eriksen. Tests in covariance selection models. Scand. J. Statist., 23(3):275\u2013284,\n\n1996.\n\n9\n\n\f", "award": [], "sourceid": 60, "authors": [{"given_name": "Rina", "family_name": "Foygel", "institution": null}, {"given_name": "Mathias", "family_name": "Drton", "institution": null}]}