{"title": "Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1432, "page_last": 1440, "abstract": "A challenging problem in estimating high-dimensional graphical models is to choose the regularization parameter in a data-dependent way. The standard techniques include $K$-fold cross-validation ($K$-CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Though these methods work well for low-dimensional problems, they are not suitable in high dimensional settings. In this paper, we present StARS: a new stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs. The method has a clear interpretation: we use the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. This interpretation requires essentially no conditions. Under mild conditions, we show that StARS is partially sparsistent in terms of graph estimation: i.e. with high probability, all the true edges will be included in the selected model even when the graph size asymptotically increases with the sample size. Empirically, the performance of StARS is compared with the state-of-the-art model selection procedures, including $K$-CV, AIC, and BIC, on both synthetic data and a real microarray dataset. StARS outperforms all competing procedures.", "full_text": "Stability Approach to Regularization Selection\n(StARS) for High Dimensional Graphical Models\n\nHan Liu Kathryn Roeder Larry Wasserman\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nA challenging problem in estimating high-dimensional graphical models is to\nchoose the regularization parameter in a data-dependent way. The standard tech-\nniques include K-fold cross-validation (K-CV), Akaike information criterion\n(AIC), and Bayesian information criterion (BIC). Though these methods work\nwell for low-dimensional problems, they are not suitable in high dimensional set-\ntings. In this paper, we present StARS: a new stability-based method for choosing\nthe regularization parameter in high dimensional inference for undirected graphs.\nThe method has a clear interpretation: we use the least amount of regularization\nthat simultaneously makes a graph sparse and replicable under random sampling.\nThis interpretation requires essentially no conditions. Under mild conditions, we\nshow that StARS is partially sparsistent in terms of graph estimation: i.e. with\nhigh probability, all the true edges will be included in the selected model even\nwhen the graph size diverges with the sample size. Empirically, the performance\nof StARS is compared with the state-of-the-art model selection procedures, in-\ncluding K-CV, AIC, and BIC, on both synthetic data and a real microarray dataset.\nStARS outperforms all these competing procedures.\n\n1 Introduction\nUndirected graphical models have emerged as a useful tool because they allow for a stochastic\ndescription of complex associations in high-dimensional data. For example, biological processes in\na cell lead to complex interactions among gene products. It is of interest to determine which features\nof the system are conditionally independent. Such problems require us to infer an undirected graph\nfrom i.i.d. observations. Each node in this graph corresponds to a random variable and the existence\nof an edge between a pair of nodes represent their conditional independence relationship.\nGaussian graphical models [4, 23, 5, 9] are by far the most popular approach for learning high di-\nmensional undirected graph structures. Under the Gaussian assumption, the graph can be estimated\nusing the sparsity pattern of the inverse covariance matrix. If two variables are conditionally inde-\npendent, the corresponding element of the inverse covariance matrix is zero. In many applications,\nestimating the the inverse covariance matrix is statistically challenging because the number of fea-\ntures measured may be much larger than the number of collected samples. To handle this challenge,\nthe graphical lasso or glasso [7, 24, 2] is rapidly becoming a popular method for estimating sparse\nundirected graphs. To use this method, however, the user must specify a regularization parameter\n\u03bb that controls the sparsity of the graph. The choice of \u03bb is critical since different \u03bb\u2019s may lead to\ndifferent scienti\ufb01c conclusions of the statistical inference. Other methods for estimating high dimen-\nsional graphs include [11, 14, 10]. They also require the user to specify a regularization parameter.\nThe standard methods for choosing the regularization parameter are AIC [1], BIC [19] and cross\nvalidation [6]. Though these methods have good theoretical properties in low dimensions, they are\nnot suitable for high dimensional problems. In regression, cross-validation has been shown to over\ufb01t\nthe data [22]. Likewise, AIC and BIC tend to perform poorly when the dimension is large relative to\nthe sample size. Our simulations con\ufb01rm that these methods perform poorly when used with glasso.\n\n1\n\n\fA new approach to model selection, based on model stability, has recently generated some interest\nin the literature [8]. The idea, as we develop it, is based on subsampling [15] and builds on the\napproach of Meinshausen and B\u00a8uhlmann [12]. We draw many random subsamples and construct a\ngraph from each subsample (unlike K-fold cross-validation, these subsamples are overlapping). We\nchoose the regularization parameter so that the obtained graph is sparse and there is not too much\nvariability across subsamples. More precisely, we start with a large regularization which corresponds\nto an empty, and hence highly stable, graph. We gradually reduce the amount of regularization\nuntil there is a small but acceptable amount of variability of the graph across subsamples. In other\nwords, we regularize to the point that we control the dissonance between graphs. The procedure\nis named StARS: Stability Approach to Regularization Selection. We study the performance of\nStARS by simulations and theoretical analysis in Sections 4 and 5. Although we focus here on\ngraphical models, StARS is quite general and can be adapted to other settings including regression,\nclassi\ufb01cation, clustering, and dimensionality reduction.\nIn the context of clustering, results of stability methods have been mixed. Weaknesses of stability\nhave been shown in [3]. However, the approach was successful for density-based clustering [17].\nFor graph selection, Meinshausen and B\u00a8uhlmann [12] also used a stability criterion; however, their\napproach differs from StARS in its fundamental conception. They use subsampling to produce a new\nand more stable regularization path then select a regularization parameter from this newly created\npath, whereas we propose to use subsampling to directly select one regularization parameter from\nthe original path. Our aim is to ensure that the selected graph is sparse, but inclusive, while they\naim to control the familywise type I errors. As a consequence, their goal is contrary to ours: instead\nof selecting a larger graph that contains the true graph, they try to select a smaller graph that is\ncontained in the true graph. As we will discuss in Section 3, in speci\ufb01c application domains like\ngene regulatory network analysis, our goal for graph selection is more natural.\n\n(\n\nX(1), . . . , X(p)\n\n)T be a random vector with distribution P . The undirected graph G =\n\n2 Estimating a High-dimensional Undirected Graph\nLet X =\n(V, E) associated with P has vertices V = {X(1), . . . , X(p)} and a set of edges E corresponding to\npairs of vertices. In this paper, we also interchangeably use E to denote the adjacency matrix of the\ngraph G. The edge corresponding to X(j) and X(k) is absent if X(j) and X(k) are conditionally\nindependent given the other coordinates of X. The graph estimation problem is to infer E from i.i.d.\nobserved data X1, . . . , Xn where Xi = (Xi(1), . . . , Xi(p))T .\nSuppose now that P is Gaussian with mean vector \u00b5 and covariance matrix \u03a3. Then the edge\ncorresponding to X(j) and X(k) is absent if and only if \u2126jk = 0 where \u2126 = \u03a3\u22121. Hence, to\nestimate the graph we only need to estimate the sparsity pattern of \u2126. When p could diverge with n,\nestimating \u2126 is dif\ufb01cult. A popular approach is the graphical lasso or glasso [7, 24, 2]. Using glasso,\nwe estimate \u2126 as follows: Ignoring constants, the log-likelihood (after maximizing over \u00b5) can be\nwritten as \u2113(\u2126) = log |\u2126| \u2212 trace\n\nwhere b\u03a3 is the sample covariance matrix. With a positive\nregularization parameter \u03bb, the glasso estimator b\u2126(\u03bb) is obtained by minimizing the regularized\n\nnegative log-likelihood\n\n)\n\n(b\u03a3\u2126\nb\u2126(\u03bb) = arg min\n\n{\n\u2212\u2113(\u2126) + \u03bb||\u2126||1\n\n}\n\n\u2211\n\n(1)\n\nj;k\n\n\u2126\u227b0\n\nwhere ||\u2126||1 =\n\n|\u2126jk| is the elementwise \u21131-norm of \u2126. The estimated graph bG(\u03bb) =\n(V,bE(\u03bb)) is then easily obtained from b\u2126(\u03bb): for i \u0338= j, an edge (i, j) \u2208 bE(\u03bb) if and only if\nthe corresponding entry inb\u2126(\u03bb) is nonzero. Friedman et al. [7] give a fast algorithm for calculating\nb\u2126(\u03bb) over a grid of \u03bbs ranging from small to large. By taking advantage of the fact that the objec-\nin each iteration by solving a lasso regression [21]. The resulting regularization path b\u2126(\u03bb) for all\nb\u2126(\u03bb) could recover the true graph with high probability. However, these types of results are either\n\n\u03bbs has been shown to have excellent theoretical properties [18, 16]. For example, Ravikumar et al.\n[16] show that, if the regularization parameter \u03bb satis\ufb01es a certain rate, the corresponding estimator\n\ntive function in (1) is convex, their algorithm iteratively estimates a single row (and column) of \u2126\n\nasymptotic or non-asymptotic but with very large constants. They are not practical enough to guide\nthe choice of the regularization parameter \u03bb in \ufb01nite-sample settings.\n\n2\n\n\f3 Regularization Selection\n\nIn Equation (1), the choice of \u03bb is critical because \u03bb controls the sparsity level of bG(\u03bb). Larger values\nour goal of graph regularization parameter selection is to choose one b\u039b \u2208 Gn, such that the true\ngraph E is contained in bE(b\u039b) with high probability. In other words, we want to \u201coverselect\u201d instead\n\nof \u03bb tend to yield sparser graphs and smaller values of \u03bb yield denser graphs. It is convenient to\nde\ufb01ne \u039b = 1/\u03bb so that small \u039b corresponds to a more sparse graph. In particular, \u039b = 0 corresponds\nto the empty graph with no edges. Given a grid of regularization parameters Gn = {\u039b1, . . . , \u039bK},\n\nof \u201cunderselect\u201d. Such a choice is motivated by application problems like gene regulatory networks\nreconstruction, in which we aim to study the interactions of many genes. For these types of studies,\nwe tolerant some false positives but not false negatives. Speci\ufb01cally, it is acceptable that an edge\npresents but the two genes corresponding to this edge do not really interact with each other. Such\nfalse positives can generally be screened out by more \ufb01ne-tuned downstream biological experiments.\nHowever, if one important interaction edge is omitted at the beginning, it\u2019s very dif\ufb01cult for us to\nre-discovery it by follow-up analysis. There is also a tradeoff: we want to select a denser graph\nwhich contains the true graph with high probability. At the same time, we want the graph to be as\nsparse as possible so that important information will not be buried by massive false positives. Based\non this rationale, an \u201cunderselect\u201d method, like the approach of Meinshausen and B\u00a8uhlmann[12],\ndoes not really \ufb01t our goal. In the following, we start with an overview of several state-of-the-art\nregularization parameter selection methods for graphs. We then introduce our new StARS approach.\n\n3.1 Existing Methods\n\nThe regularization parameter is often chosen using AIC or BIC. Let b\u2126(\u039b) denote the estimator\n\ncorresponding to \u039b. Let d(\u039b) denote the degree of freedom (or the effective number of free pa-\nrameters) of the corresponding Gaussian model. AIC chooses \u039b to minimize \u22122\u2113\n+ 2d(\u039b)\nand BIC chooses \u039b to minimize \u22122\u2113\n+ d(\u039b) \u00b7 log n. The usual theoretical justi\ufb01cation for\nthese methods assumes that the dimension p is \ufb01xed as n increases; however, in the case where\np > n this justi\ufb01cation is not applicable.\nIn fact, it\u2019s even not straightforward how to estimate\nthe degree of freedom d(\u039b) when p is larger than n . A common practice is to calculate d(\u039b) as\n\nd(\u039b) = m(\u039b)(m(\u039b)\u2212 1)/2 + p where m(\u039b) denotes the number of nonzero elements ofb\u2126(\u039b). As\n\n(b\u2126(\u039b)\n\n(b\u2126(\u039b)\n\nwe will see in our experiments, AIC and BIC tend to select overly dense graphs in high dimensions.\nAnother popular method is K-fold cross-validation (K-CV). For this procedure the data is parti-\ntioned into K subsets. Of the K subsets one is retained as the validation data, and the remaining\nK \u2212 1 ones are used as training data. For each \u039b \u2208 Gn, we estimate a graph on the K \u2212 1 training\nsets and evaluate the negative log-likelihood on the retained validation set. The results are averaged\nover all K folds to obtain a single CV score. We then choose \u039b to minimize the CV score over he\nwhole grid Gn. In regression, cross-validation has been shown to over\ufb01t [22]. Our experiments will\ncon\ufb01rm this is true for graph estimation as well.\n\n)\n\n)\n\n3.2 StARS: Stability Approach to Regularization Selection\nThe StARS approach is to choose \u039b based on stability. When \u039b is 0, the graph is empty and two\ndatasets from P would both yield the same graph. As we increase \u039b, the variability of the graph\nincreases and hence the stability decreases. We increase \u039b just until the point where the graph\nbecomes variable as measured by the stability. StARS leads to a concrete rule for choosing \u039b.\nLet b = b(n) be such that 1 < b(n) < n. We draw N random subsamples S1, . . . , SN from\nX1, . . . , Xn, each of size b. There are\nsub-\nsamples. However, Politis et al. [15] show that it suf\ufb01ces in practice to choose a large number N\nof subsamples at random. Note that, unlike bootstrapping [6], each subsample is drawn without\nreplacement. For each \u039b \u2208 Gn, we construct a graph using the glasso for each subsample. This\nN (\u039b). Focus for now on one edge (s, t) and one\nvalue of \u039b. Let \u03c8\u039b(\u00b7) denote the glasso algorithm with the regularization parameter \u039b. For any\nst(Sj) = 0 if the algorithm does\nsubsample Sj let \u03c8\u039b\nst(\u039b), we\nnot put an edge between (s, t). De\ufb01ne \u03b8b\n\nresults in N estimated edge matrices bEb\nuse a U-statistic of order b, namely,b\u03b8b\n\n)\n(\n1(\u039b), . . . ,bEb\nN\u2211\n\nst(Sj) = 1 if the algorithm puts an edge and \u03c8\u039b\n\nst(\u039b) = P(\u03c8\u039b\nst(Sj).\n\u03c8\u039b\n\nsuch subsamples. Theoretically one uses all\n\nst(X1, . . . , Xb) = 1). To estimate \u03b8b\n\nst(\u039b) =\n\n)\n\n(\n\nn\nb\n\nn\nb\n\n1\nN\n\nj=1\n\n3\n\n\fp\n2\n\ns 0,\n\nProof. Note thatb\u03b8b\nNowb\u03beb\nst(\u039b) is just a function of the U-statisticb\u03b8b\nst(\u039b)(1 \u2212b\u03b8b\n|b\u03beb\nst(\u039b) \u2212(b\u03b8b\nst(\u039b) \u2212 \u03beb\n\n|bDb(\u039b) \u2212 Db(\u039b)| \u2264\nP(|b\u03b8b\nst(\u039b) \u2212 \u03b8b\nst(\u039b)| = 2|b\u03b8b\n= 2|b\u03b8b\n\u2264 2|b\u03b8b\n\u2264 2|b\u03b8b\nst(\u039b) \u2212 \u03b8b\n\u2264 2|b\u03b8b\nst(\u039b) \u2212 \u03b8b\n= 6|b\u03b8b\nst(\u039b) \u2212 \u03b8b\nst(\u039b)| \u2264 6|b\u03b8b\nst(\u039b) \u2212 \u03b8b\nst(\u039b) \u2212 \u03b8b\n|b\u03beb\nst(\u039b) \u2212 \u03beb\n\nwe obtain: for each \u039b \u2208 Gn,\nP(max\n\nwe have |b\u03beb\n\nst(\u039b) \u2212 \u03beb\n\ns \u03f5) \u2264 2 exp\n\n(\u22122n\u03f52/b\n)2|\n(\nst(\u039b)(1 \u2212 \u03b8b\nst(\u039b))|\n)2|\n)2 \u2212(\nst(\u039b) +\nst(\u039b)\n\u03b8b\nst(\u039b))(b\u03b8b\nst(\u039b)\nst(\u039b)\n\u03b8b\nst(\u039b) \u2212 \u03b8b\nst(\u039b) \u2212 \u03b8b\n\nst(\u039b). Note that\nst(\u039b)) \u2212 \u03b8b\nst(\u039b)\n\n)2 \u2212 \u03b8b\nst(\u039b)| + 2|(b\u03b8b\nst(\u039b)| + 2|(b\u03b8b\nst(\u039b)| + 4|b\u03b8b\n\n(5)\n(6)\n(7)\n(8)\n(9)\nst(\u039b)|,\n(10)\nst(\u039b)|. Using (4) and the union bound over all the edges,\nst(\u039b)| > 6\u03f5) \u2264 2p2 exp\n\n(\u22122n\u03f52/b\n)\n\nst(\u039b))|\n\nst(\u039b) + \u03b8b\n\nst(\u039b)|\n\n(11)\n\n.\n\n4\n\n\f)\n\n(\n\nUsing two union bound arguments over the K values of \u039b and all the p(p \u2212 1)/2 edges, we have:\nst(\u039b)| > \u03f5) (12)\n(13)\n\n|b\u03beb\n(\u2212n\u03f52/(18b)\n)\nst(\u039b) \u2212 \u03beb\n\n|bDb(\u039b) \u2212 Db(\u039b)| \u2265 \u03f5\n\n\u2264 |Gn| \u00b7 p(p \u2212 1)\n2\n\u2264 K \u00b7 p4 \u00b7 exp\n\n\u00b7 P(max\n\nmax\n\u039b\u2208Gn\n\ns 0. Letb\u039bs \u2208 Gn\n(\nE \u2282 bEb(b\u039bs)\n\n) \u2192 1 as n \u2192 \u221e.\n\n(15)\n\nP\n\n|bDb(\u039b) \u2212 Db(\u039b)| \u2264 \u03b2/2. The scaling of\n\nProof. We de\ufb01ne An to be the event that max\u039b\u2208Gn\nn, K, b, p in the theorem satis\ufb01es the L.H.S. of (14), which implies that P(An) \u2192 1 as n \u2192 \u221e.\nUsing (A1), we know that, on An,\n\nbDb(\u039b) \u2264 max\n\n|bDb(\u039b) \u2212 Db(\u039b)| + max\n\nmax\n\nThis implies that, on An,b\u039bs \u2265 \u039bo. The result follows by applying (A2) and a union bound.\n\n\u039b\u2264\u039bo\u2227\u039b\u2208Gn\n\n\u039b\u2264\u039bo\u2227\u039b\u2208Gn\n\n\u039b\u2208Gn\n\nDb(\u039b) \u2264 \u03b2.\n\n(16)\n\n5 Experimental Results\nWe now provide empirical evidence to illustrate the usefulness of StARS and compare it with several\nstate-of-the-art competitors, including 10-fold cross-validation (K-CV), BIC, and AIC. For StARS\nn] and set the cut point \u03b2 = 0.05. We \ufb01rst\n\nwe always use subsampling block size b(n) = \u230a10 \u00b7 \u221a\n\n5\n\n\fquantitatively evaluate these methods on two types of synthetic datasets, where the true graphs are\nknown. We then illustrate StARS on a microarray dataset that records the gene expression levels\nfrom immortalized B cells of human subjects. On all high dimensional synthetic datasets, StARS\nsigni\ufb01cantly outperforms its competitors. On the microarray dataset, StARS obtains a remarkably\nsimple graph while all competing methods select what appear to be overly dense graphs.\n\n5.1 Synthetic Data\nTo quantitatively evaluate the graph estimation performance, we adapt the criteria including pre-\ncision, recall, and F1-score from the information retrieval literature. Let G = (V, E) be a p-\n\ndimensional graph and let bG = (V,bE) be an estimated graph. We de\ufb01ne precision = |bE \u2229 E|/|bE|,\nrecall = |bE \u2229 E|/|E|, and F1-score = 2 \u00b7 precision \u00b7 recall/(precision + recall). In other words,\n\nPrecision is the number of correctly estimated edges divided by the total number of edges in the\nestimated graph; recall is the number of correctly estimated edges divided by the total number of\nedges in the true graph; the F1-score can be viewed as a weighted average of the precision and recall,\nwhere an F1-score reaches its best value at 1 and worst score at 0. On the synthetic data where we\nknow the true graphs, we also compare the previous methods with an oracle procedure which selects\nthe optimal regularization parameter by minimizing the total number of different edges between the\nestimated and true graphs along the full regularization path. Since this oracle procedure requires the\nknowledge of the truth graph, it is not a practical method. We only present it here to calibrate the\ninherent challenge of each simulated scenario. To make the comparison fair, once the regulariza-\n\u221a\ntion parameters are selected, we estimate the oracle and StARS graphs only based on a subsampled\nn\u230b. In contrast, the K-CV, BIC, and AIC graphs are estimated using\ndataset with size b(n) = \u230a10\nthe full dataset. More details about this issue were discussed in Section 3.\nWe generate data from sparse Gaussian graphs, neighborhood graphs and hub graphs, which mimic\ncharacteristics of real-wolrd biological networks. The mean is set to be zero and the covariance\nmatrix \u03a3 = \u2126\u22121. For both graphs, the diagonal elements of \u2126 are set to be one. More speci\ufb01cally:\n\n(\u221a\n\n)\u22121 exp\n\n(\u22124\u2225yi \u2212 yj\u22252\n\n)\n\n2\u03c0\n\n1. Neighborhood graph: We \ufb01rst uniformly sample y1, . . . , yn from a unit square. We then set\n\u2126ij = \u2126ji = \u03c1 with probability\n. All the rest \u2126ij are set to\nbe zero. The number of nonzero off-diagonal elements of each row or column is restricted\nto be smaller than \u230a1/\u03c1\u230b. In this paper, \u03c1 is set to be 0.245.\n2. Hub graph: The rows/columns are partitioned into J equally-sized disjoint groups: V1 \u222a\nV2 . . . \u222a VJ = {1, . . . , p}, each group is associated with a \u201cpivotal\u201d row k. Let |V1| = s.\nWe set \u2126ik = \u2126ki = \u03c1 for i \u2208 Vk and \u2126ik = \u2126ki = 0 otherwise. In our experiment,\nJ = \u230ap/s\u230b, k = 1, s + 1, 2s + 1, . . ., and we always set \u03c1 = 1/(s + 1) with s = 20.\n\nWe generate synthetic datasets in both low-dimensional (n = 800, p = 40) and high-dimensional\n(n = 400, p = 100) settings. Table 1 provides comparisons of all methods, where we repeat the\nexperiments 100 times and report the averaged precision, recall, F1-score with their standard errors.\n\nTable 1: Quantitative comparison of different methods on the datasets from the neighborhood and hub graphs.\n\nNeighborhood graph: n =800, p=40\n\nNeighborhood graph: n=400, p =100\n\nMethods\n\nPrecision\n\nRecall\n\nF1-score\n\nPrecision\n\nRecall\n\nF1-score\n\nOracle\nStARS\nK-CV\nBIC\nAIC\n\n0:9222 (0:05)\n0:7204 (0:08)\n0:1394 (0:02)\n0:9738 (0:03)\n0:8696 (0:11)\n\n0:9070 (0:07)\n0:9530 (0:05)\n1:0000 (0:00)\n0:9948 (0:02)\n0:9996 (0:01)\n\n0:9119 (0:04)\n0:8171 (0:05)\n0:2440 (0:04)\n0:9839 (0:01)\n0:9236 (0:07)\n\nHub graph: n =800, p=40\n\n0:7473 (0:09)\n0:6366 (0:07)\n0:1383 (0:01)\n0:1796 (0:11)\n0:1279 (0:00)\n\n0:8001 (0:06)\n0:8718 (0:06)\n1:0000 (0:00)\n1:0000 (0:00)\n1:0000 (0:00)\nHub graph: n=400, p =100\n\n0:7672 (0:07)\n0:7352 (0:07)\n0:2428 (0:01)\n0:2933 (0:13)\n0:2268 (0:01)\n\nMethods\n\nPrecision\n\nRecall\n\nF1-score\n\nPrecision\n\nRecall\n\nF1-score\n\nOracle\nStARS\nK-CV\nBIC\nAIC\n\n0:9793 (0:01)\n0:4377 (0:02)\n0:2383 (0:09)\n0:4879 (0:05)\n0:2522 (0:09)\n\n1:0000 (0:00)\n1:0000 (0:00)\n1:0000 (0:00)\n1:0000 (0:00)\n1:0000 (0:00)\n\n0:9895 (0:01)\n0:6086 (0:02)\n0:3769 (0:01)\n0:6542 (0:05)\n0:3951 (0:00)\n\n0:8976 (0:02)\n0:4572 (0:01)\n0:1574 (0:01)\n0:2155 (0:00)\n0:1676 (0:00)\n\n1:0000 (0:00)\n1:0000 (0:00)\n1:0000 (0:00)\n1:0000 (0:00)\n1:0000 (0:00)\n\n0:9459 (0:01)\n0:6274 (0:01)\n0:2719 (0:00)\n0:3545 (0:01)\n0:2871 (0:00)\n\nFor low-dimensional settings where n \u226b p, the BIC criterion is very competitive and performs the\nbest among all the methods. In high dimensional settings, however, StARS clearly outperforms all\n\n6\n\n\fonly the subsampled dataset with size b(n) = \u230a10 \u00b7 \u221a\n\nthe competing methods for both neighborhood and hub graphs. This is consistent with our theory.\nAt \ufb01rst sight, it might be surprising that for data from low-dimensional neighborhood graphs, BIC\nand AIC even outperform the oracle procedure! This is due to the fact that both BIC and AIC\ngraphs are estimated using all the n = 800 data points, while the oracle graph is estimated using\nn\u230b = 282. Direct usage of the full sample\nis an advantage of model selection methods that take the general form of BIC and AIC. In high\ndimensions, however, we see that even with this advantage, StARS clearly outperforms BIC and\nAIC. The estimated graphs for different methods in the setting n = 400, p = 100 are provided in\nFigures 1 and 2, from which we see that the StARS graph is almost as good as the oracle, while the\nK-CV, BIC, and AIC graphs are overly too dense.\n\n(a) True graph\n\n(b) Oracle graph\n\n(c) StARS graph\n\n(d) K-CV graph\n\n(e) BIC graph\n\n(f) AIC graph\n\nFigure 1: Comparison of different methods on the data from the neighborhood graphs (n = 400; p = 100).\n\n5.2 Microarray Data\nWe apply StARS to a dataset based on Affymetrix GeneChip microarrays for the gene expression\nlevels from immortalized B cells of human subjects. The sample size is n = 294. The expression\nlevels for each array are pre-processed by log-transformation and standardization as in [13]. Using\na sub-pathway subset of 324 correlated genes, we study the estimated graphs obtained from each\nmethod under investigation. The StARS and BIC graphs are provided in Figure 3. We see that\nthe StARS graph is remarkably simple and informative, exhibiting some cliques and hub genes. In\ncontrast, the BIC graph is very dense and possible useful association information is buried in the\nlarge number of estimated edges. The selected graphs using AIC and K-CV are even more dense\nthan the BIC graph and will be reported elsewhere. A full treatment of the biological implication of\nthese two graphs validated by enrichment analysis will be provided in the full version of this paper.\n\n6 Conclusions\nThe problem of estimating structure in high dimensions is very challenging. Casting the problem\nin the context of a regularized optimization has led to some success, but the choice of the regu-\nlarization parameter is critical. We present a new method, StARS, for choosing this parameter in\nhigh dimensional inference for undirected graphs. Like Meinshausen and B\u00a8uhlmann\u2019s stability se-\nlection approach [12], our method makes use of subsampling, but it differs substantially from their\n\n7\n\n\f(a) True graph\n\n(b) Oracle graph\n\n(c) StARS graph\n\n(d) K-CV graph\n\n(e) BIC graph\n\n(f) AIC graph\n\nFigure 2: Comparison of different methods on the data from the hub graphs (n = 400; p = 100).\n\n(a) StARS graph\n\n(b) BIC graph\n\nFigure 3: Microarray data example. The StARS graph is more informative graph than the BIC graph.\n\napproach in both implementation and goals. For graphical models, we choose the regularization pa-\nrameter directly based on the edge stability. Under mild conditions, StARS is partially sparsistent.\nHowever, even without these conditions, StARS has a simple interpretation: we use the least amount\nof regularization that simultaneously makes a graph sparse and replicable under random sampling.\nEmpirically, we show that StARS works signi\ufb01cantly better than existing techniques on both syn-\nthetic and microarray datasets. Although we focus here on graphical models, our new method is\ngenerally applicable to many problems that involve estimating structure, including regression, clas-\nsi\ufb01cation, density estimation, clustering, and dimensionality reduction.\n\n8\n\n\fReferences\n[1] Hirotsugu Akaike. Information theory and an extension of the maximum likelihood principle.\n\nSecond International Symposium on Information Theory, (2):267\u2013281, 1973.\n\n[2] Onureena Banerjee, Laurent El Ghaoui, and Alexandre d\u2019Aspremont. Model selection through\nsparse maximum likelihood estimation. Journal of Machine Learning Research, 9:485\u2013516,\nMarch 2008.\n\n[3] Shai Ben-david, Ulrike Von Luxburg, and David Pal. A sober look at clustering stability. In\n\nProceedings of the Conference of Learning Theory, pages 5\u201319. Springer, 2006.\n\n[4] Arthur P. Dempster. Covariance selection. Biometrics, 28:157\u2013175, 1972.\n[5] David Edwards. Introduction to graphical modelling. Springer-Verlag Inc, 1995.\n[6] Bradley Efron. The jackknife, the bootstrap and other resampling plans. SIAM [Society for\n\nIndustrial and Applied Mathematics], 1982.\n\n[7] Jerome H. Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estima-\n\ntion with the graphical lasso. Biostatistics, 9(3):432\u2013441, 2007.\n\n[8] Tilman Lange, Volker Roth, Mikio L. Braun, and Joachim M. Buhmann. Stability-based vali-\n\ndation of clustering solutions. Neural Computation, 16(6):1299\u20131323, 2004.\n\n[9] Steffen L. Lauritzen. Graphical Models. Oxford University Press, 1996.\n[10] Han Liu, John Lafferty, and J. Wainwright. The nonparanormal: Semiparametric estimation of\nhigh dimensional undirected graphs. Journal of Machine Learning Research, 10:2295\u20132328,\n2009.\n\n[11] Nicolai Meinshausen and Peter B\u00a8uhlmann. High dimensional graphs and variable selection\n\nwith the Lasso. The Annals of Statistics, 34:1436\u20131462, 2006.\n\n[12] Nicolai Meinshausen and Peter B\u00a8uhlmann. Stability selection. To Appear in Journal of the\n\nRoyal Statistical Society, Series B, Methodological, 2010.\n\n[13] Renuka R. Nayak, Michael Kearns, Richard S. Spielman, and Vivian G. Cheung. Coexpression\nnetwork based on natural variation in human gene expression reveals gene interactions and\nfunctions. Genome Research, 19(11):1953\u20131962, November 2009.\n\n[14] Jie Peng, Pei Wang, Nengfeng Zhou, and Ji Zhu. Partial correlation estimation by joint sparse\nregression models. Journal of the American Statistical Association, 104(486):735\u2013746, 2009.\n[15] Dimitris N. Politis, Joseph P. Romano, and Michael Wolf. Subsampling (Springer Series in\n\nStatistics). Springer, 1 edition, August 1999.\n\n[16] Pradeep Ravikumar, Martin Wainwright, Garvesh Raskutti, and Bin Yu. Model selection in\nIn Ad-\n\nGaussian graphical models: High-dimensional consistency of \u21131-regularized MLE.\nvances in Neural Information Processing Systems 22, Cambridge, MA, 2009. MIT Press.\n\n[17] Alessandro Rinaldo and Larry Wasserman. Generalized density clustering. arXiv/0907.3454,\n\n2009.\n\n[18] Adam J. Rothman, Peter J. Bickel, Elizaveta Levina, and Ji Zhu. Sparse permutation invariant\n\ncovariance estimation. Electronic Journal of Statistics, 2:494\u2013515, 2008.\n\n[19] Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461\u2013464,\n\n1978.\n\n[20] Robert J. Ser\ufb02ing. Approximation theorems of mathematical statistics. John Wiley and Sons,\n\n1980.\n\n[21] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society, Series B, Methodological, 58:267\u2013288, 1996.\n\n[22] Larry Wasserman and Kathryn Roeder. High dimensional variable selection. Annals of statis-\n\ntics, 37(5A):2178\u20132201, January 2009.\n\n[23] Joe Whittaker. Graphical Models in Applied Multivariate Statistics. Wiley, 1990.\n[24] Ming Yuan and Yi Lin. Model selection and estimation in the Gaussian graphical model.\n\nBiometrika, 94(1):19\u201335, 2007.\n\n9\n\n\f", "award": [], "sourceid": 834, "authors": [{"given_name": "Han", "family_name": "Liu", "institution": null}, {"given_name": "Kathryn", "family_name": "Roeder", "institution": null}, {"given_name": "Larry", "family_name": "Wasserman", "institution": null}]}