{"title": "Testing for Differences in Gaussian Graphical Models: Applications to Brain Connectivity", "book": "Advances in Neural Information Processing Systems", "page_first": 595, "page_last": 603, "abstract": "Functional brain networks are well described and estimated from data with Gaussian Graphical Models (GGMs), e.g.\\ using sparse inverse covariance estimators. Comparing functional connectivity of subjects in two populations calls for comparing these estimated GGMs. Our goal is to identify differences in GGMs known to have similar structure. We characterize the uncertainty of differences with confidence intervals obtained using a parametric distribution on parameters of a sparse estimator. Sparse penalties enable statistical guarantees and interpretable models even in high-dimensional and low-sample settings. Characterizing the distributions of sparse models is inherently challenging as the penalties produce a biased estimator. Recent work invokes the sparsity assumptions to effectively remove the bias from a sparse estimator such as the lasso. These distributions can be used to give confidence intervals on edges in GGMs, and by extension their differences. However, in the case of comparing GGMs, these estimators do not make use of any assumed joint structure among the GGMs. Inspired by priors from brain functional connectivity we derive the distribution of parameter differences under a joint penalty when parameters are known to be sparse in the difference. This leads us to introduce the debiased multi-task fused lasso, whose distribution can be characterized in an efficient manner. We then show how the debiased lasso and multi-task fused lasso can be used to obtain confidence intervals on edge differences in GGMs. We validate the techniques proposed on a set of synthetic examples as well as neuro-imaging dataset created for the study of autism.", "full_text": "Testing for Differences in Gaussian Graphical Models:\n\nApplications to Brain Connectivity\n\nEugene Belilovsky 1,2,3, Gael Varoquaux2, Matthew Blaschko3\n\nmatthew.blaschko@esat.kuleuven.be\n\n1University of Paris-Saclay, 2INRIA, 3KU Leuven\n\n{eugene.belilovsky, gael.varoquaux } @inria.fr\n\nAbstract\n\nFunctional brain networks are well described and estimated from data with Gaus-\nsian Graphical Models (GGMs), e.g. using sparse inverse covariance estimators.\nComparing functional connectivity of subjects in two populations calls for compar-\ning these estimated GGMs. Our goal is to identify differences in GGMs known\nto have similar structure. We characterize the uncertainty of differences with\ncon\ufb01dence intervals obtained using a parametric distribution on parameters of a\nsparse estimator. Sparse penalties enable statistical guarantees and interpretable\nmodels even in high-dimensional and low-sample settings. Characterizing the\ndistributions of sparse models is inherently challenging as the penalties produce\na biased estimator. Recent work invokes the sparsity assumptions to effectively\nremove the bias from a sparse estimator such as the lasso. These distributions can\nbe used to give con\ufb01dence intervals on edges in GGMs, and by extension their\ndifferences. However, in the case of comparing GGMs, these estimators do not\nmake use of any assumed joint structure among the GGMs. Inspired by priors from\nbrain functional connectivity we derive the distribution of parameter differences\nunder a joint penalty when parameters are known to be sparse in the difference.\nThis leads us to introduce the debiased multi-task fused lasso, whose distribution\ncan be characterized in an ef\ufb01cient manner. We then show how the debiased lasso\nand multi-task fused lasso can be used to obtain con\ufb01dence intervals on edge\ndifferences in GGMs. We validate the techniques proposed on a set of synthetic\nexamples as well as neuro-imaging dataset created for the study of autism.\n\nIntroduction\n\n1\nGaussian graphical models describe well interactions in many real-world systems. For instance,\ncorrelations in brain activity reveal brain interactions between distant regions, a process know as\nfunctional connectivity. Functional connectivity is an interesting probe on brain mechanisms as\nit persists in the absence of tasks (the so-called \u201cresting-state\u201d) and is thus applicable to study\npopulations of impaired subjects, as in neurologic or psychiatric diseases [3]. From a formal\nstandpoint, Gaussian graphical models are well suited to estimate brain connections from functional\nMagnetic Resonance Imaging (fMRI) signals [28, 33]. A set of brain regions and related functional\nconnections is then called a functional connectome [31, 3]. Its variation across subjects can capture\ncognition [26, 27] or pathology [17, 3]. However, the effects of pathologies are often very small, as\nresting-state fMRI is a weakly-constrained and noisy imaging modality, and the number of subjects\nin a study is often small given the cost of imaging. Statistical power is then a major concern [2]. The\nstatistical challenge is to increase the power to detect differences between Gaussian graphical models\nin the small-sample regime.\nIn these settings, estimation and comparison of Gaussian graphical models fall in the range of\nhigh-dimensional statistics: the number of degrees of freedom in the data is small compared to\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe dimensionality of the model. In this regime, sparsity-promoting (cid:96)1-based penalties can make\nestimation well-posed and recover good estimation performance despite the scarcity of data [29, 10,\n22, 6, 1]. These encompass sparse regression methods such as the lasso or recovery methods such as\nbasis pursuit, and can be applied to estimation of Gaussian graphical models with approaches such as\nthe graphical lasso[10]. There is now a wide body of literature which demonstrates the statistical\nproperties of these methods [1]. Crucial to applications in medicine or neuroscience, recent work\ncharacterizes the uncertainty, with con\ufb01dence intervals and p-values, of the parameters selected by\nthese methods [15, 16, 19, 12]. These works focus primarily on the lasso and graphical lasso.\nApproaches to estimate statistical signi\ufb01cance on sparse models fall into several general categories:\n(a) non-parameteric sampling based methods which are inherently expensive and have dif\ufb01cult\nlimiting distributions [1, 24, 5], (b) characterizations of the distribution of new parameters that enter a\nmodel along a regularization path [19, 12], or (c) for a particular regularization parameter, debiasing\nthe solution to obtain a new consistent estimator with known distribution [16, 15, 30]. While some of\nthe latter work has been used to characterize con\ufb01dence intervals on network edge selection, there is\nno result, to our knowledge, on the important problem of identifying differences in networks. Here\nthe con\ufb01dence on the result is even more critical, as the differences are the direct outcome used for\nneuroscience research or medical practice, and it is important to provide the practitioner a measure of\nthe uncertainty.\nHere, we consider the setting of two datasets known to have very similar underlying signals, but\nwhich individually may not be very sparse. A motivating example is determining the difference\nin brain networks of subjects from different groups: population analysis of connectomes [31, 17].\nRecent literature in neuroscience [20] has suggested functional networks are not sparse. On the\nother hand, differences in connections across subjects should be sparse. Indeed the link between\nfunctional and anatomical brain networks [13] suggests they should not differ drastically from one\nsubject to another. From a neuroscienti\ufb01c standpoint we are interested in determining which edges\nbetween two populations (e.g. autistic and non-autistic) are different. Furthermore we want to provide\ncon\ufb01dence-intervals on our results. We particularly focus on the setting where one dataset is larger\nthan the other. In many applications it is more dif\ufb01cult to collect one group (e.g. individuals with\nspeci\ufb01c pathologies) than another.\nWe introduce an estimator tailored to this goal: the debiased multi-task fused lasso. We show that,\nwhen the underlying parameter differences are indeed sparse, we can obtain a tractable Gaussian\ndistribution for the parameter difference. This closed-form distribution underpins accurate hypothesis\ntesting and con\ufb01dence intervals. We then use the relationship between nodewise regression and the\ninverse covariance matrix to apply our estimator to learning differences of Gaussian graphical models.\nThe paper is organized as follows. In Section 2 we review previous work on learning of GGMs and\nthe debiased lasso. Section 3 discusses a joint debiasing procedure that speci\ufb01cally debiases the\ndifference estimator. In Section 3.1 we introduce the debiased multi-task fused lasso and show how\nit can be used to learn parameter differences in linear models. In Section 3.2, we show how these\nresults can be used for GGMs. In Section 4 we validate our approach on synthetic and fMRI data.\n2 Background and Related Work\nDebiased Lasso A central starting point for our work is the debiased lasso [30, 16]. Here one\nconsiders the linear regression model, Y = X\u03b2 + \u0001, with data matrix X and output Y , corrupted by\n\u0001 \u223c N (0, \u03c32\n\n\u0001 I) noise. The lasso estimator is formulated as follows:\n(cid:107)Y \u2212 X\u03b2(cid:107)2 + \u03bb(cid:107)\u03b2(cid:107)1\n\n\u02c6\u03b2\u03bb = arg min\n\u03b2\n\n1\nn\n\n(1)\n\nn X T (Y \u2212 X\u03b2), where \u02c6k is the subgradient of \u03bb(cid:107)\u03b2(cid:107)1. The debiased\nThe KKT conditions give \u02c6k\u03bb = 1\nu = \u02c6\u03b2\u03bb +M \u02c6k\u03bb for some M that is constructed to give\nlasso estimator [30, 16] is then formulated as \u02c6\u03b2\u03bb\nguarantees on the asymptotic distribution of \u02c6\u03b2\u03bb\nu. Note that this estimator is not strictly unbiased in the\n\ufb01nite sample case, but has a bias that rapidly approaches zero (w.r.t. n) if M is chosen appropriately,\nthe true regressor \u03b2 is indeed sparse, and the design matrix satistifes a certain restricted eigenvalue\nproperty [30, 16]. We decompose the difference of this debiased estimator and the truth as follows:\n\nu \u2212 \u03b2 =\n\u02c6\u03b2\u03bb\n\n1\nn\n\nM X T \u0001 \u2212 (M \u02c6\u03a3 \u2212 I)( \u02c6\u03b2 \u2212 \u03b2)\n\n(2)\n\n2\n\n\fThe \ufb01rst term is Gaussian and the second term is responsible for the bias. Using Holder\u2019s inequality\nthe second term can be bounded by (cid:107)M \u02c6\u03a3\u2212I(cid:107)\u221e(cid:107) \u02c6\u03b2\u2212\u03b2(cid:107)1. The \ufb01rst part of which we can bound using\nan appropriate selection of M while the second part is bounded by our implicit sparsity assumptions\ncoming from lasso theory [1]. Two approaches from the recent literature discuss how one can select\nM to appropriately debias this estimate. In [30] it suf\ufb01ces to use nodewise regression to learn an\ninverse covariance matrix which guarantees constraints on (cid:107)M \u02c6\u03a3 \u2212 I(cid:107)\u221e. A second approach by [16]\nproposes to solve a quadratic program to directly minimize the variance of the debiased estimator\nwhile constraining (cid:107)M \u02c6\u03a3 \u2212 I(cid:107)\u221e to induce suf\ufb01ciently small bias.\nIntuitively the construction of \u02c6\u03b2\u03bb\nu allows us to trade variance and bias via the M matrix. This allows\nus to overcome a naive bias-variance tradeoff by leveraging the sparsity assumptions that bound\n(cid:107) \u02c6\u03b2 \u2212 \u03b2(cid:107)1. In the sequel we expand this idea to the case of debiased parameter difference estimates\nand sparsity assumptions on the parameter differences.\nIn the context of GGMs, the debiased lasso can gives us an estimator that asymptotically converges to\nthe partial correlations. As highlighted by [34] we can thus use the debiased lasso to obtain difference\nestimators with known distributions. This allows us to obtain con\ufb01dence intervals on edge differences\nbetween Gaussian graphical models. We discuss this further in the sequel.\n\nGaussian Graphical Model Structure Learning A standard approach to estimating Gaussian\ngraphical models in high dimensions is to assume sparsity of the precision matrix and have a\nconstraint which limits the number of non-zero entries of the precision matrix. This constraint can\nbe achieved with a (cid:96)1-norm regularizer as in the popular graphical lasso [10]. Many variants of this\napproach that incorporate further structural assumptions have been proposed [14, 6, 23].\nAn alternative solution to inducing sparsity on the precision matrix indirectly is neighborhood (cid:96)1\nregression from [22]. Here the authors make use of a long known property that connects the entries\nof the precision matrix to the problem of regression of one variable on all the others [21]. This\nproperty is critical to our proposed estimation as it allows relating regression models to \ufb01nding edges\nconnected to speci\ufb01c nodes in the GGM.\nGGMs have been found to be good at recovering the main brain networks from fMRI data [28, 33].\nYet, recent work in neuroscience has showed that the structural wiring of the brain did not correspond\nto a very sparse network [20], thus questioning the underlying assumption of sparsity often used\nto estimate brain network connectivity. On the other hand, for the problem of \ufb01nding differences\nbetween networks in two populations, sparsity may be a valid assumption. It is well known that\nanatomical brain connections tend to closely follow functional ones [13]. Since anatomical networks\ndo not differ drastically we can surmise that two brain networks should not differ much even in the\npresence of pathologies. The statistical method we present here leverages sparsity in the difference of\ntwo networks, to yield well-behaved estimation and hypothesis testing in the low-sample regime. Most\nclosely related to our work, [35, 9] recently consider a different approach to estimating difference\nnetworks, but does not consider assigning signi\ufb01cance to the detection of edges.\n3 Debiased Difference Estimation\nIn many applications one may be interested in learning multiple linear models from data that share\nmany parameters. Situations such as this arise often in neuroimaging and bioinformatics applications.\nWe can often improve the learning procedure of such models by incorporating fused penalties that\npenalize the (cid:107)\u00b7(cid:107)1 norm of the parameter differences or (cid:107)\u00b7(cid:107)1,2 which encourages groups of parameters\nto shrink together. These methods have been shown to substantially improve the learning of the\njoint models. However, the differences between model parameters, which can have a high sample\ncomplexity when there are few of them, are often pointed out only in passing [4, 6, 14]. On the\nother hand, in many situations we might be interested in actually understanding and identifying\nthe differences between elements of the support. For example when considering brain networks of\npatients suffering from a pathology and healthy control subjects, the difference in brain connectivity\nmay be of great interest. Here we focus speci\ufb01cally on accurately identifying differences with\nsigni\ufb01cance.\nWe consider the case of two tasks (e.g. two groups of subjects), but the analysis can be easily extended\nto general multi-task settings. Consider the problem setting of data matrices X1 and X2, which\nare n1 \u00d7 p and n2 \u00d7 p, respectively. We model them as producing outputs Y1 and Y2, corrupted by\n\n3\n\n\fdiagonal gaussian noise \u00011 and \u00012 as follows\n\nY1 = X1\u03b21 + \u00011, Y2 = X2\u03b22 + \u00012\n\n(3)\nLet S1 and S2 index the elements of the support of \u03b21 and \u03b22, respectively. Furthermore the support\nof \u03b21 \u2212 \u03b22 is indexed by Sd and \ufb01nally the union of S1 and S2 is denoted Sa. Using a squared loss\nestimator producing independent estimates \u02c6\u03b21, \u02c6\u03b22 we can obtain a difference estimate \u02c6\u03b2d = \u02c6\u03b21 \u2212 \u02c6\u03b22.\nIn general if Sd is very small relative to Sa then we will have a dif\ufb01cult time to identify the support\nSd. This can be seen if we consider each of the individual components of the prediction errors. The\nlarger the true support Sa the more it will drown out the subset which corresponds to the difference\nsupport. This can be true even if one uses (cid:96)1 regularizers over the parameter vectors. Consequently,\none cannot rely on the straightforward strategy of learning two independent estimates and taking their\ndifference. The problem is particularly pronounced in the common setting where one group has fewer\nsamples than the other. Thus here we consider the setting where n1 > n2 and possibly n1 (cid:29) n2.\nLet \u02c6\u03b21 and \u02c6\u03b22 be regularized least squares estimates. In our problem setting we wish to obtain\ncon\ufb01dence intervals on debiased versions of the difference \u02c6\u03b2d = \u02c6\u03b21 \u2212 \u02c6\u03b22 in a high-dimensional\nsetting (in the sense that n2 < p), we aim to leverage assumptions about the form of the true \u03b2d,\nprimarily that it is sparse, while the independent \u02c6\u03b21 and \u02c6\u03b22 are weakly sparse or not sparse. We\nconsider a general case of a joint regularized least squares estimation of \u02c6\u03b21 and \u02c6\u03b22\n\nmin\n\u03b21,\u03b22\n\n1\nn1\n\n(cid:107)Y1 \u2212 X1\u03b21(cid:107)2 +\n\n(cid:107)Y2 \u2212 X2\u03b22(cid:107)2 + R(\u03b21, \u03b22)\n\n1\nn2\n\n(cid:21)\nWe note that the differentiating and using the KKT conditions gives\n1 (Y \u2212 X1\u03b21)\nX T\n2 (Y \u2212 X2\u03b22)\nX T\n\n(cid:20)\u02c6k1\n\n(cid:20) 1\n\n\u02c6k\u03bb =\n\n(cid:21)\n\n\u02c6k2\n\n=\n\nn1\n1\nn2\n\n(4)\n\n(5)\n\nwhere \u02c6k\u03bb is the (sub)gradient of R(\u03b21, \u03b22). Substituting Equation (3) we can now write\n\n1\nn1\n\n\u02c6\u03a31( \u02c6\u03b21 \u2212 \u03b21) + \u02c6k1 =\n\n1 \u00011 and \u02c6\u03a32( \u02c6\u03b22 \u2212 \u03b22) + \u02c6k2 =\nX T\n\n(6)\nWe would like to solve for the difference \u02c6\u03b21 \u2212 \u02c6\u03b22 but the covariance matrices may not be invertible.\nWe introduce matrices M1 and M2, which will allow us to isolate the relevant term. We will see that\nin addition these matrices will allow us to decouple the bias and variance of the estimators.\nM1 \u02c6\u03a31( \u02c6\u03b21 \u2212 \u03b21) + M1\nsubtracting these and rearranging we can now isolate the difference estimator plus a term we add\nback controlled by M1 and M2\n\n1 \u00011 and M2 \u02c6\u03a32( \u02c6\u03b22 \u2212 \u03b22) + M2\n\n2 \u00012 (7)\n\nM2X T\n\nM1X T\n\n\u02c6k1 =\n\n\u02c6k2 =\n\nX T\n\n2 \u00012\n\n1\nn2\n\n1\nn1\n\n1\nn2\n\n( \u02c6\u03b21 \u2212 \u02c6\u03b22) \u2212 (\u03b21 \u2212 \u03b22) + M1\n\n\u02c6k1 \u2212 M2\n\n\u02c6k2 =\n\n1\nn1\n\nM1X T\n\n1 \u00011 \u2212 1\nn2\n\nM2X T\n\n2 \u00012 \u2212 \u2206\n\n\u2206 = (M1 \u02c6\u03a31 \u2212 I)( \u02c6\u03b21 \u2212 \u03b21) \u2212 (M2 \u02c6\u03a32 \u2212 I)( \u02c6\u03b22 \u2212 \u03b22)\n\nDenoting \u03b2d := \u03b21 \u2212 \u03b22 and \u03b2a := \u03b21 + \u03b22, we can reformulate \u2206:\n\n(M1 \u02c6\u03a31 \u2212 I + M2 \u02c6\u03a32 \u2212 I)\n\n\u2206 =\n\n2\n\n( \u02c6\u03b2d \u2212 \u03b2d) +\n\n(M1 \u02c6\u03a31 \u2212 M2 \u02c6\u03a32)\n\n( \u02c6\u03b2a \u2212 \u03b2a)\n\n2\n\nHere, \u2206 will control the bias of our estimator. Additionally, we want to minimize its variance,\n\n1\nn1\n\nM1 \u02c6\u03a31M1 \u02c6\u03c32\n\n1 +\n\n1\nn2\n\nM2 \u02c6\u03a32M2 \u02c6\u03c32\n2.\n\nWe can now overcome the limitations of simple bias variance trade-off by using an appropriate\nregularizer coupled with an assumption on the underlying signal \u03b21 and \u03b22. This will in turn make \u2206\nasymptotically vanish while maximizing the variance.\nSince we are interested in pointwise estimates, we can focus on bounding the in\ufb01nity norm of \u2206.\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(cid:107)\u2206(cid:107)\u221e \u2264 1\n2\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u00b51\n\n(cid:107)M1 \u02c6\u03a31 + M2 \u02c6\u03a32 \u2212 2I(cid:107)\u221e\n\n(cid:107) \u02c6\u03b2d \u2212 \u03b2d(cid:107)1\n\n(cid:107)M1 \u02c6\u03a31 \u2212 M2 \u02c6\u03a32(cid:107)\u221e\n\n(cid:107) \u02c6\u03b2a \u2212 \u03b2a(cid:107)1\n\n(12)\n\n(cid:125)\n\n(cid:124)\n\n(cid:125)\n\n+\n\n1\n2\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u00b52\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nla\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nld\n\n4\n\n\fWe can control the maximum bias by selecting M1 and M2 appropriately. If we use an appropriate\nregularizer coupled with sparsity assumptions we can bound the terms la and ld and use this knowledge\nto appropriately select M1 and M2 such that the bias becomes neglibile.\nIf we had only the\nindependent parameter sparsity assumption we can apply the results of the debiased lasso and\nestimate M1 and M2 independently as in [16]. In the case of interest where \u03b21 and \u03b22 share many\nweights we can do better by taking this as an assumption and applying a sparsity regularization on the\ndifference by adding the term \u03bb2(cid:107)\u03b21 \u2212 \u03b22(cid:107)1. Comparing the decoupled penalty to the fused penalty\nproposed we see that ld would decrease at a given sample size. We now show how to jointly estimate\nM1 and M2 so that (cid:107)\u2206(cid:107)\u221e becomes negligible for a given n, p and sparsity assumption.\n3.1 Debiasing the Multi-Task Fused Lasso\nMotivated by the inductive hypothesis from neuroscience described above we introduce a consistent\nlow-variance estimator, the debiased multi-task fused lasso. We propose to use the following\nregularizer R(\u03b21, \u03b22) = \u03bb1(cid:107)\u03b21(cid:107)1 + \u03bb1(cid:107)\u03b22(cid:107)1 + \u03bb2(cid:107)\u03b21 \u2212 \u03b22(cid:107)1. This penalty has been referred to in\nsome literature as the multi-task fused lasso [4]. We propose to then debias this estimate as shown in\n(8). We estimate the M1 and M2 matrices by solving the following QP for each row m1 and m2 of\nthe matrices M1 and M2.\n\nmin\nm1,m2\n\n1\nn1\n\nmT\n1\n\n\u02c6\u03a31m1 +\n\nmT\n2\n\n\u02c6\u03a32m2\n\n(13)\n\n1\nn2\n\ns.t. (cid:107)M1 \u02c6\u03a31 + M2 \u02c6\u03a32 \u2212 2I(cid:107)\u221e \u2264 \u00b51, (cid:107)M1 \u02c6\u03a31 \u2212 M2 \u02c6\u03a32(cid:107)\u221e \u2264 \u00b52\n\n(cid:113) log p\n\nThis directly minimizes the variance, while bounding the bias in the constraint. We now show how to\nset the bounds:\nProposition 1. Take \u03bb1 > 2\nand \u03bb2 = O(\u03bb1). Denote sd the difference sparsity, s1,2 the\nparameter sparsity |S1| + |S2|, c > 1,a > 1, and 0 < m (cid:28) 1. When the compatibility condition\n[1, 11] holds the following bounds gives lau2 = o(1) and ldu1 = o(1) and thus (cid:107)\u2206(cid:107)\u221e = o(1) with\nhigh probability.\n\nn2\n\n\u00b51 \u2264\n\n1\n\nc\u03bb2sdnm\n2\n\nand \u00b52 \u2264\n\n1\n\na(\u03bb1s1,2 + \u03bb2sd)nm\n2\n\n(14)\n\nThe proof is given in the supplementary material. Using the prescribed Ms obtained with (13) and\n14 we obtain an unbiased estimator given by (8) with variance (11)\n3.2 GGM Difference Structure Discovery with Signi\ufb01cance\nThe debiased lasso and the debiased multi-task fused lasso, proposed in the previous section, can be\nused to learn the structure of a difference of Gaussian graphical models and to provide signi\ufb01cance\nresults on the presence of edges within the difference graph. We refer to these two procedures as\nDifference of Neighborhoods Debiased Lasso Selection and Difference of Neighborhoods Debiased\nFused Lasso Selection.\nWe recall that the conditional independence properties of a GGM are given by the zeros of the\nprecision matrix and these zeros correspond to the zeros of regression parameters when regressing\none variable on all the other. By obtaining a debiased lasso estimate for each node in the graph [34]\nnotes this leads to a sparse unbiased precision matrix estimate with a known asymptotic distribution.\nSubtracting these estimates for two different datasets gives us a difference estimate whose zeros\ncorrespond to no difference of graph edges in two GGMs. We can similarly use the debiased multi-\ntask fused lasso described above and the joint debiasing procedure to obtain a test statistic for the\ndifference of networks. We now formalize this procedure.\n\nNotation Given GGMs j = 1, 2. Let Xj denote the random variable in Rp associated with GGM\nj. We denote Xj,v the random variable associated with a node, v of the GGM and Xj,vc all other\nnodes in the graph. We denote \u02c6\u03b2j,v the lasso or multi-task fused lasso estimate of Xj,vc onto\nXj,v, then \u02c6\u03b2j,dL,v is the debiased version of \u02c6\u03b2j,v. Finally let \u03b2j,v denote the unknown regression,\nXj,v = Xj,vc\u03b2j,v + \u0001j where \u0001j \u223c N(0, \u03c3jI). De\ufb01ne \u03b2i\n2,dL,v the test statistic\nassociated with the edge v, i in the difference of GGMs j = 1, 2.\n\n1,dL,v \u2212 \u02c6\u03b2i\n\nD,v = \u02c6\u03b2i\n\n5\n\n\fAlgorithm 1 Difference Network Selection with\nNeighborhood Debiased Lasso\n\nAlgorithm 2 Difference Network Selection with\nNeighborhood Debiased Fused Lasso\n\nV = {1, ..., P}\nNxP Data Matrices, X1 and X2\nPx(P-1) Output Matrix B of test statistics\nfor v \u2208 V do\n\nV = {1, ..., P}\nNxP Data Matrices, X1 and X2\nPx(P-1) Output Matrix B of test statistics\nfor v \u2208 V do\n\nEstimate unbiased \u02c6\u03c31, \u02c6\u03c32 from X1,v, X2,v\nfor j \u2208 {1, 2} do\n\n\u03b2j \u2190 SolveLasso(Xj,vc , Xj,v)\nMj \u2190 M Estimator(Xj,vc )\n\u03b2j,U \u2190 \u03b2j+MjX T\nend for\nd \u2190 diag( \u02c6\u03c32\n\u03c32\nfor j \u2208 vc do\nBv,j = (\u03b21,U,j \u2212 \u03b22,U,j)/\n\nj,vc (Xj,v \u2212 Xj,vc \u03b2j)\n\u02c6\u03a31M1 + \u02c6\u03c32\n2\nn2\n\n(cid:113)\n\nM T\n2\n\nM T\n1\n\n1\nn1\n\n\u03c32\n\nd,j\n\n\u02c6\u03a32M2)\n\nEstimate unbiased \u02c6\u03c31, \u02c6\u03c32 from X1,v, X2,v\n\u03b21,\u03b22 \u2190 F usedLasso(X1,vc , X1,v, X2,vc , X2,v)\nM1, M2 \u2190 M Estimator(X1,vc , X2,vc )\nfor j \u2208 {1, 2} do\nj,vc (Xj,v \u2212 Xj,vc \u03b2j)\n\u02c6\u03a31M1 + \u02c6\u03c32\n2\nn2\n\n\u03b2j,U \u2190 \u03b2j+MjX T\nend for\nd \u2190 diag( \u02c6\u03c32\n\u03c32\nfor j \u2208 vc do\nBv,j = (\u03b21,U,j \u2212 \u03b22,U,j)/\n\n\u02c6\u03a32M2)\n\n(cid:113)\n\nM T\n2\n\nM T\n1\n\n1\nn1\n\n\u03c32\n\nd,j\n\nend for\n\nend for\n\nend for\n\nend for\n\nProposition 2. Given the \u02c6\u03b2i\nD,v, M1 and M2 computed as in [16] for the debiased lasso or as\nin Section 3.1 for the debiased multi-task fused lasso. When the respective assumptions of these\nestimators are satis\ufb01ed the following holds w.h.p.\n\nD,v \u2212 \u03b2i\n\u02c6\u03b2i\n\nD,v = W + o(1) where W \u223c N(0, [\u03c32\n\n1M1 \u02c6\u03a31M T\n\n1 + \u03c32\n\n2M2 \u02c6\u03a32M T\n\n2 ]i,i)\n\n(15)\n\nj,dL,v for the debiased\n\nThis follows directly from the asymptotic consistency of each individual \u02c6\u03b2i\nlasso and multi-task fused lasso.\nWe can now de\ufb01ne the the null hypothesis of interest as H0 : \u03981,(i,j) = \u03982,(i,j). Obtaining a test\nstatistic for each element \u03b2i\nD,v allows us to perform hypothesis testing on individual edges, all the\nedges, or groups of edges (controlling for the FWER). We summarize the Neighbourhood Debiased\nLasso Selection process in Algorithm 1 and the Neighbourhood Debiased Multi-Task Fused Lasso\nSelection in Algorithm 2 which can be used to obtain a matrix of all the relevant test statistics.\n4 Experiments\n4.1 Simulations\nWe generate synthetic data based on two Gaussian graphical models with 75 vertices. Each of the\nindividual graphs have a sparsity of 19% and their difference sparsity is 3%. We construct the\nmodels by taking two identical precision matrices and randomly removing some edges from both.\nWe generate synthetic data using both precision matrices. We use n1 = 800 samples for the \ufb01rst\ndataset and vary the second dataset n2 = 20, 30, ...150.\nWe perform a regression using the debiased lasso and the debiased multi-task fused lasso on each\nnode of the graphs. As an extra baseline we consider the projected ridge method from the R package\nfold cross validation k = {0.1, ..100} and M as prescribed in [16] which we obtain by solving a\nquadratic program. \u02c6\u03c3 is an unbiased estimator of the noise variance. For the debiased lasso we let\nboth \u03bb1 = k1 \u02c6\u03c32\nfrom the same range as k. M1 and M2 are obtained as in Equation (13) with the bounds (14) being\nset with c = a = 2, sd = 2, s1,2 = 15, m = 0.01, and the cross validated \u03bb1 and \u03bb2. In both\ndebiased lasso and fused multi-task lasso cases we utilize the Mosek QP solver package to obtain M.\nFor the projected ridge method we use the hdi package to obtain two estimates of \u03b21 and \u03b22 along\nwith their upper bounded biases which are then used to obtain p-values for the difference.\nWe report the false positive rate, the power, the coverage and interval length as per [30] for the\ndifference of graphs. In these experiments we aggregate statistics to demonstrate power of the test\nstatistic, as such we consider each edge as a separate test and do not perform corrections. Table\n1 gives the numerical results for n2 = 60: the power and coverage is substantially better for the\ndebiased fused multi-task lasso, while at the same time the con\ufb01dence interval smaller.\n\n\u201chdi\u201d [7]. We use the debiased lasso of [16] where we set \u03bb = k\u02c6\u03c3(cid:112)log p/n. We select c by 3-\n(cid:112)log p/n2, and select based on 3-fold cross-validation\n\n(cid:112)log p/n2 and \u03bb2 = k2 \u02c6\u03c32\n\n6\n\n\fMethod\nDeb. Lasso\nDeb. Fused Lasso\nRidge Projection\n\nFP\n3.7%\n0.0%\n0.0%\n\nTP(Power) Cov S Cov Sc\nd\n92%\n\nlen S\n96.2%\n2.199\n100% 98.6% 2.191\n100%\n100% 5.544\n\n80.6%\n93.3%\n18.6%\n\nlen Sc\nd\n2.195\n2.041\n5.544\n\nTable 1: Comparison of Debiased Lasso, Debiased\nFused Lasso, and Projected Ridge Regression for edge\nselection in difference of GGM. The signi\ufb01cance level\nis 5%, n1 = 800 and n2 = 60. All methods have\nfalse positive below the signi\ufb01cance level and the de-\nbiased fused lasso dominates in terms of power. The\ncoverage of the difference support and non-difference\nsupport is also best for the debiased fused lasso, which\nsimultaneously has smaller con\ufb01dence intervals on av-\nerage.\n\nFigure 1: Power of the test for different number of\nsamples in the second simulation, with n1 = 800. The\ndebiased fused lasso has highest statistical power.\nFigure 1 shows the power of the test for different values of n2. The fusedlasso outperforms the other\nmethods substantially. Projected ridge regression is particularly weak, in this scenario, as it uses a\nworst case p-value obtained using an estimate of an upper bound on the bias [7].\n4.2 Autism Dataset\nCorrelations in brain activity measured via fMRI reveal functional interactions between remote brain\nregions [18]. In population analysis, they are used to measure how connectivity varies between\ndifferent groups. Such analysis of brain function is particularly important in psychiatric diseases,\nthat have no known anatomical support: the brain functions in a pathological aspect, but nothing\nabnormal is clearly visible in the brain tissues. Autism spectrum disorder is a typical example of such\nill-understood psychiatric disease. Resting-state fMRI is accumulated in an effort to shed light on\nthis diseases mechanisms: comparing the connectivity of autism patients versus control subjects. The\nABIDE (Autism Brain Imaging Data Exchange) dataset [8] gathers rest-fMRI from 1 112 subjects\nacross, with 539 individuals suffering from autism spectrum disorder and 573 typical controls. We\nuse the preprocessed and curated data1.\nIn a connectome analysis [31, 26], each subject is described by a GGM measuring functional\nconnectivity between a set of regions. We build a connectome from brain regions of interest based on\na multi-subject atlas2 of 39 functional regions derived from resting-state fMRI [32] (see. Fig. 4).\nWe are interested in determining edge differences between the autism group and the control group.\nWe use this data to show how our parametric hypothesis test can be used to determine differences in\nbrain networks. Since no ground truth exists for this problem, we use permutation testing to evaluate\nthe statistical procedures [25, 5]. Here we permute the two conditions (e.g. autism and control group)\nto compute a p-value and compare it to our test statistics. This provides us with a \ufb01nite sample strict\ncontrol on the error rate: a non-parametric validation of our parametric test.\nFor our experiments we take 2000 randomly chosen volumes from the control group subjects and\n100 volumes from the autism group subjects. We perform permutation testing using the de-biased\nlasso, de-biased multi-task fused lasso, and projected ridge regression. Parameters for the de-biased\nfused lasso are chosen as in the previous section. For the de-biased lasso we use the exact settings for\n\u03bb and constraints on M provided in the experimental section of [16]. Projected ridge regression is\nevaluated as in the previous section.\nFigure 2 shows a comparison of three parametric approaches versus their analogue obtained with\na permutation test. The chart plots the permutation p-values of each entry in the 38 \u00d7 39 B matrix\nagainst the expected parametric p-value. For all the methods the points are above the line indicating\nthe tests are not breaching the expected false positive rates. However the de-biased lasso and ridge\nprojecting are very conservative and lead to few detections. The de-biased multi-task fused lasso\nyields far more detections on the same dataset, within the expected false positive rate or near it.\nWe now analyse the reproducibility of the results by repeatedly sampling 100 subsets of the data (with\nthe same proportions n1 = 2000 and n2 = 100), obtaining the matrix of test statistics, selecting edges\nthat fall below the 5% signi\ufb01cance level. Figure 3 shows how often edges are selected multiple times\nacross subsamples. We report results with a threshold on uncorrected p-values as the lasso procedure\n\n1http://preprocessed-connectomes-project.github.io/abide/\n2https://team.inria.fr/parietal/research/spatial_patterns/\n\nspatial-patterns-in-resting-state/\n\n7\n\n30405060708090100110120130140150n20.00.20.40.60.81.0PowerPowerridgelassofusedlasso\fFigure 2: Permutation testing comparing debiased fused lasso, debiased lasso, and projected ridge regression\non the ABIDE dataset. The chart plots the permutation p-values of each method on each possible edge against\nthe expected parametric p-value. The debiased lasso and ridge projection are very conservative and lead to few\ndetections. The fused lasso yields far more detections on the same dataset, almost all within the expected false\npositive rate.\n\nFigure 4: Outlines of the regions of the MSDL atlas.\n\nFigure 3: Reproducibility of results from sub-sampling\nusing uncorrected error rate. The fused lasso is much\nmore likely to detect edges and produce stable results.\nUsing corrected p-values no detections are made by\nlasso (\ufb01gure in supplementary material).\n\nFigure 5: Connectome of repeatedly picked up edges\nin 100 trials. We only show edges selected more than\nonce. Darker red indicates more frequent selection.\n\nselects no edges with multiple comparison correction (supplementary materials give FDR-corrected\nresults for the de-biased fused multi-task lasso selection). Figure 5 shows a connectome of the edges\nfrequently selected by the de-biased fused multi-task lasso (with FDR correction).\n\n5 Conclusions\n\nWe have shown how to characterize the distribution of differences of sparse estimators and how to use\nthis distribution for con\ufb01dence intervals and p-values on GGM network differences. For this purpose,\nwe have introduced the de-biased multi-task fused lasso. We have demonstrated on synthetic and\nreal data that this approach can provide accurate p-values and a sizable increase of statistical power\ncompared to standard procedures. The settings match those of population analysis for functional\nbrain connectivity, and the gain in statistical power is direly needed to tackle the low sample sizes [2].\nFuture work calls for expanding the analysis to cases with more than two groups as well as considering\na (cid:96)1,2 penalty sometimes used at the group level [33]. Additionally the squared loss objective\noptimizes excessively the prediction and could be modi\ufb01ed to lower further the sample complexity in\nterms of parameter estimation.\n\n8\n\n0.00.20.40.60.81.0permutationp-values0.00.20.40.60.81.0parametricp-valuesfusedlasso0.000.050.100.150.20permutationp-values0.000.020.040.060.080.10parametricp-valuesfusedlassopvaluesattail0.00.20.40.60.81.0permutationp-values0.00.20.40.60.81.0parametricp-valueslasso0.000.050.100.150.20permutationp-values0.000.020.040.060.080.10parametricp-valueslassopvaluesattail0.00.20.40.60.81.0permutationp-values0.00.20.40.60.81.0parametricp-valuesridge0.000.050.100.150.20permutationp-values0.000.020.040.060.080.10parametricp-valuesridgepvaluesattail123456789Numberofoccurences(t)0.00.20.40.60.81.0FractionofconnectionsoccuringatleastttimesReproducibilityacrosssubsampleslassofusedlassox=2LRz=20\fAcknowledgements\nThis work is partially funded by Internal Funds KU Leuven, ERC Grant 259112, FP7-MC-CIG\n334380, and DIGITEO 2013-0788D - SOPRANO, and ANR-11-BINF-0004 NiConnect.\nReferences\n[1] P. B\u00fchlmann and S. van de Geer. Statistics for High-Dimensional Data. Springer, 2011.\n[2] K. Button et al. Power failure: Why small sample size undermines the reliability of neuroscience. Nature\n\nReviews Neuroscience, 14:365, 2013.\n\n[3] F. X. Castellanos et al. Clinical applications of the functional connectome. Neuroimage, 80:527, 2013.\n[4] X. Chen et al. Smoothing proximal gradient method for general structured sparse learning. In UAI, 2011.\n[5] B. Da Mota et al. Randomized parcellation based inference. NeuroImage, 89:203\u2013215, 2014.\n[6] P. Danaher, P. Wang, and D. Witten. The joint graphical lasso for inverse covariance estimation across\n\nmultiple classes. Journal of the Royal Statistical Society (B), 76(2):373\u2013397, 2014.\n\n[7] R. Dezeure, P. B\u00fchlmann, L. Meier, and N. Meinshausen. High-dimensional inference: Con\ufb01dence\n\nintervals, p-values and R-software hdi. Statist. Sci., 30(4):533\u2013558, 11 2015.\n\n[8] A. Di Martino et al. The autism brain imaging data exchange: Towards a large-scale evaluation of the\n\nintrinsic brain architecture in autism. Mol. Psychiatry, 19:659, 2014.\n\n[9] F. Fazayeli and A. Banerjee. Generalized direct change estimation in ising model structure. In ICML, 2016.\n[10] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso.\n\n[11] A. Ganguly and W. Polonik. Local neighborhood fusion in locally constant Gaussian graphical models.\n\nBiostatistics, 9(3):432\u2013441, 2008.\n\narXiv:1410.8766, 2014.\n\n[12] M. G. G\u2019Sell, J. Taylor, and R. Tibshirani. Adaptive testing for the graphical lasso. arXiv:1307.4765, 2013.\n[13] C. Honey, O. Sporns, L. Cammoun, X. Gigandet, et al. Predicting human resting-state functional connec-\n\ntivity from structural connectivity. Proc. Nat. Acad. Sciences, 106:2035, 2009.\n\n[14] J. Honorio and D. Samaras. Multi-task learning of Gaussian graphical models. In ICML, 2010.\n[15] J. Jankov\u00e1 and S. van de Geer. Con\ufb01dence intervals for high-dimensional inverse covariance estimation.\n\nElectron. J. Statist., 9(1):1205\u20131229, 2015.\n\n[16] A. Javanmard and A. Montanari. Con\ufb01dence intervals and hypothesis testing for high-dimensional\n\nregression. The Journal of Machine Learning Research, 15(1):2869\u20132909, 2014.\n\n[17] C. Kelly, B. B. Biswal, R. C. Craddock, F. X. Castellanos, and M. P. Milham. Characterizing variation in\n\nthe functional connectome: Promise and pitfalls. Trends in Cog. Sci., 16:181, 2012.\n\n[18] M. A. Lindquist et al. The statistical analysis of fMRI data. Stat. Sci., 23(4):439\u2013464, 2008.\n[19] R. Lockhart et al. A signi\ufb01cance test for the lasso. Ann. Stat., 42:413, 2014.\n[20] N. T. Markov, M. Ercsey-Ravasz, D. C. Van Essen, K. Knoblauch, Z. Toroczkai, and H. Kennedy. Cortical\n\nhigh-density counterstream architectures. Science, 342(6158):1238406, 2013.\n\n[21] G. Marsaglia. Conditional means and covariances of normal variables with singular covariance matrix.\n\nJournal of the American Statistical Association, 59(308):1203\u20131204, 1964.\n\n[22] N. Meinshausen and P. B\u00fchlmann. High-dimensional graphs and variable selection with the lasso. Ann.\n\n[23] K. Mohan et al. Structured learning of Gaussian graphical models. In NIPS, pages 620\u2013628, 2012.\n[24] M. Narayan and G. I. Allen. Mixed effects models to \ufb01nd differences in multi-subject functional connectiv-\n\nStat., pages 1436\u20131462, 2006.\n\nity. bioRxiv:027516, 2015.\n\n[25] T. E. Nichols and A. P. Holmes. Nonparametric permutation tests for functional neuroimaging: A primer\n\nwith examples. Human Brain Mapping, 15(1):1\u201325, 2002.\n\n[26] J. Richiardi, H. Eryilmaz, S. Schwartz, P. Vuilleumier, and D. Van De Ville. Decoding brain states from\n\nfMRI connectivity graphs. NeuroImage, 56:616\u2013626, 2011.\n\n[27] W. R. Shirer, S. Ryali, E. Rykhlevskaia, V. Menon, and M. D. Greicius. Decoding subject-driven cognitive\n\nstates with whole-brain connectivity patterns. Cerebral Cortex, 22(1):158\u2013165, 2012.\n[28] S. M. Smith et al. Network modelling methods for fMRI. NeuroImage, 54:875, 2011.\n[29] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B, pages 267\u2013288, 1996.\n\n[30] S. Van de Geer, P. B\u00fchlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal con\ufb01dence regions and\n\ntests for high-dimensional models. Ann. Stat., 42(3):1166\u20131202, 2014.\n\n[31] G. Varoquaux and R. C. Craddock. Learning and comparing functional connectomes across subjects.\n\nNeuroImage, 80:405\u2013415, 2013.\n\n[32] G. Varoquaux, A. Gramfort, F. Pedregosa, V. Michel, and B. Thirion. Multi-subject dictionary learning to\n\nsegment an atlas of brain spontaneous activity. In IPMI, 2011.\n\n[33] G. Varoquaux, A. Gramfort, J.-B. Poline, and B. Thirion. Brain covariance selection: Better individual\n\nfunctional connectivity models using population prior. In NIPS, 2010.\n\n[34] L. Waldorp. Testing for graph differences using the desparsi\ufb01ed lasso in high-dimensional data. Statistics\n\nSurvey, 2014.\n\n[35] S. D. Zhao et al. Direct estimation of differential networks. Biometrika, 101(2):253\u2013268, 2014.\n\n9\n\n\f", "award": [], "sourceid": 335, "authors": [{"given_name": "Eugene", "family_name": "Belilovsky", "institution": "CentraleSupelec"}, {"given_name": "Ga\u00ebl", "family_name": "Varoquaux", "institution": "INRIA"}, {"given_name": "Matthew", "family_name": "Blaschko", "institution": "KU Leuven"}]}