{"title": "Direct Estimation of Differential Functional Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2575, "page_last": 2585, "abstract": "We consider the problem of estimating the difference between two functional undirected graphical models with shared structures. In many applications, data are naturally regarded as high-dimensional random function vectors rather than multivariate scalars. For example, electroencephalography (EEG) data are more appropriately treated as functions of time. In these problems, not only can the number of functions measured per sample be large, but each function is itself an infinite dimensional object, making estimation of model parameters challenging. We develop a method that directly estimates the difference of graphs, avoiding separate estimation of each graph, and show it is consistent in certain high-dimensional settings. We illustrate finite sample properties of our method through simulation studies. Finally, we apply our method to EEG data to uncover differences in functional brain connectivity between alcoholics and control subjects.", "full_text": "Direct Estimation of Differential Functional\n\nGraphical Models\n\nBoxin Zhao\n\nDepartment of Statistics\nThe Unveristy of Chicago\n\nChicago, IL 60637\n\nboxinz@uchicago.edu\n\nY. Samuel Wang\n\nBooth School of Business\nThe Unveristy of Chicago\n\nChicago, IL 60637\n\nswang24@uchicago.edu\n\nMladen Kolar\n\nmkolar@chicagobooth.edu\n\nBooth School of Business\nThe Unveristy of Chicago\n\nChicago, IL 60637\n\nAbstract\n\nWe consider the problem of estimating the difference between two functional\nundirected graphical models with shared structures. In many applications, data\nare naturally regarded as high-dimensional random function vectors rather than\nmultivariate scalars. For example, electroencephalography (EEG) data are more\nappropriately treated as functions of time. In these problems, not only can the\nnumber of functions measured per sample be large, but each function is itself an in-\n\ufb01nite dimensional object, making estimation of model parameters challenging. We\ndevelop a method that directly estimates the difference of graphs, avoiding separate\nestimation of each graph, and show it is consistent in certain high-dimensional\nsettings. We illustrate \ufb01nite sample properties of our method through simulation\nstudies. Finally, we apply our method to EEG data to uncover differences in\nfunctional brain connectivity between alcoholics and control subjects.\n\n1\n\nIntroduction\n\nUndirected graphical models are widely used to compactly represent pairwise conditional indepen-\ndence in complex systems. Let G = {V, E} denote an undirected graph where V is the set of vertices\nwith |V | = p and E \u2282 V 2 is the set of edges. For a random vector X = (X1, . . . , Xp)T , we say that\nX satis\ufb01es the pairwise Markov property with respect to G if Xv (cid:54) \u22a5\u22a5 Xw|{Xu}u\u2208V \\{v,w} implies\n{v, w} \u2208 E. When X follows a multivariate Gaussian distribution with covariance \u03a3 = \u0398\u22121, then\n\u0398vw (cid:54)= 0 implies {v, w} \u2208 E. Thus, recovering the structure of the undirected graph is equivalent to\nestimating the support of the precision matrix, \u0398 [10, 13, 4, 24, 25].\nWe consider a setting where we observe two samples X and Y from (possibly) different distributions,\nand the primary object of interest is the difference between the conditional dependencies of each\npopulation rather than the conditional dependencies in each population. For example, in Section 4.3\nwe analyze neuroscience data sampled from a control group and a group of alcoholics, and seek to\nunderstand how the brain functional connectivity patterns in the alcoholics differ from the control\ngroup. Thus, in this paper, the object of interest is the differential graph, G\u2206 = {V, E\u2206}, which is\nde\ufb01ned as the difference between the precision matrix of X, \u0398X and the precision matrix of Y , \u0398Y ,\n\u2206 = \u0398X \u2212 \u0398Y . When \u2206vw (cid:54)= 0, it implies {v, w} \u2208 E\u2206. This type of differential model has been\nadopted in [30, 22, 3].\nIn this paper, we are interested in estimating the differential graph in a more complicated setting.\nInstead of observing vector valued data, we assume the data are actually random vector valued\nfunctions (see [5] for a detailed exposition of random functions). Indeed, we aim to estimate the\ndifference between two functional graphical models and the method we propose combines ideas from\ngraphical models for functional data and direct estimation of differential graphs.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fMultivariate observations measured across time can be modeled as arising from distinct, but similar\ndistributions [9]. However, in some cases, it may be more natural to assume the data are measurements\nof an underlying continuous process [31, 18, 11, 28]. [31, 18] treat data as curves distributed according\nto a multivariate Gaussian process (MGP). [31] shows that Markov properties hold for Gaussian\nprocesses, while [18] shows how to consistently estimate underlying conditional independencies.\nWe adopt the functional data point of view and assume the data are curves distributed according\nto a MGP. However, we consider two samples from distinct populations with the primary goal of\ncharacterizing the difference between the conditional cross-covariance functions of each population.\nNaively, one could apply the procedure of [18] to each sample, and then directly compare the resulting\nestimated conditional independence structures. However, this approach would require sparsity in both\nof the underlying conditional independence graphs and would preclude many practical cases; e.g.,\nneither graph could contain hub-nodes with large degree. We develop a novel procedure that directly\nlearns the difference between the conditional independence structures underlying two MGPs. Under\nan assumption that the difference is sparse, we can consistently learn the structure of the differential\ngraph, even in the setting where individual graphs are dense and separate estimation would suffer.\nOur paper builds on recent literature on graphical models for vector valued data, which suggests\nthat direct estimation of the differences between parameters of underlying distributions may yield\nbetter results. [12] considers data arising from pairwise interaction exponential families and propose\nthe Kullback-Leibler Importance Estimation Procedure (KLIEP) to explicitly estimate the ratio of\ndensities. [21] uses KLIEP as a \ufb01rst step to directly estimate the difference between two directed\ngraphs. Alternatively, [30, 26] consider two multivariate Gaussian samples, and directly estimate the\ndifference between the two precision matrices. When the difference is sparse, it can be consistently\nestimated even in the high-dimensional setting with dense underlying precision matrices. [22] extends\nthis approach to Gaussian copula models.\nThe rest of the paper is organized as follows. In Section 2 we introduce our method for Functional\nDifferential Graph Estimation (FuDGE). In Section 3 we provide conditions under which FuDGE\nconsistently recovers the true differential graph. Simulations and real data analysis are provided\nin Section 41. Discussion is provided in Section 5. Appendix contains all the technical proofs and\nadditional simulation results.\n(cid:80)\nWe brie\ufb02y introduce some notation used throughout the rest of the paper. Let | \u00b7 |p denote vector p-\nnorm and (cid:107)\u00b7(cid:107)p denote the matrix/operator p-norm. For example, for a p\u00d7 p matrix A with entries ajk,\nk |ajk|.\nLet an (cid:16) bn denote that z1 \u2264 inf n |an/bn| \u2264 supn |an/bn| \u2264 z2 for some positive constants z1 and\nz2. Let \u03bbmin(A) and \u03bbmax(A) denote the minimum and maximum eigenvalues, respectively. For a\nbivariate function g(s, t), we de\ufb01ne the Hilbert-Schmidt norm of g(s, t) (or equivalently, the norm of\nthe integral operator it corresponds to) as (cid:107)g(cid:107)2\n\nj |ajk|, |A|\u221e = maxj,k |ajk|, and (cid:107)A(cid:107)\u221e = maxj\n\nHS =(cid:82)(cid:82) {g(s, t)}2dsdt.\n\n(cid:80)\n\n|A|1 =(cid:80)\n\nj,k |ajk|, (cid:107)A(cid:107)1 = maxk\n\n2 Methodology\n\n2.1 Functional differential graphical model\n\nLet Xi(t) = (Xi1(t), . . . , Xip(t))T , i = 1, . . . , nX, and Yi(t) = (Yi1(t), . . . , Yip(t))T , i =\n1, . . . , nY , be iid p-dimensional multivariate Gaussian processes with mean zero and common\ndomain T from two different, but connected population distributions, where T is a closed subset\nof the real line.2 Also, assume that for j = 1, . . . , p, Xij(t) and Yij(t) are random elements of a\nseparable Hilbert space H. For brevity, we will generally only explicitly de\ufb01ne notation for Xi(t);\nhowever, the reader should note that all notations for Yi(t) are de\ufb01ned analogously.\nFollowing [18], we de\ufb01ne the conditional cross-covariance function for Xi(t) as\n\njl (s, t) = Cov (Xij(s), Xil(t) | {Xik(\u00b7)}k(cid:54)=j,l) .\nC X\n\n(2.1)\njl (s, t) = 0 for all s, t \u2208 T , then the random functions Xij(t) and Xil(t) are conditionally\nIf C X\nindependent given the other random functions. The graph GX = {V, EX} represents the pairwise\n\n1The code for this part is on https://github.com/boxinz17/FuDGE\n2Both Xi(t) and Yi(t) are indexed by i, but they are not paired observations and are completely independent.\nAlso, we assume mean zero and a common domain T to simplify the notation, but the methodology and theory\ngeneralize to non-zero means and different time domains TX and TY when \ufb01xing some bijection TX (cid:55)\u2192 TY .\n\n2\n\n\fMarkov properties of Xi(t) if\n\nEX = {(j, l) \u2208 V 2 : j (cid:54)= l and \u2203{s, t} \u2208 T 2 such that C X\n\njl (s, t) (cid:54)= 0}.\n\n(2.2)\n\nIn this paper, the object of interest is C \u2206(s, t) where C \u2206\nthe differential graph to be G\u2206 = {V, E\u2206}, where\n\njl (s, t) = C X\n\njl (s, t) \u2212 C Y\n\njl (s, t). We de\ufb01ne\n\nE\u2206 = {(j, l) \u2208 V 2 : j (cid:54)= l and (cid:107)C \u2206\n\njl(cid:107)HS (cid:54)= 0}.\n\n(2.3)\n\nAgain, we include an edge between j and l, if the conditional dependence between Xij(t) and Xil(t)\ngiven all the other curves differs from that of Yij(t) and Yil(t) given all the other curves.\n\n2.2 Functional principal component analysis\n\nSince Xi(t) and Yi(t) are in\ufb01nite dimensional objects, for practical estimation, we reduce the\ndimensionality using functional principal component analysis (FPCA). Similar to the way principal\ncomponent analysis provides an L2 optimal lower dimensional representation of vector valued data,\nFPCA provides an L2 optimal \ufb01nite dimensional representation of functional data. As in [18], for\nsimplicity of exposition, we assume that we fully observe the functions Xi(t) and Yi(t). However,\nFPCA can also be applied to both densely and sparsely observed functional data, as well as data\ncontaining measurement errors. Such an extension is straightforward, cf. [23] and [20] for a recent\noverview. Let K X\njj (t, s) = Cov(Xij(t), Xij(s)) denote the covariance function for Xij. Then, there\nexists orthonormal eigenfunctions and eigenvalues {\u03c6jk(t), \u03bbX\n\njk}k\u2208N such that for all k \u2208 N [5]:\n\nK X\n\nWithout loss of generality, assume \u03bbX\n\n[5, Theorem 7.3.5], Xij(t) can be expressed as Xij(t) = (cid:80)\u221e\n\n(2.4)\njk(t)dt = \u03bbX\nj2 \u2265 \u00b7\u00b7\u00b7 \u2265 0. By the Karhunen-Lo\u00e8ve expansion\njk(t) where the principal\ncomponent scores satisfy aX\njk) with E(aX\nijl) = 0 if\nk (cid:54)= l. Because the eigenfunctions are orthonormal, the L2 projection of Xij onto the span of the\n\ufb01rst M eigenfunctions is\n\nijk\u03c6X\nijk \u223c N (0, \u03bbX\n\njj (s, t)\u03c6X\nj1 \u2265 \u03bbX\n\njk(t)dt and aX\n\nT Xij(t)\u03c6X\n\nk=1 aX\n\nijkaX\n\njk\u03c6X\n\njk(s).\n\n(cid:90)\nijk =(cid:82)\n\nT\n\nM(cid:88)\n\nk=1\n\nX M\n\nij (t) =\n\naX\nijk\u03c6X\n\njk(t).\n\n(2.5)\n\nnX(cid:88)\n\ni=1\n\n1\nnX\n\njj (s, t) =\n\n\u02c6K X\n\n(cid:80)nX\n\nFunctional PCA constructs estimators \u02c6\u03c6X\njk(t) and \u02c6aX\nform an empirical estimate of the covariance function:\n\nijk through the following procedure. First, we\n\n(Xij(s) \u2212 \u00afXj(s))(Xij(t) \u2212 \u00afXj(t)),\n\nijk = (cid:82)\n\nX\n\njk which allow for computation of \u02c6aX\n\ni=1 Xij(t). An eigen-decomposition of \u02c6K X\n\njk and \u02c6\u03c6X\nijM )T \u2208 RM and aX,M\n. Since X M\n\nwhere \u00afXj(t) = n\u22121\njj (s, t) then directly provides the\nestimates \u02c6\u03bbX\nT Xij(t) \u02c6\u03c6X\n=\n)T )T \u2208 RpM with corresponding\n(aX\nij1, . . . , aX\nestimates \u02c6aX,M\nwill have a multivariate\nGaussian distribution with pM \u00d7 pM covariance matrix which we denote as \u03a3X,M = (\u0398X,M )\u22121.\nIn practice, M can be selected by cross validation as in [18].\nFor (j, l) \u2208 V 2, let \u0398X,M\nbe the M \u00d7 M matrix corresponding to (j, l)th submatrix of \u0398X,M . Let\n\u2206M = \u0398X,M \u2212 \u0398Y,M be the difference between the precision matrices of the \ufb01rst M principal\ncomponent scores where \u2206M\n\n= ((aX,M\ni (t) are p-dimensional MGP, aX,M\n\njl denotes the (j, l)th submatrix of \u2206M . In addition, let\n\njk(t)dt. Let aX,M\n\n)T , . . . , (aX,M\n\nand \u02c6aX,M\n\nip\n\ni1\n\njl\n\nij\n\nij\n\ni\n\ni\n\ni\n\nE\u2206M := {(j, l) \u2208 V 2 : j (cid:54)= l and (cid:107)\u2206M\n\n(2.6)\ndenote the set of non-zero blocks of the difference matrix \u2206M . In general E\u2206M (cid:54)= E\u2206; however, we\nwill see that for certain M, by constructing a suitable estimator of \u2206M we can still recover E\u2206.\n\njl (cid:107)F (cid:54)= 0},\n\n3\n\n\f2.3 Functional differential graph estimation\n\n(cid:88)\n\nWe now describe our method, FuDGE, for functional differential graph estimation. Let SX,M and\nSY,M denote the sample covariances of \u02c6aX,M\n. To estimate \u2206M , we solve the following\nproblem with the group lasso penalty, which promotes blockwise sparsity in \u02c6\u2206M [27]:\n\nand \u02c6aY,M\n\ni\n\ni\n\n(cid:107)\u2206ij(cid:107)F ,\n\n\u02c6\u2206M \u2208 arg min\n\u2206\u2208RpM\u00d7pM\n\n(2.7)\n\nwhere L(\u2206) = tr(cid:2) 1\n\nL(\u2206) + \u03bbn\n\n2 SY,M \u2206T SX,M \u2206 \u2212 \u2206T(cid:0)SY,M \u2212 SX,M(cid:1)(cid:3). Note that although the true \u2206M is\n\n{i,j}\u2208V 2\n\nsymmetric, we do not enforce symmetry in \u02c6\u2206M .\nThe design of the loss function L(\u2206) in equation (2.7) is based on [15], where in order to construct\na consistent M-estimator, we want the true parameter value \u2206M to minimize the population loss\nE [L(\u2206)]. For a differentiable and convex loss function, this is equivalent to selecting L such that\n\nE(cid:2)\u2207L(\u2206M )(cid:3) = 0. Since \u2206M =(cid:0)\u03a3X,M(cid:1)\u22121 \u2212(cid:0)\u03a3Y,M(cid:1)\u22121, it satis\ufb01es \u03a3X,M \u2206M \u03a3Y,M \u2212 (\u03a3Y,M \u2212\nfor which E(cid:2)\u2207L(\u2206M )(cid:3) = \u03a3X,M \u2206M \u03a3Y,M \u2212 (\u03a3Y,M \u2212 \u03a3X,M ) = 0. Using properties of the\n\n\u2207L(\u2206M ) = SX,M \u2206M SY,M \u2212(cid:0)SY,M \u2212 SX,M(cid:1) ,\n\n\u03a3X,M ) = 0. By this observation, a choice for \u2207L(\u2206) is\n\ndifferential of the trace function, this choice of \u2207L(\u2206) yields L(\u2206) in (2.7). The chosen loss is\nquadratic (see (B.10) in supplement) and leads to an ef\ufb01cient algorithm. Such loss has been used in\n[22, 26, 14] and [30].\nFinally, to form \u02c6E\u2206, we threshold \u02c6\u2206M by \u0001n > 0 so that:\n\n(2.8)\n\n\u02c6E\u2206 = {(j, l) \u2208 V 2 : j (cid:54)= l and (cid:107) \u02c6\u2206M\n\njl (cid:107)F > \u0001n or (cid:107) \u02c6\u2206M\n\nlj (cid:107)F > \u0001n}.\n\n(2.9)\n\n2.4 Optimization algorithm for FuDGE\n\nAlgorithm 1 Functional differential graph estimation\nInput: SX,M , SY,M , \u03bbn, \u03b7.\nOutput: \u02c6\u2206M .\n\n(cid:104)\n\nInitialize \u2206(0) = 0pM .\nrepeat\n\nA = \u2206 \u2212 \u03b7\u2207L(\u2206) = \u2206 \u2212 \u03b7\nfor 1 \u2264 i, j \u2264 p do\n(cid:107)Ajl(cid:107)F\n\n\u2206jl \u2190(cid:16)(cid:107)Ajl(cid:107)F \u2212\u03bbn\u03b7\n\n(cid:17)\n\n\u00b7 Ajl\n\n+\n\nend for\n\nuntil Converge\n\nS(M )\nX \u2206S(M )\n\nY \u2212 (S(M )\n\nY \u2212 S(M )\nX )\n\n(cid:105)\n\nThe optimization problem (2.7) can be solved by a proximal gradient method [17], summarized in\nAlgorithm 1. Speci\ufb01cally, in each iteration step, we update the current value of \u2206, denoted as \u2206old,\nby solving the following problem:\n\n\uf8eb\uf8ed 1\n\n2\n\n(cid:13)(cid:13)\u2206 \u2212(cid:0)\u2206old \u2212 \u03b7\u2207L(cid:0)\u2206old(cid:1)(cid:1)(cid:13)(cid:13)2\n\n\u2206new = arg min\n\n\u2206\n\n\uf8f6\uf8f8 ,\n\np(cid:88)\n\nj,l=1\n\nF + \u03b7 \u00b7 \u03bbn\n\n(cid:107)\u2206jl(cid:107)F\n\n(2.10)\n\nwhere \u2207L(\u2206) is de\ufb01ned in (2.8) and \u03b7 is a user speci\ufb01ed step size. Note that \u2207L(\u2206) is Lipschitz\ncontinuous with the Lipschitz constant (cid:107)SY,M \u2297 SX,M(cid:107)2 = \u03bbmax(SY,M )\u03bbmax(SX,M ). Thus, for\nany \u03b7 such that 0 < \u03b7 \u2264 1/\u03bbS\nmax, the proximal gradient method is guaranteed to converge [1], where\nmax = \u03bbmax(SY,M )\u03bbmax(SX,M ) is the largest eigenvalue of SX,M \u2297 SY,M .\n\u03bbS\nThe update in (2.10) has a closed-form solution:\n\njl =(cid:2)(cid:0)(cid:107)Aold\n\njl (cid:107)F \u2212 \u03bbn\u03b7(cid:1) /(cid:107)Aold\n\njl (cid:107)F\n\n(cid:3)\n\n\u2206new\n\n\u00b7 Aold\njl ,\n\n1 \u2264 j, l \u2264 p,\n\n(2.11)\n\n+\n\n4\n\n\fwhere Aold = \u2206old \u2212 \u03b7\u2207L(\u2206old) and x+ = max{0, x}, x \u2208 R represents the positive part of x.\nDetailed derivations are given in the appendix.\n\nAfter performing FPCA, the proximal gradient descent method converges in O(cid:0)\u03bbS\n\nmax/tol(cid:1) iterations,\n\nwhere tol is error tolerance, each iteration takes O((pM )3) operations. See [19] for convergence\nanalysis of proximal gradient descent algorithm.\n\n3 Theoretical properties\n\nIn this section, we present theoretical properties of the proposed method. Again, we state assumptions\nexplicitly for Xi(t), but also require the same conditions on Yi(t).\nAssumption 3.1. Recall that \u03bbX\njk and \u03c6X\nthe covariance function for Xij(t), and \u03bbX\n\njk(t) are the the eigenvalues and eigenfunctions of K X\njk > \u03bbX\n\n(cid:80)\u221e\njk < \u221e and there exists some constant \u03b2X > 1 such that for\n(i) Assume maxj\u2208V\n(cid:16)\n(cid:17)\u22121\njk (cid:16) k\u2212\u03b2X and dX\neach k \u2208 N, \u03bbX\njk\u03bbX\njk =\n\u221a\nj(k\u22121) \u2212 \u03bbX\njk \u2212 \u03bbX\n\u03bbX\n\u03bbX\njk(t)\u2019s are continuous on the compact set T and satisfy\njk(s)| = O(1).\n\n(ii) Assume for all k \u2208 N, \u03c6X\nmaxj\u2208V sups\u2208T supk\u22651 |\u03c6X\n\njk = O(k) uniformly in j \u2208 V , where dX\n\njk(cid:48) for all k(cid:48) > k.\n\n(cid:17)\u22121(cid:27)\n\n(cid:26)(cid:16)\n\nk=1 \u03bbX\n\n2 max\n\nj(k+1)\n\njk\n\njj (t),\n\n2\n\n.\n\n,\n\njk\u03bbX\n\njl (cid:107)F\n\njl(cid:107)HS \u2212 (cid:107)\u2206M\n\njl (s, t)}j,l\u2208V .\n\n(cid:12)(cid:12)(cid:12)(cid:107)C \u2206\n\njl (s, t) \u2212 C Y\n\njk = O(k) controls the decay\n\n(cid:12)(cid:12)(cid:12), and let \u03c4 = min(j,l)\u2208E\u2206 (cid:107)C \u2206\n\nThe parameter \u03b2X controls the decay rate of the eigenvalues and dX\nrate of eigen-gaps (see [2] for more details).\nTo recover the exact functional differential graph structure, we need further assumptions\non the difference operator C \u2206 = {C X\nLet \u03bd = \u03bd(M ) =\njl(cid:107)HS, where \u03c4 > 0 by the de\ufb01nition\nmax(j,l)\u2208V 2\nin (2.3). Roughly speaking, \u03bd(M ) measures the bias due to using an M-dimensional approximation,\nand \u03c4 measures the strength of signal in the differential graph. A smaller \u03c4 implies that the graph is\nharder to recover, and in Theorem 3.1, we require the bias to be small compared to the signal.\nAssumption 3.2. Assume that limM\u2192\u221e \u03bd(M ) = 0.\nWe also require Assumption 3.3 which assumes sparsity in E\u2206. Again, this does not preclude the\ncase where EX and EY are dense, as long as the difference between the two graphs is sparse. This\nassumption is common in the scalar setting; e.g., Condition 1 in [30].\nAssumption 3.3. There are s edges in the differential graph; i.e., |E\u2206| = s.\nBefore we give conditions for recovering the differential graph with high probability, we \ufb01rst\nintroduce some additional notation. Let n = min{nX , nY }, \u03c3max = max{|\u03a3X,M|\u221e,|\u03a3Y,M|\u221e},\n\u03b2 = min{\u03b2X , \u03b2Y }, and \u03bb\u2217\n\n(cid:0)\u03a3Y,M(cid:1). Given positive constant c1, denote\n\n(cid:0)\u03a3X,M(cid:1) \u00d7 \u03bbmin\nc1)M 1+\u03b2(cid:112)2 (log p + log M + log n) /n\n\n\u221a\n\u03b4 = (1/\n\nmin = \u03bbmin\n\nand\n\nwhere\n\n9\u03bb2\nns\n\u03ba2L\n\n2\u03bbn\n\u03baL\n\n+\n\n\u0393 =\n\n\u03bbn = 2M(cid:2)(cid:0)\u03b42 + 2\u03b4\u03c3max\n\n(\u03c92L + 2p2\u03bd),\n\n(cid:1)(cid:12)(cid:12)\u2206M(cid:12)(cid:12)1 + 2\u03b4(cid:3) ,\nmin \u2212 8M 2s(cid:0)\u03b42 + 2\u03b4\u03c3max\n(cid:1) , and\n(cid:112)\n\n\u03baL = (1/2)\u03bb\u2217\n\u03c9L = 4M p2\u03bd\n\n\u03b42 + 2\u03b4\u03c3max.\n\n(3.1)\n\n(3.2)\n\n(3.3)\n\nNote that \u0393 implicitly depends on n through \u03bbn, \u03baL, \u03c9L and \u03b4.\nTheoreom 3.1. There exist positive constants c1 and c2, such that for n and M large enough to\nsimultaneously satisfy\n\n(cid:27)\n\n0 < \u0393 < (1/2)\u03c4 \u2212 \u03bd(M ) and\n\n(cid:26)\n\n(cid:113)\n\n\u03b4 < min\n\n(1/4)\n\nmin + 16M 2s(\u03c3max)2) / (M 2s) \u2212 \u03c3max, c1\n(\u03bb\u2217\n\n(3.4)\n\n,\n\n5\n\n\fsetting \u0001n \u2208 (\u0393 + \u03bd(M ), \u03c4 \u2212 (\u0393 + \u03bd(M ))) ensures that\n\n(cid:16) \u02c6E\u2206 = E\u2206\n\n(cid:17) \u2265 1 \u2212 2c2/n2.\n\nP\n\nij (t), and EX will correspond exactly to (j, l) \u2208 V 2 such that (cid:107)\u0398X,M\n\n[18] assumed for some \ufb01nite M, for all j \u2208 V , \u03bbX\njm(cid:48) = 0 for all m(cid:48) > M. Under this assumption,\n(cid:107)F (cid:54)= 0 [18,\nXij(t) = X M\nLemma 1]. If the same eigenvalue condition holds for Yi(t), then in our setting E\u2206 = E\u2206M . When\nthis holds and we can \ufb01x M, we obtain consistency even in the high-dimensional setting since \u03bd = 0\nand min{s log(pn)|\u2206M|2\nwith an in\ufb01nite number of positive eigenvalues, high-dimensional consistency is still possible for\nquickly decaying \u03bd; e.g, if \u03bd = o(p\u22122M\u22121) the same rate is achievable as when v(M ) = 0.\n\n1/n, s(cid:112)log(pn)/n} \u2192 0 implies consistent estimation. However, even\n\njl\n\n4 Experiments\n\n4.1 Simulation study\n\nIn this section, we demonstrate properties of our method through simulations. In each setting, we\ngenerate nX \u00d7 p functional variables from graph GX via Xij(t) = b(t)T \u03b4X\nij , where b(t) is a \ufb01ve\ndimensional basis with disjoint support over [0, 1] with\n\n(cid:26) cos (10\u03c0 (x \u2212 (2k \u2212 1)/10)) + 1\n\nk = 1, . . . , 5.\n\n(k \u2212 1)/5 \u2264 x < k/5;\notherwise,\n\nbk(t) =\n\n0\n\njl .\njl = \u2126X\n\ni1 )T ,\u00b7\u00b7\u00b7 , (\u03b4X\n\nip )T )T \u2208 R5p follows a multivariate Gaussian distribution with precision\n\u03b4X\ni = ((\u03b4X\nmatrix \u2126X. Yij(t) was generated in a similar way with precision matrix \u2126Y . We consider three\nmodels with different graph structures, and for each model, data are generated with nX = nY = 100\nand p = 30, 60, 90, 120. We repeat this 30 times for each p and model setting.\nModel 1: This model is similar to the setting considered in [30], but modi\ufb01ed to the functional\ncase. We generate support of \u2126X according to a graph with p(p \u2212 1)/10 edges and a power-law\ndegree distribution with an expected power parameter of 2. Although the graph is sparse with only\n20% of all possible edges present, the power-law structure mimics certain real-world graphs [16]\njl = \u03b4(cid:48)I5, where \u03b4(cid:48) is sampled\nby creating hub nodes with large degree. For each nonzero block, \u2126X\nuniformly from \u00b1[0.2, 0.5]. To ensure positive de\ufb01niteness, we further scale each off-diagonal block\nby 1/2, 1/3, 1/4, 1/5 for p = 30, 60, 90, 120 respectively. Each diagonal element of \u2126X is set to\n1 and the matrix is symmetrized by averaging it with its transpose. To get \u2126Y , we \ufb01rst select the\nlargest hub nodes in GX (i.e., the nodes with largest degree), and for each hub node we select the\ntop (by magnitude) 20% of edges. For each selected edge, we set \u2126Y\njl + W where Wkm = 0\nfor |k \u2212 m| \u2264 2, and Wkm = c otherwise, where c is generated in the same way as \u03b4(cid:48). For all other\nblocks, \u2126Y\nModel 2: We \ufb01rst generate a tridiagonal block matrix \u2126\u2217\nX,j+1,j =\n0.6I5, and \u2126\u2217\nX,j+2,j = 0.4I5 for j = 1, . . . , p. All other blocks are set to 0. We\nthen set \u2126\u2217\nX,jl for all other blocks.\nThus, we form GY by adding four edges to GX. We let Wkm = 0 when |k \u2212 m| \u2264 1, and\nWkm = c otherwise, with c = 1/10 for p = 30, c = 1/15 for p = 60, c = 1/20 for p = 90,\nand c = 1/25 for p = 120. Finally, we set \u2126X = \u2126\u2217\nY + \u03b4I, where \u03b4 =\nmax{| min(\u03bbmin(\u2126\u2217\nModel 3: We generate \u2126\u2217\nX,jj = I5. With\nprobability .8, we set \u2126\u2217\nX,lj = 0.1I5, and set it to 0 otherwise. Thus, we expect 80% of all\npossible edges to be present. Then, we form GY by randomly adding s new edges to GX, where s = 3\nfor p = 30, s = 4 for p = 60, s = 5 for p = 90, and s = 6 for p = 120. We set each corresponding\nY,jl = W , where Wkm = 0 when |k \u2212 m| \u2264 1 and Wkm = c otherwise. We let c = 2/5 for\nblock \u2126\u2217\np = 30, c = 4/15 for p = 60, c = 1/5 for p = 90, and c = 4/25 for p = 120. Finally, we set \u2126X =\n\u2126\u2217\nX + \u03b4I, \u2126Y = \u2126\u2217\nAlthough the theory assumes fully observed functional data, in order to mimic a realistic setting, we\nuse noisy observations at discrete time points, such that the actual data corresponding to Xij are\n\nX with \u2126\u2217\nY,j+3,j = W for j = 1, 2, 3, 4, and let \u2126\u2217\n\nX according to an Erd\u00f6s-R\u00e9nyi graph. We \ufb01rst set \u2126\u2217\n\nY + \u03b4I, where \u03b4 > max{| min(\u03bbmin(\u2126\u2217\n\nX,jj = I5, \u2126\u2217\nY,jl = \u2126\u2217\n\nX ), 0)|,| min(\u03bbmin(\u2126\u2217\n\nY ), 0)|}+0.05.\n\nX ), 0)|,| min(\u03bbmin(\u2126\u2217\n\nY ), 0)|}+0.05.\n\nX + \u03b4I, \u2126Y = \u2126\u2217\n\nX,j,j+2 = \u2126\u2217\n\nY,j,j+3 = \u2126\u2217\n\nX,j,j+1 = \u2126\u2217\n\nX,jl = \u2126\u2217\n\njl = \u2126X\n\nhX\nijk = Xij(tk) + eijk,\n\n6\n\neijk \u223c N (0, 0.52),\n\n\fFigure 1: Average ROC curves across 30 simulations. Different columns correspond to different\nmodels, different rows correspond to different dimensions.\n\njl (cid:107)F + (cid:107) \u02c6\u2206M\n\nfor 200 evenly spaced time points 0 = t1 \u2264 \u00b7\u00b7\u00b7 \u2264 t200 = 1. hY\nijk are obtained in a similar way. For\neach observation, we \ufb01rst estimate a function by \ufb01tting an L-dimensional B-spline basis. We then use\nthese estimated functions for FPCA and our direct estimation procedure. Both M and L are chosen\nby 5-fold cross-validation as discussed in [18]. Since \u0001n in (2.9) is usually very small in practice,\nwe simply let \u02c6E\u2206 = {(j, l) \u2208 V 2 : j (cid:54)= l and (cid:107) \u02c6\u2206M\nlj (cid:107)F > 0}. We can form a receiver\noperating characteristics (ROC) curve for recovery of E\u2206 by using different values of the group lasso\npenalty \u03bbn de\ufb01ned in (2.7).\nWe compare FuDGE to three competing methods. The \ufb01rst two competing methods separately\nestimate two functional graphical models using fglasso from [18]. Speci\ufb01cally, we use fglasso to\n(cid:107)F >\nestimate \u02c6\u0398X,M and \u02c6\u0398Y,M . We then set \u02c6E\u2206 to be all edges (j, l) \u2208 V 2 such that (cid:107) \u02c6\u0398X,M\n\u03b6. For each separate fglasso problem, the penalization parameter is selected by maximizing AIC in\n\ufb01rst competing method and maximizing BIC in second competing method. We de\ufb01ne the degrees of\nfreedom for both AIC and BIC to be the number of edges included in the graph times M 2. We form\nan ROC curve by using different values of \u03b6.\nThe third competing method ignores the functional nature of the data. We select 15 equally spaced\ntime points and implement a direct estimation method at each time point. Speci\ufb01cally, for each t,\nXi(t) and Yi(t) are simply p-dimensional random vectors, and we use their sample covariances in\n(2.7) to obtain a p \u00d7 p matrix \u02c6\u2206. This produces 15 differential graphs, and we use a majority vote to\nform a single differential graph. The ROC curve is obtained by changing \u03bbn, the L1 penalty used for\nall time points.\n\njl \u2212 \u02c6\u0398Y,M\n\njl\n\n7\n\n\f(cid:110)\n\n\u02dcE\u2206(t) =\n\n(cid:111)\njl (t)| (cid:54)= 0\n\nFigure 2: Average ROC curves across 30 simulations of example that multiple network strategy\nworks better\n\nFor each setting and method, the ROC curve averaged across the 30 replications is shown in Figure 1.\nWe see that FuDGE clearly has the best overall performance in recovering the support of differential\ngraph. Among the competing methods, ignoring the functional structure and using a majority vote\ngenerally performs better than separately estimating two functional graphs. A table with the average\narea under the ROC curve is given in the appendix.\n\n4.2 Example that combination of multiple networks at discrete time points works better\n\nBy construction, the simulations presented in Section 4.1 are estimating E\u2206 de\ufb01ned in (2.3), which\nis not equivalent to\n\n(j, l) \u2208 V 2 : j (cid:54)= l,| \u02dcC X\n\njl (t) \u2212 \u02dcC Y\n\n,\n\n(4.1)\n\nwhere\n\njl (t) = Cov (Xij(t), Xil(t) | {Xik(t)}k(cid:54)=j,l) ,\n\u02dcC X\n\n(4.2)\njl (t) de\ufb01ned similarly. However, when \u02dcE\u2206(t) = E\u2206,\u2200t, then the differential structure can be\nand \u02dcC Y\nrecovered by considering individual time points. Since considering time points individually requires\nestimating fewer parameters than the functional version, the multiple networks strategy performs\nbetter than FuDGE.\nHere, data are generated with nX = nY = 100, and p = 30, 60, 90, 120. We repeat the simulation 30\ntimes for each p. The model setting here is similar to model 2 in Section 4.1. However, we make two\nmajor changes. First, when we generate the functional variables, we use a 5-dimensional Fourier basis,\nso that all basis are supported over the entire interval, rather than disjoint support as in Section 4.1.\nSecond, we set matrix W to be diagonal. Speci\ufb01cally, we let Wkk = c for k = 1, 2,\u00b7\u00b7\u00b7 , 5 and\nWkm = 0 for k (cid:54)= m, where c is drawn uniformly from [0.6, 1], and scaled by 1/2 for p = 30, 1/3\nfor p = 60, and 1/4 for p = 90. All other settings are the same. The average ROC curves are shown\nin Figure 2, and the mean area under the curves are shown in Table 2 in section D.2 of supplementary\nmaterial.\nIn Section 4.1 we considered extreme settings where the data must be treated as functions, and here\nwe consider an extreme setting where the functional nature is irrelevant. In practice, however, the\ndata may often lie between these two settings, and the method which performs better should depend\non the variation of the differential structure across time. However, as it may be hard to measure this\nvariation in practice, treating the data as functional objects should be a more robust choice.\n\n8\n\n\fFigure 3: Estimated differential graph for EEG data. The anterior region is the top of the \ufb01gure and\nthe posterior region is the bottom of the \ufb01gure.\n\n4.3 Neuroscience application\n\nWe apply our method to electroencephalogram (EEG) data obtained from an alcoholism study\n[29, 6, 18] which included 122 total subjects; 77 in an alcoholic group and 45 in the control group.\nSpeci\ufb01cally, the EEG data was measured by placing p = 64 electrodes on various locations on the\nsubject\u2019s scalp and measuring voltage values across time. We follow the preprocessing procedure in\n[8, 31], which \ufb01lters the EEG signals at \u03b1 frequency bands between 8 and 12.5 Hz.\n[18] separately estimate functional graphs for both groups, but we directly estimate the differential\ngraph using FuDGE. We choose \u03bbn so that the estimated differential graph has approximately 1% of\npossible edges. The estimated edges of the differential graph are shown in Figure 3.\nWe see that edges are generally between nodes located in the same region\u2013either the anterior region or\nthe posterior region\u2013and there is no edge that crosses between regions. This observation is consistent\nwith the result in [18] where there are no connections between frontal and back regions for both\ngroups. We also note that electrode CZ, lying in the central region has a high degree in the estimated\ndifferential graph. While there is no direct connection between anterior and posterior regions, the\ncentral region may play a role in helping the two parts communicate.\n\n5 Discussion\n\nIn this paper, we propose a method to directly estimate the differential graph for functional graphical\nmodels.\nIn certain settings, direct estimation allows for the differential graph to be recovered\nconsistently, even if each underlying graph cannot be consistently recovered. Experiments on\nsimulated data also show that preserving the functional nature of the data rather than treating the data\nas multivariate scalars can also result in better estimation of the difference graph.\nA key step in the procedure is \ufb01rst representing the functions with an M-dimensional basis using\nFPCA, and Assumption 3.2 ensures that there exists some M large enough so that the signal, \u03c4,\nis larger than the bias due to using a \ufb01nite dimensional representation, \u03bd. Intuitively, \u03bd is tied to\nthe eigenvalue decay rate; however, we defer derivation of the explicit connection for future work.\nFinally, we have provided a method for direct estimation of the differential graph, but development of\nmethods which allow for inference and hypothesis testing in functional differential graphs would be\nfruitful avenues for future work. For example, [7] has developed inference tools for high-dimensional\nMarkov networks, future works may extend their results to functional graph setting.\n\n9\n\n\fReferences\n[1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Imag. Sci., 2:183\u2013202, 2009.\n\n[2] D. Bosq. Linear processes in function spaces, volume 149 of Lecture Notes in Statistics. Springer-Verlag,\n\nNew York, 2000. Theory and applications.\n\n[3] T. T. Cai. Global testing and large-scale multiple testing for high-dimensional covariance structures.\n\nAnnual Review of Statistics and Its Application, 4(1):423\u2013446, 2017.\n\n[4] M. Drton and M. H. Maathuis. Structure learning in graphical modeling. Annual Review of Statistics and\n\nIts Application, 4:365\u2013393, 2017.\n\n[5] T. Hsing and R. Eubank. Theoretical foundations of functional data analysis, with an introduction to linear\n\noperators. Wiley Series in Probability and Statistics. John Wiley & Sons, Ltd., Chichester, 2015.\n\n[6] L. Ingber. Statistical mechanics of neocortical interactions: Canonical momenta indicators of electroen-\n\ncephalography. Physical Review E, 55(4):4578\u20134593, 1997.\n\n[7] B. Kim, S. Liu, and M. Kolar. Two-sample inference for high-dimensional markov networks. arXiv\n\npreprint arXiv:1905.00466, 2019.\n\n[8] G. G. Knyazev. Motivation, emotion, and their inhibitory control mirrored in brain oscillations. Neuro-\n\nscience & Biobehavioral Reviews, 31(3):377\u2013395, 2007.\n\n[9] M. Kolar, L. Song, A. Ahmed, and E. P. Xing. Estimating Time-varying networks. Ann. Appl. Stat., 4(1):\n\n94\u2013123, 2010.\n\n[10] S. L. Lauritzen. Graphical Models, volume 17 of Oxford Statistical Science Series. The Clarendon Press\n\nOxford University Press, New York, 1996. Oxford Science Publications.\n\n[11] B. Li and E. Solea. A nonparametric graphical model for functional data with application to brain networks\n\nbased on fMRI. J. Amer. Statist. Assoc., 113(524):1637\u20131655, 2018.\n\n[12] S. Liu, J. A. Quinn, M. U. Gutmann, T. Suzuki, and M. Sugiyama. Direct learning of sparse changes in\n\nMarkov networks by density ratio estimation. Neural Comput., 26(6):1169\u20131197, 2014.\n\n[13] N. Meinshausen and P. B\u00fchlmann. High dimensional graphs and variable selection with the lasso. Ann.\n\nStat., 34(3):1436\u20131462, 2006.\n\n[14] S. Na, M. Kolar, and O. Koyejo. Estimating differential latent variable graphical models with applications\n\nto brain connectivity. arXiv preprint arXiv:1909.05892, 2019.\n\n[15] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. Stat. Sci., 27(4):538\u2013557, 2012.\n\n[16] M. E. J. Newman. The structure and function of complex networks. SIAM Rev., 45(2):167\u2013256, 2003.\n\n[17] N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127\u2013239,\n\n2014.\n\n[18] X. Qiao, S. Guo, and G. M. James. Functional Graphical Models. J. Amer. Statist. Assoc., 114(525):\n\n211\u2013222, 2019.\n\n[19] R. Tibshirani. Proximal gradient descent and acceleration. Lecture Notes, 2010.\n\n[20] J.-L. Wang, J.-M. Chiou, and H.-G. M\u00fcller. Functional data analysis. Annual Review of Statistics and Its\n\nApplication, 3(1):257\u2013295, 2016.\n\n[21] Y. Wang, C. Squires, A. Belyaeva, and C. Uhler. Direct estimation of differences in causal graphs. In\nS. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 31: Annual Conference on Neural Information Processing\nSystems 2018, NeurIPS 2018, 3-8 December 2018, Montr\u00e9al, Canada., pages 3774\u20133785, 2018.\n\n[22] P. Xu and Q. Gu. Semiparametric differential graph models. In D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1064\u20131072.\nCurran Associates, Inc., 2016.\n\n10\n\n\f[23] F. Yao and T. C. M. Lee. Penalized spline models for functional principal component analysis. J. R. Stat.\n\nSoc. Ser. B Stat. Methodol., 68(1):3\u201325, 2006.\n\n[24] M. Yu, V. Gupta, and M. Kolar. Statistical inference for pairwise graphical models using score matching.\n\nIn Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 2016.\n\n[25] M. Yu, V. Gupta, and M. Kolar. Simultaneous inference for pairwise graphical models with generalized\n\nscore matching. arXiv preprint arXiv:1905.06261, 2019.\n\n[26] H. Yuan, R. Xi, C. Chen, and M. Deng. Differential network analysis via lasso penalized D-trace loss.\n\nBiometrika, 104(4):755\u2013770, 2017.\n\n[27] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc.\n\nB, 68:49\u201367, 2006.\n\n[28] C. Zhang, H. Yan, S. Lee, and J. Shi. Dynamic multivariate functional data modeling via sparse subspace\n\nlearning. CoRR, abs/1804.03797, 2018, arXiv:1804.03797.\n\n[29] X. L. Zhang, H. Begleiter, B. Porjesz, W. Wang, and A. Litke. Event related potentials during object\n\nrecognition tasks. Brain Research Bulletin, 38(6):531\u2013538, 1995.\n\n[30] S. D. Zhao, T. T. Cai, and H. Li. Direct estimation of differential networks. Biometrika, 101(2):253\u2013268,\n\n2014.\n\n[31] H. Zhu, N. Strawn, and D. B. Dunson. Bayesian graphical models for multivariate functional data. J. Mach.\n\nLearn. Res., 17:Paper No. 204, 27, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1474, "authors": [{"given_name": "Boxin", "family_name": "Zhao", "institution": "UChicago"}, {"given_name": "Y. Samuel", "family_name": "Wang", "institution": "U of Chicago"}, {"given_name": "Mladen", "family_name": "Kolar", "institution": "University of Chicago"}]}