{"title": "Direct Estimation of Differences in Causal Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 3770, "page_last": 3781, "abstract": "We consider the problem of estimating the differences between two causal directed acyclic graph (DAG) models with a shared topological order given i.i.d. samples from each model. This is of interest for example in genomics, where changes in the structure or edge weights of the underlying causal graphs reflect alterations in the gene regulatory networks. We here provide the first provably consistent method for directly estimating the differences in a pair of causal DAGs without separately learning two possibly large and dense DAG models and computing their difference. Our two-step algorithm first uses invariance tests between regression coefficients of the two data sets to estimate the skeleton of the difference graph and then orients some of the edges using invariance tests between regression residual variances. We demonstrate the properties of our method through a simulation study and apply it to the analysis of gene expression data from ovarian cancer and during T-cell activation.", "full_text": "Direct Estimation of Differences in Causal Graphs\n\nYuhao Wang\n\nChandler Squires\n\nLab for Information & Decision Systems\nand Institute for Data, Systems and Society\n\nMassachusetts Institute of Technology\n\nLab for Information & Decision Systems\nand Institute for Data, Systems and Society\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nyuhaow@mit.edu\n\nCambridge, MA 02139\ncsquires@mit.edu\n\nAnastasiya Belyaeva\n\nCaroline Uhler\n\nLab for Information & Decision Systems\nand Institute for Data, Systems and Society\n\nMassachusetts Institute of Technology\n\nLab for Information & Decision Systems\nand Institute for Data, Systems and Society\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\nbelyaeva@mit.edu\n\nCambridge, MA 02139\n\ncuhler@mit.edu\n\nAbstract\n\nWe consider the problem of estimating the differences between two causal directed\nacyclic graph (DAG) models with a shared topological order given i.i.d. samples\nfrom each model. This is of interest for example in genomics, where changes in\nthe structure or edge weights of the underlying causal graphs re\ufb02ect alterations in\nthe gene regulatory networks. We here provide the \ufb01rst provably consistent method\nfor directly estimating the differences in a pair of causal DAGs without separately\nlearning two possibly large and dense DAG models and computing their difference.\nOur two-step algorithm \ufb01rst uses invariance tests between regression coef\ufb01cients\nof the two data sets to estimate the skeleton of the difference graph and then orients\nsome of the edges using invariance tests between regression residual variances. We\ndemonstrate the properties of our method through a simulation study and apply\nit to the analysis of gene expression data from ovarian cancer and during T-cell\nactivation.\n\n1\n\nIntroduction\n\nDirected acyclic graph (DAG) models, also known as Bayesian networks, are widely used to model\ncausal relationships in complex systems. Learning the causal DAG from observations on the nodes\nis an important problem across disciplines [8, 25, 30, 36]. A variety of causal inference algorithms\nbased on observational data have been developed, including the prominent PC [36] and GES [20]\nalgorithms, among others [35, 38]. However, these methods require strong assumptions [39]; in\nparticular, theoretical analysis of the PC [14] and GES [23, 40] algorithms have shown that these\nmethods are usually not consistent in the high-dimensional setting, i.e. when the number of nodes is\nof the same order or exceeds the number of samples, unless highly restrictive assumptions on the\nsparsity and/or the maximum degree of the underlying DAG are made.\nThe presence of high degree hub nodes is a well-known feature of many networks [2, 3], thereby\nlimiting the direct applicability of causal inference algorithms. However, in many applications,\nthe end goal is not to recover the full causal DAG but to detect changes in the causal relations\nbetween two related networks. For example, in the analysis of EEG signals it is of interest to detect\nneurons or different brain regions that interact differently when the subject is performing different\nactivities [31]; in biological pathways genes may control different sets of target genes under different\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcellular contexts or disease states [11, 28]. Due to recent technological advances that allow the\ncollection of large-scale EEG or single-cell gene expression data sets in different contexts there is\na growing need for methods that can accurately identify differences in the underlying regulatory\nnetworks and thereby provide key insights into the underlying system [11, 28]. The limitations of\ncausal inference algorithms for accurately learning large causal networks with hub nodes and the\nfact that the difference of two related networks is often sparse call for methods that directly learn the\ndifference of two causal networks without having to estimate each network separately.\nThe complimentary problem to learning the difference of two DAG models is the problem of inferring\nthe causal structure that is invariant across different environments. Algorithms for this problem have\nbeen developed in recent literature [9, 27, 42]. However, note that the difference DAG can only be\ninferred from the invariant causal structure if the two DAGs are known. The problem of learning the\ndifference between two networks has been considered previously in the undirected setting [17, 18, 43].\nHowever, the undirected setting is often insuf\ufb01cient: only a causal (i.e., directed) network provides\ninsights into the effects of interventions such as knocking out a particular gene. In this paper we\nprovide to our knowledge the \ufb01rst provably consistent method for directly inferring the differences\nbetween pairs of causal DAG models that does not require learning each model separately.\nThe remainder of this paper is structured as follows: In Section 2, we set up the notation and review\nrelated work. In Section 3, we present our algorithm for directly estimating the difference of causal\nrelationships and in Section 4, we provide consistency guarantees for our algorithm. In Section 5,\nwe evaluate the performance of our algorithm on both simulated and real datasets including gene\nexpression data from ovarian cancer and T cell activation.\n\n2 Preliminaries and Related Work\n\nLet G = ([p], A) be a DAG with node set [p] := {1,\u00b7\u00b7\u00b7 , p} and arrow set A. Without loss of\ngenerality we label the nodes such that if i \u2192 j in G then i < j (also known as topological ordering).\nTo each node i we associate a random variable Xi and let P be a joint distribution over the random\nvector X = (X1,\u00b7\u00b7\u00b7 , Xp)T . In this paper, we consider the setting where the causal DAG model is\ngiven by a linear structural equation model (SEM) with Gaussian noise, i.e.,\n\nX = BT X + \u0001\n\nwhere B (the autoregressive matrix) is strictly upper triangular consisting of the edge weights of G,\ni.e., Bij (cid:54)= 0 if and only if i \u2192 j in G, and the noise \u0001 \u223c N (0, \u2126) with \u2126 := diag(\u03c32\np), i.e.,\nthere are no latent confounders. Denoting by \u03a3 the covariance matrix of X and by \u0398 its inverse (i.e.,\nthe precision matrix), a short computation yields \u0398 = (I \u2212 B)\u2126\u22121(I \u2212 B)T , and hence\n\u0398ij = \u2212\u03c3\u22122\n\n\u03c3\u22122\nk BikBjk, \u2200i (cid:54)= j\n\nij, \u2200i \u2208 [p].\n\nand \u0398ii = \u03c3\u22122\n\ni +\n\n1,\u00b7\u00b7\u00b7 , \u03c32\n\nj Bij +\n\n(cid:88)\n\nk>j\n\n(cid:88)\n\nj>i\n\n\u03c3\u22122\nj B2\n\n(1)\n\nThis shows that the support of \u0398 is given by the moral graph of G, obtained by adding an edge\nbetween pairs of nodes that have a common child and removing all edge orientations. By the causal\nMarkov assumption, which we assume throughout, the missing edges in the moral graph encode a\nsubset of the conditional independence (CI) relations implied by a DAG model on G; the complete set\nof CI relations is given by the d-separations that hold in G [15][Section 3.2.2]; i.e., Xi \u22a5\u22a5 Xj | XS\nin P whenever nodes i and j are d-separated in G given a set S \u2286 [p] \\ {i, j}. The faithfulness\nassumption is the assertion that all CI relations entailed by P are implied by d-separation in G.\nA standard approach for causal inference is to \ufb01rst infer CI relations from the observations on the\nnodes of G and then to use the CI relations to learn the DAG structure. However, several DAGs can\nencode the same CI relations and therefore, G can only be identi\ufb01ed up to an equivalence class of\nDAGs, known as the Markov equivalence class (MEC) of G, which we denote by [G]. In [41], the\nauthor gave a graphical characterization of the members of [G]; namely, two DAGs are in the same\nMEC if and only if they have the same skeleton (i.e., underlying undirected graph) and the same\nv-structures (i.e., induced subgraphs of the form i \u2192 j \u2190 k). [G] can be represented combinatorially\nby a partially directed graph with skeleton G and an arrow for those edges in G that have the same\norientation in all members of [G]. This is known as the CP-DAG (or essential graph) of G [1].\nVarious algorithms have been developed for learning the MEC of G given observational data on the\nnodes, most notably the prominent GES [20] and PC algorithms [36]. While GES is a score-based\n\n2\n\n\fapproach that greedily optimizes a score such as the BIC (Bayesian Information Criterion) over the\nspace of MECs, the PC algorithm views causal inference as a constraint satisfaction problem with the\nconstraints being the CI relations. In a two-stage approach, the PC algorithm \ufb01rst learns the skeleton\nof the underlying DAG and then determines the v-structures, both from the inferred CI relations.\nGES and the PC algorithms are provably consistent, meaning they output the correct MEC given an\nin\ufb01nite amount of data, under the faithfulness assumption. Unfortunately, this assumption is very\nsensitive to hypothesis testing errors for inferring CI relations from data and violations are frequent\nespecially in non-sparse graphs [39]. If the noise variables in a linear SEM with additive noise are\nnon-Gaussian, the full causal DAG can be identi\ufb01ed (as opposed to just its MEC) [33], for example\nusing the prominent LiNGAM algorithm [33]. Non-Gaussianity and sparsity of the underlying graph\nin the high-dimensional setting are crucial for consistency of LiNGAM.\nIn this paper, we develop a two-stage approach, similar to the PC algorithm, for directly learning\nthe difference between two linear SEMs with additive Gaussian noise on the DAGs G and H. Note\nthat naive algorithms that separately estimate [G] and [H] and take their differences can only identify\nedges that appeared/disappeared and cannot identify changes in edge weights (since the DAGs are\nnot identi\ufb01able). Our algorithm overcomes this limitation. In addition, we show in Section 4 that\ninstead of requiring the restrictive faithfulness assumption on both DAGs G and H, consistency of\nour algorithm only requires assumptions on the (usually) smaller and sparser network of differences.\nLet P(1) and P(2) be a pair of linear SEMs with Gaussian noise de\ufb01ned by (B(1), \u0001(1)) and (B(2), \u0001(2)).\nThroughout, we make the simplifying assumption that both B(1) and B(2) are strictly upper triangular,\ni.e., that the underlying DAGs G(1) and G(2) share the same topological order. This assumption\nis reasonable for example in applications to genomics, since genetic interactions may appear or\ndisappear, change edge weights, but generally do not change directions. For example, in biological\npathways an upstream gene does not generally become a downstream gene in different conditions.\nHence B(1) \u2212 B(2) is also strictly upper triangular and we de\ufb01ne the difference-DAG (D-DAG) of the\ntwo models by \u2206 := ([p], A\u2206) with i \u2192 j \u2208 A\u2206 if and only if B(1)\nij ; i.e., an edge i \u2192 j in\n\u2206 represents a change in the causal effect of i on j, including changes in the presence/absence of\nan effect as well as changes in edge weight. Given i.i.d. samples from P(1) and P(2), our goal is to\ninfer \u2206. Just like estimating a single causal DAG model, the D-DAG \u2206 is in general not completely\nidenti\ufb01able, in which case we wish to identify the skeleton \u00af\u2206 as well as a subset of arrows \u02dcA\u2206.\nA simpler task is learning differences of undirected graphical models. Let \u0398(1) and \u0398(2) denote\nthe precision matrices corresponding to P(1) and P(2). The support of \u0398(k) consists of the edges in\nthe undirected graph (UG) models corresponding to P(k). We de\ufb01ne the difference-UG (D-UG) by\n\u2206\u0398 := ([p], E\u2206\u0398), with i \u2212 j \u2208 E\u2206\u0398 if and only if \u0398(1)\nfor i (cid:54)= j. Two recent methods\nthat directly learn the difference of two UG models are KLIEP [18] and DPM [43]; for a review and\ncomparison of these methods see [17]. These methods can be used as a \ufb01rst step towards estimating\nthe D-DAG \u2206: under genericity assumptions, the formulae for \u0398ij in (1) imply that if B(1)\n(cid:54)= B(2)\nij\nthen \u0398(1)\nij . Hence, the skeleton of \u2206 is a subgraph of \u2206\u0398, i.e., \u00af\u2206 \u2286 \u2206\u0398. In the following\nij\nsection we present our algorithm showing how to obtain \u00af\u2206 and determine some of the edge directions\nin \u2206. We end this section with a piece of notation needed for introducing our algorithm; we de\ufb01ne\n\nthe set of changed nodes to be S\u0398 :=(cid:8)i | \u2203j \u2208 [p] such that \u0398(1)\n\n(cid:54)= \u0398(2)\n\ni,j (cid:54)= \u0398(2)\n\ni,j\n\n(cid:54)= \u0398(2)\n\nij\n\n(cid:54)= B(2)\n\nij\n\nij\n\nij\n\n(cid:9).\n\n3 Difference Causal Inference Algorithm\n\nIn Algorithm 1 we present our Difference Causal Inference (DCI) algorithm for directly learning\nthe difference between two linear SEMs with additive Gaussian noise given i.i.d. samples from each\nmodel. Our algorithm consists of a two-step approach similar to the PC algorithm. The \ufb01rst step,\n\nAlgorithm 1 Difference Causal Inference (DCI) algorithm\n\nInput: Sample data \u02c6X (1), \u02c6X (2).\nOutput: Estimated skeleton \u00af\u2206 and arrows \u02dcA\u0398 of the D-DAG \u2206.\nEstimate the D-UG \u2206\u0398 and S\u0398; use Algorithm 2 to estimate \u00af\u2206; use Algorithm 3 to estimate \u02dcA\u2206.\n\n3\n\n\fAlgorithm 2 Estimating skeleton of the D-DAG\n\nInput: Sample data \u02c6X (1), \u02c6X (2), estimated D-UG \u2206\u0398, estimated set of changed nodes S\u0398.\nOutput: Estimated skeleton \u00af\u2206.\nSet \u00af\u2206 := \u2206\u0398;\nfor each edge i \u2212 j in \u00af\u2206 do\nIf \u2203S \u2286 S\u0398\\{i, j} such that \u03b2(k)\nto the next edge. Otherwise, continue.\nend for\n\ni,j|S is invariant across k = {1, 2}, delete i\u2212 j in \u00af\u2206 and continue\n\nAlgorithm 3 Directing edges in the D-DAG\n\nInput: Sample data \u02c6X (1), \u02c6X (2), estimated set of changed nodes S\u0398, estimated skeleton \u00af\u2206.\nOutput: Estimated set of arrows \u02dcA\u2206.\nSet \u02dcA\u2206 := \u2205;\nfor each node j incident to at least one undirected edge in \u00af\u2206 do\nIf \u2203S \u2286 S\u0398 \\ {j} such that \u03c3(k)\nand add j \u2192 i to \u02dcA\u2206 for all i (cid:54)\u2208 S and continue to the next node. Otherwise, continue.\nend for\nOrient as many undirected edges as possible via graph traversal using the following rule:\n\nj|S is invariant across k = {1, 2}, add i \u2192 j to \u02dcA\u2206 for all i \u2208 S,\n\nOrient i \u2212 j into i \u2192 j whenever there is a chain i \u2192 (cid:96)1 \u2192 \u00b7\u00b7\u00b7 \u2192 (cid:96)t \u2192 j.\n\ndescribed in Algorithm 2, estimates the skeleton of the D-DAG by removing edges one-by-one.\nAlgorithm 2 takes \u2206\u0398 and S\u0398 as input. In the high-dimensional setting, KLIEP can be used to\nestimate \u2206\u0398 and S\u0398. For completeness, in the Supplementary Material we also provide a constraint-\nbased method that consistently estimates \u2206\u0398 and S\u0398 in the low-dimensional setting for general\nadditive noise models. Finally, \u2206\u0398 can also simply be chosen to be the complete graph with S\u0398 = [p].\nThese different initiations of Algorithm 2 are compared via simulations in Section 5. The second step\nof DCI, described in Algorithm 3, infers some of the edge directions in the D-DAG. While the PC\nalgorithm uses CI tests based on the partial correlations for inferring the skeleton and for determining\nedge orientations, DCI tests the invariance of certain regression coef\ufb01cients across the two data sets\nin the \ufb01rst step and the invariance of certain regression residual variances in the second step. These\nare similar to the regression invariances used in [9] and are introduced in the following de\ufb01nitions.\nDe\ufb01nition 3.1. Given i, j \u2208 [p] and S \u2286 [p] \\ {i, j}, let M := {i} \u222a S and let \u03b2(k)\nlinear predictor of X (k)\nregression coef\ufb01cient \u03b2(k)\n\nM be the best\nM )2]. We de\ufb01ne the\n\nM , i.e., the minimizer of E[(X (k)\n\ngiven X (k)\ni,j|S to be the entry in \u03b2(k)\n\nM corresponding to i.\n\nj\n\nj \u2212 (\u03b2(k)\n\nM )T X (k)\n\nDe\ufb01nition 3.2. For j \u2208 [p] and S \u2286 [p] \\ {j}, we de\ufb01ne (\u03c3(k)\nj|S)2 to be the variance of the regression\nonto the random vector X (k)\nresidual when regressing X (k)\nS .\n\nj\n\ni,j|S (cid:54)= \u03b2(k)\n\nij = \u03b2(k)\nij , then there exists a set S such that \u03b2(k)\n\nNote that in general \u03b2(k)\nj,i|S. Each entry in B(k) can be interpreted as a regression coef\ufb01cient,\nnamely B(k)\ni,j|(Pa(k)(j)\\{i}), where Pa(k)(j) denotes the parents of node j in G(k). Thus, when\nB(1)\nij = B(2)\ni,j|S stays invariant across k = {1, 2}. This\nmotivates using invariances between regression coef\ufb01cients to determine the skeleton of the D-DAG.\nFor orienting edges, observe that when \u03c3(k)\nj|S would also\nstay invariant if S is chosen such that S = Pa(1)(j) \u222a Pa(2)(j). This motivates using invariances of\nresidual variances to discover the parents of node j and assign orientations afterwards. Similar to [9]\nwe use hypothesis tests based on the F-test for testing the invariance between regression coef\ufb01cients\nand residual variances. See the Supplementary Material for details regarding the construction of\nthese hypothesis tests, the derivation of their asymptotic distribution, and an example outlining the\ndifference of this approach to [9] for invariant structure learning.\n\nstays invariant across two conditions, \u03c3(k)\n\nj\n\n4\n\n\fExample 3.3. We end this section with a 4-node example showing how the DCI algorithm works.\nLet B(1) and B(2) be the autoregressive matrices de\ufb01ned by the edge weights of G(1) and G(2) and\nlet the noise variances satisfy the following invariances:\n\nInitiating Algorithm 2 with \u2206\u0398 being the complete graph and S\u0398 = [4], the output of the DCI\nalgorithm is shown above.\n\n4 Consistency of DCI\n\nj\n\nj\n\nj\n\nij and \u03c3(k)\n\nThe DCI algorithm is consistent if it outputs a partially oriented graph \u02c6\u2206 that has the same skeleton as\nthe true D-DAG and whose oriented edges are all correctly oriented. Just as methods for estimating a\nsingle DAG require assumptions on the underlying model (e.g. the faithfulness assumption) to ensure\nconsistency, our method for estimating the D-DAG requires assumptions on relationships between\nthe two underlying models. To de\ufb01ne these assumptions it is helpful to view (\u03c3(k)\n)j\u2208[p] and the\nnon-zero entries (B(k)\nij )(i,j)\u2208A(k) as variables or indeterminates and each entry of \u0398(k) as a rational\nfunction, i.e., a fraction of two polynomials in the variables B(k)\nas de\ufb01ned in (1). Using\nSchur complements one can then similarly express \u03b2(k)\nv,w|S and (\u03c3(k)\nw|S)2 as a rational function in the\nentries of \u0398(k) and hence as a rational function in the variables (B(k)\nij )(i,j)\u2208A(k) and (\u03c3(k)\n)j\u2208[p]. The\nexact formulae are given in the Supplementary Material.\nClearly, if B(1)\nw|S for\nall v, w, S. For consistency of the DCI algorithm we assume that the converse is true as well, namely\nthat differences in Bij and \u03c3j in the two distributions are not \u201ccancelled out\u201d by changes in other\nvariables and result in differences in the regression coef\ufb01cients and regression residual variances. This\nallows us to deduce invariance patterns of the autoregressive matrix B(k) from invariance patterns of\nthe regression coef\ufb01cients and residual variances, and hence differences of the two causal DAGs.1\n(cid:54)= B(2)\nAssumption 4.1. For any choice of i, j \u2208 S\u0398, if B(1)\nthen for all S \u2286 S\u0398\\{i, j} it holds that\nj,i|S (cid:54)= \u03b2(2)\nj,i|S.\n\nj \u2200j \u2208 [p], then \u03b2(1)\n\nij \u2200(i, j) and \u03c3(1)\n\nv,w|S and \u03c3(1)\n\nv,w|S = \u03b2(2)\n\nw|S = \u03c3(2)\n\nij = B(2)\n\nj = \u03c3(2)\n\nij\n\nij\n\n\u03b2(1)\ni,j|S (cid:54)= \u03b2(2)\n\ni,j|S and \u03b2(1)\nAssumption 4.2. For any choice of i, j \u2208 S\u0398 it holds that\nj|S (cid:54)= \u03c3(2)\nj|S\n\nij , then \u2200S \u2286 S\u0398 \\ {i, j}, \u03c3(1)\n, then \u03c3(1)\n\n1. if B(1)\nij\n2. if \u03c3(1)\n\nj|S (cid:54)= \u03c3(2)\n\nj|S for all S \u2286 S\u0398 \\ {j}.\n\n(cid:54)= B(2)\n(cid:54)= \u03c3(2)\n\nj\n\nj\n\nand \u03c3(1)\n\ni|S\u222a{j} (cid:54)= \u03c3(2)\n\ni|S\u222a{j}\n\n.\n\nAssumption 4.1 ensures that the skeleton of the D-DAG is inferred correctly, whereas Assumption 4.2\nensures that the arrows returned by the DCI algorithm are oriented correctly. These assumptions\nare the equivalent of the adjacency-faithfulness and orientation-faithfulness assumptions that ensure\nconsistency of the PC algorithm for estimating the MEC of a causal DAG [29].\nWe now provide our main results, namely consistency of the DCI algorithm. For simplicity we here\ndiscuss the consistency guarantees when Algorithm 2 is initialized with \u2206\u0398 being the complete graph\nand S\u0398 = [p]. However, in practice we recommend initialization using KLIEP (see also Section 5) to\navoid performing an unnecessarily large number of conditional independence tests. The consistency\nguarantees for such an initialization including a method for learning the D-DAG in general additive\nnoise models (that are not necessarily Gaussian) is provided in the Supplementary Material.\n\n1This is similar to the faithfulness assumption in the Gaussian setting, where partial correlations are used\nfor CI testing; the partial correlations are rational functions in the variables B(k)\nand the faithfulness\nassumption asserts that if a partial correlation \u03c1ij|S is zero then the corresponding rational function is identically\nequal to zero and hence Bij = 0 [16].\n\nij and \u03c3(k)\n\nj\n\n5\n\nG(1)12340.50.50.51G(2)1234110.80.5Algorithm2EstimatingskeletonoftheD-DAGInput:Sampledata\u02c6X(1),\u02c6X(2),estimatedD-UG\u2206\u0398,estimatedsetofchangednodesS\u0398.Output:Estimatedskeleton\u00af\u2206.Set\u00af\u2206:=\u2206\u0398;foreachedgei\u2212jin\u00af\u2206doIf\u2203S\u2286S\u0398\\{i,j}suchthat\u03b2(k)i,j|Sisinvariantacrossk={1,2},deletei\u2212jin\u00af\u2206andcontinuetothenextedge.Otherwise,continue.endforAlgorithm3DirectingedgesintheD-DAGInput:Sampledata\u02c6X(1),\u02c6X(2),estimatedsetofchangednodesS\u0398,estimatedskeleton\u00af\u2206.Output:Estimatedsetofarrows\u02dcA\u2206.Set\u02dcA\u2206:=\u2205;foreachnodejincidenttoatleastoneundirectededgein\u00af\u2206doIf\u2203S\u2286S\u0398\\{i,j}suchthat\u03b2(k)i,j|Sisinvariantacrossk={1,2},addi\u2192jto\u02dcA\u2206foralli\u2208S,andaddj\u2192ito\u02dcA\u2206foralli6\u2208Sandcontinuetothenextnode.Otherwise,continue.endforDe\ufb01nition3.1.Giveni,j\u2208[p]andS\u2286[p]\\{i,j},letM:={i}\u222aSandlet\u03b2(k)Mbethebest134linearpredictorofX(k)jgivenX(k)M,i.e.,theminimizerofE[(X(k)j\u2212(\u03b2(k)M)TX(k)M)2].Wede\ufb01nethe135regressioncoef\ufb01cient\u03b2(k)i,j|Stobetheentryin\u03b2(k)Mcorrespondingtoi.136De\ufb01nition3.2.Forj\u2208[p]andS\u2286[p]\\{j},wede\ufb01ne(\u03c3(k)j|S)2tobethevarianceoftheregression137residualwhenregressingX(k)jontotherandomvectorX(k)S.138Notethatingeneral\u03b2(k)i,j|S6=\u03b2(k)j,i|S.EachentryinB(k)canbeinterpretedasaregressioncoef\ufb01cient,139namelyB(k)ij=\u03b2(k)i,j|(Pa(k)(j)\\{i}),wherePa(k)(j)denotestheparentsofnodejinG(k).Thus,when140B(1)ij=B(2)ij,thenthereexistsasetSsuchthat\u03b2(k)i,j|Sstaysinvariantacrossk={1,2}.This141motivatesusinginvariancesbetweenregressioncoef\ufb01cientstodeterminetheskeletonoftheD-DAG.142Fororientingedges,observethatwhen\u03c3(k)jstaysinvariantacrosstwoconditions,\u03c3(k)j|Swouldalso143stayinvariantifSischosensuchthatS=Pa(1)(j)\u222aPa(2)(j).Thismotivatesusinginvariancesof144residualvariancestodiscovertheparentsofnodejandassignorientationsafterwards.Similarto[9],145weusehypothesistestsbasedontheF-testfortestingforinvariancebetweenregressioncoef\ufb01cients146andresidualvariances.SeetheSupplementaryMaterialfordetailsregardingtheconstructionof147thesehypothesistests,thederivationoftheirasymptoticdistribution,andanexampleoutliningthe148differenceofthisapproachto[9]forinvariantstructurelearning.149Example3.3.Weendthissectionwitha4-nodeexampleshowinghowtheDCIalgorithmworks.150LetB(1)andB(2)betheautoregressivematricesde\ufb01nedbytheedgeweightsofG(1)andG(2)and151letthenoisevariancessatisfythefollowinginvariances:152G(1)12340.50.50.51G(2)1234110.80.5DCIoutput:1234=\u21d2G(1)12340.50.50.51G(2)1234110.80.5DCIoutput:1234153\u03c3(1)16=\u03c3(2)1,\u03c3(1)3=\u03c3(2)3,(2)\u03c3(1)2=\u03c3(2)2,\u03c3(1)46=\u03c3(2)4;(3)InitiatingAlgorithm2with\u2206\u0398beingthecompletegraphandS\u0398=[4],theoutputoftheDCI154algorithmisshownbelow.Notethattheedge1\u22124isnotoriented,since\u03c3(k)4|Mand\u03c3(k)1|Marenot1554Algorithm2EstimatingskeletonoftheD-DAGInput:Sampledata\u02c6X(1),\u02c6X(2),estimatedD-UG\u2206\u0398,estimatedsetofchangednodesS\u0398.Output:Estimatedskeleton\u00af\u2206.Set\u00af\u2206:=\u2206\u0398;foreachedgei\u2212jin\u00af\u2206doIf\u2203S\u2286S\u0398\\{i,j}suchthat\u03b2(k)i,j|Sisinvariantacrossk={1,2},deletei\u2212jin\u00af\u2206andcontinuetothenextedge.Otherwise,continue.endforAlgorithm3DirectingedgesintheD-DAGInput:Sampledata\u02c6X(1),\u02c6X(2),estimatedsetofchangednodesS\u0398,estimatedskeleton\u00af\u2206.Output:Estimatedsetofarrows\u02dcA\u2206.Set\u02dcA\u2206:=\u2205;foreachnodejincidenttoatleastoneundirectededgein\u00af\u2206doIf\u2203S\u2286S\u0398\\{i,j}suchthat\u03b2(k)i,j|Sisinvariantacrossk={1,2},addi\u2192jto\u02dcA\u2206foralli\u2208S,andaddj\u2192ito\u02dcA\u2206foralli6\u2208Sandcontinuetothenextnode.Otherwise,continue.endforDe\ufb01nition3.1.Giveni,j\u2208[p]andS\u2286[p]\\{i,j},letM:={i}\u222aSandlet\u03b2(k)Mbethebest134linearpredictorofX(k)jgivenX(k)M,i.e.,theminimizerofE[(X(k)j\u2212(\u03b2(k)M)TX(k)M)2].Wede\ufb01nethe135regressioncoef\ufb01cient\u03b2(k)i,j|Stobetheentryin\u03b2(k)Mcorrespondingtoi.136De\ufb01nition3.2.Forj\u2208[p]andS\u2286[p]\\{j},wede\ufb01ne(\u03c3(k)j|S)2tobethevarianceoftheregression137residualwhenregressingX(k)jontotherandomvectorX(k)S.138Notethatingeneral\u03b2(k)i,j|S6=\u03b2(k)j,i|S.EachentryinB(k)canbeinterpretedasaregressioncoef\ufb01cient,139namelyB(k)ij=\u03b2(k)i,j|(Pa(k)(j)\\{i}),wherePa(k)(j)denotestheparentsofnodejinG(k).Thus,when140B(1)ij=B(2)ij,thenthereexistsasetSsuchthat\u03b2(k)i,j|Sstaysinvariantacrossk={1,2}.This141motivatesusinginvariancesbetweenregressioncoef\ufb01cientstodeterminetheskeletonoftheD-DAG.142Fororientingedges,observethatwhen\u03c3(k)jstaysinvariantacrosstwoconditions,\u03c3(k)j|Swouldalso143stayinvariantifSischosensuchthatS=Pa(1)(j)\u222aPa(2)(j).Thismotivatesusinginvariancesof144residualvariancestodiscovertheparentsofnodejandassignorientationsafterwards.Similarto[9],145weusehypothesistestsbasedontheF-testfortestingforinvariancebetweenregressioncoef\ufb01cients146andresidualvariances.SeetheSupplementaryMaterialfordetailsregardingtheconstructionof147thesehypothesistests,thederivationoftheirasymptoticdistribution,andanexampleoutliningthe148differenceofthisapproachto[9]forinvariantstructurelearning.149Example3.3.Weendthissectionwitha4-nodeexampleshowinghowtheDCIalgorithmworks.150LetB(1)andB(2)betheautoregressivematricesde\ufb01nedbytheedgeweightsofG(1)andG(2)and151letthenoisevariancessatisfythefollowinginvariances:152G(1)12340.50.50.51G(2)1234110.80.5DCIoutput:1234=\u21d2G(1)12340.50.50.51G(2)1234110.80.5DCIoutput:1234153\u03c3(1)16=\u03c3(2)1,\u03c3(1)3=\u03c3(2)3,(2)\u03c3(1)2=\u03c3(2)2,\u03c3(1)46=\u03c3(2)4;(3)InitiatingAlgorithm2with\u2206\u0398beingthecompletegraphandS\u0398=[4],theoutputoftheDCI154algorithmisshownbelow.Notethattheedge1\u22124isnotoriented,since\u03c3(k)4|Mand\u03c3(k)1|Marenot1554DCIoutput:1234\fTheorem 4.3. Given Assumption 4.1, Algorithm 2 is consistent in estimating the skeleton of the\nD-DAG \u2206.\nThe proof is given in the Supplementary Material. The main ingredient is showing that if B(1)\nij = B(2)\nij ,\nthen there exists a conditioning set S \u2286 S\u0398 \\ {i, j} such that \u03b2(1)\ni,j|S, namely the parents of\nnode j in both DAGs excluding node i. Next, we provide consistency guarantees for Algorithm 3.\nTheorem 4.4. Given Assumption 4.2, all arrows \u02dcA\u2206 output by Algorithm 3 are correctly oriented.\nIn particular, if \u03c3(k)\n\nis invariant across k = {1, 2}, then all edges adjacent to i are oriented.\n\ni,j|S = \u03b2(2)\n\ni\n\nSimilar to the proof of Theorem 4.3, the proof follows by interpreting the rational functions corre-\nsponding to regression residual variances in terms of d-connecting paths in G(k) and is given in the\nSupplementary Material. It is important to note that as a direct corollary to Theorem 4.4 we obtain\nsuf\ufb01cient conditions for full identi\ufb01ability of the D-DAG (i.e., all arrows) using the DCI algorithm.\nCorollary 4.5. Given Assumptions 4.1 and 4.2, and assuming that the error variances are the same\nacross the two distributions, i.e. \u2126(1) = \u2126(2), the DCI algorithm outputs the D-DAG \u2206.\n\nIn addition, we conjecture that Algorithm 3 is complete, i.e., that it directs all edges that are identi\ufb01able\nin the D-DAG. We end this section with two remarks, namely regarding the sample complexity of the\nDCI algorithm and an evaluation of how restrictive Assumptions 4.1 and 4.2 are.\nRemark 4.6 (Sample complexity of DCI). For constraint-based methods such as the PC or DCI\nalgorithms, the sample complexity is determined by the number of hypothesis tests performed by\nthe algorithm [14]. In the high-dimensional setting, the number of hypothesis tests performed by\nthe PC algorithm scales as O(ps), where p is the number of nodes and s is the maximum degree\nof the DAG, thereby implying severe restrictions on the sparsity of the DAG given a reasonable\nsample size. Meanwhile, the number of hypothesis tests performed by the DCI algorithm scales as\nO(|\u2206\u0398|2|S\u0398|\u22121) and hence does not depend on the degree of the two DAGs. Therefore, even if the\ntwo DAGs G(1) and G(2) are high-dimensional and highly connected, the DCI algorithm is consistent\nand has a better sample complexity (as compared to estimating two DAGs separately) as long as the\ndifferences between G(1) and G(2) are sparse, i.e., |S\u0398| is small compared to p and s.\nRemark 4.7 (Strength of Assumptions 4.1 and 4.2). Since faithfulness, a standard assumption for\nconsistency of causal inference algorithms to estimate an MEC, is known to be restrictive [39], it is of\ninterest to compare Assumptions 4.1 and 4.2 to the faithfulness assumption of P(k) with respect to G(k)\nfor k \u2208 {1, 2}. In the Supplementary Material we provide examples showing that Assumptions 4.1\nand 4.2 do not imply the faithfulness assumption on the two distributions and vice-versa. However, in\nthe \ufb01nite sample regime we conjecture Assumptions 4.1 and 4.2 to be weaker than the faithfulness\nassumption: violations of faithfulness as well as of Assumptions 4.1 and 4.2 correspond to points\nthat are close to conditional independence hypersurfaces [39]. The number of these hypersurfaces\n(and hence the number of violations) increases in s for the faithfulness assumption and in S\u0398 for\nAssumptions 4.1 and 4.2. Hence if the two DAGs G(1) and G(2) are large and complex while having a\nsparse difference, then S\u0398 << s. See the Supplementary Material for more details.\n\n5 Evaluation\n\nIn this section, we compare our DCI algorithm with PC and GES on both synthetic and real data. The\ncode utilized for the following experiments can be found at https://github.com/csquires/dci.\n\n5.1 Synthetic data\n\nWe analyze the performance of our algorithm in both, the low- and high-dimensional setting. For\nboth settings we generated 100 realizations of pairs of upper-triangular SEMs (B(1), \u0001(1)) and\n(B(2), \u0001(2)). For B(1), the graphical structure was generated using an Erd\u00f6s-Renyi model with\nexpected neighbourhood size s, on p nodes and n samples. The edge weights were uniformly drawn\nfrom [\u22121,\u22120.25] \u222a [0.25, 1] to ensure that they were bounded away from zero. B(2) was then\ngenerated from B(1) by adding and removing edges with probability 0.1, i.e.,\nB(2)\nij\n\n\u223c Ber(0.1) \u00b7 Unif([\u22121,\u2212.25] \u222a [.25, 1]) if B(1)\n\n\u223c Ber(0.9) \u00b7 B(1)\n\nif B(1)\n\nij = 0\n\ni.i.d.\n\nij\n\nij (cid:54)= 0, B(2)\n\nij\n\ni.i.d.\n\n6\n\n\f(a) skeleton\n\n(b) skeleton & orientation\n\n(c) changed variances\n\nFigure 1: Proportion of consistently estimated D-DAGs for 100 realizations per setting with p = 10\nnodes and sample size n. Figures (a) and (b) show the proportion of consistently estimated D-DAGs\nwhen considering just the skeleton ( \u00af\u2206) and both skeleton and edge orientations (\u2206), respectively; \u03b1 is\nthe signi\ufb01cance level used for the hypothesis tests in the algorithms. Figure (c) shows the proportion\nof consistent estimates with respect to the number of changes in internal node variances v.\n\nNote that while the DCI algorithm is able to identify changes in edge weights, we only generated\nDAG models that differ by edge insertions and deletions. This is to provide a fair comparison to\nthe naive approach, where we separately estimate the two DAGs G(1) and G(2) and then take their\ndifference, since this approach can only identify insertions and deletions of edges.\nIn Figure 1 we analyzed how the performance of the DCI algorithm changes over different choices of\nsigni\ufb01cance levels \u03b1. The simulations were performed on graphs with p = 10 nodes, neighborhood\nsize of s = 3 and sample size n \u2208 {103, 104}. For Figure 1 (a) and (b) we set \u0001(1), \u0001(2) \u223c N (0, 1p),\nwhich by Corollary 4.5 ensures that the D-DAG \u2206 is fully identi\ufb01able. We compared the performance\nof DCI to the naive approach, where we separately estimated the two DAGs G(1) and G(2) and then\ntook their difference. For separate estimation we used the prominent PC and GES algorithms tailored\nto the Gaussian setting. Since KLIEP requires an additional tuning parameter, to understand how \u03b1\nin\ufb02uences the performance of the DCI algorithm, we here only analyzed initializations in the fully\nconnected graph (DCI-FC) and using the constraint-based method described in the Supplementary\nMaterial (DCI-C). Both initializations provide a provably consistent algorithm. Figure 1 (a) and (b)\nshow the proportion of consistently estimated D-DAGs by just considering the skeleton ( \u00af\u2206) and\nboth skeleton and orientations (\u2206), respectively. For PC and GES, we considered the set of edges\nthat appeared in one estimated skeleton but disappeared in the other as the estimated skeleton of the\nD-DAG \u00af\u2206. In determining orientations, we considered the arrows that were directed in one estimated\nCP-DAG but disappeared in the other as the estimated set of directed arrows. Since the main purpose\nof this low-dimensional simulation study is to validate our theoretical \ufb01ndings, we used the exact\nrecovery rate as evaluation criterion. In line with our theoretical \ufb01ndings, both variants of the DCI\nalgorithm outperformed taking differences after separate estimation. Figure 1 (a) and (b) also show\nthat the PC algorithm outperformed GES, which is unexpected given previous results showing that\nGES usually has a higher exact recovery rate than the PC algorithm for estimating a single DAG. This\nis due to the fact that while the PC algorithm usually estimates less DAGs correctly, the incorrectly\n\n7\n\n\f(a) D-DAG skeleton \u00af\u2206\n\n(b) D-DAG \u2206\n\n(c) T-cell activation\n\nFigure 2: High-dimensional evaluation of the DCI algorithm in both simulation and real data; (a)\u2212(b)\nare the ROC curves for estimating the D-DAG \u2206 and its skeleton \u00af\u2206 with p = 100 nodes, expected\nneighbourhood size s = 10, n = 300 samples, and 5% change between DAGs; (c) shows the\nestimated D-DAG between gene expression data from naive and activated T cells.\n\nestimated DAGs tend to look more similar to the true model than the incorrect estimates of GES (as\nalso reported in [35]) and can still lead to a correct estimate of the D-DAG.\nIn Figure 1 (c) we analyzed the effect of changes in the noise variances on estimation performance.\nWe set \u0001(1) \u223c N (0, 1p), while for \u0001(2) we randomly picked v nodes and uniformly sampled their\nvariances from [1.25, 2]. We used \u03b1 = .05 as signi\ufb01cance level based on the evaluation from Figure 1.\nIn line with Theorem 4.4, as we increase the number of nodes i such that \u0001(1)\n, the number\nof edges whose orientations can be determined decreases. This is because Algorithm 3 can only\ndetermine an edge\u2019s orientation when the variance of at least one of its nodes is invariant. Moreover,\nFigure 1 (c) shows that the accuracy of Algorithm 2 is not impacted by changes in the noise variances.\nFinally, Figure 2 (a) - (b) show the performance (using ROC curves) of the DCI algorithm in the\nhigh-dimensional setting when initiated using KLIEP (DCI-K) and DCI-C. The simulations were\nperformed on graphs with p = 100 nodes, expected neighborhood size of s = 10, sample size\nn = 300, and \u0001(1), \u0001(2) \u223c N (0, 1p). B(2) was derived from B(1) so that the total number of changes\nwas 5% of the total number of edges in B(1), with an equal amount of insertions and deletions.\nFigure 2 (a) - (b) show that both DCI-C and DCI-K perform similarly well and outperform separate\nestimation using GES and the PC algorithm. The respective plots for 10% change between B(1) and\nB(2) are given in the Supplementary Material.\n\n(cid:54)= \u0001(2)\n\ni\n\ni\n\n5.2 Real data analysis\nOvarian cancer. We tested our method on an ovarian cancer data set [37] that contains two groups\nof patients with different survival rates and was previously analyzed using the DPM algorithm in\nthe undirected setting [43]. We followed the analysis of [43] and applied the DCI algorithm to gene\nexpression data from the apoptosis and TGF-\u03b2 pathways. In the apoptosis pathway we identi\ufb01ed two\nhub nodes: BIRC3, also discovered by DPM, is an inhibitor of apoptosis [12] and one of the main\ndisregulated genes in ovarian cancer [13]; PRKAR2B, not identi\ufb01ed by DPM, has been shown to\nbe important in disease progression in ovarian cancer cells [4] and an important regulatory unit for\ncancer cell growth [5]. In addition, the RII-\u03b2 protein encoded by PRKAR2B has been considered as\na therapeutic target for cancer therapy [6, 22], thereby con\ufb01rming the relevance of our \ufb01ndings. With\nrespect to the TGF-\u03b2 pathway, the DCI method identi\ufb01ed THBS2 and COMP as hub nodes. Both of\nthese genes have been implicated in resistance to chemotherapy in epithelial ovarian cancer [19] and\nwere also recovered by DPM. Overall, the D-UG discovered by DPM is comparable to the D-DAG\nfound by our method. More details on this analysis are given in the Supplementary Material.\nT cell activation. To demonstrate the relevance of our method for current genomics applications, we\napplied DCI to single-cell gene expression data of naive and activated T cells in order to study the\npathways involved during the immune response to a pathogen. We analyzed data from 377 activated\nand 298 naive T cells obtained by [34] using the recent drop-seq technology. From the previously\nidenti\ufb01ed differentially expressed genes between naive and activated T cells [32], we selected all\ngenes that had a fold expression change above 10, resulting in 60 genes for further analysis.\n\n8\n\n\fWe initiated DCI using KLIEP, thresholding the edge weights at 0.005, and ran DCI for different\ntuning parameters and with cross-validation to obtain the \ufb01nal DCI output shown in Figure 2 (c) using\nstability selection as described in [21]. The genes with highest out-degree, and hence of interest for\nfuture interventional experiments, are GZMB and UHRF1. Interestingly, GZMB is known to induce\ncytotoxicity, important for attacking and killing the invading pathogens. Furthermore, this gene has\nbeen reported as the most differentially expressed gene during T cell activation [10, 26]. UHRF1 has\nbeen shown to be critical for T cell maturation and proliferation through knockout experiments [7, 24].\nInterestingly, the UHRF1 protein is a transcription factor, i.e. it binds to DNA sequences and regulates\nthe expression of other genes, thereby con\ufb01rming its role as an important causal regulator. Learning\na D-DAG as opposed to a D-UG is crucial for prioritizing interventional experiments. In addition,\nthe difference UG for this application would not only have been more dense, but it would also have\nresulted in additional hub nodes such as FABP5, KLRC1, and ASNS, which based on the current\nbiological literature seem secondary to T cell activation (FABP5 is involved in lipid binding, KLRC1\nhas a role in natural killer cells but not in T cells, and ASNS is an asparagine synthetase gene). The\ndifference DAGs learned by separately applying the GES and PC algorithms on naive and activated T\ncell data sets as well as on the ovarian cancer data sets are included in the Supplementary Material\nfor comparison.\n\n6 Discussion\n\nWe presented an algorithm for directly estimating the difference between two causal DAG models\ngiven i.i.d. samples from each model. To our knowledge this is the \ufb01rst such algorithm and is of\nparticular interest for learning differences between related networks, where each network might\nbe large and complex, while the difference is sparse. We provided consistency guarantees for our\nalgorithm and showed on synthetic and real data that it outperforms the naive approach of separately\nestimating two DAG models and taking their difference. While our proofs were for the setting with\nno latent variables, they extend to the setting where the edge weights and noise terms of all latent\nvariables remain invariant across the two DAGs. We applied our algorithm to gene expression data in\nbulk and from single cells, showing that DCI is able to identify biologically relevant genes for ovarian\ncancer and T-cell activation. This purports DCI as a promising method for identifying intervention\ntargets that are causal for a particular phenotype for subsequent experimental validation. A more\ncareful analysis with respect to the D-DAGs discovered by our DCI algorithm is needed to reveal its\nimpact for scienti\ufb01c discovery.\nIn order to make DCI scale to networks with thousands of nodes, an important challenge is to\nreduce the number of hypothesis tests. As mentioned in Remark 4.6, currently the time complexity\n(given by the number of hypothesis tests) of DCI scales exponentially with respect to the size of S\u0398.\nThe PC algorithm overcomes this problem by dynamically updating the list of CI tests given the\ncurrent estimate of the graph. It is an open problem whether one can similarly reduce the number of\nhypothesis tests for DCI. Another challenge is to relax Assumptions 4.1 and 4.2. Furthermore, in\nmany applications (e.g., when comparing normal to disease states), there is an imbalance of data/prior\nknowledge for the two models and it is of interest to develop methods that can make use of this for\nlearning the differences between the two models.\nFinally, as described in Section 2, DCI is preferable to separate estimation methods like PC and GES\nsince it can infer not only edges that appear or disappear, but also edges with changed edge weights.\nHowever, unlike separate estimation methods, DCI relies on the assumption that the two DAGs share\na topological order. Developing methods to directly estimate the difference of two DAGs that do not\nshare a topological order is of great interest for future work.\n\nAcknowledgements\n\nYuhao Wang was supported by ONR (N00014-17-1-2147), NSF (DMS-1651995) and the MIT-IBM\nWatson AI Lab. Anastasiya Belyaeva was supported by an NSF Graduate Research Fellowship\n(1122374) and the Abdul Latif Jameel World Water and Food Security Lab (J-WAFS) at MIT.\nCaroline Uhler was partially supported by ONR (N00014-17-1-2147), NSF (DMS-1651995), and a\nSloan Fellowship.\n\n9\n\n\fReferences\n[1] S. A. Andersson, D. Madigan, and M. D. Perlman. A characterization of markov equivalence\n\nclasses for acyclic digraphs. The Annals of Statistics, 25(2):505\u2013541, 1997.\n\n[2] A.-L. Barab\u00e1si, N. Gulbahce, and J. Loscalzo. Network medicine: a network-based approach to\n\nhuman disease. Nature Reviews Genetics, 12(1):56\u201368, 2011.\n\n[3] A.-L. Barab\u00e1si and Z. N. Oltvai. Network biology: understanding the cell\u2019s functional organiza-\n\ntion. Nature Reviews Genetics, 5(2):101\u2013113, 2004.\n\n[4] C. Cheadle, M. Nesterova, T. Watkins, K. C. Barnes, J. C. Hall, A. Rosen, K. G. Becker, and Y. S.\nCho-Chung. Regulatory subunits of PKA de\ufb01ne an axis of cellular proliferation/differentiation\nin ovarian cancer cells. BMC Medical Genomics, 1(1):43, 2008.\n\n[5] F. Chiaradonna, C. Balestrieri, D. Gaglio, and M. Vanoni. RAS and PKA pathways in cancer:\n\nnew insight from transcriptional analysis. Frontiers in Bioscience, 13:5257\u20135278, 2008.\n\n[6] Y. S. Cho-Chung. Antisense oligonucleotide inhibition of serine/threonine kinases: an innovative\n\napproach to cancer treatment. Pharmacology & Therapeutics, 82(2):437\u2013449, 1999.\n\n[7] Y. Cui, X. Chen, J. Zhang, X. Sun, H. Liu, L. Bai, C. Xu, and X. Liu. Uhrf1 Controls iNKT Cell\nSurvival and Differentiation through the Akt-mTOR Axis. Cell Reports, 15(2):256\u2013263, 2016.\n[8] N. Friedman, M. Linial, I. Nachman, and D. Pe\u2019er. Using bayesian networks to analyze\n\nexpression data. Journal of Computational Biology, 7(3-4):601\u2013620, 2000.\n\n[9] A. Ghassami, S. Salehkaleybar, N. Kiyavash, and K. Zhang. Learning causal structures using\nregression invariance. In Advances in Neural Information Processing Systems, pages 3015\u20133025,\n2017.\n\n[10] L. A. Hatton. Molecular Mechanisms Regulating CD8+ T Cell Granzyme and Perforin Gene\n\nExpression. PhD thesis, University of Melbourne, 2013.\n\n[11] N. J. Hudson, A. Reverter, and B. P. Dalrymple. A differential wiring analysis of expression\ndata correctly identi\ufb01es the gene containing the causal mutation. PLoS Computational Biology,\n5(5):e1000382, 2009.\n\n[12] R.W. Johnstone, A.J. Frew, and M.J. Smyth. The TRAIL apoptotic pathway in cancer onset,\n\nprogression and therapy. Nature Reviews Cancer, 8(10):782\u2013798, 2008.\n\n[13] J. J\u00f6nsson, K. Bartuma, M. Dominguez-Valentin, K. Harbst, Z. Ketabi, S. Malander, M. J\u00f6nsson,\nA. Carneiro, A. M\u00e5sb\u00e4ck, G. J\u00f6nsson, and M. Nilbert. Distinct gene expression pro\ufb01les in\novarian cancer linked to Lynch syndrome. Familial Cancer, 13:537\u2013545, 2014.\n\n[14] M. Kalisch and P. B\u00fchlmann. Estimating high-dimensional directed acyclic graphs with the\n\npc-algorithm. Journal of Machine Learning Research, 8(Mar):613\u2013636, 2007.\n\n[15] S. L Lauritzen. Graphical Models, volume 17. Clarendon Press, 1996.\n[16] S. Lin, C. Uhler, B. Sturmfels, and P. B\u00fchlmann. Hypersurfaces and their singularities in partial\n\ncorrelation testing. Foundations of Computational Mathematics, 14(5):1079\u20131116, 2014.\n\n[17] S. Liu, K. Fukumizu, and T. Suzuki. Learning sparse structural changes in high-dimensional\n\nmarkov networks. Behaviormetrika, 44(1):265\u2013286, 2017.\n\n[18] S. Liu, J. A. Quinn, M. U. Gutmann, T. Suzuki, and M. Sugiyama. Direct learning of sparse\nchanges in markov networks by density ratio estimation. Neural Computation, 26(6):1169\u20131197,\n2014.\n\n[19] S. Marchini, R. Fruscio, L. Clivio, L. Beltrame, L. Porcu, I. F. Nerini, D. Cavalieri, G. Chiorino,\nG. Cattoretti, C. Mangioni, R. Milani, V. Torri, C. Romualdi, A. Zambelli, M. Romano, M. Sig-\nnorelli, S. di Giandomenico, and M D\u2019Incalci. Resistance to platinum-based chemotherapy is\nassociated with epithelial to mesenchymal transition in epithelial ovarian cancer. European\nJournal of Cancer, 49(2):520\u2013530, 2013.\n\n[20] C. Meek. Graphical Models: Selecting Causal and Statistical Models. PhD thesis, Carnegie\n\nMellon University, 1997.\n\n[21] N. Meinshausen and P. B\u00fchlmann. Stability selection. Journal of the Royal Statistical Society.\n\nSeries B: Statistical Methodology, 72(4):417\u2013473, 2010.\n\n10\n\n\f[22] T. Mikalsen, N. Gerits, and U. Moens. Inhibitors of signal transduction protein kinases as\n\ntargets for cancer therapy. Biotechnology Annual Review, 12:153\u2013223, 2006.\n\n[23] P. Nandy, A. Hauser, and M. H. Maathuis. High-dimensional consistency in score-based and\n\nhybrid structure learning, 2015. To appear in Annals of Statistics.\n\n[24] Y. Obata, Y. Furusawa, T.A. Endo, J. Sharif, D. Takahashi, K. Atarashi, M. Nakayama, S. On-\nawa, Y. Fujimura, M. Takahashi, T. Ikawa, T. Otsubo, Y.I. Kawamura, T. Dohi, S. Tajima,\nH. Masumoto, O. Ohara, K. Honda, S. Hori, H. Ohno, H. Koseki, and K. Hase. The epigenetic\nregulator Uhrf1 facilitates the proliferation and maturation of colonic regulatory T cells. Nature\nImmunology, 15(6):571\u2013579, 2014.\n\n[25] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.\n[26] A. Peixoto, C. Evaristo, I. Munitic, M. Monteiro, A. Charbit, B. Rocha, and H. Veiga-Fernandes.\nCD8 single-cell gene coexpression reveals three different effector types present at distinct\nphases of the immune response. The Journal of Experimental Medicine, 204(5):1193\u20131205,\n2007.\n\n[27] J. Peters, P. B\u00fchlmann, and N. Meinshausen. Causal inference by using invariant prediction:\nidenti\ufb01cation and con\ufb01dence intervals. Journal of the Royal Statistical Society: Series B\n(Statistical Methodology), 78(5):947\u20131012, 2016.\n\n[28] J. E. Pimanda, K. Ottersbach, K. Knezevic, S. Kinston, W. Y. Chan, N. K. Wilson, J. Landry,\nA. D Wood, A. Kolb-Kokocinski, A. R. Green, D. Tannahill, G. Lacaud, V. Kouskoff, and\nB. G\u00f6ttgens. Gata2, Fli1, and Scl form a recursively wired gene-regulatory circuit during early\nhematopoietic development. Proceedings of the National Academy of Sciences, 104(45):17692\u2013\n17697, 2007.\n\n[29] J. Ramsey, P. Spirtes, and J. Zhang. Adjacency-faithfulness and conservative causal inference.\nIn Proceedings of the Twenty-Second Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n401\u2013408. AUAI Press, 2006.\n\n[30] J. M. Robins and B. Hernan, Miguel. A.and Brumback. Marginal structural models and causal\n\ninference in epidemiology, 2000.\n\n[31] S. Sanei and J. A. Chambers. EEG Signal Processing. John Wiley & Sons, 2013.\n[32] S. Sarkar, V. Kalia, W.N. Haining, B.T. Konieczny, S. Subramaniam, and R. Ahmed. Functional\nand genomic pro\ufb01ling of effector CD8 T cell subsets with distinct memory fates. The Journal\nof Experimental Medicine, 205(3):625\u2013640, 2008.\n\n[33] S. Shimizu, P. O. Hoyer, A. Hyv\u00e4rinen, and A. Kerminen. A linear non-gaussian acyclic model\n\nfor causal discovery. Journal of Machine Learning Research, 7(Oct):2003\u20132030, 2006.\n\n[34] M. Singer, C. Wang, L. Cong, N.D. Marjanovic, M.S. Kowalczyk, H. Zhang, J. Nyman,\nK. Sakuishi, S. Kurtulus, D. Gennert, J. Xia, J.Y.H. Kwon, J. Nevin, R.H. Herbst, I. Yanai,\nO. Rozenblatt-Rosen, V.K. Kuchroo, A. Regev, and A.C. Anderson. A Distinct Gene Module for\nDysfunction Uncoupled from Activation in Tumor-In\ufb01ltrating T Cells. Cell, 166(6):1500\u20131511,\n2016.\n\n[35] L. Solus, Y. Wang, L. Matejovicova, and C. Uhler. Consistency guarantees for permutation-based\n\ncausal inference algorithms, 2017.\n\n[36] P. Spirtes, C. N. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press,\n\n2000.\n\n[37] R. W. Tothill, A. V. Tinker, J. George, R. Brown, S. B. Fox, S. Lade, D. S. Johnson, M. K.\nTrivett, D. Etemadmoghadam, B. Locandro, N. Tra\ufb01cante, S. Fereday, J. A. Hung, Y. Chiew,\nI. Haviv, Australian Ovarian Cancer Study Group, D. Gertig, A. deFazio, and D. D.L. Bowtell.\nNovel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome.\nClinical Cancer Research, 14(16):5198\u20135208, 2008.\n\n[38] I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The max-min hill-climbing bayesian network\n\nstructure learning algorithm. Machine Learning, 65(1):31\u201378, 2006.\n\n[39] C. Uhler, G. Raskutti, P. B\u00fchlmann, and B. Yu. Geometry of the faithfulness assumption in\n\ncausal inference. The Annals of Statistics, pages 436\u2013463, 2013.\n\n[40] S. Van de Geer and P. B\u00fchlmann. (cid:96)0-penalized maximum likelihood for sparse directed acyclic\n\ngraphs. The Annals of Statistics, 41(2):536\u2013567, 2013.\n\n11\n\n\f[41] T. Verma and J. Pearl. Equivalence and synthesis of causal models. In Proceedings of the Sixth\nAnnual Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 255\u2013270. Elsevier Science\nInc., 1990.\n\n[42] K. Zhang, B. Huang, J. Zhang, C. Glymour, and B. Sch\u00f6lkopf. Causal discovery from non-\nstationary/heterogeneous data: Skeleton estimation and orientation determination. In IJCAI:\nProceedings of the Conference, volume 2017, page 1347. NIH Public Access, 2017.\n\n[43] S. D. Zhao, T. T. Cai, and H. Li. Direct estimation of differential networks. Biometrika,\n\n101(2):253\u2013268, 2014.\n\n12\n\n\f", "award": [], "sourceid": 1882, "authors": [{"given_name": "Yuhao", "family_name": "Wang", "institution": "MIT"}, {"given_name": "Chandler", "family_name": "Squires", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Anastasiya", "family_name": "Belyaeva", "institution": "MIT"}, {"given_name": "Caroline", "family_name": "Uhler", "institution": "Massachusetts Institute of Technology"}]}