{"title": "Learning Causal Structures Using Regression Invariance", "book": "Advances in Neural Information Processing Systems", "page_first": 3011, "page_last": 3021, "abstract": "We study causal discovery in a multi-environment setting, in which the functional relations for producing the variables from their direct causes remain the same across environments, while the distribution of exogenous noises may vary. We introduce the idea of using the invariance of the functional relations of the variables to their causes across a set of environments for structure learning. We define a notion of completeness for a causal inference algorithm in this setting and prove the existence of such algorithm by proposing the baseline algorithm. Additionally, we present an alternate algorithm that has significantly improved computational and sample complexity compared to the baseline algorithm. Experiment results show that the proposed algorithm outperforms the other existing algorithms.", "full_text": "Learning Causal Structures Using Regression\n\nInvariance\n\nAmirEmad Ghassami\u21e4\u2020, Saber Salehkaleybar\u2020, Negar Kiyavash\u21e4\u2020, Kun Zhang\u2021\n\n\u21e4Department of ECE, University of Illinois at Urbana-Champaign, Urbana, USA.\n\n\u2020Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, USA.\n\n\u2021Department of Philosophy, Carnegie Mellon University, Pittsburgh, USA.\n\n\u2020{ghassam2,sabersk,kiyavash}@illinois.edu, \u2021kunz1@cmu.edu\n\nAbstract\n\nWe study causal discovery in a multi-environment setting, in which the functional\nrelations for producing the variables from their direct causes remain the same\nacross environments, while the distribution of exogenous noises may vary. We\nintroduce the idea of using the invariance of the functional relations of the variables\nto their causes across a set of environments for structure learning. We de\ufb01ne a\nnotion of completeness for a causal inference algorithm in this setting and prove\nthe existence of such algorithm by proposing the baseline algorithm. Additionally,\nwe present an alternate algorithm that has signi\ufb01cantly improved computational\nand sample complexity compared to the baseline algorithm. Experiment results\nshow that the proposed algorithm outperforms the other existing algorithms.\n\nIntroduction\n\n1\nCausal structure learning is a fundamental problem in machine learning with applications in multiple\n\ufb01elds such as biology, economics, epidemiology, and computer science. When performing interven-\ntions in the system is not possible or too expensive (observation-only setting), the main approach\nto identifying direction of in\ufb02uences and learning the causal structure is to run a constraint-based\nor a score-based causal discovery algorithm over the data. In this case, a \u201ccomplete\u201d observational\nalgorithm allows learning the causal structure to the extent possible, which is the Markov equivalence\nof the ground truth structure. When the experimenter is capable of intervening in the system to see\nthe effect of varying one variable on the other variables (interventional setting), the causal structure\ncould be exactly learned. In this setting, the most common identi\ufb01cation procedure considers that\nthe variables whose distributions have varied are the descendants of the intervened variable and\nhence the causal structure is reconstructed by performing interventions on different variables in the\nsystem [4, 11]. However, due to issues such as cost constraints and infeasibility of performing certain\ninterventions, the experimenter is usually not capable of performing arbitrary interventions.\nIn many real-life systems, due to changes in the variables of the environment, the data generating\ndistribution will vary over time. Considering the setup after each change as a new environment, our\ngoal is to exploit the differences across environments to learn the underlying causal structure. In this\nsetting, we do not intervene in the system and only use the observational data taken from environments.\nWe consider a multi-environment setting, in which the functional relations for producing the variables\nfrom their parents remain the same across environments, while the distribution of exogenous noises\nmay vary. Note that the standard interventional setting could be viewed as a special case of multi-\nenvironment setting in which the location and distribution of the changes across environments are\ndesigned by the experimenter. Furthermore, as will be seen in Figure 1(a), there are cases where the\nordinary interventional approaches cannot take advantages of changes across environments while\nthese changes could be utilized to learn the causal structure uniquely. The multi-environment setting\nwas also studied in [35, 23, 37]; we will put our work into perspective in relationship to these in the\nRelated Work.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWe focus on the linear structural equation models (SEMs) with additive noise [1] as the underlying\ndata generating model (see Section 2 for details). Note that this model is one of the most problematic\nmodels in the literature of causal inference, and if the noises follow a Gaussian distribution, for\nmany structures, none of the existing observational approaches can identify the underlying causal\nstructure uniquely1. The main idea in our proposed approach is to utilize the change of the regression\ncoef\ufb01cients, resulting from the changes across the environments, to distinguish causes from the effects.\n\nX1\n\nX2\n\nX1\n\nX2\n\nX3\n\n(a)\n\nCov(X2) = a2\n\na22\n\n1\n\n(b)\n\n1) and N2 \u21e0 N (0, 2\n\nFigure 1: Simple examples of identi\ufb01-\nable structures using the proposed ap-\nproach.\n\nOur approach is able to identify causal structures that were\nnot identi\ufb01able using observational approaches, from in-\nformation that was not usable in existing interventional\napproaches. Figure 1 shows two simple examples to il-\nlustrate this point. In this \ufb01gure, a directed edge form\nvariable Xi to Xj implies that Xi is a direct cause of Xj,\nand change of an exogenous noise across environments is\ndenoted by the \ufb02ash sign. Consider the structure in Figure\n1(a), with equations X1 = N1, and X2 = aX1 + N2,\nwhere N1 \u21e0 N (0, 2\n2) are indepen-\ndent mean-zero Gaussian exogenous noises. Suppose we\nare interested in \ufb01nding out which variable is the cause and\nwhich is the effect. We are given two environments across\nwhich the exogenous noise of both X1 and X2 are varied. Denoting the regression coef\ufb01cient resulting\nfrom regressing Xi on Xj by Xj (Xi), in this case, we have X2(X1) = Cov(X1,X2)\n,\n1 +2\n2\nand X1(X2) = Cov(X1,X2)\nCov(X1) = a. Therefore, except for pathological cases for values for the variance\nof the exogenous noises in two environments, the regression coef\ufb01cient resulting from regressing\nthe cause variable on the effect variable varies between the two environments, while the regression\ncoef\ufb01cient from regressing the effect variable on the cause variable remains the same. Hence, the\ncause is distinguishable from the effect. Note that structures X1 ! X2 and X2 ! X1 are in the same\nMarkov equivalence class and hence, not distinguishable using merely conditional independence\ntests. Also since the exogenous noises of both variables have changed, ordinary interventional tests\nare also not capable of using the information of these two environments to distinguish between the\ntwo structures [5]. Moreover, as it will be shortly explained (see Related Work), since the exogenous\nnoise of the target variable has changed, the invariant prediction method [23], cannot discern the\ncorrect structure either. As another example, consider the structure in Figure 1(b). Suppose the\nexogenous noise of X1 is varied across the two environments. Similar to the previous example, it can\nbe shown that X2(X1) varies across the two environments while X1(X2) remains the same. This\nimplies that the edge between X1 and X2 is from the former to the later. Similarly, X3(X2) varies\nacross the two environments while X2(X3) remains the same. This implies that X2 is the parent\nof X3. Therefore, the structure in Figure 1(b) is distinguishable using the proposed identi\ufb01cation\napproach. Note that the invariant prediction method cannot identify the relation between X2 and X3,\nand conditional independence tests are also not able to distinguish this structure.\nRelated Work. The main approach to learning the causal structure in observational setting is to run\na constraint-based or a score-based algorithm over the data. Constraint-based approach [33, 21] is\nbased on performing statistical tests to learn conditional independencies among the variables along\nwith applying the Meek rules introduced in [36]. IC and IC\u21e4 [21], PC, and FCI [33] algorithms are\namong the well known examples of this approach. In score-based approach, \ufb01rst a hypothesis space\nof potential models along with a scoring function is de\ufb01ned. The scoring function measures how well\nthe model \ufb01ts the observed data. Then the highest-scoring structure is chosen as the output (usually\nvia greedy search). Greedy Equivalence Search (GES) algorithm [20, 2] is an example of score-\nbased approach. Such purely observational approaches reconstruct the causal graph up to Markov\nequivalence classes. Thus, directions of some edges may remain unresolved. There are studies which\nattempt to identify the exact causal structure by restricting the model class [32, 12, 24, 22]. Most\nof such works consider SEM with independent noise. LiNGAM method [32] is a potent approach\ncapable of structure learning in linear SEM model with additive noise2, as long as the distribution of\nthe noise is not Gaussian. Authors of [12] and [38] showed that a nonlinear SEM with additive noise,\n\n1As noted in [12], \u201cnonlinearities can play a role similar to that of non-Gaussianity\u201d, and both lead to exact\n\nstructure recovery.\n\n2 There are extensions to LiNGAM beyond linear model [38].\n\n2\n\n\fand even the post-nonlinear causal model, along with some mild conditions on the functions and data\ndistributions, are not symmetric in the cause and effect. There is also a line of work on causal structure\nlearning in models where each vertex of the graph represents a random process [26, 34, 25, 6, 7, 16].\nIn such models, a temporal relationship is considered among the variables and it is usually assumed\nthat there is no instantaneous in\ufb02uence among the processes. In interventional approach for causal\nstructure learning, the experimenter picks speci\ufb01c variables and attempts to learn their relation with\nother variables, by observing the effect of perturbing that variables on the distribution of others.\nIn recent works, bounds on the required number of interventions for complete discovery of causal\nrelationships as well as passive and adaptive algorithms for minimizing the number of experiments\nwere derived [5, 9, 10, 11, 31].\nIn this work we assume that the functional relations of the variables to their direct causes across\na set of environments are invariant. Similar assumptions have been considered in other work\n[3, 30, 14, 13, 29, 23]. Speci\ufb01cally, [3] which studied \ufb01nding causal relation between two variables\nrelated to each other by an invertible function, assumes that \u201c the distribution of the cause and the\nfunction mapping cause to effect are independent since they correspond to independent mechanisms\nof nature\u201d.\nThere is little work on multi-environment setup [35, 23, 37]. In [35], the authors analyzed the\nclasses of structures that are equivalent relative to a stream of distributions and presented algorithms\nthat output graphical representations of these equivalence classes. They assumed that changing the\ndistribution of a variable, varies the marginal distribution of all its descendants. Naturally this also\nassumes that they have access to enough samples to test each variable for marginal distribution\nchange. This approach cannot identify the causal relations among variables which are affected by\nenvironment changes in the same way. The most closely related work to our approach is the invariant\nprediction method [23], which utilizes different environments to estimate the set of predictors of\na target variable. In that work, it is assumed that the exogenous noise of the target variable does\nnot vary among the environments. In fact, the method crucially relies on this assumption as it adds\nvariables to the estimated predictors set only if they are necessary to keep the distribution of the target\nvariable\u2019s noise \ufb01xed. Besides high computational complexity, invariant prediction framework may\nresult in a set which does not contain all the parents of the target variable. Additionally, the optimal\npredictor set (output of the algorithm) is not necessarily unique. We will show that in many cases\nour proposed approach can overcome both these issues. Recently, the authors of [37] considered the\nsetting in which changes in the mechanism of variables prevents ordinary conditional independence\nbased algorithms from discovering the correct structure. The authors have modeled these changes\nas multiple environments and proposed a general solution for a non-parametric model which \ufb01rst\ndetects the variables whose mechanism changed and then \ufb01nds causal relations among variables\nusing conditional independence tests. Due to the generality of the model, this method may require a\nhigh number of samples.\nContribution. We propose a novel causal structure learning framework, which is capable of uniquely\nidentifying causal structures that were not identi\ufb01able using observational approaches, from informa-\ntion that was not usable in existing interventional approaches. The main contribution of this work\nis to introduce the idea of using the invariance of the functional relations of the variables to their\ndirect causes across a set of environments. This would imply using the invariance of coef\ufb01cients in\nthe special case of linear SEM for distinguishing the causes from the effects. We de\ufb01ne a notion of\ncompleteness for a causal inference algorithm in this setting and prove the existence of such algorithm\nby proposing the baseline algorithm (Section 3). Additionally, we present an alternate algorithm\n(Section 4) which has signi\ufb01cantly improved computational and sample complexity compared to the\nbaseline algorithm.\n2 Regression-Based Causal Structure Learning\nDe\ufb01nition 1. Consider a directed graph G = (V, E) with vertex set V and set of directed edges E.\nG is a DAG if it is a \ufb01nite graph with no directed cycles. A DAG G is called causal if its vertices\nrepresent random variables V = {X1, ..., Xn} and a directed edges (Xi, Xj) indicates that variable\nXi is a direct cause of variable Xj.\nWe consider a linear SEM [1] as the underlying data generating model. In such a model the value\nof each variable Xj 2 V is determined by a linear combination of the values of its causal parents\nPA(Xj) plus an additive exogenous noise Nj as follows\n(1)\n\nbjiXi + Nj,\n\n8j 2 {1,\u00b7\u00b7\u00b7 , p},\n\nXj = XXi2PA(Xj )\n\n3\n\n\fX = AN,\n\nwhere Nj\u2019s are jointly independent. This model could be represented by a single matrix equation\nX = BX + N. Further, we can write\n(2)\nwhere A = (I B)1. This implies that each variable X 2 V can be written as a linear combination\nof the exogenous noises in the system. We assume that in our model, all variables are observable.\nAlso, we focus on zero-mean Gaussian exogenous noise; otherwise, the proposed approach could be\nextended to any arbitrary distribution for the exogenous noise in the system. The following de\ufb01nitions\nwill be used throughout the paper.\nDe\ufb01nition 2. Graph union of a set G of mixed graphs3 over a skeleton, is a mixed graph with the\nsame skeleton as the members of G which contains directed edge (X, Y ), if 9 G 2 G such that\n(X, Y ) 2 E(G) and 6 9 G0 2 G such that (Y, X) 2 E(G0). The rest of the edges remain undirected.\nDe\ufb01nition 3. Causal DAGs G1 and G2 over V are Markov equivalent if every distribution that\nis compatible with one of the graphs is also compatible with the other. Markov equivalence is an\nequivalence relationship over the set of all graphs over V [17]. The graph union of all DAGs in the\nMarkov equivalence class of a DAG G is called the essential graph of G and is denoted by Ess(G).\nWe consider a multi-environment setting consisting of N environments E = {E1, ..., EN}. The\nstructure of the causal DAG and the functional relations for producing the variables from their parents\n(the matrix B), remains the same across all environments, the exogenous noises may vary though.\nFor a pair of environments Ei, Ej 2 E, let Iij be the set of variables whose exogenous noise have\nchanged between the two environments. Given Iij, for any DAG G consistent with the essential\ngraph4 obtained from an observational algorithm, de\ufb01ne the regression invariance set as follows\n\nR(G, Iij) := {(X, S) : X 2 V, S \u2713 V \\{X}, (i)\n\nS (X) = (j)\n\nS (X)},\n\nS (X) and (j)\n\nwhere (i)\nS (X) are the regression coef\ufb01cients of regressing variable X on S in environ-\nments Ei and Ej, respectively. In words, R(G, Iij) contains all pairs (X, S), X 2 V , S \u2713 V \\{X}\nthat if we regress X on S, the regression coef\ufb01cients do not change across Ei and Ej.\nDe\ufb01nition 4. Given I, the set of variables whose exogenous noise has changed between two environ-\nments, DAGs G1 and G2 are called I-distinguishable if R(G1, I) 6= R(G2, I).\nWe make the following assumption on the distributions of the exogenous noises.\nAssumption 1 (Regression Stability Assumption). For a given set I and structure G, there exists\n\u270f0 > 0 such that for all 0 < \u270f \uf8ff \u270f0 perturbing the variance of the exogenous noises by \u270f does not\nchange the regression invariance set R(G, I).\nThe purpose of Assumption 1 is to rule out pathological cases for values of the variance of the\nexogenous noises in two environments which make special regression relations. For instance, in\n(X1) = (2)\nExample 1, (1)\ni are the variances of the\nX2\nX2\nexogenous noise of Xi in the environments E1 and E2, respectively. Note that this special relation\nbetween 2\n2, and \u02dc2\n2 has Lebesgue measure zero in the set of all possible values for the\n1, \u02dc2\n1, 2\nvariances. We give the following examples as applications of our approach.\nExample 1. Consider DAGs G1 : X1 ! X2 and\nG2 : X1 X2. For I = {X1}, I = {X2} or I =\n{X1, X2}, calculating the regression coef\ufb01cients as ex-\nplained in Section 1, we see that (X1,{X2}) 62 R(G1, I)\nbut (X1,{X2}) 2 R(G2, I). Hence G1 and G2 are I-\ndistinguishable. As mentioned in Section 1, structures G1\nand G2 are not distinguishable using the observational\ntests. Also, in the case of I = {X1, X2}, the invariant\nprediction approach and the ordinary interventional tests\n- in which the experimenter expects that a change in the\ndistribution of the effect would not perturb the marginal\ndistribution of the cause variable - are not capable of\ndistinguishing the two structures either.\n\nFigure 2: DAGs related to Example 3.\n\n(X1) only if 2\n\n1 \u02dc2\n\n2 = 2\n\n2 \u02dc2\n\n1 where 2\n\ni and \u02dc2\n\n(b)\n\nX1\n\nX3\n\nX1\n\nX3\n\nX2\n\nX2\n\n(a)\n\n3A mixed graph contains both directed and undirected edges.\n4DAG G is consistent with mixed graph M, if they have the same skeleton and G does not contain edge\n\n(X, Y ) while M contains (Y, X).\n\n4\n\n\fExample 2. Consider the DAG G in Figure 1(b) with I = {X1}. Consider an alternative DAG G0\nin which compared to G the directed edge (X1, X2) is replaced by (X2, X1), and DAG G00 in which\ncompared to G the directed edge (X2, X3) is replaced by (X3, X2). Since (X2,{X1}) 2 R(G, I)\nwhile this pair is not in R(G0, I), and (X2,{X3}) 62 R(G, I) while this pair belongs to R(G00, I),\nthe structure of G is also distinguishable using the proposed identi\ufb01cation approach. Note that the\ndirection of the edges of G is not distinguishable using an observational test as it has two other DAGs\nin its equivalence class. Also, the invariant prediction method cannot identify the relation between\nX2 and X3, since it can keep the variance of the noise of X3 \ufb01xed by setting the predictor set as\n{X2} or {X1}, which have empty intersection.\nExample 3. Consider the structure in Figure 2(a) with I = {X2}. Among the six possible triangle\nDAGs, all of them are I-distinguishable from this structure and hence, with two environments differing\nin the exogenous noise of X2, this triangle DAG could be identi\ufb01ed. Note that all the triangle DAGs\nare in the same Markov equivalence class and hence, using the information of one environment alone,\nobservation only setting cannot lead to identi\ufb01cation, which makes this structure challenging to deal\nwith [8]. For I = {X1}, the structure in Figure 2(b) is not I-distinguishable from a triangle DAG in\nwhich the direction of the edge (X2, X3) is \ufb02ipped. These two DAGs are also not distinguishable\nusing the invariant prediction method and usual interventional approaches with intervention on X1.\nLet the structure G\u21e4 be the ground truth DAG structure. De\ufb01ne G(G\u21e4, I) := {G : R(G, I) =\nR(G\u21e4, I)}, which is the set of all DAGs which are not I-distinguishable from G\u21e4. Using this set, we\nform the mixed graph M (G\u21e4, I) over V as the graph union of members of G(G\u21e4, I).\nDe\ufb01nition 5. Let Pi be the joint distribution over the set of variables V in environment Ei 2 E.\nAn algorithm A : ({Pi}N\ni=1) ! M which gets the joint distribution over V in environments\ni=1 as the input and returns a mixed graph, is regression invariance complete if for any\nE = {Ei}N\npair of environments Ei and Ej with Iij as the set of variables whose exogenous noise has changed\nbetween Ei and Ej, the set of directed edges of M (G\u21e4, Iij) be a subset of the set of directed edges of\nthe output of A .\nIn Section 3 we will introduce a structure learning algorithm which is complete in the sense of\nDe\ufb01nition 5.\n3 Existence of Complete Algorithms\nIn this section we show the existence of complete algorithm (in the sense of De\ufb01nition 5) for\nlearning the causal structure among a set of variables V whose dynamics satisfy the SEM in (1). The\npseudo-code of the algorithm is presented in Algorithm 1.\nSuppose G\u21e4 is the ground truth structure. The\nalgorithm \ufb01rst runs a complete observational al-\ngorithm to obtain the essential graph Ess(G\u21e4).\nFor each pair of environments {Ei, Ej} 2 E,\n\ufb01rst the algorithm calculates the regression coef-\n\ufb01cients (i)\nS (Y ), for all Y 2 V and\nS \u2713 V \\{Y }, and forms the regression invari-\nance set Rij, which contains the pairs (Y, S)\nfor which the regression coef\ufb01cients did not\nchange between Ei and Ej. Note that ideally\nRij is equal to R(G\u21e4, Iij). Next, using the\nfunction ChangeFinder(\u00b7), we discover the set\nIij which is the set of variables whose exoge-\nnous noises have varied between the two envi-\nronments Ei and Ej. Then using the function\nConsistantFinder(\u00b7), we \ufb01nd Gij which is the set\nof all possible DAGs, G that are consistent with\nEss(G\u21e4) and R(G, Iij) = Rij. That is, this set\nis ideally equal to G(G\u21e4, Iij). After taking the\nunion of graphs in Gij, we form Mij, which is the mixed graph containing all causal relations\ndistinguishable from the given regression information between the two environments. This graph is\nideally equal to M (G\u21e4, Iij). After obtaining Mij for all pairs of environments, the algorithm forms a\nmixed graph ME by taking graph union of Mij\u2019s. We apply the Meek rules on ME to \ufb01nd all extra\norientations and output \u02c6M. Since for each pair of environments we are searching over all DAGs, and\nwe take the graph union of Mij\u2019s, the baseline algorithm is complete in the sense of De\ufb01nition 5.\n\nInput: Joint distribution over V in environ-\ni=1.\nments E = {Ei}N\nObtain Ess(G\u21e4) by running a complete observa-\ntional algorithm.\nfor each pair of environments {Ei, Ej} \u2713 E do\nObtain Rij = {(Y, S) : Y 2 V, S \u2713\nV \\{Y }, (i)\nS (Y )}.\nIij = ChangeF inder(Ei, Ej).\nGij = ConsistentF inder(Ess(G\u21e4), Rij, Iij).\nMij =SG2Gij\nME =S1\uf8ffi,j\uf8ffN Mij.\n\nApply Meek rules on ME to get \u02c6M.\nOutput: Mixed graph \u02c6M.\n\nAlgorithm 1 The Baseline Algorithm\n\nS (Y ) and (j)\n\nS (Y ) = (j)\n\nG.\n\nend for\n\n5\n\n\fObtaining the set Rij: In this part, for a given signi\ufb01cance level \u21b5, we will show how the set Rij can\nbe obtained to have total probability of false-rejection less than \u21b5. For given Y 2 V and S \u2713 V \\{Y }\nin the environments Ei and Ej, we de\ufb01ne the null hypothesis H ij\n\n0,Y,S as follows:\n\nS (Y ), respectively, obtained using the\n\n(3)\n\n(4)\n\nS (Y ) = and (j)\n\nS (Y ) = .\n\n0,Y,S : 9 2 R|S| such that (i)\nH ij\nS (Y ) be the estimations of (i)\n\nS (Y ) and \u02c6(j)\n\nLet \u02c6(i)\nordinary least squares estimator, and de\ufb01ne the test statistic\nj )1( \u02c6(i)\n\u02c6\u23031\nj are unbiased estimates of variance of Y (XS)T (i)\n\nS (Y ) \u02c6(j)\n\nS (Y ) and (j)\n\nS (Y ))T (s2\ni\n\n\u02c6T := ( \u02c6(i)\n\n\u02c6\u23031\n\ni + s2\nj\n\nS (Y ))/|S|,\n\nS (Y ) \u02c6(j)\nS (Y ) and Y (XS)T (j)\n\ni and s2\n\nwhere s2\nS (Y ) in\nenvironments Ei and Ej, respectively, and \u02c6\u2303i and \u02c6\u2303j are sample covariance matrices of E[XS(XS)T ]\nin environments Ei and Ej, respectively. If the null hypothesis holds, then \u02c6T \u21e0 F (|S|, n |S|),\nwhere F (\u00b7,\u00b7) is the F -distribution (see supplementary material for details).\nWe set the p-value of our test to be less than \u21b5/(p \u21e5 (2p1 1)). Hence, by testing all null\nhypotheses H ij\n0,Y,S for any Y 2 V and S \u2713 V \\{Y }, we can obtain the set Rij with total probability\nof false-rejection less than \u21b5.\nFunction ChangeFinder(\u00b7): We use Lemma 1 to \ufb01nd the set Iij.\nLemma 1. Given environments Ei and Ej, for a variable Y 2 V , if E[(Y (XS)T (i)\nS (Y ))2|Ei] 6=\nE[(Y (XS)T (j)\nS (Y ))2|Ej] for all S \u2713 N (Y ), where N (Y ) is the set of neighbors of Y , then the\nvariance of exogenous noise NY is changed between the two environments. Otherwise, the variance\nof NY is unchanged.\nSee the supplementary material for the proof.\nBased on Lemma 1, for any variable Y , we try to \ufb01nd a set S \u2713 N (Y ) for which the variance of\nY (XS)T S(Y ) remains \ufb01xed between Ei and Ej by testing the following null hypothesis:\n0,Y,S : 9 2 R s.t. E[(Y (XS)T (i)\n\u00afH ij\n\nS (Y ))2|Ej] = 2.\nIn order to test the above null hypothesis, we can compute the variance of Y (XS)T (i)\nS in Ei and\nY (XS)T (j)\nS in Ej and test whether these variances are equal using an F -test. If the p-value of\nthe test for the set S is less than \u21b5/(p \u21e5 2), then we will reject the null hypothesis \u00afH ij\n0,Y,S, where \nis the maximum degree of the causal graph. If we reject all hypothesis tests \u00afH ij\n0,Y,S for all S 2 N (Y ),\nthen we will add Y to set Iij. Since we are performing at most p \u21e5 2 (for each variable, at most 2\ntests), we can obtain the set Iij with total probability of false-rejection less than \u21b5.\nFunction ConsistentFinder(\u00b7): Let Dst be the set of all directed paths from variable Xs to variable\nXt. For any directed path d 2 Dst, we de\ufb01ne the weight of d as wd := \u21e7(u,v)2dbvu where bvu are\ncoef\ufb01cients in (1). By this de\ufb01nition, it can be seen that the entry (t, s) of matrix A in (2) is equal\nto [A]ts =Pd2Dst\nwd. Thus, the entries of matrix A are multivariate polynomials of entries of B.\n\nS (Y ))2|Ei] = 2 and E[(Y (XS)T (j)\n\nFurthermore,\n\n(i)\nS (Y ) = E[XS(XS)T|Ei]1E[XSY |Ei] = (AS\u21e4iAT\n\nS )1AS\u21e4iAT\nY ,\n\n(5)\n\nwhere AS and AY are the rows corresponding to set S and Y in matrix A, respectively, and matrix\n\u21e4i is a diagonal matrix where [\u21e4i]kk = E[(Nk)2|Ei]. Therefore, the entries of vector (i)\nS (Y ) are\nrational functions of entries in B and \u21e4i. Hence, the entries of Jacobian matrix of (i)\nS (Y ) with\nrespect to the diagonal entries of \u21e4i are also rational expression of these parameters.\nIn function ConsistentFinder(\u00b7), we select any directed graph G consistent with Ess(G\u21e4) and set\nbvu = 0 if (u, v) 62 G. In order to check whether G is in Gij, we initially set R(G, Iij) = ;. Then,\nwe compute the Jacobian matrix of (i)\nS (Y ) parametrically for any Y 2 V and S 2 V \\{Y }. As noted\nabove, the entries of Jacobian matrix can be obtained as rational expressions of entries in B and \u21e4i. If\nall columns of Jacobian matrix corresponding to the elements of Iij are zero, (i)\nS (Y ) is not changing\nby varying the variances of exogenous noises in Iij and hence, we add (Y, S) to set R(G, Iij). After\nchecking all Y 2 V and S 2 V \\{Y }, we add the graph G in Gij if R(G, Iij) = Rij.\n\n6\n\n\fAlgorithm 2 LRE Algorithm\n\nInput: Joint distribution over V in environments E = {Ei}N\ni=1.\nStage 1: Obtain Ess(G\u21e4) by running a complete observational algorithm, and for all X 2 V , form\nPA(X), CH(X), UK(X).\nStage 2:\nfor each pair of environments {Ei, Ej} \u2713 E do\n\nfor all Y 2 V do\nCompute (i)\nif (i)\n\nfor each X 2 UK(Y ) do\nX (Y ), (j)\n\nY (X).\nX (Y ) 6= (j)\nY (X) then\nSet X as a child of Y and set Y as a parent of X.\n\nY (X), and (j)\nY (X) = (j)\n\nX (Y ), (i)\nX (Y ), but (i)\n\nelse if (i)\n\nX (Y ), but (i)\n\nY (X) then\n\nX (Y ) = (j)\nX (Y ) 6= (j)\n\nY (X) 6= (j)\nY (X) 6= (j)\n\nSet X as a parent of Y and set Y as a child of X.\n\nX (Y ), and (i)\n\nelse if (i)\nFind minimum set S \u2713 N (Y )\\{X} such that (i)\nif S does not exist then\nSet X as a child of Y and set Y as a parent of X.\n\nY (X) then\nS[{X}\n\n(Y ) = (j)\n\nS[{X}\n\n(Y ).\n\nelse if (i)\n\nS (Y ) 6= (j)\n\nS (Y ) then\n\nelse\n\n8W 2 {X} [ S, set W as a parent of Y and set Y as a child of W .\n8W 2 S, set W as a parent of Y and set Y as a child of W .\n\nend if\n\nend if\nend for\n\nend for\n\nend for\nStage 3: Apply Meek rules on the resulted mixed graph to obtain \u02c6M.\nOutput: Mixed graph \u02c6M.\n\n4 LRE Algorithm\nThe baseline algorithm of Section 3 is presented to prove the existence of complete algorithms, but\nthat algorithm is not practical due to its high computational and sample complexity. In this section\nwe present the Local Regression Examiner (LRE) algorithm, which is an alternative much more\nef\ufb01cient algorithm for learning the causal structure among a set of variables V . The pseudo-code of\nthe algorithm is presented in Algorithm 2. We make use of the following result in this algorithm.\nLemma 2. Consider adjacent variables X, Y 2 V in causal structure G. For a pair of environments\nEi and Ej, if (X,{Y }) 2 R(G, Iij), but (Y,{X}) 62 R(G, Iij), then Y is a parent of X.\nSee the supplementary material for the proof.\nLRE algorithm consists of three stages. In the \ufb01rst stage, similar to the baseline algorithm, it runs\na complete observational algorithm to obtain the essential graph. Then for each variable X 2 V ,\nit forms the set of X\u2019s discovered parents PA(X), and discovered children CH(X), and leaves the\nremaining neighbors as unknown in UK(X). In the second stage, the goal is that for each variable\nY 2 V , we \ufb01nd Y \u2019s relation with its neighbors in UK(Y ) based on the invariance of its regression on\nits neighbors across each pair of environments. To do so, for each pair of environments, after \ufb01xing a\ntarget variable Y and for each of its neighbors in UK(Y ), the regression coef\ufb01cients of X on Y and\nY on X are calculated. We will face one of the following cases:\n\u2022 If neither is changing, we do not make any decisions about the relationship of X and Y . This\n\u2022 If one is changing and the other is unchanged, Lemma 2 implies that the variable which \ufb01xes the\n\u2022 If both are changing, we look for an auxiliary set S among Y \u2019s neighbors with minimum number\nof elements, for which (i)\n(Y ). If no such S is found, it implies that X is a\nchild of Y . Otherwise, if S and X are both required in the regressors set to \ufb01x the coef\ufb01cient,\nwe set {X} [ S as parents of Y ; otherwise, if X is not required in the regressors set to \ufb01x the\n\ncase is similar to having only one environment, similar to the setup in [32].\n\ncoef\ufb01cient as the regressor is the parent.\n\n(Y ) = (j)\n\nS[{X}\n\nS[{X}\n\n7\n\n\f(a) Error ratio\n\n(b) CW ratio\n\n(c) CU ratio\n\nFigure 3: Comparsion of performance of LRE, PC, IP, and LiNGAM algorithms.\n\ncoef\ufb01cient, although we still set S as parents of Y , we do not make any decisions regarding the\nrelationship between X and Y (Example 3 when I = {X1}, is an instance of this case).\nAfter adding the discovered relationships to the initial mixed graph, in the third stage, we apply the\nMeek rules on the resulting mixed graph to \ufb01nd all extra possible orientations and output \u02c6M.\nAnalysis of LRE Algorithm. We can use the hypothesis testing in (3) to test whether two vectors\n(i)\nS (Y ) and (j)\nS (Y ) are equal for any Y 2 V and S \u2713 N (Y ). If the p-value for the set S is less\nthan \u21b5/(p \u21e5 (2 1)), then we will reject the null hypothesis H ij\n0,Y,S. By doing so, we obtain the\noutput with total probability of false-rejection less than \u21b5. Regarding the computational complexity,\nsince for each pair of environments, in the worse case we perform (2 1) hypothesis tests for\neach variable Y 2 V , and considering that we haveN\n2 pairs of environments, the computational\n2p(2 1). Therefore, the bottleneck in the\ncomplexity of LRE algorithm is in the order ofN\n\ncomplexity of LRE is the requirement of running a complete observational algorithm in its \ufb01rst stage.\n5 Experiments\nWe evaluate the performance of LRE algorithm by testing it on both synthetic and real data. As\nseen in the pseudo-code in Algorithm 2, LRE has three stages where in the \ufb01rst stage, a complete\nobservational algorithm is run. In our simulations, we used the PC algorithm5 [33], which is known\nto have a complexity of order O(p) when applied to a graph of order p with degree bound .\nSynthetic Data. We generated 100 DAGs of order p = 10 by \ufb01rst selecting a causal order for\nvariables and then connecting each pair of variables with probability 0.25. We generated data from a\nlinear Gaussian SEM with coef\ufb01cients drawn uniformly at random from [0.1, 2], and the variance\nof each exogenous noise was drawn uniformly at random from [0.1, 4]. For each variable of each\nstructure, 105 samples were generated. In our simulation, we only considered a scenario in which we\nhave two environments E1 and E2, where in the second environment, the exogenous noise of |I12|\nvariables were varied. The perturbed variables were chosen uniformly at random.\nFigure 3 shows the performance of LRE algorithm. De\ufb01ne a link to be any directed or undirected edge.\nThe error ratio is calculated as follows: Error ratio := (|miss-detected links|+|extra detected links|+\n|correctly detected wrongly directed edges|)/p\n2. Among the correctly detected links, de\ufb01ne C :=\n|correctly directed edges|, W := |wrongly directed edges|, and U := |undirected edges|. CW and\nDU ratios, are obtained as follows: CW ratio := (C)/(C + W ), CU ratio := (C)/(C + U ). As seen\nin Figure 3, only one change in the second environment (i.e., |I12| = 1), increases the CU ratio of\nLRE by 8 percent compared to the PC algorithm. Also, the main source of error in LRE algorithm\nresults from the application of the PC algorithm. We also compared the Error ratio and CW ratio of\nLRE algorithm with the Invariant Prediction (IP) [23] and LiNGAM [32] (since there is no undirected\nedges in the output of IP and LiNGAM, the CU ratio of both would be one). For LiNGAM, we\ncombined the data from two environments as the input. Therefore, the distribution of the exogenous\nnoise of variables in I12 is not Guassian anymore. As it can be seen in Figure 3(a), the Error ratio\nof IP increases as the size of I12 increases. This is mainly due to the fact that in IP approach it is\nassumed that the distribution of exogenous noise of the target variable should not change, which may\nbe violated by increasing |I12|. The result of simulations shows that the Error ratio of LiNGAM is\n\n5We use the pcalg package [15] to run the PC algorithm on a set of random variables.\n\n8\n\n\fFigure 4: Performance of LRE algorithm in GRNs from DREAM 3 challenge. All \ufb01ve networks\nhave 10 genes and total number of edges in each network (from left to right) is 11, 15, 10, 25, and 22,\nrespectively.\napproximately twice of those of LRE and PC. We also see that LRE performed better compared to\nLiNGAM and IP in terms of CW ratio.\nReal Data\na) We considered dataset of educational attainment of teenagers [27]. The dataset was collected from\n4739 pupils from about 1100 US high schools with 13 attributes including gender, race, base year\ncomposite test score, family income, whether the parent attended college, and county unemployment\nrate. We split the dataset into two parts where the \ufb01rst part includes data from all pupils who live\ncloser than 10 miles to some 4-year college. In our experiment, we tried to identify the potential\ncauses that in\ufb02uence the years of education the pupils received. We ran LRE algorithm on the\ntwo parts of data as two environments with a signi\ufb01cance level of 0.01 and obtained the following\nattributes as a possible set of parents of the target variable: base year composite test score, whether\nfather was a college graduate, race, and whether school was in urban area. The IP method [23] also\nshowed that the \ufb01rst two attributes have signi\ufb01cant effects on the target variable.\nb) We evaluated the performance of LRE algorithm in gene regulatory networks (GRN). GRN is a\ncollection of biological regulators that interact with each other. In GRN, the transcription factors are\nthe main players to activate genes. The interactions between transcription factors and regulated genes\nin a species genome can be presented by a directed graph. In this graph, links are drawn whenever a\ntranscription factor regulates a gene\u2019s expression. Moreover, some of vertices have both functions,\ni.e., are both transcription factor and regulated gene.\nWe considered GRNs in \u201cDREAM 3 In Silico Network\" challenge, conducted in 2008 [19]. The\nnetworks in this challenge were extracted from known biological interaction networks. The structures\nof these networks are available in the open-source tool \u201cGeneNetWeaver (GNW)\" [28]. Since we\nknew the true causal structures in these GRNs, we obtained Ess(G\u21e4) and gave it as an input to\nLRE algorithm. Furthermore, we used GNW tool to get 10000 measurements of steady state levels\nfor every gene in the networks. In order to obtain measurements from the second environment, we\nincreased coef\ufb01cients of exogneous noise terms from 0.05 to 0.2 in GNW tool. Figure 4 depicts the\nperformance of LRE algorithm in \ufb01ve networks extracted from GRNs of E-coli and Yeast bacteria.\nThe green, red, and yellow bar for each network shows the number of correctly directed edges,\nwrongly directed edges, and undirected edges, respectively. Note that since we know the correct\nEss(G\u21e4), there is no miss-detected links or extra detected links. As it can be seen, LRE algorithm\nhas a fairly good accuracy (84% on average over all \ufb01ve networks) when it decides to orient an edge.\n6 Conclusion\nWe studied the problem of causal structure learning in a multi-environment setting, in which the\nfunctional relations for producing the variables from their parents remain the same across environ-\nments, while the distribution of exogenous noises may vary. We de\ufb01ned a notion of completeness for\na causal discovery algorithm in this setting and proved the existence of such algorithm. We proposed\nan ef\ufb01cient algorithm with low computational and sample complexity and evaluated the performance\nof this algorithm by testing it on synthetic and real data. The results show the ef\ufb01cacy of the proposed\nalgorithm.\n\n9\n\n\fAcknowledgments\nThis work was supported in part by ARO grant W911NF-15-1-0281 and ONR grant W911NF-15-\n1-0479. Also, KZ acknowledges the support from NIH-1R01EB022858-01 FAIN-R01EB022858,\nNIH-1R01LM012087, and NIH-5U54HG008540-02 FAINU54HG008540. The content is solely the\nresponsibility of the authors and does not necessarily represent the of\ufb01cial views of the NIH.\n\nReferences\n[1] K. A. Bollen. Structural equations with latent variables. Wiley series in probability and\n\nmathematical statistics. Applied probability and statistics section. Wiley, 1989.\n\n[2] D. M. Chickering. Optimal structure identi\ufb01cation with greedy search. Journal of machine\n\nlearning research, 3(Nov):507\u2013554, 2002.\n\n[3] P. Daniusis, D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Sch\u00f6lkopf.\nIn Proc. 26th\n\nDistinguishing causes from effects using nonlinear acyclic causal models.\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI2010), 2010.\n\n[4] F. Eberhardt. Causation and intervention. Unpublished doctoral dissertation, Carnegie Mellon\n\nUniversity, 2007.\n\n[5] F. Eberhardt, C. Glymour, and R. Scheines. On the number of experiments suf\ufb01cient and in the\nworst case necessary to identify all causal relations among n variables. In Proceedings of the\n21st Conference on Uncertainty and Arti\ufb01cial Intelligence (UAI-05), pages 178\u2013184, 2005.\n\n[6] J. Etesami and N. Kiyavash. Directed information graphs: A generalization of linear dynamical\n\ngraphs. In American Control Conference (ACC), pages 2563\u20132568. IEEE, 2014.\n\n[7] J. Etesami, N. Kiyavash, and T. Coleman. Learning minimal latent directed information\n\npolytrees. Neural computation, 2016.\n\n[8] A. Ghassami and N. Kiyavash. Interaction information for causal inference: The case of directed\n\ntriangle. In IEEE International Symposium on Information Theory (ISIT), 2017.\n\n[9] A. Ghassami, S. Salehkaleybar, and N. Kiyavash. Optimal experiment design for causal\n\ndiscovery from \ufb01xed number of experiments. arXiv preprint arXiv:1702.08567, 2017.\n\n[10] A. Ghassami, S. Salehkaleybar, N. Kiyavash, and E. Bareinboim. Budgeted experiment design\n\nfor causal structure learning. arXiv preprint arXiv:1709.03625, 2017.\n\n[11] A. Hauser and P. B\u00fchlmann. Two optimal strategies for active learning of causal models from\n\ninterventional data. International Journal of Approximate Reasoning, 55(4):926\u2013939, 2014.\n\n[12] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Sch\u00f6lkopf. Nonlinear causal discovery\nwith additive noise models. In Advances in neural information processing systems, pages\n689\u2013696, 2009.\n\n[13] D. Janzing, J. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniu\u0161is, B. Steudel, and\nB. Sch\u00f6lkopf. Information-geometric approach to inferring causal directions. Arti\ufb01cial Intelli-\ngence, 182:1\u201331, 2012.\n\n[14] D. Janzing and B. Scholkopf. Causal inference using the algorithmic markov condition. IEEE\n\nTransactions on Information Theory, 56(10):5168\u20135194, 2010.\n\n[15] M. Kalisch, M. M\u00e4chler, D. Colombo, M. H. Maathuis, P. B\u00fchlmann, et al. Causal inference\nusing graphical models with the R package pcalg. Journal of Statistical Software, 47(11):1\u201326,\n2012.\n\n[16] S. Kim, C. J. Quinn, N. Kiyavash, and T. P. Coleman. Dynamic and succinct statistical analysis\n\nof neuroscience data. Proceedings of the IEEE, 102(5):683\u2013698, 2014.\n\n[17] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT\n\npress, 2009.\n\n10\n\n\f[18] H. L\u00fctkepohl. New introduction to multiple time series analysis. Springer Science & Business\n\nMedia, 2005.\n\n[19] D. Marbach, T. Schaffter, C. Mattiussi, and D. Floreano. Generating realistic in silico gene\nnetworks for performance assessment of reverse engineering methods. Journal of computational\nbiology, 16(2):229\u2013239, 2009.\n\n[20] C. Meek. Graphical models: Selecting causal and statistical models. 1997.\n[21] J. Pearl. Causality. Cambridge university press, 2009.\n[22] J. Peters and P. B\u00fchlmann. Identi\ufb01ability of gaussian structural equation models with equal\n\nerror variances. Biometrika, 101, pages 219\u2013228, 2014.\n\n[23] J. Peters, P. B\u00fchlmann, and N. Meinshausen. Causal inference by using invariant prediction:\nidenti\ufb01cation and con\ufb01dence intervals. Journal of the Royal Statistical Society: Series B\n(Statistical Methodology), 78(5):947\u20131012, 2016.\n\n[24] J. Peters, J. M. Mooij, D. Janzing, B. Sch\u00f6lkopf, et al. Causal discovery with continuous additive\n\nnoise models. Journal of Machine Learning Research, 15(1):2009\u20132053, 2014.\n\n[25] C. J. Quinn, N. Kiyavash, and T. P. Coleman. Ef\ufb01cient methods to compute optimal tree\nIEEE Transactions on Signal Processing,\n\napproximations of directed information graphs.\n61(12):3173\u20133182, 2013.\n\n[26] C. J. Quinn, N. Kiyavash, and T. P. Coleman. Directed information graphs. IEEE Transactions\n\non information theory, 61(12):6887\u20136909, 2015.\n\n[27] C. E. Rouse. Democratization or diversion? the effect of community colleges on educational\n\nattainment. Journal of Business & Economic Statistics, 13(2):217\u2013224, 1995.\n\n[28] T. Schaffter, D. Marbach, and D. Floreano. Genenetweaver: in silico benchmark generation and\nperformance pro\ufb01ling of network inference methods. Bioinformatics, 27(16):2263\u20132270, 2011.\n[29] B. Sch\u00f6lkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and\nanticausal learning. In Proceedings of the 29th International Conference on Machine Learning\n(ICML), pages 1255\u20131262, 2012.\n\n[30] E. Sgouritsa, D. Janzing, P. Hennig, and B. Sch\u00f6lkopf. Inference of cause and effect with\n\nunsupervised inverse regression. In AISTATS, 2015.\n\n[31] K. Shanmugam, M. Kocaoglu, A. G. Dimakis, and S. Vishwanath. Learning causal graphs with\nsmall interventions. In Advances in Neural Information Processing Systems, pages 3195\u20133203,\n2015.\n\n[32] S. Shimizu, P. O. Hoyer, A. Hyv\u00e4rinen, and A. Kerminen. A linear non-gaussian acyclic model\n\nfor causal discovery. Journal of Machine Learning Research, 7(Oct):2003\u20132030, 2006.\n\n[33] P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search. MIT press,\n\n2000.\n\n[34] J. Sun, D. Taylor, and E. M. Bollt. Causal network inference by optimal causation entropy.\n\nSIAM Journal on Applied Dynamical Systems, 14(1):73\u2013106, 2015.\n\n[35] J. Tian and J. Pearl. Causal discovery from changes. In Proceedings of the Seventeenth confer-\nence on Uncertainty in arti\ufb01cial intelligence, pages 512\u2013521. Morgan Kaufmann Publishers\nInc., 2001.\n\n[36] T. Verma and J. Pearl. An algorithm for deciding if a set of observed independencies has a\ncausal explanation. In Proceedings of the Eighth international conference on uncertainty in\narti\ufb01cial intelligence, pages 323\u2013330. Morgan Kaufmann Publishers Inc., 1992.\n\n[37] K. Zhang, B. Huang, J. Zhang, C. Glymour, and B. Sch\u00f6lkopf. Causal discovery in the presence\nof distribution shift: Skeleton estimation and orientation determination. In Proc. International\nJoint Conference on Arti\ufb01cial Intelligence (IJCAI 2017), 2017.\n\n[38] K. Zhang and A. Hyv\u00e4rinen. On the identi\ufb01ability of the post-nonlinear causal model. In Proc.\n25th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI 2009), Montreal, Canada, 2009.\n\n11\n\n\f", "award": [], "sourceid": 1724, "authors": [{"given_name": "AmirEmad", "family_name": "Ghassami", "institution": "University of Illinois at Urbana\u2013Champaign"}, {"given_name": "Saber", "family_name": "Salehkaleybar", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Negar", "family_name": "Kiyavash", "institution": "UIUC"}, {"given_name": "Kun", "family_name": "Zhang", "institution": "CMU"}]}