{"title": "Multi-domain Causal Structure Learning in Linear Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 6266, "page_last": 6276, "abstract": "We study the problem of causal structure learning in linear systems from observational data given in multiple domains, across which the causal coefficients and/or the distribution of the exogenous noises may vary. The main tool used in our approach is the principle that in a causally sufficient system, the causal modules, as well as their included parameters, change independently across domains. We first introduce our approach for finding causal direction in a system comprising two variables and propose efficient methods for identifying causal direction. Then we generalize our methods to causal structure learning in networks of variables. Most of previous work in structure learning from multi-domain data assume that certain types of invariance are held in causal modules across domains. Our approach unifies the idea in those works and generalizes to the case that there is no such invariance across the domains. Our proposed methods are generally capable of identifying causal direction from fewer than ten domains. When the invariance property holds, two domains are generally sufficient.", "full_text": "Multi-domain Causal Structure Learning\n\nin Linear Systems\n\nAmirEmad Ghassami\u21e4, Negar Kiyavash\u2020, Biwei Huang\u2021, Kun Zhang\u2021\n\n\u21e4Department of ECE, University of Illinois at Urbana-Champaign, Urbana, IL, USA.\n\n\u2020School of ISyE and ECE, Georgia Institute of Technology, Atlanta, GA, USA.\n\u2021Department of Philosophy, Carnegie Mellon University, Pittsburgh, PA, USA.\n\nAbstract\n\nWe study the problem of causal structure learning in linear systems from observa-\ntional data given in multiple domains, across which the causal coef\ufb01cients and/or\nthe distribution of the exogenous noises may vary. The main tool used in our\napproach is the principle that in a causally suf\ufb01cient system, the causal modules, as\nwell as their included parameters, change independently across domains. We \ufb01rst\nintroduce our approach for \ufb01nding causal direction in a system comprising two\nvariables and propose ef\ufb01cient methods for identifying causal direction. Then we\ngeneralize our methods to causal structure learning in networks of variables. Most\nof previous work in structure learning from multi-domain data assume that certain\ntypes of invariance are held in causal modules across domains. Our approach\nuni\ufb01es the idea in those works and generalizes to the case that there is no such\ninvariance across the domains. Our proposed methods are generally capable of\nidentifying causal direction from fewer than ten domains. When the invariance\nproperty holds, two domains are generally suf\ufb01cient.\n\nIntroduction\n\n1\nConsider a system comprised of two dependent random variables X and Y with no latent confounders.\nAssuming that the dependency is due to a unidirectional causation, how can we determine which\nvariable is the cause and which one is the effect? The golden standard for answering this question is\nperforming controlled experiments or interventions on the variables. Randomizing or intervening on\nthe cause variable may change the effect variable, but not vice versa [Ebe07]. The issue with this\napproach is that in many cases performing experiments are expensive, unethical, impossible, or even\nunde\ufb01ned. Therefore, there is a high interest in \ufb01nding causal relations from purely observational data.\nUnfortunately, for a general system, using traditional conditional independence-based methods on\npurely observational data can identify a structure only up to its Markov equivalence [SGS00, Pea09].\nFor instance, for the aforementioned system of only two variables, the structures X ! Y and\nY ! X (arrow indicates the causal direction), are Markov equivalent. Therefore, extra assumptions\nare required to make the problem well-de\ufb01ned.\nOne successful approach in the literature is to restrict the data generating model. For the case that data\nis generated from a structural equation model with additive noise [Bol89], if the noise is non-Gaussian\n[SHHK06], or if the equations are non-linear with some mild conditions [ZC06, HJM+09, ZH09],\nthe structure may be identi\ufb01able. Note that for additive noise models, in the basic case of a linear\nsystem with Gaussian noise, the causal direction is non-identi\ufb01able [SHHK06].\nAnother recently developing approach is to assume that the data is non-homogenous and is gathered\nfrom several different distributions [Hoo90, TP01, PBM16, ZHZ+17, GSKZ17, HZZ+17]. In many\nreal-life settings, the data generating distribution may vary over time, or the dataset may be gathered\nfrom different domains and hence, not follow a single distribution. While such data is usually\nproblematic in statistical analysis and causes restrictions on the learning power, this property can be\n\n\u21e4Correspondence to: AmirEmad Ghassami, email: ghassam2@illinois.edu.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fleveraged for the purpose of causal discovery, which is our focus in the paper. Moreover, it has been\nshown that causal modeling can be exploited to facilitate transfer learning [SJP+12, ZSMW13]. This\nis because of the coupling relationship between causal modeling and distribution change: Causal\nmodels constrain how the data distribution may change, while distribution change exhibits such\nchanges.\nIn this paper, we focus on linear systems, with observational data for the variables in the system given\nin multiple domains, across which the causal coef\ufb01cients and/or the distribution of the exogenous\nnoises may vary. We will present ef\ufb01cient approaches to exploiting such changes for causal structure\nlearning. The proposed methods are based on the principle that although the cause and the effect\nvariables are dependent, the mechanism that generates the cause variable changes independently from\nthe mechanism that generates the effect variable across domains. The same principle was utilized in\n[ZHZ+17] for exploiting non-stationary or heterogeneous data for causal structure learning. However,\nsince that work considers a non-parametric approach, it is restricted to general independence tests\namong distributions, which may not have high ef\ufb01ciency. The remaining previous works on causal\ndiscovery from multiple domains usually assume that a certain type of invariance holds across\ndomains [PBM16, GSKZ17]. Our approach uni\ufb01es the idea in those works, and generalizes to the\ncase that there is no such invariance across the domains. We present structure learning methods,\nwhich are generally capable of identifying causal direction from fewer than ten domains (as a special\ncase, when the invariance property holds, two domains generally suf\ufb01ces).\nThe rest of the paper is organized as follows. We \ufb01rst introduce our approach for \ufb01nding causal\nrelations from multi-domain observational data in a system comprising two variables in Section 2.\nWe propose our methods for identifying causal direction in Subsections 2.1 and 2.2. These methods\nare evaluated in Subsection 2.3. We generalize our method to causal structure learning in networks of\nvariables in Section 3, where the generalized version of our methods are presented in Subsections 3.1\nand 3.2 and evaluated in Subsection 3.4. We also present an approach for combining our proposed\nmethods with standard conditional independence-based structure learning algorithms in Subsection\n3.3. All the proofs are provided in the supplementary materials.\n2\nIn a general system, each variable is generated by a causal module, with conditional distribution\nP (effect | causes), which takes the direct causes of the variable as input. We assume that the system\nis causally suf\ufb01cient, that is, the variables do not have latent confounders. Given a causal directed\nacyclic graph (DAG), the causal Markov condition [SGS00, Pea09], as a more formal version of\nReichenbach\u2019s common cause principle [Rei91], says that any variable is independent from its non-\ndescendants given its direct causes. The contrapositive implies that if two quantities are dependent,\nthere must exist a causal path between them. As a consequence, when the joint distribution of\na causally suf\ufb01cient system changes, for any effect variable, the distribution of its direct causes,\nP (causes), must change independently from P (effect | causes); otherwise, the parameter sets (or\nvariables, from a Bayesian perspective) involved in the two modules change dependently, and there\nmust exist a causal path connecting the parameter sets, which implies the existence of confounders and\ncontradicts the causal suf\ufb01ciency assumption. This can be viewed as a realization of the modularity\nproperty of causal systems [Pea09], and as the dynamic counterpart of the principle of independent\nmechanisms, which states that causal modules are algorithmically independent [DJM+10], or the\nexogeneity property of the causal system [ZZS15]. We formalize this characteristic as follows.\nDe\ufb01nition 1 (Principle of Independent Changes (PIC)). In a causally suf\ufb01cient system, the causal\nmodules, as well as their included parameters, change independently across domains.\nAs mentioned in Section 1, this principle was used for causal discovery in nonparametric settings in\n[ZHZ+17, HZZ+17]. As will be explained in Subsection 2.2, the case of having an invariant causal\nmechanism is a special case, because a constant is independent from any other variable.\nTo introduce our methodology, we consider a system comprised of two dependent variables X and Y .\nObservational data for variables X and Y , or in the asymptotic case, the joint distributions of X and\nY , in d domains D = {D(1),\u00b7\u00b7\u00b7 , D(d)} is given. The goal is to discover the causal direction between\nX and Y . We denote the ground truth cause variable and effect variable by C and E, respectively.\nThe relation between C and E is assumed to be linear. Hence, the model in domain D(i) 2 D is\ndenoted as follows:\n\nIdentifying Causal Relation between Two Variables\n\ndomain D(i):\n\nC = N (i)\nC ,\n\nE = a(i)C + N (i)\nE ,\n\n2\n\n\fC and N (i)\n\nC)(i) and (2\n\nE are independent exogenous noises with variances (2\n\nwhere, N (i)\nE)(i), respec-\ntively. Without loss of generality, we assume that the exogenous noises are zero-mean. We refer to\nvariances of the exogenous noises and the causal coef\ufb01cients as the parameters of the system. In\ngeneral, all these three parameters can vary across the domains. For our parametric model of interest,\nC corresponds to the cause module, while a and 2\nE correspond to the effect generation module.\n2\nTherefore, PIC implies that 2\nE). Note that in general,\nE need not to be independent from a, as they both correspond to the mechanism generating the\n2\neffect.\n2.1 Proposed Approach\nLet Y |X be the linear regression coef\ufb01cient obtained from regressing Y on X, and let 2\nY |X =\nVar(Y Y |X \u00b7 X), i.e., the variance of the residual of regressing Y on X. For the causal direction,\nwe have\n(1)\n\nC changes independently from the pair (a, 2\n\n= a,\n\n2\nC|; = 2\nC,\nFor the reverse direction, we have\n\nE|C =\n\na2\nC\n2\nC\n\nC + 2\nE,\n\n2\nE|; = a22\nC|E = Var(NC \n2\n\na2\nC\nC + 2\nE\n\na22\n\nC|E = E[CE]\nE[E2]\n\n=\n\n(aNC + NE)) =\n\n2\nE|C = 2\nE.\na2\nC\nC + 2\nE\n2\nC2\nE\nC + 2\na22\nE\n\na22\n\n,\n\n(2)\n\n.\n\n(3)\n\n, C|E, 2\nE|C, 2\nE|;\nC is independent form (E|C, 2\n= 2\n\nWe will utilize the change information across domains to \ufb01nd the causal direction as follows. For\nany parameter 2 {2\nC|E}, let (i) denote the value of this parameter\n, E|C, 2\nC|;\nin domain D(i), 1 \uf8ff i \uf8ff d. Consider {(1),\u00b7\u00b7\u00b7 , (d)} as samples from random variable . As\nstated earlier, according to PIC, 2\nE), while as\nE|C) = (a, 2\nwe can see from the expressions in (2), such independence does not hold in general in the reverse\nalways leads to an increase\ndirection. For instance, if a and 2\nC|E. Therefore, we propose our causal discovery method as follows. To test whether\nin C|E and 2\nX is the cause of Y , we test the independence between 2\nand\nY |X) are independent but the counterpart in the reverse direction is not, X is considered as\n(Y |X, 2\nthe cause variable and Y the effect variable. More speci\ufb01cally, for testing if X is the cause of Y , let\nX!Y = {|Y |X|, 2\n\nC|;\nE are both \ufb01xed, an increase in 2\n\nE|;\nand (Y |X, 2\n\nY |X).\n\nIf 2\n\nX|;\n\nX|;\n\nY |X}, and de\ufb01ne the dependence measure\nI(, 2\n\nTX!Y (D) := X2X!Y\n\nX|;),\n\nwhere any standard non-parametric measure of dependence I(\u00b7,\u00b7), such as mutual information, can\nbe used. Therefore, for inferring the causal relation between X and Y , we calculate TX!Y (D) and\nTY !X(D) and pick the direction which has the smaller value, i.e., arg min\u21e12{X!Y,Y !X} T\u21e1(D),\nas the correct causal direction. Alternatively, one can use test of statistical independence, such as the\nkernel-based one [GFT+08] to infer the direction.\n2.1.1 Identical Boundaries Method\nAlthough checking for independence is suf\ufb01cient for discovering causal relation, in general performing\na non-parametric independence test may not be ef\ufb01cient. This may be specially problematic as in\nmany applications the number of domains is small. In this subsection, we show that the parametric\nstructure of our model can be exploited to devise an ef\ufb01cient independence test.\nThe main idea is as follows. Consider two bounded random variables and \u02dc. We refer to the\nminimum and maximum of \u02dc as the boundaries of this variable. If and \u02dc are independent, then \nwill have the same distribution conditioned on \u02dc = \u02dcmin and \u02dc = \u02dcmax, i.e., will have identical\ndistributions on the boundaries of \u02dc. Speci\ufb01cally, it will have same expected values on the boundaries\nof \u02dc. On the other hand, if and \u02dc are dependent, the expected value of on the boundaries of \u02dc are\nnot necessarily different. However, we will show that if X ! Y is not the causal direction and Y |X\n, due to the speci\ufb01c parametric structure of our model, the expected\n(and 2\nvalue of Y |X (and 2\nFor any parameter of interest, we denote its minimum and its maximum value in the dataset with\nsubscripts \u02c6m and \u02c6M, respectively. For inferring if X is the cause of Y , let\n] E[log |log(2\n\nY |X) on the boundaries of 2\nX|;\n\nY |X) is dependent on 2\nX|;\n\nwill not be identical.\n\nIIB(, 2\n\n)=(log(2\n\n)=(log(2\n\n)) \u02c6M\n\nX|;\n\nX|;\n\nX|;\n\nX|;\n\nX|;) :=E[log |log(2\n\n3\n\n)) \u02c6m],\n\n\f),\n\nX!Y (D) and T IB\n\nX!Y (D) :=P2X!Y IIB(, 2\nand according to (3), we de\ufb01ne the causal order indicator T IB\nX|;\nwhere IB stands for identical boundaries. We have the following result for identi\ufb01ability.\nTheorem 1. For dataset D = {D(1),\u00b7\u00b7\u00b7 , D(d)}, as d ! 1, T IB\nE!C(D) ! c,\nC!E(D) ! 0, and T IB\nfor some positive value c which is bounded away from 0.\nUsing Theorem 1, to see the causal relation between X and Y , we calculate T IB\nY !X(D)\n\u21e1 (D), as the true\nand pick the direction which has the smaller value, i.e., arg min\u21e12{X!Y,Y !X} T IB\ncausal direction. We term this approach the Identical Boundaries (IB) method.\nNote that in the IB method we only perform \ufb01rst-order statistical test (i.e., regarding the mean) on\nthe boundaries. Clearly, the idea can be extended to performing higher-order tests as well. We\nhave provided the required formulation for the extension to second-order test in the supplementary\nmaterials.\n2.2 Minimal Changes Method\nOne may perform causal discovery from multiple domains by assuming certain invariances in the\nparameters across domains. More speci\ufb01cally, consider a pair of domains {D(i), D(j)}. In order\nto \ufb01nd the causal direction, the authors of [GSKZ17] consider the particular case that a(i) = a(j)\nE)(j). In this case, (i)\nand either (2\nE|C, and we have\n(i)\nC|E = (j)\nE . Therefore, comparing the coef\ufb01cients in both\ndirections, the causal direction is identi\ufb01able if (i)\nC /(j)\nE . Note that if either of the\nvariances does not change, this condition always holds.\nIn [PBM16], the authors assume that\nE|S)(j), then S is not\nE)(i) = (2\n(2\nthe parent set of E. Similarly, the authors of [WSBU18] consider the case that a(i) 6= a(j) and either\n(i)\nC = (j)\nE . Therefore, under this assumption, in this case for variable X 2 {C, E}\nand set S \u2713 {C, E} \\ X, if (2\nWe note that invariance is a special case of the condition of independent changes, as a constant is\nindependent from any variable. Therefore, our idea introduced in Subsection 2.1 can be applied to\nthe case of existence of invariant parameter across domains. In this case, two domains are generally\nsuf\ufb01cient to identify the causal direction. Therefore, we give a uni\ufb01cation and generalization of the\nperspectives of the aforementioned previous works. The assumptions in these works can be seen as\nparticular realizations of the faithfulness assumption [SGS00], which is required for our proposed\napproach as well:\nAssumption 1. For any pair of domains D(i) and D(j), if (2\n(C|E)(j), or (2\nand D(j), i.e., (2\n\n)(j), or (C|E)(i) =\nC|E)(j), then all parameters of the system are unchanged across D(i)\nC)(j), and a(i) = a(j), and (2\n\nC 6= (i)\nE)(j). Therefore, for variable E and any set S, if (2\n\nX|S)(j), then S is the parent set of X.\n\nE /(j)\nE|S)(i) 6= (2\n\nC|E if and only if (i)\n\nC)(j) or (2\nC /(j)\n\nE)(i) 6= (2\nE /(j)\nC = (i)\n\nC|E)(i) = (2\nC)(i) = (2\n\nC , or (i)\n\nE = (j)\n\nX|S)(i) = (2\n\nE|C = (j)\n\nC)(i) 6= (2\n\n)(i) = (2\n\nE|;\n\nE|;\n\nE)(i) = (2\n\nE)(j).\n\nE|;\n\nC, and 2\n\n, C|E, and 2\n\nAssumption 1 is mild in the sense that it only rules out a 2-dimensional subspace of a 3-dimensional\nspace. Therefore, considering Lebesgue measure on the 3-dimensional space, we are only ruling out\na measure-zero subset.\nSince invariance is a special case of independent changes, based on PIC, change in one causal module\n, will not enforce any\ndoes not force any changes in another causal module, i.e., a change in, say, 2\nE|C. However, in the reverse direction, as it can be seen from equations in (1)\nchanges on E|C or 2\nand (2), if any of the variables a, 2\nE varies across two domains, by Assumption 1, all three\nvariables 2\nC|E will change. Based on this observation, we propose the following\nprinciple for causal discovery, which is the counterpart of PIC for the case of invariance.\nDe\ufb01nition 2 (Principle of Minimal Changes (PMC)). Suppose Assumption 1 holds. Compared to\nthe direction from effect to cause, fewer or an equal number of changes are required in the causal\ndirection to explain the variation in the joint distribution.\nTherefore, for testing whether variable X is the cause of variable Y , we propose the follow-\ning method. Let 0X!Y := {2\nY |X}. We denote a member of 0X!Y by , and\nX|;\nits realization in domain D(i) by (i). For any pair of domains {D(i), D(j)}, let Q(i,j)\nX!Y :=\nP20X!Y\n1[log (i) 6= log (j)]. This quantity counts the number of members of 0X!Y that\n\n,|Y |X|, 2\n\nC|;\n\n4\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nr\no\nc\ns\n \n\n1\nF\n\n0\n\n0\n\n10\n\n20\n\n30\n\nNumber of domains\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nr\no\nc\ns\n \n\n1\nF\n\n40\n\n50\n\n0\n\n0\n\n10\n\nIB\nMC\nHSIC\nIB Confounder\nMC Confounder\nHSIC with Confounder\n30\n\n40\n\n50\n\n20\n\nNumber of domains\n\nFigure 1: F1 score versus number of domains for model 1 on the left and model 2 on the right.\n\nvary across domains {D(i), D(j)}. We de\ufb01ne the causal direction indicator\n\nX!Y (D) := X1\uf8ffi. We consider the\nfollowing linear structural equation model (SEM):\n(4)\nwhere B is the matrix of coef\ufb01cients of edges, with Bij denoting the coef\ufb01cient from variable Xi\nto Xj. If Bij 6= 0, Xi is a parent of Xj and Xj is a child of Xi. We denote the parent set and\nchildren set of X by P a(X) and Ch(X), respectively. A variable is called source variable if it does\nnot have any parents, and sink variable if it does not have any children. Without loss of generality,\nwe assume that B is a strictly upper triangular matrix. Since the underlying structure is DAG, rows\nand columns of B can be permuted for this condition to be satis\ufb01ed. N = (N1,\u00b7\u00b7\u00b7 , Np)> is the\nvector of exogenous noises. We denote the variance of Ni by 2\ni , and assume that elements of N are\nindependent and without loss of generality, we assume that they are zero-mean variables. Also, we\nde\ufb01ne \u2326 as the p \u21e5 p diagonal matrix with values 2\n1,\u00b7\u00b7\u00b7 , 2\n\np on the diagonal.\n\nX = B>X + N,\n\n5\n\n\fAlgorithm 1 Multi-domain Causal Structure Learning.\n\nInput: Data from variables V in domains D, initial order \u21e1t on V .\nwhile |\u21e1t| 6= 0 do\nfor X 2 \u21e1t do\n\u21e1X,1(n) = {|( \u02c6B\u21e1X,1)m,n|, ( \u02c6\u2326\u21e1X,1)n,n; 1 \uf8ff m \uf8ff n}, 1 \uf8ff n \uf8ff |\u21e1t|.\nTX,1(D) =P2\u21e1X,1(n)Pn1\nX,1(D) =P2\u21e1X,1(n)Pn1\nE[log |log \u02dc=(log \u02dc) \u02c6m]|.\nX,1(D).\n\nk=1P\u02dc2\u21e1X,1(k) I(, \u02dc).\nk=1P\u02dc2\u21e1X,1(k) |E[log |log \u02dc=(log \u02dc) \u02c6M\n\nend for\nXlast = arg minX2\u21e1t T IB\n\u02c6\u21e1 = concatenate(Xlast, \u02c6\u21e1), Remove Xlast from \u21e1t.\n\nFor the IB method:\nT IB\n\nInitiation: \u02c6\u21e1 = ;.\n\n]\n\nend while\nOutput: \u02c6\u21e1.\n\nWe de\ufb01ne an ordering of the variables as a permutation of the variables of the system. An ordering\non a set of variables and a DAG on those variables are consistent if in the ordering, every variable\ncomes after its parents and before its children. Note that given an undirected graph, an ordering\ndetermines the direction of all the edges of the graph uniquely; however, there may be more than one\nordering consistent with a given DAG. For example, for the DAG W ! X ! Y Z, orderings\n(W, Z, X, Y ), (W, X, Z, Y ), and (Z, W, X, Y ) are consistent, but the ordering (W, X, Y, Z) is not\nconsistent with the DAG, as it contradicts with the direction of the edge between Y and Z.\nDe\ufb01nition 3 (Causal Order). An ordering on the variables is called causal if it is consistent with the\nground truth causal DAG.\nSince the skeleton of the causal DAG can be identi\ufb01ed from basic conditional independence tests, the\nmain challenge in causal structure learning is to \ufb01nd a causal order. In the following we present our\napproaches to estimate a causal order on the variables of the system.\n3.1 Proposed Approach\nSuppose a candidate for the causal order on the variables, denoted by \u21e1, is given. In order to\ngeneralize our methodology to the network case, for each domain, we need to \ufb01rst estimate the\nregression coef\ufb01cients and exogenous noise variances of each variable Xi on all the variables Xj\nwith \u21e11(Xj) < \u21e11(Xi), i.e., all the variables, which proceed Xi in the given order. For the\ngiven order \u21e1 on variables, for 1 \uf8ff m \uf8ff p, let Sm := (\u21e1(1),\u00b7\u00b7\u00b7 , \u21e1(m 1))>. We denote the\nestimated regression coef\ufb01cients and exogenous noise variances for the given ordering \u21e1 by \u02c6B\u21e1 and\n\u02c6\u2326\u21e1, respectively, where ( \u02c6B\u21e1)(1:m1),m = \u21e1(m)|Sm, and ( \u02c6\u2326\u21e1)m,m = 2\n, and zero elsewhere.\nNote that if \u21e1 is a causal order, then \u02c6B\u21e1 = B and \u02c6\u2326\u21e1 = \u2326 up to permutation. Standard regression\ncalculation can be used for obtaining \u02c6B\u21e1 and \u02c6\u2326\u21e1. We additionally propose an ef\ufb01cient method for\nestimating the regression coef\ufb01cients and noise variances, provided in the Supplementary Materials.\nOur proposed method also makes a connection between precision matrix and adjacency matrix of a\nBayesian network.\nWe will utilize the change information across domains to \ufb01nd a causal order as follows. Consider\nmatrix M = B + \u2326. According to PIC, elements in each column of M should be jointly independent\nfrom elements in any other column, as they correspond to distinct causal modules. Therefore, we can\nset a metric for measuring dependencies, and orders that obtain the minimum value are causal orders.\nMore speci\ufb01cally, for a given order \u21e1 on variables, let \u02c6B\u21e1 and \u02c6\u2326\u21e1 be the outputs of regression. A\nnaive way of applying the idea in Section 2 to networks is as follows: For 1 \uf8ff n \uf8ff p, let \u21e1(n) :=\nk=1P\u02dc2\u21e1(k) I(, \u02dc), where\n{|( \u02c6B\u21e1)m,n|, ( \u02c6\u2326\u21e1)n,n; 1 \uf8ff m \uf8ff n}, and de\ufb01ne Q\u21e1(n) :=P2\u21e1(n)Pn1\nI(\u00b7,\u00b7) is any standard measure for dependence. We de\ufb01ne the causal order indicator for exhaustive\nn=2 Q\u21e1(n). Hence, one can estimate the causal order as \u02c6\u21e1 = arg min\u21e1 T e\nsearch T e\n\u21e1 (D).\nTherefore, in low dimensions, the causal order can be found by exhaustive search over all orders.\nHowever, for large dimensions, this is infeasible, as the number of orders increases super-exponentially\n\n\u21e1 (D) :=Pp\n\n\u21e1(m)|Sm\n\n6\n\n\fInitiation: \u02c6\u21e1 = ;.\n\nAlgorithm 2 MC Causal Structure Learning.\n\nInput: Data from variables V in domains D, initial order \u21e1t on V .\nwhile |\u21e1t| 6= 0 do\nfor X 2 \u21e1t do\nDe\ufb01ne \u21e7X = {\u21e1t, \u21e1X,1, \u21e1X,2}.\n0\u21e1 = {|( \u02c6B\u21e1)m,n|, ( \u02c6\u2326\u21e1)m,m; 1 \uf8ff m \uf8ff n \uf8ff |\u21e1t|}, \u21e1 2 \u21e7X.\n\u21e1 =P20\u21e1\nQ(i,j)\n(D) =P1\uf8ffi T eM C\n(D). Therefore, in low\nUsing Theorem 5, one can estimate the causal order as \u02c6\u21e1 = arg max\u21e1 T eM C\ndimensions, the causal order can be found by exhaustive search over all orders. However, for large\ndimensions, this is infeasible. Therefore, we propose the following alternative method for applying\nthe MC method to networks.\nThe pseudo-code of the proposed approach is presented in Algorithm 2. The main idea is that in each\nround, we \ufb01nd one variable with lowest causal order and remove it from the list, until all the variables\nare ordered. The algorithm starts with a random initial order \u21e1t on all variables. In each round, for\neach variable X it forms 3 orders in set \u21e7X. \u21e1t is the initial order, \u21e1X,1 is the same as \u21e1t, but X is\nmoved to the last position, and \u21e1X,2 is the same as \u21e1t, but X is moved to one before the last position\nin the order.2 The algorithm then calculates the quantity T M C\n(D), given in line 7 of the pseudo-code,\nfor each of the three orders in \u21e7X, and updates \u21e1t to the element of \u21e7X that has the maximum value\nfor this quantity. In the case of tie, we prioritize the orders as follows: \u21e1t > \u21e1X,2 > \u21e1X,1. This\nprioritization guarantees that after performing the aforementioned update of \u21e1t for all variables, the\nlast variable in \u21e1t, i.e., \u21e1t(1), will be a sink variable, in the subgraph induced on variables in \u21e1t.\nWe concatenate \u21e1t(1) to the left side of our estimated order \u02c6\u21e1, remove it from \u21e1t, and continue to\nthe next round until all the variables are moved to \u02c6\u21e1.\nTheorem 6. In each round of Algorithm 2, if Xs is a sink variable, then for all \u21e1 2 \u21e7Xs,\n(D). Also, for any of Xs\u2019s parents, Xv, if there exists a pair of domains\nT M C\n\u21e1Xs,1(D) T M C\nacross which at least one and at most two of variables Var(Xv), Bv,s, 2\ns varies, then at the end of\nround, \u21e1t(1) will be a sink variable.\nRemark 1. Finding independence in Algorithm 1 and invariance in Algorithm 2 can also be done\nfrom top to bottom, similar to the approach used in [SIS+11]. That is, in each round we can also \ufb01nd\na variable with highest causal order as well.\n3.3 Combining Proposed Methods with Conditional Independence-based Algorithms\nOne can also run our proposed methods after applying any standard conditional independence-based\nalgorithms, such as PC [SGS00] or GES [Chi02], to the data from some or all domains. Therefore,\nour proposed approach learns beyond the Markov equivalence class. The approach for combining our\nmethods with standard conditional independence-based algorithms would be as follows.\n\n\u21e1\n\n\u21e1\n\n1. We run the observational algorithm on the data from all domains (or its subset, if the sample\nsize is too big) to learn the essential graph (also known as CPDAG) of the underlying DAG.\n2. We note that an essential graph is a chain graph. For each chain component C, we denote\nits vertices by V (C) and denote the set of parents of V (C) by P a(V (C)), and de\ufb01ne VC =\nV (C) [ P a(V (C)). Note that a variable cannot have both ingoing and outgoing edges from\nvariables in V (C), otherwise we will have a partially directed cycle, which is not allowed in a\nchain graph [AMP+97].\n\n3. We apply our methods on each chain component separately and use the data corresponding\nto VC as the input. We need to include P a(V (C)) in VC to ensure that the variables under\nconsideration do not have any latent confounders.\n\n4. In VC we locate variables in P a(V (C)) at the beginning of the order and use our methods to\n\n\ufb01nd the order on the remaining variables of VC, i.e., on V (C).\n\n5. We combine the orders obtained from chain components.\n\nNote that with suf\ufb01ciently many changing domains, our proposed methods can learn the whole causal\nstructure regardless of utilizing any conditional independence-based algorithm.\n3.4 Evaluation\nWe considered model 1 described in Subsection 2.3 for generating the parameters of the system, with\nthe number of generated samples in each domain equal to 103. After identifying the causal ordering,\n2 We have provided an example in the Supplementary Materials to demonstrate why it is required to consider\n\nboth orders \u21e1X,1 and \u21e1X,2.\n\n8\n\n\f1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\ne\nr\no\nc\ns\n \n1\nF\n\n0.5\n\n2\n\n10\n\nIB\n\ne\nr\no\nc\ns\n \n1\nF\n\np = 4 (complete)\np = 8 (complete)\np = 12 (complete)\np = 4 (sparse)\np = 8 (sparse)\np = 12 (sparse)\n\n20\n\n30\n\nNumber of domains\n\n40\n\n50\n\nHSIC\n\np = 4 (complete)\np = 8 (complete)\np = 12 (complete)\np = 4 (sparse)\np = 8 (sparse)\np = 12 (sparse)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nr\no\nc\ns\n \n\n1\nF\n\ne\nr\no\nc\ns\n \n1\nF\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nIB\n\nd = 10 (complete)\nd = 20 (complete)\nd = 50 (complete)\nd = 10 (sparse)\nd = 20 (sparse)\nd = 50 (sparse)\n4\n\n8\n\n6\n12\nNumber of variables\n\n10\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nr\no\nc\ns\n \n\n1\nF\n\n14\n\n16\n\n0\n\n2\n\n10\n\nMC\n\ne\nr\no\nc\ns\n \n\n1\nF\n\np = 4 (complete)\np = 8 (complete)\np = 12 (complete)\np = 4 (sparse)\np = 8 (sparse)\np = 12 (sparse)\n\n20\n\n30\n\nNumber of domains\n\n40\n\n50\n\nHSIC\n\nd = 10 (complete)\nd = 20 (complete)\nd = 50 (complete)\nd = 10 (sparse)\nd = 20 (sparse)\nd = 50 (sparse)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nr\no\nc\ns\n \n1\nF\n\nLiNGAM\n\np = 4 (complete)\np = 8 (complete)\np = 12 (complete)\np = 4 (sparse)\np = 8 (sparse)\np = 12 (sparse)\n\ne\nr\no\nc\ns\n \n\n1\nF\n\n0\n\n2\n\n10\n\n20\n\n30\n\nNumber of domains\n\n40\n\n50\n\n0\n\n2\n\n4\n\n8\n\n6\n12\nNumber of variables\n\n10\n\n14\n\n16\n\n0\n\n2\n\n10\n\n20\n\nNumber of domains\n\n30\n\n40\n\n50\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n2\n\nMC\n\nd = 10 (complete)\nd = 20 (complete)\nd = 50 (complete)\nd = 10 (sparse)\nd = 20 (sparse)\nd = 50 (sparse)\n4\n8\n\n6\n12\nNumber of variables\n\n10\n\n14\n\n16\n\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\n2\n\n4\n\nLiNGAM\n\nd = 2 (complete)\nd = 10 (complete)\nd = 20 (complete)\nd = 50 (complete)\nd = 2 (sparse)\nd = 10 (sparse)\nd = 20 (sparse)\nd = 50 (sparse)\n\n8\n\n6\n12\nNumber of variables\n\n10\n\n14\n\n16\n\nFigure 2: F1 score versus number of domains and number of variables.\n\nPHC\n\nCA3/DG\n\nwe then estimate the causal coef\ufb01cients B on each domain separately. We set a threshold \u21b5 = 0.1\non B from each domain; if |Bij| is larger than \u21b5, then there is an edge from Xi to Xj. Then if an\nedge appears in more than 80% of all domains, we take this edge in the \ufb01nal graph. The results\nare show in Figure 2. All experiments are performed either on complete graphs or on sparse graph\ngenerated from Erdos-Renyi model with parameter 0.3. In general, we observed better performance\non denser graphs. This is expected as having more parameters helps us in predicting the order. The\nIB and MC methods both showed high performance in our simulations, while the performance of\nnon-parametric HSIC test was in general not acceptable. We also compared the performance with\nLiNGAM Algorithm [SHHK06]. To do so, we applied LiNGAM algorithm to the pooled data of all\ndomains. As explained in [ZHZ+17], LiNAGM failed to perform well on our multi-domain data.\nfMRI hippocampus data: We applied our methods to fMRI\nhippocampus dataset [PL], which contains signals from six sep-\narate brain regions: perirhinal cortex (PRC), parahippocampal cor-\ntex (PHC), entorhinal cortex (ERC), subiculum (Sub), CA1, and\nCA3/Dentate Gyrus (CA3) in the resting states on the same per-\nson in 84 successive days. We used the anatomical connections\n[BB08, ZHZ+17] as a reference. We applied both MC and IB on this dataset. We investigated\nall possible causal orders and found the one that minimizes the causal order indicator for MC and\nIB. After identifying the causal ordering, we estimated the causal coef\ufb01cients B on each domain\nseparately with threshold \u21b5 = 0.1, and if an edge appears in more than 60% of all domains, we took\nthis edge in the \ufb01nal graph. The recovered causal graph between the six regions is shown in the \ufb01gure\non the right. The black edges indicate edges, which are identi\ufb01ed by both MC and IB methods. The\nblue edges are only identi\ufb01ed by the MC method, and the orange edges are only identi\ufb01ed by the IB\nmethod. The edges in the anatomical ground truth are as follows: PHC ! ERC, PRC ! ERC, ERC\n! CA3/DG, CA3/DG ! CA1, CA1 ! Sub, Sub ! ERC, and ERC ! CA1.\n4 Conclusion\nWe studied causal structure learning from multi-domain observational data. We proposed methods\nbased on the principle that in a causally suf\ufb01cient system, the causal modules, as well as their included\nparameters, change independently across domains. The main idea in our approach does not require\nany type of invariance across the domains. We \ufb01rst introduced our methods on a linear system\ncomprised of two variables, and then proposed ef\ufb01cient algorithms to generalize the idea to the\ncase of multiple variables. We evaluated our method on both synthetic and real data. Our proposed\nmethods are generally capable of identifying causal direction from fewer than ten domains, and when\nthe invariance property holds, two domains are generally suf\ufb01cient.\n\nERC\n\nCA1\n\nPRC\n\nSub\n\n9\n\n\fAcknowledgments\nThis work was supported in part by Army grant W911NF-15-1-0281 and NSF grant NSF CCF\n1065022. This material is partially based upon work supported by United States Air Force under Con-\ntract No. FA8650-17-C-7715, by National Science Foundation under EAGER Grant No. IIS-1829681,\nand National Institutes of Health under Contract No. NIH-1R01EB022858-01, FAINR01EB022858,\nNIH-1R01LM012087, NIH-5U54HG008540-02, and FAIN-U54HG008540, and work funded and\nsupported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie\nMellon University for the operation of the Software Engineering Institute, a federally funded research\nand development center. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in\nthis material are those of the authors and do not necessarily re\ufb02ect the views of the United States\nAir Force or the National Institutes of Health or the National Science Foundation. We thank Clark\nGlymour and Malcolm Forster for helpful discussions, and appreciate the comments from anonymous\nreviewers, which greatly helped to improve the paper.\n\nReferences\n[AMP+97] Steen A Andersson, David Madigan, Michael D Perlman, et al. A characterization of\nmarkov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505\u2013541,\n1997.\n\n[BB08] Chris M Bird and Neil Burgess. The hippocampus and memory: insights from spatial\n\nprocessing. Nature Reviews Neuroscience, 9(3):nrn2335, 2008.\n\n[Bol89] Kenneth A. Bollen. Structural equations with latent variables. Wiley series in probability\n\nand mathematical statistics. Applied probability and statistics section. Wiley, 1989.\n\n[Chi02] David Maxwell Chickering. Optimal structure identi\ufb01cation with greedy search. Journal\n\nof machine learning research, 3(Nov):507\u2013554, 2002.\n\n[DJM+10] Povilas Daniusis, Dominik Janzing, Joris Mooij, Jakob Zscheischler, Bastian Steudel,\nKun Zhang, and Bernhard Sch\u00f6lkopf. Distinguishing causes from effects using nonlinear\nacyclic causal models. In Proc. 26th Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI2010), 2010.\n\n[Ebe07] Frederick Eberhardt. Causation and intervention. Unpublished doctoral dissertation,\n\nCarnegie Mellon University, 2007.\n\n[GFT+08] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch\u00f6lkopf, and A. J. Smola. A kernel\nstatistical test of independence. In NIPS 20, pages 585\u2013592, Cambridge, MA, 2008.\nMIT Press.\n\n[GSKZ17] AmirEmad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Kun Zhang. Learn-\ning causal structures using regression invariance. In Advances in Neural Information\nProcessing Systems, pages 3015\u20133025, 2017.\n\n[HJM+09] Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Sch\u00f6lkopf.\nNonlinear causal discovery with additive noise models. In Advances in neural informa-\ntion processing systems, pages 689\u2013696, 2009.\n\n[Hoo90] K. Hoover. The logic of causal inference. Economics and Philosophy, 6:207\u2013234, 1990.\n\n[HZZ+17] Biwei Huang, Kun Zhang, Jiji Zhang, Ruben Sanchez Romero, Clark Glymour, and\nBernhard Sch\u00f6lkopf. Behind distribution shift: Mining driving forces of changes and\ncausal arrows. In Proceedings of IEEE 17th International Conference on Data Mining\n(ICDM 2017), 2017.\n\n[PBM16] Jonas Peters, Peter B\u00fchlmann, and Nicolai Meinshausen. Causal inference by using\nidenti\ufb01cation and con\ufb01dence intervals. Journal of the Royal\n\ninvariant prediction:\nStatistical Society: Series B (Statistical Methodology), 78(5):947\u20131012, 2016.\n\n[Pea09] Judea Pearl. Causality. Cambridge university press, 2009.\n\n10\n\n\f[PL] Poldrack and Laumann. https://openfmri.org/dataset/ds000031/, 2015.\n\n[Pou11] Mohsen Pourahmadi. Covariance estimation: The glm and regularization perspectives.\n\nStatistical Science, pages 369\u2013387, 2011.\n\n[Rei91] Hans Reichenbach. The direction of time, volume 65. Univ of California Press, 1991.\n[SGS00] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and\n\nsearch. MIT press, 2000.\n\n[SHHK06] Shohei Shimizu, Patrik O Hoyer, Aapo Hyv\u00e4rinen, and Antti Kerminen. A linear non-\ngaussian acyclic model for causal discovery. Journal of Machine Learning Research,\n7(Oct):2003\u20132030, 2006.\n\n[SIS+11] Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyv\u00e4rinen, Yoshinobu\nKawahara, Takashi Washio, Patrik O Hoyer, and Kenneth Bollen. Directlingam: A direct\nmethod for learning a linear non-gaussian structural equation model. Journal of Machine\nLearning Research, 12(Apr):1225\u20131248, 2011.\n\n[SJP+12] B. Sch\u00f6lkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal\nand anticausal learning. In Proc. 29th International Conference on Machine Learning\n(ICML 2012), Edinburgh, Scotland, 2012.\n\n[TP01] Jin Tian and Judea Pearl. Causal discovery from changes.\n\nIn Proceedings of the\nSeventeenth conference on Uncertainty in arti\ufb01cial intelligence, pages 512\u2013521. Morgan\nKaufmann Publishers Inc., 2001.\n\n[WSBU18] Yuhao Wang, Chandler Squires, Anastasiya Belyaeva, and Caroline Uhler. Direct\n\nestimation of differences in causal graphs. arXiv preprint arXiv:1802.05631, 2018.\n\n[XG16] Pan Xu and Quanquan Gu. Semiparametric differential graph models. In Advances in\n\nNeural Information Processing Systems, pages 1064\u20131072, 2016.\n\n[ZC06] K. Zhang and L. Chan. Extensions of ICA for causality discovery in the Hong Kong\nstock market. In Proc. 13th International Conference on Neural Information Processing\n(ICONIP 2006), 2006.\n\n[ZH09] Kun Zhang and Aapo Hyv\u00e4rinen. On the identi\ufb01ability of the post-nonlinear causal\nmodel. In Proc. 25th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI 2009),\nMontreal, Canada, 2009.\n\n[ZHZ+17] Kun Zhang, Biwei Huang, Jiji Zhang, Clark Glymour, and Bernhard Sch\u00f6lkopf. Causal\ndiscovery in the presence of distribution shift: Skeleton estimation and orientation\ndetermination. In Proc. International Joint Conference on Arti\ufb01cial Intelligence (IJCAI\n2017), 2017.\n\n[ZSMW13] K. Zhang, B. Sch\u00f6lkopf, K. Muandet, and Z. Wang. Domain adaptation under target\n\nand conditional shift. In ICML, 2013.\n\n[ZZS15] K. Zhang, J. Zhang, and B. Sch\u00f6lkopf. Distinguishing cause from effect based on exo-\ngeneity. In Proc. 15th Conference on Theoretical Aspects of Rationality and Knowledge\n(TARK 2015), 2015.\n\n11\n\n\f", "award": [], "sourceid": 3092, "authors": [{"given_name": "AmirEmad", "family_name": "Ghassami", "institution": "University of Illinois at Urbana\u2013Champaign"}, {"given_name": "Negar", "family_name": "Kiyavash", "institution": "Georgia Tech"}, {"given_name": "Biwei", "family_name": "Huang", "institution": "Carnegie Mellon University"}, {"given_name": "Kun", "family_name": "Zhang", "institution": "CMU"}]}