{"title": "Causal discovery in multiple models from different experiments", "book": "Advances in Neural Information Processing Systems", "page_first": 415, "page_last": 423, "abstract": "A long-standing open research problem is how to use information from different experiments, including background knowledge, to infer causal relations. Recent developments have shown ways to use multiple data sets, provided they originate from identical experiments. We present the MCI-algorithm as the first method that can infer provably valid causal relations in the large sample limit from different experiments. It is fast, reliable and produces very clear and easily interpretable output. It is based on a result that shows that constraint-based causal discovery is decomposable into a candidate pair identification and subsequent elimination step that can be applied separately from different models. We test the algorithm on a variety of synthetic input model sets to assess its behavior and the quality of the output. The method shows promising signs that it can be adapted to suit causal discovery in real-world application areas as well, including large databases.", "full_text": "Causal discovery in multiple models from different\n\nexperiments\n\nRadboud University Nijmegen\n\nRadboud University Nijmegen\n\nTom Claassen\n\nThe Netherlands\ntomc@cs.ru.nl\n\nTom Heskes\n\nThe Netherlands\ntomh@cs.ru.nl\n\nAbstract\n\nA long-standing open research problem is how to use information from different\nexperiments, including background knowledge, to infer causal relations. Recent\ndevelopments have shown ways to use multiple data sets, provided they originate\nfrom identical experiments. We present the MCI-algorithm as the \ufb01rst method that\ncan infer provably valid causal relations in the large sample limit from different\nexperiments. It is fast, reliable and produces very clear and easily interpretable\noutput. It is based on a result that shows that constraint-based causal discovery is\ndecomposable into a candidate pair identi\ufb01cation and subsequent elimination step\nthat can be applied separately from different models. We test the algorithm on a\nvariety of synthetic input model sets to assess its behavior and the quality of the\noutput. The method shows promising signs that it can be adapted to suit causal\ndiscovery in real-world application areas as well, including large databases.\n\n1\n\nIntroduction\n\nDiscovering causal relations from observational data is an important, ubiquitous problem in science.\nIn many application areas there is a multitude of data from different but related experiments. Often\nthe set of measured variables is not the same between trials, or the circumstances under which they\nwere conducted differed, making it dif\ufb01cult to compare and evaluate results, especially when they\nseem to contradict each other, e.g. when a certain dependency is observed in one experiment, but\nnot in another. Results obtained from one data set are often used to either corroborate or challenge\nresults from another. Yet how to reconcile information from multiple sources, including background\nknowledge, into a single, more informative model is still an open problem.\nConstraint-based methods like the FCI-algorithm [1] are provably correct in the large sample limit,\nas are Bayesian methods like the greedy search algorithm GES [2] (with additional post-processing\nsteps to handle hidden confounders). Both are de\ufb01ned in terms of modeling a single data set and have\nno principled means to relate to results from other sources in the process. Recent developments, like\nthe ION-algorithm by Tillman et al. [3], have shown that it is possible to integrate multiple, partially\noverlapping data sets. However, such algorithms are still essentially single model learners in the\nsense that they assume there is one, single encapsulating structure that accounts for all observed\ndependencies in the different models. In practice, observed dependencies often differ between data\nsets, precisely because the experimental circumstances were not identical in different experiments,\neven when the causal system at the heart of it was the same. The method we develop in this article\nshows how to distinguish between causal dependencies internal to the system under investigation\nand merely contextual dependencies.\nMani et al. [4] recognized the \u2018local\u2019 aspect of causal discovery from Y-structures embedded in\ndata: it suf\ufb01ces to establish a certain (in)dependency pattern between variables, without having to\nuncover the entire graph. In section 4 we take this one step further by showing that such causal\n\n1\n\n\fdiscovery can be decomposed into two separate steps: a conditional independency to identify a pair\nof possible causal relations (one of which is true), and then a conditional dependency that eliminates\none option, leaving the other. The two steps rely only on local (marginal) aspects of the distribution.\nAs a result the conclusion remains valid, even when, unlike causal inference from Y-structures, the\ntwo pieces of information are taken from different models. This forms the basis underpinning the\nMCI-algorithm in section 6.\nSection 2 of this article introduces some basic terminology. Section 3 models different experiments.\nSection 4 establishes the link between conditional independence and local causal relations, which\nis used in section 5 to combine multiple models into a single causal graph. Section 6 describes a\npractical implementation in the form of the MCI-algorithm. Sections 7 and 8 discuss experimental\nresults and suggest possible extensions to other application areas.\n\n2 Graphical model preliminaries\n\nFirst a few familiar notions from graphical model theory used throughout the article. A directed\ngraph G is a pair (cid:104)V, E(cid:105), where V is a set of vertices or nodes and E is a set of edges between pairs\nof nodes, represented by arrows X \u2192 Y . A path \u03c0 = (cid:104)V0, . . . , Vn(cid:105) between V0 and Vn in G is a\nsequence of distinct vertices such that for 0 \u2264 i \u2264 n\u22121, Vi and Vi+1 are connected by an edge in G.\nA directed path is a path that is traversed entirely in the direction of the arrows. A directed acyclic\ngraph (DAG) is a directed graph that does not contain a directed path from any node to itself. A\nvertex X is an ancestor of Y (and Y is a descendant of X) if there is a directed path from X to Y in\nG or if X = Y . A vertex Z is a collider on a path \u03c0 = (cid:104). . . , X, Z, Y, . . .(cid:105) if it contains the subpath\nX \u2192 Z \u2190 Y , otherwise it is a noncollider. A trek is a path that does not contain any collider.\nFor disjoint (sets of) vertices X, Y and Z in a DAG G, X is d-connected to Y conditional on Z\n(possibly empty), iff there exists an unblocked path \u03c0 = (cid:104)X, . . . , Y (cid:105) between X and Y given Z, i.e.\nsuch that every collider on \u03c0 is an ancestor of some Z \u2208 Z and every noncollider on \u03c0 is not in Z.\nIf not, then all such paths are blocked, and X is said to be d-separated from Y ; see [5, 1] for details.\nDe\ufb01nition 1. Two nodes X and Y are minimally conditionally independent given a set of nodes Z,\ndenoted [X \u22a5\u22a5 Y | Z], iff X is conditionally independent of Y given a minimal set of nodes Z. Here\nminimal, indicated by the square brackets, implies that the relation does not hold for any proper\nsubset Z(cid:48) (cid:40) Z of the (possibly empty) set Z.\nA causal DAG GC is a graphical model in the form of a DAG where the arrows represent direct\ncausal interactions between variables in a system [6]. There is a causal relation X \u21d2 Y , iff there is\na directed path from X to Y in GC. Absence of such a path is denoted X \u21d2(cid:31) Y . The causal Markov\ncondition links the structure of a causal graph to its probabilistic concomitant, [5]: two variables X\nand Y in a causal DAG GC are dependent given a set of nodes Z, iff they are connected by a path \u03c0\nin GC that is unblocked given Z; so there is a dependence X \u22a5\u22a5(cid:31) Y iff there is a trek between X and\nY in the causal DAG.\nWe assume that the systems we consider correspond to some underlying causal DAG over a great\nmany observed and unobserved nodes. The distribution over the subset of observed variables can\nthen be represented by a (maximal) ancestral graph (MAG) [7]. Different MAGs can represent the\nsame distribution, but only the invariant features, common to all MAGs that can faithfully represent\nthat distribution, carry identi\ufb01able causal information. The complete partial ancestral graph (CPAG)\nP that represents the equivalence class [G] of a MAG G is a graph with either a tail \u2018\u2212\u2019, arrowhead\n\u2018>\u2019 or circle mark \u2018\u25e6\u2019 at each end of an edge, such that P has the same adjacencies as G, and there\nis a tail or arrowhead on an edge in P iff it is invariant in [G], otherwise it has a circle mark [8]. The\nCPAG of a given MAG is unique and maximally informative for [G]. We use CPAGs as a concise\nand intuitive graphical representation of all conditional (in)dependence relations between nodes in an\nobserved distribution; see [7, 8] for more information on how to read independencies directly from\na MAG/CPAG using the m-separation criterion, which is essentially just the d-separation criterion,\nonly applied to MAGs.\nThroughout this article we also adopt the causal faithfulness assumption, which implies that all and\nonly the conditional independence relations entailed by the causal Markov condition applied to the\ntrue causal DAG will hold in the joint probability distribution over the variables in GC. For an\nin-depth discussion of the justi\ufb01cation and connection between these assumptions, see [9].\n\n2\n\n\f3 Modeling the system\n\nRandom variation in a system corresponds to the impact of unknown external variables, see [5].\nSome of these external factors may be actively controlled, e.g. in clinical trials, or passively ob-\nserved as the natural embedding of a system in its environment. We refer to both observational and\ncontrolled studies as experiments. External factors that affect two or more variables in a system\nsimultaneously, can lead to dependencies that are not part of the system. Different external factors\nmay bring about observed dependencies that differ between models, seemingly contradicting each\nother. By modeling this external environment explicitly as a set of unobserved (hypothetical) context\nnodes that causally affect the system under scrutiny we can account for this effect.\nDe\ufb01nition 2. The external context GE of a causal DAG GC is a set of independent nodes U in\ncombination with links from every U \u2208 U to one or more nodes in GC. The total causal structure of\nan experiment then becomes GT = {GE + GC}.\nFigure 1 depicts a causal system in three different experiments (double lined arrows indicate direct\ncausal relations; dashed circles represent unobserved variables). The second and third experiment\nwill result in an observed dependency between variables A and B, whereas the \ufb01rst one will not. The\ncontext only introduces arrows from nodes in GE to GC which can never result in a cycle, therefore\nthe structure of an experiment GT is also a causal DAG. Note that differences in dependencies can\nonly arise from different structures of the external context.\n\nFigure 1: A causal system GC in different experiments\n\nIn this paradigm different experiments become variations in context of a constant causal system.\nThe goal of causal discovery from multiple models can then be stated as: \u201cGiven experiments with\nunknown total causal structures GT = {GE + GC}, G(cid:48)\nE + GC}, etc., and known joint\nprobability distributions P (V \u2282 GT ), P (cid:48)(V(cid:48) \u2282 G(cid:48)\nT ), etc., which variables are connected by a\ndirected path in GC?\u201d. We assume that the large sample limit distributions P (V) are known and can\nbe used to obtain categorical statements about probabilistic (in)dependencies between sets of nodes.\nFinally, we will assume there is no selection bias, see [10], nor blocking interventions on GC, as\naccounting for the impact would unnecessarily complicate the exposition.\n\nT = {G(cid:48)\n\n4 Causal relations in arbitrary context\n\nA remarkable result that, to the best of our knowledge, has not been noted before, is that a minimal\nconditional independence always implies the presence of a causal relation. (See appendix for an\noutline of all proofs in this article.)\nTheorem 1. Let X, Y , Z and W be four disjoint (sets of) nodes (possibly empty) in an experiment\nwith causal structure GT = {GE + GC}, then the following rules apply, for arbitrary GE\n(1) a minimal conditional independence [X \u22a5\u22a5 Y | Z] implies causal links Z \u21d2 X and/or Z \u21d2 Y\n\nfrom every Z \u2208 Z to X and/or Y in GC,\n\n(2) a conditional dependence X \u22a5\u22a5(cid:31) Y | Z \u222a W induced by a node W , i.e. with X \u22a5\u22a5 Y | Z, implies\n\nthat there are no causal links W \u21d2(cid:31) X, W \u21d2(cid:31) Y or W \u21d2(cid:31) Z for any Z \u2208 Z in GC,\n\n(3) a conditional independence X \u22a5\u22a5 Y | Z implies the absence of (direct) causal paths X \u21d2 Y or\n\nX \u21d0 Y in GC between X and Y that are not mediated by nodes in Z.\n\n3\n\n\fThe theorem establishes independence patterns that signify (absence of) a causal origin, independent\nof the (unobserved) external background. Rule (1) identi\ufb01es a candidate pair of causal relations from\na conditional independence. Rule (2) identi\ufb01es the absence of causal paths from unshielded colliders\nin G, see also [1]. Rule (3) eliminates direct causal links between variables.\nThe \ufb01nal step towards causal discovery from multiple models now takes a surprisingly simple form:\nLemma 1. Let X, Y and Z \u2208 Z be disjoint (sets of) variables in an experiment with causal structure\nGT = {GE + GC}, then if there exists both:\n\n\u2212 a minimal conditional independence [X \u22a5\u22a5 Y | Z],\n\u2212 established absence of a causal path Z \u21d2(cid:31) X,\nthen there is a causal link (directed path) Z \u21d2 Y in GC.\nThe crucial observation is that these two pieces of information can be obtained from different models.\nIn fact, the origin of the information Z \u21d2(cid:31) X is irrelevant: be it from (in)dependencies via rule (2),\nother properties of the distribution, e.g. non-Gaussianity [11] or nonlinear features [12], or existing\nbackground knowledge. The only prerequisite for bringing results from various sources together is\nthat the causal system at the centre is invariant, i.e. that the causal structure GC remains the same\nacross the different experiments GT , G(cid:48)\nThis result also shows why the well-known Y-structure: 4 nodes with X \u2192 Z \u2190 W and Z \u2192 Y ,\nsee [4], always enables identi\ufb01cation of the causal link Z \u21d2 Y :\nit is simply lemma 1 applied\nto overlapping nodes in a single model, in the form of rule (1) for [X \u22a5\u22a5 Y | Z], together with\ndependency X \u22a5\u22a5(cid:31) W | Z created by Z to eliminate Z \u21d2(cid:31) X by rule (2).\n\nT etc.\n\n5 Multiple models\n\nIn this article we focus on combining multiple conditional independence models represented by\nCPAGs. We want to use these models to convey as much about the underlying causal structure GC\nas possible. We choose a causal CPAG as the target output model: similar in form and interpretation\nto a CPAG, where tails and arrowheads now represent all known (non)causal relations. This is not\nnecessarily an equivalence class in accordance with the rules in [8], as it may contain more explicit\ninformation. Ingredients for extracting this information are the rules in theorem 1, in combination\nwith the standard properties of causal relations: acyclic (if X \u21d2 Y then Y \u21d2(cid:31) X) and transitivity\n(if X \u21d2 Y and Y \u21d2 Z then X \u21d2 Z). As the causal system is assumed invariant, the established\n(absence of) causal relations in one model are valid in all models.\nA straightforward brute-force implementation is given by Algorithm 1. The input is a set of CPAG\nmodels Pi, representing the conditional (in)dependence information between a set of observed vari-\nables, e.g. as learned by the extended FCI-algorithm [1, 8], from a number of different experiments\nG(i)\nT on an invariant causal system GC. The output is the single causal CPAG G over the union of all\nnodes in the input models Pi.\n\n: set of CPAGs Pi, fully \u25e6\u2212\u25e6 connected graph G\n\nInput\nOutput : causal graph G\n\nG \u2190 eliminate all edges not appearing between nodes in Pi\nG \u2190 all de\ufb01nite (non)causal connections between nodes in Pi\n\n1: for all Pi do\n2:\n3:\n4: end for\n5: repeat\n6:\n7:\n8:\n9:\n10:\n11:\n12: until no more new non/causal information found\n\nfor all {X, Y, Z, W} \u2208 Pi do\n\nfor all Pi do\n\nend for\n\nend for\n\nG \u2190 (Z \u21d2(cid:31) {X, Y, W}), if X \u22a5\u22a5 Y | W and X \u22a5\u22a5(cid:31) Y |{W \u222a Z}\nG \u2190 (Z \u21d2 Y ), if [X \u22a5\u22a5 Y |{Z \u222a W}] and (Z \u21d2(cid:31) X) \u2208 GC\n\n(cid:46) Rule (3)\n(cid:46) invariant structure\n\n(cid:46) Rule (2)\n(cid:46) Rule (1)\n\nAlgorithm 1: Brute force implementation of rules (1)-(3)\n\n4\n\n\fFigure 2: Three different experiments, one causal model\n\nAs an example, consider the three CPAG models on the l.h.s. of \ufb01gure 2. None of these identi\ufb01es\na causal relation, yet despite the different (in)dependence relations, it is easily veri\ufb01ed that the\nalgorithm terminates after two loops with the nearly complete causal CPAG on the r.h.s. as the \ufb01nal\noutput. Figure 1 shows corresponding experiments that explain the observed dependencies above.\nTo the best of our knowledge, Algorithm 1 is the \ufb01rst algorithm ever to perform such a derivation.\nNevertheless, this brute-force approach exhibits a number of serious shortcomings. In the \ufb01rst place,\nthe computational complexity of the repeated loop over all subsets in line 7 makes it not scalable:\nfor small models like the ones in \ufb01gure 2 the derivation is almost immediate, but for larger models\nit quickly becomes unfeasible. Secondly, for sparsely overlapping models, i.e. when the observed\nvariables differ substantially between the models, the algorithm can miss certain relations: when a\ncausal relation is found to be absent between two non-adjacent nodes, then this information cannot\nbe recorded in G, and subsequent causal information identi\ufb01able by rule (1) may be lost. These\nproblems are addressed in the next section, resulting in the MCI-algorithm.\n\n6 The MCI-algorithm\nTo tackle the computational complexity we \ufb01rst introduce the following notion: a path (cid:104)X, . . . , Y (cid:105)\nin a CPAG is called a possibly directed path (or p.d. path) from X to Y , if it can be converted into a\ndirected path by changing circle marks into appropriate tails and arrowheads [6]. We can now state:\nTheorem 2. Let X and Y be two variables in an experiment with causal structure GT = {GE +\nGC}, and let P[G] be the corresponding CPAG over a subset of observed nodes from GC. Then the\nabsence of a causal link X \u21d2(cid:31) Y is detectable from the conditional (in)dependence structure in this\nexperiment iff there exists no p.d. path from X to Y in P[G].\nIn other words: X cannot be a cause (ancestor) of Y if all paths from X to Y in the graph P[G] go\nagainst an invariant arrowhead (signifying non-ancestorship) and vice versa. We refer to this as rule\n(4). Calculating which variables are connected by a p.d. path from a given CPAG is straightforward:\nturn the graph into a {0, 1} adjacency matrix by setting all arrowheads to zero and all tails and circle\nmarks to one, and compute the resulting reachability matrix. As this will uncover all detectable\n\u2018non-causal\u2019 relations in a CPAG in one go, it needs to be done only once for each model, and\ncan be aggregated into a matrix MC to make all tests for rule (2) in line 8 super\ufb02uous. If we also\nrecord all other established (non)causal relations in the matrix MC as the algorithm progresses, then\nindirect causal relations are no longer lost when they cannot be transferred to the output graph G.\nThe next lemma propagates indirect (non)causal information from MC to edge marks in the graph:\nLemma 2. Let X, Y and Z be disjoint sets of variables in an experiment with causal structure\nGT = {GE + GC}, then for every [X \u22a5\u22a5 Y | Z]:\n\n\u2212 every (indirect) causal relation X \u21d2 Y implies causal links Z \u21d2 Y ,\n\u2212 every (indirect) absence of causal relation X \u21d2(cid:31) Y implies no causal links X \u21d2(cid:31) Z.\n\nThe \ufb01rst makes it possible to orient indirect causal chains, the second shortens indirect non-causal\nlinks. We refer to these as rules (5) and (6), respectively. As a \ufb01nal improvement it is worth noting\nthat for rules (1), (5) and (6) it is only relevant to know that a node Z occurs in some Z in a minimal\nconditional independence relation [X \u22a5\u22a5 Y | Z] separating X and Y , but not what the other nodes in\nZ are or in what model(s) it occurred. We can introduce a structure SCI to record all nodes Z that\noccur in some minimal conditional independency in one of the models Pi for each combination of\nnodes (X, Y ), before any of the rules (1), (5) or (6) is processed. As a result, in the repeated causal\ninference loop no conditional independence / m-separation tests need to be performed at all.\n\n5\n\n\f: set of CPAGs Pi, fully \u25e6\u2212\u25e6 connected graph G\n\nInput\nOutput : causal graph G, causal relations matrix MC\n\nSCI \u2190 triple (X, Y, Z), if Z \u2208 Z for which [X \u22a5\u22a5 Y | Z]\n\nG \u2190 eliminate all edges not appearing between nodes in Pi\n\nfor all (X, Y, Z) \u2208 Pi do\nend for\n\n1: MC \u2190 0\n2: for all Pi do\n3:\n4: MC \u2190 (X \u21d2(cid:31) Y ), if no p.d. path (cid:104)X, . . . , Y (cid:105) \u2208 Pi\n5: MC \u2190 (X \u21d2 Y ), if causal path (cid:104)X \u21d2 . . . \u21d2 Y (cid:105) \u2208 Pi\n6:\n7:\n8:\n9: end for\n10: repeat\n11:\n12:\n13:\n14:\n15:\n16: until no more new causal information found\n17: G \u2190 non/causal info in MC\n\nfor all (X, Y, Z) \u2208 G do\n\n(cid:46) no causal relations\n\n(cid:46) Rule (3)\n(cid:46) Rule (4)\n(cid:46) transitivity\n\n(cid:46) combined SCI-matrix\n\nMC \u2190 (Z \u21d2 Y ), for unused (X, Y, Z) \u2208 SCI with (Z \u21d2(cid:31) X) \u2208 MC\nMC \u2190 (X \u21d2(cid:31) Z), for unused (X \u21d2(cid:31) Y ) \u2208 MC with (X, Y, Z) \u2208 SCI\nMC \u2190 (Z \u21d2 Y ), for unused (X \u21d2 Y ) \u2208 MC with (X, Y, Z) \u2208 SCI\n\n(cid:46) Rule (1)\n(cid:46) Rule (5)\n(cid:46) Rule (6)\n\nend for\n\n(cid:46) tails/arrowheads\n\nAlgorithm 2: MCI algorithm\n\nWith these results we can now give an improved version of the brute-force approach: the Multiple\nmodel Causal Inference (MCI) algorithm, above. The input is still a set of CPAG models from\ndifferent experiments, but the output is now twofold: the graph G, containing the causal structure\nuncovered for the underlying system GC, as well as the matrix MC with an explicit representation\nof all (non)causal relations between observed variables, including remaining indirect information\nthat cannot be read from the graph G.\nThe \ufb01rst stage (lines 2-9) is a pre-processing step to extract all necessary information for the second\nstage from each of the models separately. Building the SCI matrix is the most expensive step as it\ninvolves testing for conditional independencies (m-separation) for increasing sets of variables. This\ncan be ef\ufb01ciently implemented by noting that nodes connected by an edge will not be separated\nand that many other combinations will not have to be tested as they contain a subset for which a\n(minimal) conditional independency has already been established. If a (non)causal relation is found\nbetween adjacent variables in G, or one that can be used to infer other intermediate relations (lines\n13-14), then it can be marked as \u2018processed\u2019 to avoid unnecessary checks. Similar for the entries\nrecorded in the minimal conditional independence structure SCI.\nThe MCI algorithm is provably sound in the sense that if all input CPAG models Pi are valid, then\nall (absence of) causal relations identi\ufb01ed by the algorithm in the output graph G and (non)causal\nrelations matrix MC are also valid, provided that the causal system GC is an invariant causal DAG\nand the causal faithfulness assumption is satis\ufb01ed.\n\n7 Experimental results\n\nWe tested the MCI-algorithm on a variety of synthetic data sets to verify its validity and assess\nits behaviour and performance in uncovering causal information from multiple models. For the\ngeneration of random causal DAGs we used a variant of [13] to control the distribution of edges\nover nodes in the network. The random experiments in each run were generated from this causal\nDAG by including a random context and hidden nodes. For each network the corresponding CPAG\nwas computed, and together used as the set of input models for the MCI-algorithm. The generated\noutput G and MC was veri\ufb01ed against the true causal DAG and expressed as a percentage of the true\nnumber of (non-)causal relations.\nTo assess the performance we introduced two reference methods to act as a benchmark for the MCI-\nalgorithm (in the absence of other algorithms that can validly handle different contexts). The \ufb01rst\nis a common sense method, indicated as \u2018sum-FCI\u2019, that utilizes the transitive closure of all causal\nrelations in the input CPAGs, that could have been identi\ufb01ed by FCI in the large sample limit. As the\n\n6\n\n\fFigure 3: Proportion of causal relations discovered by the MCI-algorithm in different settings; (a) causal\nrelations vs. nr. of models, (b) causal relations vs. nr. of context nodes, (c) non-causal relations \u21d2(cid:31) vs. nr. of\nmodels; (top) identical observed nodes in input models, (bottom) only partially overlapping observed nodes\n\nsecond benchmark we take all causal information contained in the CPAG over the union of observed\nvariables, independent of the context, hence \u2018nc-CPAG\u2019 for \u2018no context\u2019. Note that this is not really\na method as it uses information directly derived from the true causal graph.\nIn \ufb01gure 3, each graph depicts the percentage of causal (a&b) or non-causal (c) relations uncovered\nby each of the three methods: MCI, sum-FCI and nc-CPAG, as a function of the number of input\nmodels (a&c) or the number of nodes in the context (b), averaged over 200 runs, for both identi-\ncal (top) or only partly overlapping (bottom) observed nodes in the input models. Performance is\ncalculated as the proportion of uncovered relations as compared to the actual number of non/causal\nrelations in the true causal graph over the union of observed nodes in each model set.\nIn these\nruns the underlying causal graphs contained 12 nodes with edge degree \u2264 5. Tests for other, much\nsparser/denser graphs up to 20 nodes, showed comparable results.\nSome typical behaviour is easily recognized: MCI always outperforms sum-FCI, and more input\nmodels always improve performance. Also non-causal information (c) is much easier to \ufb01nd than\nde\ufb01nite causal relations (a&b). For single models / no context the performance of all three methods\nis very similar, although not necessarily identical. The perceived initial drop in performance in\n\ufb01g.3(c,bottom) is only because in going to two models the number of non-causal relations in the\nunion rises more quickly than the number of new relations that is actually found (due to lack of\noverlap). A striking result that is clearly brought out is that adding random context actually improves\ndetection rate of causal relations. The rationale behind this effect is that externally induced links\ncan introduce conditional dependencies, allowing the deduction of non-causal links that are not\notherwise detectable; these in turn may lead to other causal relations that can be inferred, and so\non. If the context is expanded further, at some point the detection rate will start to deteriorate as the\ncausal structure will be swamped by the externally induced links (b).\nWe want to stress that for the many tens of thousands of (non)causal relations identi\ufb01ed by the MCI\nalgorithm in all the runs, not a single one was found to be invalid on comparing with the true causal\ngraph. For (cid:38) 8 nodes the algorithm spends the majority of its time building the SCI matrix in\nlines 6-8. The actual number of minimal conditional independencies found, however, is quite low,\ntypically in the order of a few dozen for graphs of up to 12 nodes.\n\n7\n\n!\"#$%&&&\u2019%&\u2019!&\u2019(&\u2019\"&\u2019)&\u2019#&\u2019*&\u2019$+,-.,/01203,453067-82,.-189%!,8-952:,%!\",/-865;6,8-952,,<=>21?!@=>8/!=ABC!\"#$%&&&\u2019%&\u2019!&\u2019(&\u2019\"&\u2019)&\u2019#&\u2019*&\u2019$84\u2019,-.,97..54586,?-9532+,-.,/01203,453067-82,.-189,,<=>21?!@=>8/!=ABC&!\"#$%&&&\u2019%&\u2019!&\u2019(&\u2019\"&\u2019)&\u2019#&\u2019*&\u2019$+,-.,/01203,453067-82,.-189%!,8-952:,),?-9532&!\"#$%&&&\u2019%&\u2019!&\u2019(&\u2019\"&\u2019)&\u2019#&\u2019*&\u2019$84\u2019,-.,8-952,78,5;654803,/-865;6/01203,453067-82,.-189,!,+DE0467033F,-G5430EE78H,8-952I!\"#$%&&\u2019#&\u2019#)&\u2019*&\u2019*)&\u2019$&\u2019$)&\u2019J&\u2019J)%+,KLK!/01203,453067-82,.-189%!,8-952:,%!\",/-865;6,8-952,,<=>21?!@=>8/!=ABC!\"#$%&&\u2019#&\u2019#)&\u2019*&\u2019*)&\u2019$&\u2019$)&\u2019J&\u2019J)%84\u2019,-.,97..54586,?-9532+,KLK!/01203,453067-82,.-189,,<=>21?!@=>8/!=ABCD0IDMID/I\f8 Conclusion\n\nWe have shown the \ufb01rst principled algorithm that can use results from different experiments to\nuncover new (non)causal information. It is provably sound in the large sample limit, provided the\ninput models are learned by a valid algorithm like the FCI algorithm with CPAG extension [8]. In its\ncurrent implementation the MCI-algorithm is a fast and practical method that can easily be applied to\nsets of models of up to 20 nodes. Compared to related algorithms like ION, it produces very concise\nand easily interpretable output, and does not suffer from the inability to handle any differences in\nobserved dependencies between data sets [3]. For larger models it can be converted into an anytime\nalgorithm by running over minimal conditional independencies from subsets of increasing size: at\neach level all uncovered causal information is valid, and, for reasonably sparse models, most will\nalready be found at low levels. For very large models an exciting possibility is to target only speci\ufb01c\ncausal relations: \ufb01nding the right combination of (in)dependencies is suf\ufb01cient to decide if it is\ncausal, even when there is no hope of deriving a global CPAG model.\nFrom the construction of the MCI-algorithm it is sound, but not necessarily complete. Preliminary\nresults show that theorem 1 already covers all invariant arrowheads in the single model case [8], and\nsuggest that an additional rule is suf\ufb01cient to cover all tails as well. We aim to extend this result to\nthe multiple model domain. Integrating our approach with recent developments in causal discovery\nthat are not based on independence constraints [11, 12] can provide for even more detectable causal\ninformation. When applied to real data sets the large sample limit no longer applies and inconsistent\ncausal relations may result. It should be possible to exclude the contribution of such links on the\n\ufb01nal output. Alternatively, output might be generalized to quantities like \u2018the probability of a causal\nrelation\u2019 based on the strength of appropriate conditional (in)dependencies in the available data.\nAcknowledgement This research was supported by VICI grant 639.023.604 from the Netherlands\nOrganization for Scienti\ufb01c Research (NWO).\n\nAppendix - proof outlines\n\nTheorem 1 / Lemma 1 (for details see also [14])\n(1) Without selection bias, nodes X and Y are dependent iff they are connected by treks in GT . A\nnode Z \u2208 Z that blocks such a trek has a directed path in GC to X and/or Y , but can unblock\nother paths. These paths contain a trek between Z and X/Y , and must blocked by a node\nZ(cid:48) \u2208 Z\\Z, which therefore also has a causal link to X or Y (possibly via Z). Z(cid:48) in turn can\nunblock other paths, etc. Minimality guarantees that it holds for all nodes in Z. Eliminating\nZ \u21d2(cid:31) X then leaves Z \u21d2 Y as the only option (lemma 1).\n(2) To create the dependency, W must be a (descendant of a) collider between two unblocked paths\n\u03c01 = (cid:104)X, . . . , W(cid:105) and \u03c02 = (cid:104)Y, . . . , W(cid:105) given Z. Any directed path from W to a Z \u2208 Z\nimplies that conditioning on W is not needed when already conditioning on Z. In combination\nwith \u03c01 or \u03c02, a directed path from W to X or Y in GC would make W a noncollider on an\nunblocked path between X and Y given Z, contrary to X \u22a5\u22a5 Y | Z.\n(3) A directed path between X and Y that is not blocked by Z would result in X \u22a5\u22a5(cid:31) Y | Z, see [1].\nTheorem 2\n\u2018\u21d0\u2019 follows from the fact that a directed path \u03c0 = (cid:104)X, . . . , Y (cid:105) in the underlying causal DAG GC\nimplies existence of a directed path in the true MAG over the observed nodes and therefore at least\nthe existence of a p.d. path in the CPAG P[G].\n\u2018\u21d2\u2019 follows from the completeness of the CPAG in combination with theorem 2 in [8] about ori-\nentability of CPAGs into MAGs. This, together with Meek\u2019s algorithm [15] for orienting chordal\ngraphs into DAGs with no unshielded colliders, shows that it is always possible to turn a p.d. path\ninto a directed path in a MAG that is a member of the equivalence class P[G]. Therefore, a p.d. path\nfrom X to Y in P[G] implies there is at least some underlying causal DAG in which it is a causal\npath, and so cannot correspond to a valid, detectable absence of a causal link. (cid:3)\nLemma 2\nFrom rule (1) in Theorem 1, [X \u22a5\u22a5 Y | Z] implies causal links Z \u21d2 X and/or Z \u21d2 Y . If X \u21d2 Y\nthen by transitivity Z \u21d2 X also implies Z \u21d2 Y . If X \u21d2(cid:31) Y then for any Z \u2208 Z, X \u21d2 Z implies\nZ \u21d2 Y and so (transitivity) also X \u21d2 Y , in contradiction of the given; therefore X \u21d2(cid:31) Z. (cid:3)\n\n8\n\n\fReferences\n[1] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search. Cambridge,\n\nMassachusetts: The MIT Press, 2nd ed., 2000.\n\n[2] D. Chickering, \u201cOptimal structure identi\ufb01cation with greedy search,\u201d Journal of Machine\n\nLearning Research, vol. 3, no. 3, pp. 507\u2013554, 2002.\n\n[3] R. Tillman, D. Danks, and C. Glymour, \u201cIntegrating locally learned causal structures with\n\noverlapping variables,\u201d in Advances in Neural Information Processing Systems, 21, 2008.\n\n[4] S. Mani, G. Cooper, and P. Spirtes, \u201cA theoretical study of Y structures for causal discovery,\u201d\nin Proceedings of the 22nd Conference in Uncertainty in Arti\ufb01cial Intelligence, pp. 314\u2013323,\n2006.\n\n[5] J. Pearl, Causality: models, reasoning and inference. Cambridge University Press, 2000.\n[6] J. Zhang, \u201cCausal reasoning with ancestral graphs,\u201d Journal of Machine Learning Research,\n\nvol. 9, pp. 1437 \u2013 1474, 2008.\n\n[7] T. Richardson and P. Spirtes, \u201cAncestral graph Markov models,\u201d Ann. Stat., vol. 30, no. 4,\n\npp. 962\u20131030, 2002.\n\n[8] J. Zhang, \u201cOn the completeness of orientation rules for causal discovery in the presence of\nlatent confounders and selection bias,\u201d Arti\ufb01cial Intelligence, vol. 172, no. 16-17, pp. 1873 \u2013\n1896, 2008.\n\n[9] J. Zhang and P. Spirtes, \u201cDetection of unfaithfulness and robust causal inference,\u201d Minds and\n\nMachines, vol. 2, no. 18, pp. 239\u2013271, 2008.\n\n[10] P. Spirtes, C. Meek, and T. Richardson, \u201cAn algorithm for causal inference in the presence of\nlatent variables and selection bias,\u201d in Computation, Causation, and Discovery, pp. 211\u2013252,\n1999.\n\n[11] S. Shimizu, P. Hoyer, A. Hyv\u00a8arinen, and A. Kerminen, \u201cA linear non-Gaussian acyclic model\n\nfor causal discovery,\u201d Journal of Machine Learning Research, vol. 7, pp. 2003\u20132030, 2006.\n\n[12] P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Sch\u00a8olkopf, \u201cNonlinear causal discov-\nery with additive noise models,\u201d in Advances in Neural Information Processing Systems 21\n(NIPS*2008), pp. 689\u2013696, 2009.\n\n[13] J. Ide and F. Cozman, \u201cRandom generation of Bayesian networks,\u201d in Advances in Arti\ufb01cial\n\nIntelligence, pp. 366\u2013376, Springer Berlin, 2002.\n\n[14] T. Claassen and T. Heskes, \u201cLearning causal network structure from multiple (in)dependence\nmodels,\u201d in Proceedings of the Fifth European Workshop on Probabilistic Graphical Models,\n2010.\n\n[15] C. Meek, \u201cCausal inference and causal explanation with background knowledge,\u201d in UAI,\n\npp. 403\u2013410, Morgan Kaufmann, 1995.\n\n9\n\n\f", "award": [], "sourceid": 764, "authors": [{"given_name": "Tom", "family_name": "Claassen", "institution": null}, {"given_name": "Tom", "family_name": "Heskes", "institution": null}]}