{"title": "Ancestral Causal Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 4466, "page_last": 4474, "abstract": "Constraint-based causal discovery from limited data is a notoriously difficult challenge due to the many borderline independence test decisions.  Several approaches to improve the reliability of the predictions by exploiting redundancy in the independence information have been proposed recently. Though promising, existing approaches can still be greatly improved in terms of accuracy and scalability. We present a novel method that reduces the combinatorial explosion of the search space by using a more coarse-grained representation of causal information, drastically reducing computation time. Additionally, we propose a method to score causal predictions based on their confidence. Crucially, our implementation also allows one to easily combine observational and interventional data and to incorporate various types of available background knowledge.  We prove soundness and asymptotic consistency of our method and demonstrate that it can outperform the state-of-the-art on synthetic data, achieving a speedup of several orders of magnitude. We illustrate its practical feasibility by applying it on a challenging protein data set.", "full_text": "Ancestral Causal Inference\n\nSara Magliacane\n\nVU Amsterdam & University of Amsterdam\n\nsara.magliacane@gmail.com\n\nTom Claassen\n\nRadboud University Nijmegen\n\ntomc@cs.ru.nl\n\nJoris M. Mooij\n\nUniversity of Amsterdam\n\nj.m.mooij@uva.nl\n\nAbstract\n\nConstraint-based causal discovery from limited data is a notoriously dif\ufb01cult chal-\nlenge due to the many borderline independence test decisions. Several approaches\nto improve the reliability of the predictions by exploiting redundancy in the inde-\npendence information have been proposed recently. Though promising, existing\napproaches can still be greatly improved in terms of accuracy and scalability. We\npresent a novel method that reduces the combinatorial explosion of the search space\nby using a more coarse-grained representation of causal information, drastically\nreducing computation time. Additionally, we propose a method to score causal pre-\ndictions based on their con\ufb01dence. Crucially, our implementation also allows one\nto easily combine observational and interventional data and to incorporate various\ntypes of available background knowledge. We prove soundness and asymptotic\nconsistency of our method and demonstrate that it can outperform the state-of-\nthe-art on synthetic data, achieving a speedup of several orders of magnitude. We\nillustrate its practical feasibility by applying it to a challenging protein data set.\n\n1\n\nIntroduction\n\nDiscovering causal relations from data is at the foundation of the scienti\ufb01c method. Traditionally,\ncause-effect relations have been recovered from experimental data in which the variable of interest is\nperturbed, but seminal work like the do-calculus [16] and the PC/FCI algorithms [23, 26] demonstrate\nthat, under certain assumptions (e.g., the well-known Causal Markov and Faithfulness assumptions\n[23]), it is already possible to obtain substantial causal information by using only observational data.\nRecently, there have been several proposals for combining observational and experimental data to\ndiscover causal relations. These causal discovery methods are usually divided into two categories:\nconstraint-based and score-based methods. Score-based methods typically evaluate models using a\npenalized likelihood score, while constraint-based methods use statistical independences to express\nconstraints over possible causal models. The advantages of constraint-based over score-based methods\nare the ability to handle latent confounders and selection bias naturally, and that there is no need\nfor parametric modeling assumptions. Additionally, constraint-based methods expressed in logic\n[2, 3, 25, 8] allow for an easy integration of background knowledge, which is not trivial even for\nsimple cases in approaches that are not based on logic [1].\nTwo major disadvantages of traditional constraint-based methods are: (i) vulnerability to errors\nin statistical independence test results, which are quite common in real-world applications, (ii) no\nranking or estimation of the con\ufb01dence in the causal predictions. Several approaches address the\n\ufb01rst issue and improve the reliability of constraint-based methods by exploiting redundancy in the\nindependence information [3, 8, 25]. The idea is to assign weights to the input statements that re\ufb02ect\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ftheir reliability, and then use a reasoning scheme that takes these weights into account. Several\nweighting schemes can be de\ufb01ned, from simple ways to attach weights to single independence\nstatements [8], to more complicated schemes to obtain weights for combinations of independence\nstatements [25, 3]. Unfortunately, these approaches have to sacri\ufb01ce either accuracy by using a greedy\nmethod [3, 25], or scalability by formulating a discrete optimization problem on a super-exponentially\nlarge search space [8]. Additionally, the con\ufb01dence estimation issue is addressed only in limited\ncases [17].\nWe propose Ancestral Causal Inference (ACI), a logic-based method that provides comparable\naccuracy to the best state-of-the-art constraint-based methods (e.g., [8]) for causal systems with\nlatent variables without feedback, but improves on their scalability by using a more coarse-grained\nrepresentation of causal information. Instead of representing all possible direct causal relations, in\nACI we represent and reason only with ancestral relations (\u201cindirect\u201d causal relations), developing\nspecialised ancestral reasoning rules. This representation, though still super-exponentially large,\ndrastically reduces computation time. Moreover, it turns out to be very convenient, because in\nreal-world applications the distinction between direct causal relations and ancestral relations is not\nalways clear or necessary. Given the estimated ancestral relations, the estimation can be re\ufb01ned to\ndirect causal relations by constraining standard methods to a smaller search space, if necessary.\nFurthermore, we propose a method to score predictions according to their con\ufb01dence. The con\ufb01dence\nscore can be thought of as an approximation to the marginal probability of an ancestral relation.\nScoring predictions enables one to rank them according to their reliability, allowing for higher\naccuracy. This is very important for practical applications, as the low reliability of the predictions of\nconstraint-based methods has been a major impediment to their wide-spread use.\nWe prove soundness and asymptotic consistency under mild conditions on the statistical tests for ACI\nand our scoring method. We show that ACI outperforms standard methods, like bootstrapped FCI\nand CFCI, in terms of accuracy, and achieves a speedup of several orders of magnitude over [8] on a\nsynthetic dataset. We illustrate its practical feasibility by applying it to a challenging protein data set\n[21] that so far had only been addressed with score-based methods and observe that it successfully\nrecovers from faithfulness violations. In this context, we showcase the \ufb02exibility of logic-based\napproaches by introducing weighted ancestral relation constraints that we obtain from a combination\nof observational and interventional data, and show that they substantially increase the reliability of\nthe predictions. Finally, we provide an open-source version of our algorithms and the evaluation\nframework, which can be easily extended, at http://github.com/caus-am/aci.\n\n2 Preliminaries and related work\n\nPreliminaries We assume that the data generating process can be modeled by a causal Directed\nAcyclic Graph (DAG) that may contain latent variables. For simplicity we also assume that there is\nno selection bias. Finally, we assume that the Causal Markov Assumption and the Causal Faithfulness\nAssumption [23] both hold. In other words, the conditional independences in the observational\ndistribution correspond one-to-one with the d-separations in the causal DAG. Throughout the paper\nwe represent variables with uppercase letters, while sets of variables are denoted by boldface. All\nproofs are provided in the Supplementary Material.\nA directed edge X \u2192 Y in the causal DAG represents a direct causal relation between cause X on\neffect Y . Intuitively, in this framework this indicates that manipulating X will produce a change in\nY , while manipulating Y will have no effect on X. A more detailed discussion can be found in [23].\nA sequence of directed edges X1 \u2192 X2 \u2192 \u00b7\u00b7\u00b7 \u2192 Xn is a directed path. If there exists a directed\npath from X to Y (or X = Y ), then X is an ancestor of Y (denoted as X (cid:57)(cid:57)(cid:75) Y ). Otherwise, X is\nnot an ancestor of Y (denoted as X (cid:54)(cid:57)(cid:57)(cid:75) Y ). For a set of variables W , we write:\n\nWe de\ufb01ne an ancestral structure as any non-strict partial order on the observed variables of the DAG,\ni.e., any relation that satis\ufb01es the following axioms:\n\n(1)\n\n(2)\n(3)\n(4)\n\nX (cid:57)(cid:57)(cid:75) W := \u2203Y \u2208 W : X (cid:57)(cid:57)(cid:75) Y,\nX (cid:54)(cid:57)(cid:57)(cid:75) W := \u2200Y \u2208 W : X (cid:54)(cid:57)(cid:57)(cid:75) Y.\n\n(re\ufb02exivity) : X (cid:57)(cid:57)(cid:75) X,\n(transitivity) : X (cid:57)(cid:57)(cid:75) Y \u2227 Y (cid:57)(cid:57)(cid:75) Z =\u21d2 X (cid:57)(cid:57)(cid:75) Z,\n(antisymmetry) : X (cid:57)(cid:57)(cid:75) Y \u2227 Y (cid:57)(cid:57)(cid:75) X =\u21d2 X = Y.\n\n2\n\n\fThe underlying causal DAG induces a unique \u201ctrue\u201d ancestral structure, which represents the transitive\nclosure of the direct causal relations projected on the observed variables.\nFor disjoint sets X, Y , W we denote conditional independence of X and Y given W as X \u22a5\u22a5\nY | W , and conditional dependence as X (cid:54)\u22a5\u22a5 Y | W . We call the cardinality |W| the order of the\nconditional (in)dependence relation. Following [2] we de\ufb01ne a minimal conditional independence by:\n\nX \u22a5\u22a5 Y | W \u222a [Z] := (X \u22a5\u22a5 Y | W \u222a Z) \u2227 (X (cid:54)\u22a5\u22a5 Y | W ),\n\nand similarly, a minimal conditional dependence by:\n\nX (cid:54)\u22a5\u22a5 Y | W \u222a [Z] := (X (cid:54)\u22a5\u22a5 Y | W \u222a Z) \u2227 (X \u22a5\u22a5 Y | W ).\n\nThe square brackets indicate that Z is needed for the (in)dependence to hold in the context of W . Note\nthat the negation of a minimal conditional independence is not a minimal conditional dependence.\nMinimal conditional (in)dependences are closely related to ancestral relations, as pointed out in [2]:\nLemma 1. For disjoint (sets of) variables X, Y, Z, W :\n\nX \u22a5\u22a5 Y | W \u222a [Z] =\u21d2 Z (cid:57)(cid:57)(cid:75) ({X, Y } \u222a W ),\nX (cid:54)\u22a5\u22a5 Y | W \u222a [Z] =\u21d2 Z (cid:54)(cid:57)(cid:57)(cid:75) ({X, Y } \u222a W ).\n\n(5)\n(6)\n\nExploiting these rules (as well as others that will be introduced in Section 3) to deduce ancestral\nrelations directly from (in)dependences is key to the greatly improved scalability of our method.\n\nRelated work on con\ufb02ict resolution One of the earliest algorithms to deal with con\ufb02icting inputs\nin constraint-based causal discovery is Conservative PC [18], which adds \u201credundant\u201d checks to the\nPC algorithm that allow it to detect inconsistencies in the inputs, and then makes only predictions that\ndo not rely on the ambiguous inputs. The same idea can be applied to FCI, yielding Conservative FCI\n(CFCI) [4, 10]. BCCD (Bayesian Constraint-based Causal Discovery) [3] uses Bayesian con\ufb01dence\nestimates to process information in decreasing order of reliability, discarding contradictory inputs as\nthey arise. COmbINE (Causal discovery from Overlapping INtErventions) [25] is an algorithm that\ncombines the output of FCI on several overlapping observational and experimental datasets into a\nsingle causal model by \ufb01rst pooling and recalibrating the independence test p-values, and then adding\neach constraint incrementally in order of reliability to a SAT instance. Any constraint that makes the\nproblem unsatis\ufb01able is discarded.\nOur approach is inspired by a method presented by Hyttinen, Eberhardt and J\u00e4rvisalo [8] (that\nwe will refer to as HEJ in this paper), in which causal discovery is formulated as a constrained\ndiscrete minimization problem. Given a list of weighted independence statements, HEJ searches\nfor the optimal causal graph G (an acyclic directed mixed graph, or ADMG) that minimizes the\nsum of the weights of the independence statements that are violated according to G. In order to\ntest whether a causal graph G induces a certain independence, the method creates an encoding DAG\nof d-connection graphs. D-connection graphs are graphs that can be obtained from a causal graph\nthrough a series of operations (conditioning, marginalization and interventions). An encoding DAG\nof d-connection graphs is a complex structure encoding all possible d-connection graphs and the\nsequence of operations that generated them from a given causal graph. This approach has been shown\nto correct errors in the inputs, but is computationally demanding because of the huge search space.\n\n3 ACI: Ancestral Causal Inference\n\nWe propose Ancestral Causal Inference (ACI), a causal discovery method that accurately reconstructs\nancestral structures, also in the presence of latent variables and statistical errors. ACI builds on HEJ\n[8], but rather than optimizing over encoding DAGs, ACI optimizes over the much simpler (but still\nvery expressive) ancestral structures.\nFor n variables, the number of possible ancestral structures is the number of partial orders (http:\n//oeis.org/A001035), which grows as 2n2/4+o(n2) [11], while the number of DAGs can be\ncomputed with a well-known super-exponential recurrence formula (http://oeis.org/A003024).\nThe number of ADMGs is | DAG(n)| \u00d7 2n(n\u22121)/2. Although still super-exponential, the number of\nancestral structures grows asymptotically much slower than the number of DAGs and even more so,\nADMGs. For example, for 7 variables, there are 6 \u00d7 106 ancestral structures but already 2.3 \u00d7 1015\nADMGs, which lower bound the number of encoding DAGs of d-connection graphs used by HEJ.\n\n3\n\n\fNew rules The rules in HEJ explicitly encode marginalization and conditioning operations on\nd-connection graphs, so they cannot be easily adapted to work directly with ancestral relations.\nInstead, ACI encodes the ancestral reasoning rules (2)\u2013(6) and \ufb01ve novel causal reasoning rules:\nLemma 2. For disjoint (sets) of variables X, Y, U, Z, W :\n\n(X \u22a5\u22a5 Y | Z) \u2227 (X (cid:54)(cid:57)(cid:57)(cid:75) Z) =\u21d2 X (cid:54)(cid:57)(cid:57)(cid:75) Y,\nX (cid:54)\u22a5\u22a5 Y | W \u222a [Z] =\u21d2 X (cid:54)\u22a5\u22a5 Z | W ,\nX \u22a5\u22a5 Y | W \u222a [Z] =\u21d2 X (cid:54)\u22a5\u22a5 Z | W ,\n(X \u22a5\u22a5 Y | W \u222a [Z]) \u2227 (X \u22a5\u22a5 Z | W \u222a U ) =\u21d2 (X \u22a5\u22a5 Y | W \u222a U ),\n(Z (cid:54)\u22a5\u22a5 X | W ) \u2227 (Z (cid:54)\u22a5\u22a5 Y | W ) \u2227 (X \u22a5\u22a5 Y | W ) =\u21d2 X (cid:54)\u22a5\u22a5 Y | W \u222a Z.\n\n(7)\n(8)\n(9)\n(10)\n(11)\n\nWe prove the soundness of the rules in the Supplementary Material. We elaborate some conjectures\nabout their completeness in the discussion after Theorem 1 in the next Section.\n\nOptimization of loss function We formulate causal discovery as an optimization problem where\na loss function is optimized over possible causal structures. Intuitively, the loss function sums the\nweights of all the inputs that are violated in a candidate causal structure.\nGiven a list I of weighted input statements (ij, wj), where ij is the input statement and wj is the\nassociated weight, we de\ufb01ne the loss function as the sum of the weights of the input statements that\nare not satis\ufb01ed in a given possible structure W \u2208 W, where W denotes the set of all possible causal\nstructures. Causal discovery is formulated as a discrete optimization problem:\n\nW \u2217 = arg min\nW\u2208W\n\nL(W ; I),\n\n(cid:88)\n\nwj,\n\n(12)\n\n(13)\n\nL(W ; I) :=\n\n(ij ,wj )\u2208I: W\u222aR|=\u00acij\n\nwhere W \u222a R |= \u00acij means that input ij is not satis\ufb01ed in structure W according to the rules R.\nThis general formulation includes both HEJ and ACI, which differ in the types of possible structures\nW and the rules R. In HEJ W represents all possible causal graphs (speci\ufb01cally, acyclic directed\nmixed graphs, or ADMGs, in the acyclic case) and R are operations on d-connection graphs. In ACI\nW represent ancestral structures (de\ufb01ned with the rules(2)-(4)) and the rules R are rules (5)\u2013(11).\n\nConstrained optimization in ASP The constrained optimization problem in (12) can be imple-\nmented using a variety of methods. Given the complexity of the rules, a formulation in an expressive\nlogical language that supports optimization, e.g., Answer Set Programming (ASP), is very convenient.\nASP is a widely used declarative programming language based on the stable model semantics [12, 7]\nthat has successfully been applied to several NP-hard problems. For ACI we use the state-of-the-art\nASP solver clingo 4 [6]. We provide the encoding in the Supplementary Material.\n\nWeighting schemes ACI supports two types of input statements: conditional independences and\nancestral relations. These statements can each be assigned a weight that re\ufb02ects their con\ufb01dence. We\npropose two simple approaches with the desirable properties of making ACI asymptotically consistent\nunder mild assumptions (as described in the end of this Section), and assigning a much smaller weight\nto independences than to dependences (which agrees with the intuition that one is con\ufb01dent about a\nmeasured strong dependence, but not about independence vs. weak dependence). The approaches are:\n\u2022 a frequentist approach, in which for any appropriate frequentist statistical test with indepen-\n\ndence as null hypothesis (resp. a non-ancestral relation), we de\ufb01ne the weight:\n\nw = | log p \u2212 log \u03b1|, where p = p-value of the test, \u03b1 = signi\ufb01cance level (e.g., 5%);\n\u2022 a Bayesian approach, in which the weight of each input statement i using data set D is:\n\n(14)\n\np(i)\np(\u00aci)\nwhere the prior probability p(i) can be used as a tuning parameter.\n\nw = log\n\n= log\n\np(i|D)\np(\u00aci|D)\n\np(D|i)\np(D|\u00aci)\n\n,\n\n(15)\n\n4\n\n\fGiven observational and interventional data, in which each intervention has a single known target (in\nparticular, it is not a fat-hand intervention [5]), a simple way to obtain a weighted ancestral statement\nX (cid:57)(cid:57)(cid:75) Y is with a two-sample test that tests whether the distribution of Y changes with respect to\nits observational distribution when intervening on X. This approach conveniently applies to various\ntypes of interventions: perfect interventions [16], soft interventions [14], mechanism changes [24],\nand activity interventions [15]. The two-sample test can also be implemented as an independence test\nthat tests for the independence of Y and IX, the indicator variable that has value 0 for observational\nsamples and 1 for samples from the interventional distribution in which X has been intervened upon.\n\n4 Scoring causal predictions\n\nThe constrained minimization in (12) may produce several optimal solutions, because the underlying\nstructure may not be identi\ufb01able from the inputs. To address this issue, we propose to use the loss\nfunction (13) and score the con\ufb01dence of a feature f (e.g., an ancestral relation X (cid:57)(cid:57)(cid:75) Y ) as:\n\nW\u2208W L(W ; I \u222a {(f,\u221e)}).\n\n(16)\n\nWithout going into details here, we note that the con\ufb01dence (16) can be interpreted as a MAP\napproximation of the log-odds ratio of the probability that feature f is true in a Markov Logic model:\n\nC(f ) = min\n\nW\u2208W L(W ; I \u222a {(\u00acf,\u221e)}) \u2212 min\n(cid:80)\n(cid:80)\nW\u2208W e\u2212L(W ;I)1W\u222aR|=f\nW\u2208W e\u2212L(W ;I)1W\u222aR|=\u00acf\n\nP(f | I,R)\nP(\u00acf | I,R)\n\n=\n\n\u2248 maxW\u2208W e\u2212L(W ;I\u222a{(f,\u221e)})\nmaxW\u2208W e\u2212L(W ;I\u222a{(\u00acf,\u221e)})\n\n= eC(f ).\n\nIn this paper, we usually consider the features f to be ancestral relations, but the idea is more generally\napplicable. For example, combined with HEJ it can be used to score direct causal relations.\n\nSoundness and completeness Our scoring method is sound for oracle inputs:\nTheorem 1. Let R be sound (not necessarily complete) causal reasoning rules. For any feature f,\nthe con\ufb01dence score C(f ) of (16) is sound for oracle inputs with in\ufb01nite weights.\nHere, soundness means that C(f ) = \u221e if f is identi\ufb01able from the inputs, C(f ) = \u2212\u221e if \u00acf\nis identi\ufb01able from the inputs, and C(f ) = 0 otherwise (neither are identi\ufb01able). As features, we\ncan consider for example ancestral relations f = X (cid:57)(cid:57)(cid:75) Y for variables X, Y . We conjecture that\nthe rules (2)\u2013(11) are \u201corder-1-complete\u201d, i.e., they allow one to deduce all (non)ancestral relations\nthat are identi\ufb01able from oracle conditional independences of order \u2264 1 in observational data. For\nhigher-order inputs additional rules can be derived. However, our primary interest in this work is\nimproving computation time and accuracy, and we are willing to sacri\ufb01ce completeness. A more\ndetailed study of the completeness properties is left as future work.\n\n(cid:26)\u2212\u221e H1\n\nAsymptotic consistency Denote the number of samples by N. For the frequentist weights in (14),\nwe assume that the statistical tests are consistent in the following sense:\n\nlog pN \u2212 log \u03b1N\n\nP\u2192\n\n(17)\nas N \u2192 \u221e, where the null hypothesis H0 is independence/nonancestral relation and the alternative\nhypothesis H1 is dependence/ancestral relation. Note that we need to choose a sample-size dependent\nthreshold \u03b1N such that \u03b1N \u2192 0 at a suitable rate. Kalisch and B\u00fchlmann [9] show how this can be\ndone for partial correlation tests under the assumption that the distribution is multivariate Gaussian.\nFor the Bayesian weighting scheme in (15), we assume that for N \u2192 \u221e,\n\n+\u221e H0,\n\n(cid:26)\u2212\u221e if i is true\n\n+\u221e if i is false.\n\nP\u2192\n\nwN\n\n(18)\n\nThis will hold (as long as there is no model misspeci\ufb01cation) under mild technical conditions for\n\ufb01nite-dimensional exponential family models. In both cases, the probability of a type I or type II\nerror will converge to 0, and in addition, the corresponding weight will converge to \u221e.\nTheorem 2. Let R be sound (not necessarily complete) causal reasoning rules. For any feature f,\nthe con\ufb01dence score C(f ) of (16) is asymptotically consistent under assumption (17) or (18).\nHere, \u201casymptotically consistent\u201d means that the con\ufb01dence score C(f ) \u2192 \u221e in probability if f is\nidenti\ufb01ably true, C(f ) \u2192 \u2212\u221e in probability if f is identi\ufb01ably false, and C(f ) \u2192 0 in probability\notherwise.\n\n5\n\n\fAverage execution time (s)\n\n1\n4\n1\n1\n1\n\nHEJ\nn c ACI\n12.09\n0.21\n6\n432.67\n1.66\n6\n715.74\n7\n1.03\n\u2265 2500\n9.74\n8\n146.66 (cid:29) 2500\n9\n(a)\n\nBAFCI BACFCI\n8.39\n11.10\n9.37\n13.71\n18.28\n\n12.51\n16.36\n15.12\n21.71\n28.51\n\n(b)\n\nFigure 1: Execution time comparison on synthetic data for the frequentist test on 2000 synthetic\nmodels: (a) average execution time for different combinations of number of variables n and max.\norder c; (b) detailed plot of execution times for n = 7, c = 1 (logarithmic scale).\n\n5 Evaluation\n\nIn this section we report evaluations on synthetically generated data and an application on a real\ndataset. Crucially, in causal discovery precision is often more important than recall. In many real-\nworld applications, discovering a few high-con\ufb01dence causal relations is more useful than \ufb01nding\nevery possible causal relation, as re\ufb02ected in recently proposed algorithms, e.g., [17].\n\nCompared methods We compare the predictions of ACI and of the acyclic causally insuf\ufb01cient\nversion of HEJ [8], when used in combination with our scoring method (16). We also evaluate two\nstandard methods: Anytime FCI [22, 26] and Anytime CFCI [4], as implemented in the pcalg R\npackage [10]. We use the anytime versions of (C)FCI because they allow for independence test\nresults up to a certain order. We obtain the ancestral relations from the output PAG using Theorem\n3.1 from [20]. (Anytime) FCI and CFCI do not rank their predictions, but only predict the type of\nrelation: ancestral (which we convert to +1), non-ancestral (-1) and unknown (0). To get a scoring of\nthe predictions, we also compare with bootstrapped versions of Anytime FCI and Anytime CFCI.\nWe perform the bootstrap by repeating the following procedure 100 times: sample randomly half\nof the data, perform the independence tests, run Anytime (C)FCI. From the 100 output PAGs we\nextract the ancestral predictions and average them. We refer to these methods as BA(C)FCI. For a\nfair comparison, we use the same independence tests and thresholds for all methods.\n\nSynthetic data We simulate the data using the simulator from HEJ [8]: for each experimental\ncondition (e.g., a given number of variables n and order c), we generate randomly M linear acyclic\nmodels with latent variables and Gaussian noise and sample N = 500 data points. We then perform\nindependence tests up to order c and weight the (in)dependence statements using the weighting\nschemes described in Section 3. For the frequentist weights we use tests based on partial correlations\nand Fisher\u2019s z-transform to obtain approximate p-values (see, e.g., [9]) with signi\ufb01cance level\n\u03b1 = 0.05. For the Bayesian weights, we use the Bayesian test for conditional independence presented\nin [13] as implemented by HEJ with a prior probability of 0.1 for independence.\nIn Figure 1(a) we show the average execution times on a single core of a 2.80GHz CPU for different\ncombinations of n and c, while in Figure 1(b) we show the execution times for n = 7, c = 1, sorting\nthe execution times in ascending order. For 7 variables ACI is almost 3 orders of magnitude faster\nthan HEJ, and the difference grows exponentially as n increases. For 8 variables HEJ can complete\nonly four of the \ufb01rst 40 simulated models before the timeout of 2500s. For reference we add the\nexecution time for bootstrapped anytime FCI and CFCI.\nIn Figure 2 we show the accuracy of the predictions with precision-recall (PR) curves for both\nancestral (X (cid:57)(cid:57)(cid:75) Y ) and nonancestral (X (cid:54)(cid:57)(cid:57)(cid:75) Y ) relations, in different settings. In this Figure, for\nACI and HEJ all of the results are computed using frequentist weights and, as in all evaluations, our\nscoring method (16). While for these two methods we use c = 1, for (bootstrapped) (C)FCI we use\nall possible independence test results (c = n \u2212 2). In this case, the anytime versions of FCI and CFCI\nare equivalent to the standard versions of FCI and CFCI. Since the overall results are similar, we\nreport the results with the Bayesian weights in the Supplementary Material.\nIn the \ufb01rst row of Figure 2, we show the setting with n = 6 variables. The performances of HEJ\nand ACI coincide, performing signi\ufb01cantly better for nonancestral predictions and the top ancestral\n\n6\n\n0.1\t1\t10\t100\t1000\t10000\t1\t101\t201\t301\t401\t501\t601\t701\t801\t901\t1001\t1101\t1201\t1301\t1401\t1501\t1601\t1701\t1801\t1901\tExecution\ttime\t(s)\tInstances\t(sorted\tby\tsolution\ttime)\tHEJ\tACI\t\f(a) PR ancestral: n=6\n\n(b) PR ancestral: n=6 (zoom)\n\n(c) PR nonancestral: n=6\n\n(d) PR ancestral: n=8\n\n(e) PR ancestral: n=8 (zoom)\n\n(f) PR nonancestral: n=8\n\nFigure 2: Accuracy on synthetic data for the two prediction tasks (ancestral and nonancestral relations)\nusing the frequentist test with \u03b1 = 0.05. The left column shows the precision-recall curve for ancestral\npredictions, the middle column shows a zoomed-in version in the interval (0,0.02), while the right\ncolumn shows the nonancestral predictions.\n\npredictions (see zoomed-in version in Figure 2(b)). This is remarkable, as HEJ and ACI use only\nindependence test results up to order c = 1, in contrast with (C)FCI which uses independence test\nresults of all orders. Interestingly, the two discrete optimization algorithms do not seem to bene\ufb01t\nmuch from higher order independence tests, thus we omit them from the plots (although we add the\ngraphs in the Supplementary Material). Instead, bootstrapping traditional methods, oblivious to the\n(in)dependence weights, seems to produce surprisingly good results. Nevertheless, both ACI and HEJ\noutperform bootstrapped FCI and CFCI, suggesting these methods achieve nontrivial error-correction.\nIn the second row of Figure 2, we show the setting with 8 variables. In this setting HEJ is too slow. In\naddition to the previous plot, we plot the accuracy of ACI when there is oracle background knowledge\non the descendants of one variable (i = 1). This setting simulates the effect of using interventional\ndata, and we can see that the performance of ACI improves signi\ufb01cantly, especially in the ancestral\npreditions. The performance of (bootstrapped) FCI and CFCI is limited by the fact that they cannot\ntake advantage of this background knowledge, except with complicated postprocessing [1].\n\nApplication on real data We consider the challenging task of reconstructing a signalling network\nfrom \ufb02ow cytometry data [21] under different experimental conditions. Here we consider one\nexperimental condition as the observational setting and seven others as interventional settings. More\ndetails and more evaluations are reported in the Supplementary Material. In contrast to likelihood-\nbased approaches like [21, 5, 15, 19], in our approach we do not need to model the interventions\nquantitatively. We only need to know the intervention targets, while the intervention types do not\nmatter. Another advantage of our approach is that it takes into account possible latent variables.\nWe use a t-test to test for each intervention and for each variable whether its distribution changes\nwith respect to the observational condition. We use the p-values of these tests as in (14) in order to\nobtain weighted ancestral relations that are used as input (with threshold \u03b1 = 0.05). For example, if\nadding U0126 (a MEK inhibitor) changes the distribution of RAF signi\ufb01cantly with respect to the\nobservational baseline, we get a weighted ancestral relation MEK(cid:57)(cid:57)(cid:75)RAF. In addition, we use partial\ncorrelations up to order 1 (tested in the observational data only) to obtain weighted independences\nused as input. We use ACI with (16) to score the ancestral relations for each ordered pair of variables.\nThe main results are illustrated in Figure 3, where we compare ACI with bootstrapped anytime CFCI\n\n7\n\nRecall00.050.10.150.2Precision0.30.40.50.60.70.80.91Bootstrapped (100) CFCIBootstrapped (100) FCIHEJ (c=1)ACI (c=1)Standard CFCIStandard FCIRecall00.0050.010.0150.02Precision0.60.650.70.750.80.850.90.951Recall00.20.40.60.81Precision0.860.880.90.920.940.960.981Bootstrapped (100) CFCIBootstrapped (100) FCIHEJ (c=1)ACI (c=1)Standard CFCIStandard FCIRecall00.050.10.150.2Precision0.30.40.50.60.70.80.91Bootstrapped (100) CFCIBootstrapped (100) FCIACI (c=1)ACI (c=1, i=1)Standard CFCIStandard FCIRecall00.0050.010.0150.02Precision0.30.40.50.60.70.80.91Recall00.20.40.60.81Precision0.90.910.920.930.940.950.960.970.980.991Bootstrapped (100) CFCIBootstrapped (100) FCIACI (c=1)ACI (c=1, i=1)Standard CFCIStandard FCI\f(a) Bootstrapped (100) any-\ntime CFCI (input:\nindepen-\ndences of order \u2264 1)\n\n(b) ACI (input: weighted an-\ncestral relations)\n\n(c) ACI (input: independences\nof order \u2264 1, weighted ances-\ntral relations)\n\nFigure 3: Results for \ufb02ow cytometry dataset. Each matrix represents the ancestral relations, where\neach row represents a cause and each column an effect. The colors encode the con\ufb01dence levels:\ngreen is positive, black is unknown, while red is negative. The intensity of the color represents the\ndegree of con\ufb01dence. For example, ACI identi\ufb01es MEK to be a cause of RAF with high con\ufb01dence.\n\nunder different inputs. The output for boostrapped anytime FCI is similar, so we report it only in\nthe Supplementary Material. Algorithms like (anytime) (C)FCI can only use the independences in\nthe observational data as input and therefore miss the strongest signal, weighted ancestral relations,\nwhich are obtained by comparing interventional with observational data. In the Supplementary\nMaterial, we compare also with other methods ([17], [15]). Interestingly, as we show there, our\nresults are similar to the best acyclic model reconstructed by the score-based method from [15]. As for\nother constraint-based methods, HEJ is computationally unfeasible in this setting, while COMBINE\nassumes perfect interventions (while this dataset contains mostly activity interventions).\nNotably, our algorithms can correctly recover from faithfulness violations (e.g., the independence\nbetween MEK and ERK), because they take into account the weight of the input statements (the weight\nof the independence is considerably smaller than that of the ancestral relation, which corresponds\nwith a quite signi\ufb01cant change in distribution). In contrast, methods that start by reconstructing the\nskeleton, like (anytime) (C)FCI, would decide that MEK and ERK are nonadjacent, and are unable to\nrecover from that erroneous decision. This illustrates another advantage of our approach.\n\n6 Discussion and conclusions\n\nAs we have shown, ancestral structures are very well-suited for causal discovery. They offer a\nnatural way to incorporate background causal knowledge, e.g., from experimental data, and allow a\nhuge computational advantage over existing representations for error-correcting algorithms, such as\n[8]. When needed, ancestral structures can be mapped to a \ufb01ner-grained representation with direct\ncausal relations, as we sketch in the Supplementary Material. Furthermore, con\ufb01dence estimates on\ncausal predictions are extremely helpful in practice, and can signi\ufb01cantly boost the reliability of the\noutput. Although standard methods, like bootstrapping (C)FCI, already provide reasonable estimates,\nmethods that take into account the con\ufb01dence in the inputs, as the one presented here, can lead to\nfurther improvements of the reliability of causal relations inferred from data.\nStrangely (or fortunately) enough, neither of the optimization methods seems to improve much with\nhigher order independence test results. We conjecture that this may happen because our loss function\nessentially assumes that the test results are independent from another (which is not true). Finding a\nway to take this into account in the loss function may further improve the achievable accuracy, but\nsuch an extension may not be straightforward.\n\nAcknowledgments\n\nSM and JMM were supported by NWO, the Netherlands Organization for Scienti\ufb01c Research\n(VIDI grant 639.072.410). SM was also supported by the Dutch programme COMMIT/ under the\nData2Semantics project. TC was supported by NWO grant 612.001.202 (MoCoCaDi), and EU-FP7\ngrant agreement n.603016 (MATRICS). We also thank So\ufb01a Trianta\ufb01llou for her feedback, especially\nfor pointing out the correct way to read ancestral relations from a PAG.\n\n8\n\nRafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKBCFCI (indep. <= 1)RafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKRafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKACI (ancestral relations)RafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKRafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKACI (ancestral r. + indep. <= 1)RafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKWeighted causes(i,j) RafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKRafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNK\u22121000\u221250005001000Weighted indep(i,j) RafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKRafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNK\u22121000\u221250005001000Consensus graph RafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKRafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNK\u22121000\u221250005001000ACI (causes) RafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKRafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNK\u22121000\u221250005001000ACI (causes + indeps) RafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKRafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNK\u22121000\u221250005001000FCI RafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKRafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNK\u22121000\u221250005001000CFCI RafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKRafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNK\u22121000\u221250005001000Acyclic Joris graph RafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNKRafMekPLCgPIP2PIP3ErkAktPKAPKCp38JNK\u22121000\u221250005001000\fReferences\n[1] G. Borboudakis and I. Tsamardinos. Incorporating causal prior knowledge as path-constraints in Bayesian\n\nnetworks and Maximal Ancestral Graphs. In ICML, pages 1799\u20131806, 2012.\n\n[2] T. Claassen and T. Heskes. A logical characterization of constraint-based causal discovery. In UAI, pages\n\n135\u2013144, 2011.\n\n[3] T. Claassen and T. Heskes. A Bayesian approach to constraint-based causal inference. In UAI, pages\n\n207\u2013216, 2012.\n\n[4] D. Colombo, M. H. Maathuis, M. Kalisch, and T. S. Richardson. Learning high-dimensional directed\n\nacyclic graphs with latent and selection variables. The Annals of Statistics, 40(1):294\u2013321, 2012.\n\n[5] D. Eaton and K. Murphy. Exact Bayesian structure learning from uncertain interventions. In AISTATS,\n\npages 107\u2013114, 2007.\n\n[6] M. Gebser, R. Kaminski, B. Kaufmann, and T. Schaub. Clingo = ASP + control: Extended report.\nTechnical report, University of Potsdam, 2014. http://www.cs.uni-potsdam.de/wv/pdfformat/\ngekakasc14a.pdf.\n\n[7] M. Gelfond. Answer sets. In Handbook of Knowledge Representation, pages 285\u2013316. 2008.\n\n[8] A. Hyttinen, F. Eberhardt, and M. J\u00e4rvisalo. Constraint-based causal discovery: Con\ufb02ict resolution with\n\nAnswer Set Programming. In UAI, pages 340\u2013349, 2014.\n\n[9] M. Kalisch and P. B\u00fchlmann. Estimating high-dimensional directed acyclic graphs with the PC-algorithm.\n\nJournal of Machine Learning Research, 8:613\u2013636, 2007.\n\n[10] M. Kalisch, M. M\u00e4chler, D. Colombo, M. Maathuis, and P. B\u00fchlmann. Causal inference using graphical\n\nmodels with the R package pcalg. Journal of Statistical Software, 47(1):1\u201326, 2012.\n\n[11] D. J. Kleitman and B. L. Rothschild. Asymptotic enumeration of partial orders on a \ufb01nite set. Transactions\n\nof the American Mathematical Society, 205:205\u2013220, 1975.\n\n[12] V. Lifschitz. What is Answer Set Programming? In AAAI, pages 1594\u20131597, 2008.\n\n[13] D. Margaritis and F. Bromberg. Ef\ufb01cient Markov network discovery using particle \ufb01lters. Computational\n\nIntelligence, 25(4):367\u2013394, 2009.\n\n[14] F. Markowetz, S. Grossmann, and R. Spang. Probabilistic soft interventions in conditional Gaussian\n\nnetworks. In AISTATS, pages 214\u2013221, 2005.\n\n[15] J. M. Mooij and T. Heskes. Cyclic causal discovery from continuous equilibrium data. In UAI, pages\n\n431\u2013439, 2013.\n\n[16] J. Pearl. Causality: models, reasoning and inference. Cambridge University Press, 2009.\n\n[17] J. Peters, P. B\u00fchlmann, and N. Meinshausen. Causal inference using invariant prediction: identi\ufb01cation\n\nand con\ufb01dence intervals. Journal of the Royal Statistical Society, Series B, 8(5):947\u20131012, 2015.\n\n[18] J. Ramsey, J. Zhang, and P. Spirtes. Adjacency-faithfulness and conservative causal inference. In UAI,\n\npages 401\u2013408, 2006.\n\n[19] D. Rothenh\u00e4usler, C. Heinze, J. Peters, and N. Meinshausen. BACKSHIFT: Learning causal cyclic graphs\n\nfrom unknown shift interventions. In NIPS, pages 1513\u20131521, 2015.\n\n[20] A. Roumpelaki, G. Borboudakis, S. Trianta\ufb01llou, and I. Tsamardinos. Marginal causal consistency in\n\nconstraint-based causal learning. In Causation: Foundation to Application Workshop, UAI, 2016.\n\n[21] K. Sachs, O. Perez, D. Pe\u2019er, D. Lauffenburger, and G. Nolan. Causal protein-signaling networks derived\n\nfrom multiparameter single-cell data. Science, 308:523\u2013529, 2005.\n\n[22] P. Spirtes. An anytime algorithm for causal inference. In AISTATS, pages 121\u2013128, 2001.\n\n[23] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, 2000.\n\n[24] J. Tian and J. Pearl. Causal discovery from changes. In UAI, pages 512\u2013521, 2001.\n\n[25] S. Trianta\ufb01llou and I. Tsamardinos. Constraint-based causal discovery from multiple interventions over\n\noverlapping variable sets. Journal of Machine Learning Research, 16:2147\u20132205, 2015.\n\n[26] J. Zhang. On the completeness of orientation rules for causal discovery in the presence of latent confounders\n\nand selection bias. Arti\ufb01cal Intelligence, 172(16-17):1873\u20131896, 2008.\n\n9\n\n\f", "award": [], "sourceid": 2214, "authors": [{"given_name": "Sara", "family_name": "Magliacane", "institution": "VU University Amsterdam"}, {"given_name": "Tom", "family_name": "Claassen", "institution": "Radboud University Nijmegen"}, {"given_name": "Joris", "family_name": "Mooij", "institution": "Radboud University Nijmegen"}]}