{"title": "Learning and Testing Causal Models with Interventions", "book": "Advances in Neural Information Processing Systems", "page_first": 9447, "page_last": 9460, "abstract": "We consider testing and learning problems on causal Bayesian networks as defined by Pearl (Pearl, 2009). Given a causal Bayesian network M on a graph with n discrete variables and bounded in-degree and bounded ``confounded components'', we show that O(log n) interventions on an unknown causal Bayesian network X on the same graph, and O(n/epsilon^2) samples per intervention, suffice to efficiently distinguish whether X=M or whether there exists some intervention under which X and M are farther than epsilon in total variation distance. We also obtain sample/time/intervention efficient algorithms for: (i) testing the identity of two unknown causal Bayesian networks on the same graph; and (ii) learning a causal Bayesian network on a given graph. Although our algorithms are non-adaptive, we show that adaptivity does not help in general: Omega(log n) interventions are necessary for testing the identity of two unknown causal Bayesian networks on the same graph, even adaptively. Our algorithms are enabled by a new subadditivity inequality for the squared Hellinger distance between two causal Bayesian networks.", "full_text": "Learning and Testing Causal Models with\n\nInterventions\n\nJayadev Acharya\u2217\nSchool of ECE\n\nCornell University\n\nArnab Bhattacharyya\u2217\n\nNational University of Singapore\n\n& Indian Institute of Science\n\nConstantinos Daskalakis\u2217\n\nEECS\nMIT\n\nacharya@cornell.edu\n\narnabb@iisc.ac.in\n\ncostis@csail.mit.edu\n\nSaravanan Kandasamy\u2217\n\nSTCS\n\nTata Institute of Fundamental Research\n\nsaravan.tuty@gmail.com\n\nAbstract\n\nWe consider testing and learning problems on causal Bayesian networks as de\ufb01ned\nby Pearl [Pea09]. Given a causal Bayesian network M on a graph with n discrete\nvariables and bounded in-degree and bounded \u201cconfounded components\u201d, we\nshow that O(log n) interventions on an unknown causal Bayesian network X\non the same graph, and O(n/\u00012) samples per intervention, suf\ufb01ce to ef\ufb01ciently\ndistinguish whether X = M or whether there exists some intervention under\nwhich X and M are farther than \u0001 in total variation distance. We also obtain\nsample/time/intervention ef\ufb01cient algorithms for: (i) testing the identity of two\nunknown causal Bayesian networks on the same graph; and (ii) learning a causal\nBayesian network on a given graph. Although our algorithms are non-adaptive, we\nshow that adaptivity does not help in general: \u2126(log n) interventions are necessary\nfor testing the identity of two unknown causal Bayesian networks on the same graph,\neven adaptively. Our algorithms are enabled by a new subadditivity inequality for\nthe squared Hellinger distance between two causal Bayesian networks.\n\n1\n\nIntroduction\n\nA central task in statistical inference is learning properties of a high-dimensional distribution over\nsome variables of interest given observational data. However, probability distributions only capture\nthe association between variables of interest and may not suf\ufb01ce to predict what the consequences\nwould be of setting some of the variables to particular values. A standard example illustrating the\npoint is this: From observational data, we may learn that atmospheric air pressure and the readout of\na barometer are correlated. But can we predict whether the atmospheric pressure would stay the same\nor go up if the barometer readout was forcefully increased by moving its needle?\nSuch issues are at the heart of causal inference, where the goal is to learn a causal model over some\nvariables of interest, which can predict the result of external interventions on the variables. For\nexample, a causal model on two variables of interest X and Y need not only determine conditional\nprobabilities of the form Pr[Y | X = x], but also interventional probabilities Pr[Y | do(X = x)]\n\u2217The authors are arranged in alphabetical ordering. Jayadev Acharya was supported by a Cornell University\nstartup grant. Arnab Bhattacharyya was partially supported by DST Ramanujan grant DSTO1358 and DRDO\nFrontiers Project DRDO0687. Constantinos Daskalakis was supported by NSF awards CCF-1617730 and\nIIS-1741137, a Simons Investigator Award, a Google Faculty Research Award, and an MIT-IBM Watson AI Lab\nresearch grant. Saravanan Kandasamy was partially supported by DRDO Frontiers Project DRDO0687.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwhere, following Pearl\u2019s notation [Pea09], do(X = x) means that X has been forced to take the value\nx by an external action. In our previous example, Pr[Pressure | do(Barometer = b)] = Pr[Pressure]\nbut Pr[Barometer | do(Pressure = p)] (cid:54)= Pr[Barometer], re\ufb02ecting that the atmospheric pressure\ncauses the barometer readout, not the other way around.\nCausality has been the focus of extensive study, with a wide range of analytical frameworks proposed\nto capture causal relationships and perform causal inference. A prevalent class of causal models are\ngraphical causal models, going back to Wright [Wri21] who introduced such models for path analysis,\nand Haavelmo [Haa43] who used them to de\ufb01ne structural equation models. Today, graphical causal\nmodels are widely used to represent causal relationships in a variety of ways [SDLC93, GC99, Pea09,\nSGS00, Nea04, KF09].\nIn our work, we focus on the central model of causal Bayesian networks (CBNs) [Pea09, SGS00,\nNea04]. Recall that a (standard) Bayesian network is a distribution over several random variables\nthat is associated with a directed acyclic graph. The vertices of the graph are the random variables\nover which the distribution is de\ufb01ned, and the graph describes conditional independence properties of\nthe distribution. In particular, every variable is independent of its non-descendants, conditioned on\nthe values of its parents in the graph. A CBN is also associated with a directed acyclic graph (DAG)\nwhose vertices are the random variables on which the distribution is de\ufb01ned. However, a CBN is not\na single distribution over these variables but the collection of all possible interventional distributions,\nde\ufb01ned by setting any subset of the variables to any set of values. In particular, every vertex is both a\nvariable V and a mechanism to generate the value of V given the values of the parent vertices, and\nthe interventional distributions are de\ufb01ned in terms of these mechanisms.\nWe allow CBNs to contain both observable and unobservable (hidden) random variables. Importantly,\nwe allow unobservable confounding variables. These are variables that are not observable, yet they\nare ancestors of at least two observable variables. These are especially tricky in statistical inference,\nas they may lead to spurious associations.\n\n1.1 Our Contributions\n\nConsider the following situations:\n\n1. An engineer designs a large circuit using a circuit simulation program and then builds it\nin hardware. The simulator predicts relationships between the voltages and currents at\ndifferent nodes of the circuit. Now, the engineer would like to verify whether the simulator\u2019s\npredictions hold for the real circuit by doing a limited number of experiments (e.g., holding\nsome voltages at set levels, cutting some wires, etc.). If not, then she would want to learn a\nmodel for the system that has suf\ufb01ciently good accuracy.\n\n2. A biologist is studying the role of a set of genes in migraine. He would like to know\nwhether the mechanisms relating the products of these genes are approximately the same for\npatients with and without migraine. He has access to tools (e.g., CRISPR-based gene editing\ntechnologies [DPL+16]) that generate data for gene activation and knockout experiments.\n\nMotivated by such scenarios, we study the problems of hypothesis testing and learning CBNs when\nboth observational and interventional data are available. The main highlight of our work is that we\nprove bounds on the number of samples, interventions, and time steps required by our algorithms.\nTo de\ufb01ne our problems precisely, we need to specify what we consider to be a good approximation\nof a causal model. Given \u0001 \u2208 (0, 1), we say that two causal models M and N on a set of variables\nV\u222a U (observable and unobservable resp.) are \u0001-close (denoted \u2206(M,N ) \u2264 \u0001) if for every subset S\nof V and assignment s to S, performing the same intervention do(S = s) to both M and N leads to\nthe two interventional distributions being \u0001-close to each other in total variation distance. Otherwise,\nthe two models are said to be \u0001-far and \u2206(M,N ) > \u0001.\nThus, two models M and N are close according to the above de\ufb01nition if there is no intervention\nwhich can make the resulting distributions differ signi\ufb01cantly. This de\ufb01nition is motivated by the\nphilosophy articulated by Pearl (pp. 414, [Pea09]) that \u201ccausation is a summary of behavior under\nintervention\u201d. Intuitively, if there is some intervention that makes M and N behave differently, then\nM and N do not describe the same causal process. Without having any prior information about the\n\n2\n\n\fset of relevant interventions, we adopt a worst-case view and simply require that causal models M\nand N behave similarly for every intervention to be declared close to each other.2\nThe goodness-of-\ufb01t testing problem can now be described as follows. Suppose that a collection\nV \u222a U (observable and unobservable resp.) of n random variables are causally related to each other.\nLet M be a hypothesized causal model for V \u222a U that we are given explicitly. Suppose that the\ntrue model to describe the causal relationships is an unknown X . Then, the goodness-of-\ufb01t testing\nproblem is to distinguish between: (i) X = M, versus (ii) \u2206(X ,M) > \u0001, by sampling from and\nexperimenting on V, i.e. forcing some variables in V to certain values and sampling from the thus\nintervened upon distribution.\nWe study goodness-of-\ufb01t testing assuming X and M are causal Bayesian networks over a known\nDAG G. Given a DAG G, CBN M and \u0001 > 0, we denote the corresponding goodness-of-\ufb01t testing\nproblem CGFT(G,M, \u0001). For example, the engineer above, who wants to determine whether the\ncircuit behaves as the simulation software predicts, is interested in the problem CGFT(G,M, \u0001)\nwhere M is the simulator\u2019s prediction, G is determined by the circuit layout, and \u0001 is a user-speci\ufb01ed\naccuracy parameter. Here is our theorem for goodness-of-\ufb01t testing.\nTheorem 1 (Goodness-of-\ufb01t Testing \u2013 Informal). Let G be a DAG on n vertices with bounded\nin-degree and bounded \u201cconfounded components.\u201d Let M be a given CBN over G. Then, there exists\nan algorithm solving CGFT(G,M, \u0001) that makes O(log n) interventions, takes O(n/\u00012) samples\nper intervention and runs in time \u02dcO(n2/\u00012). Namely, the algorithm gets access to a CBN X over G,\naccepts with probability \u2265 2/3 if X = M and rejects with probability \u2265 2/3 if \u2206(X ,M) > \u0001.\n\nBy \u201cconfounded component\u201d in the above statement, we mean a c-component in G, as de\ufb01ned in\nDe\ufb01nition 7. Roughly, a c-component is a maximal set of observable vertices that are pairwise\nconnected by paths of the form Vi1 \u2190 Uj1 \u2192 Vi2 \u2190 Uj2 \u2192 Vi3 \u2190 \u00b7\u00b7\u00b7 \u2192 Vik where Vi\u2019s and Uj\u2019s\ncorrespond to observable and unobservable variables respectively. The decomposition of CBNs into\nc-components has been important in earlier work [TP02] and continues to be an important structural\nproperty here.\nWe can use our techniques to extend Theorem 1 in several ways:\n\n(1) In the two-sample testing problem for causal models, the tester gets access to two unknown\ncausal models X and Y on the same set of variables V \u222a U (observable and unobservable\nresp.). For a given \u0001 > 0, the goal is to distinguish between (i) X = Y and (ii) \u2206(X ,Y) > \u0001\nby sampling from and intervening on V in both X and Y.\nWe solve the two-sample testing problem when the inputs are two CBNs over the same DAG\nG in n variables; for a given \u0001 > 0 and DAG G, call the problem C2ST(G, \u0001). Speci\ufb01cally,\nwe show an algorithm to solve C2ST(G, \u0001) that makes O(log n) interventions on the input\nmodels X and Y, uses O(n/\u00012) samples per intervention and runs in time \u02dcO(n2/\u00012), when\nG has bounded in-degree and c-component size.3\n\n(2) For the C2ST(G, \u0001) problem, the requirement that G be fully known is rather strict. Instead,\nsuppose the common graph G is unknown and only bounds on its in-degree and maximum c-\ncomponent size are given. For example, the biologist above who wants to test whether certain\ncausal mechanisms are identical for patients with and without migraine can reasonably\nassume that the underlying causal graph is the same (even though he doesn\u2019t know what it is\nexactly) and that only the strengths of the relationships may differ between subjects with\nand without migraine. For this problem, we obtain an ef\ufb01cient algorithm with nearly the\nsame number of samples and interventions as above.\n(3) The problem of learning a causal model can be posed as follows: the learning algorithm\ngets access to an unknown causal model X over a set of variables V \u222a U (observable and\nunobservable resp.), and its objective is to output a causal model N such that \u2206(X ,N ) \u2264 \u0001.\nWe consider the problem CL(G, \u0001) of learning a CBN over a known DAG G on the observ-\nable and unobservable variables. For example, this is the problem facing the engineer above\n\n2To quote Pearl again, \u201cIt is the nature of any causal explanation that its utility be proven not over standard\nsituations but rather over novel settings that require innovative manipulations of the standards.\u201d (pp. 219,\n[Pea09]).\n3Of course, it is allowed for the two networks to be different subgraphs of G. So, X could be de\ufb01ned by the\ngraph G1 and Y by G2. Our result holds when G1 \u222a G2 is a DAG with bounded in-degree and c-component\nsize.\n\n3\n\n\fwho wants to learn a good model for his circuit by conducting some experiments; the DAG\nG in this case is known from the circuit layout. Given a DAG G with bounded in-degree\nand c-component size and a parameter \u0001 > 0, we design an algorithm that on getting access\nto a CBN X de\ufb01ned over G, makes O(log n) interventions, uses \u02dcO(n2/\u00014) samples per\nintervention, runs in time \u02dcO(n3/\u00014), and returns an oracle N that can ef\ufb01ciently compute\nPX [V \\ T | do(T = t)] for any T \u2286 V and t \u2208 \u03a3|T| with error at most \u0001 in TV distance.\n\nThe sample complexity of our testing algorithms matches the state-of-the-art for testing identity of\n(standard) Bayes nets [DP17, CDKS17]. Designing a goodness-of-\ufb01t tester using o(n) samples is a\nvery interesting challenge and seems to require fundamentally new techniques.\nWe also show that the number of interventions for C2ST(G, \u0001) and CL(G, \u0001) is nearly optimal, even\nin its dependence on the in-degree and c-component size, and even when the algorithms are allowed\nto be adaptive. By \u2018adaptive\u2019 we mean the algorithms are allowed to choose the future interventions\nbased on the samples observed from the past interventions. Speci\ufb01cally,\nTheorem 2. There exists a causal graph G on n vertices, with maximum in-degree at most d and\nlargest c-component size at most (cid:96), such that \u2126(|\u03a3|(cid:96)d\u22122 log n) interventions are necessary for any\nalgorithm (even adaptive) that solves C2ST(G, \u0001) or CL(G, \u0001), where \u03a3 is the alphabet set from\nwhich the variables take values.\n\nWe make no assumptions about the distributions or the functional relationships, and we show that in\nthe worst case, the K (cid:96)d bound, that appears in the number of interventions, is unavoidable. However,\nwith further assumptions, one can hope to reduce this number. For example, if the graph has no\nhidden variables and each assignment to the parent sets occurs with large enough probability, access\nto interventions is not necessary. Or if the mechanism relating each variable to its parents can be\nmodeled as a linear function, then a covering set of interventions is not needed.\n\n1.2 Related Work\n\n(A longer discussion of previous work on causality as well as on testing/learning distributions is in\nAppendix A.) There is a huge and old literature on causality, for both and learning testing causal\nrelationships that is impossible to detail here. To the best of our knowledge, though, most previous\nwork is on testing/learning only the causal graph, whereas our objective is to test/learn the entire\ncausal model (i.e., the set of all interventional distributions). In fact, many of our results assume that\nthe causal graph is already known; as discussed in Section 1.4, we hope that in future work, this\nrequirement can be relaxed.\nMotivated by the problem of testing causal graphs, Tian and Pearl [TP02] derive functional constraints\namong the distributions of observed variables (not just conditional independence relations) in a causal\nBayesian network over the graph. Kang and Tian [KT06] derive such functional constraints on\ninterventional distributions. Although these results yield non-trivial constraints, it is not clear how to\nuse them for testing goodness-of-\ufb01t with statistical guarantees.\nThe problem of learning causal graphs has been extensively studied. [PV95, VP92, SGS00, ARSZ05,\nZha08] give algorithms to recover the class of causal graphs consistent with given conditional in-\ndependence relations in observational data. Subsequent work considered the setting when both\nobservational and interventional data are available. This setting has been a recent focus of study\n[HB12a, WSYU17, YKU18], motivated by advances in genomics that allow high-resolution observa-\ntional and interventional data for gene expression using \ufb02ow cytometry and CRISPR technologies\n[SPP+05, MBS+15, DPL+16]. [EGS05, Ebe07, HB12b] derived the minimum number of inter-\nventional distributions that suf\ufb01ce to fully identify the underlying causal graphs when there are no\nconfounding variables. Recently, Kocaoglu et al. [KSB17] showed an ef\ufb01cient randomized algorithm\nto learn a causal graph with confounding variables while minimizing the number of interventions\nfrom which conditional independence relations are obtained.\nFrom the perspective of query learning, learning circuits with value injection queries was introduced\nby Angluin et al. [AACW09]. The value injection query model is a deterministic circuit de\ufb01ned\nover an underlying directed acyclic graph whose output is determined by the value of the output\nnode. [AACW09] considers the problem of learning the outputs of all value injection queries (i.e.,\ninterventions) where the learner has oracle access to value injection queries with the objective of\n\n4\n\n\fminimizing the number of queries, when the size of alphabet set is constant. This was later generalized\nto large alphabet and analog circuits in [AACR08, Rey09].\nAll the works mentioned above assume access to an oracle that gives conditional independence\nrelations between variables in the observed and interventional distributions. This is clearly a problem-\natic assumption because it implicitly requires unbounded training data. For example, Scheines and\nSpirtes [SS08] have pointed out that measurement error, quantization and aggregation can easily alter\nconditional independence relations. The problem of developing \ufb01nite sample bounds for testing and\nlearning causal models has been repeatedly posed in the literature. The excellent survey by Guyon,\nJanzing and Sch\u00f6lkopf [GJS10] on causality from a machine learning perspective underlines the issue\nas one of the \u201cten open problems\u201d in the area. To the best of our knowledge, our work is the \ufb01rst to\nshow \ufb01nite sample complexity and running time bounds for inference problems on CBNs.\nAn application of our learning algorithm is to the problem of transportability, studied in [BP13, SP08,\nLH13, PB11, BP12], which refers to the notion of transferring causal knowledge from a set of source\ndomains to a target domain to identify causal effects in the target domain, when there are certain\ncommonalities between the source and target domains. Most work in this area assume the existence\nof an algorithm that learns the set of all interventions, that is the complete speci\ufb01cation of the source\ndomain model. Our learning algorithm can be used for this purpose; it is ef\ufb01cient in terms of time,\ninterventions, and sample complexity, and it learns each intervention distribution to error at most \u0001.\n\n1.3 Overview of our Techniques\n\nIn this section, we give an overview of the proof of Theorem 1 and the lower bound construction.\nWe start by making a well-known observation [TP02, VP90] that CBNs can be assumed to be over a\nparticular class of DAGs known as semi-Markovian causal graphs. A semi-Markovian causal graph\nis a DAG where every vertex corresponding to an unobservable variable is a root and has exactly two\nchildren, both observable. More details of the correspondence are given in Appendix I.\nIn a semi-Markovian causal graph, two observable vertices V1 and V2 are said to be connected\nby a bi-directed edge if there is a common unobservable parent of V1 and V2. Each connected\ncomponent of the graph restricted to bi-directed edges is called a c-component. The decomposition\ninto c-components forms a partition of the observable vertices, which gives very useful structural\ninformation about the causal model. In particular, a fact that is key to our whole analysis is that if\nN is a semi-Markovian Bayesian network on observable and unobservable variables V \u222a U with\nc-components C1, . . . , Cp, then for any v \u2208 \u03a3|V|:\n\nPN [v] =\n\nPN [ci | do(V \\ Ci = v \\ ci)]\n\n(1)\n\np(cid:89)\n\ni=1\n\nwhere \u03a3 is the alphabet set, ci is the restriction of v to Ci and v \\ ci is the restriction of v to V \\ Ci\n[TP02]. Moreover, one can write a similar formula (Lemma 9) for an interventional distribution on\nN instead of the observable distribution PN [v].\nThe most direct approach to test whether two causal Bayes networks X and Y are identical is to test\nwhether each interventional distribution is identical in the two models. This strategy would require\n(|\u03a3| + 1)n many interventions, each on a variable set of size O(n), where n is the total number of\nobservable vertices. To reduce the number of interventions as well as the sample complexity, a natural\napproach, given (1) and its extension to interventional distributions, is to test for identity between\neach pair of \u201clocal\u201d distributions\n\nPX [S | do(v \\ s)]\n\nand\n\nPY [S | do(v \\ s)]\n\nfor every subset S of a c-component C and assignment v \\ s to V \\ S. We assume that each\nc-component is bounded, so each local distribution has bounded support. Moreover, using the\nconditional independence properties of Bayesian networks, note that in each local distribution, we\nonly need to intervene on observable parents of S that are outside S, not on all of V \\ S.\nThrough a probabilistic argument, we ef\ufb01ciently \ufb01nd a small set I of covering interventions, which\nare de\ufb01ned as a set of interventions with the following property: For every subset S of a c-component\nand for every assignment pa(S) to the observable parents of S, there is an intervention I \u2208 I that does\nnot intervene on S and sets the parents of S to exactly pa(S). Our test performs all the interventions\nin I on both X and Y and hence can observe each of the local distributions PX [S | do(pa(S))] and\n\n5\n\n\fPY [S | do(pa(S))]. What remains is to bound \u2206(X ,Y) in terms of the distances between each pair\nof local distributions.\nTo that end, we develop a subadditivity theorem about CBNs, and this is the main technical contribu-\ntion of our upper bound results. We show that if each pair of local distributions is within distance \u03b3 in\nsquared Hellinger distance, then for any intervention I, applying I to X and Y results in distributions\nthat are within O(n\u03b3) distance in squared Hellinger distance, assuming bounded in-degree and\nc-component size of the underlying graph. A bound on the total variation distance between the\ninterventional distributions and hence \u2206(X ,Y) follows. The subadditivity theorem is inspired from\n[DP17], where they showed that for Bayes networks, \u201ccloseness of local marginals implies closeness\nof the joint distribution\u201d. Our result is in a very different set-up, where we prove \u201ccloseness of local\ninterventions implies closeness of any joint interventional distribution\u201d, and requires a new proof\ntechnique. We relax the squared Hellinger distance between the interventional distributions as the\nobjective of a minimization program in which the constraints are that each pair of local distributions\nis \u03b3-close in squared Hellinger distance. By a sequence of transformations of the program, we lower\nbound its objective in terms of \u03b3, thus proving our result. In the absence of unobservable variables,\nthe analysis becomes much simpler and is sketched in Appendix B.\nRegarding the lower bound, we prove that the number of interventions required by our algorithms are\nindeed necessary for any algorithm that solves C2ST(G, \u0001) or CL(G, \u0001), even if the algorithms are\nprovided with in\ufb01nite samples/time. For any algorithm that fails to perform some local intervention I,\nwe provide a construction of two models which do not agree on I and agree on all other interventions.\nOur construction is designed in such a way that it allows adaptive algorithms. The idea is to show\nan adversary that, for each intervention, reveals a distribution to the algorithm. Towards the end,\nwhen the algorithm fails to perform some local intervention I, we can show a construction of two\nmodels such that: i) both the models do not agree on I, and the total variation distance between\nthe interventional distributions is equal to one; ii) and for all other interventions, the interventional\ndistributions revealed by the adversary match with the corresponding distributions on both the models.\nThis, together with a probabilitic argument, shows the existence of a causal graph that requires\nsuf\ufb01ciently large number of interventions to solve C2ST(G, \u0001) and CL(G, \u0001).\n\n1.4 Future Directions\n\nWe hope that this work paves the way for future research on designing ef\ufb01cient algorithms with\nbounded sample complexity for learning and testing causal models. For the sake of concreteness, we\nlist a few open problems.\n\u2022 Interventional experiments are often expensive or infeasible, so one would like to deduce causal\nmodels from observations alone. In general, this is impossible. However, in identi\ufb01able CBNs\n(see [Tia02]), one can identify causal effects from observational data alone. Is there an ef\ufb01cient\nalgorithm to learn an identi\ufb01able interventional distribution from samples?4\n\u2022 A de\ufb01ciency of our learning algorithm is that we assume the underlying causal graph is fully known.\nCan our learning algorithm be extended to the setting where the hypothesis only consists of\nsome limited information about the causal graph (e.g., in-degree, c-component size) instead\nof the whole graph? This seems to be a hard problem. In fact, it is open how to ef\ufb01ciently learn\nthe distribution given by a standard Bayesian network based on samples from it if we don\u2019t know\nthe underlying graph [DP17, CDKS17].\n\u2022 Our goodness-of-\ufb01t algorithm might reject even when the input X is very close to the hypothesis\nM. Is there a tolerant goodness-of-\ufb01t tester that accepts when \u2206(X ,M) \u2264 \u00011 and rejects\nwhen \u2206(X ,M) > \u00012 for 0 < \u00011 < \u00012 < 1? Our current analysis does not extend to a tolerant\ntester. The same question holds for testing goodness-of-\ufb01t for standard Bayesian networks.\n\u2022 In many applications, causal models are described in terms of structural equation models, in\nwhich each variable is a deterministic function of its parents as well as some stochastic error terms.\nDesign sample and time ef\ufb01cient algorithms for testing and learning structural equation\n\n4Schulman and Srivastava [SS16] have shown that under adversarial noise, there exist causal Bayesian\nnetworks on n nodes where estimating an identi\ufb01able intervention to precision d requires precision d+exp(n0.49)\nin the estimates of the probabilities of observed events. However, this instability is likely due to the adversarial\nnoise and does not preclude an ef\ufb01cient sampling-based algorithm, especially if we assume a balancedness\ncondition as in [CDKS17].\n\n6\n\n\fmodels. Other questions such as evaluating counterfactual queries or doing policy analysis (see\nChapter 7 of [Pea09]) also present interesting algorithmic problems.\n\n2 Testing and Learning Algorithms for SMBNs\n\nWe use SMCG and SMBN to denote semi-Markovian causal graph and semi-Markovian Bayesian\nnetwork respectively on V \u222a U, observable and unobservable variables respectively. Let Gd,(cid:96) denotes\nthe class of SMCGs with maximum in-degree at most d and largest c-component size at most\n(cid:96). For any subset S of observable variables, we use Pa(S) to denote the observable parents of S\n(excluding S), and pa(S) to denote an assignment to Pa(S). More formal de\ufb01nitions can be found in\nAppendix C.\nFirst we recall a fast and sample-ef\ufb01cient test for squared Hellinger distance from [DKW18].\nLemma 1. [Hellinger Test, [DKW18]] Given O(min(D2/3/\u00018/3, D3/4/\u00012)) samples from each\nunknown distributions P and Q, we can distinguish between P = Q vs H 2(P, Q) \u2265 \u00012 with\nprobability at least 2/3. This probability can be boosted to 1 \u2212 \u03b4 at a cost of an additional\nO(log(1/\u03b4)) factor in the sample complexity. The running time of the algorithm is quasi-linear in\nthe sample size 5.\n\nWe also need the notion of covering intervention sets:\nDe\ufb01nition 1. A set of interventions I is a covering intervention set if for every subset S of every\nc-component, and every assignment pa(S) \u2208 \u03a3| Pa(S)| there exists an I \u2208 I such that, (i) No node in\nS is intervened in I; (ii) Every node in Pa(S) is intervened; and (iii) I restricted to Pa(S) has the\nassignment pa(S).\n\nOur algorithms comprise of two key arguments.\n\u2022 A procedure to compute a covering intervention set I of small size, given as Lemma 2 below.\n\u2022 A sub-additivity result, shown in Theorem 3, for CBNs that allows us to localize the distances:\nwhere we show that two CBNs are far implies there exist a marginal distribution of some interven-\ntion in I such that the marginals are far.\n\nLemma 2. (Counting Lemma) Let G \u2208 Gd,(cid:96) be a SMCG with n vertices and \u03a3 be an alphabet set\nof size K. Then, there exists a covering intervention set I of size O(K (cid:96)d(3d)(cid:96)(log n + (cid:96)d log K)). If\nthe total degree of G is bounded by d, then there exists such an I of size O(K (cid:96)d(3d)(cid:96)(cid:96)d2 log K). In\nboth cases, there is an \u02dcO(n) time algorithm to output I.\nTheorem 3. (Subadditivity Theorem) Let M and N be two SMBNs de\ufb01ned on a known and common\nSMCG G \u2208 Gd,(cid:96). For a given intervention do(t), let V \\ T partition into C = {C1, C2, . . . , Cp},\nthe c-components with respect to the induced graph G[V \\ T]. Suppose\n\nH 2(PM[Cj | do(pa(Cj))], PN [Cj | do(pa(Cj))]) \u2264 \u03b3\n\n\u2200j \u2208 [p],\u2200 pa(Cj) \u2208 \u03a3| Pa(Cj )|.\n\n(2)\n\nThen\n\nH 2 (PM[V \\ T | do(t)], PN [V \\ T | do(t)]) \u2264 \u0001\n\n\u2200t \u2208 \u03a3|T|\n\n(3)\n\nwhere \u0001 = \u03b3|\u03a3|(cid:96)(d+1)n.\n\nThe proof of Lemma 2 is shown in Appendix E.1. The subadditivity theorem is proved in Ap-\npendix E.2.\nOur main testing algorithm for C2ST(G, \u0001) is shown below in Theorem 4, which gives Theorem 1 as\na corollary, since two sample tests are harder than one sample tests. We also provide 1) an algorithm\nfor C2ST(G, \u0001) when G \u2208 Gd,(cid:96) is unknown, and 2) an algorithm for CL(G, \u0001). Both these algorithms\nare similar to the below algorithm and can be found in Appendix D.\n\n5The sample complexity here is an improvement of the previously known result of [DK16].\n\n7\n\n\fTheorem 4 (Algorithm for C2ST(G, \u0001)). Let G be a SMCG \u2208 Gd,(cid:96) with n vertices. Let the\nvariables take values over a set \u03a3 of size K. Then, there is an algorithm to solve C2ST(G, \u0001),\nthat makes O(K (cid:96)d(3d)(cid:96) log n) interventions to each of the unknown SMBNs X and Y, taking\nO(K (cid:96)(d+7/4)n\u0001\u22122) samples per intervention, in time \u02dcO(2(cid:96)K (cid:96)(2d+7/4)n2\u0001\u22122).\nWhen the maximum degree (in-degree plus out-degree) of G is bounded by d, then our algorithm uses\nO(K (cid:96)d(3d)(cid:96)(cid:96)d2 log K) interventions with the same sample complexity and running time as above.\n\nProof of Theorem 4. Our algorithm is described in Algorithm 1. The algorithm starts with a covering\nintervention set I. Lemma 2 gives an I with O(K (cid:96)d(3d)(cid:96)(log n + (cid:96)d log K)) interventions., and when\nthe maximum degree is bounded by d, then the same lemma gives an I of size O(K (cid:96)d(3d)(cid:96)(cid:96)d2 log K).\n\nAlgorithm 1: Algorithm for C2ST(G, \u0001)\n\nI: Covering intervention set\n\n1. Under each intervention I \u2208 I:\n\n(a) Obtain O(K (cid:96)(d+7/4)n\u0001\u22122) samples from the interventional distribution of I\n\nin both models X and Y.\n\n(b) For any subset S of a c-component of G, if I does not set S but sets Pa(S)\nto pa(S), then using Lemma 1 and the obtained samples, test (with error\nprobability at most 1/(3K (cid:96)d2(cid:96)n)):\n\nPX [S|do(pa(S))] = PY [S|do(pa(S))] vs H 2\n\nOutput \u201c\u2206(X ,Y) > \u0001\u201d if the latter.\n\n2. Output \u201cX = Y\u201d.\n\nWe will now analyze the performance of our algorithm.\n\n(cid:18) PX [S|do(pa(S))],\n\nPY [S|do(pa(S))]\n\n(cid:19)\n\n\u2265\n\n\u00012\n\n2K (cid:96)(d+1)n\n\n.\n\nNumber of interventions, time, and sample requirements. The number of interventions is the size\nof I, bounded above. The number of samples per intervention is given in the algorithm. The algorithm\nperforms n2(cid:96)K (cid:96)d sub-tests. And for each such sub-test, the algorithm\u2019s running time is quasi-linear\nin the sample complexity (Lemma 1), therefore taking a total time of \u02dcO(2(cid:96)K (cid:96)(2d+7/4)n2\u0001\u22122).\nCorrectness. In Theorem 3, we show that when \u2206(X ,Y) > \u0001, there exists a subset S of some\nc-component, and an I \u2208 I that does not intervene any node in S but intervenes Pa(S) with some\nassignment pa(s) such that\n\nH 2(PX [S | do(pa(S))], PY [S | do(pa(S))]) > \u00012/(2K (cid:96)(d+1)n).\n\nThis structural result is the key to our algorithm. This, together with the fact that the TV distance\nbetween two distributions is at most\n2 times their Hellinger distance, proves that PX and PY are\nfar in terms of the total variation distance. To bound the error probability, note that the number of\ntotal sub-tests we run is bounded by K (cid:96)dn2(cid:96), and the error probability for each subset is at most\n1/(3K (cid:96)d2(cid:96)n), by the union bound, we will have an error of at most 1/3 over the entire algorithm.\n\n\u221a\n\n3 Lower Bound on Interventional Complexity\n\nRecall that in Section 2 we provided non-adaptive algorithms for C2ST(G, \u0001), and CL(G, \u0001). In this\nsection we provide lower bounds on the number of interventions that any algorithm must make to\nsolve these problems. Our lower bounds nearly match the upper bounds in Theorem 4, and Theorem 8,\neven when the algorithm is allowed to be adaptive (namely future interventions are decided based\nupon the samples observed from the past interventions). In other words, these lower bounds show\nthat in general, adaptivity cannot reduce the interventional complexity.\nTheorem 5. There exists a SMCG G \u2208 Gd,(cid:96) with n nodes such that \u2126(K (cid:96)d\u22122 log n) interventions\nare necessary for any algorithm (even adaptive) that solves C2ST(G, \u0001) or CL(G, \u0001).\n\n8\n\n\fThis theorem is proved via the following ingredients.\nNecessary Condition. We obtain a necessary condition on the set of interventions I of any algorithm\nthat solves C2ST(G, \u0001) or CL(G, \u0001).\nWe will consider SMCGs G with a speci\ufb01c structure, and prove the necessary condition for these\ngraphs: The vertices of G are the union of two disjoint sets A, and B, such that G contains directed\nedges from A to B, and bidirected edges within B. Further, all edges in G are one of these two types.\nThe next lemma is for graphs with this structure.\nLemma 3. Suppose an adaptive algorithm uses a sequence of interventions I to solve C2ST(G, \u0001)\nor CL(G, \u0001). Let C \u2286 B be a c-component of G. Then, for any assignment pa(C) \u2208 \u03a3| Pa(C)|, there\nis an intervention I \u2208 I such that the following conditions hold:\n\nC1. I intervenes Pa(C) with the corresponding assignment of pa(C),6\nC2. I does not intervene on any node in C.\n\nExistence. We then show that there is a graph with the structure mentioned above for which I must\nbe \u2126(K (cid:96)d\u22122 log n) in order for the condition to be satis\ufb01ed. More precisely,\nLemma 4. There exists a G, and a constant c such that for any set of interventions I with |I| <\nc \u00b7 K (cid:96)d\u22122 log n, there is a C \u2286 B, which is a c-component of G, and an assignment pa(C) such that\nno intervention in I\n\n\u2022 assigns pa(C) to Pa(C), and\n\u2022 observes all variables in C.\n\nCombining these two lemmas, we obtain the lower bound for the adaptive versions of C2ST(G, \u0001)\nand CL(G, \u0001). The proofs of Lemmas 3 and 4 are described in Appendix F.\n\nAcknowledgments\n\nWe would like to thank Vasant Honavar who told us about the problems considered here and for\nseveral helpful discussions that were essential for us to complete this work. We acknowledge the\nsupport of Google India and NeurIPS in the form of an International Travel Grant, which enabled\nSaravanan Kandasamy to attend the conference.\n\nReferences\n[AACR08] Dana Angluin, James Aspnes, Jiang Chen, and Lev Reyzin. Learning large-alphabet\nand analog circuits with value injection queries. Machine Learning, 72(1):113\u2013138,\nAug 2008.\n\n[AACW09] Dana Angluin, James Aspnes, Jiang Chen, and Yinghua Wu. Learning a circuit by\n\ninjecting values. J. Comput. Syst. Sci., 75(1):60\u201377, January 2009.\n\n[ADK15] Jayadev Acharya, Constantinos Daskalakis, and Gautam Kamath. Optimal testing for\nproperties of distributions. In Advances in Neural Information Processing Systems 28,\nNIPS \u201915, pages 3577\u20133598. Curran Associates, Inc., 2015.\n\n[AGHP92] Noga Alon, Oded Goldreich, Johan H\u00e5stad, and Ren\u00e9 Peralta. Simple constructions\nof almost k-wise independent random variables. Random Structures & Algorithms,\n3(3):289\u2013304, 1992.\n\n[AKN06] Pieter Abbeel, Daphne Koller, and Andrew Y. Ng. Learning factor graphs in polynomial\ntime and sample complexity. Journal of Machine Learning Research, 7(Aug):1743\u2013\n1788, 2006.\n\n6In our construction, Pa(C) always take 0 in the natural distribution. Henceforth, the interventions where\nsome vertices in Pa(C) are not intervened are not considered here, as they are equivalent to the case when those\nvertices are intervened with 0.\n\n9\n\n\f[ARSZ05] R Ayesha Ali, Thomas S Richardson, Peter Spirtes, and Jiji Zhang. Towards character-\nizing markov equivalence classes for directed acyclic graphs with latent variables. In\n21st Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI 2005, 2005.\n\n[AS04] Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2004.\n\n[BFR+00] Tu\u02d8gkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White.\nTesting that distributions are close. In Proceedings of the 41st Annual IEEE Symposium\non Foundations of Computer Science, FOCS \u201900, pages 259\u2013269, Washington, DC,\nUSA, 2000. IEEE Computer Society.\n\n[BFRV11] Arnab Bhattacharyya, Eldar Fischer, Ronitt Rubinfeld, and Paul Valiant. Testing\nmonotonicity of distributions over general partial orders. In ICS, pages 239\u2013252, 2011.\n\n[BL92] Kenneth A Bollen and J Scott Long. Tests for structural equation models: introduction.\n\nSociological Methods & Research, 21(2):123\u2013131, 1992.\n\n[BMS08] Guy Bresler, Elchanan Mossel, and Allan Sly. Reconstruction of markov random \ufb01elds\nfrom samples: Some observations and algorithms. In Approximation, Randomization\nand Combinatorial Optimization. Algorithms and Techniques, pages 343\u2013356. Springer,\n2008.\n\n[BP12] Elias Bareinboim and Judea Pearl. Transportability of causal effects: Completeness\nresults. In Proceedings of the Twenty-Sixth AAAI Conference on Arti\ufb01cial Intelligence,\nAAAI\u201912, pages 698\u2013704. AAAI Press, 2012.\n\n[BP13] E. Bareinboim and J. Pearl. Meta-transportability of causal effects: A formal approach.\nIn Proceedings of the 16th International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), pages 135\u2013143, 2013.\n\n[Bre15] Guy Bresler. Ef\ufb01ciently learning Ising models on arbitrary graphs. In Proceedings\nof the 47th Annual ACM Symposium on the Theory of Computing, STOC \u201915, pages\n771\u2013782, New York, NY, USA, 2015. ACM.\n\n[Can15] Cl\u00e9ment L. Canonne. A survey on distribution testing: Your data is big. but is it blue?\n\nElectronic Colloquium on Computational Complexity (ECCC), 22(63), 2015.\n\n[CDKS17] Clement L Canonne, Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Testing\n\nbayesian networks. In Conference on Learning Theory, pages 370\u2013448, 2017.\n\n[CL68] C.K. Chow and C.N. Liu. Approximating discrete probability distributions with depen-\n\ndence trees. IEEE Transactions on Information Theory, 14(3):462\u2013467, 1968.\n\n[CT06] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-\n\nInterscience, 2006.\n\n[DDK17] Constantinos Daskalakis, Nishanth Dikkala, and Gautam C. Kamath. Concentration\nof Multilinear Functions of the Ising Model with Applications to Network Data. In\nAdvances in Neural Information Processing Systems (NIPS), 2017.\n\n[DDK18] Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing Ising models.\nIn Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms,\nSODA \u201918, Philadelphia, PA, USA, 2018. SIAM.\n\n[DK16] Ilias Diakonikolas and Daniel M. Kane. A new approach for testing properties of\n\ndiscrete distributions. CoRR, abs/1601.05557, 2016.\n\n[DKW18] Constantinos Daskalakis, Gautam Kamath, and John Wright. Which distribution\ndistances are sublinearly testable? In Proceedings of the Twenty-Ninth Annual ACM-\nSIAM Symposium on Discrete Algorithms, SODA \u201918, pages 2747\u20132764, Philadelphia,\nPA, USA, 2018. Society for Industrial and Applied Mathematics.\n\n[DL12] Luc Devroye and G\u00e1bor Lugosi. Combinatorial methods in density estimation. Springer\n\nScience & Business Media, 2012.\n\n10\n\n\f[DP17] Constantinos Daskalakis and Qinxuan Pan. Square hellinger subadditivity for bayesian\nnetworks and its applications to identity testing. Proceedings of Machine Learning\nResearch vol, 65:1\u20137, 2017.\n\n[DPL+16] Atray Dixit, Oren Parnas, Biyu Li, Jenny Chen, Charles P. Fulco, Livnat Jerby-Arnon,\nNemanja D. Marjanovic, Danielle Dionne, Tyler Burks, Raktima Raychndhury, Britt\nAdamson, Thomas M. Norman, Eric S. Lander, Jonathan S. Weissman, Nir Friedman,\nand Aviv Regev. Perturb-seq: Dissecting molecular circuits with scalable single\ncell rna pro\ufb01ling of pooled genetic screens. Cell, 167(7):1853\u20131866.e17, Dec 2016.\n27984732[pmid].\n\n[Ebe07] Frederick Eberhardt. Causation and intervention. Doctoral dissertation, Carnegie\n\nMellon University, 2007.\n\n[EGL+92] Guy Even, Oded Goldreich, Michael Luby, Noam Nisan, and Boban Veli\u02c7ckovic. Ap-\nproximations of general independent distributions. In Proceedings of the twenty-fourth\nannual ACM symposium on Theory of computing, pages 10\u201316. ACM, 1992.\n\n[EGS05] Frederick Eberhardt, Clark Glymour, and Richard Scheines. On the number of experi-\nments suf\ufb01cient and in the worst case necessary to identify all causal relations among n\nvariables. In Proceedings of the Twenty-First Conference on Uncertainty in Arti\ufb01cial\nIntelligence, pages 178\u2013184. AUAI Press, 2005.\n\n[Fis25] Ronald Aylmer Fisher. Statistical Methods for Research Workers. Oliver and Boyd,\n\nEdinburgh, 1925.\n\n[GC99] Clark N Glymour and Gregory Floyd Cooper. Computation, causation, and discovery.\n\nAAAI Press, 1999.\n\n[GJS10] Isabelle Guyon, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Causality: Objectives and\n\nassessment. In Causality: Objectives and Assessment, pages 1\u201342, 2010.\n\n[Gol17] Oded Goldreich. Introduction to property testing. Cambridge University Press, 2017.\n\n[GR00] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs.\n\nElectronic Colloquium on Computational Complexity (ECCC), 7(20), 2000.\n\n[Haa43] Trygve Haavelmo. The statistical implications of a system of simultaneous equations.\n\nEconometrica, Journal of the Econometric Society, pages 1\u201312, 1943.\n\n[HB12a] Alain Hauser and Peter B\u00fchlmann. Characterization and greedy learning of interven-\ntional markov equivalence classes of directed acyclic graphs. J. Mach. Learn. Res.,\n13(1):2409\u20132464, August 2012.\n\n[HB12b] Alain Hauser and Peter B\u00fchlmann. Two optimal strategies for active learning of causal\nnetworks from interventional data. In Proceedings of Sixth European Workshop on\nProbabilistic Graphical Models, volume 119, 2012.\n\n[HEH13] Antti Hyttinen, Frederick Eberhardt, and Patrik O Hoyer. Experiment selection for\ncausal discovery. The Journal of Machine Learning Research, 14(1):3041\u20133071, 2013.\n\n[HJM+09] Patrik O Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Sch\u00f6lkopf.\nNonlinear causal discovery with additive noise models. In Advances in neural informa-\ntion processing systems, pages 689\u2013696, 2009.\n\n[JMZ+12] Dominik Janzing, Joris Mooij, Kun Zhang, Jan Lemeire, Jakob Zscheischler, Povilas\nDaniu\u0161is, Bastian Steudel, and Bernhard Sch\u00f6lkopf. Information-geometric approach\nto inferring causal directions. Arti\ufb01cial Intelligence, 182:1\u201331, 2012.\n\n[KDV17] Murat Kocaoglu, Alex Dimakis, and Sriram Vishwanath. Cost-optimal learning of\ncausal graphs. In International Conference on Machine Learning, pages 1875\u20131884,\n2017.\n\n[KF09] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and\n\ntechniques. MIT press, 2009.\n\n11\n\n\f[KM17] Adam Klivans and Raghu Meka. Learning graphical models using multiplicative\n\nweights. arXiv preprint arXiv:1706.06274, 2017.\n\n[KSB17] Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. Experimental design\nfor learning causal graphs with latent variables. In Advances in Neural Information\nProcessing Systems, pages 7021\u20137031, 2017.\n\n[KT06] Changsung Kang and Jin Tian. Inequality constraints in causal models with hidden\nvariables. In Proceedings of the Twenty-Second Conference on Uncertainty in Arti\ufb01cial\nIntelligence, pages 233\u2013240. AUAI Press, 2006.\n\n[LH13] Sanghack Lee and Vasant Honavar. m-transportability: Transportability of a causal\neffect from multiple environments. In Proceedings of the Twenty-Seventh AAAI Confer-\nence on Arti\ufb01cial Intelligence, July 14-18, 2013, Bellevue, Washington, USA., 2013.\n\n[LR06] Erich L Lehmann and Joseph P Romano. Testing statistical hypotheses. Springer\n\nScience & Business Media, 2006.\n\n[MBS+15] Evan Z. Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar,\nMelissa Goldman, Itay Tirosh, Allison R. Bialas, Nolan Kamitaki, Emily M. Marter-\nsteck, John J. Trombetta, David A. Weitz, Joshua R. Sanes, Alex K. Shalek, Aviv\nRegev, and Steven A. McCarroll. Highly parallel genome-wide expression pro\ufb01l-\ning of individual cells using nanoliter droplets. Cell, 161(5):1202\u20131214, May 2015.\n26000488[pmid].\n\n[MMLM06] Stijn Meganck, Sam Maes, Philippe Leray, and Bernard Manderick. Learning semi-\nmarkovian causal models using experiments. In Proceedings of The third European\nWorkshop on Probabilistic Graphical Models (PGM), 2006.\n\n[Mos09] Robin A Moser. A constructive proof of the lov\u00e1sz local lemma. In Proceedings of\nthe forty-\ufb01rst annual ACM symposium on Theory of computing, pages 343\u2013350. ACM,\n2009.\n\n[MT10] Robin A Moser and G\u00e1bor Tardos. A constructive proof of the general lov\u00e1sz local\n\nlemma. Journal of the ACM (JACM), 57(2):11, 2010.\n\n[Nea04] Richard E Neapolitan. Learning bayesian networks, volume 38. Pearson Prentice Hall\n\nUpper Saddle River, NJ, 2004.\n\n[NN90] J. Naor and M. Naor. Small-bias probability spaces: Ef\ufb01cient constructions and\napplications. In Proceedings of the Twenty-second Annual ACM Symposium on Theory\nof Computing, STOC \u201990, pages 213\u2013223, New York, NY, USA, 1990. ACM.\n\n[PB11] J. Pearl and E. Bareinboim. Transportability of causal and statistical relations: A\nformal approach. In Proceedings of the Twenty-Fifth Conference on Arti\ufb01cial Intelli-\ngence (AAAI-11), pages 247\u2013254, Menlo Park, CA, August 7-11 2011. Available at:\n.\n\n[Pea95] Judea Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669\u2013688, 1995.\n\n[Pea09] Judea Pearl. Causality. Cambridge university press, 2009.\n\n[PJS11] Jonas Peters, Dominik Janzing, and Bernhard Scholkopf. Causal inference on discrete\ndata using additive noise models. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 33(12):2436\u20132450, 2011.\n\n[PV95] Judea Pearl and Thomas S Verma. A theory of inferred causation. In Studies in Logic\n\nand the Foundations of Mathematics, volume 134, pages 789\u2013811. Elsevier, 1995.\n\n[Rey09] Lev Reyzin. Active learning of interaction networks. 2009.\n\n[SDLC93] David J Spiegelhalter, A Philip Dawid, Steffen L Lauritzen, and Robert G Cowell.\n\nBayesian analysis in expert systems. Statistical science, pages 219\u2013247, 1993.\n\n12\n\n\f[SGS00] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and\n\nsearch. MIT press, 2000.\n\n[SKDV15] Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G Dimakis, and Sriram Vish-\nIn Advances in Neural\n\nwanath. Learning causal graphs with small interventions.\nInformation Processing Systems, pages 3195\u20133203, 2015.\n\n[SMR99] P Spirtes, C Meek, and T Richardson. An algorithm for causal inference in the presence\nof latent variables and selection bias in computation, causation and discovery, 1999,\n1999.\n\n[SP08] I. Shpitser and J. Pearl. Complete identi\ufb01cation methods for the causal hierarchy.\n\nJournal of Machine Learning Research, 9:1941\u20131979, 2008.\n\n[SPP+05] Karen Sachs, Omar Perez, Dana Pe\u2019er, Douglas A. Lauffenburger, and Garry P. Nolan.\nCausal protein-signaling networks derived from multiparameter single-cell data. Sci-\nence, 308(5721):523\u2013529, 2005.\n\n[SS08] Richard Scheines and Peter Spirtes. Causal structure search: Philosophical foundations\n\nand problems, 2008.\n\n[SS16] Leonard J. Schulman and Piyush Srivastava. Stability of causal inference. In Proceed-\nings of the Thirty-Second Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201916,\npages 666\u2013675, Arlington, Virginia, United States, 2016. AUAI Press.\n\n[SSS+17] Rajat Sen, Ananda Theertha Suresh, Karthikeyan Shanmugam, Alexandros G Dimakis,\nand Sanjay Shakkottai. Model-powered conditional independence test. In Advances in\nNeural Information Processing Systems, pages 2955\u20132965, 2017.\n\n[SW12] Narayana P Santhanam and Martin J Wainwright. Information-theoretic limits of se-\nlecting binary graphical models in high dimensions. IEEE Transactions on Information\nTheory, 58(7):4117\u20134134, 2012.\n\n[Tia02] Jin Tian. Studies in causal reasoning and learning. University of California, Los\n\nAngeles, 2002.\n\n[TP02] Jin Tian and Judea Pearl. On the testable implications of causal models with hidden\nvariables. In Proceedings of the Eighteenth Conference on Uncertainty in Arti\ufb01cial In-\ntelligence, UAI\u201902, pages 519\u2013527, San Francisco, CA, USA, 2002. Morgan Kaufmann\nPublishers Inc.\n\n[VMLC16] Marc Vuffray, Sidhant Misra, Andrey Lokhov, and Michael Chertkov. Interaction\nscreening: Ef\ufb01cient and sample-optimal learning of ising models. In Advances in\nNeural Information Processing Systems, pages 2595\u20132603, 2016.\n\n[VP90] Thomas Verma and Judea Pearl. Causal networks: Semantics and expressiveness. In\nProceedings of the Fourth Annual Conference on Uncertainty in Arti\ufb01cial Intelligence,\nUAI \u201988, pages 69\u201378, Amsterdam, The Netherlands, The Netherlands, 1990. North-\nHolland Publishing Co.\n\n[VP92] Thomas Verma and Judea Pearl. An algorithm for deciding if a set of observed\nindependencies has a causal explanation. In Uncertainty in Arti\ufb01cial Intelligence, 1992,\npages 323\u2013330. Elsevier, 1992.\n\n[Wri21] Sewall Wright. Correlation and causation. Journal of agricultural research, 20(7):557\u2013\n\n585, 1921.\n\n[WSYU17] Yuhao Wang, Liam Solus, Karren Yang, and Caroline Uhler. Permutation-based\ncausal inference algorithms with interventions. In I. Guyon, U. V. Luxburg, S. Bengio,\nH. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 30, pages 5822\u20135831. Curran Associates, Inc., 2017.\n[YKU18] Karren Yang, Abigail Katcoff, and Caroline Uhler. Characterizing and learning equiv-\nalence classes of causal dags under interventions. arXiv preprint arXiv:1802.06310,\n2018.\n\n13\n\n\f[Zha08] Jiji Zhang. On the completeness of orientation rules for causal discovery in the presence\nof latent confounders and selection bias. Arti\ufb01cial Intelligence, 172(16-17):1873\u20131896,\n2008.\n\n[ZPJS12] Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Kernel-based\nconditional independence test and application in causal discovery. arXiv preprint\narXiv:1202.3775, 2012.\n\n14\n\n\f", "award": [], "sourceid": 5744, "authors": [{"given_name": "Jayadev", "family_name": "Acharya", "institution": "Cornell University"}, {"given_name": "Arnab", "family_name": "Bhattacharyya", "institution": "National University of Singapore & Indian Institute of Science"}, {"given_name": "Constantinos", "family_name": "Daskalakis", "institution": "MIT"}, {"given_name": "Saravanan", "family_name": "Kandasamy", "institution": "Tata Institute of Fundamental Research"}]}