{"title": "Learning Causal Graphs with Small Interventions", "book": "Advances in Neural Information Processing Systems", "page_first": 3195, "page_last": 3203, "abstract": "We consider the problem of learning causal networks with interventions, when each intervention is limited in size under Pearl's Structural Equation Model with independent errors (SEM-IE). The objective is to minimize the number of experiments to discover the causal directions of all the edges in a causal graph. Previous work has focused on the use of separating systems for complete graphs for this task. We prove that any deterministic adaptive algorithm needs to be a separating system in order to learn complete graphs in the worst case. In addition, we present a novel separating system construction, whose size is close to optimal and is arguably simpler than previous work in combinatorics. We also develop a novel information theoretic lower bound on the number of interventions that applies in full generality, including for randomized adaptive learning algorithms. For general chordal graphs, we derive worst case lower bounds on the number of interventions. Building on observations about induced trees, we give a new deterministic adaptive algorithm to learn directions on any chordal skeleton completely. In the worst case, our achievable scheme is an $\\alpha$-approximation algorithm where $\\alpha$ is the independence number of the graph. We also show that there exist graph classes for which the sufficient number of experiments is close to the lower bound. In the other extreme, there are graph classes for which the required number of experiments is multiplicatively $\\alpha$ away from our lower bound. In simulations, our algorithm almost always performs very close to the lower bound, while the approach based on separating systems for complete graphs is significantly worse for random chordal graphs.", "full_text": "Learning Causal Graphs with Small Interventions\n\nKarthikeyan Shanmugam1, Murat Kocaoglu2, Alexandros G. Dimakis3, Sriram Vishwanath4\n\n1karthiksh@utexas.edu,2mkocaoglu@utexas.edu,\n\n3dimakis@austin.utexas.edu,4sriram@ece.utexas.edu\n\nDepartment of Electrical and Computer Engineering\n\nThe University of Texas at Austin, USA\n\nAbstract\n\nWe consider the problem of learning causal networks with interventions, when\neach intervention is limited in size under Pearl\u2019s Structural Equation Model with\nindependent errors (SEM-IE). The objective is to minimize the number of experi-\nments to discover the causal directions of all the edges in a causal graph. Previous\nwork has focused on the use of separating systems for complete graphs for this\ntask. We prove that any deterministic adaptive algorithm needs to be a separat-\ning system in order to learn complete graphs in the worst case. In addition, we\npresent a novel separating system construction, whose size is close to optimal and\nis arguably simpler than previous work in combinatorics. We also develop a novel\ninformation theoretic lower bound on the number of interventions that applies in\nfull generality, including for randomized adaptive learning algorithms.\nFor general chordal graphs, we derive worst case lower bounds on the number\nof interventions. Building on observations about induced trees, we give a new\ndeterministic adaptive algorithm to learn directions on any chordal skeleton com-\npletely. In the worst case, our achievable scheme is an \u21b5-approximation algorithm\nwhere \u21b5 is the independence number of the graph. We also show that there exist\ngraph classes for which the suf\ufb01cient number of experiments is close to the lower\nbound. In the other extreme, there are graph classes for which the required number\nof experiments is multiplicatively \u21b5 away from our lower bound.\nIn simulations, our algorithm almost always performs very close to the lower\nbound, while the approach based on separating systems for complete graphs is\nsigni\ufb01cantly worse for random chordal graphs.\n\n1\n\nIntroduction\n\nCausality is a fundamental concept in sciences and philosophy. The mathematical formulation of\na theory of causality in a probabilistic sense has received signi\ufb01cant attention recently (e.g. [1\u20135]).\nA formulation advocated by Pearl considers the structural equation models: In this framework,\nX is a cause of Y , if Y can be written as f (X, E), for some deterministic function f and some\nlatent random variable E. Given two causally related variables X and Y , it is not possible to infer\nwhether X causes Y or Y causes X from random samples, unless certain assumptions are made\non the distribution of E and/or on f [6, 7]. For more than two random variables, directed acyclic\ngraphs (DAGs) are the most common tool used for representing causal relations. For a given DAG\nD = (V, E), the directed edge (X, Y ) 2 E shows that X is a cause of Y .\nIf we make no assumptions on the data generating process, the standard way of inferring the causal\ndirections is by performing experiments, the so-called interventions. An intervention requires mod-\nifying the process that generates the random variables: The experimenter has to enforce values on\nthe random variables. This process is different than conditioning as explained in detail in [1].\n\n1\n\n\fThe natural problem to consider is therefore minimizing the number of interventions required to\nlearn a causal DAG. Hauser et al. [2] developed an ef\ufb01cient algorithm that minimizes this number\nin the worst case. The algorithm is based on optimal coloring of chordal graphs and requires at\nmost log interventions to learn any causal graph where is the chromatic number of the chordal\nskeleton.\nHowever, one important open problem appears when one also considers the size of the used inter-\nventions: Each intervention is an experiment where the scientist must force a set of variables to take\nrandom values. Unfortunately, the interventions obtained in [2] can involve up to n/2 variables. The\nsimultaneous enforcing of many variables can be quite challenging in many applications: for exam-\nple in biology, some variables may not be enforceable at all or may require complicated genomic\ninterventions for each parameter.\nIn this paper, we consider the problem of learning a causal graph when intervention sizes are\nbounded by some parameter k. The \ufb01rst work we are aware of for this problem is by Eberhardt\net al. [3], where he provided an achievable scheme. Furthermore [8] shows that the set of interven-\ntions to fully identify a causal DAG must satisfy a speci\ufb01c set of combinatorial conditions called a\nseparating system1, when the intervention size is not constrained or is 1. In [4], with the assumption\nthat the same holds true for any intervention size, Hyttinen et al. draw connections between causality\nand known separating system constructions. One open problem is: If the learning algorithm is adap-\ntive after each intervention, is a separating system still needed or can one do better? It was believed\nthat adaptivity does not help in the worst case [8] and that one still needs a separating system.\nOur Contributions: We obtain several novel results for learning causal graphs with interventions\nbounded by size k. The problem can be separated for the special case where the underlying undi-\nrected graph (the skeleton) is the complete graph and the more general case where the underlying\nundirected graph is chordal.\n1. For complete graph skeletons, we show that any adaptive deterministic algorithm needs a (n, k)\nseparating system. This implies that lower bounds for separating systems also hold for adaptive\nalgorithms and resolves the previously mentioned open problem.\n\n2. We present a novel combinatorial construction of a separating system that is close to the previous\n\nlower bound. This simple construction may be of more general interest in combinatorics.\n\n3. Recently [5] showed that randomized adaptive algorithms need only log log n interventions with\n\nhigh probability for the unbounded case. We extend this result and show that O n\n\ninterventions of size bounded by k suf\ufb01ce with high probability.\n\nof such randomized algorithms.\n\n4. We present a more general information theoretic lower bound of n\n\n2k to capture the performance\n5. We extend the lower bound for adaptive algorithms for general chordal graphs. We show that\nover all orientations, the number of experiments from a ((G), k) separating system is needed\nwhere (G) is the chromatic number of the skeleton graph.\n\nk log log k\n\nseparating system is suf\ufb01cient. For the other class, we need \u21b5(1)\nworst case.\n\n6. We show two extremal classes of graphs. For one of them, the interventions through (, k)\n2k experiments in the\n7. We exploit the structural properties of chordal graphs to design a new deterministic adaptive al-\ngorithm that uses the idea of separating systems together with adaptability to Meek rules. We\nsimulate our new algorithm and empirically observe that it performs quite close to the (, k) sep-\narating system. Our algorithm requires much fewer interventions compared to (n, k) separating\nsystems.\n\n2k \u21e1 n\n\n2 Background and Terminology\n\n2.1 Essential graphs\nA causal DAG D = (V, E) is a directed acyclic graph where V = {x1, x2 . . . xn} is a set of random\nvariables and (x, y) 2 E is a directed edge if and only if x is a direct cause of y. We adopt Pearl\u2019s\nstructural equation model with independent errors (SEM-IE) in this work (see [1] for more details).\n\n1A separating system is a 0-1 matrix with n distinct columns and each row has at most k ones.\n\n2\n\n\fVariables in S \u2713 V cause xi, if xi = f ({xj}j2S, ey) where ey is a random variable independent of\nall other variables.\nThe causal relations of D imply a set of conditional independence (CI) relations between the vari-\nables. A conditional independence relation is of the following form: Given Z, the set X and the set\nY are conditionally independent for some disjoint subsets of variables X, Y, Z. Due to this, causal\nDAGs are also called causal Bayesian networks. A set V of variables is Bayesian with respect to a\nDAG D if the joint probability distribution of V can be factorized as a product of marginals of every\nvariable conditioned on its parents.\nAll the CI relations that are learned statistically through observations can also be inferred from the\nBayesian network using a graphical criterion called the d-separation [9] assuming that the distribu-\ntion is faithful to the graph 2. Two causal DAGs are said to be Markov equivalent if they encode the\nsame set of CIs. Two causal DAGs are Markov equivalent if and only if they have the same skeleton3\nand the same immoralities4. The class of causal DAGs that encode the same set of CIs is called the\nMarkov equivalence class. We denote the Markov equivalence class of a DAG D by [D]. The graph\nunion5 of all DAGs in [D] is called the essential graph of D. It is denoted E(D). E(D) is always a\nchain graph with chordal6 chain components 7 [11].\nThe d-separation criterion can be used to identify the skeleton and all the immoralities of the under-\nlying causal DAG [9]. Additional edges can be identi\ufb01ed using the fact that the underlying DAG\nis acyclic and there are no more immoralities. Meek derived 3 local rules (Meek rules), introduced\nin [12], to be recursively applied to identify every such additional edge (see Theorem 3 of [13]). The\nrepeated application of Meek rules on this partially directed graph with identi\ufb01ed immoralities until\nthey can no longer be used yields the essential graph.\n\nInterventions and Active Learning\n\n2.2\nGiven a set of variables V = {x1, ..., xn}, an intervention on a set S \u21e2 X of the variables is an\nexperiment where the performer forces each variable s 2 S to take the value of another independent\n(from other variables) variable u, i.e., s = u. This operation, and how it affects the joint distribution\nis formalized by the do operator by Pearl [1]. An intervention modi\ufb01es the causal DAG D as\nfollows: The post intervention DAG D{S} is obtained by removing the connections of nodes in S to\ntheir parents. The size of an intervention S is the number of intervened variables, i.e., |S|. Let Sc\ndenote the complement of the set S.\nCI-based learning algorithms can be applied to D{S} to identify the set of removed edges, i.e.\nparents of S [9], and the remaining adjacent edges in the original skeleton are declared to be the\nchildren. Hence,\n(R0) The orientations of the edges of the cut between S and Sc in the original DAG D can be\ninferred.\nThen, 4 local Meek rules (introduced in [12]) are repeatedly applied to the original DAG D with\nthe new directions learnt from the cut to learn more till no more directed edges can be identi\ufb01ed.\nFurther application of CI-based algorithms on D will reveal no more information. The Meek rules\nare given below:\n\n(R1) (a b) is oriented as (a ! b) if 9c s.t. (c ! a) and (c, b) /2 E.\n(R2) (a b) is oriented as (a ! b) if 9c s.t. (a ! c) and (c ! b).\n(R3) (a b) is oriented as (a ! b) if 9c, d s.t. (a c),(a d),(c ! b),(d ! b) and (c, d) /2 E.\n2Given Bayesian network, any CI relation implied by d-separation holds true. All the CIs implied by the\ndistribution can be found using d-separation if the distribution is faithful. Faithfulness is a widely accepted\nassumption, since it is known that only a measure zero set of distributions are not faithful [10].\n\n3Skeleton of a DAG is the undirected graph obtained when directed edges are converted to undirected edges.\n4An induced subgraph on X, Y, Z is an immorality if X and Y are disconnected, X ! Z and Z Y .\n5Graph union of two DAGs D1 = (V, E1) and D2 = (V, E2) with the same skeleton is a partially directed\ngraph D = (V, E), where (va, vb) 2 E is undirected if the edges (va, vb) in E1 and E2 have different\ndirections, and directed as va ! vb if the edges (va, vb) in E1 and E2 are both directed as va ! vb.\n\n6An undirected graph is chordal if it has no induced cycle of length greater than 3.\n7This means that E(D) can be decomposed as a sequence of undirected chordal graphs G1, G2 . . . Gm\n\n(chain components) such that there is a directed edge from a vertex in Gi to a vertex in Gj only if i < j\n\n3\n\n\f(R4) (a c) is oriented as (a ! c) if 9b, d s.t. (b ! c),(a d),(a b),(d ! b) and (c, d) /2 E.\nThe concepts of essential graphs and Markov equivalence classes are extended in [14] to incorporate\nthe role of interventions: Let I = {I1, I2, ..., Im}, be a set of interventions and let the above process\nbe followed after each intervention. Interventional Markov equivalence class (I equivalence) of\na DAG is the set of DAGs that represent the same set of probability distributions obtained when\nthe above process is applied after every intervention in I. It is denoted by [D]I. Similar to the\nobservational case, I essential graph of a DAG D is the graph union of all DAGs in the same I\nequivalence class; it is denoted by EI(D). We have the following sequence:\n\nD ! CI learning ! Meek rules !E (D) ! I1\n\na! learn by R0 b! Meek rules\n!E {I1}(D) ! I2 . . . !E {I1,I2}(D) . . .\n\n(1)\n\nTherefore, after a set of interventions I has been performed, the essential graph EI(D) is a graph\nwith some oriented edges that captures all the causal relations we have discovered so far, using I.\nBefore any interventions happened E(D) captures the initially known causal directions. It is known\nthat EI(D) is a chain graph with chordal chain components. Therefore when all the directed edges\nare removed, the graph becomes a set of disjoint chordal graphs.\n\n2.3 Problem De\ufb01nition\n\nWe are interested in the following question:\nProblem 1. Given that all interventions in I are of size at most k < n/2 variables, i.e., for each\nintervention I, |I|\uf8ff k,8I 2I , minimize the number of interventions |I| such that the partially\ndirected graph with all directions learned so far EI(D) = D.\nThe question is the design of an algorithm that computes the small set of interventions I given E(D).\nNote, of course, that the unknown directions of the edges D are not available to the algorithm. One\ncan view the design of I as an active learning process to \ufb01nd D from the essential graph E(D). E(D)\nis a chain graph with undirected chordal components and it is known that interventions on one chain\ncomponents do not affect the discovery process of directed edges in the other components [15]. So\nwe will assume that E(D) is undirected and a chordal graph to start with. Our notion of algorithm\ndoes not consider the time complexity (of statistical algorithms involved) of steps a and b in (1).\nGiven m interventions, we only consider ef\ufb01ciently computing Im+1 using (possibly) the graph\nE{I1,...Im}. We consider the following three classes of algorithms:\n\n1. Non-adaptive algorithm: The choice of I is \ufb01xed prior to the discovery process.\n2. Adaptive algorithm: At every step m, the choice of Im+1 is a deterministic function of\n\nE{I1,...Im}(D).\nof E{I1,...Im}(D).\n\n3. Randomized adaptive algorithm: At every step m, the choice of Im+1 is a random function\n\nThe problem is different for complete graphs versus more general chordal graphs since rule R1\nbecomes applicable when the graph is not complete. Thus we give a separate treatment for each\ncase. First, we provide algorithms for all three cases for learning the directions of complete graphs\nE(D) = Kn (undirected complete graph) on n vertices. Then, we generalize to chordal graph\nskeletons and provide a novel adaptive algorithm with upper and lower bounds on its performance.\nThe missing proofs of the results that follow can be found in the Appendix.\n\n3 Complete Graphs\n\nIn this section, we consider the case where the skeleton we start with, i.e. E(D), is an undirected\ncomplete graph (denoted Kn). It is known that at any stage in (1) starting from E(D), rules R1,\nR3 and R4 do not apply. Further, the underlying DAG D is a directed clique. The directed clique\nis characterized by an ordering on [1 : n] such that, in the subgraph induced by (i), (i +\n1) . . . (n), (i) has no incoming edges. Let D be denoted by ~Kn() for some ordering . Let\n[1 : n] denote the set {1, 2 . . . n}. We need the following results on a separating system for our \ufb01rst\nresult regarding adaptive and non-adaptive algorithms for a complete graph.\n\n4\n\n\f3.1 Separating System\n\nDe\ufb01nition 1.\n[16, 17] An (n, k)-separating system on an n element set [1 : n] is a set of subsets\nS = {S1, S2 . . . Sm} such that |Si|\uf8ff k and for every pair i, j there is a subset S 2S such that\neither i 2 S, j /2 S or j 2 S, i /2 S. If a pair i, j satis\ufb01es the above condition with respect to S,\nthen S is said to separate the pair i, j. Here, we consider the case when k < n/2\nIn [17],\nIn [16], Katona gave an (n, k)-separating system together with a lower bound on |S|.\nWegener gave a simpler argument for the lower bound and also provided a tighter upper bound than\nthe one in [16]. In this work, we give a different construction below where the separating system\nsize is at mostdlogdn/ke ne larger than the construction of Wegener. However, our construction has\na simpler description.\nLemma 1. There is a labeling procedure that produces distinct ` length labels for all elements in\n[1 : n] using letters from the integer alphabet {0, 1 . . . a} where ` = dloga ne. Further, in every digit\n(or position), any integer letter is used at most dn/ae times.\nOnce we have a set of n string labels as in Lemma 1, our separating system construction is straight-\nforward.\nTheorem 1. Consider an alphabet A = [0 : d n\nke + 1 where k < n/2. Label every\nk e ne using\nelement of an n element set using a distinct string of letters from A of length ` = dlogd n\nke, choose the\nthe procedure in Lemma 1 with a = d n\nsubset Si,j of vertices whose string\u2019s i-th letter is j. The set of all such subsets S = {Si,j} is a\nk-separating system on n elements and |S| \uf8ff (d n\n3.2 Adaptive algorithms: Equivalence to a Separating System\nConsider any non-adaptive algorithm that designs a set of interventions I, each of size at most k,\nto discover ~Kn(). I has to be a separating system in the worst case over all . This is already\nknown. Now, we prove the necessity of a separating system for deterministic adaptive algorithms in\nthe worst case.\nTheorem 2. Let there be an adaptive deterministic algorithm A that designs the set of interventions\nI such that the \ufb01nal graph learnt EI(D) = ~Kn() for any ground truth ordering starting from the\ninitial skeleton E(D) = Kn. Then, there exists a such that A designs an I which is a separating\nsystem.\n\nke. For every 1 \uf8ff i \uf8ff ` and 1 \uf8ff j \uf8ff d n\n\nke] of size d n\n\nke)dlogd n\n\nk e ne.\n\nThe theorem above is independent of the individual intervention sizes. Therefore, we have the\nfollowing theorem, which is a direct corollary of Theorem 2:\nTheorem 3. In the worst case over , any adaptive or a non-adaptive deterministic algorithm on\nthe DAG ~Kn() has to be such that n\nke \n1)dlogd n\nProof. By Theorem 2, we need a separating system in the worst case and the lower and upper bounds\nare from [16, 17].\n\nn \uf8ff|I| . There is a feasible I with |I| \uf8ff d( n\n\nk log ne\n\nk e ne\n\nk\n\n3.3 Randomized Adaptive Algorithms\n\nIn this section, we show that that total number of variable accesses to fully identify the complete\ncausal DAG is \u2326(n).\nTheorem 4. To fully identify a complete causal DAG ~Kn() on n variables using size-k interven-\ntions, n\n\n2k interventions are necessary. Also, the total number of variables accessed is at least n\n2 .\n\nThe lower bound in Theorem 4 is information theoretic. We now give a randomized algorithm that\nrequires O( n\nk log log k) experiments in expectation. We provide a straightforward generalization\nof [5], where the authors gave a randomized algorithm for unbounded intervention size.\nTheorem 5. Let E(D) be Kn and the experiment size k = nr for some 0 < r < 1. Then there\nexists a randomized adaptive algorithm which designs an I such that EI(D) = D with probability\npolynomial in n, and |I| = O( n\n\nk log log(k)) in expectation.\n\n5\n\n\f4 General Chordal Graphs\n\nIn this section, we turn to interventions on a general DAG G. After the initial stages in (1), E(G)\nis a chain graph with chordal chain components. There are no further immoralities throughout the\ngraph. In this work, we focus on one of the chordal chain components. Thus the DAG D we work\non is assumed to be a directed graph with no immoralities and whose skeleton E(D) is chordal. We\nare interested in recovering D from E(D) using interventions of size at most k following (1).\n4.1 Bounds for Chordal skeletons\n\nWe provide a lower bound for both adaptive and non-adaptive deterministic schemes for a chordal\nskeleton E(D). Let (E(D)) be the coloring number of the given chordal graph. Since, chordal\ngraphs are perfect, it is the same as the clique number.\nTheorem 6. Given a chordal E(D),\nin the worst case over all DAGs D (which has skele-\nton E(D) and no immoralities),\nthen |I| \n(E(D))\n\n (E(D)) for any adaptive and non-adaptive algorithm with EI(D) = D.\n\nif every intervention is of size at most k,\n\nlog (E(D))e\n\nk\n\nk\n\nn\n\nk e n \uf8ff \u21b5(E(D))(E(D))\n\nUpper bound: Clearly, the separating system based algorithm of Section 3 can be applied to the\nvertices in the chordal skeleton E(D) and it is possible to \ufb01nd all the directions. Thus, |I| \uf8ff\nk e n. This with the lower bound implies an \u21b5 approximation\nk logd n\nalgorithm (since logd n\nRemark: The separating system on n nodes gives an \u21b5 approximation. However, the new algorithm\nin Section 4.3 exploits chordality and performs much better empirically. It is possible to show that\nour heuristic also has an \u21b5 approximation guarantee but we skip that.\n\ne ).\n (E(D)) , under a mild assumption (E(D)) \uf8ff n\n\nk e n \uf8ff log (E(D))e\n\nlogd n\n\nk\n\nk\n\n4.2 Two extreme counter examples\n\nWe provide two classes of chordal skeletons G: One for which the number of interventions close\nto the lower bound is suf\ufb01cient and the other for which the number of interventions needed is very\nclose to the upper bound.\nTheorem 7. There exists chordal skeletons such that for any algorithm with intervention size con-\nstraint k, the number of interventions |I| required is at least \u21b5 (1)\n2k where \u21b5 and are the inde-\npendence number and chromatic numbers respectively. There exists chordal graph classes such that\n|I| = d \n4.3 An Improved Algorithm using Meek Rules\n\nk e e is suf\ufb01cient.\n\nkedlogd \n\nIn this section, we design an adaptive deterministic algorithm that anticipates Meek rule R1 usage\nalong with the idea of a separating system. We evaluate this experimentally on random chordal\ngraphs. First, we make a few observations on learning connected directed trees T from the skeleton\nE(T ) (undirected trees are chordal) that do not have immoralities using Meek rule R1 where every\nintervention is of size k = 1. Because the tree has no cycle, Meek rules R2-R4 do not apply.\nLemma 2. Every node in a directed tree with no immoralities has at most one incoming edge. There\nis a root node with no incoming edges and intervening on that node alone identi\ufb01es the whole tree\nusing repeated application of rule R1.\nLemma 3. If every intervention in I is of size at most 1, learning all directions on a directed tree\nT with no immoralities can be done adaptively with at most |I| \uf8ff O(log2 n) where n is the number\nof vertices in the tree. The algorithm runs in time poly(n).\nLemma 4. Given any chordal graph and a valid coloring, the graph induced by any two color\nclasses is a forest.\n\nIn the next section, we combine the above single intervention adaptive algorithm on directed trees\nwhich uses Meek rules, with that of the non-adaptive separating system approach.\n\n6\n\n\f4.3.1 Description of the algorithm\n\nThe key motivation behind the algorithm is that, a pair of color classes is a forest (Lemma 4).\nChoosing the right node to intervene leaves only a small subtree unlearnt as in the proof of Lemma\n3. In subsequent steps, suitable nodes in the remaining subtrees could be chosen until all edges are\nlearnt. We give a brief description of the algorithm below.\nLet G denote the initial undirected chordal skeleton E(D) and let be its coloring number. Consider\na (, k) separating system S = {Si}. To intervene on the actual graph, an intervention set Ii\ncorresponding to Si is chosen. We would like to intervene on a node of color c 2 Si.\nConsider a node v of color c. Now, we attach a score P (v, c) as follows. For any color c0 /2 Si,\nconsider the induced forest F (c, c0) on the color classes c and c0 in G. Consider the tree T (v, c, c0)\ncontaining node v in F . Let d(v) be the degree of v in T . Let T1, T2, . . . Td(v) be the resulting\ndisjoint trees after node v is removed from T .\nIf v is intervened on, according to the proof of\nLemma 3: a) All edge directions in all trees Ti except one of them would be learnt when applying\nMeek Rules and rule R0. b) All the directions from v to all its neighbors would be found.\nThe score is taken to be the total number of edge directions guaranteed to be learnt in the worst case.\n\nTherefore, the score P (v) is: P (v) = Pc0:|c,c0T|=1\u2713|T (c, c0)| max\n\n1\uf8ffj\uf8ffd(v)|Tj|\u25c6 . The node with the\nhighest score among the color class c is used for the intervention Ii. After intervening on Ii, all the\nedges whose directions are known through Meek Rules (by repeated application till nothing more\ncan be learnt) and R0 are deleted from G. Once S is processed, we recolor the sparser graph G. We\n\ufb01nd a new S with the new chromatic number on G and the above procedure is repeated. The exact\nhybrid algorithm is described in Algorithm 1.\nTheorem 8. Given an undirected choral skeleton G of an underlying directed graph with no im-\nmoralities, Algorithm 1 ends in \ufb01nite time and it returns the correct underlying directed graph. The\nalgorithm has runtime complexity polynomial in n.\n\nAlgorithm 1 Hybrid Algorithm using Meek rules with separating system\n1: Input: Chordal Graph skeleton G = (V, E) with no Immoralities.\n2: Initialize ~G(V, Ed = ;) with n nodes and no directed edges. Initialize time t = 1.\n3: while E 6= ; do\n4:\n5:\n\nColor the chordal graph G with colors. . Standard algorithms exist to do it in linear time\nInitialize color set C = {1, 2 . . .}. Form a (, min(k,d/2e)) separating system S such\nfor i = 1 until |S| do\n\nthat |S|\uf8ff k, 8S 2S .\n\nInitialize Intervention It = ;.\nfor c 2 Si and every node v in color class c do\nConsider F (c, c0), T (c, c0, v) and {Tj}d(i)\nCompute: P (v, c) = Pc02CT Sc\nend for\nif k \uf8ff /2 then\nIt = It Sc2Si{ argmax\n\nP (v, c)}.\n\nv:P (v,c)6=0\n\nelse\n\ni\n\nIt = It [c2Si{First\n\n1\n\n|T (c, c0, v)| max\n\n(as per de\ufb01nitions in Sec. 4.3.1).\n1\uf8ffj\uf8ffd(i)|Tj|.\n\n6:\n7:\n8:\n9:\n10:\n\n11:\n12:\n13:\n\n14:\n15:\n16:\n17:\n18:\n\nk\n\nd/2e\n\nnodes v with largest nonzero P (v, c)}.\n\nend if\nt = t + 1\nApply R0 and Meek rules using Ed and E after intervention It. Add newly learnt di-\n\nrected edges to Ed and delete them from E.\n\nend for\nRemove all nodes which have degree 0 in G.\n\n19:\n20:\n21: end while\n22: return ~G.\n\n7\n\n\f5 Simulations\n\ns\nt\n\nn\ne\nm\n\ni\nr\ne\np\nx\nE\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n200\n180\n160\n140\n120\n100\n80\n60\n40\n20\n \n0\n20\n\n \n\nInformation Theoretic LB\nMax. Clique Sep. Sys. Entropic LB\nMax. Clique Sep. Sys. Achievable LB\nOur Construction Clique Sep. Sys. LB\nOur Heuristic Algorithm\nNaive (n,k) Sep. Sys. based Algorithm\nSeperating System UB\n\n40\n\n60\n\n80\n\nChromatic Number, \u03c7\n(a) n = 1000, k = 10\n\n100\n\n120\n\ns\nt\n\nn\ne\nm\n\ni\nr\ne\np\nx\nE\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n \n0\n20\n\n \n\nInformation Theoretic LB\nMax. Clique Sep. Sys. Entropic LB\nMax. Clique Sep. Sys. Achievable LB\nOur Construction Clique Sep. Sys. LB\nOur Heuristic Algorithm\nNaive (n,k) Sep. Sys. based Algorithm\nSeperating System UB\n\n40\n\n60\n\n80\n\n100\nChromatic Number, \u03c7\n(b) n = 2000, k = 10\n\n120\n\nFigure 1: n: no. of vertices, k: Intervention size bound. The number of experiments is compared be-\ntween our heuristic and the naive algorithm based on the (n, k) separating system on random chordal\ngraphs. The red markers represent the sizes of (, k) separating system. Green circle markers and\nthe cyan square markers for the same value correspond to the number of experiments required\nby our heuristic and the algorithm based on an (n, k) separating system(Theorem 1), respectively,\non the same set of chordal graphs. Note that, when n = 1000 and n = 2000, the naive algorithm\nrequires on average about 130 and 260 (close to n/k) experiments respectively, while our algorithm\nrequires at most \u21e0 40 (orderwise close to /k = 10) when = 100.\nWe simulate our new heuristic, namely Algorithm 1, on randomly generated chordal graphs and\ncompare it with a naive algorithm that follows the intervention sets given by our (n, k) separating\nsystem as in Theorem 1. Both algorithms apply R0 and Meek rules after each intervention according\nto (1). We plot the following lower bounds: a) Information Theoretic LB of \n2k b) Max. Clique Sep.\nSys. Entropic LB which is the chromatic number based lower bound of Theorem 6. Moreover, we\nuse two known (, k) separating system constructions for the maximum clique size as \u201creferences\u201d:\nThe best known (, k) separating system is shown by the label Max. Clique Sep. Sys. Achievable\nLB and our new simpler separating system construction (Theorem 1) is shown by Our Construction\nClique Sep. Sys. LB. As an upper bound, we use the size of the best known (n, k) separating system\n(without any Meek rules) and is denoted Separating System UB.\nRandom generation of chordal graphs: Start with a random ordering on the vertices. Consider\nevery vertex starting from (n). For each vertex i, (j, i) 2 E with probability inversely proportional\nto (i) for every j 2 Si where Si = {v : 1(v) < 1(i)}. The proportionality constant is\nchanged to adjust sparsity of the graph. After all such j are considered, make Si \\ ne(i) a clique by\nadding edges respecting the ordering , where ne(i) is the neighborhood of i. The resultant graph is\na DAG and the corresponding skeleton is chordal. Also, is a perfect elimination ordering.\nResults: We are interested in comparing our algorithm and the naive one which depends on the\n(n, k) separating system to the size of the (, k) separating system. The size of the (, k) separating\nsystem is roughly \u02dcO(/k). Consider values around = 100 on the x-axis for the plots with n =\n1000, k = 10 and n = 2000, k = 10. Note that, our algorithm performs very close to the size of\nthe (, k) separating system, i.e. \u02dcO(/k). In fact, it is always < 40 in both cases while the average\nperformance of naive algorithm goes from 130 (close to n/k = 100) to 260 (close to n/k = 200).\nThe result points to this: For random chordal graphs, the structured tree search allows us to learn the\nedges in a number of experiments quite close to the lower bound based only on the maximum clique\nsize and not n. The plots for (n, k) = (500, 10) and (n, k) = (2000, 20) are given in Appendix.\n\nAcknowledgments\nAuthors acknowledge the support from grants: NSF CCF 1344179, 1344364, 1407278, 1422549\nand a ARO YIP award (W911NF-14-1-0258). We also thank Frederick Eberhardt for helpful dis-\ncussions.\n\n8\n\n\fReferences\n[1] J. Pearl, Causality: Models, Reasoning and Inference. Cambridge University Press, 2009.\n[2] A. Hauser and P. B\u00a8uhlmann, \u201cTwo optimal strategies for active learning of causal models from\ninterventional data,\u201d International Journal of Approximate Reasoning, vol. 55, no. 4, pp. 926\u2013\n939, 2014.\n\n[3] F. Eberhardt, C. Glymour, and R. Scheines, \u201cOn the number of experiments suf\ufb01cient and in\nthe worst case necessary to identify all causal relations among n variables,\u201d in Proceedings of\nthe 21st Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pp. 178\u2013184.\n\n[4] A. Hyttinen, F. Eberhardt, and P. Hoyer, \u201cExperiment selection for causal discovery,\u201d Journal\n\nof Machine Learning Research, vol. 14, pp. 3041\u20133071, 2013.\n\n[5] H. Hu, Z. Li, and A. Vetta, \u201cRandomized experimental design for causal graph discovery,\u201d in\n\nProceedings of NIPS 2014, Montreal, CA, December 2014.\n\n[6] S. Shimizu, P. O. Hoyer, A. Hyvarinen, and A. J. Kerminen, \u201cA linear non-gaussian acyclic\nmodel for causal discovery,\u201d Journal of Machine Learning Research, vol. 7, pp. 2003\u20132030,\n2006.\n\n[7] P. O. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Sch\u00a8olkopf, \u201cNonlinear causal discovery\n\nwith additive noise models,\u201d in Proceedings of NIPS 2008, 2008.\n\n[8] F. Eberhardt, Causation and Intervention (Ph.D. Thesis), 2007.\n[9] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search. A Bradford\n\nBook, 2001.\n\n[10] C. Meek, \u201cStrong completeness and faithfulness in bayesian networks,\u201d in Proceedings of the\n\neleventh international conference on uncertainty in arti\ufb01cial intelligence, 1995.\n\n[11] S. A. Andersson, D. Madigan, and M. D. Perlman, \u201cA characterization of markov equivalence\n\nclasses for acyclic digraphs,\u201d The Annals of Statistics, vol. 25, no. 2, pp. 505\u2013541, 1997.\n\n[12] T. Verma and J. Pearl, \u201cAn algorithm for deciding if a set of observed independencies has a\ncausal explanation,\u201d in Proceedings of the Eighth international conference on uncertainty in\narti\ufb01cial intelligence, 1992.\n\n[13] C. Meek, \u201cCausal inference and causal explanation with background knowledge,\u201d in Proceed-\n\nings of the eleventh international conference on uncertainty in arti\ufb01cial intelligence, 1995.\n\n[14] A. Hauser and P. B\u00a8uhlmann, \u201cCharacterization and greedy learning of interventional markov\nequivalence classes of directed acyclic graphs,\u201d Journal of Machine Learning Research,\nvol. 13, no. 1, pp. 2409\u20132464, 2012.\n\n[15] \u2014\u2014, \u201cTwo optimal strategies for active learning of causal networks from interventional data,\u201d\n\nin Proceedings of Sixth European Workshop on Probabilistic Graphical Models, 2012.\n\n[16] G. Katona, \u201cOn separating systems of a \ufb01nite set,\u201d Journal of Combinatorial Theory, vol. 1(2),\n\npp. 174\u2013194, 1966.\n\n[17] I. Wegener, \u201cOn separating systems whose elements are sets of at most k elements,\u201d Discrete\n\nMathematics, vol. 28(2), pp. 219\u2013222, 1979.\n\n[18] R. J. Lipton and R. E. Tarjan, \u201cA separator theorem for planar graphs,\u201d SIAM Journal on\n\nApplied Mathematics, vol. 36, no. 2, pp. 177\u2013189, 1979.\n\n9\n\n\f", "award": [], "sourceid": 1783, "authors": [{"given_name": "Karthikeyan", "family_name": "Shanmugam", "institution": "UT Austin"}, {"given_name": "Murat", "family_name": "Kocaoglu", "institution": "UT Austin"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "Utaustin"}, {"given_name": "Sriram", "family_name": "Vishwanath", "institution": "UT Austin"}]}