{"title": "Learning Bayesian networks with ancestral constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 2325, "page_last": 2333, "abstract": "We consider the problem of learning Bayesian networks optimally, when subject to background knowledge in the form of ancestral constraints. Our approach is based on a recently proposed framework for optimal structure learning based on non-decomposable scores, which is general enough to accommodate ancestral constraints. The proposed framework exploits oracles for learning structures using decomposable scores, which cannot accommodate ancestral constraints since they are non-decomposable. We show how to empower these oracles by passing them decomposable constraints that they can handle, which are inferred from ancestral constraints that they cannot handle. Empirically, we demonstrate that our approach can be orders-of-magnitude more efficient than alternative frameworks, such as those based on integer linear programming.", "full_text": "Learning Bayesian networks\nwith ancestral constraints\n\nEunice Yuh-Jie Chen and Yujia Shen and Arthur Choi and Adnan Darwiche\n\n{eyjchen,yujias,aychoi,darwiche}@cs.ucla.edu\n\nComputer Science Department\n\nUniversity of California\nLos Angeles, CA 90095\n\nAbstract\n\nWe consider the problem of learning Bayesian networks optimally, when subject\nto background knowledge in the form of ancestral constraints. Our approach is\nbased on a recently proposed framework for optimal structure learning based on\nnon-decomposable scores, which is general enough to accommodate ancestral\nconstraints. The proposed framework exploits oracles for learning structures using\ndecomposable scores, which cannot accommodate ancestral constraints since they\nare non-decomposable. We show how to empower these oracles by passing them\ndecomposable constraints that they can handle, which are inferred from ancestral\nconstraints that they cannot handle. Empirically, we demonstrate that our approach\ncan be orders-of-magnitude more ef\ufb01cient than alternative frameworks, such as\nthose based on integer linear programming.\n\n1\n\nIntroduction\n\nBayesian networks learned from data are broadly used for classi\ufb01cation, clustering, feature selection,\nand to determine associations and dependencies between random variables, in addition to discovering\ncauses and effects; see, e.g., [Darwiche, 2009, Koller and Friedman, 2009, Murphy, 2012].\nIn this paper, we consider the task of learning Bayesian networks optimally, subject to background\nknowledge in the form of ancestral constraints. Such constraints are important in practice as they\nallow one to assert direct or indirect cause-and-effect relationships (or lack thereof) between random\nvariables. Further, one expects that their presence should improve the ef\ufb01ciency of the learning\nprocess as they reduce the size of the search space. However, nearly all mainstream approaches for\noptimal structure learning make a fundamental assumption, that the scoring function (i.e., the prior\nand likelihood) is decomposable. This in turn limits their ability to integrate ancestral constraints,\nwhich are non-decomposable. Such approaches only support structure-modular constraints such\nas the presence or absence of edges, or order-modular constraints such as pairwise constraints on\ntopological orderings; see, e.g., [Koivisto and Sood, 2004, Parviainen and Koivisto, 2013].\nRecently, a new framework has been proposed for optimal Bayesian network structure learning [Chen\net al., 2015], but with non-decomposable priors and scores. This approach is based on navigating the\nseemingly intractable search space over all network structures (i.e., all DAGs). This intractability can\nbe mitigated however by leveraging an omniscient oracle that can optimally learn structures with\ndecomposable scores. This approach led to the \ufb01rst system for \ufb01nding optimal DAGs (i.e., model\nselection) given order-modular priors (a type of non-decomposable prior) [Chen et al., 2015]. The\napproach was also applied towards the enumeration of the k-best structures [Chen et al., 2015, 2016],\nwhere it was orders-of-magnitude more ef\ufb01cient than the existing state-of-the-art [Tian et al., 2010,\nCussens et al., 2013, Chen and Tian, 2014].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn this paper, we show how to incorporate non-decomposable constraints into the structure learning\napproach of Chen et al. [2015, 2016]. We consider learning with ancestral constraints, and inferring\ndecomposable constraints from ancestral constraints to empower the oracle. In principle, structure\nlearning approaches based on integer linear programming (ILP) and constraint programming (CP)\ncan also represent ancestral constraints (and other non-decomposable constraints) [Jaakkola et al.,\n2010, Bartlett and Cussens, 2015, van Beek and Hoffmann, 2015].1 We empirically evaluate the\nproposed approach against those based on ILP, showing orders of magnitude improvements.\nThis paper is organized as follows. In Section 2, we review the problem of Bayesian network structure\nlearning. In Section 3, we discuss ancestral constraints and how they relate to existing structure\nlearning approaches. In Section 4, we introduce our approach for learning with ancestral constraints.\nIn Section 5, we show how to infer decomposable constraints from non-decomposable ancestral\nconstraints. We evaluate our approach empirically in Section 6, and conclude in Section 7.\n\n2 Technical preliminaries\n\nscore(G | D) =(cid:80)\n\nWe use upper case letters X to denote variables and bold-face upper case letters X to denote sets of\nvariables. We use X to denote a variable in a Bayesian network and U to denote its parents.\nIn score-based approaches to structure learning, we are given a complete dataset D and want to learn\na DAG G that optimizes a decomposable score, which aggregates scores over the DAG families XU:\n\nXU score(XU | D)\n\n(1)\nThe MDL and BDeu scores are examples of decomposable scores; see, e.g., Darwiche [2009], Koller\nand Friedman [2009], Murphy [2012]. The seminal K2 algorithm is one of the \ufb01rst algorithms to\nexploit decomposable scores [Cooper and Herskovits, 1992]. The K2 algorithm optimizes Equation 1,\nbut assumes that a DAG G is consistent with a given topological ordering \u03c3. This assumption\ndecomposes the structure learning problem into independent sub-problems, where we \ufb01nd the optimal\nset of parents for each variable X, from those variables that precede X in ordering \u03c3.\nWe can \ufb01nd the DAG G that optimizes Equation 1 by running the K2 algorithm on all n! variable\norderings \u03c3, and then take the DAG with the best score. Note that these n! instances share many\ncomputational sub-problems: \ufb01nding the optimal set of parents for some variable X. One can\naggregate these common sub-problems, leaving us with only n \u00b7 2n\u22121 unique sub-problems. This\ntechnique underlies a number of modern approaches to score-based structure learning, including\nsome based on dynamic programming [Koivisto and Sood, 2004, Singh and Moore, 2005, Silander\nand Myllym\u00e4ki, 2006], and related approaches based on heuristic search methods such as A* [Yuan\net al., 2011, Yuan and Malone, 2013]. This aggregation of K2 sub-problems also corresponds to a\nsearch space called the order graph [Yuan et al., 2011, Yuan and Malone, 2013].\nBayesian network structure learning can also be formulated using integer linear programming (ILP),\nwith Equation 1 as the linear objective function of an ILP. Further, for each variable X and candidate\nparent set U, we introduce an ILP variable I(X, U) \u2208 {0, 1} to represent the event that X has\nparents U when I(X, U) = 1, and I(X, U) = 0 otherwise. We then assert constraints that each\nU I(X, U) = 1. Another set of constraints ensure that\n(cid:80)\nall variables X and their parents U must yield an acyclic graph. One approach is to use cluster\nconstraints [Jaakkola et al., 2010], where for each cluster C \u2286 X, at least one variable X in C has\nU\u2229C=\u2205 I(X, U) \u2265 1. Finally, we have the objective function of our ILP,\n\nvariable X has a unique set of parents,(cid:80)\nno parents in C,(cid:80)\n(cid:80)\n\n(cid:80)\nU\u2286X\\X score(XU | D) \u00b7 I(X, U), which corresponds to Equation 1.\n\nX\u2208C\n\nX\u2208X\n\n3 Ancestral constraints\n\nAn ancestral constraint speci\ufb01es a relation between two variables X and Y in a DAG G. If X is an\nancestor of Y , then there is a directed path connecting X to Y in G. If X is not an ancestor of Y ,\nthen there is no such path. Ancestral constraints can be used, for example, to express background\nknowledge in the form of cause-and-effect relations between variables. When X is an ancestor of Y ,\nwe have a positive ancestral constraint, denoted X (cid:32) Y . When X is not an ancestor of Y , we have a\n1To our knowledge, however, the ILP and CP approaches have not been previously evaluated, in terms of\n\ntheir ef\ufb01cacy in structure learning with ancestral constraints.\n\n2\n\n\f(a) A BN graph\n\n(b) A EC Tree\n\nFigure 1: Bayesian network search spaces for the set of variables X = {X1, X2, X3}.\n\nnegative ancestral constraint, denoted X (cid:54)(cid:32) Y . In this case, there is no directed path from X to Y ,\nbut there may still be a directed path from Y to X. Positive ancestral constraints are transitive, i.e., if\nX (cid:32) Y and Y (cid:32) Z then X (cid:32) Z. Negative ancestral constraints are not transitive.\nAncestral constraints are non-decomposable since we cannot in general check whether an ancestral\nconstraint is satis\ufb01ed or violated by independently checking the parents of each variable. For example,\nconsider an optimal DAG, compatible with ordering (cid:104)X1, X2, X3(cid:105), from the family scores:\n\nX U score X U\nX1\n\n{}\n\nX2\n\n1\n\n{}, {X1}\n\nscore X U\n1, 2\n\nX3\n\n{}, {X1}, {X2}, {X1, X2}\n\nscore\n10, 10, 1, 10\n\nThe optimal DAG (with minimal score) in this case is X1 X2 \u2192 X3 . If we assert the ancestral\nconstraint X1 (cid:32) X3, then the optimal DAG is X1 \u2192 X2 \u2192 X3 . Yet, we cannot enforce this\nancestral constraints using independent, local constraints on the parents that each variable can take.\nIn particular, the choice of parents for variable X2 and the choice of parents for variable X3 will\njointly determine whether X1 is an ancestor of X3. Hence, the K2 algorithm and approaches based\non the order graph (dynamic programming and heuristic search) cannot enforce ancestral constraints.\nThese approaches, however, can enforce decomposable constraints, such as the presence or absence\nof an edge U \u2192 X, or a limit on the size of a family XU. Interestingly, one can infer some\ndecomposable constraints from non-decomposable ones. We discuss this technique extensively later,\nshowing how it can lead to signi\ufb01cant impact on the ef\ufb01ciency of structure search.\nStructure learning approaches based on ILP can in principle enforce non-decomposable constraints,\nwhen they can be encoded as linear constraints. In fact, ancestral relations have been employed\nin ILPs and other formalisms to enforce a graph\u2019s acyclicity; see, e.g., [Cussens, 2008]. However,\nto our knowledge, these approaches have not been evaluated for learning structures with ancestral\nconstraints. We provide such an empirical evaluation in Section 6.2\n\n4 Learning with constraints\n\nIn this section, we review two recently proposed search spaces for learning Bayesian networks: the\nBN graph and the EC tree [Chen et al., 2015, 2016]. We subsequently show how we can adapt the\nEC tree to facilitate the learning of Bayesian network structures under ancestral constraints.\n\n4.1 BN graphs\n\nThe BN graph is a search space for learning structures with non-decomposable scores [Chen et al.,\n2015]. Figure 1(a) shows a BN graph over 3 variables, where nodes represent DAGs over different\nXU\u2212\u2212\u2192 Gj from a DAG Gi to a DAG Gj exists in the BN\nsubsets of variables. A directed edge Gi\ngraph iff Gj can be obtained from Gi by adding a leaf node X with parents U. Each edge has a cost,\ncorresponding to the score of the family XU, as in Equation 1. Hence, a path from the root G0 to a\nDAG Gn yields the score of the DAG, score(Gn | D). As a result, the shortest path in the BN graph\n(the one with the lowest score) corresponds to an optimal DAG, as in Equation 1.\n\n2We also make note of Borboudakis and Tsamardinos [2012], which uses ancestral constraints (path con-\nstraints) for constraint-based learning methods, such as the PC algorithm. Borboudakis and Tsamardinos [2013]\nfurther proposes a prior based on path beliefs (soft constraints), and evaluated using greedy local search.\n\n3\n\nX1. . .X1X1X1X1X1X1X1G0X2X2X2X2X2X2X2X1X1X1X1X1X3X3X3X3X3X3X3X2X3X3X3X3X3X3X2X2X2X2X215\u5e746\u670812\u2f47\u65e5\u661f\u671f\u4e94X1. . .P0X2X1X2X1X2X3X2X3X2X3X1X3X1X3X1X1X1X1X1X2X3X3X3X3X3X2X2X2X215(cid:2)11(cid:5)16(cid:3)(cid:4)(cid:6)(cid:1)\fUnlike the order graph, the BN graph explicitly represents all possible DAGs. Hence, ancestral\nconstraints can be easily integrated by pruning the search space, i.e., by pruning away those DAGs\nthat do not satisfy the given constraints. Consider Figure 1(a) and the ancestral constraint X1 (cid:32) X2.\nSince the DAG X1 X2 violates the constraint, we can prune this node, along with all of its\ndescendants, as the descendants must also violate an ancestral constraint (adding new leaves to a\nDAG will not undo a violated ancestral constraint). Finding a shortest path in this pruned search\nspace will yield an optimal Bayesian network satisfying a given set of ancestral constraints.\nWe can use A* search to \ufb01nd a shortest path in a BN graph. A* is a best-\ufb01rst search algorithm\nthat uses an evaluation function f to guide the search. For a given DAG G, we have the evaluation\nfunction f (G) = g(G) + h(G), where g(G) is the actual cost to reach G from the root G0, and\nh(G) is the estimated cost to reach a leaf from G. A* search is guaranteed to \ufb01nd a shortest path\nwhen the heuristic function h is admissible, i.e., it does not over-estimate. Chen et al. [2015, 2016]\nshowed that a heuristic function can be induced by any learning algorithm that takes a (partial) DAG\nas input, and returns an optimal DAG that extends it. Learning systems based on the order graph fall\nin this category and can be viewed as powerful oracles that help us to navigate the DAG graph. We\nemployed URLEARNING as an oracle in our experiments [Yuan and Malone, 2013]. We will later\nshow how to empower this oracle by passing it decomposable constraints that we infer from a set of\nnon-decomposable ancestral constraints\u2014the impact of this empowerment turns out to be dramatic.\n\n4.2 EC trees\n\nThe EC tree is a recently proposed search space that improves the BN graph along two dimensions\n[Chen et al., 2016]. First, it merges Markov-equivalent nodes in the BN graph. Second, it canonizes\nthe resulting EC graph into a tree, where each node is reachable by a unique path from the root.\nTwo network structures are Markov equivalent iff they have the same undirected skeleton and the\nsame v-structures. A Markov equivalence class can be represented by a completed, partially directed\nacyclic graph (CPDAG). The set of structures represented by a CPDAG P is denoted by class(P )\nand may contain exponentially many Markov equivalent structures.\nFigure 1(b) illustrates an EC tree over 3 variables, where nodes represent CPDAGs over different\nXU\u2212\u2212\u2192 Pj from a CPDAG Pi to a CPDAG Pj exists in the\nsubsets of variables. A directed edge Pi\nEC tree iff there exists a DAG Gj \u2208 class(Pj) that can be obtained from a DAG Gi \u2208 class(Pi) by\nadding a leaf node X with parents U, but where X must be the largest variable in Gj (according to\nsome canonical ordering). Each edge of an EC tree has a cost score(XU | D), so the shortest path in\nthe EC tree corresponds to an optimal equivalence class of Bayesian networks.\n\n4.3 EC trees and ancestral constraints\nA DAG G satis\ufb01es a set of ancestral constraints A (both over the same set of variables) iff the DAG G\nsatis\ufb01es each constraint in A. Moreover, a CPDAG P satis\ufb01es A iff there exists a DAG G \u2208 class(P )\nthat satis\ufb01es A. We enforce ancestral constraints by pruning a CPDAG node P from an EC tree when\nP does not satisfy the constraints A. First, consider an ancestral constraint X1 (cid:54)(cid:32) X2. A CPDAG\nP containing a directed path from X1 to X2 violates the constraint, as every structure in class(P )\ncontains a path from X1 to X2. Next, consider an ancestral constraint X1 (cid:32) X2. A CPDAG P with\nno partially directed paths from X1 to X2 violates the given constraint, as no structure in class(P )\ncontains a path from X1 to X2.3 Given a CPDAG P , we \ufb01rst test for these two cases, which can be\ndone ef\ufb01ciently. If these tests are inconclusive, we exhaustively enumerate the structures of class(P ),\nto check if any of them satis\ufb01es the given constraints. If not, we can prune P and its descendants\nfrom the EC tree. The soundness of this pruning step is due to the following.\nTheorem 1 In an EC tree, a CPDAG P satis\ufb01es ancestral constraints A, both over the same set of\nvariables X, iff its descendants satisfy A.\n\n5 Projecting constraints\n\nIn this section, we show how one can project non-decomposable ancestral constraints onto decompos-\nable edge and ordering constraints. For example, if G is a set of DAGs satisfying a set of ancestral\n3A partially directed path from X to Y consists of undirected edges and directed edges oriented towards Y .\n\n4\n\n\fconstraints A, we want to \ufb01nd the edges that appear in all DAGs of G. These projected constraints\ncan be then used to improve the ef\ufb01ciency of structure learning. Recall (from Section 4.1) that our\napproach to structure learning uses a heuristic function that utilizes an optimal structure learning\nalgorithm for decomposable scores (the oracle). We tighten this heuristic (empower the oracle) by\npassing to it projected edge and ordering constraints, leading to a more ef\ufb01cient search when we are\nsubject to non-decomposable ancestral constraints.\nGiven a set of ancestral constraints A, we shall show how to infer new edge and ordering constraints,\nthat we can utilize to empower our oracle. For the case of edge constraints, we propose a simple\nalgorithm that can ef\ufb01ciently enumerate all inferrable edge constraints. For the case of ordering\nconstraints, we propose a reduction to MaxSAT, that can \ufb01nd a maximally large set of ordering\nconstraints that can be jointly inferred from ancestral constraints.\n\n5.1 Edge constraints\n\nWe now propose an algorithm for \ufb01nding all edge constraints that can be inferred from a set of\nancestral constraints A. We consider (decomposable) constraints on the presence of an edge, or the\nabsence of an edge. We refer to edge presence constraints as positive constraints, denoted by X \u2192 Y ,\nand refer to edge absence constraints as negative constraints, denoted by X (cid:54)\u2192 Y .\nWe let E denote a set of edge constraints. We further let G(A) denote the set of DAGs G over the\nvariables X that satisfy all ancestral constraints in the set A, and let G(E) denote the set of DAGs G\nthat satisfy all edge constraints in E. Given a set of ancestral constraints A, we say that A entails\na positive edge constraint X \u2192 Y iff G(A) \u2286 G(X \u2192 Y ), and that A entails a negative edge\nconstraint X (cid:54)\u2192 Y iff G(A) \u2286 G(X (cid:54)\u2192 Y ). For example, consider the four DAGs over the variables\nX, Y and Z that satisfy ancestral constraints X (cid:54)(cid:32) Z and Y (cid:32) Z.\n\nY Z X Y Z X Y Z X Y Z X\n\nFirst, we note that no DAG above contains the edge X \u2192 Z, since this would immediately violate the\nconstraint X (cid:54)(cid:32) Z. Next, no DAG above contains the edge X \u2192 Y . Suppose instead that this edge\nappeared; since Y (cid:32) Z, we can infer X (cid:32) Z, which contradicts the existing constraint X (cid:54)(cid:32) Z.\nHence, we can infer the negative edge constraint X (cid:54)\u2192 Y . Finally, no DAG above contains the edge\nZ \u2192 Y , since this would lead to a directed cycle with the constraint Y (cid:32) Z.\nBefore we present our algorithm for inferring edge constraints, we \ufb01rst revisit some properties of\nancestral constraints that we will need. Note that given a set of ancestral constraints, we may be\nable to infer additional ancestral constraints. First, given two constraints X (cid:32) Y and Y (cid:32) Z, we\ncan infer an additional ancestral constraint X (cid:32) Z (by transitivity of ancestral relations). Second,\nif adding a path X (cid:32) Y would create a directed cycle (e.g., if Y (cid:32) X exists in A), or if it would\nviolate an existing negative ancestral constraints (e.g., if X (cid:54)(cid:32) Z and Y (cid:32) Z exists in A), then we\ncan infer a new negative constraint X (cid:54)(cid:32) Y . By using a few rules based on the examples above, we\ncan ef\ufb01ciently enumerate all of the ancestral constraints that are entailed by a given set of ancestral\nconstraints (details omitted for space). Hence, we shall subsequently assume that a given set of\nancestral constraints A will already include all ancestral constraints that can be entailed from it. We\nthen refer to A as a maximum set of ancestral constraints.\nWe now consider how to infer edge constraints from a (maximum) set of ancestral constraints A.\nFirst, let \u03b1(X) be the set that consists of X and every X(cid:48) such that X(cid:48) (cid:32) X \u2208 A, and let \u03b2(X)\nbe the set that consists of X and every X(cid:48) such that X (cid:32) X(cid:48) \u2208 A. In other words, \u03b1(X) contains\nX and all nodes that are constrained to be ancestors of X by A, i.e., each X(cid:48) \u2208 \u03b1(X) is either X\nor an ancestor of X, for all DAGs G \u2208 G(A). Similarly, \u03b2(X) contains X and all nodes that are\nconstrained to be descendants of X by A.\nFirst, we can check if a negative edge constraint X (cid:54)\u2192 Y is entailed by A by enumerating all possible\nXa (cid:54)(cid:32) Yb for all Xa \u2208 \u03b1(X) and all Yb \u2208 \u03b2(Y ). If any Xa (cid:54)(cid:32) Yb is in A then we know that A\nentails X (cid:54)\u2192 Y . That is, since Xa (cid:32) X and Y (cid:32) Yb, then if there was a DAG G \u2208 G(A) with the\nedge X \u2192 Y , then G would also have a path from Xa to Yb. Hence, we can infer X (cid:54)\u2192 Y . This idea\nis summarized by the following theorem:\n\n5\n\n\fTheorem 2 Given a maximum set of ancestral constraints A, then A entails the negative edge\nconstraint X (cid:54)\u2192 Y iff Xa (cid:54)(cid:32) Yb, where Xa \u2208 \u03b1(X) and Yb \u2208 \u03b2(Y ).\nNext, suppose that both (1) A dictates that X can reach Y , and that (2) A dictates that there is no\npath from X to Z to Y , for any other variable Z. In this case, we can infer a positive edge constraint\nX \u2192 Y . We can again verify if X \u2192 Y is entailed by A by enumerating all relevant candidates Z,\nbased on the following theorem.\nTheorem 3 Given a maximum set of ancestral constraints A, then A entails the positive edge\nconstraint X \u2192 Y iff A contains X (cid:32) Y and for all Z (cid:54)\u2208 \u03b1(X) \u222a \u03b2(Y ), the set A contains a\nconstraint Xa (cid:54)(cid:32) Zb or Za (cid:54)(cid:32) Yb, where Xa \u2208 \u03b1(X), Zb \u2208 \u03b2(Z), Za \u2208 \u03b1(Z) and Yb \u2208 \u03b2(Y ).\n\n5.2 Topological ordering constraints\n\nWe next consider constraints on the topological orderings of a DAG. An ordering satis\ufb01es a constraint\nX < Y iff X appears before Y in the ordering. Further, an ordering constraint X < Y is compatible\nwith a DAG G iff there exists a topological ordering of DAG G that satis\ufb01es the constraint X < Y .\nThe negation of an ordering constraint X < Y is the ordering constraint Y < X. A given ordering\nsatis\ufb01es either X < Y or Y < X, but not both at the same time. A DAG G may be compatible with\nboth X < Y and Y < X through two different topological orderings.\nWe let O denote a set of ordering constraints, and let G(O) denote the set of DAGs G that are\ncompatible with each ordering constraint in O. The task of determining whether a set of ordering\nconstraints O is entailed by a set of ancestral constraints A, i.e., whether G(A) \u2286 G(O), is more\nsubtle than the case of edge constraints. For example, consider the set of ancestral constraints\nA = {Z (cid:54)(cid:32) Y, X (cid:54)(cid:32) Z}. We can infer the ordering constraint Y < Z from the \ufb01rst constraint\nZ (cid:54)(cid:32) Y , and Z < X from the second constraint X (cid:54)(cid:32) Z.4 If we were to assume both ordering\nconstraints, we could infer the third ordering constraint Y < X, by transitivity. However, consider the\nfollowing DAG G which satis\ufb01es A: X \u2192 Y\nZ . This DAG is compatible with the constraint\nY < Z as well as the constraint Z < X, but it is not compatible with the constraint Y < X. Consider\nthe three topological orderings of the DAG G: (cid:104)X, Y, Z(cid:105),(cid:104)X, Z, Y (cid:105) and (cid:104)Z, X, Y (cid:105). We see that none\nof the orderings satisfy both ordering constraints at the same time. Hence, if we assume both ordering\nconstraints at the same time, it eliminates all topological orderings of the DAG G, and hence the DAG\nitself. Consider another example over variables W, X, Y and Z with a set of ancestral constraints\nA = {W (cid:54)(cid:32) Z, Y (cid:54)(cid:32) X}. The following DAG G satis\ufb01es A: W \u2192 X Y \u2192 Z . However,\ninferring the ordering constraints Z < W and X < Y from each ancestral constraint of A leads to a\ncycle in the above DAG (W < X < Y < Z < W ), hence eliminating the DAG.\nHence, for a given set of ancestral constraints A, we want to infer from it a set O of ordering\nconstraints that is as large as possible, but without eliminating any DAGs satisfying A. Roughly, this\ninvolves inferring ordering constraints X < Y from ancestral constraints Y (cid:54)(cid:32) X, as long as the or-\ndering constraints do not induce a cycle. We propose to encode the problem as an instance of MaxSAT\n[Li and Many\u00e0, 2009]. Given a maximum set of ancestral constraints A, we construct a MaxSAT\ninstance where propositional variables represent ordering constraints and ancestral constraints (true if\nthe constraint is present, and false otherwise). The clauses encode the ancestral constraints, as well as\nconstraints to ensure acyclicity. By maximizing the set of satis\ufb01ed clauses, we then maximize the set\nof constraints X < Y selected. In turn, the (decomposable) ordering constraints can be to empower\nan oracle during structure search. Our MaxSAT problem includes hard constraints (1-3), as well as\nsoft constraints (4):\n\n1. transitivity of orderings: for all X < Y , Y < Z: (X < Y ) \u2227 (Y < Z) \u21d2 (X < Z)\n2. a necessary condition for orderings: for all X < Y : (X < Y ) \u21d2 (Y (cid:54)(cid:32) X)\n3. a suf\ufb01cient condition for acyclicity: for all X < Y and Z < W : (X < Y ) \u2227 (Z <\n\nW ) \u21d2 (X (cid:32) Y ) \u2228 (Z (cid:32) W ) \u2228 (X (cid:32) Z) \u2228 (Y (cid:32) W ) \u2228 (X (cid:32) W ) \u2228 (Y (cid:54)(cid:32) Z)\n\n4. infer orderings from ancestral constraints: for all X (cid:54)(cid:32) Y in A: (X (cid:54)(cid:32) Y ) \u21d2 (Y < X)\n4To see this, consider any DAG G satisfying Z (cid:54)(cid:32) Y . We can construct another DAG G(cid:48) from G by adding\nthe edge Y \u2192 Z, since adding such an edge does not introduce a directed cycle. As a result, every topological\nordering of G(cid:48) satis\ufb01es Y < Z, and G(Z (cid:54)(cid:32) Y ) \u2286 G(Y < Z).\n\n6\n\n\f512\n\nn = 10\n2048\n\n8192\n\n512\n\n8192\n\n512\n\nn = 12\n2048\n\nN\np\n0.00\n0.01\n0.05\n0.10\n0.25 0.01\n0.50\n0.75\n1.00\n\nEC GOB\n\nEC GOB\n\nEC GOB\n\nEC GOB EC GOB\n\nEC\nEC GOB\n0.06 625.081 0.07 839.46 0.09 1349.24\n< 7.81 < 112.98 < 19.70 0.01 70.85 0.02 98.28 0.02 144.21\n0.05 673.003 0.06 901.50 0.08 1356.63\n< 9.61 < 15.41 < 23.58 0.01 73.39 0.01 99.46 0.01 145.75\n411.22\n0.08 243.681 0.05 287.45 0.04\n95.11\n< 11.56 < 14.54 < 19.85 0.02 60.16 0.01 75.40 0.27\n218.94\n59.42\n< 10.74 < 11.60 < 13.87 0.21 52.02 0.10 53.29 0.36\n0.58 176.500 1.26 198.18 0.03\n19.68 55.07 126.312 0.91 112.80 0.02\n4.04 < 3.43 < 3.37 4.91 22.47 0.18 20.88 0.17\n107.44\n< 62.60\n5.85\n6.10 0.01\n< 41.21\n< 2.57\n< 2.27\n< 37.78\n\n< 0.87 < 0.71 < 0.72 0.51\n< 0.31 < 0.75 < 0.30\n< 0.21 < 0.31 < 0.21\n\n67.29\n< 42.95\n< 39.67\n\n< 44.074\n< 39.484\n\n< 2.62\n< 2.30\n\n< 2.66\n< 2.29\n\n73.236 0.02\n\nEC GOB\n\n6.11 0.03\n\nGOB\n\n0.48\n\nn = 14\n2048\n\n8192\n\nEC\n\nGOB\n\nTable 1: Time (in sec) used by EC tree and GOBNILP to \ufb01nd optimal networks. < is less than 0.01 sec. n is the\nvariable number, N is the dataset size, p is the percentage of the ancestral constraints.\n\nn = 12\n2048\n\n8192\n\n512\nEC (t/s)\n1 63.53\n0.01\n1 55.20\n0.06\n2.56\n1 50.33\n70.19 0.98 23.29\n7.74\n137.31\n4.38\n21.86\n2.31\n4.10\n\n1\n1\n1\n\nGOB EC (t/s) GOB EC (t/s) GOB\n1 128.23\n1\n90.59\n57.66\n1\n21.16\n1\n7.36\n1\n4.30\n1\n1\n4.02\n\n1 83.59\n1 70.20\n1 52.80\n1 20.74\n7.80\n1\n4.39\n1\n1\n4.07\n\n0.01\n0.03\n2.36\n4.57\n15.53\n1.73\n0.35\n\n0.02\n1.18\n0.91\n1.63\n1.43\n0.50\n0.15\n\n512\nEC (t/s)\n0.01\n0.06\n2.54\n\nGOB\n1 634.19\n1 228.57\n1 174.70\n280.59 0.84 137.67\n90.92\n609.18 0.88\n64.51\n1\n258.80\n21.18\n1\n61.44\n\nn = 14\n2048\n\nEC (t/s)\nGOB\n0.123\n1 738.25\n0.868\n1 276.68\n34.979 0.98 183.93\n1 126.80\n88.80\n85.58\n1\n35.62\n63.68\n1\n6.49\n1.39\n1\n60.56\n\n8192\n\nEC (t/s) GOB\n0.12\n0.18\n0.60\n1.85\n4.74\n2.28\n0.54\n\n1 1295.90\n1\n404.35\n210.12\n1\n126.24\n1\n83.81\n1\n64.04\n1\n1\n61.06\n\nN\np\n0.01\n0.05\n0.10\n0.25\n0.50\n0.75\n1.00\n\nTable 2: Time t (in sec) used by EC tree and GOBNILP to \ufb01nd optimal networks, without any projected\nconstraints, using a 32G memory and 2 hour time limit. s is the percentage of test cases that \ufb01nish.\n\nWe remark that the above constraints are suf\ufb01cient for \ufb01nding a set of ordering constraints O that are\nentailed by a set of ancestral constraints A, which is formalized in the following theorem.\nTheorem 4 Given a maximum set of ancestral constraints A, and let O be a closed set of ordering\nconstraints. The set O is entailed by A if O satis\ufb01es the following two statements:\n\n1. for all X < Y in O, A contains Y (cid:54)(cid:32) X\n2. for all X < Y and Z < W in O, where X, Y, Z and W are distinct, A contains at least\n\none of X (cid:32) Y, Z (cid:32) W, X (cid:32) Z, Y (cid:32) W, X (cid:32) W, Y (cid:54)(cid:32) Z.\n\n6 Experiments\n\nWe now empirically evaluate the effectiveness of our approach to learning with ancestral constraints.\nWe simulated different structure learning problems from standard Bayesian network benchmarks5\nALARM, ANDES, CHILD, CPCS54, and HEPAR2, by (1) taking a random sub-network N of a given\nsize6 (2) simulating a training dataset from N of varying sizes (3) simulating a set of ancestral\nconstraints of a given size, by randomly selecting ordered pairs whose ground-truth ancestral relations\nin N were used as constraints. In our experiments, we varied the number of variables in the learning\nproblem (n), the size of the training dataset (N), and the percentage of the n(n \u2212 1)/2 total ancestral\nrelations that were given as constraints (p). We report results that were averaged over 50 different\ndatasets: 5 datasets were simulated from each of 2 different sub-networks, which were taken from\neach of the 5 original networks mentioned above. Our experiments were run on a 2.67GHz Intel Xeon\nX5650 CPU. We assumed BDeu scores with an equivalent sample size of 1. We further pre-computed\nthe scores of candidate parent sets, which were fed as input into each system evaluated. Finally, we\nused the EVASOLVER partial MaxSAT solver, for inferring ordering constraints.7\nIn our \ufb01rst set of experiments, we compared our approach with the ILP-based system of GOBNILP,8\nwhere we encoded ancestral constraints using linear constraints, based on [Cussens, 2008]; note\nagain that both are exact approaches for structure learning. In Table 1, we supplied both systems\nwith decomposable constraints inferred via projection (which empowers the oracle for searching\nthe EC tree, and provides redundant constraints for the ILP). In Table 2, we withheld the projected\n\n5The networks used in our experiments are available at http://www.bnlearn.com/bnrepository\n6We select random sets of nodes and all their ancestors, up to a connected sub-network of a given size.\n7Available at http://www.maxsat.udl.cat/14/solvers/eva500a__\n8Available at http://www.cs.york.ac.uk/aig/sw/gobnilp\n\n7\n\n\fN\np\n0.00\n0.01\n0.05\n0.10\n0.25\n0.50\n0.75\n1.00\n\n512\ns\n\n\u2206\n\nt\n1 16.74\n2.25\n2.22\n1 16.58\n41.15 0.96 15.02\n149.40 0.94 12.72\n251.74 0.78\n6.33\n5.49\n95.18 0.98\n3.30\n1\n9.07\n1\n0.72\n<\n\nn = 18\n2048\n\ns \u2206\nt\n1 8.32\n2.78\n3.46\n1 8.60\n2.91 0.98 6.96\n73.03 0.96 5.81\n338.10 0.94 3.79\n13.92 0.98 2.69\n1 1.66\n5.83\n1 0.48\n<\n\n8192\n\ns \u2206\nt\n1 7.06\n3.11\n1 7.38\n3.63\n1 5.56\n2.12\n7.35\n1 3.78\n30.90 0.96 1.96\n116.29 0.98 1.24\n1 0.72\n1 0.26\n\n0.72\n<\n\nn = 20\n2048\ns\n\n\u2206\n\nt\n\n\u2206\n\n512\ns\n\nt\n\n1 23.44\n19.40\n30.38\n1 23.67\n87.74 0.96 18.44\n492.59 0.82 14.67\n507.02 0.58\n6.17\n6.36\n163.19 0.88\n4.49\n1\n1\n2.02\n\n1.47\n<\n\n1 10.60\n20.62\n1 10.53\n30.46\n8.20\n39.25\n1\n7.21\n185.82 0.94\n572.68 0.88\n4.46\n2.19\n46.43 0.96\n1.36\n1\n0.28\n1\n0.47\n<\n\n8192\n\nt\n\ns \u2206\n1 7.22\n28.22\n1 7.09\n34.34\n17.40\n1 5.00\n24.46 0.98 3.94\n153.81 0.96 2.28\n1 1.07\n70.15\n1 0.60\n0.38\n1 0.18\n<\n\nTable 3: Time t (in sec) used by EC tree to \ufb01nd optimal networks, with a 32G memory, a 2 hour time limit. < is\nless than 0.01 sec. n is the variable number, N is the dataset size, p is the percentage of the ancestor constraints,\ns is the percentage of test cases that \ufb01nish, \u2206 is the edge difference of the learned and true networks.\n\nconstraints. In Table 1, our approach is consistently orders-of-magnitude faster than GOBNILP, for\nalmost all values of n, N and p that we varied. This difference increased with the number of variables\nn.9 When we compare Table 2 to Table 1, we see that for the EC tree, the projection of constraints\nhas a signi\ufb01cant impact on the ef\ufb01ciency of learning (often by several orders of magnitude). For ILP,\nthere is some mild overhead with a smaller number of variables (n = 12), but with a larger number\nof variables (n = 14), there were consistent improvements when projected constraints are used.\nNext, we evaluate (1) how introducing ancestral constraints effects the ef\ufb01ciency of search, and (2)\nhow scalable our approach is as we increase the number of variables in the learning problem. In\nTable 3, we report results where we varied the number of variables n \u2208 {16, 18, 20}, and asserted\na 2 hour time limit and a 32GB memory limit. First, we observe an easy-hard-easy trend as we\nincrease the proportion p of ancestral constraints. When p is small, the learning problem is close\nto the unconstrained problem, and our oracle serves as an accurate heuristic. When p is large, the\nproblem is highly constrained, and the search space is signi\ufb01cantly reduced. In contrast, the ILP\napproach more consistently became easier as more constraints were provided (from Table 1). As\nexpected, the learning problem becomes more challenging when we increase the number of variables\nn, and when less training data is available. We note that our approach scales to n = 20 variables here,\nwhich is comparable to the scalability of modern score-based approaches reported in the literature (for\nBDeu scores); e.g., Yuan and Malone [2013] reported results up to 26 variables (for BDeu scores).\nTable 3 also reports the average structural Hamming distance \u2206 between the learned network and the\nground-truth network used to generate the data. We see that as the dataset size N and the proportion\np of constraints available increases, the more accurate the learned model becomes.10 We remark\nthat a relatively small number of ancestral constraints (say 10%\u201325%) can have a similar impact\non the quality of the observed network (relative to the ground-truth), as increasing the amount of\ndata available from 512 to 2048, or from 2048 to 8192. This highlights the impact that background\nknowledge can have, in contrast to collecting more (potentially expensive) training data.\n\n7 Conclusion\n\nWe proposed an approach for learning the structure of Bayesian networks optimally, subject to\nancestral constraints. These constraints are non-decomposable, posing a particular dif\ufb01culty for\nlearning approaches for decomposable scores. We utilized a search space for structure learning with\nnon-decomposable scores, called the EC tree, and employ an oracle that optimizes decomposable\nscores. We proposed a sound and complete method for pruning the EC tree, based on ancestral\nconstraints. We also showed how the employed oracle can be empowered by passing it decomposable\nconstraints inferred from the non-decomposable ancestral constraints. Empirically, we showed that\nour approach is orders-of-magnitude more ef\ufb01cient compared to learning systems based on ILP.\n\nAcknowledgments\n\nThis work was partially supported by NSF grant #IIS-1514253 and ONR grant #N00014-15-1-2339.\n\n9When no limits are placed on the sizes of families (as was done here), heuristic-search approaches (like\n\nours) have been observed to scale better than ILP approaches [Yuan and Malone, 2013, Malone et al., 2014].\nFor example, DAG X \u2192 Y \u2192 Z expresses the same ancestral relations, after adding edge X \u2192 Z.\n\n10\u2206 can be greater than 0 when p = 1, as there may be many DAGs that respect a set of ancestral constraints.\n\n8\n\n\fReferences\nM. Bartlett and J. Cussens. Integer linear programming for the Bayesian network structure learning problem.\n\nArti\ufb01cial Intelligence, 2015.\n\nG. Borboudakis and I. Tsamardinos. Incorporating causal prior knowledge as path-constraints in Bayesian\nnetworks and maximal ancestral graphs. In Proceedings of the Twenty-Ninth International Conference on\nMachine Learning, 2012.\n\nG. Borboudakis and I. Tsamardinos. Scoring and searching over Bayesian networks with causal and associative\n\npriors. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Arti\ufb01cial Intelligence, 2013.\n\nE. Y.-J. Chen, A. Choi, and A. Darwiche. Learning optimal Bayesian networks with DAG graphs. In Proceedings\nof the 4th IJCAI Workshop on Graph Structures for Knowledge Representation and Reasoning (GKR\u201915),\n2015.\n\nE. Y.-J. Chen, A. Choi, and A. Darwiche. Enumerating equivalence classes of Bayesian networks using EC\ngraphs. In Proceedings of the Nineteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\n2016.\n\nY. Chen and J. Tian. Finding the k-best equivalence classes of Bayesian network structures for model averaging.\n\nIn Proceedings of the Twenty-Eighth Conference on Arti\ufb01cial Intelligence, 2014.\n\nG. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data.\n\nMachine learning, 9(4):309\u2013347, 1992.\n\nJ. Cussens. Bayesian network learning by compiling to weighted MAX-SAT. In Proceedings of the 24th\n\nConference in Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 105\u2013112, 2008.\n\nJ. Cussens, M. Bartlett, E. M. Jones, and N. A. Sheehan. Maximum likelihood pedigree reconstruction using\n\ninteger linear programming. Genetic epidemiology, 37(1):69\u201383, 2013.\n\nA. Darwiche. Modeling and reasoning with Bayesian networks. Cambridge University Press, 2009.\nT. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian network structure using LP relaxations.\nIn Proceedings of the Thirteen International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n358\u2013365, 2010.\n\nM. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. The Journal of Machine\n\nLearning Research, 5:549\u2013573, 2004.\n\nD. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. The MIT Press, 2009.\nC.-M. Li and F. Many\u00e0. MaxSAT, hard and soft constraints. In Handbook of Satis\ufb01ability, pages 613\u2013631. 2009.\nB. Malone, K. Kangas, M. J\u00e4rvisalo, M. Koivisto, and P. Myllym\u00e4ki. Predicting the hardness of learning\n\nBayesian networks. In Proceedings of the Twenty-Eighth Conference on Arti\ufb01cial Intelligence, 2014.\n\nK. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.\nP. Parviainen and M. Koivisto. Finding optimal Bayesian networks using precedence constraints. Journal of\n\nMachine Learning Research, 14(1):1387\u20131415, 2013.\n\nT. Silander and P. Myllym\u00e4ki. A simple approach for \ufb01nding the globally optimal Bayesian network structure. In\nProceedings of the Twenty-Second Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 445\u2013452, 2006.\nA. P. Singh and A. W. Moore. Finding optimal Bayesian networks by dynamic programming. Technical report,\n\nCMU-CALD-050106, 2005.\n\nJ. Tian, R. He, and L. Ram. Bayesian model averaging using the k-best Bayesian network structures. In\n\nProceedings of the Twenty-Six Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 589\u2013597, 2010.\n\nP. van Beek and H. Hoffmann. Machine learning of Bayesian networks using constraint programming. In\nProceedings of the 21st International Conference on Principles and Practice of Constraint Programming\n(CP), pages 429\u2013445, 2015.\n\nC. Yuan and B. Malone. Learning optimal Bayesian networks: A shortest path perspective. Journal of Arti\ufb01cial\n\nIntelligence Research, 48:23\u201365, 2013.\n\nC. Yuan, B. Malone, and X. Wu. Learning optimal Bayesian networks using A* search. In Proceedings of the\n\nTwenty-Second International Joint Conference on Arti\ufb01cial Intelligence, pages 2186\u20132191, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1206, "authors": [{"given_name": "Eunice Yuh-Jie", "family_name": "Chen", "institution": "UCLA"}, {"given_name": "Yujia", "family_name": "Shen", "institution": "UCLA"}, {"given_name": "Arthur", "family_name": "Choi", "institution": "UCLA"}, {"given_name": "Adnan", "family_name": "Darwiche", "institution": "UCLA"}]}