{"title": "Learning Chordal Markov Networks by Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 2357, "page_last": 2365, "abstract": "We present an algorithm for finding a chordal Markov network that maximizes any given decomposable scoring function. The algorithm is based on a recursive characterization of clique trees, and it runs in O(4^n) time for n vertices. On an eight-vertex benchmark instance, our implementation turns out to be about ten million times faster than a recently proposed, constraint satisfaction based algorithm (Corander et al., NIPS 2013). Within a few hours, it is able to solve instances up to 18 vertices, and beyond if we restrict the maximum clique size. We also study the performance of a recent integer linear programming algorithm (Bartlett and Cussens, UAI 2013). Our results suggest that, unless we bound the clique sizes, currently only the dynamic programming algorithm is guaranteed to solve instances with around 15 or more vertices.", "full_text": "Learning Chordal Markov Networks\n\nby Dynamic Programming\n\nKustaa Kangas\n\nTeppo Niinim\u00a8aki Mikko Koivisto\n\nHelsinki Institute for Information Technology HIIT\n\nDepartment of Computer Science, University of Helsinki\n\n{jwkangas,tzniinim,mkhkoivi}@cs.helsinki.fi\n\nAbstract\n\nWe present an algorithm for \ufb01nding a chordal Markov network that maximizes\nany given decomposable scoring function. The algorithm is based on a recursive\ncharacterization of clique trees, and it runs in O(4n) time for n vertices. On\nan eight-vertex benchmark instance, our implementation turns out to be about\nten million times faster than a recently proposed, constraint satisfaction based\nalgorithm (Corander et al., NIPS 2013). Within a few hours, it is able to solve\ninstances up to 18 vertices, and beyond if we restrict the maximum clique size.\nWe also study the performance of a recent integer linear programming algorithm\n(Bartlett and Cussens, UAI 2013). Our results suggest that, unless we bound the\nclique sizes, currently only the dynamic programming algorithm is guaranteed to\nsolve instances with around 15 or more vertices.\n\n1\n\nIntroduction\n\nStructure learning in Markov networks, also known as undirected graphical models or Markov\nrandom \ufb01elds, has attracted considerable interest in computational statistics, machine learning, and\narti\ufb01cial intelligence. Natural score-and-search formulations of the task have, however, proved to be\ncomputationally very challenging. For example, Srebro [1] showed that \ufb01nding a maximum-likelihood\nchordal (or triangulated or decomposable) Markov network is NP-hard even for networks of treewidth\nat most 2, in sharp contrast to the treewidth-1 case [2]. Consequently, various approximative\napproaches and local search heuristics have been proposed [3, 1, 4, 5, 6, 7, 8, 9, 10, 11].\nOnly very recently, Corander et al. [12] published the \ufb01rst non-trivial algorithm that is guaranteed to\n\ufb01nd a globally optimal chordal Markov network. It is based on expressing the search space in terms of\nlogical constraints and employing the state-of-the-art solver technology equipped with optimization\ncapabilities. To this end, they adopt the usual clique tree, or junction tree, representation of chordal\ngraphs, and work with a particular characterization of clique trees, namely, that for any vertex of the\ngraph the cliques containing that vertex induce a connected subtree in the clique tree. The key idea\nis to rephrase this property as what they call a balancing condition: for any vertex, the number of\ncliques that contain it is one larger than the number of edges (the intersection of the adjacent cliques)\nthat contain it. They show that with appropriate, ef\ufb01cient encodings of the constraints, an eight-vertex\ninstance can be solved to the optimum in a few days of computing, which could have been impossible\nby a brute-force search. However, while the constraint satisfaction approach enables exploiting the\npowerful technology, it is currently not clear, whether it scales to larger instances.\nHere, we investigate an alternative approach to \ufb01nd an optimal chordal Markov network. Like the\nwork of Corander at al. [12], our algorithm stems from a particular characterization of clique trees of\nchordal graphs. However, our characterization is quite different, being recursive in nature. It concords\nthe structure of common scoring functions and so yields a natural dynamic programming algorithm\nthat grows an optimal clique tree by selecting its cliques one by one. In its basic form, the algorithm\n\n1\n\n\fis very inef\ufb01cient. Fortunately, the \ufb01ne structure of the scoring function enables us to further factorize\nthe main dynamic programming step and so bring the time requirement down to O(4n) for instances\nwith n vertices. We also show that by setting the maximum clique size, equivalently the treewidth\n\n(plus one), to w \u2264 n/4, the time requirement can be improved to O(cid:0)3n\u2212w(cid:0)n\n\n(cid:1)w(cid:1).\n\nw\n\nWhile our recursive characterization of clique trees and the resulting dynamic programming algorithm\nare new, they are similar in spirit to a recent work by Korhonen and Parviainen [13]. Their algorithm\n\ufb01nds a bounded-treewidth Bayesian network structure that maximizes a decomposable score, running\nin 3nnw+O(1) time, where w is the treewidth bound. For large w it thus is superexponentially slower\nthan our algorithm. The problems solved by the two algorithms are, of course, different: the class of\ntreewidth-w Bayesian networks properly extends the class of treewidth-w chordal Markov networks.\nThere is also more recent work for \ufb01nding bounded-treewidth Bayesian networks by employing\nconstraint solvers: Berg et al. [14] solve the problem by casting into maximum satis\ufb01ability, while\nParviainen et al. [15] cast into integer linear programming. For unbounded-treewidth Bayesian\nnetworks, O(2nn2)-time algorithms based on dynamic programming are available [16, 17, 18].\nHowever, none of these dynamic programming algorithms, nor their A* search based variant [19],\nenables adding the constraints of chordality or bounded width.\nBut the integer linear programming approach to \ufb01nding optimal Bayesian networks, especially the\nrecent implementation by Bartlett and Cussens [20], also enables adding the further constraints.1\nWe are not aware of any reasonable worst-case bounds for the algorithm\u2019s time complexity, nor any\nprevious applications of the algorithm to the problem of learning chordal Markov networks. As a\nsecond contribution of this paper, we report on an experimental study of the algorithm\u2019s performance,\nusing both synthetic data and some frequently used machine learning benchmark datasets.\nThe remainder of this article begins by formulating the learning task as an optimization problem. Next\nwe present our recursive characterization of clique trees and a derivation of the dynamic programming\nalgorithm, with a rigorous complexity analysis. The experimental setting and results are reported in a\ndedicated section. We end with a brief discussion.\n\n2 The problem of learning chordal Markov networks\n\nclique. Together, G and p form a Markov network if p(x1, . . . , xn) =(cid:81)\n\nWe adopt the hypergraph treatment of chordal Markov networks. For a gentler presentation and\nproofs, see Lauritzen and Spiegelhalter [21, Sections 6 and 7], Lauritzen [22], and references therein.\nLet p be a positive probability function over a product of n state spaces. Let G be an undirected\ngraph on the vertex set V = {1, . . . , n}, and call any maximal set of pairwise adjacent vertices of G a\nC \u03c8C(xC), where C runs\nthrough the cliques of G and each \u03c8C is a mapping to positive reals. Here xC denotes (xv : v \u2208 C).\nThe factors \u03c8C take a particularly simple form when the graph G is chordal, that is, when every cycle\nof G of length greater than three has a chord, which is an edge of G joining two nonconsecutive\nvertices of the cycle. The chordality requirement can be expressed in terms of hypergraphs. Consider\n\ufb01rst an arbitrary hypergraph on V , identi\ufb01ed with a collection C of subsets of V such that each\nelement of V belongs to some set in C. We call C reduced if no set in C is a proper subset of another\nset in C, and acyclic if, in addition, the sets in C admit an ordering C1, . . . , Cm that has the running\nintersection property: for each 2 \u2264 j \u2264 m, the intersection Sj = Cj \u2229 (C1 \u222a \u00b7\u00b7\u00b7 \u222a Cj\u22121) is a subset\nof some Ci with i < j. We call the sets Sj the separators. The multiset of separators, denoted by\nS, does not depend on the ordering and is thus unique for an acyclic hypergraph. Now, letting C be\nthe set of cliques of the chordal graph G, it is known that the hypergraph C is acyclic and that each\nfactor \u03c8Cj (xCj ) can be speci\ufb01ed as the ratio p(xCj )/p(xSj ) of marginal probabilities (where we\nde\ufb01ne p(xS1) = 1). Also the converse holds: by connecting all pairs of vertices within each set of an\nacyclic hypergraph we obtain a chordal graph.\nGiven multiple observations over the product state space, the data, we associate with each hyper-\nS\u2208S p(S), where the local score p(A) measures the\nprobability (density) of the data projected on A \u2286 V , possibly extended by some structure prior\nor penalization term. The structure learning problem is to \ufb01nd an acyclic hypergraph C on V that\n\ngraph C on V a score s(C) =(cid:81)\n\nC\u2208C p(C)(cid:14)(cid:81)\n\n1We thank an anonymous reviewer of an earlier version of this work for noticing this fact, which apparently\n\nwas not well known in the community, including the authors and reviewers of Corander\u2019s et al. work [12].\n\n2\n\n\fmaximizes the score s(C). This formulation covers a Bayesian approach, in which each p(A) is the\nmarginal likelihood for the data on A under a Dirichlet\u2013multinomial model [23, 7, 12], but also the\nmaximum-likelihood formulation, in which each p(A) is the empirical probability of the data on\nA [23, 1]. Motivated by these instantiations, we will assume that for any given A the value p(A) can\nbe ef\ufb01ciently computed, and we treat the values as the problem input.\nOur approach to the problem exploits the fact [22, Prop. 2.27] that a reduced hypergraph C is acyclic\nif and only if there is a junction tree T for C, that is, an undirected tree on the node set C that has the\njunction property (JP): for any two nodes A and B in C and any C on the unique path in T between\nA and B we have A \u2229 B \u2286 C. Furthermore, by labeling each edge of T by the intersection of its\nendpoints, the edge labels amount to the multiset of separators of the hypergraph C. Thus a junction\ntree gives the separators explicitly, which motivates us to write s(T ) for the respective score s(C)\nand solve the structure learning problem by \ufb01nding a junction tree T over V that maximizes s(T ).\nHere and henceforth, we say that a tree is over a set if the union of the tree\u2019s nodes equals the set.\nAs our problem formulation does not explicitly refer to the underlying chordal graph and cliques, we\nwill speak of junction trees instead of equivalent but semantically more loaded clique trees. From\nhere on, a junction tree refers speci\ufb01cally to a junction tree whose node set is a reduced hypergraph.\n\n3 Recursive characterization and dynamic programming\n\nThe score of a junction tree obeys a recursive factorization along subtrees (by rooting the tree at any\nnode), given in Section 3.2 below. While this is the essential structural property of the score for our\ndynamic programming algorithm, it does not readily yield the needed recurrence for the optimal\nscore. Indeed, we need a characterization of, not a \ufb01xed junction tree, but the entire search space\nof junction trees that concords the factorization of the score. We next give such a characterization\nbefore we proceed to the derivation and analysis of the dynamic programming algorithm.\n\n3.1 Recursive partition trees\n\nWe characterize the set of junction trees by expressing the ways in which they can partition V . The\nidea is that when any tree of interest is rooted at some node, the subtrees amount to a partition of not\nonly the remaining nodes in the tree (which holds trivially) but also the remaining vertices (contained\nin the nodes); and the subtrees also satisfy this property. See Figure 1 for an illustration.\nIf T is a tree over a set S, we write C(T ) for its node set and V (T ) for the union of its nodes, S. For\na family R of subsets of a set S, we say that R is a partition of S and denote R (cid:64) S if the members\nof R are non-empty and pairwise disjoint, and their union is S.\nDe\ufb01nition 1 (Recursive partition tree, RPT). Let T be a tree over a \ufb01nite set V , rooted at C \u2208\nC(T ). Denote by C1, . . . , Ck the children of C, by Ti the subtree rooted at Ci, and let Ri = V (Ti)\\C.\nWe say that T is a recursive partition tree (RPT) if it satis\ufb01es the following three conditions: (R1)\neach Ti is a RPT over Ci \u222a Ri, (R2) {R1, . . . , Rk} (cid:64) V \\ C, and (R3) C \u2229 Ci is a proper subset of\nboth C and Ci. We denote by RPT(V, C) the set of all RPTs over V rooted at C.\n\nWe now present the following theorems to establish that, when edge directions are ignored, the\nde\ufb01nitions of junction trees and recursive partition trees are equivalent.\nTheorem 1. A junction tree T is a RPT when rooted at any C \u2208 C(T ).\nTheorem 2. A RPT is a junction tree (when considered undirected).\n\nOur proofs of these results will use the following two observations:\nObservation 3. A subtree of a junction tree is also a junction tree.\nObservation 4. If T is a RPT, so is its every subtree rooted at any C \u2208 C(T ).\n\nProof of Theorem 1. Let T be a junction tree over V and consider an arbitrary C \u2208 C(T ). We show\nby induction over the number of nodes that T is a RPT when rooted at C. Let Ci, Ti, and Ri be\nde\ufb01ned as in De\ufb01nition 1 and consider the three RPT conditions. If C is the only node in T , the\nconditions hold trivially. Assume they hold up to n \u2212 1 nodes and consider the case |C(T )| = n. We\nshow that each condition holds.\n\n3\n\n\fFigure 1: An example of a chordal graph and a\ncorresponding recursive partition. The root node\nC = {3, 4, 5} (dark grey) partitions the remaining\nvertices into three disjoint sets R1 = {0, 1, 2},\nR2 = {6}, and R3 = {7, 8, 9} (light grey), which\nare connected to the root node by its child nodes\nC1 = {1, 2, 3}, C2 = {4, 5, 6}, and C3 = {5, 7}\nrespectively (medium grey).\n\ni Ri =(cid:83)\n\nreduced hypergraph Ci is non-empty and not contained in C. Second,(cid:83)\n(C \u222a(cid:83)\n\ni V (Ti)) \\ C =(cid:83)C(T ) \\ C = V \\ C. Finally, to see that Ri are pairwise disjoint, assume to\n\n(R1) By Observation 3 each Ti is a junction tree and thus, by the induction assumption, a RPT. It\nremains to show that V (Ti) = Ci \u222a Ri. By de\ufb01nition both Ci \u2286 V (Ti) and Ri \u2286 V (Ti). Thus\nCi \u222a Ri \u2286 V (Ti). Assume then that x \u2208 V (Ti), i.e. x \u2208 C(cid:48) for some C(cid:48) \u2208 C(Ti). If x /\u2208 Ri,\nthen by de\ufb01nition x \u2208 C. Since Ci is on the path between C and C(cid:48), by JP x \u2208 Ci. Therefore\nV (Ti) \u2286 Ci \u222a Ri.\n(R2) We show that the sets Ri partition V \\ C. First, each Ri is non-empty since by de\ufb01nition of\ni(V (Ti) \\ C) =\nthe contrary that x \u2208 Ri \u2229 Rj for distinct Ri and Rj. This implies x \u2208 A \u2229 B for some A \u2208 C(Ti)\nand B \u2208 C(Tj). Now, by JP x \u2208 C, which contradicts the de\ufb01nition of Ri.\n(R3) Follows by the de\ufb01nition of reduced hypergraph.\nProof of Theorem 2. Assume now that T is a RPT over V . We show that T is a junction tree. To see\nthat T has JP, consider arbitrary A, B \u2208 C(T ). We show that A \u2229 B is a subset of every C \u2208 C(T )\non the path between A and B.\nConsider \ufb01rst the case that A is an ancestor of B and let B = C1, . . . , Cm = A be the path that\nconnects them. We show by induction over m that C1 \u2229 Cm \u2286 Ci for every i = 1, . . . , m. The base\ncase m = 1 is trivial. Assume m > 1 and the claim holds up to m \u2212 1. If i = m, the claim is trivial.\nLet i < m. Denote by Tm\u22121 the subtree rooted at Cm\u22121 and let Rm\u22121 = V (Tm\u22121) \\ Cm. Since\nC1 \u2286 V (Tm\u22121) we have that C1 \u2229 Cm = (C1 \u2229 V (Tm\u22121)) \u2229 Cm = C1 \u2229 (Cm \u2229 V (Tm\u22121)). By\nObservation 4 Tm\u22121 is a RPT. Therefore, from (R1) it follows that V (Tm\u22121) = Cm\u22121 \u222a Rm\u22121 and\nthus Cm \u2229 V (Tm\u22121) = (Cm \u2229 Cm\u22121) \u222a (Cm \u2229 Rm\u22121) = Cm \u2229 Cm\u22121. Plugging this above and\nusing the induction assumption we get C1 \u2229 Cm = C1 \u2229 (Cm \u2229 Cm\u22121) \u2286 C1 \u2229 Cm\u22121 \u2286 Ci.\nConsider now the case that A and B have a least common ancestor C. By Observation 4, the subtree\nrooted at C is a RPT. Thus, by (R1) and (R2) there are disjoint R and R(cid:48) such that A \u2286 C \u222a R and\nB \u2286 C \u222a R(cid:48). Thus, A \u2229 B \u2286 C, and consequently A \u2229 B \u2286 A \u2229 C. As we proved above, A \u2229 C is\na subset of every node on the path between A and C, and therefore A \u2229 B is also a subset of every\nsuch node. Similarly, A \u2229 B is a subset of every node on the path between B and C. Combining\nthese results, we have that A \u2229 B is a subset of every node on the path between A and B.\nFinally, to see that C(T ) is reduced, assume the opposite, that A \u2286 B for distinct A, B \u2208 C(T ). Let\nC be the node next to A on the path from A to B. By the initial assumption and JP A \u2286 A \u2229 B \u2286 C.\nAs either A or C is a child of the other, this contradicts (R3) in the subtree rooted at the parent.\n\n3.2 The main recurrence\nWe want to \ufb01nd a junction tree T over V that maximizes the score s(T ). By Theorems 1 and 2 this\nis equivalent to \ufb01nding a RPT T that maximizes s(T ). Let T be a RPT rooted at C and denote by\nC1, . . . , Ck the children of C and by Ti the subtree rooted at Ci. Then, the score factorizes as follows\n\ns(T ) = p(C)\n\n(1)\nTo see this, observe that each term of s(T ) is associated with a particular node or edge (separator) of\nT . Thus the product of the s(Ti) consists of exactly the terms of s(T ), except for the ones associated\nwith the root C of T and the edges between C and each Ci.\n\np(C \u2229 Ci)\n\ni=1\n\nk(cid:89)\n\ns(Ti)\n\n.\n\n4\n\n0123546789\fTo make use of the above factorization, we introduce suitable constraints under which an optimal\ntree can be constructed from subtrees that are, in turn, optimal with respect to analogous constraints\n(cf. Bellman\u2019s principle of optimality). Speci\ufb01cally, we de\ufb01ne a function f that gives the score of an\noptimal subtree over any subset of nodes as follows:\nDe\ufb01nition 2. For S \u2282 V and \u2205 (cid:54)= R \u2286 V \\ S, let f (S, R) be the score of an optimal RPT over\nS \u222a R rooted at a proper superset of S. That is\n\nf (S, R) =\n\nmax\n\nS \u2282 C \u2286 S \u222a R\nT \u2208RPT(S\u222aR,C)\n\ns(T ) .\n\nCorollary 5. The score of an optimal RPT over V is given by f (\u2205, V ).\nWe now show that f admits the following recurrence, which shall be used as the basis of our dynamic\nprogramming algorithm.\nLemma 6. Let S \u2282 V and \u2205 (cid:54)= R \u2286 V \\ S. Then\n\nf (S, R) =\n\nmax\n\n{R1, . . . , Rk} (cid:64) R \\ C\n\nS \u2282 C \u2286 S \u222a R\nS1, . . . , Sk \u2282 C\n\np(C)\n\nk(cid:89)\n\ni=1\n\nf (Si, Ri)\n\np(Si)\n\n.\n\nProof. We \ufb01rst show inductively that the recurrence is well de\ufb01ned. Assume that the conditions\nS \u2282 V and \u2205 (cid:54)= R \u2286 V \\ S hold. Observe that R is non-empty, every set has a partition, and C\nis selected to be non-empty. Therefore, all three maximizations are over non-empty ranges and it\nremains to show that the product over i = 1, . . . , k is well de\ufb01ned. If |R| = 1, then R \\ C = \u2205 and\nthe product equals 1 by convention. Assume now that f (S, R) is de\ufb01ned when |R| < m and consider\nthe case |R| = m. By construction Si \u2282 V , \u2205 (cid:54)= Ri \u2286 V \\ Si and |Ri| < |R| for every i = 1, . . . , k.\nThus, by the induction assumption each f (Si, Ri) is de\ufb01ned and therefore the product is de\ufb01ned.\nWe now show that the recurrence indeed holds. Let the root C in De\ufb01nition 2 be \ufb01xed and consider the\nmaximization over the trees T . By De\ufb01nition 1, choosing a tree T \u2208 RPT(S \u222a R, C) is equivalent\nto choosing sets R1, . . . , Rk, sets C1, . . . , Ck, and trees T1, . . . ,Tk such that (R0) Ri = V (Ti) \\ C,\n(R1) Ti is a RPT over Ci \u222a Ri rooted at Ci, (R2) {R1, . . . , Rk} (cid:64) (S \u222a R) \\ C, and (R3) C \u2229 Ci is\na proper subset of C and Ci.\nObserve \ufb01rst that (S \u222a R) \\ C = R \\ C and therefore (R2) is equivalent to choosing sets Ri such\nthat {R1, . . . , Rk} (cid:64) R \\ C.\nDenote by Si the intersection C \u2229 Ci. We show that together (R0) and (R1) are equivalent to\nsaying that Ti is a RPT over Si \u222a Ri rooted at Ci. Assume \ufb01rst that the conditions are true. By\n(R1) it\u2019s suf\ufb01cient to show that Ci \u222a Ri = Si \u222a Ri. From (R1) it follows that Ci \u2286 V (Ti)\nand therefore Ci \\ C \u2286 V (Ti) \\ C, which by (R0) implies Ci \\ C \u2286 Ri. This in turn implies\nCi \u222a Ri = (Ci \u2229 C)\u222a (Ci \\ C)\u222a Ri = Si \u222a Ri. Assume then that Ti is a RPT over Si \u222a Ri rooted at\nCi. Condition (R0) holds since V (Ti) \\ C = (Si \u222a Ri) \\ C = (Si \\ C) \u222a (Ri \\ C) = \u2205 \u222a Ri = Ri.\nCondition (R1) holds since Si \u2286 Ci \u2286 V (Ti) = Si \u222a Ri and thus Si \u222a Ri = Ci \u222a Ri.\nFinally observe that (R3) is equivalent to \ufb01rst choosing Si \u2282 C and then Ci \u2283 Si. By (R1) it must\nalso be that Ci \u2286 V (Ti) = Si \u222a Ri. Based on these observations, we can now write\n\ns(T ) .\n\nf (S, R) =\n\nmax\n\n{R1, . . . , Rk} (cid:64) R \\ C\n\nS \u2282 C \u2286 S \u222a R\nS1,...,Sk\u2282C\n\n\u2200i:Si\u2282Ci\u2286Ri\u222aSi\n\n\u2200i:Ti is a RPT over Si \u222a Ri rooted at Ci\n\nNext we factorize s(T ) using the factorization (1) of the score. In addition, once a root C, a partition\n{R1, . . . , Rk}, and separators {S1, . . . , Sk} have been \ufb01xed, then each pair (Ci,Ti) can be chosen\nindependently for different i. Thus, the above maximization can be written as\n\n\uf8eb\uf8ed 1\n\np(Si)\n\nk(cid:89)\n\ni=1\n\nmax\n\n{R1, . . . , Rk} (cid:64) R \\ C\n\nS \u2282 C \u2286 S \u222a R\nS1,...,Sk\u2282C\n\np(C)\n\n\uf8f6\uf8f8 .\n\n\u00b7\nTi\u2208RPT(Si\u222aRi,Ci)\n\nSi\u2282Ci\u2286Ri\u222aSi\n\nmax\n\ns(Ti)\n\nBy applying De\ufb01nition 2 to the inner maximization the claim follows.\n\n5\n\n\f3.3 Fast evaluation\n\np(C)\n\nk(cid:89)\nf (S, R)(cid:14)p(S) .\n\ni=1\n\nThe direct evaluation of the recurrence in Lemma 6 would be very inef\ufb01cient, especially since it\ninvolves maximization over all partitions of the vertex set. In order to evaluate it more ef\ufb01ciently, we\ndecompose it into multiple recurrences, each of which can take advantage of dynamic programming.\nObserve \ufb01rst that we can rewrite the recurrence as\n\nf (S, R) =\n\nmax\n\nS \u2282 C \u2286 S \u222a R\n\n{R1, . . . , Rk} (cid:64) R \\ C\n\nh(C, Ri) ,\n\n(2)\n\nwhere\n\n(3)\nWe have simply moved the maximization over Si \u2282 C inside the product and written each factor\nusing a new function h. Due to how the sets C and Ri are selected, the arguments to h are always\nnon-empty and disjoint subsets of V . In a similar fashion, we can further rewrite recurrence 2 as\n\nh(C, R) = max\nS\u2282C\n\np(C)g(C, R \\ C) ,\n\n(4)\n\nf (S, R) = max\n\nS\u2282C\u2286S\u222aR\n\nwhere we de\ufb01ne\n\ng(C, U ) =\n\nmax\n\n{R1,...,Rk}(cid:64)U\n\nk(cid:89)\n\nh(C, Ri) .\n\ni=1\n\nAgain, note that C and U are disjoint and C is non-empty. If U = \u2205, then g(C, U ) = 1. Otherwise\n\ng(C, U ) = max\n\u2205(cid:54)=R\u2286U\n\nh(C, R)\n\nmax\n\n{R2,...,Rk}(cid:64)U\\R\n\nh(C, Ri) = max\n\u2205(cid:54)=R\u2286U\n\nh(C, R)g(C, U \\ R) .\n\n(5)\n\nk(cid:89)\n\ni=2\n\nThus, we have split the original recurrence into three simpler recurrences (4,5,3). We now obtain a\nstraightforward dynamic programming algorithm that evaluates f, g and h using these recurrences\nwith memoization, and then outputs the score f (\u2205, V ) of an optimal RPT.\n\n3.4 Time and space requirements\n\nWe measure the time requirement by the number of basic operations, namely comparisons and\narithmetic operations, executed for pairs of real numbers. Likewise, we measure the space requirement\nby the maximum number of real values stored at any point during the execution of the algorithm.\nWe consider both time and space in the more general setting where the width w \u2264 n of the optimal\nnetwork is restricted by selecting every node (clique) C in recurrence (4) with the constraint |C| \u2264 w.\nWe prove the following bounds by counting, for each of the three functions, the associated subset\ntriplets that meet the applicable disjointness, inclusion, and cardinality constraints:\nTheorem 7. Let V be a set of size n and w \u2264 n. Given the local scores of the subsets of V of size\nat most w as input, a maximum-score junction tree over V of width at most w can be found using\n\n(cid:0)n\n(cid:1)3n\u2212i basic operations and having a storage for 3(cid:80)w\n\n(cid:1)2n\u2212i real numbers.\n\n6(cid:80)w\n\n(cid:0)n\n\ni=0\n\ni\n\ni=0\n\ni\n\nProof. To bound the number of basic operations needed, we consider the evaluation of each the\nfunctions f, g, and h using the recurrences (4,5,3). Consider \ufb01rst f. Due to memoization, the\nalgorithm executes at most two basic operations (one comparison and one multiplication) per triplet\n(S, R, C), with S and R disjoint, S \u2282 C \u2286 S \u222a R, and |C| \u2264 w. Subject to these constraints, a set C\n\nof size i can be chosen in(cid:0)n\nThus, the number of basic operations needed is at most Nf = 2(cid:80)w\n|C| \u2264 w, and \u2205 (cid:54)= R \u2286 U. A set C of size i can be chosen in(cid:0)n\n\n(cid:1) ways, the set S \u2282 C in at most 2i ways, and the set R\\ C in 2n\u2212i ways.\n(cid:0)n\n(cid:1).\n(cid:0)n\n(cid:1) ways, and the remaining n \u2212 i\n\ni\nSimilarly, for h the algorithm executes at most two basic operations per triplet (C, R, S), with now C\nand R disjoint, |C| \u2264 w, and S \u2282 C. A calculation gives the same bound as for f. Finally consider g.\nNow the algorithm executes at most two basic operations per triplet (C, U, R), with C and U disjoint,\nelements can be assigned into U and its subset R in 3n\u2212i ways. Thus, the number of basic operations\n\n(cid:1)2n\u2212i2i = 2n+1(cid:80)w\n\ni=0\n\ni=0\n\ni\n\ni\n\ni\n\n6\n\n\fw = 3\n\nw = 4\n\nw = 5\n\nw = 6\n\nw = \u221e\n\n(cid:0)n\n\nFigure 2: The running time of Junctor and GOBNILP as a function of the number of vertices for\nvarying widths w, on sparse (top) and dense (bottom) synthetic instances with 100 (\u201csmall\u201d), 1000\n(\u201cmedium\u201d), and 10,000 (\u201clarge\u201d) data samples. The dashed red line indicates the 4-hour timeout or\nmemout. For GOBNILP shown is the median of the running times on 15 random instances.\n\nneeded is at most Ng = 2(cid:80)w\n(cid:1)3n\u2212i is larger than(cid:0)n\n(cid:0)n\n\n(cid:1)3n\u2212i. Finally, it is suf\ufb01cient to observe that there is a j such that\n(cid:1)2n when i \u2264 j, and smaller when i > j. Now because both terms sum up\n\ni\nto the same value 4n when i = 0, . . . , n, the bound Ng is always greater or equal to Nf .\nWe bound the storage requirement in a similar manner. For each function, the size of the \ufb01rst argument\nis at most w and the second argument is disjoint from the \ufb01rst, yielding the claimed bound.\nRemark 1. For w = n, the bounds for the number of basic operations and storage requirement in\nTheorem 7 become 6 \u00b7 4n and 3 \u00b7 3n, respectively. When w \u2264 n/4, the former bound can be replaced\n\n(cid:1)3n\u2212i\u22121 if and only if i \u2264 (n \u2212 3)/4.\n\ni\n\ni=0\n\ni\n\n(cid:1)3n\u2212i \u2264(cid:0) n\n\n(cid:1)3n\u2212w, since(cid:0)n\n\nby 6w(cid:0)n\nspace by mapping a pair of sets (A, B) to(cid:80)n\n\ni+1\n\nw\n\ni\n\nRemark 2. Memoization requires indexing with pairs of disjoint sets. Representing sets as integers\nallows ef\ufb01cient lookups to a two-dimensional array, using O(4n) space. We can achieve O(3n)\na=1 3a\u22121Ia(A, B) where Ia(A, B) is 1 if a \u2208 A, 2 if\na \u2208 B, and 0 otherwise. Each pair gets a unique index from 0 to 3n \u2212 1 to a compact array. A na\u00a8\u0131ve\nevaluation of the index adds an O(n) factor to the running time. This can be improved to constant\namortized time by updating the index incrementally while iterating over sets.\n\n4 Experimental results\n\nWe have implemented the presented algorithm in a C++ program Junctor (Junction Trees Optimally\nRecursively).2 In the experiments reported below, we compared the performance of Junctor and the\ninteger linear programming based solver GOBNILP by Bartlett and Cussens [20]. While GOBNILP\nhas been tailored for \ufb01nding an optimal Bayesian network, it enables forbidding the so-called\nv-structures in the network and, thereby, \ufb01nding an optimal chordal Markov network, provided that\nwe use the BDeu score, as we have done, or some other special scoring function [23, 24]. We note\nthat when forbidding v-structures, the standard score pruning rules [20, 25] are no longer valid.\nWe \ufb01rst investigated the performance on synthetic data generated from Bayesian networks of varying\nsize and density. We generated 15 datasets for each combination of the number of vertices n from 8 to\n18, maximum indegree k = 4 (sparse) or k = 8 (dense), and the number of samples m equaling 100,\n1000, or 10,000, as follows: Along a random vertex ordering, we \ufb01rst drew for each vertex the number\nof its parents from the uniform distribution between 0 and k and then the actual parents uniformly\nat random from its predecessors in the vertex ordering. Next, we assigned each vertex two possible\nstates and drew the parameters of the conditional distributions from the uniform distribution. Finally,\nfrom the obtained joint distribution, we drew m independent samples. The input for Junctor and\n\n2Junctor is publicly available at www.cs.helsinki.fi/u/jwkangas/junctor/.\n\n7\n\n810121416181s 60s 1h Junctor, anyGOBNILP, largeGOBNILP, mediumGOBNILP, small810121416181s 60s 1h 810121416181s 60s 1h 810121416181s 60s 1h 810121416181s 60s 1h 810121416181s 60s 1h 810121416181s 60s 1h 810121416181s 60s 1h 810121416181s 60s 1h 810121416181s 60s 1h \fTable 1: Benchmark instances with different numbers of attributes (n) and samples (m).\n\nDataset\nTic-tac-toe\nPoker\nBridges\nFlare\nZoo\n\nAbbr.\n\nX\nP\nB\nF\nZ\n\nn\n10\n11\n12\n13\n17\n\nm\n958\n10000\n108\n1066\n101\n\nAbbr.\n\nV\nT\nL\n\nDataset\nVoting\nTumor\nLymph\nHypothyroid\nMushroom\n\nn\n17\n18\n19\n22\n22\n\nm\n435\n339\n148\n3772\n8124\n\nw = 3\n\nw = 4\n\nw = 5\n\nw = 6\n\nw = \u221e\n\nFigure 3: The running time of Junctor against GOBNILP on the benchmark instances with at most\n19 attributes, given in Table 1. The dashed red line indicates the 4-hour timeout or memout.\n\nGOBNILP was produced using the BDeu score with equivalent sample size 1. For both programs, we\nvaried the maximum width parameter w from 3 to 6 and, in addition, examined the case of unbounded\nwidth (w = \u221e). Because the performance of Junctor only depends on n and w, we ran it only\nonce for each combination of the two. In contrast, the performance of GOBNILP is very sensitive to\nvarious characteristics of the data, and therefore we ran it for all the combinations. All runs were\nallowed 4 CPU hours and 32 GB of memory. The results (Figure 2) show that for large widths\nJunctor scales better than GOBNILP (with respect to n), and even for low widths Junctor is\nsuperior to GOBNILP for smaller n. We found GOBNILP to exhibit moderate variance: 93% of all\nrunning times (excluding timeouts) were within a factor of 5 of the respective medians shown in\nFigure 2, while 73% were within a factor of 2. We observe that the running time of GOBNILP may\nbehave \u201cdiscontinuously\u201d (e.g., small datasets around 15 vertices with width 4).\nWe also evaluated both programs on several benchmark instances taken from the UCI repository [26].\nThe datasets are summarized in Table 1. Figure 3 shows the results on the instances with at most 19\nattributes, for which the runs were, again, allowed 4 CPU hours and 32 GB of memory. The results\nare qualitatively in well agreement with the results obtained with synthetic data. For example, solving\nthe Bridges dataset on 12 attributes with width 5, takes less than one second by Junctor but around\n7 minutes by GOBNILP. For the two 22-attribute datasets we allowed both programs one week of\nCPU time and 128 GB of memory. Junctor was able to solve each within 33 hours for w = 3 and\nwithin 74 hours for w = 4. GOBNILP was able to solve Hypothyroid up to w = 6 (in 24 hours, or\nless for small widths), but Mushroom only up to w = 3. For higher widths GOBNILP ran out of time.\n\n5 Concluding remarks\n\nWe have investigated the structure learning problem in chordal Markov networks. We showed that the\ncommonly used scoring functions factorize in a way that enables a relatively ef\ufb01cient dynamic pro-\ngramming treatment. Our algorithm is the \ufb01rst that is guaranteed to solve moderate-size instances to\nthe optimum within reasonable time. For example, whereas Corander et al. [12] report their algorithm\ntook more than 3 days on an eight-variable instance, our Junctor program solves any eight-variable\ninstance within 20 milliseconds. We also reported on the \ufb01rst evaluation of GOBNILP [20] for solving\nthe problem, which highlighted the advantages of the dynamic programming approach.\n\nAcknowledgments\n\nThis work was supported by the Academy of Finland, grant 276864. The authors thank Matti J\u00a8arvisalo\nfor useful discussions on constraint programming approaches to learning Markov networks.\n\n8\n\n1s 60s 1h Junctor1s 60s 1h GOBNILPBFLPXTVZ1s 60s 1h Junctor1s 60s 1h GOBNILPBFLPXTVZ1s 60s 1h Junctor1s 60s 1h GOBNILPBFLPXTVZ1s 60s 1h Junctor1s 60s 1h GOBNILPBFLPXTVZ1s 60s 1h Junctor1s 60s 1h GOBNILPBFLPXTVZ\fReferences\n[1] N. Srebro. Maximum likelihood bounded tree-width Markov networks. Arti\ufb01cial Intelligence, 143(1):123\u2013\n\n138, 2003.\n\n[2] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence trees. IEEE\n\nTransactions on Information Theory, 14:462\u2013467, 1968.\n\n[3] S. Della Pietra, V. J. Della Pietra, and J. D. Lafferty. Inducing features of random \ufb01elds. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 19(4):380\u2013393, 1997.\n\n[4] M. Narasimhan and J. A. Bilmes. PAC-learning bounded tree-width graphical models. In D. M. Chickering\n\nand J. Y. Halpern, editors, UAI, pages 410\u2013417. AUAI Press, 2004.\n\n[5] P. Abbeel, D. Koller, and A. Y. Ng. Learning factor graphs in polynomial time and sample complexity.\n\nJournal of Machine Learning Research, 7:1743\u20131788, 2006.\n\n[6] A. Chechetka and C. Guestrin. Ef\ufb01cient principled learning of thin junction trees. In J. C. Platt, D. Koller,\n\nY. Singer, and S. T. Roweis, editors, NIPS. Curran Associates, Inc., 2007.\n\n[7] J. Corander, M. Ekdahl, and T. Koski. Parallell interacting MCMC for learning of topologies of graphical\n\nmodels. Data Mining and Knowledge Discovery, 17(3):431\u2013456, 2008.\n\n[8] G. Elidan and S. Gould. Learning bounded treewidth Bayesian networks. Journal of Machine Learning\n\nResearch, 9:2699\u20132731, 2008.\n\n[9] F. Bromberg, D. Margaritis, and V. Honavar. Ef\ufb01cient Markov network structure discovery using indepen-\n\ndence tests. Journal of Arti\ufb01cial Intelligence Research, 35:449\u2013484, 2009.\n\n[10] J. Davis and P. Domingos. Bottom-up learning of Markov network structure.\n\nT. Joachims, editors, ICML, pages 271\u2013278. Omnipress, 2010.\n\nIn J. F\u00a8urnkranz and\n\n[11] J. Van Haaren and J. Davis. Markov network structure learning: A randomized feature generation approach.\n\nIn J. Hoffmann and B. Selman, editors, AAAI, pages 1148\u20131154. AAAI Press, 2012.\n\n[12] J. Corander, T. Janhunen, J. Rintanen, H. J. Nyman, and J. Pensar. Learning chordal Markov networks by\nconstraint satisfaction. In C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors, NIPS,\npages 1349\u20131357, 2013.\n\n[13] J. Korhonen and P. Parviainen. Exact learning of bounded tree-width Bayesian networks. In C. M. Carvalho\nand P. Ravikumar, editors, AISTATS, volume 31 of JMLR Proceedings, pages 370\u2013378. JMLR.org, 2013.\n[14] J. Berg, M. J\u00a8arvisalo, and B. Malone. Learning optimal bounded treewidth Bayesian networks via maximum\n\nsatis\ufb01ability. In S. Kaski and J. Corander, editors, AISTATS, pages 86\u201395. JMLR.org, 2014.\n\n[15] P. Parviainen, H. S. Farahani, and J. Lagergren. Learning bounded tree-width Bayesian networks using\ninteger linear programming. In S. Kaski and J. Corander, editors, AISTATS, pages 751\u2013759. JMLR.org,\n2014.\n\n[16] S. Ott, S. Imoto, and S. Miyano. Finding optimal models for small gene networks. In R. B. Altman, A. K.\n\nDunker, L. Hunter, and T. E. Klein, editors, PSB, pages 557\u2013567. World Scienti\ufb01c, 2004.\n\n[17] M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine\n\nLearning Research, pages 549\u2013573, 2004.\n\n[18] T. Silander and P. Myllym\u00a8aki. A simple approach for \ufb01nding the globally optimal Bayesian network\n\nstructure. In R. Dechter and T. S. Richardson, editors, UAI, pages 445\u2013452. AUAI Press, 2006.\n\n[19] C. Yuan and B. Malone. Learning optimal Bayesian networks: A shortest path perspective. Journal of\n\nArti\ufb01cial Intelligence Research, 48:23\u201365, 2013.\n\n[20] M. Bartlett and J. Cussens. Advances in Bayesian network learning using integer programming. In UAI,\n\npages 182\u2013191. AUAI Press, 2013.\n\n[21] S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and\ntheir application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological),\n50(2):pp. 157\u2013224, 1988.\n\n[22] S. L. Lauritzen. Graphical Models. Oxford University Press, 1996.\n[23] A. P. Dawid and S. L. Lauritzen. Hyper Markov laws in the statistical analysis of decomposable graphical\n\nmodels. The Annals of Statistics, 21(3):1272\u20131317, 09 1993.\n\n[24] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of\n\nknowledge and statistical data. Machine Learning, 20:197\u2013243, 1995.\n\n[25] C. P. de Campos and Q. Ji. Ef\ufb01cient structure learning of Bayesian networks using constraints. Journal of\n\nMachine Learning Research, 12:663\u2013689, 2011.\n\n[26] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1234, "authors": [{"given_name": "Kustaa", "family_name": "Kangas", "institution": "University of Helsinki"}, {"given_name": "Mikko", "family_name": "Koivisto", "institution": "Helsinki Institute for Information Technology"}, {"given_name": "Teppo", "family_name": "Niinim\u00e4ki", "institution": "University of Helsinki"}]}