{"title": "Advances in Learning Bayesian Networks of Bounded Treewidth", "book": "Advances in Neural Information Processing Systems", "page_first": 2285, "page_last": 2293, "abstract": "This work presents novel algorithms for learning Bayesian networks of bounded treewidth. Both exact and approximate methods are developed. The exact method combines mixed integer linear programming formulations for structure learning and treewidth computation. The approximate method consists in sampling k-trees (maximal graphs of treewidth k), and subsequently selecting, exactly or approximately, the best structure whose moral graph is a subgraph of that k-tree. The approaches are empirically compared to each other and to state-of-the-art methods on a collection of public data sets with up to 100 variables.", "full_text": "Advances in Learning Bayesian\nNetworks of Bounded Treewidth\n\nSiqi Nie\n\nRensselaer Polytechnic Institute\n\nTroy, NY, USA\nnies@rpi.edu\n\nCassio P. de Campos\n\nQueen\u2019s University Belfast\n\nBelfast, UK\n\nc.decampos@qub.ac.uk\n\nDenis D. Mau\u00b4a\n\nUniversity of S\u02dcao Paulo\n\nS\u02dcao Paulo, Brazil\n\ndenis.maua@usp.br\n\nQiang Ji\n\nRensselaer Polytechnic Institute\n\nTroy, NY, USA\n\nqji@ecse.rpi.edu\n\nAbstract\n\nThis work presents novel algorithms for learning Bayesian networks of bounded\ntreewidth. Both exact and approximate methods are developed. The exact method\ncombines mixed integer linear programming formulations for structure learning\nand treewidth computation. The approximate method consists in sampling k-trees\n(maximal graphs of treewidth k), and subsequently selecting, exactly or approx-\nimately, the best structure whose moral graph is a subgraph of that k-tree. The\napproaches are empirically compared to each other and to state-of-the-art meth-\nods on a collection of public data sets with up to 100 variables.\n\n1\n\nIntroduction\n\nBayesian networks are graphical models widely used to represent joint probability distributions on\ncomplex multivariate domains. A Bayesian network comprises two parts: a directed acyclic graph\n(the structure) describing the relationships among the variables in the model, and a collection of\nconditional probability tables from which the joint distribution can be reconstructed. As the number\nof variables in the model increases, specifying the underlying structure becomes a daunting task,\nand practitioners often resort to learning Bayesian networks directly from data. Here, learning a\nBayesian network refers to inferring its structure from data, a task known to be NP-hard [9].\nLearned Bayesian networks are commonly used for drawing inferences such as querying the pos-\nterior probability of some variable given some evidence or \ufb01nding the mode of the posterior joint\ndistribution. Those inferences are NP-hard to compute even approximately [23], and all known\nexact and provably good algorithms have worst-case time complexity exponential in the treewidth,\nwhich is a measure of the tree-likeness of the structure. In fact, under widely believed assumptions\nfrom complexity theory, exponential time complexity in the treewidth is inevitable for any algorithm\nthat performs exact inference [7, 20]. Thus, learning networks of small treewidth is essential if one\nwishes to ensure reliable and ef\ufb01cient inference. This is particularly important in the presence of\nmissing data, when learning becomes intertwined with inference [16]. There is a second reason to\nlimit the treewidth. Previous empirical results [15, 22] suggest that bounding the treewidth improves\nmodel performance on unseen data, hence improving the model generalization ability.\nIn this paper we present two novel ideas for score-based Bayesian network learning with a hard\nconstraint on treewidth. The \ufb01rst one is a mixed-integer linear programming (MILP) formulation\nof the problem (Section 3) that builds on existing MILP formulations for unconstrained learning\nof Bayesian networks [10, 11] and for computing the treewidth of a graph [17]. Unlike the MILP\n\n1\n\n\fformulation of Parviainen et al. [21], the MILP problem we generate is of polynomial size in the\nnumber of variables, and dispense with the use of cutting planes techniques. This makes for a clean\nand succinct formulation that can be solved with a single call of any MILP optimizer. We provide\nsome empirical evidence (in Section 5) that suggests that our approach is not only simpler but often\nfaster. It also outperforms the dynamic programming approach of Korhonen and Parviainen [19].\nSince linear programming relaxations are used for solving the MILP problem, any MILP formula-\ntion can be used to provide approximate solutions and error estimates in an anytime fashion (i.e., the\nmethod can be stopped at any time during the computation with a feasible solution whose quality\nmonotonically improves with time). However, the MILP formulations (both ours and that of Parvi-\nainen et al. [21]) cannot cope with very large domains, even if we settle for approximate solutions.\nIn order to deal with large domains, we devise (in Section 4) an approximate method based on a\nuniform sampling of k-trees (maximal chordal graphs of treewidth k), which is achieved by using\na fast computable bijection between k-trees and Dandelion codes [6]. For each sampled k-tree, we\neither run an exact algorithm similar to the one in [19] (when computationally appealing) to learn\nthe score-maximizing network whose moral graph is a subgraph of that k-tree, or we resort to a\nmore ef\ufb01cient method that takes partial variable orderings uniformly at random from a (relatively\nsmall) space of orderings that are compatible with the k-tree. We show empirically (in Section 5)\nthat our sampling-based methods are very effective in learning close to optimal structures and scale\nup to large domains. We conclude in Section 6 and point out possible future work. We begin with\nsome background knowledge and literature review on learning Bayesian networks (Section 2).\n\n2 Bayesian Network Structure Learning\nLet N be {1, . . . , n} and consider a \ufb01nite set X = {Xi : i \u2208 N} of categorical random variables\nXi taking values in \ufb01nite sets Xi. A Bayesian network is a triple (X, G, \u03b8), where G = (N, A)\nis a directed acyclic graph (DAG) whose nodes are in one-to-one correspondence with variables in\nX, and \u03b8 = {\u03b8i(xi, xGi)} is a set of numerical parameters specifying (conditional) probabilities\n\u03b8i(xi, xGi ) = Pr(xi|xGi ), for every node i in G, value xi of Xi and assignment xGi to the parents\nGi of Xi in G. The structure G of the network represents a set of stochastic independence assess-\nments among variables in X called graphical Markov conditions: every variable Xi is conditionally\nindependent of its nondescendant nonparents given its parents. As a consequence, a Bayesian net-\nwork uniquely de\ufb01nes a joint probability distribution over X as the product of its parameters.\nAs it is common in the literature, we formulate the problem of Bayesian network learning as an\noptimization over DAG structures guided by a score function. We only require that (i) the score\nfunction can be written as a sum of local score functions si(Gi), i \u2208 N, each depending only on\nthe corresponding parent set Gi and on the data, and (ii) the local score functions can be ef\ufb01ciently\ncomputed and stored [13, 14]. These properties are satis\ufb01ed by commonly used score functions\nsuch as the Bayesian Dirichlet equivalent uniform score [18]. We assume the reader is familiar with\ngraph-theoretic concepts such as polytrees, chordal graphs, chordalizations, moral graphs, moral-\nizations, topological orders, (perfect) elimination orders, \ufb01ll-in edges and clique-trees. References\n[1] and [20] are good starting points to the topic.\nMost score functions penalize model complexity in order to avoid over\ufb01tting. The way scores penal-\nize model complexity generally leads to learning structures of bounded in-degree, but even bounded\nin-degree graphs can have high treewidth (for instance, directed square grids have treewidth equal\nto the square root of the number of nodes, yet have maximum in-degree equal to two), which brings\ndif\ufb01culty to subsequent probabilistic inferences with the model [5].\nThe goal of this work is to develop methods that search for\n\nG\u2217 = argmax\nG\u2208GN,k\n\nsi(Gi) ,\n\n(1)\nwhere GN,k is the set of all DAGs with node set N and treewidth at most k. Dasgupta proved\nNP-hardness of learning polytrees of bounded treewidth when the score is data log likelihood [12].\nKorhonen and Parviainen [19] adapted Srebro\u2019s complexity result for Markov networks [25] to show\nthat learning Bayesian networks of treewidth two or greater is NP-hard.\nIn comparison to the unconstrained problem, few algorithms have been designed for the bounded\ntreewidth case. Korhonen and Parviainen [19] developed an exact algorithm based on dynamic\n\n(cid:88)\n\ni\u2208N\n\n2\n\n\fprogramming that learns optimal n-node structures of treewidth at most w in time 3nnw+O(1),\nwhich is above the 2nnO(1) time required by the best worst-case algorithms for learning optimal\nBayesian networks with no constraint on treewidth [24]. We shall refer to their method in the rest\nof this paper as K&P (after the authors\u2019 initials). Elidan and Gould [15] combined several heuristics\nto treewidth computation and network structure learning in order to design approximate methods.\nOthers have addressed the similar (but not equivalent) problem of learning undirected models of\nbounded treewidth [2, 8, 25]. Very recently, there seems to be an increase of interest in the topic.\nBerg et al. [4] showed that the problem of learning bounded treewidth Bayesian networks can be\nreduced to a weighted maximum satis\ufb01ability problem, and subsequently solved by weighted MAX-\nSAT solvers. They report experimental results showing that their approach outperforms K&P. In the\nsame year, Parviainen et al. [21] showed that the problem can be reduced to a MILP. Their reduced\nMILP problem however has exponentially many constraints in the number of variables. Following\nthe work of Cussens [10], the authors avoid creating such large programs by a cutting plane gener-\nation mechanism, which iteratively includes a new constraint while the optimum is not found. The\ngeneration of each new constraint (cutting plane) requires solving another MILP problem. We shall\nrefer to their method from now on as TWILP (after the name of the software package the authors\nprovide).\n\n3 A Mixed Integer Linear Programming Approach\n\nThe \ufb01rst contribution of this work is the MILP formulation that we design to solve the problem of\nstructure learning with bounded treewidth. MILP formulations have shown to be very effective for\nlearning Bayesian networks with no constraint on treewidth [3, 10], surpassing other attempts in\na range of data sets. The formulation is based on combining the MILP formulation for structure\nlearning in [11] with the MILP formulation presented in [17] for computing the treewidth of an\nundirected graph. There are however notable differences: for instance, we do not enforce a linear\nelimination ordering of nodes; instead we allow for partial orders which capture the equivalence be-\ntween different orders in terms of minimizing treewidth, and we represent such partial order by real\nnumbers instead of integers. We avoid the use of sophisticate techniques for solving MILP problems\nsuch as constraint generation [3, 10], which allows for an easy implementation and parallelization\n(MILP optimizers such as CPLEX can take advantage of that).\nFor each node i in N, let Pi be the collection of allowed parent sets for that node (these sets can\nbe speci\ufb01ed manually by the user or simply de\ufb01ned as the subsets of N \\ {i} with cardinality less\nthan a given bound). We denote an element of Pi as Pit, with t = 1, . . . , ri and ri = |Pi| (hence\nPit \u2282 N). We will refer to a DAG as valid if its node set is N and the parent set of each node i in it\nis an element of Pi. The following MILP problem can be used to \ufb01nd valid DAGs whose treewidth\nis at most w:\n\nMaximize (cid:88)\n\nit\n\n(2)\n\n(3a)\n(3b)\n(3c)\n(4a)\n(4b)\n(4c)\n(4d)\n\n(5)\n\n(n + 1) \u00b7 yij \u2264 n + zj \u2212 zi,\n\nyij + yik \u2212 yjk \u2212 ykj \u2264 1,\n\n(cid:80)\nj\u2208N yij \u2264 w,\n(cid:80)\n\n(n + 1)pit \u2264 n + vj \u2212 vi,\n\nt pit = 1,\npit \u2264 yij + yji,\npit \u2264 yjk + ykj,\n\nsubject to\n\npit \u00b7 si(Pit)\n\u2200i \u2208 N,\n\u2200i, j \u2208 N,\n\u2200i, j, k \u2208 N,\n\u2200i \u2208 N,\n\u2200i \u2208 N, \u2200t \u2208 {1, . . . , ri}, \u2200j \u2208 Pit,\n\u2200i \u2208 N, \u2200t \u2208 {1, . . . , ri}, \u2200j \u2208 Pit,\n\u2200i \u2208 N, \u2200t \u2208 {1, . . . , ri}, \u2200j, k \u2208 Pit,\n\nzi \u2208 [0, n], vi \u2208 [0, n], yij \u2208 {0, 1}, pit \u2208 {0, 1}\n\n\u2200i, j \u2208 N, \u2200t \u2208 {1, . . . , ri}.\n\nThe variables pit de\ufb01ne which parent sets are chosen, while the variables vi guarantee that those\nchoices respect a linear ordering of the variables, and hence that the corresponding directed graph\nis acyclic. The variables yij specify a chordal moralization of this DAG with arcs respecting an\nelimination ordering of width at most w, which is given by the variables zi.\nThe following result shows that any solution to the MILP above can be decoded into a chordal graph\nof bounded treewidth and a suitable perfect elimination ordering.\n\n3\n\n\fLemma 1. Let zi, yij, i, j \u2208 N, be variables satisfying Constraints (4) and (5). Then the undirected\ngraph M = (N, E), where E = {ij \u2208 N \u00d7 N : yij = 1 or yji = 1}, is chordal and has treewidth\nat most w. Any elimination ordering that extends the weak ordering induced by zi is perfect for M.\n\nThe graph M is used in the formulation as a template for the moral graph of a valid DAG:\nLemma 2. Let vi, pit, i \u2208 N, t = 1, . . . , ri, be variables satisfying Constraints (4) and (5). Then\nthe directed graph G = (N, A), where Gi = {j : pit = 1 and j \u2208 Pit}, is acyclic and valid.\nMoreover the moral graph of G is a subgraph of the graph M de\ufb01ned in the previous lemma.\n\nThe previous lemmas suf\ufb01ce to show that the solutions of the MILP problem can be decoded into\nvalid DAGs of bounded treewidth:\nTheorem 1. Any solution to the MILP can be decoded into a valid DAG of treewidth less than or\nequal to w. In particular, the decoding of an optimal solution solves (1).\n\nThe MILP formulation can be directly fed into any off-the-shelf MILP optimizer. Most MILP op-\ntimizers (e.g. CPLEX) can be prematurely stopped while providing an incumbent solution and an\nerror estimate. Moreover, given enough resources (time and memory), these solvers always \ufb01nd\noptimal solutions. Hence, the MILP formulation provides an anytime algorithm that can be used to\nprovide both exact and approximate solutions.\nThe bottleneck in terms of ef\ufb01ciency of the MILP construction lies in the speci\ufb01cation of Con-\nstraints (3c) and (4d), as there are \u0398(n3) such constraints. Thus, as n increases even the linear\nrelaxations of the MILP problem become hard to solve. We demonstrate empirically in Section 5\nthat the quality of solutions found by the MILP approach in a reasonable amount of time degrades\nquickly as the number of variables exceeds a few dozens. In the next section, we present an approx-\nimate algorithm to overcome such limitations and handle large domains.\n\n4 A Sampling Based Approach\n\nA successful method for learning Bayesian networks of unconstrained treewidth on large domains\nis order-based local search, which consists in sampling topological orderings for the variables and\nselecting optimal compatible DAGs [26]. Given a topological ordering, the optimal DAG can be\nfound in linear time (assuming scores are given as input), hence rendering order-based search re-\nally effective in exploring the solution space. A naive extension of that approach to the bounded\ntreewidth case would be to (i) sample a topological order, (ii) \ufb01nd the optimal compatible DAG, (iii)\nverify the treewidth and discard if it exceeds the desired bound. There are two serious issues with\nthat approach. First, verifying the treewidth is an NP-hard problem, and even if there are linear-time\nalgorithms (which are exponential in the treewidth), they perform poorly in practice; second, the vast\nmajority of structures would be discarded, since the most used score functions penalize the number\nof free parameters, which correlates poorly with treewidth [5].\nIn this section, we propose a more sophisticate extension of order-based search to learn bounded\ntreewidth structures. Our method relies on sampling k-trees, which are de\ufb01ned inductively as fol-\nlows [6]. A complete graph with k + 1 nodes (i.e., a (k + 1)-clique) is a k-tree. Let Tk = (V, E)\nbe a k-tree, K be a k-clique in it, and v be a node not in V . Then the graph obtained by connecting\nv to every node in K is also a k-tree. A k-tree is a maximal graph of treewidth k in the sense that\nno edge can be added without increasing the treewidth. Every graph of treewidth at most k is a\nsubgraph of some k-tree. Hence, Bayesian networks of treewidth bounded by k are exactly those\nwhose moral graph is a subgraph of some k-tree [19]. We are interested in k-trees over the nodes N\nof the Bayesian network and where k = w is the bound we impose to the treewidth.\nCaminiti et al. [6] proposed a linear time method (in both n and k) for coding and decoding k-\ntrees into what is called (generalized) Dandelion codes. They also established a bijection between\nDandelion codes and k-trees. Hence, sampling Dandelion codes is essentially equivalent to sampling\nk-trees. The former however is computationally much easier and faster to perform, especially if we\nwant to draw samples uniformly at random (uniform sampling provides good coverage of the space\nand produces low variance estimates across data sets). Formally, a Dandelion code is a pair (Q, S),\nwhere Q \u2286 N with |Q| = k and S is a list of n\u2212 k\u2212 2 pairs of integers drawn from N \u222a{\u0001}, where\n\u0001 is an arbitrary number not in N. Dandelion codes can be sampled uniformly by a trivial linear-time\n\n4\n\n\falgorithm that uniformly chooses k elements from N to build Q, then uniformly samples n \u2212 k \u2212 2\npairs of integers in N \u222a {\u0001}. Algorithm 1 contains a high-level description of our approach.\n\nAlgorithm 1 Learning a structure of bounded treewidth by sampling Dandelion codes.\n% Takes a score function si, i \u2208 N, and an integer k, and outputs a DAG G\u2217 of treewidth \u2264 k.\n1 Initialize G\u2217 as an empty DAG.\n2 Repeat a certain number of iterations:\n2.a Uniformly sample a Dandelion code (Q, S) and decode it into Tk.\n2.b Search for a DAG G that maximizes the score function and is compatible with Tk.\n\n2.c If(cid:80)\n\ni\u2208N si(Gi) >(cid:80)\n\ni\u2208N si(G\u2217\n\ni ), update G\u2217.\n\nWe assume from now on that a k-tree Tk is available, and consider the problem of searching for a\ncompatible DAG that maximizes the score (Step 2.b). Korhonen and Parviainen [19] presented an\nalgorithm (which we call K&P) that given an undirected graph M \ufb01nds a DAG G maximizing the\nscore function such that the moralization of G is a subgraph of M. The algorithm runs in time and\nspace O(n) assuming the scores are part of the input (hence pre-computed and accessed at constant\ntime). We can use their algorithm to \ufb01nd the optimal structure whose moral graph is a subgraph of\nTk. We call this approach S+K&P to remind of (k-tree) sampling followed by K&P.\nTheorem 2. The size of the sampling space of S+K&P is less than en log(nk). Each of its iterations\nruns in linear time in n (but exponential in k).\n\nAccording to the result above, the sampling space of S+K&P is not much bigger than that of stan-\ndard order-based local search (which is approximately en log n), especially if k (cid:28) n. The practical\ndrawback of this approach is the \u0398(k3k(k + 1)!n) time taken by K&P to process each sampled\nk-tree, which forbids its use for moderately high treewidth bounds (say, k \u2265 10). Our experiments\nin the next section further corroborate our claim: S+K&P often performs poorly even on small k,\nmostly due to the small number of k-trees sampled within the given time limit. A better approach is\nto sacri\ufb01ce the optimality of the search for compatible DAGs in exchange of an ef\ufb01ciency gain. We\nnext present a method based on sampling topological orderings that achieves such a goal.\nLet Ci be the collection of maximal cliques of Tk that contain a certain node i (these can be obtained\nef\ufb01ciently, as Tk is chordal), and consider a topological ordering < of N. Let C<i = {j \u2208 C : j <\ni}. We can \ufb01nd an optimal DAG G compatible with < and Tk by making Gi = argmax{si(P ) :\nP \u2286 C<i : C \u2208 Ci} for each i \u2208 N. The graph G is acyclic since each parent set Gi respects the\ntopological ordering by construction. Its treewidth is at most k because both i and Gi belong to a\nclique C of Tk, which implies that the moralization of G is a subgraph of Tk.\nSampling topological orderings is both inef\ufb01cient and wasteful, as different topological orderings\nimpose the same constraints on the choices of Gi. To see this, consider the k-tree with edges 1\u2013\n2,1\u20133,2\u20133,2\u20134 and 3\u20134. Since there is no edge connecting nodes 1 and 4 their relative ordering is\nirrelevant when choosing both G1 or G4. A better approach is to linearly order the nodes in each\nmaximal clique.\nA k-tree Tk can be represented by a clique-tree structure, which comprises its maximal cliques\nC1, . . . , Cn+k\u22121 and a tree T over the maximal cliques. Every two adjacent cliques in T differ by\nexactly one node. Assume T is rooted at a clique R, so we can unambiguously refer to the (single)\nparent of a (maximal) clique and to its children. A clique-tree structure as such can directly be\nobtained from the process of decoding a Dandelion code [6]. The procedure in Algorithm 2 shows\nhow to ef\ufb01ciently obtain a collection of compatible orderings of the nodes of each clique of a k-tree.\n\nAlgorithm 2 Sampling a partial order within a k-tree.\n% Takes a k-tree represented as a clique-tree structure T rooted at R, and outputs a collection of\n\norderings \u03c3C for every maximal clique C of T .\n\n1 Sample an order \u03c3R of the nodes in R, paint R black and the other maximal cliques white.\n2 Repeat until all maximal cliques are painted black:\n2.a Take a white clique C whose parent clique P in T is black, and let i be the single node in C \\ P .\n2.b Sample a relative order for i with respect to \u03c3P (i.e., insert i into some arbitrary position of the\n\nprojection of \u03c3P onto C), and generate \u03c3C accordingly; when done paint C black.\n\n5\n\n\fTable 1: Number of variables in the data sets.\n\nnursery\n9\n\nbreast\n10\n\nhousing\n14\n\nadult\n15\n\nzoo\n17\n\nletter mushroom wdbc\n17\n\n22\n\n31\n\naudio\n62\n\nhill\n100\n\ncommunity\n100\n\nThe cliques in Algorithm 2 are processed in topological ordering in the clique-tree structure, which\nensures that the order \u03c3P of the parent P of a clique C is already de\ufb01ned when processing C (note\nthat the order in which we process cliques does not restrict the possible orderings among nodes). At\nthe end, we have a node ordering for each clique. Given such a collection of local orderings, we can\nef\ufb01ciently learn the optimal parent set of every node i by\n\nGi =\n\nsi(P ) ,\n\nargmax\n\nP\u2286C:P\u223c\u03c3C ,C\u2208Ci\n\n(6)\nwhere P \u223c \u03c3C denotes that the parent sets are constrained to be nodes smaller than i in \u03c3C. In\nfact, the choices made in (6) can be implemented together with step 2.b of Algorithm 2, providing\na slight increase of ef\ufb01ciency. We call the method obtained by Algorithm 1 with partial orderings\nestablished by Algorithm 2 and parent set selection by (6) as S2, in allusion to the double sampling\nscheme of k-trees and local node orderings.\nTheorem 3. S2 samples DAGs \u03c3 on a sample space of size k! \u00b7 (k + 1)n\u2212k, and runs in linear time\nin n and k.\n\nThe generation of partial orderings can also serve to implement the DAG search in S+K&P, by\nreplacing the sampling with complete enumeration of them. Then Step 2.b would be performed for\neach compatible ordering \u03c3P of the parent in a recursive way. Dynamic programming can be used\nto make the procedure more ef\ufb01cient. We have actually used this approach in our implementation\nof S+K&P. Finally, the sampling can be enhanced by some systematic search in the neighborhood\nof the sampled candidates. We have implemented and put in place a simple hill-climbing procedure\nfor that, even though the quality of solutions does not considerably improve by doing so.\n\n5 Experiments\n\nWe empirically analyzed the accuracy of the algorithms proposed here against each other and\nagainst the available implementations of TWILP (https://bitbucket.org/twilp/twilp/) and K&P\n(http://www.cs.helsinki.\ufb01/u/jazkorho/aistats-2013/) on a collection of data sets from the UCI repos-\nitory. The S+K&P and S2 algorithms were implemented (purely) in Matlab. The data sets were\nselected so as to span a wide range of dimensionality, and were preprocessed to have variables dis-\ncretized over the median value when needed. Some columns of the original data sets audio and\ncommunity were discarded: 7 variables of audio had a constant value, 5 variables of community\nhave almost one different value per sample (such as personal data), and 22 variables had missing\ndata (Table 1 shows the number of (binary) variables after pre-processing). In all experiments, we\nmaximize the Bayesian Dirichlet equivalent uniform score with equivalent sample size equal to one.\n\n5.1 Exact Solutions\n\nWe refer to our MILP formulation as simply MILP hereafter. We compared MILP, TWILP and K&P\non the task of \ufb01nding an optimal structure. Table 2 reports the running time on a selection of data\nsets of reasonably low dimensionality and small values for the treewidth bound. The experiments\nwere run in a computer with 32 cores, memory limit of 64GB, time limit of 3h and maximum\nnumber of parents of three (the latter restriction facilitates the experiments and does not constrain\nthe treewidth). On cases where MILP or TWILP did not \ufb01nish we report also the error estimates from\nCPLEX (an error of e% means that the achieved solution is certainly not more than e% worse than\nthe optimal). While we emphasize that one should be careful when directly comparing execution\ntime between methods, as the implementations use different languages (we are running CPLEX\n12.4, the original K&P uses a Cython compiled Python code, TWILP uses a Python interface to\nCPLEX to generate the cutting plane mechanism), we note that MILP goes much further in terms\nof which data sets and treewidth values it can compute. MILP has found the optimal structure in\nall instances, but was not able to certify its optimality in due time. TWILP found the optimum for\n\n6\n\n\fall treewidth bounds only on the nursery and breast data sets. The results also suggest that MILP\nbecomes faster with the increase of the bound, while TWILP running times remain almost unaltered.\nThis might be explained by the fact that the MILP formulation is complete and the increase of the\nbound facilitates encountering good solutions, while TWILP needs to generate constraints until an\noptimal solution can be certi\ufb01ed.\n\nTable 2: Time to learn an optimal Bayesian network subject to treewidth bound w. Dashes denote\nfailure to solve due to excessive memory demand.\n\nmethod\n\nw nursery\n\nMILP\n\nTWILP\n\nK&P\n\nn=9\n1s\n\n2\n3 <1s\n4 <1s\n5 <1s\n5m\n2\n3\n5s\n4 <1s\n5 <1s\n2\n3\n4\n5\n\n7s\n72s\n12m\n131m\n\nbreast\nn=10\n31s\n19s\n8s\n8s\n3h [0.5%]\n3h [3%]\n3h [0.3%]\n3h [0.5%]\n26s\n5m\n103m\n\u2013\n\nhousing\nn=14\n3h [2.4%]\n25m\n80s\n56s\n3h [7%]\n3h [9%]\n3h [9%]\n3h [7%]\n128m\n\u2013\n\u2013\n\u2013\n\nadult\nn=15\n3h [0.39%]\n3h [0.04%]\n40m\n37s\n3h [0.6%]\n3h [0.7%]\n3h [0.9%]\n3h [0.9%]\n137m\n\u2013\n\u2013\n\u2013\n\nmushroom\nn=22\n3h [50%]\n3h [19.3%]\n3h [14.9%]\n3h [11.2%]\n3h [32%]\n3h [31%]\n3h [27%]\n3h [23%]\n\u2013\n\u2013\n\u2013\n\u2013\n\n5.2 Approximate Solutions\n\nWe used treewidth bounds of 4 and 10, and maximum parent set size of 3, except for hill and\ncommunity, where it was set as 2 to help the integer programming approaches (which suffer the\nmost from large parent sets). To be fair with all methods, we pre-computed scores, and considered\nthem as input of the problem. Both MILP and TWILP used CPLEX 12.4 with a memory limit of\n64GB to solve the optimizations. We have allowed CPLEX to run up to three hours, collecting the\nincumbent solution after 10 minutes. S+K&P and S2 have been given 10 minutes. This evaluation\nat 10 minutes is to be seen as an early-stage comparison for applications that need a reasonably\nfast response. To account for the intrinsic variability of the performance of the sampling methods\nwith respect to the sampling seed, S+K&P and S2 were ran ten times on each data set with different\nseeds; we report the minimum, median and maximum obtained values over the runs.\nFigure 1 shows the normalized scores (in percentage) of each method on each data set. The normal-\nized score of a method that returns a solution with score s on a certain data set is norm-score(s) =\n(s \u2212 sempty)/(smax \u2212 sempty), where sempty is the score of an empty DAG (used as baseline), and smax\nis the maximum score over all methods in that data set. Hence, a normalized score of 0 indicates the\nmethod found solutions as good as the empty graph (a trivial solution), whereas a normalized score\nof 1 indicates the method performed best on that data set.\nThe exponential dependence on treewidth of S+K&P prevents it to run with treewidth bound greater\nthan 6. We see from the plot on the left that S2 is largely superior to S+K&P, even though the former\n\ufb01nds suboptimal networks for each given k-tree. This suggests that \ufb01nding good k-trees is more im-\nportant than selecting good networks for a given k-tree. We also see that both integer programming\nformulations scale poorly with the number of variables, being unable to obtain satisfactory solutions\nfor data sets with more than 50 variables. For the hill data set and treewidth \u2264 4, MILP was not able\nto \ufb01nd a feasible solution within 10 minutes, and could only \ufb01nd the trivial solution (empty DAG)\nafter 3 hours; TWILP did not \ufb01nd any solution even after 3 hours. On the community data set with\ntreewidth \u2264 4, neither MILP nor TWILP found a solution within 3 hours. For treewidth \u2264 10 the\ninteger programming approaches performed even worse: TWILP could not provide a solution for\nthe audio, hill and community data sets; MILP could only \ufb01nd the empty graph.\nSince both S+K&P and S2 were implemented in Matlab, the comparison with either MILP or\nTWILP within the same time period (10 minutes) might be unfair (one could also try to improve\nthe MILP formulation, although it will eventually suffer from the problems discussed in Section 3).\nNevertheless, the results show that S2 is very competitive even under implementation disadvantage.\n\n7\n\n\fcommunity\nhill\naudio\nwdbc\nmushroom\nletter\nzoo\nadult\nhousing\nbreast\nnursery\n\nTreewidth \u2264 4\n\nTreewidth \u2264 10\n\ncommunity\nhill\naudio\nwdbc\nmushroom\nletter\nzoo\nadult\nhousing\nbreast\nnursery\n\n50\n\n0\n0\nNORMALIZED SCORE (%)\nNORMALIZED SCORE (%)\nS+K&P S2 MILP-10m MILP-3h TWILP-10m TWILP-3h\n\n100\n\n50\n\n100\n\nFigure 1: Normalized scores. Missing results indicate failure to provide a solution.\n\n6 Conclusions\n\nWe presented exact and approximate procedures to learn Bayesian networks of bounded treewidth.\nThe exact procedure is based on a MILP formulation, and is shown to outperform other methods for\nexact learning, including the different MILP formulation proposed in [21]. Our MILP approach is\nalso competitive when used to produce approximate solutions. However, due to the cubic number\nof constraints, the MILP formulation cannot cope with very large domains, and there is probably\nlittle we can do to considerably improve this situation. Constraint generation techniques [3] are yet\nto be explored, even though we do not expect them to produce dramatic performance gains \u2013 the\ncompeting objectives of maximizing score and bounding treewidth usually lead to the generation of\na large number of constraints.\nTo tackle large problems, we developed an approximate algorithm that samples k-trees and then\nsearches for compatible structures. We derived two variants by trading off the computational effort\nspent in sampling k-trees and in searching for compatible structures. The sampling-based methods\nare empirically shown to provide fairly accurate solutions and to scale to large domains.\n\nAcknowledgments\n\nWe thank the authors of [19, 21] for making their software publicly available and the anonymous\nreviewers for their useful suggestions. Most of this work has been performed while C. P. de Campos\nwas with the Dalle Molle Institute for Arti\ufb01cial Intelligence. This work has been partially supported\nby the Swiss NSF grant 200021 146606/1, by the S\u02dcao Paulo Research Foundation (FAPESP) grant\n2013/23197-4, and by the grant N00014-12-1-0868 from the US Of\ufb01ce of Navy Research.\n\nReferences\n[1] S. Arnborg, D. Corneil, and A. Proskurowski. Complexity of \ufb01nding embeddings in a k-tree.\n\nSIAM J. on Matrix Analysis and Applications, 8(2):277\u2013284, 1987.\n\n[2] F. R. Bach and M. I. Jordan. Thin junction trees. In Advances in Neural Inf. Proc. Systems 14,\n\npages 569\u2013576, 2001.\n\n[3] M. Barlett and J. Cussens. Advances in Bayesian Network Learning using Integer Program-\n\nming. In Proc. 29th Conf. on Uncertainty in AI, pages 182\u2013191, 2013.\n\n8\n\n\f[4] J. Berg, M. J\u00a8arvisalo, and B. Malone. Learning optimal bounded treewidth Bayesian networks\nvia maximum satis\ufb01ability. In Proc. 17th Int. Conf. on AI and Stat., pages 86\u201395, 2014. JMLR\nW&CP 33.\n\n[5] A. Beygelzimer and I. Rish. Inference complexity as a model-selection criterion for learning\nBayesian networks. In Proc. 8th Int. Conf. Princ. Knowledge Representation and Reasoning,\npages 558\u2013567, 1998.\n\n[6] S. Caminiti, E. G. Fusco, and R. Petreschi. Bijective linear time coding and decoding for\n\nk-trees. Theory of Comp. Systems, 46(2):284\u2013300, 2010.\n\n[7] V. Chandrasekaran, N. Srebro, and P. Harsha. Complexity of inference in graphical models. In\n\nProc. 24th Conf. on Uncertainty in AI, pages 70\u201378, 2008.\n\n[8] A. Chechetka and C. Guestrin. Ef\ufb01cient principled learning of thin junction trees. In Advances\n\nin Neural Inf. Proc. Systems, pages 273\u2013280, 2007.\n\n[9] D. M. Chickering. Learning Bayesian networks is NP-complete. In Learning from Data: AI\n\nand Stat. V, pages 121\u2013130. Springer-Verlag, 1996.\n\n[10] J. Cussens. Bayesian network learning with cutting planes. In Proc. 27th Conf. on Uncertainty\n\nin AI, pages 153\u2013160, 2011.\n\n[11] J. Cussens, M. Bartlett, E. M. Jones, and N. A. Sheehan. Maximum Likelihood Pedigree\nReconstruction using Integer Linear Programming. Genetic Epidemiology, 37(1):69\u201383, 2013.\n[12] S. Dasgupta. Learning polytrees. In Proc. 15th Conf. on Uncertainty in AI, pages 134\u2013141,\n\n1999.\n\n[13] C. P. de Campos and Q. Ji. Ef\ufb01cient structure learning of Bayesian networks using constraints.\n\nJ. of Mach. Learning Res., 12:663\u2013689, 2011.\n\n[14] C. P. de Campos, Z. Zeng, and Q. Ji. Structure learning of Bayesian networks using constraints.\n\nIn Proc. 26th Int. Conf. on Mach. Learning, pages 113\u2013120, 2009.\n\n[15] G. Elidan and S. Gould. Learning Bounded Treewidth Bayesian Networks. J. of Mach. Learn-\n\ning Res., 9:2699\u20132731, 2008.\n\n[16] N. Friedman. The Bayesian structural EM algorithm. In Proc. 14th Conf. on Uncertainty in\n\nAI, pages 129\u2013138, 1998.\n\n[17] A. Grigoriev, H. Ensinck, and N. Usotskaya.\n\nInteger linear programming formulations for\ntreewidth. Technical report, Maastricht Res. School of Economics of Tech. and Organization,\n2011.\n\n[18] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combina-\n\ntion of knowledge and statistical data. Mach. Learning, 20(3):197\u2013243, 1995.\n\n[19] J. H. Korhonen and P. Parviainen. Exact learning of bounded tree-width Bayesian networks.\n\nIn Proc. 16th Int. Conf. on AI and Stat., pages 370\u2013378, 2013. JMLR W&CP 31.\n\n[20] J. H. P. Kwisthout, H. L. Bodlaender, and L. C. van der Gaag. The Necessity of Bounded\nTreewidth for Ef\ufb01cient Inference in Bayesian Networks. In Proc. 19th European Conf. on AI,\npages 237\u2013242, 2010.\n\n[21] P. Parviainen, H. S. Farahani, and J. Lagergren. Learning bounded tree-width Bayesian net-\nworks using integer linear programming. In Proc. 17th Int. Conf. on AI and Stat., pages 751\u2013\n759, 2014. JMLR W&CP 33.\n\n[22] E. Perrier, S. Imoto, and S. Miyano. Finding optimal Bayesian network given a super-structure.\n\nJ. of Mach. Learning Res., 9(2):2251\u20132286, 2008.\n\n[23] D. Roth. On the hardness of approximate reasoning. Artif. Intell., 82(1\u20132):273\u2013302, 1996.\n[24] T. Silander and P. Myllymaki. A simple approach for \ufb01nding the globally optimal Bayesian\n\nnetwork structure. In Proc. 22nd Conf. on Uncertainty in AI, pages 445\u2013452, 2006.\n\n[25] N. Srebro. Maximum likelihood bounded tree-width Markov networks. Artif. Intell., 143(1):\n\n123\u2013138, 2003.\n\n[26] M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learn-\n\ning Bayesian networks. In Proc. 21st Conf. on Uncertainty in AI, pages 584\u2013590, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1216, "authors": [{"given_name": "Siqi", "family_name": "Nie", "institution": "RPI"}, {"given_name": "Denis", "family_name": "Maua", "institution": "USP"}, {"given_name": "Cassio", "family_name": "de Campos", "institution": "Queen's University Belfast"}, {"given_name": "Qiang", "family_name": "Ji", "institution": "RPI"}]}