{"title": "Learning Treewidth-Bounded Bayesian Networks with Thousands of Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 1462, "page_last": 1470, "abstract": "We present a method for learning treewidth-bounded Bayesian networks from data sets containing thousands of variables. Bounding the treewidth of a Bayesian network greatly reduces the complexity of inferences. Yet, being a global property of the graph, it considerably increases the difficulty of the learning process. Our novel algorithm accomplishes this task, scaling both to large domains and to large treewidths. Our novel approach consistently outperforms the state of the art on experiments with up to thousands of variables.", "full_text": "Learning Treewidth-Bounded Bayesian Networks\n\nwith Thousands of Variables\n\nMauro Scanagatta\nIDSIA\u2217, SUPSI\u2020 , USI\u2021\nLugano, Switzerland\nmauro@idsia.ch\n\nGiorgio Corani\n\nIDSIA\u2217, SUPSI\u2020 , USI\u2021\nLugano, Switzerland\n\ngiorgio@idsia.ch\n\nCassio P. de Campos\n\nQueen\u2019s University Belfast\n\nNorthern Ireland, UK\n\nc.decampos@qub.ac.uk\n\nMarco Zaffalon\n\nIDSIA\u2217\n\nLugano, Switzerland\n\nzaffalon@idsia.ch\n\nAbstract\n\nWe present a method for learning treewidth-bounded Bayesian networks from\ndata sets containing thousands of variables. Bounding the treewidth of a Bayesian\nnetwork greatly reduces the complexity of inferences. Yet, being a global property\nof the graph, it considerably increases the dif\ufb01culty of the learning process. Our\nnovel algorithm accomplishes this task, scaling both to large domains and to large\ntreewidths. Our novel approach consistently outperforms the state of the art on\nexperiments with up to thousands of variables.\n\nIntroduction\n\n1\nWe consider the problem of structural learning of Bayesian networks with bounded treewidth,\nadopting a score-based approach. Learning the structure of a bounded treewidth Bayesian network is\nan NP-hard problem (Korhonen and Parviainen, 2013). Yet learning Bayesian networks with bounded\ntreewidth is necessary to allow exact tractable inference, since the worst-case inference complexity is\nexponential in the treewidth k (under the exponential time hypothesis) (Kwisthout et al., 2010).\nA pioneering approach, polynomial in both the number of variables and the treewidth bound, has\nbeen proposed in Elidan and Gould (2009). It incrementally builds the network; at each arc addition\nit provides an upper-bound on the treewidth of the learned structure. The limit of this approach is that,\nas the number of variables increases, the gap between the bound and the actual treewidth becomes\nlarge, leading to sparse networks. An exact method has been proposed in Korhonen and Parviainen\n(2013), which \ufb01nds the highest-scoring network with the desired treewidth. However, its complexity\nincreases exponentially with the number of variables n. Thus it has been applied in experiments with\n15 variables at most. Parviainen et al. (2014) adopted an anytime integer linear programming (ILP)\napproach, called TWILP. If the algorithm is given enough time, it \ufb01nds the highest-scoring network\nwith bounded treewidth. Otherwise it returns a sub-optimal DAG with bounded treewidth. The ILP\nproblem has an exponential number of constraints in the number of variables; this limits its scalability,\neven if the constraints can be generated online. Berg et al. (2014) casted the problem of structural\nlearning with limited treewidth as a problem of weighted partial Maximum Satis\ufb01ability. They solved\nthe problem exactly through a MaxSAT solver and performed experiments with 30 variables at most.\nNie et al. (2014) proposed an ef\ufb01cient anytime ILP approach with a polynomial number of constraints\n\n\u2217Istituto Dalle Molle di studi sull\u2019Intelligenza Arti\ufb01ciale (IDSIA)\n\u2020Scuola universitaria professionale della Svizzera italiana (SUPSI)\n\u2021Universit\u00e0 della Svizzera italiana (USI)\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fin the number of variables. Yet they report that the quality of the solutions quickly degrades as the\nnumber of variables exceeds a few dozens and that no satisfactory solutions are found with data sets\ncontaining more than 50 variables. Approximate approaches are therefore needed to scale to larger\ndomains.\nNie et al. (2015) proposed the method S2. It exploits the notion of k-tree, which is an undirected\nmaximal graph with treewidth k. A Bayesian network whose moral graph is a subgraph of a k-tree\nhas treewidth bounded by k. S2 is an iterative algorithm. Each iteration consists of two steps: a)\nsampling uniformly a k-tree from the space of k-trees and b) recovering a DAG whose moral graph\nis a sub-graph of the most promising sampled k-tree. The goodness of the k-tree is assessed via a\nso-called informative score. Nie et al. (2016) further re\ufb01ne this idea, obtaining via A* the k-tree\nwhich maximizes the informative score. This algorithm is called S2+.\nRecent structural learning algorithms with unbounded treewidth (Scanagatta et al., 2015) can cope\nwith thousands of variables. Yet the unbounded treewidth provides no guarantee about the tractability\nof the inferences of the learned models. We aim at \ufb01lling this gap, learning treewidth-bounded\nBayesian network models in domains with thousands of variables.\nWe propose two novel methods for learning Bayesian networks with bounded treewidth. They exploit\nthe fact that any k-tree can be constructed by an iterative procedure that adds one variable at a time.\nWe propose an iterative procedure that, given an order on the variables, builds a DAG G adding one\nvariable at a time. The moral graph of G is ensured to be subgraph of a k-tree. The k-tree is designed\nas to maximize the score of the resulting DAG. This is a major difference with respect to previous\nworks (Nie et al., 2015, 2016) in which the k-trees were randomly sampled. We propose both an\nexact and an approximated variant of our algorithm; the latter is necessary to scale to thousands of\nvariables.\nWe discuss that the search space of the presented algorithms does not span the whole space of\nbounded-treewidth DAGs. Yet our algorithms consistently outperform the state-of-the-art competitors\nfor structural learning with bounded treewidth. For the \ufb01rst time we present experimental results for\nstructural learning with bounded treewidth for domains involving up to ten thousand variables.\nSoftware and supplementary material are available from http://blip.idsia.ch.\n2 Structural learning\nConsider the problem of learning the structure of a Bayesian network from a complete data set of N\ninstances D = {D1, ..., DN}. The set of n categorical random variables is X = {X1, ..., Xn}. The\ngoal is to \ufb01nd the best DAG G = (V, E), where V is the collection of nodes and E is the collection\nof arcs. E can be represented by the set of parents \u03a01, ..., \u03a0n of each variable.\nDifferent scores can be used to assess the \ufb01t of a DAG; we adopt the Bayesian information criterion\n(or simply BIC). The BIC score is decomposable, being constituted by the sum of the scores of the\nindividual variables:\n\nBIC(G) =\n\nBIC(Xi, \u03a0i) =\n\n(LL(Xi|\u03a0i) + Pen(Xi, \u03a0i)) =\n\ni=1\n\n\u03c0\u2208|\u03a0i|,x\u2208|Xi| Nx,\u03c0\n\n\u02c6\u03b8x|\u03c0 \u2212 log N\n2\n\n(|Xi| \u2212 1)(|\u03a0i|))\n\nn(cid:88)\n\nn(cid:88)\nn(cid:88)\n(cid:88)\n\ni=1\n\n=\n\n(\n\ni=1\n\nwhere \u02c6\u03b8x|\u03c0 is the maximum likelihood estimate of the conditional probability P (Xi = x|\u03a0i = \u03c0),\nNx,\u03c0 represents the number of times (X = x \u2227 \u03a0i = \u03c0) appears in the data set, and | \u00b7 | indicates the\nsize of the Cartesian product space of the variables given as argument. Thus |Xi| is the number of\nstates of Xi and |\u03a0i| is the product of the number of states of the parents of Xi.\nExploiting decomposability, we \ufb01rst identify independently for each variable a list of candidate\nparent sets (parent set identi\ufb01cation). Later, we select for each node the parent set that yields the\nhighest-scoring treewidth-bounded DAG (structure optimization).\n\n2\n\n\fboth v1 and v2;\n\n2.1 Treewidth and k-trees\nWe illustrate the concept of treewidth following the notation of Elidan and Gould (2009). We\ndenote an undirected graph as H = (V, E) where V is the vertex set and E is the edge set. A tree\ndecomposition of H is a pair (C,T ) where C = {C1, C2, ..., Cm} is a collection of subsets of V and\nT is a tree on C, so that:\ni=1 Ci = V ;\n\n\u2022 \u222am\n\u2022 for every edge which connects the vertices v1 and v2, there is a subset Ci which contains\n\u2022 for all i, j, k in {1, 2, ..m} if Cj is on the path between Ci and Ck in T then Ci \u2229 Ck \u2286 Cj.\nThe width of a tree decomposition is max(|Ci|) \u2212 1 where |Ci| is the number of vertices in Ci. The\ntreewidth of H is the minimum width among all possible tree decompositions of G.\nThe treewidth can be equivalently de\ufb01ned in terms of a triangulation of H. A triangulated graph is an\nundirected graph in which every cycle of length greater than three contains a chord. The treewidth of\na triangulated graph is the size of the maximal clique of the graph minus one. The treewidth of H is\nthe minimum treewidth over all the possible triangulations of H.\nThe treewidth of a Bayesian network is characterized with respect to all possible triangulations of its\nmoral graph. The moral graph M of a DAG is an undirected graph that includes an edge i \u2212 j for\nevery edge i \u2192 j in the DAG and an edge p \u2212 q for every pair of edges p \u2192 i, q \u2192 i in the DAG.\nThe treewidth of a DAG is the minimum treewidth over all the possible triangulations of its moral\ngraph M. Thus the maximal clique of any moralized triangulation of G is an upper bound on the\ntreewidth of the model.\n\nIncremental treewidth-bounded structure learning\n\nk-trees An undirected graph Tk = (V, E) is a k-tree if it is a maximal graph of tree-width k: any\nedge added to Tk increases its treewidth. A k-tree is inductively de\ufb01ned as follows (Patil, 1986).\nConsider a (k + 1)-clique, namely a complete graph with k + 1 nodes. A (k + 1)-clique is a k-tree. A\n(k + 1)-clique can be decomposed into multiple k-cliques. Let us denote by z a node not yet included\nin the list of vertices V . Then the graph obtained by connecting z to every node of a k-clique of Tk\nis also a k-tree. The treewidth of any subgraph of a k-tree (partial k-tree) is bounded by k. Thus a\nDAG whose moral graph is subgraph of a k-tree has treewidth bounded by k.\n3\nOur approach for the structure optimization task proceeds by repeatedly sampling an order \u227a over\nthe variables and then identifying the highest-scoring DAG with bounded treewidth consistent with\nthe order. An effective approach for structural learning based on order sampling has been introduced\nby Teyssier and Koller (2012); however it does not enforce any treewidth constraint.\nThe size of the search space of orders is n!; this is smaller than the search space of the k-trees,\nO(enlog(nk)). Once the order \u227a is sampled, we incrementally learn the DAG. At each iteration the\nmoralization of the DAG is by design a subgraph of a k-tree. The treewidth of the DAG eventually\nobtained is thus bounded by k. The algorithm proceeds as follows.\nInitialization The initial k-tree Kk+1 is constituted by the complete clique over the \ufb01rst k + 1\nvariables in the order. The initial DAG Gk+1 is learned over the same k + 1 variables. Since (k + 1)\nis a tractable number of variables, we exactly learn Gk+1 adopting the method of Cussens (2011).\nThe moral graph of Gk+1 is a subgraph of Kk+1 and thus Gk+1 has bounded treewidth.\nAddition of the subsequent nodes We then iteratively add each remaining variable in the order.\nConsider the next variable in the order, X\u227ai, where i \u2208 {k + 2, ..., n}. Let us denote by Gi\u22121\nand Ki\u22121 the DAG and the k-tree which have to be updated by adding X\u227ai. We add X\u227ai to Gi\u22121,\nconstraining its parent set \u03a0\u227ai to be a k-clique (or a subset of) in Ki\u22121. This yields the updated DAG\nGi. We then update the k-tree, connecting X\u227ai to such k-clique. This yields the k-tree Ki; it contains\nan additional k + 1-clique compared to Ki\u22121. By construction, Ki is also a k-tree. The moral graph\nof Gi cannot add arc outside this (k + 1)-clique; thus it is a subgraph of Ki.\nPruning orders The initial k-tree Kk+1 and the initial DAG Gk+1 depend on which are the \ufb01rst\nk + 1 variables in the order, but not on their relative positions. Thus all the orders which differ only\n\n3\n\n\fi(cid:88)\n\nn(cid:88)\n\nbest(X\u227aj) .\n\nas for the relative position of the \ufb01rst k + 1 elements are equivalent for our algorithm: they yield the\nsame Kk+1 and Gk+1. Thus once we sample an order and perform structural learning, we prune the\n(k + 1)! \u2212 1 orders which are equivalent to the current one.\nIn order to choose the parent set to be assigned to each variable added to the graph we propose two\nalgorithms: k-A* and k-G.\n3.1 k-A*\nWe formulate the problem as a shortest path \ufb01nding problem. We de\ufb01ne each state as a step towards\nthe completion of the structure, where a new variable is added to the DAG G. Given X\u227ai the variable\nassigned in the state S, we de\ufb01ne a successor state of S for each k-clique to which we can link\nX\u227ai+1. The approach to solve the problem is based on a path-\ufb01nding A* search, with cost function\nfor state S de\ufb01ned as f (S) = g(S) + h(S). The goal is to \ufb01nd the state which minimizes f (S) once\nall variables have been assigned.\nWe de\ufb01ne g(S) and h(S) as:\n\ng(S) =\n\nscore(X\u227aj,\u03a0\u227aj) ,\n\nh(S) =\n\nj=0\n\nj=i+1\n\ng(S) is the cost from the initial state to S; it corresponds to the sum of scores of the already assigned\nparent sets.\nh(S) is the estimated cost from S to the goal. It is the sum of the best assignable parent sets for the\nremaining variables. Variable Xa can have Xb as parent only if Xb \u227a Xa.\nThe A* approach requires the h function to be admissible. The function h is admissible if the\nestimated cost is never greater than the true cost to the goal state. Our approach satis\ufb01es this property\nsince the true cost of each step (score of chosen parent set for X\u227ai+1) is always equal to or greater\nthan the estimated one (the score of the best selectable parent set for X\u227ai+1).\nThe previous discussion implies that h is consistent, meaning that for any state S and its successor\nT , h(S) \u2264 h(T ) + c(S, T ), where c(S, T ) is the cost of the edges added in T . The function f is\nmonotonically non-decreasing on any path, and the algorithm is guaranteed to \ufb01nd the optimal path\nas long as the goal state is reachable. Additionally there is no need to process a node more than once,\nas no node will be explored a second time with a lower cost.\n3.2 k-G\nA very high number of variables might prevent the use of k-A*. For those cases we propose k-G as a\ngreedy alternative approach, which chooses at each step the best local parent set. Given the set of\nexisting k-clique in K as KC, we choose as parent set for X\u227ai:\nscore(\u03c0) .\n\n\u03a0X\u227ai = argmax\n\u03c0\u2282c,c\u2208KC\n\n3.3 Space of learnable DAGs\nA reverse topological order is an order {v1, ...vn} over the vertexes V of a DAG in which each vi\nappears before its parents \u03a0i. The search space of our algorithms is restricted to the DAGs whose\nreverse topological order, when used as variable elimination order, has treewidth k. This prevents\nrecovering DAGs which have bounded treewidth but lack this property.\nWe start by proving by induction that the reverse topological order has treewidth k in the DAGs\nrecovered by our algorithms. Consider the incremental construction of the DAG previously discussed.\nThe initial DAG Gk+1 is induced over k + 1 variables; thus every elimination ordering has treewidth\nbounded by k.\nFor the inductive case, assume that Gi\u22121 satis\ufb01es the property. Consider the next variable in the\norder, X\u227ai, where i \u2208 {k + 2, ..., n}. Its parent set \u03a0\u227ai is a subset of a k-clique in Ki\u22121. The\nonly neighbors of X\u227ai in the updated DAG Gi are its parents \u03a0\u227ai. Consider performing variable\nelimination on the moral graph of Gi, using a reverse topological order. Then X\u227ai will be eliminated\nbefore \u03a0\u227ai, without introducing \ufb01ll-in edges. Thus the treewidth associated to any reverse topological\norder is bounded by k. This property inductively applies to the addition also of the following nodes\nup to X\u227an.\n\n4\n\n\fInverted trees An example of DAG non recoverable by our algorithms is the speci\ufb01c class of\npolytrees that we call inverted trees, that is, DAGs with out-degree equal to one. An inverted tree\nwith m levels and treewidth k can be built as follows. Take the root node (level one) and connect it to\nk child nodes (level two). Connect each node of level two to k child nodes (level three). Proceed in\nthis way up to the m-th level and then invert the direction of all the arcs.\nFigure 1 shows an inverted tree with k=2 and m=3. It has treewidth two, since its moral graph\nis constituted by the cliques {A,B,E}, {C,D,F}, {E,F,G}. The treewidth associated to the reverse\ntopological order is instead three, using the order G, F, D, C, E, A, B.\n\nA\n\nB\n\nC\n\nD\n\nE\n\nF\n\nG\n\nFigure 1: Example of inverted tree.\n\nIf we run our algorithms with bounded treewidth k=2, it will be unable to recover the actual inverted\ntree. It will instead identify a high-scoring DAG whose reverse topological order has treewidth 2.\n4 Experiments\nWe compare k-A*, k-G, S2, S2+ and TWILP in various experiments. We compare them through\nan indicator which we call W-score: the percentage of worsening of the BIC score of the selected\ntreewidth-bounded method compared to the score of the Gobnilp solver Cussens (2011). Gobnilp\nachieves higher scores than the treewidth-bounded methods since it has no limits on the treewidth.\nLet us denote by G the BIC score achieved by Gobnilp and by T the BIC score obtained by the given\ntreewidth-bounded method. Notice that both G and T are negative. The W-score is W = G\u2212T\nG . W\nstands for worsening and thus lower values of W are better. The lowest value of W is zero, while\nthere is no upper bound.\nWe adopt BIC as a scoring function. The reason is that an algorithm for approximate exploration of\nthe parent sets (Scanagatta et al., 2015) allowing high in-degree even on large domains exists at the\nmoment only for BIC.\n4.1 Parent set score exploration\nBefore performing structural learning it is necessary to compute the scores of the candidate parent sets\nfor each node (parent set exploration). The different structural learning methods are then provided\nwith the same score of the parent sets.\nA treewidth k implies that one should explore all the parent sets up to size k; thus the complexity of\nparent set exploration increases exponentially with the treewidth. To let the parent set exploration\nscale ef\ufb01ciently with large treewidths and large number of variables we apply the approach of\nScanagatta et al. (2015). It guides the exploration towards the most promising parent sets (with size\nup to k) without scoring them all. This is done on the basis of an approximated score function that is\ncomputed in constant time. The actual score of the most promising parent sets is eventually computed.\nWe allow 60 seconds of time for the computation of the scores of the parent set of each variable, in\neach data set.\n4.2 Our implementation of S2 and S2+\nHere we provide some details of our implementation of S2 and S2+. The second phase of both S2\nand S2+ looks for a DAG whose moralization is a subgraph of a chosen k-tree. For this task Nie et al.\n(2014) adopt an approximate approach based on partial order sampling (Algorithm 2). We found that\nusing Gobnilp for this task yields consistently slightly higher scores; thus we adopt this approach in\nour implementation. We believe that it is due to the fact that constraining the structure optimization\nto a subjacent graph of a k-tree results in a small number of allowed arcs for the DAG. Thus our\nimplementation of S2 and S2+ \ufb01nds the highest-scoring DAG whose moral graph is a subgraph of\nthe provided k-tree.\n\n5\n\n\f4.3 Learning inverted trees\nAs already discussed our approach cannot learn an inverted tree with k parents per node if the given\nbounded treewidth is k. In this section we study this worst-case scenario.\nWe start with treewidth k = 2. We consider the number of variables n \u2208 {21, 41, 61, 81, 101}. For\neach value of n we generate 5 different inverted trees. To generate as inverted tree we \ufb01rst select a\nroot variable X and add k parents to it as \u03a0X; then we continue by randomly choosing a leaf of the\ngraph (at a generic iteration, there are leaves at different distance from X) and adding k parents to it,\nuntil the the graph contains n variables.\nAll variables are binary and we sample their conditional probability tables from a Beta(1,1). We\nsample 10,000 instances from each generated inverted tree.\nWe then perform structural learning with k-A*, k-G, S2, S2+ and TWILP setting k = 2 as limit on\nthe treewidth. We allow each method to run for ten minutes. S2, S2+ and TWILP could in principle\nrecover the true structure, which is prevented to our algorithms. The results are shown in Fig.2.\nQualitatively similar results are obtained repeating the experiments with k = 4.\n\n0.1\n\n0.05\n\ne\nr\no\nc\ns\n-\n\nW\n\n0\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of variables\n\nS2\n\nTWILP\n\nS2+\nk-G\nk-A*\n\nFigure 2: Structural learning results when the actual DAGs are inverted trees (k=2). Each point\nrepresent the mean W-score over 5 experiments. Lower values of the W -score are better.\n\nDespite the unfavorable setting, both k-G and k-A* yield DAGs with higher score than S2, S2+ and\nTWILP consistently for each value of n. For n = 20 they found a close approximation to the optimal\ngraph. S2, S2+ and TWILP found different structures, with close score.\nThus the limitation of the space of learnable DAGs does not hurt the performance of k-G and k-A*.\nIn fact S2 could theoretically recover the actual DAG, but this is not feasible in practice as it requires\na prohibitive number of samples from the space of k-trees. The exact solver TWILP was unable to\n\ufb01nd the exact solution within the time limit; thus it returned a best solution achieved in the time limit.\n\nIterations\nMedian\n\nMax\n\nS2\n\n803150\n-273600\n-271484\n\nS2+\n3\n-267921\n-266593\n\nk-G\n7176\n-261648\n-258601\n\nk-A*\n66\n-263250\n-261474\n\nTable 1: Statistics of the solutions yielded by different methods on an inverted tree (n = 100, k = 4).\n\nWe further investigate the differences between methods in Table 1. Iterations is the number of\nproposed solutions; for S2 and S2+ it corresponds to the number of explored k-trees, while for k-G\nand k-A* it corresponds to the number of explored orders. During the execution, S2 samples almost\none million k-trees. Yet it yields the lowest-scoring DAGs among the different methods. This can be\nexplained considering that a randomly sampled k-tree has a low chance to cover a high-scoring DAG.\nS2+ recovers only a few k-trees, but their scores are higher than those of S2. Thus the informative\nscore is effective at driving the search for good k-trees; yet it does not scale on large data sets as we\nwill see later. As for our methods, k-G samples a larger number of orders than k-A* does and this\nallows it to achieve higher scores, even if it sub-optimally deals with each single order. Such statistics\nshow a similar pattern also in the next experiments.\n\n6\n\n\fDATASET VAR. GOBNILP\n\nS2\n\nS2+\n\nTWILP\n\nk-A*\nk-G\n\u221272159\n-72159 \u221272159 \u221272159 \u221272159 \u221272159\n\u22122698\n\u22122698\n-2698\n\u22123203\n-3185\n-3206\n-200431 \u2212200363\n-200142\n-183369 \u2212183241\n-181748\n\u2212613\n-608\n\u221255785\n-53104\n\u22127088\n-6919\n\u22122185\n-2173\n\u221282003\n-77555\n\u22121279\n-1277\n\n\u22122698\n-3252\n-201235\n-189539\n-620\n-68670\n-7213\n-2283\n-107252\n-1641\n\n\u22122698\n-3247\n-200926\n-186815\n-619\n-64769\n-7209\n-2208\n-88350\n-1427\n\n\u22122698\n-3213\n-200340\n-190086\n-620\n-68298\n-7190\n-2277\n\n-615\n-57021\n-7109\n-2201\n-82633\n-1284\n\nnursery\nbreast\nhousing\nadult\nletter\nzoo\n\nmushroom\n\nwdbc\naudio\n\nhill\n\ncommunity\n\n9\n10\n14\n15\n17\n17\n22\n31\n62\n100\n100\n\nTable 2: Comparison of the BIC scores yielded by different algorithms on the data sets analyzed by\nNie et al. (2016). The highest-scoring solution with limited treewidth is boldfaced. In the \ufb01rst column\nwe report the score obtained by Gobnilp without bound on the treewidth.\n\n4.4 Small data sets\nWe now present experiments on the data sets considered by Nie et al. (2016). They involve up to 100\nvariables. We set the bounded treewidth to k = 4. We allow each method to run for ten minutes. We\nperform 10 experiments on each data set and we report the median scores in Table 2.\nOn the smallest data sets all methods (including Gobnilp) achieve the same score. As the data sets\nbecomes larger, both k-A* and k-G achieve higher scores than S2, S2+ and TWILP (which does not\nachieve the exact solution). Between our two novel algorithms, k-A* has a slight advantage over k-G.\n4.5 Large data sets\nWe now consider 10 large data sets (100 \u2264 n \u2264 400) listed in Table 3. We no longer run TWILP, as\nit is unable to handle this number of variables.\n\nData set\nAudio\nJester\n\nData set\nNet\ufb02ix\n\nn\n100\n100 Accidents\nTable 3: Large data sets sorted according to the number of variables.\n\nn Data set\n100\nRetail\n111 Kosarek\n\nn Data set\n135\nAndes\n190 MSWeb\n\nn\n223\n294\n\nDNA\n\nData set\n\nPumsb-star\n\nn\n163\n180\n\nk-A*\n29/20/24\n\nS2\n30/30/29\n29/27/20\n\nS2+\n30/30/30\n29/27/21\n12/13/30\n\nk-G\nk-A*\nS2\n\nTable 4: Result on the 30 experiments on large data sets. Each cell report how many times the row\nalgorithm yields a higher score than the column algorithm for treewidth 2/5/8. For instance k-G wins\non all the 30 data sets against S2+ for each considered treewidth.\n\nWe consider the following treewidths: k \u2208 {2, 5, 8}. We split each data set randomly into three\nsubsets. Thus for each treewidth we run 10\u00b73=30 structural learning experiments.\nWe let each method run for one hour. For S2+, we adopt a more favorable approach, allowing it to\nrun for one hour; if after one hour the \ufb01rst k-tree was not yet solved, we allow it to run until it has\nsolved the \ufb01rst k-tree.\nIn Table 4 we report how many times each method wins against another for each treewidth, out\nof 30 experiments. The entries are boldfaced when the number of victories of an algorithm over\nanother is statistically signi\ufb01cant (p-value <0.05) according to the sign-test. Consistently for any\nchosen treewidth, k-G is signi\ufb01cantly better than any competitor, including k-A*; moreover, k-A* is\nsigni\ufb01cantly better than both S2 and S2+.\n\n7\n\n\fThis can be explained by considering that k-G explores more orders than k-A*, as for a given order it\nonly \ufb01nds an approximate solution. The results suggest that it is more important to explore many\norders instead of obtaining the optimal DAG given an order.\n4.6 Very large data sets\nEventually we consider 14 very large data sets, containing between 400 and 10000 variables. We\nsplit each algorithm in three subsets. We thus perform 14\u00b73=42 structural learning experiments with\neach algorithm.\nWe include three randomly-generated synthetic data sets containing 2000, 4000 and 10000 variables\nrespectively. These networks have been generated using the software BNGenerator 4. Each variable\nhas a number of states randomly drawn from 2 to 4 and a number of parents randomly drawn from 0\nto 6.\n\nData set\nDiabets\n\nPigs\nBook\n\nn\n\nData set\n500 Reuters-52\n724\n839\n\nn\n\nData set\n\nn Data set\n\nn Data set\n\n413 EachMovie\n441\n500\nTable 5: Very large data sets sorted according to the number n of variables.\n\nC20NG\nMunin\n\nBBC\nAd\nR2\n\n889\n910\n1041\n\nLink\n\nWebKB\n\n1058\n1556\n2000\n\nR4\nR10\n\nn\n4000\n10000\n\nWe let each method run for one hour. The only two algorithms able to cope with these data sets are\nk-G and S2. For all the experiments, both k-A* and S2+ fail to \ufb01nd even a single solution in the\nallowed time limit; we veri\ufb01ed this is not due to memory issues. Among them, k-G wins 42 times out\nof 42; this dominance is clearly signi\ufb01cant. This result is consistently found under each choice of\ntreewidth (k =2, 5, 8). On average, the improvement of k-G over S2 \ufb01lls about 60% of the gap which\nseparates S2 from the unbounded solver.\n\ne\nr\no\nc\ns\n-\n\nW\n\n102\n\n101\n\n100\n10\u22121\n\nk-G(2)\n\nk-G(5)\n\nk-G(8)\n\nS2(2)\n\nS2(5)\n\nS2(8)\n\nFigure 3: Boxplots of the W-scores, summarizing the results over 14\u00b73=42 structural learning\nexperiments on very large data sets. Lower W-scores are better. The y-axis is shown in logarithmic\nscale. In the label of the x-axis we also report the adopted treewidth for each method: 2, 5 or 8.\n\nThe W-scores of such 42 structural learning experiments are summarized in Figure 3. For both S2 and\nk-G, a larger treewidth allows to recover a higher-scoring graph. In turn this decreases the W-score.\nHowever k-G scales better than S2 with respect to the treewidth; its W-score decreases more sharply\nwith the treewidth. For S2, the difference between the treewidth seems negligible from the \ufb01gure.\nThis is due to the fact that the graph learned are actually sparse.\nFurther experimental documentation is available, including how the score achieved by the algorithms\nevolve with time, are available from http://blip.idsia.ch.\n5 Conclusions\nOur novel approaches for treewidth-bounded structure learning scale effectively with both in the\nnumber of variables and the treewidth, outperforming the competitors.\nAcknowledgments\nWork partially supported by the Swiss NSF grants 200021_146606 / 1 and IZKSZ2_162188.\n\n4http://sites.poli.usp.br/pmr/ltd/Software/BNGenerator/\n\n8\n\n\fReferences\nBerg J., J\u00e4rvisalo M., and Malone B. Learning optimal bounded treewidth Bayesian networks via\nmaximum satis\ufb01ability. In AISTATS-14: Proceedings of the 17th International Conference on\nArti\ufb01cial Intelligence and Statistics, 2014.\n\nCussens J. Bayesian network learning with cutting planes. In UAI-11: Proceedings of the 27th\nConference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 153\u2013160. AUAI\nPress, 2011.\n\nElidan G. and Gould S. Learning bounded treewidth Bayesian networks. In Advances in Neural\n\nInformation Processing Systems 21, pages 417\u2013424. Curran Associates, Inc., 2009.\n\nKorhonen J. H. and Parviainen P. Exact learning of bounded tree-width Bayesian networks. In Proc.\n\n16th Int. Conf. on AI and Stat., page 370\u2013378. JMLR W&CP 31, 2013.\n\nKwisthout J. H. P., Bodlaender H. L., and van der Gaag L. C. The necessity of bounded treewidth\nfor ef\ufb01cient inference in Bayesian networks. In ECAI-10: Proceedings of the 19th European\nConference on Arti\ufb01cial Intelligence, 2010.\n\nNie S., Mau\u00e1 D. D., de Campos C. P., and Ji Q. Advances in learning Bayesian networks of bounded\n\ntreewidth. In Advances in Neural Information Processing Systems, pages 2285\u20132293, 2014.\n\nNie S., de Campos C. P., and Ji Q. Learning Bounded Tree-Width Bayesian Networks via Sampling.\nIn ECSQARU-15: Proceedings of the 13th European Conference on Symbol and Quantitative\nApproaches to Reasoning with Uncertainty, pages 387\u2013396, 2015.\n\nNie S., de Campos C. P., and Ji Q. Learning Bayesian networks with bounded treewidth via guided\n\nsearch. In AAAI-16: Proceedings of the 30th AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\nParviainen P., Farahani H. S., and Lagergren J. Learning bounded tree-width Bayesian networks using\ninteger linear programming. In Proceedings of the 17th International Conference on Arti\ufb01cial\nIntelligence and Statistics, 2014.\n\nPatil H. P. On the structure of k-trees. Journal of Combinatorics, Information and System Sciences,\n\npages 57\u201364, 1986.\n\nScanagatta M., de Campos C. P., Corani G., and Zaffalon M. Learning Bayesian Networks with\nThousands of Variables. In NIPS-15: Advances in Neural Information Processing Systems 28,\npages 1855\u20131863, 2015.\n\nTeyssier M. and Koller D. Ordering-based search: A simple and effective algorithm for learning\n\nBayesian networks. CoRR, abs/1207.1429, 2012.\n\n9\n\n\f", "award": [], "sourceid": 820, "authors": [{"given_name": "Mauro", "family_name": "Scanagatta", "institution": "Idsia"}, {"given_name": "Giorgio", "family_name": "Corani", "institution": "Idsia"}, {"given_name": "Cassio", "family_name": "de Campos", "institution": "Queen's University Belfast"}, {"given_name": "Marco", "family_name": "Zaffalon", "institution": "IDSIA"}]}