{"title": "Bayesian Structure Learning by Recursive Bootstrap", "book": "Advances in Neural Information Processing Systems", "page_first": 10525, "page_last": 10535, "abstract": "We address the problem of Bayesian structure learning for domains with hundreds of variables by employing non-parametric bootstrap, recursively. We propose a method that covers both model averaging and model selection in the same framework. The proposed method deals with the main weakness of constraint-based learning---sensitivity to errors in the independence tests---by a novel way of combining bootstrap with constraint-based learning. Essentially, we provide an algorithm for learning a tree, in which each node represents a scored CPDAG for a subset of variables and the level of the node corresponds to the maximal order of conditional independencies that are encoded in the graph. As higher order independencies are tested in deeper recursive calls, they benefit from more bootstrap samples, and therefore are more resistant to the curse-of-dimensionality. Moreover, the re-use of stable low order independencies allows greater computational efficiency. We also provide an algorithm for sampling CPDAGs efficiently from their posterior given the learned tree. That is, not from the full posterior, but from a reduced space of CPDAGs encoded in the learned tree. We empirically demonstrate that the proposed algorithm scales well to hundreds of variables, and learns better MAP models and more reliable causal relationships between variables, than other state-of-the-art-methods.", "full_text": "Bayesian Structure Learning by Recursive Bootstrap\n\nRaanan Y. Rohekar\u2217\n\nIntel AI Lab\n\nraanan.yehezkel@intel.com\n\nYaniv Gurwicz\u2217\nIntel AI Lab\n\nyaniv.gurwicz@intel.com\n\nShami Nisimov\u2217\nIntel AI Lab\n\nshami.nisimov@intel.com\n\nGuy Koren\nIntel AI Lab\n\nguy.koren@intel.com\n\nGal Novik\nIntel AI Lab\n\ngal.novik@intel.com\n\nAbstract\n\nWe address the problem of Bayesian structure learning for domains with hundreds\nof variables by employing non-parametric bootstrap, recursively. We propose a\nmethod that covers both model averaging and model selection in the same frame-\nwork. The proposed method deals with the main weakness of constraint-based\nlearning\u2014sensitivity to errors in the independence tests\u2014by a novel way of combin-\ning bootstrap with constraint-based learning. Essentially, we provide an algorithm\nfor learning a tree, in which each node represents a scored CPDAG for a subset of\nvariables and the level of the node corresponds to the maximal order of conditional\nindependencies that are encoded in the graph. As higher order independencies are\ntested in deeper recursive calls, they bene\ufb01t from more bootstrap samples, and\ntherefore are more resistant to the curse-of-dimensionality. Moreover, the re-use\nof stable low order independencies allows greater computational ef\ufb01ciency. We\nalso provide an algorithm for sampling CPDAGs ef\ufb01ciently from their posterior\ngiven the learned tree. That is, not from the full posterior, but from a reduced\nspace of CPDAGs encoded in the learned tree. We empirically demonstrate that\nthe proposed algorithm scales well to hundreds of variables, and learns better\nMAP models and more reliable causal relationships between variables, than other\nstate-of-the-art-methods.\n\n1\n\nIntroduction\n\nBayesian networks (BN) are probabilistic graphical models, commonly used for probabilistic infer-\nence, density estimation, and causal modeling (Darwiche, 2009; Pearl, 2009; Murphy, 2012; Spirtes\net al., 2000). The graph of a BN is a DAG over random variables, encoding conditional independence\nassertions. Learning this DAG structure, G, from data, D, has been a fundamental problem for the\npast two decades. Often, it is desired to learn an equivalence class (EC) of DAGs, that is, a CPDAG.\nDAGs in an EC are Markov equivalent; that is, given an observed dataset, they are statistically\nindistinguishable and represent the same set of independence assertions (Verma & Pearl, 1990).\nCommonly, two main scenarios are considered. In one scenario, the posterior probability, P (G|D),\n(or some other structure scoring metric) peaks sharply around a single structure, GMAP. Here, the\ngoal is to \ufb01nd the highest-scoring structure\u2014a maximum-a-posteriori (MAP) estimation (model\nselection). In a second scenario, several distinct structures have high posterior probabilities, which is\ncommon when the data size is small compared to the domain size (Friedman & Koller, 2003). In this\ncase, learning model structure or causal relationships between variables using a single MAP model\nmay give unreliable conclusions. Thus, instead of learning a single structure, graphs are sampled\n\n\u2217Equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ffrom the posterior probability, P (G|D) and the posterior probabilities of hypotheses-of-interests,\ne.g., structural features, f, are computed in a model averaging manner. Examples of structural\nfeatures are: the existence of a directed edge from node X to node Y , X \u2192 Y , a Markov blanket\nfeature, X \u2208 MB(Y ), and a directed path feature X (cid:32) Y . Another example is the computation\nstructures, G. That is,(cid:80)\nof the posterior predictive probability, P (Dnew|D). In the model selection scenario, it is equal\nto P (Dnew|GMAP). In the model averaging scenario, it is equal to averaging over all the DAG\n\nG\u2208G P (Dnew|G)P (G|D).\n\nThe number of DAG structures is super exponential with the number of nodes, O(n!2(n\n2)), rendering\nan exhaustive search for an optimal DAG or averaging over all the DAGs intractable for many\nreal-world problems. In fact, it was shown that recovering an optimal DAG with a bounded in-degree\nis NP-hard (Chickering et al., 1995).\nIn this paper we propose: (1) an algorithm, called B-RAI, that learns a generative tree, T , for\nCPDAGs (equivalence classes), and (2) an ef\ufb01cient algorithm for sampling CPDAGs from this\ntree. The proposed algorithm, B-RAI, applies non-parametric bootstrap in a recursive manner, and\ncombines CI-tests and scoring.\n\n2 Related Work\n\nPreviously, two main approaches for structure learning were studied, score-based (search-and-score)\nand constraint-based. Score-based approaches combine a scoring function, such as BDe (Cooper\n& Herskovits, 1992), with a strategy for searching through the space of structures, such as greedy\nequivalence search (Chickering, 2002). Constraint-based approaches (Pearl, 2009; Spirtes et al.,\n2000) \ufb01nd the optimal structures in the large sample limit by testing conditional independence\n(CI) between pairs of variables. They are generally faster than score-based approaches, scale well\nfor large domains, and have a well-de\ufb01ned stopping criterion (e.g., maximal order of conditional\nindependence). However, these methods are sensitive to errors in the independence tests, especially\nin the case of high-order conditional-independence tests and small training sets. Some methods are a\nhybrid between the score-based and constraint-based methods and have been empirically shown to\nhave superior performance (Tsamardinos et al., 2006).\nRecently, important advances have been reported for \ufb01nding optimal solutions. Firstly, the ef\ufb01ciency\nin \ufb01nding an optimal structure (MAP) has been signi\ufb01cantly improved (Koivisto & Sood, 2004;\nSilander & Myllym\u00e4ki, 2006; Jaakkola et al., 2010; Yuan & Malone, 2013). Secondly, several\nmethods for \ufb01nding the k-most likely structures have been proposed (Tian et al., 2010; Chen & Tian,\n2014; Chen et al., 2016). Many of these advances are based on de\ufb01ning new search spaces and\nef\ufb01cient search strategies for these spaces. Nevertheless, they are still limited to relatively small\ndomains (up to 25 variables). Another type of methods are based on MCMC (Friedman & Koller,\n2003; Eaton & Murphy, 2007; Grzegorczyk & Husmeier, 2008; Niinim\u00e4ki & Koivisto, 2013; Su\n& Borsuk, 2016) where graphs are sampled from the posterior distribution. However, there is no\nguarantee on the quality of the approximation in \ufb01nite runs (may not mix well and converge in \ufb01nite\nruns). Moreover, these methods have high computational costs, and, in practice, they are restricted to\nsmall domains.\n\n3 Proposed Method\n\nWe propose learning a tree, T , by applying non-parametric bootstrap recursively, testing conditional\nindependence, and scoring the leaves of T using a Bayesian score.\n3.1 Recursive Autonomy Identi\ufb01cation\n\nWe \ufb01rst brie\ufb02y describe the RAI algorithm, proposed by Yehezkel & Lerner (2009), which given a\ndataset, D, constructs a CPDAG in a recursive manner. RAI is a constraint-based structure learning\nalgorithm. That is, it learns a structure by performing independence tests between pairs of variables\nconditioned on a set of variables (CI-tests). As illustrated in Figure 1, the CPDAG is constructed\nrecursively, from level n = 0. In each level of recursion, the current CPDAG is \ufb01rstly re\ufb01ned by\nremoving edges between nodes that are independent conditioned on a set of size n and directing\nthe edges. Then, the CPDAG is partitioned into ancestors, X (n)\nD groups.\n\nAi , and (2) descendant, X (n)\n\n2\n\n\fEach group is autonomous in that it includes the parents of its members (Yehezkel & Lerner, 2009).\nFurther, each autonomous group from the n-th recursion level, is independently partitioned, resulting\nin a new level of n + 1. Each such CPDAG (a subgraph over the autonomous set) is progressively\npartitioned (in a recursive manner) until a termination condition is satis\ufb01ed (independence tests with\ncondition set size n cannot be performed), at which point the resulting CPDAG (a subgraph) at that\nlevel is returned to its parent (the previous recursive call). Similarly, each group in its turn, at each\nrecursion level, gathers back the CPDAGs (subgraphs) from the recursion level that followed it, and\nthen return itself to the recursion level that precedes it, and until the highest recursion level, n = 0, is\nreached, and the \ufb01nal CPDAG is fully constructed.\n\nFigure 1: An example of an execution tree of RAI. An arrow indicates a recursive call. Each CPDAG\nis partitioned into ancestors group, X (n)\nD , and then, each group is\nfurther partitioned, recursively with n + 1. Each circle represents a distinct subset of variables (for\nexample, X (1)\nA1\n\nin different circles represents different subsets). Best viewed in color.\n\nAi , and descendant group, X (n)\n\n3.2 Uncertainty in Conditional Independence Test\n\nConstraint-based structure learning algorithms, such as RAI, are proved to recover the true underlying\nCPDAG when using an optimal independence test. In practice, independence is estimated from\n\ufb01nite-size, noisy, datasets. For example, the dependency between X and Y conditioned on a set\nZ = {Zi}l\n\ni is estimated by thresholding the conditional mutual information,\n\n(cid:99)MI(X, Y |Z) =\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nX\n\nY\n\nZ1\n\nZl\n\n\u00b7\u00b7\u00b7\n\nP (X, Y, Z) log\n\nP (X, Y |Z)\n\nP (X|Z)P (Y |Z)\n\n,\n\n(1)\n\nwhere the probabilities are estimated from a limited dataset D. Obviously, this measure suffers from\nthe curse-of-dimensionality, where for large condition set sizes, l, this measure becomes unreliable.\nThe relation between the optimal conditional mutual information, MI, and the conditional mutual\n\ninformation, (cid:99)MI, estimated from a limited dataset D, is\n(cid:99)MI(X, Y |Z) = MI(X, Y |Z) +\nwhere(cid:80)\u221e\n\n\u221e(cid:88)\n\nm=1\n\nm=1 Cm is an estimate of the average bias for limited data (Treves & Panzeri, 1995), and \u0001\nis a zero mean random variable with unknown distribution. Lerner et al. (2013) proposed thresholding\n\nCm + \u0001,\n\n(2)\n\n3\n\n \f(cid:99)MI with the leading term of the bias, C1, to test independence. Nevertheless, there is still uncertainty\n\n\u03c8. To generate a sample from the bootstrapped distribution, a dataset (cid:101)D of cardinality equal to that\nto be \u03c8((cid:101)D). When this process is repeated several times, it produces several resampled datasets,\n\nin the estimation due to the unknown distribution of \u0001, which may lead to erroneous independence\nassertions. One inherent limitation of the RAI algorithm, as well as other constraint-based algorithms,\nis its sensitivity to errors in independence testing. An error in an early stage, may lead to additional\nerrors in later stages. We propose modeling this uncertainty using non-parametric bootstrap.\nThe bootstrap principle is to approximate a population distribution by a sample distribution (Efron &\nTibshirani, 1994). In its most common form, the bootstrap takes as input a data set D and an estimator\nof D is sampled uniformly with replacement from D. The bootstrap sample estimate is then taken\nestimators and thereafter sample estimates, from which a \ufb01nal estimate can be made by MAP or\nmodel averaging (Friedman et al., 1999). The bootstrap is widely acclaimed as a great advance in\napplied statistics and even comes with theoretical guarantees (Bickel & Freedman, 1981).\nWe propose estimating the result of the n + 1 recursive call (\u03c8) for each autonomous group using\nnon-parametric bootstrap.\n\n3.3 Graph Generative Tree\n\ndatasets, {(cid:101)Dt}s\n\nWe now describe a method for constructing a tree, T , from which CPDAGs can be sampled. In\nessence, we replace each node in the execution tree, as illustrated in Figure 1, with a bootstrap-node,\nas illustrated in Figure 2. In the bootstrap-node, for each autonomous group (X (n)\nD ), s\n\nt=1, are sampled with replacement from the training data D, where |(cid:101)Dt| = |D|. This\n\nresults in a recursive application of bootstrap. Finally, we calculate log[P (D|G)] for each leaf node\nin the tree (G is the CPDAG in the leaf), using a decomposable score,\n\nAi and X (n)\n\n(cid:88)\n\ni\n\nlog[P (D|G)] =\n\nscore(Xi|\u03c0i;D),\n\n(3)\n\nwhere \u03c0i are the parents of node Xi. For example, Bayesian score (Heckerman et al., 1995).\n\nFigure 2: The bootstrap node. For each autonomous group, s recursive calls are performed. Each call\n\nuses a sampled dataset, (cid:101)Dt \u223c D where t \u2208 {1,\u00b7\u00b7\u00b7 , s}. Best viewed in color.\n\nThe recursive construction of T is described in Algorithm 1. The algorithm starts with condition set\nsize n = 0, G a complete graph, and a set of exogenous variables Xex = \u2205. The set Xex is exogenous\nto G and consists the parents of X. First, an exit condition is tested (line 2). It is satis\ufb01ed if there\nare not enough variables for a condition set of size n. In this case, T is set to be a leaf node, and the\ninput graph G is scored using the full training data (not the sampled data). It is important to note\nthat only leaf nodes of T are scored. From this point, the recursive procedure will trace back, adding\nparent nodes to T .\n\n4\n\n \f1 R \u2190\u2212B-RAI (G, X, X ex, n,D,(cid:101)D)\nAlgorithm 1: Construct a graph generative tree, T\ntraining data D, and sampled training data (cid:101)D.\n\nInput: an initial CPDAG G over endogeneous X & exogenous nodes X ex, a desired resolution n, full\nOutput: R, root of the graph generative tree T .\nif maxXi\u2208X (|\u03c0i| \u2212 1) < n then\n\n(cid:46) exit condition (test maximal indegree)\n\nsc \u2190\u2212 Score(G,D)\nR \u2190\u2212a leaf node with content (G, sc)\nreturn T\n\nG(cid:48) \u2190\u2212IncreaseResolution(G, n, (cid:101)D)\n\n2\n3\n4\n\n5\n\n6\n7\n\n8\n\n9\n\n10\n11\n\n12\n13\n\n14\n15\n\n16\n\nfor i \u2208 1 . . . K do\n\n{X D, X A1, . . . , X AK} \u2190\u2212SplitAutonomous(X, G(cid:48))\nR \u2190\u2212a new root node\nfor t \u2208 1 . . . s do\n\n(cid:101)D(cid:48) \u2190\u2212sample with replacement from D\n\ni \u2190\u2212 B-RAI(G(cid:48), X Ai, X ex, n + 1,D, (cid:101)D(cid:48))\n\ni to be the child of R with label: Anct\n\nRA\nset RA\nD \u2190\u2212 B-RAI(G(cid:48), X D, X ex \u222a {X Ai}K\nRt\nset RD\nt to be the child of R with label: Dect\nreturn R\n\ni=1, n + 1,D, (cid:101)D(cid:48))\n\ni\n\nt\n\nt\n\n(cid:46) n-th order independencies\n(cid:46) identify autonomies\n\n(cid:46) a bootstrap node\n\n(cid:46) bootstrap\n\n(cid:46) a recursive call\n\n(cid:46) a recursive call\n\nThe procedure IncreaseResolution (line 6) disconnects conditionally independent variables in\ntwo steps. First, it tests dependency between X ex and X, i.e., X \u22a5\u22a5 X(cid:48)\n|S for every connected\npair X \u2208 X and X(cid:48)\n\u2208 X ex given a condition set S \u2282 {X ex \u222a X} of size n. Next, it tests\ndependencies within X, i.e., Xi \u22a5\u22a5 Xj|S for every connected pair, Xi, Xj \u2208 X, given a condition\nset S \u2282 {X ex \u222a X} of size n. After removing the corresponding edges, the remaining edges are\ndirected by applying two rules (Pearl, 2009; Spirtes et al., 2000). First, v-structures are identi\ufb01ed\nand directed. Then, edges are continually directed, by avoiding the creation of new v-structures and\ndirected cycles, until no more edges can be directed. Following the terminology of Yehezkel & Lerner\n(2009), we say that G(cid:48) is set by increasing the graph d-separation resolution from n \u2212 1 to n.\nThe procedure SplitAutonomous (line 7) identi\ufb01es autonomous sets, one descendant set, X D, and\nK ancestor sets, X A1, . . . , X AK in two steps. First, the variables having the lowest topological\norder (the highest indexes in a topological sort) are grouped into X D. Speci\ufb01cally, X D consists\nof all the nodes without outgoing directed edges (undirected edges between them may be present).\nThen, X D is removed (temporarily) from G(cid:48) revealing unconnected sub-structures. The number of\nunconnected sub-structures is denoted by K and the nodes set of each sub-structure is denoted by\nX Ai (i \u2208 {1 . . . K}).\nAn autonomous set in G(cid:48) includes all its nodes\u2019 parents (complying with the Markov property) and\ntherefore a sub-tree can further be constructed independently, using a recursive call with n + 1. First,\ns datasets are sampled from D (line 10) and the algorithm is called recursively for each dataset and\nfor each autonomous set (for ancestor sets in line 12, and descendant set in line 14). This recursive\ndecomposition of X is similar to that of RAI (Figure 1). The result of each recursive call is a tree.\nThese trees are merged into a single tree, T , by setting a common parent node, R (line 8), for the\nroots of each subtree (line 15). From the resulting tree, CPDAGs can be generated (sampled), as\ndescribed in the next section. Thus, we call it a graph generative tree (GGT).\nComplexity. The computational complexity of two main operations are analyzed, CI tests and\nscoring. The number of CI tests performed by B-RAI has a complexity of O(nksk+1), where s\nis the number of splits, n is the number of variables, and k is the maximal order of conditional\nindependence in the data. The running-trace of RAI in the worst-case scenario can be viewed as a\nsingle path in the GGT (root to leaf). This has a complexity of O(nk). Thus, for the number of CI\n\n5\n\n\ftests, the ratio between B-RAI and RAI is(cid:80)k\n\ni=0 si. For the Bayesian scoring function (scoring a\nnode given its parents), the complexity is O(nsk), as only the leaves of the GGT are scored. Note\nthat the worst-case scenario is the case where the true underlying graph is a complete graph, which is\nnot typical in real-world cases. In practice, signi\ufb01cantly fewer CI tests and scoring operations are\nperformed, as evident in the short run-times in our experiments.\n\n3.3.1 Sampling CPDAGs\nIn essence, following a path along the learned GGT, T , following one of s possible labeled children\nat each bootstrap node, results in a single CPDAG. In Algorithm 2 we provide a method for sampling\nCPDAGs proportionally to their scores. The scores calculation and CPDAGs selections from the\nGGT are performed backwards, from the leaves to the root (as opposed to the tree construction which\nis performed in top down manner). For each autonomous group, given s sampled CPDAGs and their\nscores returned from s recursive calls (lines 8 &13), the algorithm samples one of the s results (lines\n9 & 14) proportionally to their (log) score. We use the Boltzmann distribution,\n\nP (t;{sct(cid:48)\n\n}s\nt(cid:48)=1) =\n\n(cid:80)s\n\nexp[sct/\u03b3]\nt(cid:48)=1 exp[sct(cid:48)\n\n,\n\n/\u03b3]\n\n(4)\n\nwhere \u03b3 is a \u201ctemperature\u201d term. When \u03b3 \u2192 \u221e, results are sampled from a uniform distribution,\nand when \u03b3 \u2192 0 the index of the maximal value is selected (arg max). We set \u03b3 = 1 and use the\nBayesian score, BDeu (Heckerman et al., 1995). Finally, the sampled CPDAGs are merged (line 16)\nand the sum of scores of all autonomous sets (line 17) is the score of the merged CPDAG.\nAnother common task is \ufb01nding the CPDAG having the highest score (model selection). In our case,\nGMAP = arg maxG\u2208T [log P (D|G)]. The use of a decomposable score enables an ef\ufb01cient recursive\nalgorithm to recover GMAP. Thus, this algorithm is similar to Algorithm 2, where sampling (lines 9\n& 14) is replaced by t(cid:48) = arg maxt P (t;{sct}s\nAlgorithm 2: Sample a CPDAG from T\n1 (G, sc) \u2190\u2212SampleCPDAG (R)\n\nt=1).\n\n(cid:46) exit condition\n\n(cid:46) select Anci sub-tree\n(cid:46) a recursive call\n\n(cid:46) select Dec sub-tree\n(cid:46) a recursive call\n\n(cid:46) recall that GD includes edges incoming from GAi\n(cid:46) summation is used since the score is decomposable\n\n2\n3\n4\n\n5\n6\n7\n8\n\n9\n\n10\n\n11\n12\n13\n\n14\n\n15\n\n16\n\n17\n\n18\n\nInput: R, root of a graph generative tree, T\nOutput: G, a sampled CPDAG and score sc\nif R is a leaf node then\n\n(G, sc) \u2190\u2212content of the leaf node R\nreturn (G, sc)\nfor i \u2208 1 . . . K do\n\nfor t \u2208 1 . . . s do\n\nR(cid:48) \u2190\u2212 Child(R, Anct\ni)\n(Gt, sct) \u2190\u2212 SampleCPDAG(R(cid:48))\n\nt=1) (see Equation 4)\n\nsample t(cid:48) \u223c P (t(cid:48);{sct}s\nGAi \u2190\u2212 Gt(cid:48)\nfor t \u2208 1 . . . s do\n\nand scAi \u2190\u2212 sct(cid:48)\n\nR(cid:48) \u2190\u2212 Child(R, Dect)\n(Gt, sct) \u2190\u2212 SampleCPDAG(R(cid:48))\n\nt=1) (see Equation 4)\n\nsample t(cid:48) \u223c P (t(cid:48);{sct}s\nGD \u2190\u2212 Gt(cid:48)\n, scD \u2190\u2212 sct(cid:48)\nG \u2190\u2212 \u222ak\n\nsc \u2190\u2212 scD +(cid:80)K\n\ni=1GAi \u222a GD\n\ni=1 scAi\n\nreturn (G, sc)\n\n6\n\n\f4 Experiments\n\nWe use common networks2 and datasets3 to analyze B-RAI in three aspects: (1) computational\nef\ufb01ciency compared to classic bootstrap, (2) model averaging, and (3) model selection. Experiments\nwere performed using the Bayes net toolbox (Murphy, 2001). Conditional mutual information was\nused for CI testing, and BDeu with ESS = 1 for scoring.\n\n4.1 GGT Ef\ufb01ciency\n\nIn the large sample limit, independent bootstrap samples will yield similar CI-test results. Thus, all\nthe paths in T will represent the same single CPDAG. Since RAI is proved to learn the true underlying\ngraph in the large sample limit (Yehezkel & Lerner, 2009), this single CPDAG will also be the true\nunderlying graph. On the other hand, we expect that for very small sample sizes, each path in T will\nbe unique. In Figure 3-left, we apply B-RAI, with s = 3, for different sample sizes (50\u2013500) and\ncount the number of unique CPDAGs in T . As expected, the number of unique CPDAGs increases as\nthe sample size decreases.\n\nFigure 3: Left: Number of unique CPDAGs in a GGT with s = 3 as a function of data size (averaged\nover 10 different Alarm datasets). Right: predictive log-liklihood as a function of computational\ncomplexity (number of CI-tests). The MAP estimation from B-RAI, with s \u2208 {2, 3, 4, 5}, is compared\nto selecting the highest scoring CPDAG from l classic bootstrap samples (averaged over 1000 trials).\n\nNext, we compare the computational complexity (number of CI-tests) of B-RAI to classic non-\nparametric bootstrap over RAI. For learning, we use 10 independent Alarm datasets, each having\n500-samples. For calculating posterior predictive probability, we use 10 different Alarm datasets,\neach having 5000 samples. We learn B-RAI using four different values of s, {2, 3, 4, 5}, resulting\nin four different GGTs, {T 2, . . . ,T 5}. We record the number CI-tests that are performed when\nlearning each GGT. Next, a CPDAG having the highest score is found in each GGT and the predictive\nlog-likelihood is calculated. Similarly, for classic bootstrap, we sample l datasets with replacement,\nlearn l different CPDAGs using RAI, and record the number of CI-tests. The different values of\nl that we tested are {1, 2, . . . , 27}, where the number of CI-tests required by 27 independent runs\nof RAI is similar to that of B-RAI with s = 5. From the l resulting CPDAGs, we select the one\nhaving the highest score and calculate the predictive log-likelihood. This experiment is repeated 1000\ntimes. Average results are reported in Figure 3-right. Note that the four points on the B-RAI curve\nrepresent T 2, . . . ,T 5, where T 2 requires the fewest CI-test and T 5 the highest. This demonstrates\nthe ef\ufb01ciency of recursively applying bootstrap, relying on reliable results of the calling recursive\nfunction (lower n), compared to classic bootstrap.\n\n2www.bnlearn.com/bnrepository/\n3www.dsl-lab.org/supplements/mmhc_paper/mmhc_index.html\n\n7\n\n0100200300400500050100datasizenumberofuniqueCPDAGs0102030405060\u221211\u221210.8\u221210.6\u221210.4\u221210.2computationalcomplexity(#ofCI-tests\u00d7103)logposteriorpredictiveprobabilityB-RAIMAP,s\u2208{2,3,4,5}BootstrapRAIMax.\f4.2 Model Averaging\n\nWe compare B-RAI to the following algorithms: (1) an exact method (Chen & Tian, 2014) that \ufb01nds\nthe k CPDAGs having the highest scores\u2014k-best, (2) an MCMC algorithm (Eaton & Murphy, 2007)\nthat uses an optimal proposal distribution\u2014DP-MCMC, and (3) non-parametric bootstrap applied\nto an algorithm that was shown scale well for domains having hundreds of variables (Tsamardinos\net al., 2006)\u2014BS-MMHC. We use four common networks: Asia, Cancer, Earthquake, and Survey;\nand sampled 500 data samples from each network. We then repeat the experiment for three different\nvalues of k \u2208 {5, 10, 15}. It is important to note that the k-best algorithm produces optimal results.\nMoreover, we sample 10,000 CPDAGs using the MCMC-DP algorithm providing near optimal results.\nHowever, these algorithms are impractical for domain with n > 25, and are used as optimal baselines.\nThe posterior probabilities of three types of structural features, f, are evaluated: edge,f = X \u2192 Y ,\nMarkov blanket, f = X \u2208 MB(Y ), and path, f = X (cid:32) Y . From the k-best algorithm we calculate\nthe k CPDAGs having the highest scores (an optimal solution); from the samples of MCMC-DP and\nBS-MMHC, we select the k CPDAGs having the highest scores; and from the B-RAI tree we select\nthe k routs leading to the highest scoring CPDAGs. Next, for each CPDAG (for all algorithms) we\nenumerate all the DAGs (recall that a CPDAG is a family of Markov equivalent DAGs), resulting in a\nset of DAGs, G. Finally, we calculate the posterior probability of structural features,\n\n(cid:80)\n\n(cid:80)\n\n,\n\n(5)\n\nP (f|D) \u2248\n\nG\u2208G f (G)P (G,D)\n\nG\u2208G P (G,D)\n\nwhere f (G) = 1 if the feature exists in the graph, and f (G) = 0 otherwise. In Figure 4 we report area\nunder ROC curve for each of the features for different k values, and different datasets. True-positive\nand false-positive values are calculated after thresholding P (f|D) at different values, resulting in an\nROC curve. It is evident that, B-RAI provides competitive results to the optimal k-best algorithm.\n\nFigure 4: Area under ROC curve of three different structural features: a directed edge, Markov\nblanket (MB), and path. Each column of plots represents a different k value (k \u2208 {5, 10, 15}), and\neach row of plots represents a different dataset. Each bar-color represents a different method. The\naccuracy of B-RAI is on par with that of the exact k-best and MCMC-DP methods, and is better than\nthe approximate bootstrap-MMHC method.\n\n8\n\nasia, k=5MB0.990.990.590.95path0.980.980.540.96edge0.970.950.640.96asia, k=10MB110.660.94path0.970.970.520.95edge0.980.950.640.96asia, k=15MB110.630.96path0.980.980.410.98edge0.980.950.620.96cancer, k=5MB0.790.790.850.81path0.340.340.70.79edge0.760.930.890.83cancer, k=10MB0.870.870.750.89path0.670.670.670.85edge0.930.930.820.88cancer, k=15MB0.870.870.750.92path0.760.760.680.91edge0.930.930.830.92earth quake, k=5MB110.820.97path0.890.890.431edge0.920.940.760.99earth quake, k=10MB110.830.97path0.950.950.50.96edge0.960.940.770.96earth quake, k=15MB110.850.87path0.930.930.520.93edge0.950.940.890.93survey, k=5MB0.540.540.850.82path0.560.560.60.97edge0.580.760.760.88survey, k=10MB0.650.650.810.87path0.560.560.620.96edge0.620.760.760.82survey, k=15MB0.830.830.80.93path0.60.60.620.9edge0.780.760.710.88k-best\u00a0 \u00a0 \u00a0 \u00a0 \u00a0MCMC-DP\u00a0 \u00a0 \u00a0 \u00a0 \u00a0BS-MMHC\u00a0 \u00a0 \u00a0 \u00a0 \u00a0B-RAI\u00a0\f4.3 Model Selection\n\nIn this experiment, we examine the applicability of B-RAI in large domains having up to hundreds of\nvariables. Since, optimal methods are intractable in these domains, we compare MAP estimation of\nB-RAI to three algorithms, RAI (Yehezkel & Lerner, 2009), MMHC (Tsamardinos et al., 2006), and\nclassic bootstrap applied to MMHC, BS-MMHC. Both, RAI and MMHC, were previously reported\nto achieve state-of-the-art estimation in large domain. For BS-MMHC, 1000 CPDAGs were learned\nfrom bootstrap sampled datasets and the CPDAG having the highest score was selected.\nWe use eight, publicly available, databases and networks, commonly used for model selection\n(Tsamardinos et al., 2006; Yehezkel & Lerner, 2009). Each of the eight databases consists of\n10 datasets, each having 500 samples, for training, and 10 datasets, each having 5000 samples,\nfor calculating the posterior predictive probability. Thus, for each of the 8 databases, we repeat\nour experiments 10 times. Results are provided in Table 1. The longest running time of B-RAI\n(implemented in Matlab) was recorded for the Link dataset (724 nodes): \u223c 2 hours. It is evident\nthat B-RAI learns structures that have signi\ufb01cantly higher posterior predictive probabilities (as well\nas scores of the training datasets; not reported). We also measured the structural hamming distance\n(SHD) to the true structure. For all datasets we found B-RAI to have the lowest percentage of missing\nedges, nearly 0% of extra edges, and lowest direction errors. The smallest improvement of B-RAI\ncompared to BS-MMHC was for Munin: 5% fewer missing edges and 15% fewer direction error.\n\nTable 1: Log probabilities of MAP models\n\nDataset\n\nNodes\n\nRAI\n\nMMHC\n\nChild\nInsurance\nMildew\nAlarm\nBarley\nHail\ufb01nder\nMunin\nLink\n\n20\n27\n35\n37\n48\n56\n189\n724\n\n-68861 (\u00b1110)\n-71296 (\u00b1115)\n-288677 (\u00b11323)\n-52198 (\u00b1117)\n-340317 (\u00b1389)\n-291632.5 (\u00b1259)\n-447481 (\u00b11148)\n-1857751 (\u00b1887)\n\n-73290 (\u00b160)\n-89670 (\u00b1118)\n-296375 (\u00b1119)\n-86190 (\u00b1220)\n-380790 (\u00b1414)\n-308125 (\u00b133)\n-455290 (\u00b1270)\n-1907700 (\u00b1432)\n\nBS-MMHC\n1000 Max.\n-72125 (\u00b126)\n-85915 (\u00b188)\n-296815 (\u00b1270)\n-80645 (\u00b184)\n-380305 (\u00b1380)\n-306930 (\u00b195)\n-442860 (\u00b1255)\n-1863546 (\u00b1320)\n\nB-RAI MAP\n\ns = 3\n\n-65671 (\u00b179)\n-70634 (\u00b199)\n-279686 (\u00b11028)\n-51173.5 (\u00b1127)\n-339057 (\u00b1804)\n-289074 (\u00b1120)\n-436309 (\u00b1593)\n-1772132 (\u00b1659)\n\n5 Conclusions\n\nWe proposed a method that covers both model averaging and model selection in the same framework.\nThe B-RAI algorithm recursively constructs a tree of CPDAGs, T . Each of these CPDAGs was split\ninto autonomous sets and bootstrap was applied recursively to each set independently. In general, CI-\ntests suffer from the curse-of-dimensionality. However, in B-RAI, higher order CI-tests are performed\nin deeper recursive calls, and therefor inherently bene\ufb01t from more bootstrap samples. Moreover,\ncomputational ef\ufb01ciency is gained by re-using stable lower order CI test. Sampling CPDAGs from\nthis tree, as well as \ufb01nding a MAP model, is ef\ufb01cient. Moreover, the number of unique CPDAGs that\nare encoded within the learned tree is determined automatically.\nIn the large sample limit, independent bootstrap samples yield similar CI-test results. Thus, all the\npaths in T represent the same single CPDAG\u2014the true underlying graph. On the other hand, we\nfound that for a small sample size (small training set), each path in T represents a unique CPDAG.\nThus, B-RAI has a virtue of inherently identifying the number of unique CPDAGs required to capture\nthe distribution. It follows that for small samples sizes a larger s (number of bootstrap splits) is\nrequired in order to capture a large number of CPDAGs in T , whereas for relatively large sample\nsizes, a small s is suf\ufb01cient. This alleviates the computational cost.\nWe empirically demonstrate that while B-RAI has an accuracy that is comparable to optimal (exact)\nmethods on small domains, it is also scalable to large domains, having hundreds of variables, for\nwhich exact methods are impractical. In these large domains, B-RAI provides the highest scoring\nCPDAGs and most reliable structural features on all the tested benchmarks.\n\n9\n\n\fReferences\nBickel, Peter J and Freedman, David A. Some asymptotic theory for the bootstrap. The Annals of\n\nStatistics, pp. 1196\u20131217, 1981.\n\nChen, Eunice Yuh-Jie, Choi, Arthur Choi, and Darwiche, Adnan. Enumerating equivalence classes of\nBayesian networks using EC graphs. In Arti\ufb01cial Intelligence and Statistics, pp. 591\u2013599, 2016.\n\nChen, Yetian and Tian, Jin. Finding the k-best equivalence classes of Bayesian network structures for\n\nmodel averaging. In Twenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\nChickering, David Maxwell. Optimal structure identi\ufb01cation with greedy search. Journal of machine\n\nlearning research, 3(Nov):507\u2013554, 2002.\n\nChickering, Do, Geiger, Dan, and Heckerman, David. Learning Bayesian networks: Search methods\nand experimental results. In proceedings of \ufb01fth conference on arti\ufb01cial intelligence and statistics,\npp. 112\u2013128, 1995.\n\nCooper, Gregory F and Herskovits, Edward. A Bayesian method for the induction of probabilistic\n\nnetworks from data. Machine learning, 9(4):309\u2013347, 1992.\n\nDarwiche, Adnan. Modeling and reasoning with Bayesian networks. Cambridge University Press,\n\n2009.\n\nEaton, Daniel and Murphy, Kevin. Bayesian structure learning using dynamic programming and\nmcmc. In Proceedings of the Twenty-Third Conference on Uncertainty in Arti\ufb01cial Intelligence,\npp. 101\u2013108. AUAI Press, 2007.\n\nEfron, Bradley and Tibshirani, Robert J. An introduction to the bootstrap. CRC press, 1994.\n\nFriedman, Nir and Koller, Daphne. Being Bayesian about network structure. a Bayesian approach to\n\nstructure discovery in Bayesian networks. Machine learning, 50(1-2):95\u2013125, 2003.\n\nFriedman, Nir, Goldszmidt, Moises, and Wyner, Abraham. Data analysis with Bayesian networks:\nA bootstrap approach. In Proceedings of the Fifteenth conference on Uncertainty in arti\ufb01cial\nintelligence, pp. 196\u2013205. Morgan Kaufmann Publishers Inc., 1999.\n\nGrzegorczyk, Marco and Husmeier, Dirk. Improving the structure mcmc sampler for Bayesian\n\nnetworks by introducing a new edge reversal move. Machine Learning, 71(2-3):265, 2008.\n\nHeckerman, David, Geiger, Dan, and Chickering, David M. Learning Bayesian networks: The\n\ncombination of knowledge and statistical data. Machine learning, 20(3):197\u2013243, 1995.\n\nJaakkola, Tommi, Sontag, David, Globerson, Amir, and Meila, Marina. Learning Bayesian network\nIn Proceedings of the Thirteenth International Conference on\n\nstructure using lp relaxations.\nArti\ufb01cial Intelligence and Statistics, pp. 358\u2013365, 2010.\n\nKoivisto, Mikko and Sood, Kismat. Exact Bayesian structure discovery in Bayesian networks. Journal\n\nof Machine Learning Research, 5(May):549\u2013573, 2004.\n\nLerner, Boaz, Afek, Michal, and Bojmel, Ra\ufb01. Adaptive thresholding in structure learning of a\n\nBayesian network. In IJCAI, pp. 1458\u20131464, 2013.\n\nMurphy, K. The Bayes net toolbox for Matlab. Computing Science and Statistics, 33:331\u2013350, 2001.\n\nMurphy, Kevin P. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.\n\nNiinim\u00e4ki, Teppo Mikael and Koivisto, Mikko. Annealed importance sampling for structure learning\nin Bayesian networks. In Twenty-Third International Joint Conference on Arti\ufb01cial Intelligence,\n2013.\n\nPearl, Judea. Causality: Models, Reasoning, and Inference. Cambridge university press, second\n\nedition, 2009.\n\n10\n\n\fSilander, Tomi and Myllym\u00e4ki, Petri. A simple approach for \ufb01nding the globally optimal Bayesian\nnetwork structure. In Proceedings of the Twenty-Second Conference on Uncertainty in Arti\ufb01cial\nIntelligence, pp. 445\u2013452. AUAI Press, 2006.\n\nSpirtes, P., Glymour, C., and Scheines, R. Causation, Prediction and Search. MIT Press, 2nd edition,\n\n2000.\n\nSu, Chengwei and Borsuk, Mark E. Improving structure mcmc for Bayesian networks through\n\nmarkov blanket resampling. Journal of Machine Learning Research, 17(118):1\u201320, 2016.\n\nTian, Jin, He, Ru, and Ram, Lavanya. Bayesian model averaging using the k-best Bayesian network\nstructures. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Arti\ufb01cial Intelligence,\npp. 589\u2013597. AUAI Press, 2010.\n\nTreves, Alessandro and Panzeri, Stefano. The upward bias in measures of information derived from\n\nlimited data samples. Neural Computation, 7(2):399\u2013407, 1995.\n\nTsamardinos, Ioannis, Brown, Laura E, and Aliferis, Constantin F. The max-min hill-climbing\n\nBayesian network structure learning algorithm. Machine learning, 65(1):31\u201378, 2006.\n\nVerma, Thomas and Pearl, Judea. Equivalence and synthesis of causal models. In Proceedings of the\nSixth Annual Conference on Uncertainty in Arti\ufb01cial Intelligence, pp. 255\u2013270. Elsevier Science\nInc., 1990.\n\nYehezkel, Raanan and Lerner, Boaz. Bayesian network structure learning by recursive autonomy\n\nidenti\ufb01cation. Journal of Machine Learning Research, 10(Jul):1527\u20131570, 2009.\n\nYuan, Changhe and Malone, Brandon. Learning optimal Bayesian networks: A shortest path\n\nperspective. Journal of Arti\ufb01cial Intelligence Research, 48:23\u201365, 2013.\n\n11\n\n\f", "award": [], "sourceid": 6730, "authors": [{"given_name": "Raanan", "family_name": "Rohekar", "institution": "Intel Corporation"}, {"given_name": "Yaniv", "family_name": "Gurwicz", "institution": "Intel AI Lab"}, {"given_name": "Shami", "family_name": "Nisimov", "institution": "intel"}, {"given_name": "Guy", "family_name": "Koren", "institution": "Intel"}, {"given_name": "Gal", "family_name": "Novik", "institution": "Intel"}]}