{"title": "Bayesian Network Induction via Local Neighborhoods", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 511, "abstract": null, "full_text": "Bayesian Network Induction via Local \n\nNeighborhoods \n\nDimitris Margaritis \n\nDepartment of Computer Science \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nD.Margaritis@cs.cmu.edu \n\nSebastian Thrun \n\nDepartment of Computer Science \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \nS. Thrun@cs.cmu.edu \n\nAbstract \n\nIn recent years, Bayesian networks have become highly successful tool for di(cid:173)\nagnosis, analysis, and decision making in real-world domains. We present an \nefficient algorithm for learning Bayes networks from data. Our approach con(cid:173)\nstructs Bayesian networks by first identifying each node's Markov blankets, then \nconnecting nodes in a maximally consistent way. In contrast to the majority of \nwork, which typically uses hill-climbing approaches that may produce dense and \ncausally incorrect nets, our approach yields much more compact causal networks \nby heeding independencies in the data. Compact causal networks facilitate fast in(cid:173)\nference and are also easier to understand. We prove that under mild assumptions, \nour approach requires time polynomial in the size of the data and the number of \nnodes. A randomized variant, also presented here, yields comparable results at \nmuch higher speeds. \n\n1 Introduction \n\nA great number of scientific fields today benefit from being able to automatically estimate \nthe probability of certain quantities of interest that may be difficult or expensive to observe \ndirectly. For example, a doctor may be interested in estimating the probability of heart \ndisease from indications of high blood pressure and other directly measurable quantities. \nA computer vision system may benefit from a probability distribution of buildings based \non indicators of horizontal and vertical straight lines. Probability densities proliferate the \nsciences today and advances in its estimation are likely to have a wide impact on many \ndifferent fields . \nBayesian networks are a succinct and efficient way to represent a joint probability distri(cid:173)\nbution among a set of variables. As such, they have been applied to fields such as those \nmentioned [Herskovits90][Agosta88]. Besides their ability for density estimation, their \nsemantics lend them to what is sometimes loosely referred to as causal discovery, namely \ndirectional relationships among quantities involved. It has been widely accepted that the \nmost parsimonious representation for a Bayesian net is one that closely represents the causal \nindependence relationships that may exist. For these reasons, there has been great interest \nin automatically inducing the structure of Bayesian nets automatically from data, preferably \nalso preserving the independence relationships in the process. \nTwo research approaches have emerged. The first employs independence properties of \nthe underlying network that produced the data in order to discover parts of its structure. \nThis approach is mainly exemplified by the SGS and PC algorithms in [Spirtes93], as well \n\n\f506 \n\nD. Margaritis and S. Thrun \n\nFigure 1: On the left, an example of a Markov blanket of variable X is shown. The members of \nthe blanket are shown shaded. On the right, an example reconstruction of a 5 x 5 rectangular net of \nbranching factor 3 by the algorithm presented in this paper using 20000 samples. Indicated by dotted \nlines are 3 directionality errors. \n\nas for restricted classes such as trees [Chow68] and poly trees [Rebane87]. The second \napproach is concerned more with data prediction, disregarding independencies in the data. \nIt is typically identified with a greedy hill-climbing or best-first beam search in the space \nof legal structures, employing as a scoring function a form of data likelihood, sometimes \npenalized for network complexity. The result is a local maximum score network structure \nfor representing the data, and is one of the more popular techniques used today. \nThis paper presents an approach that belongs in the first category. It addresses the two main \nshortcomings of the prior work which, we believe, are preventing its use from becoming \nmore widespread. These two disadvantages are: exponential execution times, and proneness \nto errors in dependence tests used. The former problem is addressed in this paper in two \nways. One is by identifying the local neighborhood of each variable in the Bayesian net \nas a preprocessing step, in order to facilitate the recovery of the local structure around \neach variable in polynomial time under the assumption of bounded neighborhood size. The \nsecond, randomized version goes one step further, employing a user-specified number of \nrandomized tests (constant or logarithmic) in order to ascertain the same result with high \nprobability. The second disadvantage of this research approach, namely proneness to errors, \nis also addressed by the randomized version, by using multiple data sets (if available) and \nBayesian accumulation of evidence. \n\n2 The Grow-Shrink Markov Blanket Algorithm \n\nThe concept of the Markov blanket of a variable or a set of variables is central to this paper. \nThe concept itself is not new. For example, see [PearI88]. It is surprising, however, how \nlittle attention it has attracted for all its being a fundamental property of a Bayesian net. \nWhat is new in this paper is the introduction of the explicit use of this idea to effectively \nlimit unnecessary computation, as well as a simple algorithm to compute it. The definition \nof a Markov blanket is as follows: denoting V as the set of variables and X HS Y as the \nconditional dependence of X and Y given the set S, the Markov blanket BL(X) ~ V of \nX E V is any set of variables such that for any Y E V - BL(X) - {X}, X ft-BL(x) Y. \nIn other words, BL(X) completely shields variable X from any other variable in V . The \nnotion of a minimal Markov blanket, called a Markov boundary, is also introduced in \n[PearI88] and its uniqueness shown under certain conditions. The Markov boundary is \nnot unique in certain pathological situations, such as the equality of two variables. In \nour following discussion we will assume that the conditions necessary for its existence and \nuniqueness are satisfied and we will identify the Markov blanket with the Markov boundary, \nusing the notation B (X) for the blanket of variable X from now on. It is also illuminating \nto mention that, in the Bayesian net framework, the Markov blanket of a node X is easily \nidentifiable from the graph: it consists of all parents, children and parents of children of \nX. An example Markov blanket is shown in Fig. 1. Note that any of these nodes, say Y, is \ndependent with X given B (X) - {Y}. \n\n\fBayesian Network Induction via Local Neighborhoods \n\n507 \n\n1. S t- 0. \n\n2. While:3 Y E V - {X} such that Y HS X, do S t- S U {Y}. \n\n[Growing phase] \n\n3. While:3 YES such that Y ft-S-{Y} X, do S t- S - {Y}. \n\n[Shrinking phase] \n\n4. B(X) t- S. \n\nFigure 2: The basic Markov blanket algorithm. \n\nThe algorithm for the recovery of the Markov blanket of X is shown in Fig. 2. The idea \nbehind step 2 is simple: as long as the Markov blanket property of X is violated (ie. there \nexists a variable in V that is dependent on X), we add it to the current set S until there are \nno more such variables. In this process however, there may be some variables that were \nadded to S that were really outside the blanket. Such variables would have been rendered \nindependent from X at a later point when \"intervening\" nodes of the underlying Bayesian \nnet were added to S. This observation necessitates step 3, which identifies and removes \nthose variables. The algorithm is efficient, requiring only O( n) conditional tests, making \nits running time O(n IDI), where n = IVI and D is the set of examples. For a detailed \nderivation of this bound as well as a formal proof of correctness, see [Margaritis99]. In \npractice one may try to minimize the number of tests in step 3 by heuristically ordering the \nvariables in the loop of step 2, for example by ascending mutual information or probability \nof dependence between X and Y (as computed using the X2 test, see section 5). \n3 Grow-Shrink (GS) Algorithm for Bayesian Net Induction \n\nThe recovery of the local structure around each node is greatly facilitated by the knowledge \nof the nodes' Markov blankets. What would normally be a daunting task of employ(cid:173)\ning dependence tests conditioned on an exponential number of subsets of large sets of \nvariables-even though most of their members may be irrelevant-can now be focused on \nthe Markov blankets of the nodes involved, making structure discovery much faster and \nmore reliable. We present below the plain version of the GS algorithm that utilizes blanket \ninformation for inducing the structure of a Bayesian net. At a later point of this paper, we \nwill present a robust, randomized version that has the potential of being faster and more \nreliable, as well as being able to operate in an \"anytime\" manner. \nIn the following N (X) represents the direct neighbors of X. \n\n[ Compute Markov Blankets ] \n\nFor all X E V, compute the Markov blanket B (X) . \n\n[ Compute Graph Structure] \n\nFor all X E V and Y E B(X), determine Y to be a direct neighbor of X if X and \nY are dependent given S for all S ~ T, where T is the smaller of B (X) - {Y} and \nB(Y) - {X}. \n[Orient Edges] \n\nFor all X E V and YEN (X), orient Y -+ X if there exists a variable Z E \nN (X) - N (Y) - {Y} such that Y and Z are dependent given S U {X} for all S ~ U, \nwhere U is the smaller of B (Y) - {Z} and B (Z) - {Y}. \n\n[ Remove Cycles] \n\nDo the following while there exist cycles in the graph: \n\n1. Compute the set of edges C = {X -+ Y such that X -+ Y is part of a cycle}. \n2. Remove the edge in C that is part of the greatest number of cycles, and put it in \n\nR. \n\n\f508 \n\nD. Margaritis and S. Thrun \n\n[ Reverse Edges] \n\nInsert each edge from R in the graph, reversed. \n\n[ Propagate Directions] \n\nFor all X E V and Y E N(X) such that neither Y ~ X nor X ~ Y, execute the \nfollowing rule until it no longer applies: If there exists a directed path from X to Y, \norient X ~ Y . \n\nIn the algorithm description above, step 2 determines which of the members of the blanket \nof each node are actually direct neighbors (parents and children). Assuming, without loss of \ngenerality, that B (X) - {Y} is the smaller set, if any of the tests are successful in separating \n(making independent) X from Y, the algorithm determines that there is no direct connection \nbetween them. That would happen when the conditioning set S includes all parents of X \nand no common children of X and Y. It is interesting to note that the motivation behind \nselecting the smaller set to condition on stems not only from computational efficiency \nbut from reliability as well: a conditioning set S causes the data set to be split into 21 S 1 \npartitions; smaller conditioning sets cause the data set to be split into larger partitions and \nmake dependence tests more reliable. \nStep 3 exploits the fact that two variables that have a common descendant become dependent \nwhen conditioning on a set that includes any such descendant. Since the direct neighbors \nof X and Y are known from step 2, we can determine whether a direct neighbor Y is a \nparent of X if there exists another node Z (which, coincidentally, is also a parent) such \nthat any attempt to separate Y and Z by conditioning on a subset of the blanket of Y that \nincludes X, fails (assuming that B(Y) is smaller than B(Z)). If the directionality is indeed \nY ~ X ~ Z, there should be no such subset since, by conditioning on X, a permanent \ndependency path between Y and Z is created. This would not be the case if Y were a child \nof X. \nIt is straightforward to show that the algorithm requires 0 (n 2 + nb22b) conditional inde(cid:173)\npendence tests, where b = maxx(IB(X)I). Under the assumption that b is bounded by a \nconstant, this algorithm is O( n 2) in the number of conditional independence tests. It is \nworthwhile to note that the time to compute a conditional independence test by a pass over \nthe data set Dis O( n IDt) and not O(2IVI). An analysis and a formal proof of correctness \nof the algorithm is presented in [Margaritis99]. \n\nDiscussion \nThe main advantage of the algorithm comes through the use of Markov blankets to restrict \nthe size of the conditioning sets. The Markov blankets may be usually wrong in the side \nof including too many nodes because they are represented by a disjunction of tests for \nall values of the conditioning set, on the same data. This emphasizes the importance of \nthe \"direct neighbors\" step which removes nodes that were incorrectly added during the \nMarkov blanket computation step by admitting variables whose dependence was shown \nhigh confidence in a large number of different tests. \nIt is also possible that an edge direction is wrongly determined during step 3 due to non(cid:173)\nrepresentative or noisy data. This may lead to directed cycles in the resulting graph. It is \ntherefore necessary to remove those cycles by identifying the minimum set of edges than \nneed to be reversed for all cycles to disappear. This problem is closely related [Margaritis99] \nto the Minimum Feedback Arc Set problem, which is concerned with identifying a minimum \nset of edges that need to be removed from a graph that possibly contains directed cycles, \nin order for all such cycles to disappear. Unfortunately, this problem is NP-complete in its \ngenerality [Junger85]. We introduce here a reasonable heuristic for its solution that is based \non the number of cycles that an edge that is part of a cycle is involved in. \nNot all edge directions can be determined during the last two steps. For example, nodes with \na single parent or multi-parent nodes (called colliders) whose parents are directly connected \ndo not apply to step 3, and steps 4 and 5 are only concerned with already directed edges. \nStep 6 attempts to ameliorate that, through orienting edges in a way that does not introduce \n\n\fBayesian Network Induction via Local Neighborhoods \n\n509 \n\na cycle, if the reverse direction necessarily does. It is not obvious that, for example, if the \ndirection X -t Y produces a cycle in an otherwise acyclic graph, the opposite direction \nY -t X will not also. However, this is the case. For the proof of this, see [Margaritis99]. \nThe algorithm is similar to the SGS algorithm presented in [Spirtes93], but differs in a \nnumber of ways. Its main difference lies in the use of Markov blankets to dramatically \nimprove performance (in many cases where the bounded blanket size assumptions hold). \nIts structure is similar to SGS, and the stability (frequently referred to as robustness in the \nfollowing discussion) arguments presented in [Spirtes93] apply. Increased reliability stems \nfrom the use of smaller conditioning sets, leading to greater number of examples per test. \nThe PC algorithm, also in [Spirtes93], differs from the GS algorithm in that it involves \nlinear probing for a separator set, which makes it unnecessarily inefficient. \n\n4 Randomized Version of the GS Algorithm \n\nThe GS algorithm, as presented above, is appropriate for situations where the maximum \nMarkov blanket of each of a set of variables is small. While it is reasonable to assume \nthat in many real-life problems where high-level variables are involved this may be the \ncase, other problems such as Bayesian image retrieval in computer vision, may employ \nfiner representations. In these cases the variables used may depend in a direct manner on \nmany others. For example, we may choose to use variables to characterize local texture in \ndifferent parts of an image. If the resolution of the mapping from textures to variables is \nincreasingly fine, direct dependencies among those variables may be plentiful and therefore \nthe maximum Markov blanket size may be significant. \nAnother problem that has plagued independence-test based algorithms for Bayesian net \nstructure induction in general is that their decisions are based on a single or a few tests \n(\"hard\" decisions), making them prone to errors due to noise in the data. This also applies \nto the the GS algorithm. It would therefore be advantageous to employ multiple tests before \ndeciding on a direct neighbor or the direction of an edge. \nThe randomized version of the GS algorithm addresses these two problems. Both of \nthem are tackled through randomized testing and Bayesian evidence accumulation. The \nproblem of exponential running times in the maximum blanket size of steps 2 and 3 of the \nplain algorithm is overcome by replacing them by a series of tests, whose number may be \nspecified by the user, with the members of the conditioning set chosen randomly from the \nsmallest blanket of the two variables. Each such test provides evidence for or against the \ndirect connection between the two variables, appropriately weighted by the probability that \ncircumstances causing that event occur or not, and due to the fact that connectedness is the \nconjunction of more elementary events. \nThis version of the algorithm is not shown here in detail due to space restrictions. Its \noperation follows closely the one of the plain GS version. The main difference lies in the \nusage of Bayesian updating of the posterior probability of a direct link (or a dependence \nthrough a collider) between a pair of variables X and Y using conditional dependence tests \nthat take into account independent evidence. The posterior probability Pi of a link between \nX and Y after executing i dependence tests dj, j = 1, .. . , i is \n\nPi= ------------~----------\nPi-ldi + (1 - Pi-d(G + 1 - dd \n\nPi-ldi \n\nwhere G == G(X, Y) = 1 - (4)ITI is a factor that takes values in the interval [0,1) and \ncan be interpreted as the \"(un)importance\" of the truth of each test di , while T is the \nsmaller of B(X) - {Y} and B(Y) - {X}. We can use this accumulated evidence to guide \nour decisions to the hypothesis that we feel most confident about. Besides being able to \ndo that in a timely manner due to the user-specified number of tests, we also note how \nthis approach also addresses the robustness problem mentioned above through the use of \nmUltiple weighted tests, and leaving for the end the \"hard\" decisions that involve a threshold \n(ie. comparing the posterior probability with a threshold, which in our case is ~) . \n\n\f125 \n\nI 100 \n.s \n\n75 \n\n~ \nGO ! \n\nw \n\nSO \n\n25 \n\no \no \n\n4000 \n\n75 \n\n1? \n~ \nS \n~ 50 \n\ni i5 \n\n25 \n\n510 \n\nD. Margaritis and S. Thrun \n\n0.00015 r--~--C...-~--~;;::Pla-:-in-::G::::SCN:-_-_--, \n\nKl-divergance verSUS number of samples \n\nHill-Clil\"l\"tling. score' data likelihood \nHill-Glirnblng, soore: BIC -\n\nRandomized GS8N .... - .. \n.. \u2022 ,. \n.Q---\n\n00001 \n\n5e-05 \n\n4000 \n\nBOOO \n12000 \nNurrber of sarrples \n\n16000 \n\n20000 \n\n1~r---_--_-__ - __ -~ \n\nEdge errors versus number of sarrples \n\nPlainGSBN -\n\nHill-Clirrtling, score data likelihood \n\nRandomized GSBN .... ~ ... \n. , .. \nB -\n\nHill-Climbing, soore: BIC -\n\n~.~ \n\n' .................................................... III- ...... .\n\n....... ... . \n\nl00r-----------::p~lai~nG~Sr-BN~-~ \n\nDirection errors versus number of sarrplss \n\nRandomized GSBN .... ~ ... \nHill-Clirrbi~h~~i~~~~h~ .~_ \n\n8000 \n12000 \nNurrber of sarrples \n\n18000 \n\n20000 \n\nOL---~--~--~---~-~ \n20000 \n\n18000 \n\n4000 \n\n0 \n\n8000 \n12000 \nNumber of 5a!1l)les \n\nFigure 3: Results for a 5 x 5 rectangular net with branching factor 2 (in both directions, blanket size \n8) as a function of the number of samples. On the top, KL-divergence is depicted for the plain GS, \nrandomized GS, and hill-climbing algorithms. On the bottom, the percentage of edge and direction \nerrors are shown. Note that certain edge error rates for the hill-climbing algorithm exceed 100%. \n\n5 Results \n\nThroughout the algorithms presented in this paper we employ standard chi-square (X 2) con(cid:173)\nditional dependence tests (as is done also in [Spirtes93]) in order to compare the histograms \nP(X) and P(X I Y). The X2 test gives us a probability of the error of assuming that the \ntwo variables are dependent when in fact they are not (type II error of a dependence test), \nfrom which we can easily derive the probability that X and Y are dependent. There is an \nimplicit confidence threshold T involved in each dependence test, indicating how certain \nwe wish to be about the correctness of the test without unduly rejecting dependent pairs, \nsomething that is always possible in reality due to the presence of noise. In all experiments \nwe used T = 0.95, which corresponds to a 95% confidence test. \nWe test the effectiveness of the algorithms through the following procedure: we generate a \nrandom rectangular net of specified dimensions and up/down branching factor. A number \nof examples are drawn from that net using logic sampling and they are used as input to \nthe algorithm under test. The resulting nets can be compared with the original ones along \ndimensions of KL-divergence and difference in edges and edge directionality. The KL(cid:173)\ndivergence was estimated using a Monte Carlo procedure. An example reconstruction was \nshown in the beginning of the paper, Fig. 1. \nFig. 3 shows how the KL-divergence between the original and the reconstructed net as well \nas edge omissions/false additions/reversals as a function of number of samples used. It \ndemonstrates two facts. First, that typical KL-divergence for both GS and hill-climbing \nalgorithms is low (with hill-climbing slightly lower), which shows good performance for \napplications where prediction is of prime concern. Second, the number of incorrect edges \nand the errors in the directionality of the edges present is much higher for the hill-climbing \nalgorithm, making it unsuitable for accurate Bayesian net reconstruction. \nFig. 4 shows the effects of increasing the Markov blanket through an increasing branching \nfactor. As expected, we see a dramatic (exponential) increase in execution time of the plain \n\n\fBayesian Network Induction via Local Neighborhoods \n\n511 \n\nEdge I Direction Errors versus Branching Factor \n\n100r-------~~~----~~~----, \nEdge errors, plain GSBN ~ \n\nEdge errors, randomized GSBN ---~---\n\n90 \n80 \n\nDirection errors, plain GSBN \nDirection errors, randomized GSBN \n\n- &---\n\n70 \n60 \n50 \n40 \n30 ___ __ ___ _ __ _ ... ___ ___ ___ .... ___ __ _ \n\n2\u00b0L==~==~~\u00b7----=----=-----=----~-------------- ----l \n10 -.,,\"\"'-.. -----...... .. _ . \u2022 \n\n_______ .olII ___ ____ __ \u2022 __ _ \n\n_ _ \" \n\n22000 \n20000 \n18000 \n16000 \n~ 14000 \n~ 12000 \n~ 10000 \ni= \n8000 \n6000 \n4000 \n2000 \n\nExecution Time versus Branching Factor \n\nPlain GSBN - (cid:173)\n\nRandomized GSBN ----K----\n\n.-----\n\nO~------~--------~--------~ \n5 \n\n2 \n\n3 \n\n4 \nBranching Factor \n\nO~------~--------~------~ \n5 \n\n2 \n\n3 \n4 \nBranching Factor \n\nFigure 4: Results for a 5 x 5 rectangular net from which 10000 samples were generated and used \nfor reconstruction, versus increasing branching factor. On the left, errors are slowly increasing as \nexpected, but comparable for the plain and randomized versions of the GS algorithm. On the right, \ncorresponding execution times are shown. \nGS algorithm, though only a mild increase of the randomized version. The latter uses 200 \n(constant) conditional tests per decision, and its execution time increase can be attributed \nto the (quadratic) increase in the number of decisions. Note that the error percentages \nbetween the plain and the randomized version remain relatively close. The number of \ndirection errors for the GS algorithm actually decreases due to the larger number of parents \nfor each node (more \"V\" structures), which allows a greater number of opportunities to \nrecover the directionality of an edge (using an increased number of tests). \n6 Discussion \nIn this paper we presented an efficient algorithm for computing the Markov blanket of \na node and then used it in the two versions of the GS algorithm (plain and randomized) \nby exploiting the properties of the Markov blanket to facilitate fast reconstruction of the \nlocal neighborhood around each node, under assumptions of bounded neighborhood size. \nWe also presented a randomized variant that has the advantages of faster execution speeds \nand added reconstruction robustness due to multiple tests and Bayesian accumulation of \nevidence. Simulation results demonstrate the reconstruction accuracy advantages of the \nalgorithms presented here over hill-climbing methods. Additional results also show that \nthe randomized version has a dramatical execution speed benefit over the plain one in \ncases where the assumption of bounded neighborhood does not hold, without significantly \naffecting the reconstruction error rate. \nReferences \n[Chow68] \n\n[Herskovits90] \n\n[Spirtes93] \n\n[PearI88] \n[Rebane87] \n\n[Verma90] \n[Agosta88] \n[Cheng97] \n\n[Margaritis99] \n\n[Jtinger85] \n\nC.K. Chow and C.N. Liu. Approximating discrete probability distributions with \ndependence trees. IEEE Transactions on Information Theory, 14, 1968. \nE.H. Herskovits and G.F. Cooper. Kutat6: An entropy-driven system for construc(cid:173)\ntion of probabilistic expert systems from databases. VAI-90. \nP. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search, \nSpringer, 1993. \n1. Pearl. Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann, 1988. \nG. Rebane and J. Pearl. The recovery of causal poly-trees from statistical data. \nVAI-87. \nT.S. Verma, and J. Pearl. Equivalence and Synthesis of Causal Models. VAI-90. \nJ.M. Agosta. The structure of Bayes networks for visual recognition. VAI-88. \n1. Cheng, D.A. Bell, W. Liu, An algorithm for Bayesian network construction from \ndata. AI and Statistics, 1997. \nD. Marg aritis , S. Thrun, Bayesian Network Induction via Local Neighborhoods. \nTR CMV-CS-99-134, forthcoming. \nM. Junger, Polyhedral combinatorics and the acyclic subdigraph problem, Helder(cid:173)\nmann, 1985. \n\n\f", "award": [], "sourceid": 1685, "authors": [{"given_name": "Dimitris", "family_name": "Margaritis", "institution": null}, {"given_name": "Sebastian", "family_name": "Thrun", "institution": null}]}