{"title": "Bayesian Network Induction via Local Neighborhoods", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 511, "abstract": null, "full_text": "Bayesian Network Induction via Local \n\nNeighborhoods \n\nDimitris Margaritis \n\nDepartment of Computer Science \n\nCarnegie Mellon University \n\nPittsburgh, PA  15213 \n\nD.Margaritis@cs.cmu.edu \n\nSebastian Thrun \n\nDepartment of Computer Science \n\nCarnegie Mellon University \n\nPittsburgh, PA  15213 \nS. Thrun@cs.cmu.edu \n\nAbstract \n\nIn  recent  years,  Bayesian  networks  have  become  highly  successful  tool  for  di(cid:173)\nagnosis,  analysis,  and  decision  making  in  real-world domains.  We  present  an \nefficient algorithm for  learning Bayes  networks from  data.  Our approach  con(cid:173)\nstructs Bayesian networks by first identifying each node's Markov blankets, then \nconnecting nodes  in  a maximally consistent way.  In  contrast to the majority of \nwork, which typically uses hill-climbing approaches that may produce dense and \ncausally incorrect nets, our approach yields much more compact causal networks \nby heeding independencies in the data.  Compact causal networks facilitate fast in(cid:173)\nference and are also easier to understand. We prove that under mild assumptions, \nour approach requires time polynomial in the size of the data and the number of \nnodes.  A randomized variant,  also presented here,  yields comparable results at \nmuch higher speeds. \n\n1  Introduction \n\nA great number of scientific fields  today benefit from being able to automatically estimate \nthe probability of certain quantities of interest that may be difficult or expensive to observe \ndirectly.  For  example,  a  doctor may  be  interested  in  estimating  the  probability of heart \ndisease from  indications of high blood pressure and  other directly  measurable quantities. \nA computer vision system may  benefit from a probability distribution of buildings based \non  indicators of horizontal and  vertical straight lines.  Probability densities proliferate the \nsciences  today  and  advances  in  its  estimation  are  likely  to have  a wide  impact  on many \ndifferent fields . \nBayesian networks are  a succinct  and efficient  way  to  represent  a joint probability distri(cid:173)\nbution among  a set  of variables.  As such,  they  have been  applied  to  fields  such  as  those \nmentioned  [Herskovits90][Agosta88].  Besides  their  ability  for  density  estimation,  their \nsemantics lend them to what is sometimes loosely referred to as  causal discovery,  namely \ndirectional relationships among quantities involved.  It has been  widely  accepted  that the \nmost parsimonious representation for a Bayesian net is one that closely represents the causal \nindependence relationships that may exist.  For these reasons, there has been great interest \nin automatically inducing the structure of Bayesian nets automatically from data, preferably \nalso preserving the independence relationships in the process. \nTwo  research  approaches  have  emerged.  The  first  employs  independence  properties  of \nthe  underlying  network that produced  the data  in order to  discover  parts  of its  structure. \nThis approach is mainly exemplified by the SGS and PC algorithms in  [Spirtes93], as  well \n\n\f506 \n\nD.  Margaritis and S.  Thrun \n\nFigure 1:  On the  left,  an example of a Markov blanket of variable  X  is  shown.  The  members  of \nthe blanket are  shown shaded. On the right,  an example reconstruction of a 5 x  5 rectangular net of \nbranching factor 3 by the algorithm presented in this paper using 20000 samples.  Indicated by dotted \nlines are 3 directionality errors. \n\nas  for  restricted  classes  such  as  trees  [Chow68]  and  poly trees  [Rebane87].  The  second \napproach is concerned more with data prediction, disregarding independencies in the data. \nIt is typically identified with a greedy hill-climbing or best-first beam search in the space \nof legal  structures, employing as  a scoring function  a form of data likelihood, sometimes \npenalized for network complexity.  The result is  a local maximum score network structure \nfor representing the data, and is one of the more popular techniques used today. \nThis paper presents an approach that belongs in the first category.  It addresses the two main \nshortcomings of the prior work which,  we believe,  are  preventing its use from becoming \nmore widespread.  These two disadvantages are:  exponential execution times, and proneness \nto errors in dependence tests  used.  The former problem is  addressed in  this paper in  two \nways.  One is by identifying the local  neighborhood of each  variable in  the Bayesian net \nas  a  preprocessing  step,  in  order to  facilitate  the recovery  of the  local  structure around \neach variable in polynomial time under the assumption of bounded neighborhood size.  The \nsecond,  randomized version  goes one step further,  employing a user-specified  number of \nrandomized tests (constant or logarithmic) in order to  ascertain  the same result with high \nprobability. The second disadvantage of this research approach, namely proneness to errors, \nis also addressed by the randomized version, by using multiple data sets (if available) and \nBayesian accumulation of evidence. \n\n2  The Grow-Shrink Markov Blanket Algorithm \n\nThe concept of the Markov blanket of a variable or a set of variables is central to this paper. \nThe concept itself is not new.  For example,  see  [PearI88].  It is  surprising, however,  how \nlittle attention it has  attracted for all  its  being a fundamental  property of a Bayesian  net. \nWhat is  new in this paper is  the introduction of the explicit use of this idea to effectively \nlimit unnecessary computation, as  well as  a simple algorithm to compute it.  The definition \nof a Markov blanket is as  follows:  denoting V  as  the set of variables and X  HS Y  as  the \nconditional dependence of X  and Y  given the set S, the Markov blanket BL(X)  ~ V  of \nX  E V  is any  set of variables such that for any Y  E V  - BL(X)  - {X}, X  ft-BL(x)  Y. \nIn other words, BL(X)  completely shields variable X  from  any  other variable in V .  The \nnotion  of  a  minimal Markov  blanket,  called  a  Markov  boundary,  is  also  introduced  in \n[PearI88]  and  its  uniqueness  shown  under certain  conditions.  The Markov  boundary  is \nnot  unique  in  certain  pathological  situations,  such  as  the  equality  of two  variables.  In \nour following discussion we will assume that the conditions necessary for its existence and \nuniqueness are satisfied and we will identify the Markov blanket with the Markov boundary, \nusing the notation B (X)  for the blanket of variable X  from now on.  It is also illuminating \nto mention that, in the Bayesian net framework,  the Markov blanket of a node X  is easily \nidentifiable from  the graph:  it consists of all  parents,  children and  parents of children of \nX.  An example Markov blanket is shown in Fig. 1.  Note that any of these nodes, say Y, is \ndependent with X  given B (X)  - {Y}. \n\n\fBayesian Network Induction via Local Neighborhoods \n\n507 \n\n1.  S  t- 0. \n\n2.  While:3 Y  E V  - {X} such that Y  HS X, do S  t- S U {Y}. \n\n[Growing phase] \n\n3.  While:3 YES such that Y  ft-S-{Y}  X, do S  t- S - {Y}. \n\n[Shrinking phase] \n\n4.  B(X)  t- S. \n\nFigure 2:  The basic Markov blanket algorithm. \n\nThe algorithm for the recovery of the Markov blanket of X  is  shown in Fig.  2.  The idea \nbehind step 2 is simple:  as  long as  the Markov blanket property of X  is violated (ie.  there \nexists a variable in V  that is dependent on X), we add it to the current set S until there are \nno more such  variables.  In this  process however,  there  may  be some variables  that were \nadded to S  that were really outside the blanket.  Such variables would have been rendered \nindependent from X  at a later point when \"intervening\" nodes of the underlying Bayesian \nnet  were added  to  S.  This observation necessitates  step  3,  which identifies  and  removes \nthose variables.  The algorithm is efficient, requiring only O( n)  conditional tests, making \nits  running time O(n IDI), where n  =  IVI  and D is  the set of examples.  For a detailed \nderivation of this bound as  well  as  a formal  proof of correctness,  see  [Margaritis99].  In \npractice one may try to minimize the number of tests in step 3 by heuristically ordering the \nvariables in the loop of step 2, for example by ascending mutual information or probability \nof dependence between X  and Y (as computed using the X2  test, see section 5). \n3  Grow-Shrink (GS) Algorithm for Bayesian Net Induction \n\nThe recovery of the local structure around each node is greatly facilitated by the knowledge \nof the  nodes'  Markov  blankets.  What  would  normally  be  a  daunting  task  of employ(cid:173)\ning  dependence  tests  conditioned on  an  exponential  number  of subsets  of large  sets  of \nvariables-even though most of their members may  be irrelevant-can now be focused on \nthe Markov  blankets of the  nodes  involved,  making  structure discovery  much  faster  and \nmore reliable.  We present below the plain version of the GS  algorithm that utilizes blanket \ninformation for inducing the structure of a Bayesian net.  At a later point of this paper,  we \nwill present a robust,  randomized version  that  has  the potential of being faster and  more \nreliable, as  well as  being able to operate in an  \"anytime\" manner. \nIn the following N (X)  represents the direct neighbors of X. \n\n[ Compute Markov Blankets ] \n\nFor all X  E V, compute the Markov blanket B (X) . \n\n[ Compute Graph Structure] \n\nFor all  X  E  V  and  Y  E  B(X), determine Y  to be a direct neighbor of X  if X  and \nY  are dependent given S for  all  S  ~ T, where T  is  the smaller of B (X)  - {Y}  and \nB(Y) - {X}. \n[Orient Edges] \n\nFor  all  X  E  V  and  YEN (X),  orient  Y  -+  X  if there  exists  a  variable  Z  E \nN (X)  - N (Y)  - {Y} such that Y and Z  are dependent given S U {X} for all S  ~ U, \nwhere U is the smaller of B (Y)  - {Z}  and B (Z)  - {Y}. \n\n[ Remove Cycles] \n\nDo the following while there exist cycles in the graph: \n\n1.  Compute the set of edges C  =  {X -+ Y such that X -+ Y is part of a cycle}. \n2.  Remove the edge in C  that is part of the greatest number of cycles,  and put it in \n\nR. \n\n\f508 \n\nD.  Margaritis  and S.  Thrun \n\n[ Reverse Edges] \n\nInsert each edge from R  in the graph, reversed. \n\n[ Propagate Directions] \n\nFor all  X  E  V  and Y  E  N(X)  such  that neither Y  ~ X  nor X  ~ Y,  execute  the \nfollowing rule until it no longer applies:  If there exists  a directed path from X  to Y, \norient X  ~ Y . \n\nIn the algorithm description above, step 2 determines which of the members of the blanket \nof each node are actually direct neighbors (parents and children).  Assuming, without loss of \ngenerality, that B (X) - {Y} is the smaller set, if any of the tests are successful in separating \n(making independent) X  from Y, the algorithm determines that there is no direct connection \nbetween them.  That would happen when the conditioning set S  includes all parents of X \nand no common children of X  and Y.  It is interesting to note that the motivation behind \nselecting  the  smaller  set  to  condition on  stems  not  only  from  computational  efficiency \nbut from reliability as  well:  a conditioning set  S  causes  the data  set to be split into 21 S 1 \npartitions; smaller conditioning sets cause the data set to be split into larger partitions and \nmake dependence tests more reliable. \nStep 3 exploits the fact that two variables that have a common descendant become dependent \nwhen conditioning on a set that includes any  such descendant.  Since the direct neighbors \nof X  and Y  are known from  step 2,  we can  determine whether  a direct neighbor Y  is  a \nparent of X  if there exists  another node Z  (which,  coincidentally, is  also  a parent)  such \nthat any  attempt to separate Y  and  Z  by conditioning on a subset of the blanket of Y  that \nincludes X, fails (assuming that B(Y) is smaller than B(Z)).  If the directionality is indeed \nY  ~ X  ~ Z, there should be no such subset since,  by conditioning on  X,  a permanent \ndependency path between Y  and Z  is created.  This would not be the case if Y  were a child \nof X. \nIt is straightforward to show that the algorithm requires 0  (n 2 + nb22b)  conditional inde(cid:173)\npendence tests,  where b = maxx(IB(X)I).  Under the assumption that b is bounded by  a \nconstant,  this  algorithm is  O( n 2)  in  the number of conditional independence  tests.  It is \nworthwhile to note that the time to compute a conditional independence test by a pass over \nthe data set Dis O( n IDt)  and not  O(2IVI).  An analysis and a formal proof of correctness \nof the algorithm is presented in [Margaritis99]. \n\nDiscussion \nThe main advantage of the algorithm comes through the use of Markov blankets to restrict \nthe size of the conditioning sets.  The Markov blankets may  be usually wrong in  the side \nof including too  many  nodes  because  they  are  represented  by  a  disjunction  of tests  for \nall  values  of the conditioning set,  on the same  data.  This  emphasizes  the importance of \nthe \"direct neighbors\" step  which removes  nodes  that were  incorrectly  added  during the \nMarkov  blanket computation  step by  admitting variables  whose  dependence  was  shown \nhigh confidence in a large number of different tests. \nIt is  also possible that an  edge direction is  wrongly determined during step  3 due to  non(cid:173)\nrepresentative or noisy data.  This may  lead  to directed cycles in the resulting graph.  It is \ntherefore necessary  to remove  those cycles by identifying the minimum set of edges than \nneed to be reversed for all cycles to disappear.  This problem is closely related [Margaritis99] \nto the Minimum Feedback Arc Set problem, which is concerned with identifying a minimum \nset of edges that need  to be removed  from  a graph  that possibly contains directed cycles, \nin order for all such cycles to disappear. Unfortunately, this problem is NP-complete in its \ngenerality [Junger85].  We introduce here a reasonable heuristic for its solution that is based \non the number of cycles  that an edge that is part of a cycle is involved in. \nNot all edge directions can be determined during the last two steps.  For example, nodes with \na single parent or multi-parent nodes (called colliders) whose parents are directly connected \ndo not apply  to step 3,  and steps 4 and 5  are only concerned with already directed edges. \nStep 6 attempts to ameliorate that, through orienting edges in a way that does not introduce \n\n\fBayesian Network Induction via Local Neighborhoods \n\n509 \n\na cycle, if the reverse direction necessarily does.  It is not obvious that, for example, if the \ndirection X  -t Y  produces  a cycle  in an  otherwise acyclic  graph,  the opposite direction \nY  -t X  will not also.  However, this is the case. For the proof of this, see [Margaritis99]. \nThe  algorithm is  similar  to  the SGS  algorithm  presented  in  [Spirtes93],  but  differs  in  a \nnumber of ways.  Its main  difference  lies  in  the  use  of Markov  blankets  to  dramatically \nimprove performance  (in many  cases  where  the bounded blanket size  assumptions hold). \nIts structure is similar to SGS, and the stability (frequently referred to as  robustness in the \nfollowing discussion) arguments presented in [Spirtes93] apply.  Increased reliability stems \nfrom the use of smaller conditioning sets, leading to greater number of examples  per test. \nThe PC  algorithm,  also  in  [Spirtes93],  differs  from  the GS  algorithm  in  that  it  involves \nlinear probing for a separator set, which makes it unnecessarily inefficient. \n\n4  Randomized Version of the GS Algorithm \n\nThe  GS  algorithm,  as  presented above,  is  appropriate for  situations where  the maximum \nMarkov  blanket of each  of a  set  of variables  is  small.  While it  is  reasonable  to  assume \nthat in  many  real-life  problems  where  high-level  variables  are  involved  this  may  be the \ncase,  other problems  such  as  Bayesian  image retrieval  in  computer  vision,  may  employ \nfiner representations.  In these cases  the variables  used may  depend  in a direct manner on \nmany others.  For example, we may choose to use variables to characterize local texture in \ndifferent parts of an  image.  If the resolution of the mapping from textures to  variables is \nincreasingly fine, direct dependencies among those variables may be plentiful and therefore \nthe maximum Markov blanket size may  be significant. \nAnother problem  that  has  plagued  independence-test  based  algorithms  for  Bayesian  net \nstructure  induction  in  general  is  that  their decisions  are  based  on  a  single or a  few  tests \n(\"hard\" decisions), making them prone to errors due to noise in the data.  This also applies \nto the the GS algorithm. It would therefore be advantageous to employ multiple tests before \ndeciding on a direct neighbor or the direction of an edge. \nThe  randomized  version  of the  GS  algorithm  addresses  these  two  problems.  Both  of \nthem  are  tackled  through randomized  testing  and  Bayesian  evidence accumulation.  The \nproblem of exponential running times in the maximum blanket size of steps 2 and 3 of the \nplain algorithm is overcome by replacing them by a series of tests,  whose number may  be \nspecified by the user,  with the members  of the conditioning set chosen randomly from the \nsmallest blanket of the two variables.  Each such  test provides evidence for or against the \ndirect connection between the two variables, appropriately weighted by the probability that \ncircumstances causing that event occur or not, and due to the fact that connectedness is the \nconjunction of more elementary events. \nThis  version  of the  algorithm is  not  shown  here  in  detail  due  to  space  restrictions.  Its \noperation follows closely the one of the plain GS  version.  The main difference lies in the \nusage of Bayesian  updating of the posterior probability of a direct link (or a dependence \nthrough a collider) between a pair of variables X  and Y  using conditional dependence tests \nthat take into account independent evidence. The posterior probability Pi of a link between \nX  and Y  after executing i  dependence tests dj, j  =  1, .. . , i  is \n\nPi=  ------------~----------\nPi-ldi + (1  - Pi-d(G + 1 - dd \n\nPi-ldi \n\nwhere G  ==  G(X, Y)  =  1 - (4)ITI  is  a factor  that takes  values  in the  interval  [0,1)  and \ncan  be  interpreted  as  the  \"(un)importance\"  of the  truth  of each  test  di ,  while  T  is  the \nsmaller of B(X) - {Y} and B(Y) - {X}.  We  can use this accumulated evidence to guide \nour decisions to  the hypothesis  that we feel  most confident about.  Besides  being able to \ndo that  in  a  timely  manner due  to  the  user-specified  number of tests,  we  also  note  how \nthis  approach  also  addresses  the robustness problem mentioned  above through the use of \nmUltiple weighted tests, and leaving for the end the \"hard\" decisions that involve a threshold \n(ie.  comparing the posterior probability with a threshold, which in our case is  ~) . \n\n\f125 \n\nI  100 \n.s \n\n75 \n\n~ \nGO ! \n\nw \n\nSO \n\n25 \n\no \no \n\n4000 \n\n75 \n\n1? \n~ \nS \n~  50 \n\ni i5 \n\n25 \n\n510 \n\nD.  Margaritis and S.  Thrun \n\n0.00015  r--~--C...-~--~;;::Pla-:-in-::G::::SCN:-_-_--, \n\nKl-divergance verSUS number of samples \n\nHill-Clil\"l\"tling.  score' data likelihood \nHill-Glirnblng,  soore: BIC  -\n\nRandomized GS8N  .... - .. \n.. \u2022 ,. \n.Q---\n\n00001 \n\n5e-05 \n\n4000 \n\nBOOO \n12000 \nNurrber of sarrples \n\n16000 \n\n20000 \n\n1~r---_--_-__ - __ -~ \n\nEdge errors versus number of sarrples \n\nPlainGSBN  -\n\nHill-Clirrtling, score  data likelihood \n\nRandomized GSBN  .... ~ ... \n. , .. \nB -\n\nHill-Climbing, soore: BIC  -\n\n~.~ \n\n' .................................................... III- ...... .\n\n....... ... . \n\nl00r-----------::p~lai~nG~Sr-BN~-~ \n\nDirection errors versus number of sarrplss \n\nRandomized GSBN  .... ~ ... \nHill-Clirrbi~h~~i~~~~h~  .~_ \n\n8000 \n12000 \nNurrber of sarrples \n\n18000 \n\n20000 \n\nOL---~--~--~---~-~ \n20000 \n\n18000 \n\n4000 \n\n0 \n\n8000 \n12000 \nNumber of 5a!1l)les \n\nFigure 3:  Results for a 5 x 5 rectangular net with branching factor 2 (in both directions, blanket size \n8) as a function of the number of samples.  On the top,  KL-divergence  is  depicted for  the plain GS, \nrandomized GS, and hill-climbing  algorithms.  On the  bottom, the percentage of edge and direction \nerrors are shown.  Note that certain edge error rates for  the hill-climbing  algorithm exceed 100%. \n\n5  Results \n\nThroughout the algorithms presented in this paper we employ standard chi-square (X 2)  con(cid:173)\nditional dependence tests (as is done also in [Spirtes93]) in order to compare the histograms \nP(X)  and P(X I Y).  The X2  test gives  us  a probability of the error of assuming that the \ntwo variables  are dependent when in fact  they are not (type II error of a dependence test), \nfrom which we can  easily derive the probability that X  and Y  are dependent.  There is an \nimplicit confidence threshold  T  involved in  each  dependence test,  indicating how  certain \nwe wish  to be about the correctness  of the test without unduly rejecting dependent pairs, \nsomething that is always possible in reality due to the presence of noise. In all experiments \nwe used T  = 0.95, which corresponds to a 95% confidence test. \nWe test the effectiveness of the algorithms through the following procedure:  we generate a \nrandom rectangular net of specified dimensions and up/down branching factor.  A number \nof examples  are  drawn  from  that  net using  logic  sampling  and  they  are  used  as  input to \nthe algorithm under test.  The resulting nets can be compared with the original ones  along \ndimensions  of KL-divergence  and  difference in  edges  and  edge  directionality.  The  KL(cid:173)\ndivergence was estimated using a Monte Carlo procedure.  An example reconstruction was \nshown in the beginning of the paper, Fig.  1. \nFig. 3 shows how the KL-divergence between the original and the reconstructed net as  well \nas  edge  omissions/false additions/reversals  as  a function  of number  of samples  used.  It \ndemonstrates  two facts.  First,  that  typical  KL-divergence  for  both GS  and  hill-climbing \nalgorithms is  low (with hill-climbing slightly lower),  which  shows good performance for \napplications where prediction is of prime concern.  Second, the number of incorrect edges \nand the errors in the directionality of the edges present is much higher for the hill-climbing \nalgorithm, making it unsuitable for accurate Bayesian net reconstruction. \nFig. 4 shows the effects of increasing the Markov blanket through an  increasing branching \nfactor.  As expected, we see a dramatic (exponential) increase in execution time of the plain \n\n\fBayesian Network Induction via Local Neighborhoods \n\n511 \n\nEdge I Direction Errors versus Branching Factor \n\n100r-------~~~----~~~----, \nEdge errors, plain GSBN  ~ \n\nEdge errors, randomized GSBN  ---~---\n\n90 \n80 \n\nDirection errors, plain GSBN \nDirection errors, randomized GSBN \n\n- &---\n\n70 \n60 \n50 \n40 \n30  ___ __ ___ _ __ _ ... ___ ___ ___ .... ___ __ _ \n\n2\u00b0L==~==~~\u00b7----=----=-----=----~-------------- ----l \n10  -.,,\"\"'-.. -----...... .. _ . \u2022 \n\n_______ .olII ___ ____ __ \u2022 __ _ \n\n_ _ \" \n\n22000 \n20000 \n18000 \n16000 \n~  14000 \n~  12000 \n~  10000 \ni= \n8000 \n6000 \n4000 \n2000 \n\nExecution Time versus Branching Factor \n\nPlain GSBN  - (cid:173)\n\nRandomized GSBN  ----K----\n\n.-----\n\nO~------~--------~--------~ \n5 \n\n2 \n\n3 \n\n4 \nBranching Factor \n\nO~------~--------~------~ \n5 \n\n2 \n\n3 \n4 \nBranching Factor \n\nFigure 4:  Results for a 5 x  5 rectangular net from  which  10000 samples were generated and used \nfor  reconstruction,  versus  increasing branching factor.  On  the  left,  errors  are  slowly  increasing as \nexpected, but comparable for  the plain and randomized versions of the GS  algorithm.  On the  right, \ncorresponding execution times are shown. \nGS algorithm, though only a mild increase of the randomized version.  The latter uses 200 \n(constant) conditional tests per decision, and its execution time increase can be attributed \nto  the (quadratic)  increase  in  the  number  of decisions.  Note that  the  error percentages \nbetween  the  plain  and  the randomized  version  remain  relatively  close.  The  number of \ndirection errors for the GS algorithm actually decreases due to the larger number of parents \nfor each  node (more \"V\" structures),  which  allows  a greater  number of opportunities to \nrecover the directionality of an edge (using an increased number of tests). \n6  Discussion \nIn  this  paper  we  presented  an  efficient  algorithm for  computing  the Markov  blanket of \na  node  and  then  used  it in the two versions  of the GS algorithm (plain  and  randomized) \nby exploiting the properties of the Markov  blanket to facilitate fast reconstruction of the \nlocal neighborhood around each node, under assumptions of bounded neighborhood size. \nWe also presented a randomized variant that has the advantages of faster execution speeds \nand  added reconstruction robustness due to multiple tests  and Bayesian  accumulation of \nevidence.  Simulation results demonstrate  the reconstruction  accuracy  advantages  of the \nalgorithms presented  here over hill-climbing methods.  Additional results  also  show that \nthe randomized  version  has  a  dramatical  execution  speed  benefit  over  the  plain  one  in \ncases where the assumption of bounded neighborhood does not hold, without significantly \naffecting the reconstruction error rate. \nReferences \n[Chow68] \n\n[Herskovits90] \n\n[Spirtes93] \n\n[PearI88] \n[Rebane87] \n\n[Verma90] \n[Agosta88] \n[Cheng97] \n\n[Margaritis99] \n\n[Jtinger85] \n\nC.K. Chow  and C.N.  Liu. Approximating  discrete  probability  distributions  with \ndependence trees.  IEEE Transactions on Information Theory,  14,  1968. \nE.H. Herskovits and G.F.  Cooper.  Kutat6:  An entropy-driven system for construc(cid:173)\ntion of probabilistic expert systems from databases. VAI-90. \nP.  Spirtes,  C.  Glymour,  and  R.  Scheines.  Causation,  Prediction,  and  Search, \nSpringer, 1993. \n1.  Pearl. Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann, 1988. \nG.  Rebane  and J.  Pearl.  The  recovery  of causal poly-trees  from  statistical  data. \nVAI-87. \nT.S.  Verma, and J. Pearl. Equivalence and Synthesis of Causal Models. VAI-90. \nJ.M.  Agosta. The structure of Bayes networks for visual recognition. VAI-88. \n1. Cheng, D.A. Bell, W. Liu, An algorithm for Bayesian network construction from \ndata. AI and Statistics, 1997. \nD. Marg aritis , S. Thrun,  Bayesian Network Induction  via Local  Neighborhoods. \nTR CMV-CS-99-134, forthcoming. \nM.  Junger, Polyhedral combinatorics and the acyclic subdigraph problem, Helder(cid:173)\nmann, 1985. \n\n\f", "award": [], "sourceid": 1685, "authors": [{"given_name": "Dimitris", "family_name": "Margaritis", "institution": null}, {"given_name": "Sebastian", "family_name": "Thrun", "institution": null}]}