{"title": "Discovering Structure in Continuous Variables Using Bayesian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 500, "page_last": 506, "abstract": null, "full_text": "Discovering Structure in Continuous \nVariables Using  Bayesian Networks \n\nReimar Hofmann and  Volker  Tresp* \n\nSiemens AG,  Central Research \n\nOtto-Hahn-Ring 6 \n\n81730  Munchen,  Germany \n\nAbstract \n\nWe  study  Bayesian  networks  for  continuous  variables  using  non(cid:173)\nlinear  conditional  density  estimators.  We  demonstrate  that  use(cid:173)\nful  structures  can  be extracted  from  a  data set  in  a  self-organized \nway and we  present sampling techniques for  belief update based on \nMarkov  blanket conditional density models. \n\n1 \n\nIntroduction \n\nOne  of the  strongest  types  of information that  can  be  learned  about  an  unknown \nprocess  is  the discovery  of dependencies  and -even more important- of indepen(cid:173)\ndencies.  A  superior example is  medical epidemiology where  the goal is  to find  the \ncauses  of  a  disease  and  exclude  factors  which  are  irrelevant.  Whereas  complete \nindependence  between  two  variables  in  a  domain might  be  rare  in  reality  (which \nwould mean that the joint probability density of variables A  and B  can be factored: \np(A, B)  = p(A)p(B)),  conditional  independence  is  more  common  and  is  often  a \nresult  from  true or apparent  causality:  consider  the  case  that  A  is  the  cause  of B \nand B  is  the  cause  of C,  then p(CIA, B) = p(CIB)  and  A  and  C  are  independent \nunder the condition  that B  is  known.  Precisely  this notion of cause and effect  and \nthe  resulting  independence  between  variables  is  represented  explicitly  in  Bayesian \nnetworks.  Pearl  (1988)  has convincingly argued  that causal  thinking leads  to clear \nknowledge  representation  in form  of conditional probabilities and  to efficient  local \nbelief propagating rules. \n\nBayesian networks form a  complete probabilistic model in the sense  that they repre(cid:173)\nsent the joint probability distribution of all variables involved.  Two of the powerful \n\nReimar.Hofmann@zfe.siemens.de  Volker.Tresp@zfe.siemens.de \n\n\fDiscovering Structure in Continuous Variables  Using  Bayesian Networks \n\n501 \n\nfeatures  of Bayesian networks are that any variable can be predicted from any sub(cid:173)\nset  of known other variables and  that Bayesian  networks  make explicit statements \nabout the certainty of the estimate of the state of a  variable.  Both aspects  are par(cid:173)\nticularly important for  medical or fault  diagnosis systems.  More  recently,  learning \nof structure  and  of parameters  in  Bayesian  networks  has  been addressed  allowing \nfor  the discovery of structure  between  variables  (Buntine,  1994, Heckerman,  1995). \n\nMost  of the  research  on  Bayesian  networks  has  focused  on  systems  with  discrete \nvariables,  linear Gaussian models or combinations of both.  Except  for  linear mod(cid:173)\nels,  continuous  variables  pose  a  problem for  Bayesian  networks.  In  Pearl's  words \n(Pearl,  1988):  \"representing each  [continuous]  quantity by an estimated magnitude \nand a  range of uncertainty,  we  quickly produce a  computational mess.  [Continuous \nvariables]  actually impose a  computational tyranny of their own.\"  In this paper we \npresent approaches to applying the concept  of Bayesian networks towards arbitrary \nnonlinear relations between continuous variables.  Because  they are fast  learners  we \nuse Parzen  windows based conditional density estimators for  modeling local depen(cid:173)\ndencies.  We  demonstrate  how  a  parsimonious Bayesian  network  can  be extracted \nout  of a  data set  using  unsupervised  self-organized  learning.  For  belief update  we \nuse  local Markov  blanket  conditional density  models which -\nin  combination with \nGibbs sampling- allow relatively efficient sampling from the conditional density of \nan unknown variable. \n\n2  Bayesian  Networks \n\nThis brief introduction of Bayesian networks follows closely Heckerman,  1995.  Con(cid:173)\nsidering a joint probability density I  p( X)  over a  set of variables {Xl, \u2022\u2022. , X N}  we  can \ndecompose using  the chain rule of probability \n\nN \n\np(x) =  IIp(xiIXI, ... ,Xi-I). \n\ni=l \n\n(1) \n\nFor  each  variable  Xi,  let  the parents of Xi  denoted  by  Pi  ~ {XI, . .. , Xi- d  be a  set \nof variables2  that renders  Xi  and {x!, ... , Xi-I}  independent,  that is \n\n(2) \nNote,  that  Pi  does  not  need  to  include  all  elements  of {XI, ... , Xi- Il which  indi(cid:173)\ncates  conditional independence  between  those  variables  not  included  in  Pi  and  Xi \ngiven that  the  variables in Pi  are  known.  The dependencies  between  the  variables \nare often depicted  as  directed  acyclic3  graphs  (DAGs)  with directed  arcs from  the \nmembers  of Pi  (the  parents)  to  Xi  (the  child).  Bayesian  networks  are  a  natural \ndescription of dependencies between variables if they depict causal relationships be(cid:173)\ntween  variables.  Bayesian  networks  are  commonly used  as  a  representation  of the \nknowledge  of domain  experts.  Experts  both  define  the  structure  of the  Bayesian \nnetwork  and  the  local  conditional  probabilities.  Recently  there  has  been  great \n\n1 For simplicity  of notation  we  will  only  treat  the  continuous  case.  Handling  mixtures \n\nof continuous  and discrete  variables  does not impose  any  additional  difficulties. \n\n2Usually  the  smallest  set  will  be  used.  Note  that  in  Pi  is  defined  with  respect  to  a \n\ngiven  ordering  of the variables. \n\n:li.e.  not  containing  any  directed loops. \n\n\f502 \n\nR.  HOFMANN. V. TRESP \n\nemphasis on learning structure  and  parameters in  Bayesian  networks  (Heckerman, \n1995).  Most  of previous  work  concentrated  on  models with only discrete  variables \nor on linear models of continuous variables where  the probability distribution of all \ncontinuous given all discrete variables is a multidimensional Gaussian.  In this paper \nwe  use  these ideas in context with continuous variables and nonlinear dependencies. \n\n3  Learning  Structure and  Parameters in  Nonlinear \n\nContinuous  Bayesian  Networks \n\nMany of the structures developed  in  the neural network community can  be used  to \nmodel  the  conditional density  distribution of continuous variables p( Xi IPi).  Under \nthe  usual  signal-plus independent  Gaussian  noise  model a  feedforward  neural  net(cid:173)\nwork N N(.) is a  conditional density model such that p(Xi IPi) =  G(Xi; N N(Pi), 0- 2 ), \nwhere G(x; c, 0-2 )  is our notation for a normal density centered at c and with variance \n0- 2 \u2022  More  complex conditional densities  can,  for  example,  be  modeled by mixtures \nof experts or by Parzen windows based density estimators which we  used  in our ex(cid:173)\nperiments  (Section  5).  We  will  use  pM (Xi IP;)  for  a  generic  conditional  probability \nmodel.  The joint probability model is  then \n\nN \n\npM (X)  = II pM (xi/Pi). \n\ni=l \n\n(3) \n\nfollowing  Equations  1  and  2.  Learning  Bayesian  networks  is  usually  decomposed \ninto  the  problems  of learning  structure  (that  is  the  arcs  in  the  network)  and  of \nlearning  the  conditional density  models  pM (Xi IPi)  given  the structure4 .  First  as(cid:173)\nsume the  structure  of the  network  is  given.  If the data set  only  contains  complete \ndata,  we  can  train  conditional  density  models  pM (Xi IPi )  independently  of each \nother  since  the log-likelihood of the  model decomposes  conveniently  into the  indi(cid:173)\nvidual  likelihoods  of the  models  for  the  conditional  probabilities.  Next,  consider \ntwo  competing  network  structures.  We  are  basically  faced  with  the  well-known \nbias-variance  dilemma:  if we  choose  a  network  with  too many arcs,  we  introduce \nlarge parameter variance  and if we  remove  too many arcs  we  introduce  bias.  Here, \nthe  problem  is  even  more complex since  we  also have  the freedom  to reverse  arcs. \nIn  our  experiments  we  evaluate  different  network  structures  based  on  the  model \nlikelihood  using  leave-one-out  cross-validation  which  defines  our  scoring  function \nfor  different  network  structures.  More  explicitly,  the  score  for  network  structure \nS  is  Score  =  10g(p(S))  + Lev,  where  p(S)  is  a  prior  over  the  network  structures \nand  Lev  = ~f=llog(pM (xkIS, X  - {xk}))  is  the leave-one-out cross-validation log(cid:173)\nlikelihood (later referred  to as cv-Iog-likelihood).  X  =  {xk}f=l is  the set of training \nsamples, and pM (x k IS, X  - {xk})  is  the probability density of sample Xk  given  the \nstructure  S  and  all  other  samples.  Each  of the  terms  pM (xk IS, X  - {xk})  can  be \ncomputed from  local densities  using  Equation 3. \n\nEven for small networks it is computationally impossible to calculate the score for all \npossible network structures and the search for the global optimal network structure \n\n4Differing  from  Heckerman  we  do  not follow  a fully  Bayesian  approach in  which  priors \n\nare  defined  on  parameters  and  structure;  a  fully  Bayesian  approach  is  elegant  if the  oc(cid:173)\ncurring integrals  can  be  solved  in  closed  form  which  is  not  the  case  for  general  nonlinear \nmodels  or  if data are incomplete. \n\n\fDiscovering  Structure in  Continuous  Variables  Using  Bayesian  Networks \n\n503 \n\nis NP-hard.  In the Section 5 we describe a heuristic search which is closely related to \nsearch strategies commonly used  in discrete Bayesian networks  (Heckerman, 1995). \n\n4  Prior Models \n\nIn a Bayesian framework it is useful  to provide means for exploiting prior knowledge, \ntypically  introducing  a  bias for  simple structures.  Biasing  models  towards  simple \nstructures  is  also  useful  if the model selection  criteria is  based  on cross-validation, \nas in our  case,  because  of the variance in  this score.  In the experiments we  added \na  penalty  per  arc  to  the  log-likelihood  i.e.  10gp(S)  ex:  -aNA  where  NA  is  the \nnumber of arcs  and  the  parameter a  determines  the  weight  of the  penalty.  Given \nmore  specific  knowledge  in  form  of a  structure  defined  by  a  domain  expert  we \ncan  alternatively  penalize  the  deviation  in  the  arc  structure  (Heckerman,  1995). \nFurthermore, prior knowledge can be introduced in form of a set of artificial training \ndata.  These  can  be  treated  identical  to  real  data and  loosely  correspond  to  the \nconcept  of a  conjugate prior. \n\n5  Experiment \n\nIn the experiment we  used  Parzen windows based conditional density estimators to \nmodel the conditional densities pM (Xj IPd  from  Equation 2,  i.e. \n\n(4) \n\nwhere  {xi }f=l  is  the  training  set.  The  Gaussians  in  the  nominator  are  centered \nat  (x7, Pf)  which  is  the  location of the  k-th  sample in  the joint input/output  (or \nparent/child)  space  and  the  Gaussians  in  the  denominator  are  centered  at  (Pf) \nwhich  is  the  location  of the  k-th  sample  in  the  input  (or  parent)  space.  For  each \nconditional model,  (J\"j  was optimized using  leave-one-out cross  validation5 \u2022 \n\nThe unsupervised structure optimization procedure starts with a complete Bayesian \nmodel  corresponding  to  Equation  1,  i.e.  a  model  where  there  is  an  arc  between \nany  pair of variables6 \u2022  Next,  we  tentatively  try all  possible  arc  direction  changes, \narc  removals and  arc  additions  which  do  not  produce  directed  loops  and  evaluate \nthe  change  in  score.  After  evaluating all  legal single  modifications,  we  accept  the \nchange which improves the score  the most.  The procedure stops if every  arc change \ndecreases  the  score.  This  greedy  strategy  can  get  stuck  in  local  minima which \ncould in  principle be avoided if changes  which result in worse  performance are also \naccepted  with a  nonzero probability 7  (such  as in annealing strategies,  Heckerman, \n1995).  Calculating  the  new  score  at  each  step  requires  only  local  computation. \nThe removal or addition  of an  arc  corresponds  to a  simple removal or addition of \nthe corresponding dimension in the Gaussians of the local density model.  However, \n\n5Note  that  if we  maintained  a  global (7  for  all  density  estimators,  we  would  maintain \nlikelihood  equivalence  which  means  that each  network displaying  the  same  independence \nmodel  gets  the same  score on any  test set. \n\n6The order of nodes  determining  the direction  of initial  arcs is  random. \n7 In our experiments  we  treated very  small  changes in score  as if they  were exactly  zero \n\nthus allowing  small  decreases in  score. \n\n\f504 \n\nR.  HOFMANN. V. TRESP \n\n15~-----------------------, \n\n--\n\n- - - --\n\n10 \n\n-5 \n\n100~------~------~------~ \n\n50 \n\n~ \n~ \n\"T \n~  I \n-100 \n\n-10~----------~----------~ \n100 \n\n50 \n\no \n\nNumber of Iterations \n\n-150~--------------------~ \n15 \n\n10 \n\n5 \n\no \n\nNumber of inputs \n\nFigure 1:  Left:  evolution of the cv-log-Iikelihood (dashed)  and of the log-likelihood \non the test set  (continuous) during structure optimization.  The curves are averages \nover  20  runs  with  different  partitions of training and  test  sets  and  the likelihoods \nare normalized with respect  to the number of cv- or test-samples, respectively.  The \npenalty per arc was  a  = 0.1.  The dotted line shows  the Parzen joint density model \ncommonly used  in statistics,  i.e.  assuming no  independencies  and  using  the  same \nwidth  for  all  Gaussians  in  all  conditional  density  models.  Right:  log-likelihood \nof the  local  conditional  Parzen  model  for  variable  3  (pM (x3IP3))  on  the  test  set \n(continuous)  and the corresponding  cv-log-likelihood  (dashed)  as  a  function  of the \nnumber of parents  (inputs). \n\ncrime  ra.te \npercent land  zoned  for  lots \npercent  nonretail  business \nlocated  on  Charles  river? \nnitrogen  oxide  concentration \nAverage  number  of rooms \npercent built  before  1940 \nweighted  distance  to  employment center \naccess  to  radial  highways \ntax  rate \npupil/teacher  ratio \npercent  black \npercent lower-status  population \n\n2 \na \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14  median  value  of  homes \n\nFigure  2:  Final structure  of a  run on  the full  data set. \n\nafter each such operation the widths of the Gaussians O'i  in the affected local models \nhave  to  be  optimized.  An  arc  reversal  is  simply  the  execution  of an  arc  removal \nfollowed  by an arc addition. \n\nIn  our experiment,  we  used  the Boston  housing data set,  which  contains 506  sam(cid:173)\nples.  Each  sample consists  of the housing  price and  14  variables which  supposedly \ninfluence  the  housing  price  in  a  Boston  neighborhood  (Figure  2).  Figure  1  (left) \nshows  an experiment  where  one  third of the samples  was  reserved  as  a  test  set  to \nmonitor the  process.  Since  the algorithm  never  sees  the  test  data the  increase  in \nlikelihood  of the  model  on  the  test  data  is  an  unbiased  estimator  for  how  much \nthe  model  has  improved  by  the  extraction  of structure  from  the  data.  The  large \nincrease in the log-likelihood can be understood  by studying Figure 1 (right).  Here \nwe  picked a single variable (node 3)  and formed a density model to predict this vari(cid:173)\nable from the remaining 13 variables.  Then we  removed input variables in the order \nof their significance.  After the removal of a  variable, 0'3  is optimized.  Note that the \ncv-Iog-likelihood increases  until  only  three  input  variables  are  left  due  to  the  fact \n\n\fDiscovering  Structure in Continuous  Variables  Using  Bayesian  Networks \n\n505 \n\nthat  irrelevant  variables or  variables  which  are  well  represented  by  the  remaining \ninput variables are removed.  The log-likelihood of the fully connected  initial model \nis  therefore  low  (Figure 1 left). \n\nWe  did  a  second  set  of 15  runs  with no test  set.  The scores  of the final  structures \nhad  a  standard  deviation  of only  0.4.  However,  comparing  the  final  structures  in \nterms of undirected arcs8  the difference was  18% on average.  The structure from one \nof these  runs  is  depicted  in  Figure 2  (right).  In comparison to the initial complete \nstructure  with  91  arcs, only  18  arcs  are left and 8 arcs have changed direction. \n\nOne of the advantages of Bayesian  networks is  that they can  be easily interpreted. \nThe  goal of the original Boston  housing data experiment  was  to examine whether \nthe  nitrogen  oxide  concentration  (5)  influences  the  housing  price  (14).  Under  the \nstructure extracted  by  the algorithm, 5 and  14  are dependent  given  all other  vari(cid:173)\nables because  they have a  common child, 13.  However,  if all variables except  13 are \nknown  then  they  are  independent.  Another  interesting  question  is  what  the  rele(cid:173)\nvant quantities are for  predicting  the housing price,  i.e.  which  variables have  to be \nknown  to render  the housing price  independent from  all other variables.  These  are \nthe  parents,  children,  and  children's  parents of variable  14,  that is  variables  8,  10, \n11,  6,  13  and 5.  It is  well  known that in Bayesian networks,  different  constellations \nof directions  of arcs  may  induce  the  same  independencies,  i.e.  that  the  direction \nof arcs  is  not  uniquely  determined.  It  can  therefore  not  be expected  that  the arcs \nactually reflect  the direction of causality. \n\n6  Missing  Data and Markov  Blanket  Conditional  Density \n\nModel \n\nBayesian networks are typically used in applications where variables might be miss(cid:173)\ning.  Given partial information (i.  e.  the states of a subset of the variables) the goal \nis  to update the  beliefs  (i.  e.  the  probabilities)  of all unknown variables.  Whereas \nthere  are  powerful  local  update  rules  for  networks  of discrete  variables  without \n(undirected)  loops,  the belief update in networks  with loops  is  in general  NP-hard. \nA generally applicable update rule for the unknown variables in networks of discrete \nor continuous variables is Gibbs sampling.  Gibbs sampling can be roughly described \nas follows:  for  all variables whose  state is  known,  fix  their states to the known val(cid:173)\nues.  For all  unknown  variables  choose some initial states.  Then  pick  a  variable  Xi \nwhich  is  not known  and update its  value following the  probability distribution \n\np(xil{Xl, ... , XN}  \\  {xd) ex:  p(xilPd  II p(xjIPj ). \n\n(5) \n\nx.E1'j \n\nDo  this  repeatedly  for  all  unknown  variables.  Discard  the  first  samples.  Then, \nthe samples which  are generated are drawn from  the probability distribution of the \nunknown  variables  given  the  known  variables.  Using  these  samples  it  is  easy  to \ncalculate  the  expected  value of any  of the  unknown  variables,  estimate variances, \ncovariances and other statistical measures such  as the mutual information between \nvariables. \n\n8 Since  the direction  of arcs is  not  unique  we  used  the  difference  in undirected  arcs  to \ncompare  two  structures.  We  used  the number  of arcs  present  in  one  and  only  one  of the \nstructures normalized  with respect  to  the number of arcs in  a  fully  connected network. \n\n\f506 \n\nR. HOFMANN, V.  TRESP \n\nGibbs  sampling  requires  sampling from  the  univariate  probability distribution  in \nEquation  5  which  is  not  straightforward  in  our  model since  the  conditional  den(cid:173)\nsity  does  not  have  a  convenient  form.  Therefore,  sampling  techniques  such  as \nimportance  sampling  have  to  be  used.  In  our  case  they  typically  produce  many \nrejected  samples  and  are  therefore  inefficient.  An  alternative  is  sampling  based \non  Markov  blanket  conditional  density  models.  The  Markov  blanket  of Xi,  Mi  is \nthe smallest set  of variables such  that P(Xi I{ Xb . .. , XN}  \\  Xi)  = P(Xi IMi)  (given  a \nBayesian  network,  the Markov  blanket of a  variable consists of its  parents,  its chil(cid:173)\ndren  and  its children's  parents.).  The idea is  to form  a  conditional density  model \npM (xilMd  ~ p(xdMd  for  each  variable  in  the  network  instead  of computing  it \naccording  to  Equation  5.  Sampling from  this  model  is  simple  using  conditional \nParzen  models:  the  conditional  density  is  a  mixture of Gaussians  from  which  we \ncan sample without rejection9 \u2022  Markov blanket conditional density models are also \ninteresting if we  are only interested in always predicting one  particular variable, as \nin most  neural  network applications.  Assuming  that a  signal-plus-noise model is  a \nreasonably good  model for  the conditional density,  we  can train an ordinary neural \nnetwork  to predict  the  variable of interest.  In  addition,  we  train a  model for  each \ninput variable predicting it from the remaining variables.  In addition to having ob(cid:173)\ntained a  model for  the complete data case,  we  can  now  also handle missing inputs \nand do backward inference  using  Gibbs sampling. \n\n7  Conclusions \n\nWe demonstrated that Bayesian models of local conditional density estimators form \npromising nonlinear  dependency  models  for  continuous  variables.  The conditional \ndensity  models can  be  trained  locally if training data are  complete.  In  this  paper \nwe  focused  on  the  self-organized  extraction  of structure.  Bayesian  networks  can \nalso serve as a framework for a  modular construction of large systems out of smaller \nconditional  density  models.  The  Bayesian  framework  provides  consistent  update \nrules  for  the  probabilities  i.e.  communication between  modules.  Finally,  consider \ninput  pruning  or  variable  selection  in  neural  networks.  Note,  that  our  pruning \nstrategy  in  Figure  1  can  be  considered  a  form  of variable  selection  by  not  only \nremoving  variables  which  are  statistically independent  of the  output  variable  but \nalso removing variables which are represented well by the remaining variables.  This \nway we  obtain more compact models.  If input  values are missing then  the indirect \ninfluence  of the  pruned  variables  on  the output  will  be  recovered  by  the sampling \nmechanism. \n\nReferences \nBuntine,  W.  (1994).  Operations for  learning  with graphical  models.  Journal  of Artificial \nIntelligence Research 2: 159-225. \nHeckerman,  D.  (1995).  A  tutorial  on  learning  Bayesian  networks.  Microsoft  Research, \nTR.  MSR-TR-95-06, 1995. \nPearl,  J.  (1988).  Probabilistic  Reasoning in  Intelligent Systems.  San  Mateo,  CA:  Morgan \nKaufmann. \n\n9There are,  however,  several open issues concerning consistency between the conditional \n\nmodels. \n\n\f", "award": [], "sourceid": 1098, "authors": [{"given_name": "Reimar", "family_name": "Hofmann", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}