{"title": "Discovering Structure in Continuous Variables Using Bayesian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 500, "page_last": 506, "abstract": null, "full_text": "Discovering Structure in Continuous \nVariables Using Bayesian Networks \n\nReimar Hofmann and Volker Tresp* \n\nSiemens AG, Central Research \n\nOtto-Hahn-Ring 6 \n\n81730 Munchen, Germany \n\nAbstract \n\nWe study Bayesian networks for continuous variables using non(cid:173)\nlinear conditional density estimators. We demonstrate that use(cid:173)\nful structures can be extracted from a data set in a self-organized \nway and we present sampling techniques for belief update based on \nMarkov blanket conditional density models. \n\n1 \n\nIntroduction \n\nOne of the strongest types of information that can be learned about an unknown \nprocess is the discovery of dependencies and -even more important- of indepen(cid:173)\ndencies. A superior example is medical epidemiology where the goal is to find the \ncauses of a disease and exclude factors which are irrelevant. Whereas complete \nindependence between two variables in a domain might be rare in reality (which \nwould mean that the joint probability density of variables A and B can be factored: \np(A, B) = p(A)p(B)), conditional independence is more common and is often a \nresult from true or apparent causality: consider the case that A is the cause of B \nand B is the cause of C, then p(CIA, B) = p(CIB) and A and C are independent \nunder the condition that B is known. Precisely this notion of cause and effect and \nthe resulting independence between variables is represented explicitly in Bayesian \nnetworks. Pearl (1988) has convincingly argued that causal thinking leads to clear \nknowledge representation in form of conditional probabilities and to efficient local \nbelief propagating rules. \n\nBayesian networks form a complete probabilistic model in the sense that they repre(cid:173)\nsent the joint probability distribution of all variables involved. Two of the powerful \n\nReimar.Hofmann@zfe.siemens.de Volker.Tresp@zfe.siemens.de \n\n\fDiscovering Structure in Continuous Variables Using Bayesian Networks \n\n501 \n\nfeatures of Bayesian networks are that any variable can be predicted from any sub(cid:173)\nset of known other variables and that Bayesian networks make explicit statements \nabout the certainty of the estimate of the state of a variable. Both aspects are par(cid:173)\nticularly important for medical or fault diagnosis systems. More recently, learning \nof structure and of parameters in Bayesian networks has been addressed allowing \nfor the discovery of structure between variables (Buntine, 1994, Heckerman, 1995). \n\nMost of the research on Bayesian networks has focused on systems with discrete \nvariables, linear Gaussian models or combinations of both. Except for linear mod(cid:173)\nels, continuous variables pose a problem for Bayesian networks. In Pearl's words \n(Pearl, 1988): \"representing each [continuous] quantity by an estimated magnitude \nand a range of uncertainty, we quickly produce a computational mess. [Continuous \nvariables] actually impose a computational tyranny of their own.\" In this paper we \npresent approaches to applying the concept of Bayesian networks towards arbitrary \nnonlinear relations between continuous variables. Because they are fast learners we \nuse Parzen windows based conditional density estimators for modeling local depen(cid:173)\ndencies. We demonstrate how a parsimonious Bayesian network can be extracted \nout of a data set using unsupervised self-organized learning. For belief update we \nuse local Markov blanket conditional density models which -\nin combination with \nGibbs sampling- allow relatively efficient sampling from the conditional density of \nan unknown variable. \n\n2 Bayesian Networks \n\nThis brief introduction of Bayesian networks follows closely Heckerman, 1995. Con(cid:173)\nsidering a joint probability density I p( X) over a set of variables {Xl, \u2022\u2022. , X N} we can \ndecompose using the chain rule of probability \n\nN \n\np(x) = IIp(xiIXI, ... ,Xi-I). \n\ni=l \n\n(1) \n\nFor each variable Xi, let the parents of Xi denoted by Pi ~ {XI, . .. , Xi- d be a set \nof variables2 that renders Xi and {x!, ... , Xi-I} independent, that is \n\n(2) \nNote, that Pi does not need to include all elements of {XI, ... , Xi- Il which indi(cid:173)\ncates conditional independence between those variables not included in Pi and Xi \ngiven that the variables in Pi are known. The dependencies between the variables \nare often depicted as directed acyclic3 graphs (DAGs) with directed arcs from the \nmembers of Pi (the parents) to Xi (the child). Bayesian networks are a natural \ndescription of dependencies between variables if they depict causal relationships be(cid:173)\ntween variables. Bayesian networks are commonly used as a representation of the \nknowledge of domain experts. Experts both define the structure of the Bayesian \nnetwork and the local conditional probabilities. Recently there has been great \n\n1 For simplicity of notation we will only treat the continuous case. Handling mixtures \n\nof continuous and discrete variables does not impose any additional difficulties. \n\n2Usually the smallest set will be used. Note that in Pi is defined with respect to a \n\ngiven ordering of the variables. \n\n:li.e. not containing any directed loops. \n\n\f502 \n\nR. HOFMANN. V. TRESP \n\nemphasis on learning structure and parameters in Bayesian networks (Heckerman, \n1995). Most of previous work concentrated on models with only discrete variables \nor on linear models of continuous variables where the probability distribution of all \ncontinuous given all discrete variables is a multidimensional Gaussian. In this paper \nwe use these ideas in context with continuous variables and nonlinear dependencies. \n\n3 Learning Structure and Parameters in Nonlinear \n\nContinuous Bayesian Networks \n\nMany of the structures developed in the neural network community can be used to \nmodel the conditional density distribution of continuous variables p( Xi IPi). Under \nthe usual signal-plus independent Gaussian noise model a feedforward neural net(cid:173)\nwork N N(.) is a conditional density model such that p(Xi IPi) = G(Xi; N N(Pi), 0- 2 ), \nwhere G(x; c, 0-2 ) is our notation for a normal density centered at c and with variance \n0- 2 \u2022 More complex conditional densities can, for example, be modeled by mixtures \nof experts or by Parzen windows based density estimators which we used in our ex(cid:173)\nperiments (Section 5). We will use pM (Xi IP;) for a generic conditional probability \nmodel. The joint probability model is then \n\nN \n\npM (X) = II pM (xi/Pi). \n\ni=l \n\n(3) \n\nfollowing Equations 1 and 2. Learning Bayesian networks is usually decomposed \ninto the problems of learning structure (that is the arcs in the network) and of \nlearning the conditional density models pM (Xi IPi) given the structure4 . First as(cid:173)\nsume the structure of the network is given. If the data set only contains complete \ndata, we can train conditional density models pM (Xi IPi ) independently of each \nother since the log-likelihood of the model decomposes conveniently into the indi(cid:173)\nvidual likelihoods of the models for the conditional probabilities. Next, consider \ntwo competing network structures. We are basically faced with the well-known \nbias-variance dilemma: if we choose a network with too many arcs, we introduce \nlarge parameter variance and if we remove too many arcs we introduce bias. Here, \nthe problem is even more complex since we also have the freedom to reverse arcs. \nIn our experiments we evaluate different network structures based on the model \nlikelihood using leave-one-out cross-validation which defines our scoring function \nfor different network structures. More explicitly, the score for network structure \nS is Score = 10g(p(S)) + Lev, where p(S) is a prior over the network structures \nand Lev = ~f=llog(pM (xkIS, X - {xk})) is the leave-one-out cross-validation log(cid:173)\nlikelihood (later referred to as cv-Iog-likelihood). X = {xk}f=l is the set of training \nsamples, and pM (x k IS, X - {xk}) is the probability density of sample Xk given the \nstructure S and all other samples. Each of the terms pM (xk IS, X - {xk}) can be \ncomputed from local densities using Equation 3. \n\nEven for small networks it is computationally impossible to calculate the score for all \npossible network structures and the search for the global optimal network structure \n\n4Differing from Heckerman we do not follow a fully Bayesian approach in which priors \n\nare defined on parameters and structure; a fully Bayesian approach is elegant if the oc(cid:173)\ncurring integrals can be solved in closed form which is not the case for general nonlinear \nmodels or if data are incomplete. \n\n\fDiscovering Structure in Continuous Variables Using Bayesian Networks \n\n503 \n\nis NP-hard. In the Section 5 we describe a heuristic search which is closely related to \nsearch strategies commonly used in discrete Bayesian networks (Heckerman, 1995). \n\n4 Prior Models \n\nIn a Bayesian framework it is useful to provide means for exploiting prior knowledge, \ntypically introducing a bias for simple structures. Biasing models towards simple \nstructures is also useful if the model selection criteria is based on cross-validation, \nas in our case, because of the variance in this score. In the experiments we added \na penalty per arc to the log-likelihood i.e. 10gp(S) ex: -aNA where NA is the \nnumber of arcs and the parameter a determines the weight of the penalty. Given \nmore specific knowledge in form of a structure defined by a domain expert we \ncan alternatively penalize the deviation in the arc structure (Heckerman, 1995). \nFurthermore, prior knowledge can be introduced in form of a set of artificial training \ndata. These can be treated identical to real data and loosely correspond to the \nconcept of a conjugate prior. \n\n5 Experiment \n\nIn the experiment we used Parzen windows based conditional density estimators to \nmodel the conditional densities pM (Xj IPd from Equation 2, i.e. \n\n(4) \n\nwhere {xi }f=l is the training set. The Gaussians in the nominator are centered \nat (x7, Pf) which is the location of the k-th sample in the joint input/output (or \nparent/child) space and the Gaussians in the denominator are centered at (Pf) \nwhich is the location of the k-th sample in the input (or parent) space. For each \nconditional model, (J\"j was optimized using leave-one-out cross validation5 \u2022 \n\nThe unsupervised structure optimization procedure starts with a complete Bayesian \nmodel corresponding to Equation 1, i.e. a model where there is an arc between \nany pair of variables6 \u2022 Next, we tentatively try all possible arc direction changes, \narc removals and arc additions which do not produce directed loops and evaluate \nthe change in score. After evaluating all legal single modifications, we accept the \nchange which improves the score the most. The procedure stops if every arc change \ndecreases the score. This greedy strategy can get stuck in local minima which \ncould in principle be avoided if changes which result in worse performance are also \naccepted with a nonzero probability 7 (such as in annealing strategies, Heckerman, \n1995). Calculating the new score at each step requires only local computation. \nThe removal or addition of an arc corresponds to a simple removal or addition of \nthe corresponding dimension in the Gaussians of the local density model. However, \n\n5Note that if we maintained a global (7 for all density estimators, we would maintain \nlikelihood equivalence which means that each network displaying the same independence \nmodel gets the same score on any test set. \n\n6The order of nodes determining the direction of initial arcs is random. \n7 In our experiments we treated very small changes in score as if they were exactly zero \n\nthus allowing small decreases in score. \n\n\f504 \n\nR. HOFMANN. V. TRESP \n\n15~-----------------------, \n\n--\n\n- - - --\n\n10 \n\n-5 \n\n100~------~------~------~ \n\n50 \n\n~ \n~ \n\"T \n~ I \n-100 \n\n-10~----------~----------~ \n100 \n\n50 \n\no \n\nNumber of Iterations \n\n-150~--------------------~ \n15 \n\n10 \n\n5 \n\no \n\nNumber of inputs \n\nFigure 1: Left: evolution of the cv-log-Iikelihood (dashed) and of the log-likelihood \non the test set (continuous) during structure optimization. The curves are averages \nover 20 runs with different partitions of training and test sets and the likelihoods \nare normalized with respect to the number of cv- or test-samples, respectively. The \npenalty per arc was a = 0.1. The dotted line shows the Parzen joint density model \ncommonly used in statistics, i.e. assuming no independencies and using the same \nwidth for all Gaussians in all conditional density models. Right: log-likelihood \nof the local conditional Parzen model for variable 3 (pM (x3IP3)) on the test set \n(continuous) and the corresponding cv-log-likelihood (dashed) as a function of the \nnumber of parents (inputs). \n\ncrime ra.te \npercent land zoned for lots \npercent nonretail business \nlocated on Charles river? \nnitrogen oxide concentration \nAverage number of rooms \npercent built before 1940 \nweighted distance to employment center \naccess to radial highways \ntax rate \npupil/teacher ratio \npercent black \npercent lower-status population \n\n2 \na \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 median value of homes \n\nFigure 2: Final structure of a run on the full data set. \n\nafter each such operation the widths of the Gaussians O'i in the affected local models \nhave to be optimized. An arc reversal is simply the execution of an arc removal \nfollowed by an arc addition. \n\nIn our experiment, we used the Boston housing data set, which contains 506 sam(cid:173)\nples. Each sample consists of the housing price and 14 variables which supposedly \ninfluence the housing price in a Boston neighborhood (Figure 2). Figure 1 (left) \nshows an experiment where one third of the samples was reserved as a test set to \nmonitor the process. Since the algorithm never sees the test data the increase in \nlikelihood of the model on the test data is an unbiased estimator for how much \nthe model has improved by the extraction of structure from the data. The large \nincrease in the log-likelihood can be understood by studying Figure 1 (right). Here \nwe picked a single variable (node 3) and formed a density model to predict this vari(cid:173)\nable from the remaining 13 variables. Then we removed input variables in the order \nof their significance. After the removal of a variable, 0'3 is optimized. Note that the \ncv-Iog-likelihood increases until only three input variables are left due to the fact \n\n\fDiscovering Structure in Continuous Variables Using Bayesian Networks \n\n505 \n\nthat irrelevant variables or variables which are well represented by the remaining \ninput variables are removed. The log-likelihood of the fully connected initial model \nis therefore low (Figure 1 left). \n\nWe did a second set of 15 runs with no test set. The scores of the final structures \nhad a standard deviation of only 0.4. However, comparing the final structures in \nterms of undirected arcs8 the difference was 18% on average. The structure from one \nof these runs is depicted in Figure 2 (right). In comparison to the initial complete \nstructure with 91 arcs, only 18 arcs are left and 8 arcs have changed direction. \n\nOne of the advantages of Bayesian networks is that they can be easily interpreted. \nThe goal of the original Boston housing data experiment was to examine whether \nthe nitrogen oxide concentration (5) influences the housing price (14). Under the \nstructure extracted by the algorithm, 5 and 14 are dependent given all other vari(cid:173)\nables because they have a common child, 13. However, if all variables except 13 are \nknown then they are independent. Another interesting question is what the rele(cid:173)\nvant quantities are for predicting the housing price, i.e. which variables have to be \nknown to render the housing price independent from all other variables. These are \nthe parents, children, and children's parents of variable 14, that is variables 8, 10, \n11, 6, 13 and 5. It is well known that in Bayesian networks, different constellations \nof directions of arcs may induce the same independencies, i.e. that the direction \nof arcs is not uniquely determined. It can therefore not be expected that the arcs \nactually reflect the direction of causality. \n\n6 Missing Data and Markov Blanket Conditional Density \n\nModel \n\nBayesian networks are typically used in applications where variables might be miss(cid:173)\ning. Given partial information (i. e. the states of a subset of the variables) the goal \nis to update the beliefs (i. e. the probabilities) of all unknown variables. Whereas \nthere are powerful local update rules for networks of discrete variables without \n(undirected) loops, the belief update in networks with loops is in general NP-hard. \nA generally applicable update rule for the unknown variables in networks of discrete \nor continuous variables is Gibbs sampling. Gibbs sampling can be roughly described \nas follows: for all variables whose state is known, fix their states to the known val(cid:173)\nues. For all unknown variables choose some initial states. Then pick a variable Xi \nwhich is not known and update its value following the probability distribution \n\np(xil{Xl, ... , XN} \\ {xd) ex: p(xilPd II p(xjIPj ). \n\n(5) \n\nx.E1'j \n\nDo this repeatedly for all unknown variables. Discard the first samples. Then, \nthe samples which are generated are drawn from the probability distribution of the \nunknown variables given the known variables. Using these samples it is easy to \ncalculate the expected value of any of the unknown variables, estimate variances, \ncovariances and other statistical measures such as the mutual information between \nvariables. \n\n8 Since the direction of arcs is not unique we used the difference in undirected arcs to \ncompare two structures. We used the number of arcs present in one and only one of the \nstructures normalized with respect to the number of arcs in a fully connected network. \n\n\f506 \n\nR. HOFMANN, V. TRESP \n\nGibbs sampling requires sampling from the univariate probability distribution in \nEquation 5 which is not straightforward in our model since the conditional den(cid:173)\nsity does not have a convenient form. Therefore, sampling techniques such as \nimportance sampling have to be used. In our case they typically produce many \nrejected samples and are therefore inefficient. An alternative is sampling based \non Markov blanket conditional density models. The Markov blanket of Xi, Mi is \nthe smallest set of variables such that P(Xi I{ Xb . .. , XN} \\ Xi) = P(Xi IMi) (given a \nBayesian network, the Markov blanket of a variable consists of its parents, its chil(cid:173)\ndren and its children's parents.). The idea is to form a conditional density model \npM (xilMd ~ p(xdMd for each variable in the network instead of computing it \naccording to Equation 5. Sampling from this model is simple using conditional \nParzen models: the conditional density is a mixture of Gaussians from which we \ncan sample without rejection9 \u2022 Markov blanket conditional density models are also \ninteresting if we are only interested in always predicting one particular variable, as \nin most neural network applications. Assuming that a signal-plus-noise model is a \nreasonably good model for the conditional density, we can train an ordinary neural \nnetwork to predict the variable of interest. In addition, we train a model for each \ninput variable predicting it from the remaining variables. In addition to having ob(cid:173)\ntained a model for the complete data case, we can now also handle missing inputs \nand do backward inference using Gibbs sampling. \n\n7 Conclusions \n\nWe demonstrated that Bayesian models of local conditional density estimators form \npromising nonlinear dependency models for continuous variables. The conditional \ndensity models can be trained locally if training data are complete. In this paper \nwe focused on the self-organized extraction of structure. Bayesian networks can \nalso serve as a framework for a modular construction of large systems out of smaller \nconditional density models. The Bayesian framework provides consistent update \nrules for the probabilities i.e. communication between modules. Finally, consider \ninput pruning or variable selection in neural networks. Note, that our pruning \nstrategy in Figure 1 can be considered a form of variable selection by not only \nremoving variables which are statistically independent of the output variable but \nalso removing variables which are represented well by the remaining variables. This \nway we obtain more compact models. If input values are missing then the indirect \ninfluence of the pruned variables on the output will be recovered by the sampling \nmechanism. \n\nReferences \nBuntine, W. (1994). Operations for learning with graphical models. Journal of Artificial \nIntelligence Research 2: 159-225. \nHeckerman, D. (1995). A tutorial on learning Bayesian networks. Microsoft Research, \nTR. MSR-TR-95-06, 1995. \nPearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan \nKaufmann. \n\n9There are, however, several open issues concerning consistency between the conditional \n\nmodels. \n\n\f", "award": [], "sourceid": 1098, "authors": [{"given_name": "Reimar", "family_name": "Hofmann", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}