{"title": "Nonlinear Markov Networks for Continuous Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 521, "page_last": 527, "abstract": null, "full_text": "Nonlinear Markov Networks for Continuous \n\nVariables \n\nReimar Hofmann and Volker Tresp* \nSiemens AG, Corporate Technology \nInformation and Communications \n\n81730 Munchen, Germany \n\nAbstract \n\nWe address the problem oflearning structure in nonlinear Markov networks \nwith continuous variables. This can be viewed as non-Gaussian multidi(cid:173)\nmensional density estimation exploiting certain conditional independencies \nin the variables. Markov networks are a graphical way of describing con(cid:173)\nditional independencies well suited to model relationships which do not ex(cid:173)\nhibit a natural causal ordering. We use neural network structures to model \nthe quantitative relationships between variables. The main focus in this pa(cid:173)\nper will be on learning the structure for the purpose of gaining insight into \nthe underlying process. Using two data sets we show that interesting struc(cid:173)\ntures can be found using our approach. Inference will be briefly addressed. \n\n1 \n\nIntroduction \n\nKnowledge about independence or conditional independence between variables is most help(cid:173)\nful in ''understanding'' a domain. An intuitive representation of independencies is achieved by \ngraphical models in which independency statements can be extracted from the structure of the \ngraph. The two most popular types of graphical stochastical models are Bayesian networks \nwhich use a directed graph, and Markov networks which use an undirected graph. Whereas \nBayesian networks are well suited to represent causal relationships, Markov networks are \nmostly used in cases where the user wants to express statistical correlation between variables. \nThis is the case in image processing where the variables typically represent the grey levels \nof pixels and the graph encourages smootheness in the values of neighboring pixels (Markov \nrandom fields, Geman and Geman, 1984). We believe that Markov networks might be a useful \nrepresentation in many domains where the concept of cause and effect is somewhat artificial. \nThe learned structure of a Markov network also seems to be more easily communicated to \nnon-experts; in a Bayesian network not all arc directions can be uniquely identified based on \ntraining data alone which makes a meaningful interpretation for the non-expert rather difficult. \n\nAs in Bayesian networks, direct dependencies between variables in Markov networks are rep(cid:173)\nresented by an arc between those variables and missing edges represent independencies (in \nSection 2 we will be more precise about the independencies represented in Markov networks). \nWhereas the graphical structure in Markov networks might be known a priori in some cases, \n\nf{eimar.Hofinann@mchp.siemens.de Volker. Tresp@mchp.siemens.de \n\n\f522 \n\nR. Hofmann and V. Tresp \n\nthe focus of this work is the case that structure is unknown and must be inferred from data. \nFor both discrete variables and linear relationships between continuous variables algorithms \nfor structure learning exist (Whittaker, 1990). Here we address the problem of learning struc(cid:173)\nture for Markov networks of continuous variables where the relationships between variables \nare nonlinear. In particular we use neural networks for approximating the dependency be(cid:173)\ntween a variable and its Markov boundary. We demonstrate that structural learning can be \nachieved without a direct reference to a likelihood function and show how inference in such \nnetworks can be perfonned using Gibbs sampling. From a technical point of view, these \nMarlwv boundary networks perfonn multi-dimensional density estimation for a very general \nclass of non-Gaussian densities. \n\nIn the next section we give a mathematical description of Markov networks and a formulation \nof the joint probability density as a product of compatibility functions. In Section 3.1 we \ndiscuss strucurallearning in Markov networks based on a maximum likelihood approach and \nshow that this approach is in general unfeasible. We then introduce our approach which is \nbased on learning the Markov boundary of each variable. We also show how belief update can \nbe performed using Gibbs sampling. In Section 4 we demonstrate that useful structures can \nbe extraced from two data sets (Boston housing data., financial market) using our approach. \n\n2 Markov Networks \n\nThe following brief introduction to Markov networks is adapted from Pearl (1988). Consider \na strictly positive I joint probability density p(x) over a set of variables X := {XI, ... , XN }. \nFor each variable Xi, let the Marlwv boundary of Xi, Bi ~ X - {Xi}, be the smallest set of \nvariables that renders Xi and X - ({ xd U Bd independent under p( x) (the Markov boundary \nis unique for strictly positive distributions). Let the Marlwv network 9 be the undirected \ngraph with nodes Xl, \u2022\u2022\u2022 , xN and edges between Xi and Xj if and only if Xi E Bj (which also \nimplies X j E Bi). In other words, a Markov network is generated by connecting each node to \nthe nodes in its Markov boundary. Then for any set Z ~ (X - {Xi, Xj}), Xi is independent \nof X j given Z if and only if every path from Xi to X j goes through at least one node in Z. In \nother words, two variables are independent if any path between those variables is \"blocked\" \nby a known variable. In particular a variable is independent of the remaining variables if the \nvariables in its Markov boundary are known. \n\nA clique in G is a maximal fully connected sub graph. Given a Markov Network G for p( x) it \ncan be shown that p can be factorized as a product of positive functions on the cliques of G, \ni.e. \n\n(1) \n\nwhere the product is over all cliques in the graph. Xclique, is the projection of X to the \nvariables of the i-th clique and the gi are the compatibility functions w.r.t. cliquej. K = \nJ fli gi(Xclique.)dx is the normalization constant. Note, that a state whose clique functions \nhave large values has high probability. The theorem of Hammersley and Clifford states that \nthe nonnalized product in equation 1 embodies all the conditional independencies portrayed \nby the graph (Pearl, 1988? for any choice of the gi . \n\nIf the graph is sparse, i.e. if many conditional independencies exist then the cliques might \n\n1 To simplify the discussion we will assume strict positivity for the rest of this paper. For some of the \nstatements weaker conditions may also be sufficient. Note that strict positivity implies that functional \nconstraints (for example, a = b) are excluded. \n\n2 In terms of graphical models: The graph G is an I-map of p. \n\n\fNonlinear Markov Networks for Continuous Variables \n\n523 \n\nbe small and the product will be over low dimensional functions. Similar to Bayesian net(cid:173)\nworks where the complexity of describing a joint probability density is greatly reduced by \ndecomposing the joint density in a product of ideally low-dimensional conditional densities, \nequation 1 describes the decomposition of a joint probability density function into a product \nof ideally low-dimensional compatibility functions. It should be noted that Bayesian networks \nand Markov networks differ in which specific independencies they can represent (Pearl, 1988). \n\n3 Learning the Markov Network \n\n3.1 Likelihood Function Based Learning \n\nLearning graphical stochastical models is usually decomposed into the problems of learning \nstructure (that is the edges in the graph) and of learning the parameters of the joint density \nfunction under the constraint that it obeys the independence statements made by the graph. \nThe idea is to generate candidate structures according to some search strategy, learn the param(cid:173)\neters for this structure and then judge the structure on the basis of the (penalized) likelihood \nof the model or, in a fully Bayesian approach, using a Bayesian scoring metric. \n\nAssume that the compatibility functions in equation 1 are approximated using a function ap(cid:173)\nproximator such as a neural network gi 0 ~ 9 i (x). Let {x P}:= 1 be a training set. With \nlikelihood L = I1;=1 pM (xP) (where the M in pM indicates a probability density model in \ncontrast to the true distribution), the gradient of the log-likelihood with respect to weight W i \nin gi (.) becomes \n\n~~I M( P)-~~l ~(P \na L-0gp \n\nx -L-a \n\noggl Xclique, \n\np=l Wi \n\nWi p=l \n\n)_NI(i!v;loggi(Xclique,))I1jgj(XcliqueJ)dX \n\nII1 W( \n\nj gj Xclique) \n\n)d \nX \n\n(2) \nwhere the sums are over N training patterns. The gradient decomposes into two terms. Note, \nthat only in the first term the training patterns appear explicitly and that, conveniently, the first \nterm is only dependent on the clique i which contains parameter Wi. The second term emerges \nfrom the normalization constant K in equation I. The difficulty is that the integrals in the \nsecond term can not be solved in closed form for universal types of compatibility functions gi \nand have to be approximated numerically, typically using a form of Monte Carlo integration. \nThis is exactly what is done in the Boltzmann machine, which is a special case of a Markov \nnetwork with discrete variables.3 \n\nCurrently, we consider maximum likelihood learning based on the compatibility functions un(cid:173)\nsuitable, considering the complexity and slowness of Monte Carlo integration (Le. stochastic \nsampling). Note, that for structural learning the maximum likelihood learning is in the inner \nloop and would have to be executed repeatedly for a large number of structures. \n\n3.2 Markov Boundary Learning \n\nThe difficulties in using maximum likelihood learning for finding optimal structures motivated \nthe approach pursued in this paper. If the underlying true probability density is known the \nstructure in a Markov network can be found using either the edge deletion method or the \n\n3 A fully connected Boltzmann machine does not display any independencies and we only have one \nclique consisting of all variables. The compatibility function is gO = exp (- L: WijSiSj). The Boltz(cid:173)\nmann machine typically contains hidden variables, such that not only the second tenn (corresponding to \nthe unclamped phase) in equation 2 has to be approximated using stochastic sampling but also the first \ntenn. (In this paper we only consider the case that data are complete). \n\n\f524 \n\nR. Hofmann and V Tresp \n\nMarkov boundary method (Pearl, 1988). The edge deletion method uses the fact that variables \na and b are not connected by an edge if and only if a and b are independent given all other \nvariables. Evaluating this test for each pair of variables reveals the structure of the network. \nThe Markov boundary method consists of determining - for each variable a - its Markov \nboundary and connecting a to each variable in its Markov boundary. Both approaches are \nsimple if we have a reliable test for true conditional independence. \n\nBoth methods cannot be applied directly for learning structure from data since here tests \nfor conditional independence cannot be based on the true underlying probability distribution \n(which is unknown) but has to be inferred from a finite data set. The hope is that dependen(cid:173)\ncies which are strong enough to be supported by the data can still be reliably identified. It is, \nhowever not difficult to construct cases where simply using an (unreliable) statistical test for \nconditional independence with the edge deletion method does not work wel1. 4 \n\nWe now describe our approach, which is motivated by the Markov boundary method. First, \nwe start with a fully connected graph. We train a model ptt to approximate the conditional \ndensity of each variable i, given the current candidate variables for its Markov boundary Bi \nwhich initially are all other variables. For this we can use a wide variety of neural networks. \nWe use conditional Parzen windows \n\n(3) \n\nwhere {XP};'=l is the training set and G(x; J-l, 1:) is our notation for a multidimensional Gaus(cid:173)\nsian centered at J-l with covariance matrix 1: evaluated at x. The Gaussians in the nominator are \ncentered at X~i}U8: which is the location of the p-th sample in the jointinput!output( {x;} UBi) \nspace and the Gaussians in the denominator are centered at x~: which is the location of the \np-th sample in the input space (Bi). There is one covariance matrix 1:i for each conditional \ndensity model which is shared between all the Gaussians in that model. 1:i is restricted to a \ndiagonal matrix where the diagonal elements in all dimensions except the output dimension i, \nare the same. So there are only two free parameters in the matrix: The variance in the output \ndimension and the variance in all input dimensions. Ei 8' is equal to 1:i except that the row \nand column corresponding to the output dimension ha~e been deleted. For each conditional \nmodel ptt, 1:i was optimized on the basis of the leave-one-out cross validation log-likelihood. \nOur approach is based on tentatively removing edges from the model. Removing an edge \ndecreases the size of the Markov boundary candidates of both affected variables and thus \ndecreases the number of inputs in the corresponding two conditional density models. With \nthe inputs removed, we retrain the two models (in our case, we simply find the optimal Ei \nfor the two conditional Parzen windows). If the removal of the edge was correct, the leave(cid:173)\none-out cross validation log-likelihood (model-score) of the two models should improve since \nan unnecessary input is removed. (Removing an unnecessary input typically decreases model \nvariance.) We therefore remove an edge if the model-scores of both models improve. Let's \ndefine as edge-removal-score the smaller ofthe two improvements in model-score. \n\nHere is the algorithm in pseudo code: \n\n\u2022 Start with a fully connected network \n\n4The problem is that in the edge deletion method the decision is made independently for each edge \nwhether or not it should be present There are however cases where it is obvious that at least one of two \nedges must be present although the edge deletion method which tests each edge individually removes \nboth. \n\n. \n\n\fNonlinear Markov Networksfor Continuous Variables \n\n525 \n\n\u2022 Until no edge-removal-score is positive: \n\nfor all edges edgeij in the network \n* calculate the model-scores of the reduced models ptt (Xi IBi - {j}) and \n\n-\n\n-\n\nptt (Xj IB; - {i}) \n\nM i\nPi (XjIBj) \n\n* compare with the model-scores of the current models pM (xiIB~) and \n\nt \n\nI \n\n* set the edge-removal-score to the smaller of both model-score improvements \nremove the edge for which the edge-removal-score is in maximum. \n\n\u2022 end \n\n3.3 \n\nInference \n\nNote that we have learned the structure of the Markov network without an explicit representa(cid:173)\ntion of the probability density. Although the conditional densities p(.r i IBi) provide sufficient \ninformation to calculate the joint probability density the latter can not be easily computed. \nMore precisely, the conditional densities overdetermine the joint density which might lead \nto problems if the conditional densities are estimated from data. For inference, we are typi(cid:173)\ncally interested in the expected value of an unknown variable, given an arbitrary set of known \nvariables, which can be calculated using Gibbs sampling. Note, that the conditional densi(cid:173)\nties pM (Xi IBi) which are required for Gibbs sampling are explicitly modeled in our approach \nby the conditional Parzen windows. Also note, that sampling from the conditional Parzen \nmodel (as well as many other neural networks, such as mixture of experts models) is easy.5 \nIn Hofmann (1997) we show that Gibbs sampling from the conditional Parzen models gives \nsignificantly better results than running inference using either a kernel estimator or a Gaussian \nmixture model of the joint density. \n\n4 Experiments \n\nIn our first experiment we used the Boston housing data set, which contains 506 samples. \nEach sample consists of the housing price and 13 other variables which supposedly influence \nthe housing price in a Boston neighborhood. Maximizing the cross validation log-likelihood \nas score as described in the previous chapters results in a Markov network with 68 edges. \n\nWhile cross validation gives an unbiased estimate of whether a direct dependency exists be(cid:173)\ntween two variables the estimate can have a large variance depending on the size of the given \ndata set. If the goal of the experiment is to interpret the resulting structure one would prefer \nto see only those edges corresponding to direct dependencies which can be clearly identified \nfrom the given data set. In other words, if the relationship between two variables observed on \nthe given data set is so weak that we can not be sure that it is not just an effect of the finite \ndata set size, then we do not want to display the corresponding edge. This can be achieved by \nadding a penalty per edge to the score of the conditional density models. (figure 1). \n\nFigure 2 shows the resulting Markov network for a penalty per edge of 0.2. The goal of the \noriginal experiment for which the Boston housing data were collected was to examine whether \nthe air quality (5) has direct influence on the housing price (14). Our algorithm did not find \nsuch an influence - in accordance with the original study. It found that the percentage of low \nstatus population (13) and the average number of rooms (6) are in direct relationship with \nthe housing price. The pairwise relationships between these three variables are displayed in \nfigure 3. \n\n5 Readers not familiar with Gibbs sampling, please consult Geman and Geman (1984). \n\n\f526 \n\nR. Hofmann and V. Tresp \n\n\u00b0 0 \n\n0011 \n\n01 \n\n016 \n\n02 on \n\n......... \n\n0 3 \n\n0\u00bb \n\noc \n\no.~ os \n\nFigure I: Number of edges in the Markov network for the Boston housing data as a function \nof the penalty per edge. \n\nI \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \nII \n12 \n13 \n14 \n\ncrime rate \npercent land zoned for lots \npercent nooretail bu.. in\"\", \nlocated on Charies river'! \nnitrogen oxide concentration \naverage number of room., \npercent bui I t before 1940 \nweighted distance to employment center \nacces. to radial highways \ntax rate \npupi IIteacher ratio \npercent black \npercent lower-status population \nmedian value ofbomes \n\nFigure 2: Final structure of a run on the full Boston housing data set (penalty = 0.2). \n\nThe scatter plots visualize the relationship between variables 13 and 14, 6 and 14 and between \n6 and 13 (from left to right). The left and the middle correspond to edges in the Markov \nnetwork whereas for the right diagram the corresponding edge (6-13) is missing even though \nboth variables are clearly dependent. The reason is, that the dependency between 6 and 13 can \nbe explained as indirect relationship via variable 14. The Markov network tells us that 13 and \n6 are independent given 14, but dependent if 14 is unknown. \n\nIn a second experiment we used a financial dataset. Each pattern corresponds to one business \nday. The variables in our model are relative changes in certain economic variables from the \nlast business day to the present day which were expected to possibly influence the development \nof the German stock index DAX and the composite DAX, which contains a larger selection of \nstocks than the DAX. We used 500 training patterns consisting of 12 variables (figure 4). In \ncomparison to the Boston housing data set most relationships are very weak. Using a penalty \nper edge of 0.2 leads to a very sparse model with only three edges (2-12, 12-1 ,5-11) (not \nshown). A penalty of 0.025 results in the model shown in figure 4. Note, that the composite \n\n50~- - -\n\n- .... - - ~----.-. \u2022 -\n\n-' \n\n.0--- ----- -- - ---\n\n.\n\n.. ,I : \n\n:. \n\n10 \n\no--- ----~-- ...J \no \n\n10 \n~ \nPc Low Status Population \n\n~ \n\n~ \n\nO'--~- . _\n34 56 7 89 \n\n_ ______ ____ -..l \n\nA. Number of Rooms \n\nFigure 3: Pairwise relationship between the variables 6, 13 and 14. Displayed are all data \npoints in the Boston housing data set. \n\n\fNonlinear Markov Networks/or Continuous Variables \n\n527 \n\nDAX \ncomposite DAX \n3 month interest rates Gennany \nrerum Gennany \nMorgan Stanley tndex Germany \nI)(JW' Jones mdustrial index \nDM-USD exchange rate \nUS treasury bonds \ngold price in DM \nN.kkei index Japan \nMorgan Stanley index Europe \nprice earning ratio (DAX stocks) \n\n4 \n5 \n6 \n7 \n8 \n9 \n10 \nII \n12 \n\nFigure 4: Final structure of a run on the financial data set with a penalty of 0.025. The small \nnumbers next to the edges indicate the strength of the connection, i.e. the decrease in score \n(excluding the penalty) when the edge is removed. All variables are relative changes - not \nabsolute values. \n\nDAX is connected to the DAX mainly through the price earning ratio. While the DAX has \ndirect connections to the Nikkei index and to the DM-USD exchange rate the composite DAX \nhas a direct connection to the Morgan Stanley index for Germany. Recall, that composite \nDAX contains the stocks of many smaller companies in addition to the DAX stocks. The \ngraph structure might be interpreted (with all caution) in the way that the composite DAX \n(including small companies) has a stronger dependency on national business whereas the DAX \n(only including the stock of major companies) reacts more to international indicators. \n\n5 Conclusions \n\nWe have demonstrated, to our knowledge for the first time, how nonlinear Markov networks \ncan be learned for continuous variables and we have shown that the resulting structures can \ngive interesting insights into the underlying process. We used a representation based on mod(cid:173)\nels of the conditional probability density of each variable given its Markov boundary. These \nmodels can be trained locally. We showed how searching in the space of all possible structures \ncan be done using this representation. \n\nWe suggest to use the conditional densities of each variable given its Markov boundary also for \ninference by Gibbs sampling. Since the required conditional densities are modeled explicitly \nby our approach and sampling from these is easy, Gibbs sampling is easier and faster to realize \nthan with a direct representation of the joint density. \n\nA topic of further research is the variance in resulting structures, i.e. the fact that different \nstructures can lead to almost equally good models. It would for example be desirable to \nindicate to the user in a principled way the certainty of the existence or nonexistence of edges. \n\nReferences \nGeman, S., and Geman, D. (1984). Stochastic relaxations, Gibbs distributions and the Bayesian restora(cid:173)\ntion of images. IEEE Trans. on Pattern Analysis and Machine Intelligence PAMI-6 (no. 6):721-42 \nHofinann, R. (1997). Inference in Markov Blanket Models. Technical report, in preparation. \nMonti, S., and Cooper, G. (1997). Learning Bayesian belief networks with neural network estimators. \nIn Neural Information Processing Systems 9., MIT Press. \nPearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo: Morgan Kaufinann. \nWhittaker, J. (1990). Graphical models in applied multivariate statistics. Chichester, UK: John Wiley \nand Sons. \n\n\f", "award": [], "sourceid": 1341, "authors": [{"given_name": "Reimar", "family_name": "Hofmann", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}