{"title": "Gaussian Fields for Approximate Inference in Layered Sigmoid Belief Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 393, "page_last": 399, "abstract": null, "full_text": "Correctness of belief propagation in Gaussian \n\ngraphical models of arbitrary topology \n\nYair Weiss \n\nComputer Science Division \nUC Berkeley, 485 Soda Hall \nBerkeley, CA 94720-1776 \n\nPhone: 510-642-5029 \n\nyweiss@cs.berkeley.edu \n\nWilliam T. Freeman \n\nMitsubishi Electric Research Lab \n\n201 Broadway \n\nCambridge, MA 02139 \nPhone: 617-621-7527 \nfreeman @merl.com \n\nAbstract \n\nLocal \"belief propagation\" rules of the sort proposed by Pearl [15] are \nguaranteed to converge to the correct posterior probabilities in singly \nconnected graphical models. Recently, a number of researchers have em(cid:173)\npirically demonstrated good performance of \"loopy belief propagation\"(cid:173)\nusing these same rules on graphs with loops. Perhaps the most dramatic \ninstance is the near Shannon-limit performance of \"Turbo codes\", whose \ndecoding algorithm is equivalent to loopy belief propagation. \nExcept for the case of graphs with a single loop, there has been little theo(cid:173)\nretical understanding of the performance of loopy propagation. Here we \nanalyze belief propagation in networks with arbitrary topologies when \nthe nodes in the graph describe jointly Gaussian random variables. We \ngive an analytical formula relating the true posterior probabilities with \nthose calculated using loopy propagation. We give sufficient conditions \nfor convergence and show that when belief propagation converges it gives \nthe correct posterior means for all graph topologies, not just networks \nwith a single loop. \nThe related \"max-product\" belief propagation algorithm finds the max(cid:173)\nimum posterior probability estimate for singly connected networks. We \nshow that, even for non-Gaussian probability distributions, the conver(cid:173)\ngence points of the max-product algorithm in loopy networks are max(cid:173)\nima over a particular large local neighborhood of the posterior proba(cid:173)\nbility. These results help clarify the empirical performance results and \nmotivate using the powerful belief propagation algorithm in a broader \nclass of networks. \n\nProblems involving probabilistic belief propagation arise in a wide variety of applications, \nincluding error correcting codes, speech recognition and medical diagnosis. If the graph \nis singly connected, there exist local message-passing schemes to calculate the posterior \nprobability of an unobserved variable given the observed variables. Pearl [15] derived such \na scheme for singly connected Bayesian networks and showed that this \"belief propagation\" \nalgorithm is guaranteed to converge to the correct posterior probabilities (or \"beliefs\"). \n\nSeveral groups have recently reported excellent experimental results by running algorithms \n\n\f674 \n\nY. Weiss and W T. Freeman \n\nequivalent to Pearl's algorithm on networks with loops [8, 13, 6]. Perhaps the most dramatic \ninstance of this performance is for \"Turbo code\" [2] error correcting codes. These codes \nhave been described as \"the most exciting and potentially important development in coding \ntheory in many years\" [12] and have recently been shown [10, 11] to utilize an algorithm \nequivalent to belief propagation in a network with loops. \n\nProgress in the analysis of loopy belief propagation has been made for the case of networks \nwith a single loop [17, 18, 4, 1]. For these networks, it can be shown that (1) unless \nall the compatabilities are deterministic, loopy belief propagation will converge. (2) The \ndifference between the loopy beliefs and the true beliefs is related to the convergence rate \nthe faster the convergence the more exact the approximation and (3) If \nof the messages -\nthe hidden nodes are binary, then the loopy beliefs and the true beliefs are both maximized \nby the same assignments, although the confidence in that assignment is wrong for the loopy \nbeliefs. \n\nIn this paper we analyze belief propagation in graphs of arbitrary topology, for nodes de(cid:173)\nscribing jointly Gaussian random variables. We give an exact formula relating the correct \nmarginal posterior probabilities with the ones calculated using loopy belief propagation. \nWe show that if belief propagation converges, then it will give the correct posterior means \nfor all graph topologies, not just networks with a single loop. We show that the covari(cid:173)\nance estimates will generally be incorrect but present a relationship between the error in \nthe covariance estimates and the convergence speed. For Gaussian or non-Gaussian vari(cid:173)\nables, we show that the \"max-product\" algorithm, which calculates the MAP estimate in \nsingly connected networks, only converges to points that are maxima over a particular large \nneighborhood of the posterior probability of loopy networks. \n\n1 Analysis \n\nTo simplify the notation, we assume the graphical model has been preprocessed into an \nundirected graphical model with pairwise potentials. Any graphical model can be con(cid:173)\nverted into this form, and running belief propagation on the pairwise graph is equivalent \nto running belief propagation on the original graph [18]. We assume each node X i has a \nlocal observation Yi . In each iteration of belief propagation, each node X i sends a message \nto each neighboring X j that is based on the messages it received from the other neighbors, \nits local observation Yl and the pairwise potentials Wij(Xi , Xj) and Wii(Xi, Yi) . We assume \nthe message-passing occurs in parallel. \n\nThe idea behind the analysis is to build an unwrapped tree. The unwrapped tree is the \ngraphical model which belief propagation is solving exactly when one applies the belief \npropagation rules in a loopy network [9, 20, 18]. It is constructed by maintaining the same \nlocal neighborhood structure as the loopy network but nodes are replicated so there are no \nloops. The potentials and the observations are replicated from the loopy graph. Figure 1 (a) \nshows an unwrapped tree for the diamond shaped graph in (b). By construction, the belief \nat the root node X- I is identical to that at node Xl in the loopy graph after four iterations of \nbelief propagation. Each node has a shaded observed node attached to it, omitted here for \nclarity. \n\nBecause the original network represents jointly Gaussian variables, so will the unwrapped \ntree. Since it is a tree, belief propagation is guaranteed to give the correct answer for the \nunwrapped graph. We can thus use Gaussian marginalization formulae to calculate the \ntrue mean and variances in both the original and the unwrapped networks. In this way, we \ncalculate the accuracy of belief propagation for Gaussian networks of arbitrary topology. \n\nWe assume that the joint mean is zero (the means can be added-in later). The joint distri-\n\n\fCorrectness of Belief Propagation \n\n675 \n\nFigure 1: Left: A Markov network with mUltiple loops. Right: The unwrapped network \ncorresponding to this structure. \n\nbution of z = ( : ) is given by P(z) = ae-!zTVz, where V = (~:: ~::) . It \nis straightforward to construct the inverse covariance matrix V of the joint Gaussian that \ndescribes a given Gaussian graphical model [3]. \n\nWriting out the exponent of the joint and completing the square shows that the mean I-' of \nx, given the observations y, is given by: \n\n(1) \n\nand the covariance matrix C~IY of x given y is: C~IY = V~-;l. We will denote by C~dY the \nith row of C~IY so the marginal posterior variance of Xi given the data is (72 (i) = C~i Iy (i). \n\nWe will use - for unwrapped quantities. We scan the tree in breadth first order and denote by \nx the vector of values in the hidden nodes of the tree when so scanned. Simlarly, we denote \nby y the observed nodes scanned in the same order and Vn , V~y the inverse covariance \nmatrices. Since we are scanning in breadth first order the last nodes are the leaf nodes and \nwe denote by L the number of leaf nodes. By the nature of unwrapping, tL(1) is the mean \nof the belief at node Xl after t iterations of belief propagation, where t is the number of \nunwrappings. Similarly 0-2 (1) = 6~1Iy(1) is the variance of the belief at node Xl after t \niterations. \nBecause the data is replicated we can write y = Oy where O(i, j) = 1 if Yi is a replica of Yj \nand 0 otherwise. Since the potentials W(Xi' Yi) are replicated, we can write V~yO = OV~y. \nSince the W (Xi, X j) are also replicated and all non-leaf Xi have the same connectivity as \nthe corresponding Xi, we can write V~~O = OVzz + E where E is zero in all but the last \nL rows. When these relationships between the loopy and unwrapped inverse covariance \nmatrices are substituted into the loopy and unwrapped versions of equation I, one obtains \nthe following expression, true for any iteration [19]: \n\nwhere e is a vector that is zero everywhere but the last L components (corresponding to the \nleaf nodes). Our choice of the node for the root of the tree is arbitrary, so this applies to \nall nodes of the loopy network. This formula relates, for any node of a network with loops, \nthe means calculated at each iteration by belief propagation with the true posterior means. \n\nSimilarly when the relationship between the loopy and unwrapped inverse covariance ma(cid:173)\ntrices is substituted into the loopy and unwrapped definitions of C~IY we can relate the \n\n(2) \n\n\f676 \n\nY Weiss and W T Freeman \n\n0.5 \n\n0.4 \n\n~ 0.3 \n~ \n.~ 0.2 \nn; \n~ 0.1 \n8 \n\"t:> \u00a7 \n\n0 \n\n-0.1 \n\n-0.2 \n\n0 \n\n20 \n\n40 \n\nnode \n\n60 \n\n80 \n\n100 \n\nFigure 2: The conditional correlation between the root node and all other nodes in the \nunwrapped tree of Fig. 1 after eight iterations. Potentials were chosen randomly. Nodes \nare presented in breadth first order so the last elements are the correlations between the root \nnode and the leaf nodes. We show that if this correlation goes to zero, belief propagation \nconverges and the loopy means are exact. Symbols plotted with a star denote correlations \nwith nodes that correspond to the node Xl in the loopy graph. The sum of these correlations \ngives the correct variance of node Xl while loopy propagation uses only the first correlation. \n\nmarginalized covariances calculated by belief propagation to the true ones [19]: \n\n-2 \n\na (1) = a (1) + CZllyel - Czt/ye2 \n\n(3) \nwhere el is a vector that is zero everywhere but the last L components while e2 is equal \nto 1 for all nodes in the unwrapped tree that are replicas of Xl except for Xl. All other \ncomponents of e2 are zero, \n\n2 \n\n-\n\n-\n\nFigure 2 shows Cz1lY for the diamond network in Fig. 1. We generated random potential \nfunctions and observations and calculated the conditional correlations in the unwrapped \ntree. Note that the conditional correlation decreases with distance in the tree - we are \nscanning in breadth first order so the last L components correspond to the leaf nodes. \nAs the number of iterations of loopy propagation is increased the size of the unwrapped \ntree increases and the conditional correlation between the leaf nodes and the root node \ndecreases. \n\nFrom equations 2-3 it is clear that if the conditional correlation between the leaf nodes and \nthe root nodes are zero for all sufficiently large unwrappings then (1) belief propagation \nconverges (2) the means are exact and (3) the variances may be incorrect. In practice the \nconditional correlations will not actually be equal to zero for any finite unwrapping. In [19] \nwe give a more precise statement: if the conditional correlation of the root node and the \nleaf nodes decreases rapidly enough then (1) belief propagation converges (2) the means \nare exact and (3) the variances may be incorrect. We also show sufficient conditions on the \npotentials III (Xi, X j) for the correlation to decrease rapidly enough: the rate at which the \ncorrelation decreases is determined by the ratio of off-diagonal and diagonal components \nin the quadratic fonn defining the potentials [19]. \n\nHow wrong will the variances be? The tenn CZllye2 in equation 3 is simply the sum of \nmany components of Cz11y. Figure 2 shows these components. The correct variance is \nthe sum of all the components witHe the belief propagation variance approximates this sum \nwith the first (and dominant) tenn. Whenever there is a positive correlation between the \nroot node and other replicas of Xl the loopy variance is strictly less than the true variance \n-\n\nthe loopy estimate is overconfident. \n\n\fCorrectness of Belief Propagation \n\n677 \n\n~07 \ne \niDO.6 \n\" ., \n;;;05 \nfr \n~04 \n'\" ~03 \n0.2 \n\n0.1 \n\nSOR \n\n(a) \n\n40 \n\n50 \n\n60 \n\n20 \n\n30 \n\niterations \n(b) \n\nFigure 3: (a) 25 x 25 graphical model for simulation. The unobserved nodes (unfilled) were \nconnected to their four nearest neighbors and to an observation node (filled). (b) The error \nof the estimates of loopy propagation and successive over-relaxation (SOR) as a function \nof iteration. Note that belief propagation converges much faster than SOR. \n\nNote that when the conditional correlation decreases rapidly to zero two things happen. \nFirst, the convergence is faster (because CZdyel approaches zero faster) . Second, the ap(cid:173)\nproximation error of the variances is smaller (because CZ1 /y e2 is smaller). Thus we have \nshown, as in the single loop case, quick convergence is correlated with good approximation. \n\n2 Simulations \n\nWe ran belief propagation on the 25 x 25 2D grid of Fig. 3 a. The joint probability was: \n\n(4) \n\nwhere Wij = 0 if nodes Xi, Xj are not neighbors and 0.01 otherwise and Wii was randomly \nselected to be 0 or 1 for all i with probability of 1 set to 0.2. The observations Yi were \nchosen randomly. This problem corresponds to an approximation problem from sparse \ndata where only 20% of the points are visible. \n\nWe found the exact posterior by solving equation 1. We also ran belief propagation and \nfound that when it converged, the calculated means were identical to the true means up \nto machine precision. Also, as predicted by the theory, the calculated variances were too \nsmall -\n\nthe belief propagation estimate was overconfident. \n\nIn many applications, the solution of equation 1 by matrix inversion is intractable and iter(cid:173)\native methods are used. Figure 3 compares the error in the means as a function of iterations \nfor loopy propagation and successive-over-relaxation (SOR), considered one of the best \nrelaxation methods [16]. Note that after essentially five iterations loopy propagation gives \nthe right answer while SOR requires many more. As expected by the fast convergence, the \napproximation error in the variances was quite small. The median error was 0.018. For \ncomparison the true variances ranged from 0.01 to 0.94 with a mean of 0.322. Also, the \nnodes for which the approximation error was worse were indeed the nodes that converged \nslower. \n\n\f678 \n\n3 Discussion \n\nY. Weiss and W T Freeman \n\nIndependently, two other groups have recently analyzed special cases of Gaussian graphical \nmodels. Frey [7] analyzed the graphical model corresponding to factor analysis and gave \nconditions for the existence of a stable fixed-point. Rusmevichientong and Van Roy [14] \nanalyzed a graphical model with the topology of turbo decoding but a Gaussian joint den(cid:173)\nsity. For this specific graph they gave sufficient conditions for convergence and showed \nthat the means are exact. \n\nOur main interest in the Gaussian case is to understand the performance of belief propaga(cid:173)\ntion in general networks with multiple loops. We are struck by the similarity of our results \nfor Gaussians in arbitrary networks and the results for single loops of arbitrary distribu(cid:173)\ntions [18]. First, in single loop networks with binary nodes, loopy belief at a node and the \ntrue belief at a node are maximized by the same assignment while the confidence in that \nassignment is incorrect. In Gaussian networks with multiple loops, the mean at each node \nis correct but the confidence around that mean may be incorrect. Second, for both single(cid:173)\nloop and Gaussian networks, fast belief propagation convergence correlates with accurate \nbeliefs. Third, in both Gaussians and discrete valued single loop networks, the statistical \ndependence between root and leaf nodes governs the convergence rate and accuracy. \n\nThe two models are quite different. Mean field approximations are exact for Gaussian \nMRFs while they work poorly in sparsely connected discrete networks with a single loop. \nThe results for the Gaussian and single-loop cases lead us to believe that similar results \nmay hold for a larger class of networks. \n\nCan our analysis be extended to non-Gaussian distributions? The basic idea applies to \narbitrary graphs and arbitrary potentials: belief propagation is performing exact inference \non a tree that has the same local neighbor structure as the loopy graph. However, the linear \nalgebra that we used to calculate exact expressions for the error in belief propagation at any \niteration holds only for Gaussian variables. \n\nWe have used a similar approach to analyze the related \"max-product\" belief propagation \nalgorithm on arbitrary graphs with arbitrary distributions [5] (both discrete and continuous \nvalued nodes). We show that if the max-product algorithm converges, the max-product \nassignment has greater posterior probability then any assignment in a particular large region \naround that assignment. While this is a weaker condition than a global maximum, it is much \nstronger than a simple local maximum of the posterior probability. \n\nThe sum-product and max-product belief propagation algorithms are fast and paralleliz(cid:173)\nable. Due to the well known hardness of probabilistic inference in graphical models, belief \npropagation will obviously not work for arbitrary networks and distributions. Nevertheless, \na growing body of empirical evidence shows its success in many networks with loops. Our \nresults justify applying belief propagation in certain networks with mUltiple loops. This \nmay enable fast, approximate probabilistic inference in a range of new applications. \n\nReferences \n\n[1] S.M. Aji, G.B. Hom, and R.J. McEliece. On the convergence of iterative decoding on graphs \n\nwith a single cycle. In Proc. 1998 ISIT, 1998. \n\n[2] c. Berrou, A. Glavieux, and P. Thitimajshima. Near Shannon limit error-correcting coding and \n\ndecoding: Turbo codes. In Proc. IEEE International Communications Conference '93, 1993. \n\n[3] R. Cowell. Advanced inference in Bayesian networks. In M.1. Jordan, editor, Learning in \n\nGraphical Models . MIT Press, 1998. \n\n[4] G.D. Forney, F.R. Kschischang, and B. Marcus. \n\nIterative decoding of tail-biting trellisses. \n\npreprint presented at 1998 Information Theory Workshop in San Diego, 1998. \n\n\fCorrectness of Belief Propagation \n\n679 \n\n[5] W. T. Freeman and Y. Weiss. On the fixed points of the max-product algorithm. Technical \n\nReport 99-39, MERL, 201 Broadway, Cambridge, MA 02139, 1999. \n\n[6] W.T. Freeman and E.C. Pasztor. Learning to estimate scenes from images. In M.S. Kearns, \nS.A. SoUa, and D.A. Cohn, editors, Adv. Neural Information Processing Systems I I. MIT Press, \n1999. \n\n[7] B.J. Frey. Turbo factor analysis. In Adv. Neural Information Processing Systems 12. 2000. to \n\nappear. \n\n[8) Brendan J. Frey. Bayesian Networksfor Pattern Classification, Data Compression and Channel \n\nCoding. MIT Press, 1998. \n\n[9) R.G. Gallager. Low Density Parity Check Codes. MIT Press, 1963. \n[10) F. R. Kschischang and B. J. Frey. Iterative decoding of compound codes by probability propaga(cid:173)\n\ntion in graphical models. IEEE Journal on Selected Areas in Communication , 16(2):219-230, \n1998. \n\n[11] R.J. McEliece, D.J .C. MackKay, and J.F. Cheng. Turbo decoding as as an instance of Pearl's \n'belief propagation' algorithm. IEEE Journal on Selected Areas in Communication, 16(2): 140-\n152,1998. \n\n[12J R.J. McEliece, E. Rodemich, and J.F. Cheng. The Turbo decision algorithm. In Proc. 33rd \nAllerton Conference on Communications, Control and Computing, pages 366-379, Monticello, \nIL, 1995. \n\n[I3J K.P. Murphy, Y. Weiss, and M.1. Jordan. Loopy belief propagation for approximate inference: \n\nan empirical study. In Proceedings of Uncertainty in AI, 1999. \n\n[14] Rusmevichientong P. and Van Roy B. An analysis of Turbo decoding with Gaussian densities. \n\nIn Adv. Neural Information Processing Systems I2 . 2000. to appear. \n\n[15) Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. \n\nMorgan Kaufmann, 1988. \n\n[16J Gilbert Strang. Introduction to Applied Mathel1Ultics. Wellesley-Cambridge, 1986. \n[I7J Y. Weiss. Belief propagation and revision in networks with loops. Technical Report 1616, MIT \n\nAI lab, 1997. \n\n[18J Y. Weiss. Correctness of local probability propagation in graphical models with loops. Neural \n\nComputation, to appear, 2000. \n\n[19] Y. Weiss and W. T. Freeman. Loopy propagation gives the correct posterior means for \nGaussians. Technical Report UCB.CSD-99-1046, Berkeley Computer Science Dept., 1999. \nwww.cs.berkeley.edu yweiss/. \n\n[20J N. Wiberg. Codes and decoding on general graphs. PhD thesis, Department of Electrical \n\nEngineering, U. Linkoping, Sweden, 1996. \n\n\fGaussian Fields for Approximate Inference \n\nin Layered Sigmoid Belief Networks \n\nDavid Barber'\" \n\nStichting Neurale Netwerken \n\nMedical Physics and Biophysics \n\nNijmegen University, The Netherlands \n\nbarberdOaston.ac.uk \n\nPeter Sollich \n\nDepartment of Mathematics \n\nKing's College, University of London \n\nLondon WC2R 2LS, U.K. \npeter.sollichOkcl.ac.uk \n\nAbstract \n\nLayered Sigmoid Belief Networks are directed graphical models \nin which the local conditional probabilities are parameterised by \nweighted sums of parental states. Learning and inference in such \nnetworks are generally intractable, and approximations need to be \nconsidered. Progress in learning these networks has been made by \nusing variational procedures. We demonstrate, however, that vari(cid:173)\national procedures can be inappropriate for the equally important \nissue of inference - that is,\u00b7 calculating marginals of the network. \nWe introduce an alternative procedure, based on assuming that the \nweighted input to a node is approximately Gaussian distributed. \nOur approach goes beyond previous Gaussian field assumptions in \nthat we take into account correlations between parents of nodes. \nThis procedure is specialized for calculating marginals and is sig(cid:173)\nnificantly faster and simpler than the variational procedure. \n\n1 \n\nIntroduction \n\nLayered Sigmoid Belief Networks [1] are directed graphical models [2] in which \nthe local conditional probabilities are parameterised by weighted sums of parental \nstates, see fig ( 1). This is a graphical representation of a distribution over a set of \nbinary variables Si E {a, I}. Typically, one supposes that the states of the nodes \nat the bottom of the network are generated by states in previous layers. Whilst, in \nprinciple, there is no restriction on the number of nodes in any layer, typically, one \nconsiders structures similar to the \"fan out\" in fig(l) in which higher level layers \nprovide an \"explanation\" for patterns generated in lower layers. Such graphical \nmodels are attractive since they correspond to layers of information processors, of \npotentially increasing complexity. Unfortunately, learning and inference in such net(cid:173)\nworks is generally intractable, and approximations need to be considered. Progress \nin learning has been made by using variational procedures [3,4, 5]. However, an(cid:173)\nother crucial aspect remains inference [2]. That is, given some evidence ( or none), \ncalculate the marginal of a variable, conditional on this evidence. This assumes \nthat we have found a suitable network from some learning procedure, and now wish \n\n\u00b7Present Address: NCRG, Aston University, Birmingham B4 7ET, U.K. \n\n\f394 \n\nD. Barber and P. Sollich \n\nto query this network. Whilst the variational procedure is attractive for learning, \nsince it generally provides a bound on the likelihood of the visible units, we demon(cid:173)\nstrate that it may not always be equally appropriate for the inference problem. \n\nA directed graphical model defines a distribution over \na set of variables s = (S1 ... sn) that factorises into \nthe local conditional distributions, \n\np(S1 . .. sn) = IIp(silll'i) \n\nn \n\ni=1 \n\n(1) \n\nwhere lI'i denotes the parent nodes of node i. In a \nlayered network, these are the nodes in the proceed(cid:173)\ning layer that feed into node i. In a sigmoid belief \nnetwork the local probabilities are defined as \n\nFigure 1: A Layered Sig(cid:173)\nmoid Belief Network \n\nP (s; = ll~;) = \" ( ~ W;jSj + 0;) =\" (h;) \n\n(2) \n\nwhere the \"field\" at node i is defined as hi = 2:j WijSj + fh and er(h) = 1/(1 + e- h ). \nWij is the strength of the connection between node i and its parent node j; if j is \nnot a parent of i we set Wij = O. Oi is a bias term that gives a parent-independent \nbias to the state of node i . \n\nWe are interested in inference - in particular, calculating marginals of the network \nfor cases with and without evidential nodes. In section (2) we describe how to \napproximate the quantities p(Si = 1) and discuss in section (2.1) why our method \ncan improve on the standard variational mean field theory. Conditional marginals, \nsuch as p(Si = IISj = 1, Sk = 0) are considered in section (3). \n\n2 Gaussian Field Distributions \n\nUnder the 0/1 coding for the variables Si, the mean of a variable, mi is given by the \nprobability that it is in state 1. Using the fact from (2) that the local conditional \ndistribution of node i is dependent on its parents only through its field hi, we have \n\nwhere we use the notation \u00ab(-)p to denote an average with respect to the distri(cid:173)\nbution p. If there are many parents of node i, a reasonable assumption is that the \ndistribution of the field hi will be Gaussian, p(hi ) ~ N (J,Li,er[). Under this Gaus(cid:173)\nsian Field (GF) assumption, we need to work out the mean and variance, which are \ngiven by \n\n(3) \n\nj \n\nj \n\nerr = ((Llhi)2) = L WijWikRjk \n\nj,k \n\n(4) \n\n(5) \n\nwhere Rjk = (LlSjLlsk). We use the notation Ll (-) == (-) - \u00ab(.) . \nThe diagonal terms of the node covariance matrix are ~i = mi (1- mi)' In contrast \nto previous studies, we include off diagonal terms in the calculation of R [4] . From \n\n\fGaussian Fields for Approximate Inference \n\n395 \n\n(5) we only need to find correlations between parents i and j of a node. These are \neasy to calculate in the layered networks that we are considering, because neither i \nnor j is a descendant of the other: \n\nRjj = p(Sj = 1, Sj = 1) - mjmj \n\n= J p(Si = Ilhj)p(Sj = Ilhj)p(hj, hj)dh - mimj \n\n= (0\" (hd 0\" (h j ) \n\n(h h) - mjmj \nP \n\nJ, J \n\n(6) \n\n(7) \n\n(8) \n\nAssuming that the joint distribution p( h j , hj ) is Gaussian, we again need its mean \nand covariance, given by \n\n~ij = (D.hjD.hj) = L WjkWjl (D.skD.SI) = L WikWjlRkl \n\nkl \n\nkl \n\n(10) \n\nUnder this scheme, we have a closed set of equations, (4,5,8,10) for the means \nmj and covariance matrix Rij which can be solved by forward propagation of the \nequations. That is, we start from nodes without parents, and then consider the \nnext layer of nodes, repeating the procedure until a full sweep through the network \nhas been completed. The one and two dimensional field averages, equations (3) \nand (8), are computed using Gaussian Quadrature. This results in an extremely \nfast procedure for approximating the marginals mi, requiring only a single sweep \nthrough the network. \nOur approach is related to that of [6] by the common motivating assumption that \neach node has a large number of parents. This is used in [6] to obtain actual \nbounds on quantities of interest such as joint marginals. Our approach does not \ngive bounds. Its advantage, however, is that it allows fluctuations in the fields hi, \nwhich are effectively excluded in [6] by the assumed scaling of the weights Wij with \nthe number of parents per node. \n\n2.1 Relation to Variational Mean Field Theory \n\nIn the variational approach, one fits a tractable approximating distribution Q to \nthe SBN. Taking Q factorised, Q(s) = Dj m:' (1 - md l - 3 \u2022 we have the bound \nIn p (Sl ... sn) 2: L {-mj In mj -\n\n(1 - md In (1 - md} \n\ni \n\nThe final term in (11) causes some difficulty even in the case in which Q is a fac(cid:173)\ntorised model. Formally, this is because this term does not have the same graphical \nstructure as the tractable model Q. One way around around this difficulty is to em(cid:173)\nploy a further bound, with associated variational parameters [7]. Another approach \nis to make the Gaussian assumption for the field hi as in section (2). Because Q is \nfactorised, corresponding to a diagonal correlation matrix R, this gives [4] \n\n(12) \n\n\f396 \n\nD. Barber and P Sollich \n\nwhere Pi = ~j Wijmj + Oi and (1[ = ~j w[jmj(l - mj). Note that this is a one \ndimensional integral of a smooth function. In contrast to [4] we therefore evaluate \nthis quantity using Gaussian Quadrature. This has the advantage that no extra \nvariational parameters need to be introduced. Technically, the assumption of a \nGaussian field distribution means that (11) is no longer a bound. Nevertheless, in \npractice it is found that this has little effect on the quality of the resulting solution. \nIn our implementation of the variational approach, we find the optimal parameters \nmi by maximising the above equation for each component mi separately, cycling \nthrough the nodes until the parameters mi do not change by more than 10- 1\u00b0. \nThis is repeated 5 times, and the solution with the highest bound score is chosen. \nNote that these equations cannot be solved by forward propagation alone since the \nfinal term contains contributions from all the nodes in the network. This is in \ncontrast to the GF approach of section (2) . Finding appropriate parameters mi by \nthe variational approach is therefore rather slower than using the GF method. \n\nIn arriving at the above equations, we have made two assumptions. The first is \nthat the intractable distribution is well approximated by a factorised model. The \nsecond is that the field distribution is Gaussian. The first step is necessary in \norder to obtain a bound on the likelihood of the model (although this is slightly \ncompromised by the Gaussian field assumption). In the GF approach we dispense \nwith this assumption of an effectively factorised network (partially because if we \nare only interested in inference, a bound on the model likelihood is less relevant). \nThe GF method may therefore prove useful for a broader class of networks than the \nvariational approach. \n\n2.2 Results for unconditional marginals \n\nWe compared three procedures for estimating the conditional values p(Si = 1) for \nall the nodes in the network, namely the variational theory, as described in section \n(2.1), the diagonal Gaussian field theory, and the non-diagonal Gaussian field theory \nwhich includes correlation effects between parents. Results for small weight values \nWij are shown in fig(2). In this case, all three methods perform reasonably well, \nalthough there is a significant improvement in using the GF methods over the \nvariational procedure; parental correlations are not important (compare figs(2b) \nand (2c)) . In fig(3) the weights and biases are chosen such that the exact mean \nvariables mi are roughly 0.5 with non-trivial correlation effects between parents. \nNote that the variational mean field theory now provides a poor solution, whereas \nthe GF methods are relatively accurate. The effect of using the non-diagonal R \nterms is beneficial, although not dramatically so. \n\n3 Calculating Conditional Marginals \n\nWe consider now how to calculate conditional marginals, given some evidential \nnodes. (In contrast to [6], any set of nodes in the network, not just output nodes, \ncan be considered evidential.) We write the evidence in the following manner \n\nE = {SCi = SCi' . . . Sc\" = SC,.} = {ECl ... Ec,.} \n\nThe quantities that we are interested in are conditional marginals which, from Bayes \nrule are related to the joint distribution by \n\nP (Si = liE) = \n\nP (Si = 1, E) \n\nP (Si = 0, E) + P (Si = 1, E) \n\n(13) \n\nThat is, provided that we have a procedure for estimating joint marginals, we can \nobtain conditional marginals too. Without loss of generality, we therefore consider \n\n\fGaussian Fields for Approximate Inference \n\nEm>ruoing1_ model 1ft \n\n2Or--~ \n\nEm>r using Ga_ Fiaid. Ooagonal ooyariance \n\n397 \n\ncowuiance \n\nO< uoing Ga ..... n Field. Non Diagonal \"\"w\"ianee \n70,---~------~--------, \n\nEO \n\n50 \n\n40 \n\n30 \n\n20 \n\n10 II.. \n\n(a) Mean error = 0.4188 \n\n(b) Mean error = 0.0253 \n\n(c) Mean error = 0.0198 \n\no 1 \n\n0.2 \n\n0.3 \n\n0.4 \n\n0.5 \n\n06 \n\n00 \n\n0 1 \n\n02 \n\n0.3 \n\n0.4 \n\n0.5 \n\n0.6 \n\nFigure 3: All weights are set to uniformly from 0 to 50. Biases are set to -0.5 of \nthe summed parental weights plus a uniform random number from -2.5 to 2.5 . The \nroot node is set to be 1 with probability 0.5. This has the effect of making all the \nnodes in the exact network roughly 0.5 in mean, with non-negligible correlations \nbetween parental nodes. 160 simulations were made. \n\nin different layers, we require the correlations between their fields h to evaluate (17) . \nSuch 'inter-layer' correlations were not required in section (2) , and to be able to use \nthe same calculational scheme we simply neglect them. (We leave a study of the \neffects of this assumption for future work.) The average in (17) then factors into \ngroups, where each group contains evidential terms in a particular layer. \n\nThe conditional marginal for node i is obtained from repeating the above procedure \nin which the desired marginal node is clamped to its opposite value, and then using \nthese results in (13). The above procedure is repeated for each conditional marginal \nthat we are interested in. Although this may seem computationally expensive, the \nmarginal for each node is computed quickly, since the equations are solved by one \nforward propagation sweep only. \n\nError uoing Gauosian Field, Diago\".1 covarIanee \n\nEm:>< uoing Gau\"ian Field. Non Diagonal \"\"\"ariance \n\n(a) Mean error = 0.1534 \n\n(b) Mean error = 0.0931 \n\n(c) Mean error = 0.0865 \n\nFigure 4: Estimating the conditional marginal of the top node being in state 1, \ngiven that the four bottom nodes are in state 1. Weights were drawn from a zero \nmean Gaussian with variance 5, with biases set to -0.5 the summed parental weights \nplus a uniform random number from -2.5 to 2.5 . Results of 160 simulations. \n\n3.1 Results for conditional marginals \n\nWe used the same structure as in the previous experiments, as shown in fig(I). We \nare interested here in calculating the probability that the top node is in state 1, \n\n\fGaussian Fields for Approximate Inference \n\n399 \n\ngiven that the four bottom nodes are in state 1. Weights were chosen from a zero \nmean Gaussian with variance 5. Biases were set to negative half of the summed \nparent weights, plus a uniform random value from -2.5 to 2.5. Correlation effects \nin these networks are not as strong as in the experiments in section (2.2), although \nthe improvement of the G F theory over the variational theory seen in fig ( 4) remains \nclear. The improvement from the off diagonal terms in R is minimal. \n\n4 Conclusion \n\nDespite their appropriateness for learning, variational methods may not be equally \nsuited to inference, making more tailored methods attractive. We have considered \nan approximation procedure that is based on assuming that the distribution of the \nweighted input to a node is approximately Gaussian. Correlation effects between \nparents of a node were taken into account to improve the Gaussian theory, although \nin our examples this gave only relatively modest improvements. \n\nThe variational mean field theory performs poorly in networks with strong cor(cid:173)\nrelation effects between nodes. On the other hand, one may conjecture that the \nGaussian Field approach will not generally perform catastrophically worse than the \nfactorised variational mean field theory. One advantage of the variational theory \nis the presence of an objective function against which competing solutions can be \ncompared. However, finding an optimum solution for the mean parameters mj from \nthis function is numerically complex. Since the Gaussian Field theory is extremely \nfast to solve, an interesting compromise might be to prime the variational solution \nwith the results from the Gaussian Field theory. \n\nAcknowledgments \n\nDB would like to thank Bert Kappen and Wim Wiegerinck for stimulating and \nhelpful discussions. PS thanks the Royal Society for financial support. \n\n[1] R. Neal. Connectionist learning of Belief Networks. Artificial Intelligence, 56:71-113, \n\n1992. \n\n[2] E. Castillo, J. M. Gutierrez, and A. S. Radi. Expert Systems and Probabilistic Network \n\nModels. Springer, 1997. \n\n[3] M. I. Jordan, Z. Gharamani, T. S. Jaakola, and L. K. Saul. An Introduction to Vari(cid:173)\n\national Methods for Graphical Models. In M. I. Jordan, editor, Learning in Graphical \nModels, pages 105-161. Kluwer, 1998. \n\n[4] L. Saul and M. I. Jordan. A mean field learning algorithm for unsupervised neural \n\nnetworks. In M. I. Jordan, editor, Learning in Graphical Models, 1998. \n\n[5] D. Barber and W Wiegerinck. Tractable variational structures for approximating \ngraphical models. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in \nNeural Information Processing Systems NIPS 11. MIT Press, 1999. \n\n[6] M. Kearns and 1. Saul. Inference in Multilayer Networks via Large Deviation Bounds. \n\nIn Advances in Neural Information Processing Systems NIPS 11, 1999. \n\n[7] L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean Field Theory for Sigmoid Belief \n\nNetworks. Journal of Artificial Intelligence Research, 4:61-76, 1996. \n\n\f", "award": [], "sourceid": 1643, "authors": [{"given_name": "David", "family_name": "Barber", "institution": null}, {"given_name": "Peter", "family_name": "Sollich", "institution": null}]}