{"title": "Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology", "book": "Advances in Neural Information Processing Systems", "page_first": 673, "page_last": 679, "abstract": null, "full_text": "Correctness of belief propagation in Gaussian \n\ngraphical models of arbitrary topology \n\nYair Weiss \n\nComputer Science Division \nUC Berkeley, 485 Soda Hall \nBerkeley, CA 94720-1776 \n\nPhone: 510-642-5029 \n\nyweiss@cs.berkeley.edu \n\nWilliam T. Freeman \n\nMitsubishi Electric Research Lab \n\n201 Broadway \n\nCambridge, MA 02139 \nPhone: 617-621-7527 \nfreeman @merl.com \n\nAbstract \n\nLocal \"belief propagation\" rules of the sort proposed by Pearl [15] are \nguaranteed to converge to the correct posterior probabilities in singly \nconnected graphical models. Recently, a number of researchers have em(cid:173)\npirically demonstrated good performance of \"loopy belief propagation\"(cid:173)\nusing these same rules on graphs with loops. Perhaps the most dramatic \ninstance is the near Shannon-limit performance of \"Turbo codes\", whose \ndecoding algorithm is equivalent to loopy belief propagation. \nExcept for the case of graphs with a single loop, there has been little theo(cid:173)\nretical understanding of the performance of loopy propagation. Here we \nanalyze belief propagation in networks with arbitrary topologies when \nthe nodes in the graph describe jointly Gaussian random variables. We \ngive an analytical formula relating the true posterior probabilities with \nthose calculated using loopy propagation. We give sufficient conditions \nfor convergence and show that when belief propagation converges it gives \nthe correct posterior means for all graph topologies, not just networks \nwith a single loop. \nThe related \"max-product\" belief propagation algorithm finds the max(cid:173)\nimum posterior probability estimate for singly connected networks. We \nshow that, even for non-Gaussian probability distributions, the conver(cid:173)\ngence points of the max-product algorithm in loopy networks are max(cid:173)\nima over a particular large local neighborhood of the posterior proba(cid:173)\nbility. These results help clarify the empirical performance results and \nmotivate using the powerful belief propagation algorithm in a broader \nclass of networks. \n\nProblems involving probabilistic belief propagation arise in a wide variety of applications, \nincluding error correcting codes, speech recognition and medical diagnosis. If the graph \nis singly connected, there exist local message-passing schemes to calculate the posterior \nprobability of an unobserved variable given the observed variables. Pearl [15] derived such \na scheme for singly connected Bayesian networks and showed that this \"belief propagation\" \nalgorithm is guaranteed to converge to the correct posterior probabilities (or \"beliefs\"). \n\nSeveral groups have recently reported excellent experimental results by running algorithms \n\n\f674 \n\nY. Weiss and W T. Freeman \n\nequivalent to Pearl's algorithm on networks with loops [8, 13, 6]. Perhaps the most dramatic \ninstance of this performance is for \"Turbo code\" [2] error correcting codes. These codes \nhave been described as \"the most exciting and potentially important development in coding \ntheory in many years\" [12] and have recently been shown [10, 11] to utilize an algorithm \nequivalent to belief propagation in a network with loops. \n\nProgress in the analysis of loopy belief propagation has been made for the case of networks \nwith a single loop [17, 18, 4, 1]. For these networks, it can be shown that (1) unless \nall the compatabilities are deterministic, loopy belief propagation will converge. (2) The \ndifference between the loopy beliefs and the true beliefs is related to the convergence rate \nthe faster the convergence the more exact the approximation and (3) If \nof the messages -\nthe hidden nodes are binary, then the loopy beliefs and the true beliefs are both maximized \nby the same assignments, although the confidence in that assignment is wrong for the loopy \nbeliefs. \n\nIn this paper we analyze belief propagation in graphs of arbitrary topology, for nodes de(cid:173)\nscribing jointly Gaussian random variables. We give an exact formula relating the correct \nmarginal posterior probabilities with the ones calculated using loopy belief propagation. \nWe show that if belief propagation converges, then it will give the correct posterior means \nfor all graph topologies, not just networks with a single loop. We show that the covari(cid:173)\nance estimates will generally be incorrect but present a relationship between the error in \nthe covariance estimates and the convergence speed. For Gaussian or non-Gaussian vari(cid:173)\nables, we show that the \"max-product\" algorithm, which calculates the MAP estimate in \nsingly connected networks, only converges to points that are maxima over a particular large \nneighborhood of the posterior probability of loopy networks. \n\n1 Analysis \n\nTo simplify the notation, we assume the graphical model has been preprocessed into an \nundirected graphical model with pairwise potentials. Any graphical model can be con(cid:173)\nverted into this form, and running belief propagation on the pairwise graph is equivalent \nto running belief propagation on the original graph [18]. We assume each node X i has a \nlocal observation Yi . In each iteration of belief propagation, each node X i sends a message \nto each neighboring X j that is based on the messages it received from the other neighbors, \nits local observation Yl and the pairwise potentials Wij(Xi , Xj) and Wii(Xi, Yi) . We assume \nthe message-passing occurs in parallel. \n\nThe idea behind the analysis is to build an unwrapped tree. The unwrapped tree is the \ngraphical model which belief propagation is solving exactly when one applies the belief \npropagation rules in a loopy network [9, 20, 18]. It is constructed by maintaining the same \nlocal neighborhood structure as the loopy network but nodes are replicated so there are no \nloops. The potentials and the observations are replicated from the loopy graph. Figure 1 (a) \nshows an unwrapped tree for the diamond shaped graph in (b). By construction, the belief \nat the root node X- I is identical to that at node Xl in the loopy graph after four iterations of \nbelief propagation. Each node has a shaded observed node attached to it, omitted here for \nclarity. \n\nBecause the original network represents jointly Gaussian variables, so will the unwrapped \ntree. Since it is a tree, belief propagation is guaranteed to give the correct answer for the \nunwrapped graph. We can thus use Gaussian marginalization formulae to calculate the \ntrue mean and variances in both the original and the unwrapped networks. In this way, we \ncalculate the accuracy of belief propagation for Gaussian networks of arbitrary topology. \n\nWe assume that the joint mean is zero (the means can be added-in later). The joint distri-\n\n\fCorrectness of Belief Propagation \n\n675 \n\nFigure 1: Left: A Markov network with mUltiple loops. Right: The unwrapped network \ncorresponding to this structure. \n\nbution of z = ( : ) is given by P(z) = ae-!zTVz, where V = (~:: ~::) . It \nis straightforward to construct the inverse covariance matrix V of the joint Gaussian that \ndescribes a given Gaussian graphical model [3]. \n\nWriting out the exponent of the joint and completing the square shows that the mean I-' of \nx, given the observations y, is given by: \n\n(1) \n\nand the covariance matrix C~IY of x given y is: C~IY = V~-;l. We will denote by C~dY the \nith row of C~IY so the marginal posterior variance of Xi given the data is (72 (i) = C~i Iy (i). \n\nWe will use - for unwrapped quantities. We scan the tree in breadth first order and denote by \nx the vector of values in the hidden nodes of the tree when so scanned. Simlarly, we denote \nby y the observed nodes scanned in the same order and Vn , V~y the inverse covariance \nmatrices. Since we are scanning in breadth first order the last nodes are the leaf nodes and \nwe denote by L the number of leaf nodes. By the nature of unwrapping, tL(1) is the mean \nof the belief at node Xl after t iterations of belief propagation, where t is the number of \nunwrappings. Similarly 0-2 (1) = 6~1Iy(1) is the variance of the belief at node Xl after t \niterations. \nBecause the data is replicated we can write y = Oy where O(i, j) = 1 if Yi is a replica of Yj \nand 0 otherwise. Since the potentials W(Xi' Yi) are replicated, we can write V~yO = OV~y. \nSince the W (Xi, X j) are also replicated and all non-leaf Xi have the same connectivity as \nthe corresponding Xi, we can write V~~O = OVzz + E where E is zero in all but the last \nL rows. When these relationships between the loopy and unwrapped inverse covariance \nmatrices are substituted into the loopy and unwrapped versions of equation I, one obtains \nthe following expression, true for any iteration [19]: \n\nwhere e is a vector that is zero everywhere but the last L components (corresponding to the \nleaf nodes). Our choice of the node for the root of the tree is arbitrary, so this applies to \nall nodes of the loopy network. This formula relates, for any node of a network with loops, \nthe means calculated at each iteration by belief propagation with the true posterior means. \n\nSimilarly when the relationship between the loopy and unwrapped inverse covariance ma(cid:173)\ntrices is substituted into the loopy and unwrapped definitions of C~IY we can relate the \n\n(2) \n\n\f676 \n\nY Weiss and W T Freeman \n\n0.5 \n\n0.4 \n\n~ 0.3 \n~ \n.~ 0.2 \nn; \n~ 0.1 \n8 \n\"t:> \u00a7 \n\n0 \n\n-0.1 \n\n-0.2 \n\n0 \n\n20 \n\n40 \n\nnode \n\n60 \n\n80 \n\n100 \n\nFigure 2: The conditional correlation between the root node and all other nodes in the \nunwrapped tree of Fig. 1 after eight iterations. Potentials were chosen randomly. Nodes \nare presented in breadth first order so the last elements are the correlations between the root \nnode and the leaf nodes. We show that if this correlation goes to zero, belief propagation \nconverges and the loopy means are exact. Symbols plotted with a star denote correlations \nwith nodes that correspond to the node Xl in the loopy graph. The sum of these correlations \ngives the correct variance of node Xl while loopy propagation uses only the first correlation. \n\nmarginalized covariances calculated by belief propagation to the true ones [19]: \n\n-2 \n\na (1) = a (1) + CZllyel - Czt/ye2 \n\n(3) \nwhere el is a vector that is zero everywhere but the last L components while e2 is equal \nto 1 for all nodes in the unwrapped tree that are replicas of Xl except for Xl. All other \ncomponents of e2 are zero, \n\n2 \n\n-\n\n-\n\nFigure 2 shows Cz1lY for the diamond network in Fig. 1. We generated random potential \nfunctions and observations and calculated the conditional correlations in the unwrapped \ntree. Note that the conditional correlation decreases with distance in the tree - we are \nscanning in breadth first order so the last L components correspond to the leaf nodes. \nAs the number of iterations of loopy propagation is increased the size of the unwrapped \ntree increases and the conditional correlation between the leaf nodes and the root node \ndecreases. \n\nFrom equations 2-3 it is clear that if the conditional correlation between the leaf nodes and \nthe root nodes are zero for all sufficiently large unwrappings then (1) belief propagation \nconverges (2) the means are exact and (3) the variances may be incorrect. In practice the \nconditional correlations will not actually be equal to zero for any finite unwrapping. In [19] \nwe give a more precise statement: if the conditional correlation of the root node and the \nleaf nodes decreases rapidly enough then (1) belief propagation converges (2) the means \nare exact and (3) the variances may be incorrect. We also show sufficient conditions on the \npotentials III (Xi, X j) for the correlation to decrease rapidly enough: the rate at which the \ncorrelation decreases is determined by the ratio of off-diagonal and diagonal components \nin the quadratic fonn defining the potentials [19]. \n\nHow wrong will the variances be? The tenn CZllye2 in equation 3 is simply the sum of \nmany components of Cz11y. Figure 2 shows these components. The correct variance is \nthe sum of all the components witHe the belief propagation variance approximates this sum \nwith the first (and dominant) tenn. Whenever there is a positive correlation between the \nroot node and other replicas of Xl the loopy variance is strictly less than the true variance \n-\n\nthe loopy estimate is overconfident. \n\n\fCorrectness of Belief Propagation \n\n677 \n\n~07 \ne \niDO.6 \n\" ., \n;;;05 \nfr \n~04 \n'\" ~03 \n0.2 \n\n0.1 \n\nSOR \n\n(a) \n\n40 \n\n50 \n\n60 \n\n20 \n\n30 \n\niterations \n(b) \n\nFigure 3: (a) 25 x 25 graphical model for simulation. The unobserved nodes (unfilled) were \nconnected to their four nearest neighbors and to an observation node (filled). (b) The error \nof the estimates of loopy propagation and successive over-relaxation (SOR) as a function \nof iteration. Note that belief propagation converges much faster than SOR. \n\nNote that when the conditional correlation decreases rapidly to zero two things happen. \nFirst, the convergence is faster (because CZdyel approaches zero faster) . Second, the ap(cid:173)\nproximation error of the variances is smaller (because CZ1 /y e2 is smaller). Thus we have \nshown, as in the single loop case, quick convergence is correlated with good approximation. \n\n2 Simulations \n\nWe ran belief propagation on the 25 x 25 2D grid of Fig. 3 a. The joint probability was: \n\n(4) \n\nwhere Wij = 0 if nodes Xi, Xj are not neighbors and 0.01 otherwise and Wii was randomly \nselected to be 0 or 1 for all i with probability of 1 set to 0.2. The observations Yi were \nchosen randomly. This problem corresponds to an approximation problem from sparse \ndata where only 20% of the points are visible. \n\nWe found the exact posterior by solving equation 1. We also ran belief propagation and \nfound that when it converged, the calculated means were identical to the true means up \nto machine precision. Also, as predicted by the theory, the calculated variances were too \nsmall -\n\nthe belief propagation estimate was overconfident. \n\nIn many applications, the solution of equation 1 by matrix inversion is intractable and iter(cid:173)\native methods are used. Figure 3 compares the error in the means as a function of iterations \nfor loopy propagation and successive-over-relaxation (SOR), considered one of the best \nrelaxation methods [16]. Note that after essentially five iterations loopy propagation gives \nthe right answer while SOR requires many more. As expected by the fast convergence, the \napproximation error in the variances was quite small. The median error was 0.018. For \ncomparison the true variances ranged from 0.01 to 0.94 with a mean of 0.322. Also, the \nnodes for which the approximation error was worse were indeed the nodes that converged \nslower. \n\n\f678 \n\n3 Discussion \n\nY. Weiss and W T Freeman \n\nIndependently, two other groups have recently analyzed special cases of Gaussian graphical \nmodels. Frey [7] analyzed the graphical model corresponding to factor analysis and gave \nconditions for the existence of a stable fixed-point. Rusmevichientong and Van Roy [14] \nanalyzed a graphical model with the topology of turbo decoding but a Gaussian joint den(cid:173)\nsity. For this specific graph they gave sufficient conditions for convergence and showed \nthat the means are exact. \n\nOur main interest in the Gaussian case is to understand the performance of belief propaga(cid:173)\ntion in general networks with multiple loops. We are struck by the similarity of our results \nfor Gaussians in arbitrary networks and the results for single loops of arbitrary distribu(cid:173)\ntions [18]. First, in single loop networks with binary nodes, loopy belief at a node and the \ntrue belief at a node are maximized by the same assignment while the confidence in that \nassignment is incorrect. In Gaussian networks with multiple loops, the mean at each node \nis correct but the confidence around that mean may be incorrect. Second, for both single(cid:173)\nloop and Gaussian networks, fast belief propagation convergence correlates with accurate \nbeliefs. Third, in both Gaussians and discrete valued single loop networks, the statistical \ndependence between root and leaf nodes governs the convergence rate and accuracy. \n\nThe two models are quite different. Mean field approximations are exact for Gaussian \nMRFs while they work poorly in sparsely connected discrete networks with a single loop. \nThe results for the Gaussian and single-loop cases lead us to believe that similar results \nmay hold for a larger class of networks. \n\nCan our analysis be extended to non-Gaussian distributions? The basic idea applies to \narbitrary graphs and arbitrary potentials: belief propagation is performing exact inference \non a tree that has the same local neighbor structure as the loopy graph. However, the linear \nalgebra that we used to calculate exact expressions for the error in belief propagation at any \niteration holds only for Gaussian variables. \n\nWe have used a similar approach to analyze the related \"max-product\" belief propagation \nalgorithm on arbitrary graphs with arbitrary distributions [5] (both discrete and continuous \nvalued nodes). We show that if the max-product algorithm converges, the max-product \nassignment has greater posterior probability then any assignment in a particular large region \naround that assignment. While this is a weaker condition than a global maximum, it is much \nstronger than a simple local maximum of the posterior probability. \n\nThe sum-product and max-product belief propagation algorithms are fast and paralleliz(cid:173)\nable. Due to the well known hardness of probabilistic inference in graphical models, belief \npropagation will obviously not work for arbitrary networks and distributions. Nevertheless, \na growing body of empirical evidence shows its success in many networks with loops. Our \nresults justify applying belief propagation in certain networks with mUltiple loops. This \nmay enable fast, approximate probabilistic inference in a range of new applications. \n\nReferences \n\n[1] S.M. Aji, G.B. Hom, and R.J. McEliece. On the convergence of iterative decoding on graphs \n\nwith a single cycle. In Proc. 1998 ISIT, 1998. \n\n[2] c. Berrou, A. Glavieux, and P. Thitimajshima. Near Shannon limit error-correcting coding and \n\ndecoding: Turbo codes. In Proc. IEEE International Communications Conference '93, 1993. \n\n[3] R. Cowell. Advanced inference in Bayesian networks. In M.1. Jordan, editor, Learning in \n\nGraphical Models . MIT Press, 1998. \n\n[4] G.D. Forney, F.R. Kschischang, and B. Marcus. \n\nIterative decoding of tail-biting trellisses. \n\npreprint presented at 1998 Information Theory Workshop in San Diego, 1998. \n\n\fCorrectness of Belief Propagation \n\n679 \n\n[5] W. T. Freeman and Y. Weiss. On the fixed points of the max-product algorithm. Technical \n\nReport 99-39, MERL, 201 Broadway, Cambridge, MA 02139, 1999. \n\n[6] W.T. Freeman and E.C. Pasztor. Learning to estimate scenes from images. In M.S. Kearns, \nS.A. SoUa, and D.A. Cohn, editors, Adv. Neural Information Processing Systems I I. MIT Press, \n1999. \n\n[7] B.J. Frey. Turbo factor analysis. In Adv. Neural Information Processing Systems 12. 2000. to \n\nappear. \n\n[8) Brendan J. Frey. Bayesian Networksfor Pattern Classification, Data Compression and Channel \n\nCoding. MIT Press, 1998. \n\n[9) R.G. Gallager. Low Density Parity Check Codes. MIT Press, 1963. \n[10) F. R. Kschischang and B. J. Frey. Iterative decoding of compound codes by probability propaga(cid:173)\n\ntion in graphical models. IEEE Journal on Selected Areas in Communication , 16(2):219-230, \n1998. \n\n[11] R.J. McEliece, D.J .C. MackKay, and J.F. Cheng. Turbo decoding as as an instance of Pearl's \n'belief propagation' algorithm. IEEE Journal on Selected Areas in Communication, 16(2): 140-\n152,1998. \n\n[12J R.J. McEliece, E. Rodemich, and J.F. Cheng. The Turbo decision algorithm. In Proc. 33rd \nAllerton Conference on Communications, Control and Computing, pages 366-379, Monticello, \nIL, 1995. \n\n[I3J K.P. Murphy, Y. Weiss, and M.1. Jordan. Loopy belief propagation for approximate inference: \n\nan empirical study. In Proceedings of Uncertainty in AI, 1999. \n\n[14] Rusmevichientong P. and Van Roy B. An analysis of Turbo decoding with Gaussian densities. \n\nIn Adv. Neural Information Processing Systems I2 . 2000. to appear. \n\n[15) Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. \n\nMorgan Kaufmann, 1988. \n\n[16J Gilbert Strang. Introduction to Applied Mathel1Ultics. Wellesley-Cambridge, 1986. \n[I7J Y. Weiss. Belief propagation and revision in networks with loops. Technical Report 1616, MIT \n\nAI lab, 1997. \n\n[18J Y. Weiss. Correctness of local probability propagation in graphical models with loops. Neural \n\nComputation, to appear, 2000. \n\n[19] Y. Weiss and W. T. Freeman. Loopy propagation gives the correct posterior means for \nGaussians. Technical Report UCB.CSD-99-1046, Berkeley Computer Science Dept., 1999. \nwww.cs.berkeley.edu yweiss/. \n\n[20J N. Wiberg. Codes and decoding on general graphs. PhD thesis, Department of Electrical \n\nEngineering, U. Linkoping, Sweden, 1996. \n\n\f", "award": [], "sourceid": 1641, "authors": [{"given_name": "Yair", "family_name": "Weiss", "institution": null}, {"given_name": "William", "family_name": "Freeman", "institution": null}]}