{"title": "Tree-based reparameterization for approximate inference on loopy graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1008, "abstract": null, "full_text": "Tree-based reparameterization for \n\napproximate inference on loopy graphs \n\nMartin J. Wainwright, Tommi Jaakkola, and Alan S. Will sky \n\nDepartment of Electrical Engineering and Computer Science \n\nMassachusetts Institute of Technology \n\nmjwain@mit.edu \n\nCambridge, MA 02139 \ntommi@ai.mit.edu \n\nwillsky@mit.edu \n\nAbstract \n\nWe develop a tree-based reparameterization framework that pro(cid:173)\nvides a new conceptual view of a large class of iterative algorithms \nfor computing approximate marginals in graphs with cycles. It \nincludes belief propagation (BP), which can be reformulated as a \nvery local form of reparameterization. More generally, we consider \nalgorithms that perform exact computations over spanning trees \nof the full graph. On the practical side, we find that such tree \nreparameterization (TRP) algorithms have convergence properties \nsuperior to BP. The reparameterization perspective also provides \na number of theoretical insights into approximate inference, in(cid:173)\ncluding a new characterization of fixed points; and an invariance \nintrinsic to TRP /BP. These two properties enable us to analyze \nand bound the error between the TRP /BP approximations and \nthe actual marginals. While our results arise naturally from the \nTRP perspective, most of them apply in an algorithm-independent \nmanner to any local minimum of the Bethe free energy. Our re(cid:173)\nsults also have natural extensions to more structured approxima(cid:173)\ntions [e.g. , 1, 2]. \n\n1 \n\nIntroduction \n\nGiven a graphical model, one important problem is the computation of marginal \ndistributions of variables at each node. Although highly efficient algorithms exist \nfor this task on trees, exact solutions are prohibitively complex for more general \ngraphs of any substantial size. This difficulty motivates the use of approximate \ninference algorithms, of which one of the best-known and most widely studied is \nbelief propagation [3], also known as the sum-product algorithm in coding [e.g., 4]. \n\nRecent work has yielded some insight into belief propagation (BP). Several re(cid:173)\nsearchers [e.g., 5, 6] have analyzed the single loop case, where BP can be reformu(cid:173)\nlated as a matrix powering method. For Gaussian processes on arbitrary graphs, \ntwo groups [7, 8] have shown that the means are exact when BP converges. For \ngraphs corresponding to turbo codes, Richardson [9] established the existence of \nfixed points, and gave conditions for their stability. More recently, Yedidia et al. [1] \n\n\fshowed that BP corresponds to constrained minimization of the Bethe free energy, \nand proposed extensions based on Kikuchi expansions [10]. Related extensions to \nBP were proposed in [2]. The paper [1] has inspired other researchers [e.g., 11, 12] to \ndevelop more sophisticated algorithms for minimizing the Bethe free energy. These \nadvances notwithstanding, much remains to be understood about the behavior of \nBP. \n\nThe framework of this paper provides a new conceptual view of various algorithms \nfor approximate inference, including BP. The basic idea is to seek a reparameter(cid:173)\nization of the distribution that yields factors which correspond, either exactly or \napproximately, to the desired marginal distributions. If the graph is acyclic (i.e., \na tree) , then there exists a unique reparameterization specified by exact marginal \ndistributions over cliques. For a graph with cycles, we consider the idea of itera(cid:173)\ntively reparameterizing different parts of the distribution, each corresponding to an \nacyclic subgraph. As we will show, BP can be interpreted in exactly this manner, \nin which each reparameterization takes place over a pair of neighboring nodes. One \nof the consequences of this interpretation is a more storage-efficient \"message-free\" \nimplementation of BP. More significantly, this interpretation leads to more general \nupdates in which reparameterization is performed over arbitrary acyclic subgraphs, \nwhich we refer to as tree-based reparameterization (TRP) algorithms. \n\nAt a low level, the more global TRP updates can be viewed as a tree-based schedule \nfor message-passing. Indeed, a practical contribution of this paper is to demon(cid:173)\nstrate that TRP updates tend to have better convergence properties than local \nBP updates. At a more abstract level, the reparameterization perspective provides \nvaluable conceptual insight, including a simple tree-consistency characterization of \nfixed points, as well as an invariance intrinsic to TRP /BP. These properties allow \nus to derive an exact expression for the error between the TRP /BP approximations \nand the actual marginals. Based on this exact expression, we derive computable \nbounds on the error. Most of these results, though they emerge very naturally in \nthe TRP framework, apply in an algorithm-independent manner to any constrained \nlocal minimum of the Bethe free energy, whether obtained by TRP /BP or an alter(cid:173)\nnative method [e.g. , 11, 12]. More details of our work can be found in [13, 14]. \n\n1.1 Basic notation \n\nAn undirected graph Q = (V, \u00a3) consists of a set of nodes or vertices V = {l , ... ,N} \nthat are joined by a set of edges \u00a3. Lying at each node s E V is a discrete \nrandom variable Xs E {a, ... ,m -\nI}. The underlying sample space X N is the \nset of all N vectors x = {x s I S E V} over m symbols, so that IXNI = m N . \nWe focus on stochastic processes that are Markov with respect to Q, so that the \nHammersley-Clifford theorem [ e.g., 3] guarantees that the distribution factorizes \nas p(x) ex: [lcEe 'l/Jc(xc) where 'l/Jc(xc) is a compatibility function depending only \non the subvector Xc = {xs I SEC} of nodes in a particular clique C. Note that \neach individual node forms a singleton clique, so that some of the factors 'l/Jc may \ninvolve functions of each individual variable. As a consequence, if we have inde(cid:173)\npendent measurements Ys of Xs at some (or all) of the nodes, then Bayes' rule \nimplies that the effect of including these measurements -\ni.e., the transformation \nfrom the prior distribution p(x) to the conditional distribution p(x I y) -\nis simply \nto modify the singleton factors. As a result, throughout this paper, we suppress \nexplicit mention of measurements, since the problem of computing marginals for \neither p(x) or p(x I y) are of identical structure and complexity. The analysis of \nthis paper is restricted to graphs with singleton ('l/Js) and pairwise ('l/Jst} cliques. \nHowever, it is straightforward to extend reparameterization to larger cliques, as in \ncluster variational methods [e.g., 10]. \n\n\f1.2 Exact tree inference as reparameterization \n\nAlgorithms for optimal inference on trees have appeared in the literature of vari(cid:173)\nous fields [e.g., 4, 3]. One important consequence of the junction tree representa(cid:173)\ntion [15] is that any exact algorithm for optimal inference on trees actually computes \nmarginal distributions for pairs (s, t) of neighboring nodes. In doing so, it produces \nan alternative factorization p(x) = TI sEV Ps TI(s,t)E\u00a3 Pst/(PsPt ) where Ps and Pst \nare the single-node and pairwise marginals respectively. This {Ps, Pst} representa(cid:173)\ntion can be deduced from a more general factorization result on junction trees [e.g. \n15]. Thus, exact inference on trees can be viewed as computing a reparameter(cid:173)\nized factorization of the distribution p(x) that explicitly exposes the local marginal \ndistributions. \n\n2 Tree-based reparameterization for graphs with cycles \n\ns \n\nt \n\ns , \n\nThe basic idea of a TRP algorithm is to perform successive reparameterization up(cid:173)\ndates on trees embedded within the original graph. Although such updates are \napplicable to arbitrary acyclic substructures, here we focus on a set T 1 , ... , TL \nof embedded spanning trees. To describe TRP updates, let T be a pseudo(cid:173)\nmarginal probability vector consisting of single-node marginals Ts(xs) for 8 E V; \nand pairwise joint distributions Tst (x s, Xt) for edges (s, t) E [. Aside from pos(cid:173)\nitivity and normalization (Lx Ts = 1; L x x Tst = 1) constraints, a given vec-\ntor T is arbitraryl , and gives rises to a parameterization of the distribution as \np(x; T) ex: TI sEV Ts TI(S,t)E\u00a3 Tst/ {(Lx. Tst)(L Xt Tst )}, where the dependence of Ts \nand Tst on x is omitted for notational simplicity. Ultimately, we shall seek vectors \ni.e. , that belong to as well as upper and lower bounds on the \nactual marginals. (a) For weak potentials, TRP /BP approximation is excellent; \nbounds on exact marginals are tight. (b) For strong mixed potentials, approxima(cid:173)\ntion is poor. Bounds are looser, and for certain nodes, the TRP /BP approximation \nlies above the upper bounds on the actual marginal P8 ;1 . \n\nMuch of the analysis of this paper -- including reparameterization, invariance, \nand error analysis -- can be extended [see 14] to more structured approximation \nalgorithms [e.g., 1, 2]. Figure 3 illustrates the use of bounds in assessing when to \nuse a more structured approximation. For strong attractive potentials on the 3 x 3 \ngrid, the TRP /BP approximation in panel (a) is very poor, as reflected by relatively \nloose bounds on the actual marginals. In contrast, the Kikuchi approximation in \n(b) is excellent, as revealed by the tightness of the bounds. \n\n4 Discussion \n\nThe TRP framework of this paper provides a new view of approximate inference; \nand makes both practical and conceptual contributions. On the practical side, we \nfind that more global TRP updates tend to have better convergence properties than \nlocal BP updates. The freedom in tree choice leads to open problems of a graph(cid:173)\ntheoretic nature: e.g., how to choose trees so as to guarantee convergence, or to \noptimize the rate of convergence? \n\nAmong the conceptual insights provided by the reparameterization perspective are \na new characterization of fixed points; an intrinsic invariance; and analysis of the \napproximation error. Importantly, most of these results apply to any constrained \nlocal minimum of the Bethe free energy, and have natural extensions [see 14] to \nmore structured approximations [e.g., 1, 2]. \n\nAcknowledgments \n\nThis work partially funded by ODDR&E MURI Grant DAAD19-00-1-0466; by ONR \nGrant N00014-00-1-0089; and by AFOSR Grant F49620-00-1-0362; MJW also supported \nby NSERC 1967 fellowship. \n\nReferences \n\n[1] J. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In NIPS \n\n13, pages 689- 695. MIT Press, 2001. \n\n\fBounds on single node marginals \n\nBounds on single node marginals \n\n-\n\n- - - -0 - - - -0- -\n\n- e - - - -\n\nM \n\n_ 0 - - - - 0 - - - -0- -\n\n-\n\n-\n\n\u20acl\n\n-\n\n- - -\n\n0.8 \n\n;o::V \n\n\" \n:5-\"b.5 \n\u00a3> \na.. 0.4 \n\n0.3 \n\nM \n\n\" \n:5-\"b. \n\u00a3> \na.. 0.4 \n\n0.3 \n\n-:-Ac-,-tu--:al----. \n\n-\n-+- TAP I BP \n- 0 \u00b7 Bounds \n\n0.2 r r -+-\n0.1 \n\u00b01~~==~~-4~~5~~-~-~~ \n\nNode number \n\n(a) TRP /BP \n\n:~ II =-:= ~~r~~lured approx. 1 \n~rl=-e~B=o=un=ds~==~~~~-~-~-~ \n\n\u00b01 \n\n4 \n\n5 \n\nNode number \n(b) Kikuchi \n\nFigure 3. When to use a more structured approximation? (a) For strong attrac(cid:173)\ntive potentials on the 3 x 3 grid, BP approximation is poor, as reflected by loose \nbounds on the actual marginal. (b) Kikuchi approximation [1] for same problem \nis excellent; corresponding bounds are tight. \n\n[2] T. P. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, \n\nMIT Media Lab, 2001. \n\n[3] J. Pearl. Probabilistic reasoning in intelligent systems. Morgan Kaufman, San Mateo, \n\n1988. \n\n[4] F. Kschischang and B. Frey. Iterative decoding of compound codes by probability \npropagation in graphical models. IEEE Sel. Areas Comm., 16(2):219- 230, February \n1998. \n\n[5] J. B. Anderson and S. M. Hladnik. Tailbiting map decoders. IEEE Sel. Areas Comm., \n\n16:297- 302, February 1998. \n\n[6] Y. Weiss. Correctness of local probability propagation in graphical models with loops. \n\nNeural Computation, 12:1-41, 2000. \n\n[7] Y. Weiss and W. T. Freeman. Correctness of belief propagation in Gaussian graphical \n\nmodels of arbitrary topology. In NIPS 12, pages 673- 679. MIT Press, 2000. \n\n[8] P. Rusmevichientong and B. Van Roy. An analysis of turbo decoding with Gaussian \n\ndensities. In NIPS 12, pages 575- 581. MIT Press, 2000. \n\n[9] T. Richardson. The geometry of turbo-decoding dynamics. IEEE Trans. Info. Theory, \n\n46(1):9- 23, January 2000. \n\n[10] R. Kikuchi. The theory of cooperative phenomena. Physical Review, 81:988- 1003, \n\n1951. \n\n[11] M. Welling and Y. Teh. Belief optimization: A stable alternative to loopy belief \n\npropagation. In Uncertainty in Artificial Intelligence, July 2001. \n\n[12] A. Yuille. A double-loop algorithm to minimize the Bethe and Kikuchi free energies. \n\nNeural Computation, To appear, 2001. \n\n[13] M. J . Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization for \napproximate estimation on graphs with cycles. LIDS Tech. report P-2510: available \nat http://ssg.rnit.edu/group/rnjyain/rnjyain.shtrnl, May 2001. \n\n[14] M. Wainwright. Stochastic processes on graphs with cycles: geometric and variational \napproaches. PhD thesis, MIT, Laboratory for Information and Decision Systems, \nJanuary 2002. \n\n[1 5] S. L. Lauritzen. Graphical models. Oxford University Press, Oxford, 1996. \n[16] W. Freeman and Y. Weiss. On the optimality of solutions of the max-product belief \npropagation algorithm in arbitrary graphs. IEEE Trans. Info. Theory, 47:736- 744, \n2001. \n\n\f", "award": [], "sourceid": 2107, "authors": [{"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Alan", "family_name": "Willsky", "institution": null}]}