{"title": "Loop Series and Bethe Variational Bounds in Attractive Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1425, "page_last": 1432, "abstract": "Variational methods are frequently used to approximate or bound the partition or likelihood function of a Markov random field. Methods based on mean field theory are guaranteed to provide lower bounds, whereas certain types of convex relaxations provide upper bounds. In general, loopy belief propagation (BP) provides (often accurate) approximations, but not bounds. We prove that for a class of attractive binary models, the value specified by any fixed point of loopy BP always provides a lower bound on the true likelihood. Empirically, this bound is much better than the naive mean field bound, and requires no further work than running BP. We establish these lower bounds using a loop series expansion due to Chertkov and Chernyak, which we show can be derived as a consequence of the tree reparameterization characterization of BP fixed points.", "full_text": "Loop Series and Bethe Variational Bounds\n\nin Attractive Graphical Models\n\nErik B. Sudderth and Martin J. Wainwright\n\nElectrical Engineering & Computer Science, University of California, Berkeley\n\nsudderth@eecs.berkeley.edu, wainwrig@eecs.berkeley.edu\n\nElectrical Engineering & Computer Science, Massachusetts Institute of Technology\n\nAlan S. Willsky\n\nwillsky@mit.edu\n\nAbstract\n\nVariational methods are frequently used to approximate or bound the partition\nor likelihood function of a Markov random \ufb01eld. Methods based on mean \ufb01eld\ntheory are guaranteed to provide lower bounds, whereas certain types of convex\nrelaxations provide upper bounds. In general, loopy belief propagation (BP) pro-\nvides often accurate approximations, but not bounds. We prove that for a class of\nattractive binary models, the so\u2013called Bethe approximation associated with any\n\ufb01xed point of loopy BP always lower bounds the true likelihood. Empirically,\nthis bound is much tighter than the naive mean \ufb01eld bound, and requires no fur-\nther work than running BP. We establish these lower bounds using a loop series\nexpansion due to Chertkov and Chernyak, which we show can be derived as a\nconsequence of the tree reparameterization characterization of BP \ufb01xed points.\n\n1 Introduction\nGraphical models are widely used in many areas, including statistical machine learning, computer\nvision, bioinformatics, and communications. Such applications typically require computationally\nef\ufb01cient methods for (approximately) solving various problems, including computing marginal dis-\ntributions and likelihood functions. The variational framework provides a suite of candidate meth-\nods, including mean \ufb01eld approximations [3, 9], the sum\u2013product or belief propagation (BP) algo-\nrithm [11, 14], Kikuchi and cluster variational methods [23], and related convex relaxations [21].\n\nThe likelihood or partition function of an undirected graphical model is of fundamental interest in\nmany contexts, including parameter estimation, error bounds in hypothesis testing, and combina-\ntorial enumeration. In rough terms, particular variational methods can be understood as solving\noptimization problems whose optima approximate the log partition function. For mean \ufb01eld meth-\nods, this optimal value is desirably guaranteed to lower bound the true likelihood [9]. For other\nmethods, including the Bethe variational problem underlying loopy BP [23], optima may either\nover\u2013estimate or under\u2013estimate the truth. Although \u201cconvexi\ufb01ed\u201d relaxations of the Bethe problem\nyield upper bounds [21], to date the best known lower bounds on the partition function are based on\nmean \ufb01eld theory. Recent work has studied loop series expansions [2, 4] of the partition function,\nwhich generate better approximations but not, in general, bounds.\n\nSeveral existing theoretical results show that loopy BP, and the corresponding Bethe approximation,\nhave desirable properties for graphical models with long cycles [15] or suf\ufb01ciently weak depen-\ndencies [6, 7, 12, 19]. However, these results do not explain the excellent empirical performance\nof BP in many graphs with short cycles, like the nearest\u2013neighbor grids arising in spatial statistics\nand low\u2013level vision [3, 18, 22]. Such models often encode \u201csmoothness\u201d priors, and thus have\nattractive interactions which encourage connected variables to share common values. The \ufb01rst main\ncontribution of this paper is to demonstrate a family of attractive models for which the Bethe varia-\ntional method always yields lower bounds on the true likelihood. Although we focus on models with\nbinary variables (but arbitrary order of interactions), we suspect that some ideas are more generally\napplicable. For such models, these lower bounds are easily computed from any \ufb01xed point of loopy\nBP, and empirically improve substantially on naive mean \ufb01eld bounds.\n\n1\n\n\fOur second main contribution lies in the route used to establish the Bethe lower bounds. In partic-\nular, Sec. 3 uses the reparameterization characterization of BP \ufb01xed points [20] to provide a simple\nderivation for the loop series expansion of Chertkov and Chernyak [2]. The Bethe approximation\nis the \ufb01rst term in this representation of the true partition function. Sec. 4 then identi\ufb01es attrac-\ntive models for which all terms in this expansion are positive, thus establishing the Bethe lower\nbound. We conclude with empirical results demonstrating the accuracy of this bound, and discuss\nimplications for future analysis and applications of loopy BP.\n\n2 Undirected Graphical Models\nGiven an undirected graph G = (V, E), with edges (s, t) \u2208 E connecting n vertices s \u2208 V , a graph-\nical model associates each node with a random variable Xs taking values xs \u2208 X . For pairwise\nMarkov random \ufb01elds (MRFs) as in Fig. 1, the joint distribution of x := {xs | s \u2208 V } is speci\ufb01ed\nvia a normalized product of local compatibility functions:\n\np(x) =\n\n1\n\nZ(\u03c8) Ys\u2208V\n\n\u03c8s(xs) Y(s,t)\u2208E\n\n\u03c8st(xs, xt)\n\n(1)\n\nThe partition function Z(\u03c8) := Px\u2208X n Qs \u03c8s(xs) Q(s,t) \u03c8st(xs, xt), whose value depends on\n\nthe compatibilities \u03c8, is de\ufb01ned so that p(x) is properly normalized. We also consider distributions\nde\ufb01ned by hypergraphs G = (V, C), where each hyperedge c \u2208 C connects some subset of the\nvertices (c \u2282 V ). Letting xc := {xs | s \u2208 c}, the corresponding joint distribution equals\n\np(x) =\n\n1\n\nZ(\u03c8) Ys\u2208V\n\n\u03c8s(xs) Yc\u2208C\n\n\u03c8c(xc)\n\n(2)\n\nwhere as before Z(\u03c8) = Px\u2208X n Qs \u03c8s(xs) Qc \u03c8c(xc). Such higher\u2013order random \ufb01elds are\n\nconveniently described by the bipartite factor graphs [11] of Fig. 2.\n\nIn statistical physics, the partition function arises in the study of how physical systems respond to\nchanges in external stimuli or temperature [23]. Alternatively, when compatibility functions are\nparameterized by exponential families [20], log Z(\u03c8) is the family\u2019s cumulant generating function,\nand thus intrinsically related to the model\u2019s marginal statistics. For directed Bayesian networks\n(which can be factored as in eq. (2)), Z(\u03c8) is the marginal likelihood of observed data, and plays a\ncentral role in learning and model selection [9]. However, for general graphs coupling discrete ran-\ndom variables, the cost of exactly evaluating Z(\u03c8) grows exponentially with n [8]. Computationally\ntractable families of bounds on the true partition function are thus of great practical interest.\n\n2.1 Attractive Discrete Random Fields\nIn this paper, we focus on binary random vectors x \u2208 {0, 1}n. We say that a pairwise MRF, with\ncompatibility functions \u03c8st : {0, 1}2 \u2192 R+, has attractive interactions if\n\n(3)\nfor each edge (s, t) \u2208 E. Intuitively, this condition requires all potentials to place greater weight\non con\ufb01gurations where neighboring variables take the same value. Our later analysis is based on\npairwise marginal distributions \u03c4st(xs, xt), which we parameterize as follows:\n\n\u03c8st(0, 0) \u03c8st(1, 1) \u2265 \u03c8st(0, 1) \u03c8st(1, 0)\n\n\u03c4st(xs, xt) = (cid:20)1 \u2212 \u03c4s \u2212 \u03c4t + \u03c4st\n\n\u03c4s \u2212 \u03c4st\n\n\u03c4t \u2212 \u03c4st\n\n\u03c4st (cid:21)\n\n\u03c4s := E\u03c4st [Xs]\n\u03c4st := E\u03c4st [XsXt]\n\n(4)\n\nWe let E\u03c4st [\u00b7] denote expectation with respect to \u03c4st(xs, xt), so that \u03c4st is the probability that\nXs = Xt = 1. This normalized matrix is attractive, satisfying eq. (3), if and only if \u03c4st \u2265 \u03c4s\u03c4t.\nFor binary variables, the pairwise MRF of eq. (1) provides one representation of a general, inho-\nmogeneous Ising model. In the statistical physics literature, Ising models are typically expressed\nby coupling random spins zs \u2208 {\u22121, +1} with symmetric potentials log \u03c8st(zs, zt) = \u03b8stzszt. The\nattractiveness condition of eq. (3) then becomes \u03b8st \u2265 0, and the resulting model has ferromagnetic\ninteractions. Furthermore, pairwise MRFs satisfy the regularity condition of [10], and thus allow\ntractable MAP estimation via graph cuts [5], if and only if they are attractive. Even for attractive\nmodels, however, calculation of the partition function in non\u2013planar graphs is #P\u2013complete [8].\n\nTo de\ufb01ne families of higher\u2013order attractive potentials, we \ufb01rst consider a probability distribution\n\u03c4c(xc) on k = |c| binary variables. Generalizing eq. (4), we parameterize such distributions by the\n\n2\n\n\ffollowing collection of 2k \u2212 1 mean parameters:\n\n\u03c4a := E\u03c4c(cid:20)Ys\u2208a\n\nXs(cid:21)\n\n\u2205 6= a \u2286 c\n\n(5)\n\nFor example, \u03c4stu(xs, xt, xu) would be parameterized by {\u03c4s, \u03c4t, \u03c4u, \u03c4st, \u03c4su, \u03c4tu, \u03c4stu}. For any\nsubset a \u2286 c, we then de\ufb01ne the following central moment statistic:\n\n\u03baa := E\u03c4c(cid:20)Ys\u2208a\n\n(Xs \u2212 \u03c4s)(cid:21)\n\n\u2205 6= a \u2286 c\n\n(6)\n\nNote that \u03bas = 0, while \u03bast = Cov\u03c4 (Xs, Xt) = \u03c4st \u2212 \u03c4s\u03c4t. The third\u2013order central moment then\nequals the cumulant \u03bastu = \u03c4stu \u2212 \u03c4st\u03c4u \u2212 \u03c4su\u03c4t \u2212 \u03c4tu\u03c4s + 2\u03c4s\u03c4t\u03c4u.\nGiven these de\ufb01nitions, we say that a probability distribution \u03c4c(xc) is attractive if the central mo-\nments associated with all subsets a \u2286 c of binary variables are non\u2013negative (\u03baa \u2265 0). Similarly, a\ncompatibility function \u03c8c(xc) is attractive if the probability distribution attained by normalizing its\nvalues has non\u2013negative central moments. For example, the following potential is easily shown to\nsatisfy this condition for all degrees k = |c|, and any scalar \u03b8c > 0:\n\nlog \u03c8c(x1, . . . , xk) = (cid:26) \u03b8c\n\n\u2212\u03b8c\n\nx1 = x2 = \u00b7 \u00b7 \u00b7 = xk\notherwise\n\n(7)\n\n2.2 Belief Propagation and the Bethe Variational Principle\nMany applications of graphical models require estimates of the posterior marginal distributions of\nindividual variables \u03c4s(xs) or factors \u03c4c(xc). Loopy belief propagation (BP) approximates these\nmarginals via a series of messages passed among nodes of the graphical model [14, 23]. Let \u0393(s)\ndenote the set of factors which depend on Xs, or equivalently the neighbors of node s in the corre-\nsponding factor graph. The BP algorithm then iterates the following message updates:\n\n\u00afmsc(xs) \u2190 \u03c8s(xs) Yd\u2208\u0393(s)\\c\n\nmds(xs)\n\nmcs(xs) \u2190 Xxc\\s\n\n\u03c8c(xc) Yt\u2208c\\s\n\n\u00afmtc(xt)\n\n(8)\n\nThe left\u2013hand expression updates the message \u00afmsc(xs) passed from variable node s to factor c. New\noutgoing messages mcs(xs) from factor c to each s \u2208 c are then determined by marginalizing the\nincoming messages from other nodes. At any iteration, appropriately normalized products of these\nmessages de\ufb01ne estimates of the desired marginals:\n\n\u03c4s(xs) \u221d \u03c8s(xs) Yc\u2208\u0393(s)\n\nmcs(xs)\n\n\u03c4c(xc) \u221d \u03c8c(xc) Yt\u2208c\n\n\u00afmtc(xt)\n\n(9)\n\nIn tree\u2013structured graphs, BP de\ufb01nes a dynamic programming recursion which converges to the\nexact marginals after \ufb01nitely many iterations [11, 14]. In graphs with cycles, however, convergence\nis not guaranteed, and pseudo\u2013marginals computed via eq. (9) are (often good) approximations.\n\nA wide range of inference algorithms can be derived via variational approximations [9] to the true\npartition function. Loopy BP is implicitly associated with the following Bethe approximation:\n\nlog Z\u03b2(\u03c8; \u03c4 ) = Xs\u2208V Xxs\n\u2212 Xs\u2208V Xxs\n\n\u03c4s(xs) log \u03c8s(xs) + Xc\u2208CXxc\n\u03c4s(xs) log \u03c4s(xs) \u2212 Xc\u2208CXxc\n\n\u03c4c(xc) log \u03c8c(xc)\n\n\u03c4c(xc) log\n\n(10)\n\n\u03c4c(xc)\n\nQt\u2208c \u03c4t(xt)\n\nFixed points of loopy BP correspond to stationary points of this Bethe approximation [23], subject\n\n\u03c4c(xc) = \u03c4s(xs).\n\nto the local marginalization constraints Pxc\\s\n3 Reparameterization and Loop Series Expansions\nAs discussed in Sec. 2.2, any BP \ufb01xed point is in one\u2013to\u2013one correspondence with a set {\u03c4s, \u03c4c}\nof pseudo\u2013marginals associated with each of the graph\u2019s nodes s \u2208 V and factors c \u2208 C. These\npseudo\u2013marginals then lead to an alternative parameterization [20] of the factor graph of eq. (2):\n\np(x) =\n\n1\n\nZ(\u03c4 ) Ys\u2208V\n\n\u03c4s(xs) Yc\u2208C\n\n\u03c4c(xc)\n\nQt\u2208c \u03c4t(xt)\n\n(11)\n\nFor pairwise MRFs, the reparameterized compatibility functions equal \u03c4st(xs, xt)/\u03c4s(xs)\u03c4t(xt).\nThe BP algorithm effectively searches for reparameterizations which are tree\u2013consistent, so that\n\n3\n\n\f\u03c4c(xc) is the exact marginal distribution of Xc for any tree (or forest) embedded in the original\ngraph [20]. In later sections, we take expectations with respect to \u03c4c(xc) of functions f (xc) de-\n\ufb01ned over individual factors. Although these pseudo\u2013marginals will in general not equal the true\nmarginals pc(xc), BP \ufb01xed points ensure local consistency so that E\u03c4c [f (Xc)] is well\u2013de\ufb01ned.\nUsing eq. (10), it is easily shown that the Bethe approximation Z\u03b2(\u03c4 ; \u03c4 ) = 1 for any joint distribu-\ntion de\ufb01ned by reparameterized potentials as in eq. (11). For simplicity, the remainder of this paper\nfocuses on reparameterized models of this form, and analyzes properties of the corresponding exact\npartition function Z(\u03c4 ). The resulting expansions and bounds are then related to the original MRF\u2019s\npartition function via the positive constant Z(\u03c8)/Z(\u03c4 ) = Z\u03b2(\u03c8; \u03c4 ) of eq. (10).\nRecently, Chertkov and Chernyak proposed a \ufb01nite loop series expansion [2] of the partition func-\ntion, whose \ufb01rst term coincides with the Bethe approximation. They provide two derivations: one\napplies a trigonometric identity to Fourier representations of binary variables, while the second em-\nploys a saddle point approximation obtained via an auxiliary \ufb01eld of complex variables. The gauge\ntransformations underlying these derivations are a type of reparameterization, but their form is com-\nplicated by auxiliary variables and extraneous degrees of freedom. In this section, we show that the\n\ufb01xed point characterization of eq. (11) leads to a more direct, and arguably simpler, derivation.\n\n3.1 Pairwise Loop Series Expansions\nWe begin by developing a loop series expansion for pairwise MRFs. Given an undirected graph\nG = (V, E), and some subset F \u2286 E of the graph\u2019s edges, let ds(F ) denote the degree (number of\nneighbors) of node s in the subgraph induced by F . As illustrated in Fig. 1, any subset F for which\nall nodes s \u2208 V have degree ds(F ) 6= 1 de\ufb01nes a generalized loop [2]. The partition function for\nany binary, pairwise MRF can then be expanded via an associated set of loop corrections.\nProposition 1. Consider a pairwise MRF de\ufb01ned on an undirected G = (V, E), with reparameter-\nized potentials as in eq. (11). The associated partition function then equals\n\nZ(\u03c4 ) = 1 + X\u22056=F \u2286E\n\n\u03b2F Ys\u2208V\n\nE\u03c4sh(Xs \u2212 \u03c4s)ds(F )i\n\n\u03b2F := Y(s,t)\u2208F\n\n\u03b2st\n\n\u03b2st :=\n\n\u03c4st \u2212 \u03c4s\u03c4t\n\n\u03c4s(1 \u2212 \u03c4s)\u03c4t(1 \u2212 \u03c4t)\n\n=\n\nCov\u03c4st (Xs, Xt)\n\nVar\u03c4s (Xs) Var\u03c4t (Xt)\n\nwhere only generalized loops F lead to non\u2013zero terms in the sum of eq. (12), and\n\nare central moments of the binary variables at individual nodes.\n\nE\u03c4s(cid:2)(Xs \u2212 \u03c4s)d(cid:3) = \u03c4s(1 \u2212 \u03c4s)(cid:2)(1 \u2212 \u03c4s)d\u22121 + (\u22121)d (\u03c4s)d\u22121(cid:3)\n\n(12)\n\n(13)\n\n(14)\n\nProof. To establish the expansion of eq. (12), we exploit the following polynomial representation of\nreparameterized pairwise compatibility functions:\n\n\u03c4st(xs, xt)\n\u03c4s(xs)\u03c4t(xt)\n\n= 1 + \u03b2st(xs \u2212 \u03c4s)(xt \u2212 \u03c4t)\n\n(15)\n\nAs veri\ufb01ed in [17], this expression is satis\ufb01ed for any (xs, xt) \u2208 {0, 1}2 if \u03b2st is de\ufb01ned as in\neq. (13). For attractive models satisfying eq. (3), \u03b2st \u2265 0 for all edges. Using E\u02dc\u03c4 [\u00b7] to denote\n\nexpectation with respect to the fully factorized distribution \u02dc\u03c4 (x) = Qs \u03c4s(xs), we then have\n\nZ(\u03c4 ) = Xx\u2208{0,1}n Ys\u2208V\n= E\u02dc\u03c4(cid:20) Y(s,t)\u2208E\n\n\u03c4st(xs, xt)\n\u03c4s(xs)\u03c4t(xt)\n\n\u03c4s(xs) Y(s,t)\u2208E\n\u03c4s(Xs)\u03c4t(Xt)(cid:21) = E\u02dc\u03c4(cid:20) Y(s,t)\u2208E\n\n\u03c4st(Xs, Xt)\n\n1 + \u03b2st(Xs \u2212 \u03c4s)(Xt \u2212 \u03c4t)(cid:21)\n\n(16)\n\nExpanding this polynomial via the expectation operator\u2019s linearity, we recover one term for each\nnon\u2013empty subset F \u2286 E of the graph\u2019s edges:\n\nZ(\u03c4 ) = 1 + X\u22056=F \u2286E\n\nE\u02dc\u03c4(cid:20) Y(s,t)\u2208F\n\n\u03b2st(Xs \u2212 \u03c4s)(Xt \u2212 \u03c4t)(cid:21)\n\n(17)\n\nThe expression in eq. (12) then follows from the independence structure of \u02dc\u03c4 (x), and standard\nformulas for the moments of Bernoulli random variables. To evaluate these terms, note that if\nds(F ) = 1, it follows that E\u03c4s [Xs \u2212 \u03c4s] = 0. There is thus one loop correction for each generalized\nloop F , in which all connected nodes have degree at least two.\n\n4\n\n\fFigure 1: A pairwise MRF coupling ten binary variables (left), and the nine generalized loops in its loop series\nexpansion (right). For attractive potentials, two of the generalized loops may have negative signs (second &\nthird from right), while the core graph of Thm. 1 contains eight variables (far right).\n\nFigure 1 illustrates the set of generalized loops associated with a particular pairwise MRF. These\nloops effectively de\ufb01ne corrections to the Bethe estimate Z(\u03c4 ) \u2248 1 of the partition function for\nreparameterized models. Tree\u2013structured graphs do not contain any non\u2013trivial generalized loops,\nand the Bethe variational approximation is thus exact.\n\nThe loop expansion formulas of [2] can be precisely recovered by transforming binary variables to\na spin representation, and refactoring terms from the denominator of edge weights \u03b2st to adjacent\nvertices. Explicit computation of these loop corrections is in general intractable; for example, fully\nconnected graphs with n \u2265 5 nodes have more than 2n generalized loops. In some cases, accounting\nfor a small set of signi\ufb01cant loop corrections may lead to improved approximations to Z(\u03c8) [4], or\nmore accurate belief estimates for LDPC codes [1]. We instead use the series expansion of Prop. 1\nto establish analytic properties of BP \ufb01xed points.\n\n3.2 Factor Graph Loop Series Expansions\nWe now extend the loop series expansion to higher\u2013order MRFs de\ufb01ned on hypergraphs G = (V, C).\nLet E = {(s, c) | c \u2208 C, s \u2208 c} denote the set of edges in the factor graph representation of this\nMRF. As illustrated in Fig. 2, we de\ufb01ne a generalized loop to be a subset F \u2286 E of edges such that\nall connected factor and variable nodes have degree at least two.\nProposition 2. Consider any factor graph G = (V, C) with reparameterized potentials as in\neq. (11), and associated edges E. The partition function then equals\n\nZ(\u03c4 ) = 1 + X\u22056=F \u2286E\n\n\u03b2F Ys\u2208V\n\n\u03b2a :=\n\n\u03baa\n\nQt\u2208a \u03c4t(1 \u2212 \u03c4t)\n\nE\u03c4sh(Xs \u2212 \u03c4s)ds(F )i\nE\u03c4c(cid:2)Qs\u2208a(Xs \u2212 \u03c4s)(cid:3)\nQt\u2208a Var\u03c4t (Xt)\n\n=\n\n\u03b2F :=Yc\u2208C\n\n\u03b2ac(F )\n\n(18)\n\n(19)\n\nwhere ac(F ) := {s \u2208 c | (s, c) \u2208 F } denotes the subset of variables linked to factor node c by the\nedges in F . Only generalized loops F lead to non\u2013zero terms in the sum of eq. (18).\nProof. As before, we employ a polynomial representation of the reparameterized factors in eq. (11):\n\n\u03c4c(xc)\n\nQt\u2208c \u03c4t(xt)\n\n= 1 + Xa\u2286c,|a|\u22652\n\n\u03b2a Ys\u2208a\n\n(xs \u2212 \u03c4s)\n\n(20)\n\nFor factor graphs with attractive reparameterized potentials, the constant \u03b2a \u2265 0 for all a \u2286 c.\nNote that this representation, which is derived in [17], reduces to that of eq. (15) when c = {s, t}.\nSingle\u2013variable subsets are excluded in eq. (20) because \u03bas = E\u03c4s [Xs \u2212 \u03c4s] = 0.\nApplying eq. (20) as in our earlier derivation for pairwise MRFs (see eq. (16)), we may express the\npartition function of the reparameterized factor graph as follows:\n\nZ(\u03c4 ) = E\u02dc\u03c4(cid:20) Yc\u2208C\n\n\u03c4c(Xc)\n\nQt\u2208c \u03c4t(Xt)(cid:21) = E\u02dc\u03c4(cid:20) Yc\u2208C\n\n1 + X\u22056=a\u2286c\n\n\u03b2a Ys\u2208a\n\n(Xs \u2212 \u03c4s)(cid:21)\n\n(21)\n\nNote that \u03b2a = 0 for any subset where |a| = 1. There is then a one\u2013to\u2013one correspondence between\nvariable node subsets a \u2286 c, and subsets {(s, c) | s \u2208 a} of the factor graph\u2019s edges E. Expanding\nthis expression by F \u2286 E, it follows that each factor c \u2208 C contributes a term corresponding to the\nchosen subset ac(F ) of its edges:\n\nZ(\u03c4 ) = 1 + X\u22056=F \u2286E\n\nE\u02dc\u03c4(cid:20) Yc\u2208C\n\n\u03b2ac(F ) Ys\u2208ac(F )\n\n(Xs \u2212 \u03c4s)(cid:21)\n\n(22)\n\nNote that \u03b2\u2205 = 1. Equation (18) then follows from the independence properties of \u02dc\u03c4 (x). For a term\nin this loop series to be non\u2013zero, there must be no degree one variables, since E\u03c4s[Xs \u2212 \u03c4s] = 0.\nIn addition, the de\ufb01nition of \u03b2a implies that there can be no degree one factor nodes.\n\n5\n\n\fFigure 2: A factor graph (left) with three binary variables (circles) and four factor nodes (squares), and the\nthirteen generalized loops in its loop series expansion (right, along with the full graph).\n\n4 Lower Bounds in Attractive Binary Models\nThe Bethe approximation underlying loopy BP differs from mean \ufb01eld methods [9], which lower\nbound the true log partition function Z(\u03c8), in two key ways. First, while the Bethe entropy (second\nline of eq. (10)) is exact for tree\u2013structured graphs, it approximates (rather than bounds) the true\nentropy in graphs with cycles. Second, the marginalization condition imposed by loopy BP relaxes\n(rather than strengthens) the global constraints characterizing valid distributions [21]. Neverthe-\nless, we now show that for a large family of attractive graphical models, the Bethe approximation\nZ\u03b2(\u03c8; \u03c4 ) of eq. (10) lower bounds Z(\u03c8). In contrast with mean \ufb01eld methods, these bounds hold\nonly at appropriate BP \ufb01xed points, not for arbitrarily chosen pseudo\u2013marginals \u03c4c(xc).\n\n4.1 Partition Function Bounds for Pairwise Graphical Models\nConsider a pairwise MRF de\ufb01ned on G = (V, E), as in eq. (1). Let VH \u2286 V denote the set of\nnodes which either belong to some cycle in G, or lie on a path (sequence of edges) connecting two\ncycles. We then de\ufb01ne the core graph H = (VH , EH ) as the node\u2013induced subgraph obtained by\ndiscarding edges from nodes outside VH, so that EH = {(s, t) \u2208 E | s, t \u2208 VH }. The unique core\ngraph H underlying any graph G can be ef\ufb01ciently constructed by iteratively pruning degree one\nnodes, or leaves, until all remaining nodes have two or more neighbors. The following theorem\nidenti\ufb01es conditions under which all terms in the loop series expansion must be non\u2013negative.\nTheorem 1. Let H = (VH , EH ) be the core graph for a pairwise binary MRF, with attractive\npotentials satisfying eq. (3). Consider any BP \ufb01xed point for which all nodes s \u2208 VH with three or\nmore neighbors in H have marginals \u03c4s \u2264 1\n2 ). The corresponding Bethe\nvariational approximation Z\u03b2(\u03c8; \u03c4 ) then lower bounds the true partition function Z(\u03c8).\n\n2 (or equivalently, \u03c4s \u2265 1\n\nProof. It is suf\ufb01cient to show that Z(\u03c4 ) \u2265 1 for any reparameterized pairwise MRF, as in eq. (11).\nFrom eq. (9), note that loopy BP estimates the pseudo\u2013marginal \u03c4st(xs, xt) via the product of\n\u03c8st(xs, xt) with message functions of single variables. For this reason, attractive pairwise com-\npatibilities always lead to BP \ufb01xed points with attractive pseudo\u2013marginals satisfying \u03c4st \u2265 \u03c4s\u03c4t.\nConsider the pairwise loop series expansion of eq. (12). As shown by eq. (13), attractive models\n\neach generalized loop F \u2286 E. Suppose \ufb01rst that the graph has a single cycle, and thus exactly one\nnon\u2013zero generalized loop F . Because all connected nodes in this cycle have degree two, the bound\n\nlead to edge weights \u03b2st \u2265 0. It is thus suf\ufb01cient to show that Qs\nE\u03c4s(cid:2)(Xs \u2212 \u03c4s)ds(F )(cid:3) \u2265 0 for\nfollows because E\u03c4s(cid:2)(Xs \u2212 \u03c4s)2(cid:3) \u2265 0. More generally, we clearly have Z(\u03c4 ) \u2265 1 in graphs where\n\nevery generalized loop F associates an even number of neighbors ds(F ) with each node.\nFocusing on generalized loops containing nodes with odd degree d \u2265 3, eq. (14) implies that\n\nE\u03c4s(cid:2)(Xs \u2212 \u03c4s)d(cid:3) \u2265 0 for marginals satisfying 1 \u2212 \u03c4s \u2265 \u03c4s. For BP \ufb01xed points in which \u03c4s \u2264 1\nfor all nodes, we thus have Z(\u03c4 ) \u2265 1. In particular, the symmetric \ufb01xed point \u03c4s = 1\n2 leads to uni-\nformly positive generalized loop corrections. More generally, the marginals of nodes s for which\nds(F ) \u2264 2 for every generalized loop F do not in\ufb02uence the expansion\u2019s positivity. Theorem 1\ndiscards these nodes by examining the topology of the core graph H (see Fig. 1 for an example).\nFor \ufb01xed points where \u03c4s \u2265 1\n2 for all nodes, we rewrite the polynomial in the loop expansion of\neq. (15) as (1 + \u03b2st(\u03c4s \u2212 xs)(\u03c4t \u2212 xt)), and employ an analogous line of reasoning.\nIn addition to establishing Thm. 1, our arguments show that the true partition function monotonically\nincreases as additional edges, with attractive reparameterized potentials as in eq. (11), are added to\na graph with \ufb01xed pseudo\u2013marginals \u03c4s \u2264 1\n2 . For such models, the accumulation of particular\nloop corrections, as explored by [4], produces a sequence of increasingly tight bounds on Z(\u03c8). In\naddition, we note that the conditions required by Thm. 1 are similar to those underlying classical\n\n2\n\n6\n\n\fcorrelation inequalities [16] from the statistical physics literature.\nSherman (GKS) inequality leads to an alternative proof in cases where \u03c4s = 1\n\nIndeed, the Grif\ufb01ths\u2013Kelly\u2013\n\n2 for all nodes.\n\nFor attractive Ising models in which some nodes have marginals \u03c4s > 1\n2 , the loop\nseries expansion may contain negative terms. For small graphs like that in Fig. 1, it is possible to\nuse upper bounds on the edge weights \u03b2st, which follow from \u03c4st \u2264 min(\u03c4s, \u03c4t), to cancel negative\nloop corrections with larger positive terms. As con\ufb01rmed by the empirical results in Sec. 4.3, the\nlower bound Z(\u03c8) \u2265 Z\u03b2(\u03c8; \u03c4 ) thus continues to hold for many (perhaps all) attractive Ising models\nwith less homogeneous marginal biases.\n\n2 and others \u03c4t < 1\n\n4.2 Partition Function Bounds for Factor Graphs\nGiven a factor graph G = (V, C) relating binary variables, de\ufb01ne a core graph H = (VH , CH ) by\nexcluding variable and factor nodes which are not members of any generalized loops. As in Sec. 2.2,\nlet \u0393(s) denote the set of factor nodes neighboring variable node s in the core graph H.\nTheorem 2. Let H = (VH , CH ) be the core graph for a binary factor graph, and consider an\nattractive BP \ufb01xed point for which one of the following conditions holds:\n\n(i) \u03c4s \u2264 1\n(ii) \u03c4s \u2265 1\n\n2 for all nodes s \u2208 VH with |\u0393(s)| \u2265 3, and \u03baa \u2265 0 for all a \u2286 c, c \u2208 CH.\n2 for all nodes s \u2208 VH with |\u0393(s)| \u2265 3, and (\u22121)|a|\u03baa \u2265 0 for all a \u2286 c, c \u2208 CH.\n\nThe Bethe approximation Z\u03b2(\u03c8; \u03c4 ) then lower bounds the true partition function Z(\u03c8).\n\nFor the case where \u03c4s \u2264 1\narguments in Sec. 4.1. When \u03c4s \u2265 1\nof eq. (20), and again recover uniformly positive loop corrections.\n\n2 , the proof of this theorem is a straightforward generalization of the\n2 , we replace all (xs \u2212 \u03c4s) terms by (\u03c4s \u2212 xs) in the expansion\n\nFor any given BP \ufb01xed point, the conditions of Thm. 2 are easy to verify. For factor graphs, it is\nmore challenging to determine which compatibility functions \u03c8c(xc) necessarily lead to attractive\n\ufb01xed points. For symmetric potentials as in eq. (7), however, one can show that the conditions on\n\u03baa, a \u2286 c are necessarily satis\ufb01ed whenever all variable nodes s \u2208 VH have the same bias.\n\n4.3 Empirical Comparison of Mean Field and Bethe Lower Bounds\nIn this section, we compare the accuracy of the Bethe variational bounds established by Thm. 1\nto those produced by a naive, fully factored mean \ufb01eld approximation [3, 9]. Using the\nspin representation zs \u2208 {\u22121, +1}, we examine Ising models with attractive pairwise potentials\nlog \u03c8st(zs, zt) = \u03b8stzszt of varying strengths \u03b8st \u2265 0. We \ufb01rst examine a 2D torus, with potentials\nof uniform strength \u03b8st = \u00af\u03b8 and no local observations. For such MRFs, the exact partition func-\ntion may be computed via Onsager\u2019s classical eigenvector method [13]. As shown in Fig. 3(a), for\nmoderate \u00af\u03b8 the Bethe bound Z\u03b2(\u03c8; \u03c4 ) is substantially tighter than mean \ufb01eld. For large \u00af\u03b8, only two\nstates (all spins \u201cup\u201d or \u201cdown\u201d) have signi\ufb01cant probability, so that Z(\u03c8) \u2248 2 exp(\u00af\u03b8|E|). In this\nregime, loopy BP exhibits \u201csymmetry breaking\u201d [6], and converges to one of these states at random\nwith corresponding bound Z\u03b2(\u03c8; \u03c4 ) \u2248 exp(\u00af\u03b8|E|). As veri\ufb01ed in Fig. 3(a), as \u00af\u03b8 \u2192 \u221e the difference\nlog Z(\u03c8) \u2212 log Z\u03b2(\u03c8; \u03c4 ) \u2248 log 2 \u2248 0.69 thus remains bounded.\nWe also consider a set of random 10 \u00d7 10 nearest\u2013neighbor grids, with inhomogeneous pairwise\n\npotentials sampled according to |\u03b8st| \u223c N(cid:0)0, \u00af\u03b8 2(cid:1), and observation potentials log \u03c8s(zs) = \u03b8szs,\n|\u03b8s| \u223c N(cid:0)0, 0.12(cid:1). For each candidate \u00af\u03b8, we sample 100 random MRFs, and plot the average differ-\n\nence log Z\u03b2(\u03c8; \u03c4 ) \u2212 log Z(\u03c8) between the true partition function and the BP (or mean \ufb01eld) \ufb01xed\npoint reached from a random initialization. Fig. 3(b) \ufb01rst considers MRFs where \u03b8s > 0 for all\nnodes, so that the conditions of Thm. 1 are satis\ufb01ed for all BP \ufb01xed points. For these models, the\nBethe bound is extremely accurate.\nIn Fig. 3(c), we also consider MRFs where the observation\npotentials \u03b8s are of mixed signs. Although this sometimes leads to BP \ufb01xed points with negative\nassociated loop corrections, the Bethe variational approximation nevertheless always lower bounds\nthe true partition function in these examples. We hypothesize that this bound in fact holds for all\nattractive, binary pairwise MRFs, regardless of the observation potentials.\n5 Discussion\nWe have provided an alternative, direct derivation of the partition function\u2019s loop series expansion,\nbased on the reparameterization characterization of BP \ufb01xed points. We use this expansion to prove\nthat the Bethe approximation lower bounds the true partition function in a family of binary attractive\n\n7\n\n\f10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\nn\no\n\ni\nt\ni\nt\nr\na\nP\ng\no\nL\n\n \n\n \n\n \n\ne\nu\nr\nT\nm\no\nr\nf\n \n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nD\n\n\u221260\n\n \n\n\u221270\n0\n\n0.2\n\nBelief Propagation\nMean Field\n0.8\n\n0.4\n0.6\nEdge Strength\n(a)\n\n \n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\nn\no\n\ni\nt\ni\nt\nr\na\nP\ng\no\nL\n\n \n\n \n\n \n\ne\nu\nr\nT\nm\no\nr\nf\n \n\ne\nc\nn\ne\nr\ne\n\nf\nf\ni\n\nD\n\n1\n\n \n\n\u22123\n0\n\n0.2\n\n \n\nn\no\n\n \n\n \n\n \n\ni\nt\ni\nt\nr\na\nP\ng\no\nL\ne\nu\nr\nT\nm\no\nr\nf\n \n\ne\nc\nn\ne\nr\ne\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\nBelief Propagation\nMean Field\n0.8\n\n0.4\n0.6\nEdge Strength\n(b)\n\nf\nf\ni\n\nD\n\n1\n\n \n\n\u22128\n0\n\n0.2\n\nBelief Propagation\nMean Field\n0.8\n\n0.4\n0.6\nEdge Strength\n(c)\n\n \n\n1\n\nFigure 3: Bethe (dark blue, top) and naive mean \ufb01eld (light green, bottom) lower bounds on log Z(\u03c8) for three\nfamilies of attractive, pairwise Ising models. (a) 30 \u00d7 30 torus with no local observations and homogeneous\npotentials. (b) 10 \u00d7 10 grid with random, inhomogeneous potentials and all pseudo\u2013marginals \u03c4s > 1\n2 , satisfy-\ning the conditions of Thm. 1. (c) 10 \u00d7 10 grid with random, inhomogeneous potentials and pseudo\u2013marginals\nof mixed biases. Empirically, the Bethe lower bound also holds for these models.\n\nmodels. These results have potential implications for the suitability of loopy BP in approximate\nparameter estimation [3], as well as its convergence dynamics. We are currently exploring general-\nizations of our results to other families of attractive, or \u201cnearly\u201d attractive, graphical models.\nAcknowledgments The authors thank Yair Weiss for suggesting connections to loop series expansions,\nand helpful conversations. Funding provided by Army Research Of\ufb01ce Grant W911NF-05-1-0207, National\nScience Foundation Grant DMS-0528488, and NSF Career Grant CCF-0545862.\nReferences\n[1] M. Chertkov and V. Y. Chernyak. Loop calculus helps to improve belief propagation and linear program-\n\nming decodings of low density parity check codes. In Allerton Conf., 2006.\n\n[2] M. Chertkov and V. Y. Chernyak. Loop series for discrete statistical models on graphs. J. Stat. Mech.,\n\n2006:P06009, June 2006.\n\n[3] B. J. Frey and N. Jojic. A comparison of algorithms for inference and learning in probabilistic graphical\n\nmodels. IEEE Trans. PAMI, 27(9):1392\u20131416, Sept. 2005.\n\n[4] V. G\u00b4omez, J. M. Mooij, and H. J. Kappen. Truncating the loop series expansion for BP. JMLR, 8:1987\u2013\n\n2016, 2007.\n\n2004.\n\n[5] D. M. Greig, B. T. Porteous, and A. H. Seheult. Exact maximum a posteriori estimation for binary images.\n\nJ. R. Stat. Soc. B, 51(2):271\u2013279, 1989.\n\n[6] T. Heskes. On the uniqueness of loopy belief propagation \ufb01xed points. Neural Comp., 16:2379\u20132413,\n\n[7] A. T. Ihler, J. W. Fisher, and A. S. Willsky. Loopy belief propagation: Convergence and effects of message\n\n[8] M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms for the Ising model. SIAM J.\n\nerrors. JMLR, 6:905\u2013936, 2005.\n\nComput., 22(5):1087\u20131116, Oct. 1993.\n\n[9] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\n\ngraphical models. Machine Learning, 37:183\u2013233, 1999.\n\n[10] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? IEEE Trans.\n\nPAMI, 26(2):147\u2013159, Feb. 2004.\n\nTrans. IT, 47(2):498\u2013519, Feb. 2001.\n\n21, pages 396\u2013403. AUAI Press, 2005.\n\nReview, 65:117\u2013149, 1944.\n\n[11] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum\u2013product algorithm. IEEE\n\n[12] J. M. Mooij and H. J. Kappen. Suf\ufb01cient conditions for convergence of loopy belief propagation. In UAI\n\n[13] L. Onsager. Crystal statistics I: A two\u2013dimensional model with an order\u2013disorder transition. Physical\n\n[14] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Mateo, 1988.\n[15] T. J. Richardson and R. L. Urbanke. The capacity of low-density parity-check codes under message-\n\npassing decoding. IEEE Trans. IT, 47(2):599\u2013618, Feb. 2001.\n\n[16] S. B. Shlosman. Correlation inequalities and their applications. J. Math. Sci., 15(2):79\u2013101, Jan. 1981.\n[17] E. B. Sudderth, M. J. Wainwright, and A. S. Willsky. Loop series and Bethe variational bounds in attractive\n\ngraphical models. UC Berkeley, EECS department technical report, in preparation, 2008.\n\n[18] M. F. Tappen and W. T. Freeman. Comparison of graph cuts with belief propagation for stereo, using\n\nidentical MRF parameters. In ICCV, volume 2, pages 900\u2013907, 2003.\n\n[19] S. C. Tatikonda and M. I. Jordan. Loopy belief propagation and Gibbs measures.\n\nIn UAI 18, pages\n\n493\u2013500. Morgan Kaufmann, 2002.\n\n[20] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Tree\u2013based reparameterization framework for anal-\n\nysis of sum\u2013product and related algorithms. IEEE Trans. IT, 49(5):1120\u20131146, May 2003.\n\n[21] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition\n\nfunction. IEEE Trans. IT, 51(7):2313\u20132335, July 2005.\n\n[22] Y. Weiss. Comparing the mean \ufb01eld method and belief propagation for approximate inference in MRFs.\n\nIn D. Saad and M. Opper, editors, Advanced Mean Field Methods. MIT Press, 2001.\n\n[23] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free energy approximations and generalized\n\nbelief propagation algorithms. IEEE Trans. IT, 51(7):2282\u20132312, July 2005.\n\n8\n\n\f", "award": [], "sourceid": 1077, "authors": [{"given_name": "Alan", "family_name": "Willsky", "institution": null}, {"given_name": "Erik", "family_name": "Sudderth", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}]}