{"title": "Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy", "book": "Advances in Neural Information Processing Systems", "page_first": 359, "page_last": 366, "abstract": null, "full_text": "Stable Fixed Points of Loopy Belief\n\nPropagation Are Minima of the Bethe\n\nFree Energy\n\nTom Heskes\n\nSNN, University of Nijmegen\n\nGeert Grooteplein 21, 6252 EZ, Nijmegen, The Netherlands\n\nAbstract\n\nWe extend recent work on the connection between loopy belief propagation\nand the Bethe free energy. Constrained minimization of the Bethe free energy\ncan be turned into an unconstrained saddle-point problem. Both converging\ndouble-loop algorithms and standard loopy belief propagation can be inter-\npreted as attempts to solve this saddle-point problem. Stability analysis then\nleads us to conclude that stable (cid:12)xed points of loopy belief propagation must\nbe (local) minima of the Bethe free energy. Perhaps surprisingly, the converse\nneed not be the case: minima can be unstable (cid:12)xed points. We illustrate this\nwith an example and discuss implications.\n\n1\n\nIntroduction\n\nPearl\u2019s belief propagation [1] is a popular algorithm for inference in Bayesian net-\nworks. It is exact in special cases, e.g., for tree-structured (singly-connected) net-\nworks with just Gaussian or just discrete nodes. But also on networks containing\ncycles, so-called loopy belief propagation often leads to good performance (approx-\nimate marginals close to exact marginals) [2]. The notion that (cid:12)xed points of loopy\nbelief propagation correspond to extrema of the so-called Bethe free energy [3] has\nbeen an important step in the theoretical understanding of this success. Empirically\nit has further been observed that loopy belief propagation, when it does, converges\nto a minimum. The main goal of this article is to understand why.\n\nIn Section 2 we will introduce loopy belief propagation in terms of a sum-product\nalgorithm on factor graphs [4]. The corresponding Bethe free energy is derived in\nSection 3 from a variational point of view, indicating that we should be particularly\ninterested in minima. In Section 4 we show that minimization of the Bethe free\nenergy under the appropriate constraints is equivalent to an unconstrained saddle-\npoint problem. The converging double-loop algorithm, described in Section 3, as\nwell as the standard sum-product algorithm are in fact attempts to solve this saddle-\npoint problem. More speci(cid:12)cally, (a damped version of) the sum-product algorithm\nhas the same local stability properties as a gradient descent-ascent procedure. Sta-\nbility analysis of this gradient descent-ascent procedure then leads to the conclusion\nin the title. With an example we illustrate that the converse need not be the case.\nIn Section 5 we discuss further implications and relations to other studies.\n\n\fx1\n\nx3\n\nEEEEEEEEEE\nyyyyyyyyyy\n\nx2\n\nx4\n\n1; 2\n\n1; 3\n\n1; 4\n\n2; 3\n\n2; 4\n\n3; 4\n\nRRRRRRRRRRRRRRRRRR\nRRRRRRRRRRRRRRRRRR\nRRRRRRRRRRRRRRRRRR\nllllllllllllllllll\nEEEEEEEEE\nyyyyyyyyy\nyyyyyyyyy\n\nllllllllllllllllll\nyyyyyyyyy\n\n1\n\n2\n\n3\n\n4\n\n(a) Graphical model of\n\n(b) Factor graph with potentials\n\nP (x1; : : : ; xn) /\n\nexphPij wij xixj +Pi (cid:18)ixii :\n\n(cid:9)ij(xi; xj) = exp(cid:2)wij xixj + 1\n\nn(cid:0)1 (cid:18)ixi + 1\n\nn(cid:0)1 (cid:18)j xj(cid:3) :\n\nFigure 1: A Boltzmann machine. (a) Graphical representation of the probability\ndistribution. (b) Corresponding factor graph with a factor for each pair of nodes.\n\n2 The sum-product algorithm on factor graphs\n\nWe start with a description of (loopy) belief propagation as the sum-product al-\ngorithm on factor graphs [4]. We assume that the probability distribution over\n(disjoint subsets of) variables x(cid:12) factorizes over \\factors\" (cid:9)(cid:11)(X(cid:11)):\n\nP (x1; : : : ; x(cid:12); : : : ; xN ) =\n\n(cid:9)(cid:11)(X(cid:11)) ;\n\n(1)\n\n1\n\nZ Y(cid:11)\n\nwith Z a proper normalization constant. We will use notation similar to [4]: upper-\ncase X(cid:11) for the factors (\\local function nodes\") and lowercase x(cid:12) for the variables.\n(cid:12) (cid:26) (cid:11) means that x(cid:12) is a neighbor of X(cid:11) in the factor graph, i.e., is included in the\npotential (cid:9)(cid:11)(X(cid:11)). An example of the transformation of a Markov network into a\nfactor graph is shown in Figure 1. In a similar manner one can transform Bayesian\nnetworks into factor graphs, where each factor contains the child and its parents [4].\n\nOn singly-connected structures, Pearl\u2019s belief propagation algorithm [1] can be ap-\nplied to compute the exact marginals (\\beliefs\")\n\nP (X(cid:11)) = XXn(cid:11)\n\nP (X) and P (x(cid:12)) =XXn(cid:12)\n\nP (X) :\n\nIf the structure contains cycles, one can still apply (loopy) belief propagation, in an\nattempt to obtain accurate approximations P(cid:11)(X(cid:11)) and P(cid:12)(x(cid:12)).\n\nPseudo-code for the sum-product algorithm is given in Algorithm 1. In the factor-\ngraph representation we distinguish messages from factor (cid:11) to variable (cid:12), (cid:22)(cid:11)!(cid:12)(x(cid:12)),\nand vice versa, (cid:22)(cid:12)!(cid:11)(x(cid:12)). The beliefs follow by multiplying the potential, a mere 1\nfor the variables and (cid:9)(cid:11)(X(cid:11)) for the factors, with the incoming messages, see (1.3)\nand (1.2) in Algorithm 1. The update for an outgoing message is the variable belief,\neither calculated with the de(cid:12)nition (1.2) or through the marginalization (1.6),\ndivided by the incoming message, see (1.4) and (1.5).\n\nWe interpret the update of factor-variable message (cid:22)(cid:11)!(cid:12) in line 8 of Algorithm 1\nas the only actual update: beliefs and variable-factor messages directly follow from\nde(cid:12)nitions in lines 11 to 15. For later reference we introduce the damped update\n\nlog (cid:22)new\n\n(cid:11)!(cid:12)(x(cid:12)) = log (cid:22)(cid:11)!(cid:12)(x(cid:12)) + (cid:15)(cid:2)log (cid:22)full\n\n(cid:11)!(cid:12)(x(cid:12)) (cid:0) log (cid:22)(cid:11)!(cid:12)(x(cid:12))(cid:3) ;\n\nwhere (cid:22)full refers to the result of the full update (1.5) and (cid:22) to the previous message.\nThese and other seemingly arbitrary choices, among which the particular ordering\n\n(2)\n\n\ffor all variables (cid:12) do\n\nfor all factors (cid:11) (cid:27) (cid:12) do\n\ninitialize message (1.1)\n\nmarginalize (1.6)\nupdate message (1.5)\n\nelse\n\nif initial then\n\n1: repeat\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16: end for\n17: until convergence\n\nend for\n\nend if\n\nend for\ncompute variable belief (1.2)\nfor all factors (cid:11) (cid:27) (cid:12) do\ncompute message (1.4)\ncompute factor belief (1.3)\n\nInitial messages:\n\n(cid:22)(cid:11)!(cid:12)(x(cid:12)) = 1\n\n(1.1)\n\nBeliefs:\n\nP(cid:12)(x(cid:12)) =\n\nP(cid:11)(X(cid:11)) =\n\nMessages:\n\n1\n\n(cid:22)(cid:11)!(cid:12)(x(cid:12))\n\nZ(cid:12) Y(cid:11)(cid:27)(cid:12)\n(cid:9)(cid:11)(X(cid:11)) Y(cid:12)(cid:26)(cid:11)\n\n1\nZ(cid:11)\n\n(1.2)\n\n(cid:22)(cid:12)!(cid:11)(x(cid:12)) (1.3)\n\n(cid:22)(cid:12)!(cid:11)(x(cid:12)) =\n\n(cid:22)(cid:11)!(cid:12)(x(cid:12)) =\n\nP(cid:12)(x(cid:12))\n\n(cid:22)(cid:11)!(cid:12)(x(cid:12))\n\nP(cid:11)(x(cid:12))\n\n(cid:22)(cid:12)!(cid:11)(x(cid:12))\n\n(1.4)\n\n(1.5)\n\nP(cid:11)(x(cid:12)) (cid:17) XX(cid:11)n(cid:12)\n\nP(cid:11)(X(cid:11))\n\n(1.6)\n\nwith\n\nAlgorithm 1: The sum-product algorithm on factor graphs.\n\nof updates, follow naturally from the analysis below. Besides, for the results on\nlocal stability we will consider the limit of small step sizes (cid:15), where any e(cid:11)ects of\nthe ordering disappear. Last but not least, the description in Algorithm 1 is mainly\npedagogical and can be made more e(cid:14)cient in several ways.\n\n3 The Bethe free energy\n\nThe exact distribution (1) can be written as the result of the variational problem\n\nP (X) = argmin\n\n^P (X) log\"\n\n^P XX\n\n^P (X)\n\nQ(cid:11) (cid:9)(cid:11)(X(cid:11))# ;\n\nwhere here and in the following normalization and positivity constraints on proba-\nbilities are implicitly assumed. Next we con(cid:12)ne our search to \\tree-like\" probability\ndistributions of the form\n\n(3)\n\n^P (X) / Q(cid:11) P(cid:11)(X(cid:11))\n\nQ(cid:12) P(cid:12)(x(cid:12))n(cid:12) (cid:0)1 with n(cid:12) (cid:17) X(cid:11)(cid:27)(cid:12)\n\n1 ;\n\n(4)\n\nthe number of neighboring factors of variable (cid:12). Here P(cid:11)(X(cid:11)) and P(cid:12)(x(cid:12)) are\ninterpreted as (approximate) local marginals that should normalize to 1, but should\nalso be consistent, i.e., obey\n\n8(cid:12)8(cid:11)(cid:27)(cid:12) P(cid:11)(x(cid:12)) = P(cid:12)(x(cid:12)) ;\n\n(5)\nwith P(cid:11)(x(cid:12)) as in (1.6). The denominator in (4) prevents double-counting. For\nsingly-connected structures, it can be shown that the exact solution P (X) is of this\nform, with proportionality constant equal to 1 and where P(cid:11)(X(cid:11)) = P (X(cid:11)) and\nP(cid:12)(x(cid:12)) = P (x(cid:12)). For structures containing cycles, this need not be the case, but\nwe can still assume it to be true approximately. Plugging (4) into the objective (3)\nand implementing the above assumptions, we obtain the Bethe free energy\n\nF (P ) =X(cid:11) XX(cid:11)\n\nP(cid:11)(X(cid:11)) log(cid:20) P(cid:11)(X(cid:11))\n\n(cid:9)(cid:11)(X(cid:11))(cid:21) (cid:0)X(cid:12)\n\n(n(cid:12) (cid:0) 1)Xx(cid:12)\n\nP(cid:12)(x(cid:12)) log P(cid:12)(x(cid:12)) :\n\n(6)\n\n\finitialize (2.1)\n\nfor all factors (cid:11) do\n\n1: for all (cid:11) and (cid:12) (cid:26) (cid:11) do\n2:\n3: end for\n4: repeat\n5:\n6:\n7:\n8: end for\n9:\n10: until convergence\n\nupdate potential (2.4)\nupdate variable belief (2.3)\n\ninner loop with (2.2) and (2.3)\n\nInitial messages and beliefs:\n\n(cid:22)(cid:12)!(cid:11)(x(cid:12)) = 1 and P(cid:11)(x(cid:12)) = 1\n\n(2.1)\n\nBeliefs:\n\nP(cid:12)(x(cid:12)) =\n\nP(cid:11)(X(cid:11)) =\n\n1\n\n(cid:22)(cid:11)!(cid:12)(x(cid:12))3\n5\n\nZ(cid:12) 2\n4Y(cid:11)(cid:27)(cid:12)\n^(cid:9)(cid:11)(X(cid:11)) Y(cid:12)(cid:26)(cid:11)\n\n1\nZ(cid:11)\n\n1\nn(cid:12)\n\n(2.2)\n\n(cid:22)(cid:12)!(cid:11)(x(cid:12)) (2.3)\n\nPotential update:\n\nlog ^(cid:9)(cid:11)(X(cid:11)) = log (cid:9)(cid:11)(X(cid:11))\n\nn(cid:12) (cid:0) 1\n\nn(cid:12)\n\n+X(cid:12)(cid:26)(cid:11)\n\nlog P old\n\n(cid:11) (x(cid:12)) (2.4)\n\nAlgorithm 2: Double-loop algorithm for minimizing the Bethe free energy. The\ninner loop is Algorithm 1 with rede(cid:12)nitions of the factor and variable beliefs.\n\nMinus the Bethe free energy is an approximation, but not a bound of the loglike-\nlihood log Z. A key observation in [3] is that the (cid:12)xed points of the sum-product\nalgorithm, described in the previous section, correspond to extrema of the Bethe\nfree energy under the constraints (5).\n\nThe above derivation suggests that we should be speci(cid:12)cally interested in minima\nof the Bethe free energy, not \\just\" stationary points. The resulting constrained\nminimization problem is well-de(cid:12)ned (the Bethe free energy is bounded from below),\nbut not necessarily convex, mainly because of the negative P(cid:12) log P(cid:12)-terms. The\ncrucial trick, implicit or explicit in recently suggested procedures is to bound [5] or\nclamp [6] the possibly concave part (outer loop: recompute the bound) and solve\nthe remaining convex problem (inner loop: maximization with respect to Lagrange\nmultipliers; see below). Here we propose to use the linear bound\n\n(cid:0)Xx(cid:12)\n\nP(cid:12)(x(cid:12)) log P(cid:12)(x(cid:12)) (cid:20) (cid:0)Xx(cid:12)\n\nP(cid:12)(x(cid:12)) log P old\n\n(cid:12) (x(cid:12)) ;\n\n(7)\n\nwith P old\nthe Bethe free energy then boils down to\n\n(cid:12) (x(cid:12)) from the result of the previous inner loop. The (convex) bound of\n\nFbound(P ) =X(cid:11) XX(cid:11)\n\nP(cid:11)(X(cid:11)) log\" P(cid:11)(X(cid:11))\n\n^(cid:9)(cid:11)(X(cid:11))# (cid:21) F (P ) ;\n\nif we de(cid:12)ne ^(cid:9)(cid:11) as in (2.4). The outer loop corresponds to a reset of the bound,\ni.e., at the start of the inner loop we have Fbound(P ) = F (P ). In the inner loop\n(see the next section for its derivation), we solve the remaining convex constrained\nminimization problem with the method of Lagrange multipliers. At the end of the\ninner loop, we then have F (P new) (cid:20) Fbound(P new) (cid:20) Fbound(P ) = F (P ).\n\n4 Saddle-point problem\n\nIn this section we will translate the (non-convex) minimization of the Bethe free en-\nergy under linear constraints into an equivalent (non-convex/concave) saddle-point\n\n\fproblem. We replace the bound (7) with an explicit minimization over auxiliary\nvariables (cid:13) (see also [7]; an alternative interpretation is a Legendre transform):\n\nP(cid:12)(x(cid:12)) log P(cid:12)(x(cid:12)) = min\n\n(cid:0)Xx(cid:12)\n\n(cid:0)Xx(cid:12)\n\n(cid:13)(cid:12)(x(cid:12))P(cid:12)(x(cid:12)) + log2\n4Xx(cid:12)\n\ne(cid:13)(cid:12) (x(cid:12) )3\n5\n\n(cid:13)(cid:12) 8<\n:\n\n:\n\n(8)\n\n9=\n;\n\nSubstitution into (6) then yields a constrained minimization problem, where the\nminimization is w.r.t. fP(cid:11); P(cid:12); (cid:13)(cid:12)g under constraints (5). Using (any other convex\ncombination will work as well, but this symmetric one is most convenient)\n\nP(cid:12)(x(cid:12)) =\n\n1\n\nn(cid:12) X(cid:11)(cid:27)(cid:12)\n\nP(cid:11)(x(cid:12))\n\nwe can get rid of all dependencies on P(cid:12), both in (8) and in the constraints (5),\nwhich simpli(cid:12)es the following analysis and derivations considerably. For (cid:12)xed (cid:13)(cid:12),\nthe remaining minimization problem is convex in P(cid:11) with linear constraints and\ncan thus be solved with the method of Lagrange multipliers.\nIn terms of these\nmultipliers (cid:21) and the auxiliary variables (cid:13), the solution for P(cid:11) reads\n\nwith Z(cid:11)((cid:21); (cid:13)) the proper normalization and\n\nP(cid:11)(X(cid:11)) =\n\n1\n\nZ(cid:11)((cid:21); (cid:13))\n\n(cid:22)(cid:21)(cid:11)(cid:12)(x(cid:12)) +\n\n(cid:9)(cid:11)(X(cid:11)) exp2\n4X(cid:12)(cid:26)(cid:11)\nn(cid:12) X(cid:11)0(cid:27)(cid:12)\n\n1\n\n(cid:22)(cid:21)(cid:11)(cid:12)(x(cid:12)) (cid:17) (cid:21)(cid:11)(cid:12)(x(cid:12)) (cid:0)\n\n(cid:21)(cid:11)0(cid:12)(x(cid:12)) :\n\nn(cid:12) (cid:0) 1\n\nn(cid:12)\n\n(cid:13)(cid:12)(x(cid:12))3\n5 ;\n\n(9)\n\nSubstituting this back into the Lagrangian, we end up with an unconstrained saddle-\npoint problem of the type min(cid:13) max(cid:21) F ((cid:21); (cid:13)) with\n\nF ((cid:21); (cid:13)) =X(cid:11)\n\nlog Z(cid:11)((cid:21); (cid:13)) (cid:0)X(cid:12)\n\nFrom the (cid:12)xed-point equations we derive the updates\n\n(cid:21)new\n\n(cid:11)(cid:12) (x(cid:12)) = (cid:21)(cid:11)(cid:12)(x(cid:12)) (cid:0) log P(cid:11)(x(cid:12)) +\n\nlog P(cid:11)0 (x(cid:12)) ;\n\n(cid:13)new\n(cid:12)\n\n(x(cid:12)) = log2\n4\n\n1\n\nn(cid:12) X(cid:11)(cid:27)(cid:12)\n\nP(cid:11)(x(cid:12))3\n5 ;\n\nwith P(cid:11)(x(cid:12)) the marginal computed from P(cid:11)(X(cid:11)) as in (9).\n\ne(cid:13)(cid:12) (x(cid:12) )3\n5 :\n\n(n(cid:12) (cid:0) 1) log2\n4Xx(cid:12)\nn(cid:12) X(cid:11)0(cid:27)(cid:12)\n\n1\n\n(10)\n\n(11)\n\nProof.\n\nIntroduce a new set of auxiliary variables ^Z(cid:11) by writing\n\n(cid:0) log Z(cid:11) = max\n\n^Z(cid:11) ((cid:0) log ^Z(cid:11) + 1 (cid:0)\n\nP(cid:11)(X(cid:11))Z(cid:11)!) :\n\n1\n\n^Z(cid:11) XX(cid:11)\n\nNext consider maximizing (cid:21)(cid:11)(cid:12)(x(cid:12)) for a particular variable (cid:12) and all (cid:11) (cid:27) (cid:12), while keeping\nall others as well as all ^Z(cid:11) (cid:12)xed (by convention, we update ^Z(cid:11) to Z(cid:11) after each update of\n(cid:21)\u2019s). Taking derivatives, we (cid:12)nd that the new (cid:22)(cid:21)new should satisfy\n\n(cid:22)(cid:21)new\n(cid:11)(cid:12) (x(cid:12) )P(cid:11)(x(cid:12))\n\ne\n\ne(cid:22)(cid:21)(cid:11)(cid:12) (x(cid:12) )\n\n=\n\n1\n\nn(cid:12) X(cid:11)0(cid:27)(cid:12)\n\n(cid:22)(cid:21)new\n(cid:11)0 (cid:12)\n\n(x(cid:12) )\n\ne\n\nP(cid:11)0 (x(cid:12))\n\n(cid:22)(cid:21)(cid:11)0 (cid:12) (x(cid:12) )\n\ne\n\n:\n\n\fAny update of the form (cid:21)new\nchoosing (cid:23)(cid:12)(x(cid:12)) such that (cid:21)new\n\n(cid:11)(cid:12) (x(cid:12)) = (cid:0) log P(cid:11)(x(cid:12)) + (cid:21)(cid:11)(cid:12)(x(cid:12)) + (cid:23)(cid:12)(x(cid:12)) will do, where\n(cid:11)(cid:12) = (cid:22)(cid:21)new\n\n(cid:11)(cid:12) yields (10).\n\nThe updates (10) and (11) are properly aligned with the respective gradients and\nsatisfy the saddle-point equations\n\nF ((cid:21)new; (cid:13)) (cid:21) F ((cid:21); (cid:13)) (cid:21) F ((cid:21); (cid:13)new) :\n\n(12)\n\nThis saddle-point problem is concave in (cid:21), but not necessarily convex in (cid:13). One\nway to guarantee convergence to a \\correct\" saddle point is then to solve the (up\nto irrelevant linear translations unique) maximization with respect to (cid:21) in an inner\nloop, followed by an update of (cid:13) in the outer loop. This is precisely the double-\nloop algorithm sketched in the previous section. We obtain the description given in\nAlgorithm 2 if we substitute (up to irrelevant constants)\n\n(cid:13)(cid:12)(x(cid:12)) = log P old\n\n(cid:12) (x(cid:12)); (cid:22)(cid:21)(cid:11)(cid:12)(x(cid:12)) = log (cid:22)(cid:12)!(cid:11)(x(cid:12)), and (cid:21)(cid:11)(cid:12)(x(cid:12)) = (cid:0) log (cid:22)(cid:11)!(cid:12)(x(cid:12)) :\nNote that in the inner loop of the double-loop algorithm the scheduling does mat-\nter. The ordering described in Algorithm 1 - run over variables (cid:12) and update all\ncorresponding messages from and to neighboring factors before moving on to the\nnext variable - satis(cid:12)es (12) without damping.\n\nAn alternative approach is to apply (damped versions of) the updates (10) and\n(11) in parallel. This can be loosely interpreted as doing gradient descent-ascent.\nGradient descent-ascent is a standard procedure for solving saddle-point problems\nand guaranteed to converge to the correct solution if the saddle-point problem is\nindeed convex/concave (see e.g. [8]). Similarly, it is easy to show that gradient\ndescent-ascent applied to a non-convex/concave problem is locally stable at a par-\nticular saddle point f(cid:21)(cid:3); (cid:13) (cid:3)g, if and only if the objective is locally convex/concave.\nThe statement in the title now follows from two observations.\n\n1. The damped version (2) of the sum-product algorithm has the same local stability\n\nproperties as a gradient descent-ascent procedure derived from (10) and (11).\n\nProof. We replace (11) with\n\n(cid:13)new\n\n(cid:12)\n\n(x(cid:12)) =\n\n1\n\nn(cid:12) X(cid:11)(cid:27)(cid:12)\n\nlog P(cid:11)(x(cid:12)) :\n\n(13)\n\nAt a saddle point P(cid:11)(x(cid:12)) = P(cid:12)(x(cid:12)) 8(cid:11)(cid:27)(cid:12) and thus the di(cid:11)erence between the logarithmic\naverage (13) and the linear average (11) as well as its derivatives vanish. Consequently,\n(13) has the same local stability properties as (11). Now consider parallel application of\na damped version of (10), with step size (cid:15), and (13), with step size n(cid:12) (cid:15). We obtain the\ndamped version (2) of the standard sum-product algorithm, in combination with the other\nde(cid:12)nitions in Algorithm 1, when we apply the de(cid:12)nitions\n\nlog (cid:22)(cid:12)!(cid:11)(x(cid:12)) = (cid:22)(cid:21)(cid:11)(cid:12)(x(cid:12)) +\n\nn(cid:12) (cid:0) 1\n\nn(cid:12)\n\n(cid:13)(cid:12)(x(cid:12)) and log (cid:22)(cid:11)!(cid:12)(x(cid:12)) =\n\n1\nn(cid:12)\n\n(cid:13)(cid:12)(x(cid:12)) (cid:0) (cid:21)(cid:11)(cid:12)(x(cid:12)) :\n\n2. Local stability of the gradient descent-ascent procedure at f(cid:21)(cid:3); (cid:13) (cid:3)g implies that\nthe corresponding P(cid:11) is at a minimum of the Bethe free energy and that all\nconstraints are satis(cid:12)ed. The converse need not be the case.\n\nProof. Local stability of the gradient descent-ascent procedure and thus the sum-product\nalgorithm depends on the local curvature of F ((cid:21); (cid:13)), de(cid:12)ned through the Hessian matrices\n\nH(cid:13)(cid:13) (cid:17)\n\n@2F ((cid:21); (cid:13))\n\n@(cid:13)@(cid:13)T (cid:12)(cid:12)(cid:12)(cid:12)f(cid:21)(cid:3);(cid:13)(cid:3) g\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n101\n\n100\n\n10\u22121\n\ne\nc\nn\ne\ng\nr\ne\nv\nd\n\u2212\nL\nK\n\ni\n\n101\n\n100\n\n10\u22121\n\n101\n\n100\n\n10\u22121\n\n101\n\n100\n\n10\u22121\n\n0\n\n50\n\n0\n\n500\n\n0\n\n10\n\n20\n\n0\n\n1000\n\n2000\n\n#iterations\n\n#iterations\n\n#iterations\n\n#iterations\n\nFigure 2: Loopy belief propagation on a Boltzmann machine with 4 nodes, weights\n(upper diagonal) (3; 2; 2; 1; 3; (cid:0)3), and thresholds (0; 0; 1; 1). Plotted is the Kullback-\nLeibler divergence between the exact and the approximate single-node marginals.\n(a) No damping leads to somewhat erratic cyclic behavior. (b) Damping with step\nsize 0.1 yields a smoother cycle, but no convergence. (c) The double-loop algorithm\ndoes converge to a stable solution. (d) This solution is unstable under standard\nloopy belief propagation (here again with step size 0.1).\n\nand H(cid:21)(cid:21). Gradient descent-ascent is locally stable i(cid:11) H(cid:13)(cid:13) is positive and H(cid:21)(cid:21) negative\n(semi-)de(cid:12)nite. The latter is true by construction. The \\total\" curvature, de(cid:12)ned through\n\nH (cid:3)\n\n(cid:13)(cid:13) (cid:17)\n\ncan be shown to obey\n\n@2F (cid:3)((cid:13))\n\n@(cid:13)@(cid:13)T (cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:3)\n\nwith F (cid:3)((cid:13)) (cid:17) max\n\n(cid:21)\n\nF ((cid:21); (cid:13)) ;\n\nH (cid:3)\n\n(cid:13)(cid:13) = H(cid:13)(cid:13) (cid:0) H(cid:13)(cid:21)H (cid:0)1\n\n(cid:21)(cid:21) H(cid:21)(cid:13) :\n\nWith H(cid:21)(cid:21) negative de(cid:12)nite, we then conclude that if H(cid:13)(cid:13) is positive de(cid:12)nite (gradient\ndescent-ascent locally stable), then so is H (cid:3)\n(cid:13)(cid:13) (local minimum). The converse, however,\nneed not be the case: H (cid:3)\n(cid:13)(cid:13) can be positive de(cid:12)nite (minimum) where H(cid:13)(cid:13) has one or more\nnegative eigenvalues (gradient descent-ascent unstable). An example of this phenomenom\nis F ((cid:21); (cid:13)) = (cid:0)(cid:21)2 (cid:0) (cid:13)2 + 4(cid:21)(cid:13).\n\nNon-convergence of loopy belief propagation on a Boltzmann machine is shown in\nFigure 2. Typically, standard loopy belief propagation converges to a stable solu-\ntion without damping. In rare cases, damping is required to obtain convergence and\nin very rare cases, even considerable damping does not help, as in Figure 2. The\ndouble-loop algorithm does converge and the solution obtained is indeed unstable\nunder standard belief propagation, even with damping. The larger the weights, the\nmore often these instabilities seem to occur. This is consistent with the empiri-\ncal observation that the max-product algorithm (\\belief revision\") is typically less\nstable than the sum-product algorithm: max-product on a Boltzmann machine cor-\nresponds to (a properly scaled version of) the sum-product algorithm in the limit of\nin(cid:12)nite weights. The example in Figure 2 is about the smallest that we have found:\nwe have observed these instabilities in many other (larger) instances of Markov net-\nworks, as well as directed Bayesian networks, yet not in structures with just a single\nloop. The latter seems consistent with the notion that not only for trees, but also\nfor networks with a single loop, the Bethe free energy is still convex.\n\n5 Discussion\n\nThe above gradient descent-ascent interpretation shows that loopy belief propaga-\ntion is more than just (cid:12)xed-point iteration: the updates tend to move in the right\nuphill-downhill directions, which might explain its success in practical applications.\nStill, loopy belief propagation can fail to converge, and apparently for two di(cid:11)erent\n\n\freasons. The (cid:12)rst rather innocent one is a too large step size, similar to taking\na too large \\learning parameter\" in gradient-descent learning. Straightforwardly\ndamping the updates, as in (2), is then su(cid:14)cient to converge to a stable (cid:12)xed point.\nNote that this damping is in the logarithmic domain and thus slightly di(cid:11)erent\nfrom the damping linear in the messages as described in [2]. The damping proposed\nin [7] is restricted to the Lagrange multipliers (cid:21) and may therefore not share the\nnice properties of the damping discussed here. Local stability in the limit of small\nstep sizes is independent of the scheduling of messages, but in practice particular\nschedules can still favor others and, for example, be stable with larger step sizes or\nconverge more rapidly. For example, in [9] the message updates follow the structure\nof a spanning tree, which empirically seems to help a lot.\n\nThe other more serious reason for non-convergence is inherent instability of the\n(cid:12)xed point, even in the limit of in(cid:12)nitely small step sizes. In that case, loopy belief\npropagation just does not work and one can resort to a more tedious double-loop\nalgorithm to guarantee convergence to a local minimum. The double-loop algorithm\ndescribed here is similar to the CCCP algorithm of [5]. The latter implicitly uses\na less strict bound, which makes it (slightly) less e(cid:14)cient and arguably a little\nmore complicated. Whether double-loop algorithms are worth the e(cid:11)ort is an open\nquestion:\nin several simulation studies a negative correlation between the quality\nof the approximation and the convergence of standard belief propagation has been\nfound [6, 7, 10], but still without a convincing theoretical explanation.\n\nAcknowledgments\n\nI would like to thank Wim Wiegerink and Onno Zoeter for many helpful suggestions\nand interesting discussions and the Dutch Technology Foundation STW for support.\n\nReferences\n\n[1] J. Pearl. Probabilistic Reasoning in Intelligent systems: Networks of Plausible\n\nInference. Morgan Kaufmann, San Francisco, CA, 1988.\n\n[2] K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate\n\ninference: An empirical study. In UAI\u201999, pages 467{475, 1999.\n\n[3] J. Yedidia, W. Freeman, and Y. Weiss. Generalized belief propagation.\n\nIn\n\nNIPS 13, pages 689{695, 2001.\n\n[4] F. Kschischang, B. Frey, and H. Loeliger. Factor graphs and the sum-product\n\nalgorithm. IEEE Transactions on Information Theory, 47(2):498{519, 2001.\n\n[5] A. Yuille. CCCP algorithms to minimize the Bethe and Kikuchi free energies:\nConvergent alternatives to belief propagation. Neural Computation, 14:1691{\n1722, 2002.\n\n[6] Y. Teh and M. Welling. The uni(cid:12)ed propagation and scaling algorithm. In\n\nNIPS 14, 2002.\n\n[7] T. Minka. The EP energy function and minimization schemes. Technical\n\nreport, MIT Media Lab, 2001.\n\n[8] S. Seung, T. Richardson, J. Lagarias, and J. Hop(cid:12)eld. Minimax and Hamilto-\n\nnian dynamics of excitatory-inhibitory networks. In NIPS 10, 1998.\n\n[9] M. Wainwright, T. Jaakola, and A. Willsky. Tree-based reparameterization for\n\napproximate estimation on loopy graphs. In NIPS 14, 2002.\n\n[10] T. Heskes and O. Zoeter. Expectation propagation for approximate inference\n\nin dynamic Bayesian networks. In UAI-2002, pages 216{223, 2002.\n\n\f", "award": [], "sourceid": 2220, "authors": [{"given_name": "Tom", "family_name": "Heskes", "institution": null}]}