{"title": "Linear Response for Approximate Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 361, "page_last": 368, "abstract": "", "full_text": "Linear Response for Approximate Inference\n\nMax Welling\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nToronto M5S 3G4 Canada\n\nYee Whye Teh\n\nComputer Science Division\n\nUniversity of California at Berkeley\n\nBerkeley CA94720 USA\n\nwelling@cs.utoronto.ca\n\nywteh@eecs.berkeley.edu\n\nAbstract\n\nBelief propagation on cyclic graphs is an ef\ufb01cient algorithm for comput-\ning approximate marginal probability distributions over single nodes and\nneighboring nodes in the graph. In this paper we propose two new al-\ngorithms for approximating joint probabilities of arbitrary pairs of nodes\nand prove a number of desirable properties that these estimates ful\ufb01ll.\nThe \ufb01rst algorithm is a propagation algorithm which is shown to con-\nverge if belief propagation converges to a stable \ufb01xed point. The second\nalgorithm is based on matrix inversion. Experiments compare a number\nof competing methods.\n\n1 Introduction\n\nBelief propagation (BP) has become an important tool for approximate inference on graphs\nwith cycles. Especially in the \ufb01eld of \u201cerror correction decoding\u201d, it has brought perfor-\nmance very close to the Shannon limit. BP was studied in a number of papers which have\ngradually increased our understanding of the convergence properties and accuracy of the\nalgorithm. In particular, recent developments show that the stable \ufb01xed points are local\nminima of the Bethe free energy [10, 1], which paved the way for more accurate \u201cgeneral-\nized belief propagation\u201d algorithms and convergent alternatives to BP [11, 6].\n\nDespite its success, BP does not provide a prescription to compute joint probabilities over\npairs of non-neighboring nodes in the graph. When the graph is a tree, there is a single chain\nconnecting any two nodes, and dynamic programming can be used to ef\ufb01ciently integrate\nout the internal variables. However, when cycles exist, it is not clear what approximate\nprocedure is appropriate. It is precisely this problem that we will address in this paper.\nWe show that the required estimates can be obtained by computing the \u201csensitivity\u201d of the\nnode marginals to small changes in the node potentials. Based on this idea, we present two\nalgorithms to estimate the joint probabilities of arbitrary pairs of nodes.\n\nThese results are interesting in the inference domain but may also have future applications\nto learning graphical models from data. For instance, information about dependencies be-\ntween random variables is relevant for learning the structure of a graph and the parameters\nencoding the interactions.\n\n\f2 Belief Propagation on Factor Graphs\nLet V index a collection of random variables {Xi}i\u2208V and let xi denote values of Xi. For\na subset of nodes \u03b1 \u2282 V let X\u03b1 = {Xi}i\u2208\u03b1 be the variable associated with that subset, and\nx\u03b1 be values of X\u03b1. Let A be a family of such subsets of V . The probability distribution\nover X\n\n.= XV is assumed to have the following form,\n\nPX(X = x) =\n\n1\nZ\n\n(cid:89)\n\n\u03b1\u2208A\n\n(cid:89)\n\ni\u2208V\n\n\u03c8\u03b1(x\u03b1)\n\n\u03c8i(xi)\n\n(1)\n\nwhere Z is the normalization constant (the partition function) and \u03c8\u03b1, \u03c8i are positive po-\ntential functions de\ufb01ned on subsets and single nodes respectively. In the following we will\nwrite P (x) .= PX(X = x) for notational simplicity. The decomposition of (1) is consistent\nwith a factor graph with function nodes over X\u03b1 and variables nodes Xi. For each i \u2208 V\ndenote its neighbors by Ni = {\u03b1 \u2208 A : \u03b1 (cid:51) i} and for each subset \u03b1 its neighbors are\nsimply N\u03b1 = {i \u2208 \u03b1}.\nFactor graphs are a convenient representation for structured probabilistic models and sub-\nsume undirected graphical models and acyclic directed graphical models [3]. Further, there\nis a simple message passing algorithm for approximate inference that generalizes the belief\npropagation algorithms on both undirected and acyclic directed graphical models,\n\nni\u03b1(xi) \u2190 \u03c8i(xi)\n\nm\u03b2i(xi)\n\nm\u03b1i(xi) \u2190\n\n\u03c8\u03b1(x\u03b1)\n\nnj\u03b1(xj)\n\n(2)\n\n(cid:89)\n\n\u03b2\u2208Ni\\\u03b1\n\n(cid:88)\n\nx\u03b1\\i\n\nwhere ni\u03b1(xi) represents a message from variable node i to factor node \u03b1 and vice versa\nfor message m\u03b1i(xi). Marginal distributions over factor nodes and variable nodes are\nexpressed in terms of these messages as follows,\n\nb\u03b1(x\u03b1) =\n\n1\n\u03b3\u03b1\n\n\u03c8\u03b1(x\u03b1)\n\nni\u03b1(xi)\n\nbi(xi) =\n\n1\n\u03b3i\n\n\u03c8i(xi)\n\nm\u03b1i(xi)\n\n(3)\n\n(cid:89)\n\nj\u2208N\u03b1\\i\n\n(cid:89)\n\n\u03b1\u2208Ni\n\n(cid:89)\n\ni\u2208N\u03b1\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nwhere \u03b3i, \u03b3\u03b1 are normalization constants. It was recently established in [10, 1] that stable\n\ufb01xed points of these update equations correspond to local minima of the Bethe-Gibbs free\nenergy given by,\n\n\u03b1 }) =\n\ni , bBP\n\nGBP({bBP\n\ni (xi)ci\n\u03c8i(xi)\nwith ci = 1 \u2212 |Ni| and the marginals are subject to the following local constraints:\n\ni (xi) log bBP\nbBP\n\n\u03b1 (x\u03b1)\n\u03c8\u03b1(x\u03b1)\n\n+\n\nxi\n\n\u03b1\n\nx\u03b1\n\ni\n\n\u03b1 (x\u03b1) log bBP\nbBP\n(cid:88)\n\n(cid:88)\n\n\u03b1 (x\u03b1) = bBP\nbBP\n\ni (xi),\n\nb\u03b1(x\u03b1) = 1,\n\n\u2200\u03b1 \u2208 A, i \u2208 \u03b1\n\n(4)\n\n(5)\n\nx\u03b1\\i\n\nx\u03b1\n\nSince only local constraints are enforced it is no longer guaranteed that the set of marginals\n{bBP\n\n\u03b1 } are consistent with a single joint distribution B(x).\n\ni , bBP\n\n3 Linear Response\n\nIn the following we will be interested in computing estimates of joint probability distri-\nbutions for arbitrary pairs of nodes. We propose a method based on the linear response\ntheorem. The idea is to study changes in the system when we perturb single node poten-\ntials,\n\nlog \u03c8i(xi) = log \u03c80\n\ni (xi) + \u03b8i(xi)\n\n(6)\n\n\fThe superscript 0 indicates unperturbed quantities in (6) and the following. Let \u03b8 = {\u03b8i}\nand de\ufb01ne the cumulant generating function of P (X) (up to a constant) as,\n\n(cid:88)\n\n(cid:89)\n\n\u03b1\u2208A\n\nx\n\n(cid:89)\n\ni\u2208V\n\nF (\u03b8) = \u2212 log\n\n\u03c8\u03b1(x\u03b1)\n\ni (xi)e\u03b8i(xi)\n\u03c80\n\n(7)\n\n(8)\n\n(9)\n\n(cid:175)(cid:175)(cid:175)\n(cid:175)(cid:175)(cid:175)\n\nDifferentiating F (\u03b8) with respect to \u03b8 gives the cumulants of P (x),\n\n\u2212 \u2202F (\u03b8)\n\n\u2202\u03b8j (xj )\n\n\u2212\n\n\u22022F (\u03b8)\n\n\u2202\u03b8i(xi)\u2202\u03b8j (xj )\n\n= pj(xj)\n\n= \u2202pj (xj )\n\u2202\u03b8i(xi)\n\n\u03b8=0\n\n\u03b8=0\n\n(cid:175)(cid:175)(cid:175)\n\n(cid:189)\n\n=\n\n\u03b8=0\n\npij(xi, xj) \u2212 pi(xi)pj(xj)\npi(xi)\u03b4xi,xj \u2212 pi(xi)pj(xj)\n\nif i (cid:54)= j\nif i = j\n\nwhere pi, pij are single and pairwise marginals of P (x). Expressions for higher order\ncumulants can be derived by taking further derivatives of \u2212F (\u03b8).\nNotice from (9) that the covariance estimates are obtained by studying the perturbations in\npj(xj) as we vary \u03b8i(xi). This is not practical in general since calculating pj(xj) itself is\nintractable. Instead, we consider perturbations of approximate marginal distributions {bj}.\nIn the following we will assume that bj(xj; \u03b8) (with the dependence on \u03b8 made explicit)\nare the beliefs at a local minimum of the BP-Gibbs free energy (subject to constraints).\nIn analogy to (9), let Cij(xi, xj) = \u2202bj (xj ;\u03b8)\n\u2202\u03b8i(xi)\nance, and de\ufb01ne the linear response estimated joint pairwise marginal as\n\n\u03b8=0 be the linear response estimated covari-\n\n(cid:175)(cid:175)\n\nij (xi, xj) = b0\nbLR\n\ni (xi)b0\n\nj(xj) + Cij(xi, xj)\n\n(10)\n\ni (xi) .= bi(xi; \u03b8 = 0). We will show that bLR\n\nwhere b0\nij and Cij satisfy a number of important\nproperties which make them suitable as approximations of joint marginals and covariances.\nFirst we show that Cij(xi, xj) can be interpreted as the Hessian of a well-behaved convex\nfunction. Let C be the set of beliefs that satisfy the constraints (5). The approximate\nmarginals {b0\n\u03b1} form a local minimum of the Bethe-\nGibbs free energy (subject to b0 .= {b0\n\u03b1} \u2208 C). Assume that b0 is a strict local minimum\nof GBP (the strict local minimality is in fact attained if we use loopy belief propagation [1]).\nThat is, there is an open domain D containing b0 such that GBP(b0) < GBP(b) for each\nb \u2208 D \u2229 C\\b0. Now we can de\ufb01ne\n\ni} along with the joint marginals {b0\n\ni , b0\n\nG\u2217(\u03b8) = inf\n\ni,xi\n\nbi(xi)\u03b8i(xi)\n\n(11)\nG\u2217(\u03b8) is a concave function since it is the in\ufb01mum of a set of linear functions in \u03b8. Further\nG\u2217(0) = GBP(b0). Since b0 is a strict local minimum when \u03b8 = 0, small perturbations in \u03b8\nwill result in small perturbations in b0, so that G\u2217 is well-behaved on an open neighborhood\naround \u03b8 = 0. Differentiating G\u2217(\u03b8), we get \u2202G\u2217(\u03b8)\n\u2202\u03b8j (xj ) = \u2212bj(xj; \u03b8) so we now have\n= \u2212\n\n\u22022G\u2217(\u03b8)\n\nCij(xi, xj) = \u2202bj (xj ;\u03b8)\n\u2202\u03b8i(xi)\n\n(12)\n\n(cid:175)(cid:175)(cid:175)(cid:175)\n\n(cid:175)(cid:175)(cid:175)\n\n\u03b8=0\n\n\u2202\u03b8i(xi)\u2202\u03b8j(xj)\n\n\u03b8=0\n\nb\u2208D\u2229C GBP(b) \u2212(cid:80)\n\nIn essence, we can interpret G\u2217(\u03b8) as a local convex dual of GBP(b) (by restricting attention\nto D). Since GBP is an approximation to the exact Gibbs free energy [8], which is in turn\ndual to F (\u03b8) [4], G\u2217(\u03b8) can be seen as an approximation to F (\u03b8) for small values of \u03b8. For\nthat reason we can take its second derivatives Cij(xi, xj) as approximations to the exact\ncovariances (which are second derivatives of \u2212F (\u03b8)).\n\nTheorem 1 The approximate covariance satis\ufb01es the following symmetry:\n\nCij(xi, xj) = Cji(xj, xi)\n\n(13)\n\n\fProof: The covariances are second derivatives of \u2212G\u2217(\u03b8) at \u03b8 = 0 so we can interchange\nthe order of the derivatives since G\u2217(\u03b8) is well-behaved on a neighborhood around \u03b8 =\n0. (cid:164)\n\nTheorem 2 The approximate covariance satis\ufb01es the following \u201cmarginalization\u201d condi-\n\nCij(x(cid:48)\n\ni, xj) =\n\nCij(xi, x(cid:48)\n\nj) = 0\n\n(cid:88)\n\nx(cid:48)\n\nj\n\n(cid:80)\n\n(cid:88)\n\nx(cid:48)\n\nj\n\n(cid:175)(cid:175)(cid:175)(cid:175)\n\ntions for each xi, xj: (cid:88)\n\nx(cid:48)\n\ni\n\n(cid:88)\n\nx(cid:48)\n\ni\n\nij (x(cid:48)\nbLR\n(cid:88)\n\nCij(xi, x(cid:48)\n\nj) =\n\n(cid:80)\n\nx(cid:48)\n\nj\n\nCij(x(cid:48)\n\n(cid:88)\n\nx(cid:48)\n\nj\n\n(cid:175)(cid:175)(cid:175)\n\nAs a result the approximate joint marginals satisfy local marginalization constraints:\n\ni, xj) = b0\n\nj(xj)\n\nij (xi, x(cid:48)\nbLR\n\nj) = b0\n\ni (xi)\n\n(cid:175)(cid:175)(cid:175)\n\nProof: Using the de\ufb01nition of Cij(xi, xj) and marginalization constraints for b0\nj ,\n\n\u2202bj (x(cid:48)\nj ;\u03b8)\n\u2202\u03b8i(xi)\n\n\u2202\n\n=\n\n\u03b8=0\n\nbj (x(cid:48)\nx(cid:48)\n\u2202\u03b8i(xi)\n\nj\n\nj ;\u03b8)\n\n=\n\n\u2202\n\n\u2202\u03b8i(xi)1\n\n\u03b8=0\n\n= 0\n\n(16)\n\n\u03b8=0\n\ni\n\nx(cid:48)\n\ni, xj) = 0 follows from the symmetry (13), while the corre-\n\nThe constraint\nsponding marginalization (15) follows from (14) and the de\ufb01nition of bLR\nSince \u2212F (\u03b8) is convex, its Hessian matrix with entries given in (9) is positive semi-de\ufb01nite.\nSimilarly, since the approximate covariances Cij(xi, xj) are second derivatives of a convex\nfunction \u2212G\u2217(\u03b8), we have:\nTheorem 3 The matrix formed from the approximate covariances Cij(xi, xj) by varying\ni and xi over the rows and varying j, xj over the columns is positive semi-de\ufb01nite.\n\nij . (cid:164)\n\nUsing the above results we can reinterpret the linear response correction as a \u201cprojection\u201d\nof the (only locally consistent) beliefs {b0\nij } that is both\nlocally consistent (theorem 2) and satis\ufb01es the global constraint of being positive semi-\nde\ufb01nite (theorem 3)1.\n\n\u03b1} onto a set of beliefs {b0\n\ni , bLR\n\ni , b0\n\n4 Propagating Perturbations for Linear Response\n\nRecall from (10) that we need the \ufb01rst derivative of bi(xi; \u03b8) with respect to \u03b8j(xj) at \u03b8 = 0.\nThis does not automatically imply that we need an analytic expression for bi(xi; \u03b8) in terms\nof \u03b8. In this section we show how we may compute these \ufb01rst derivatives by expanding all\nquantities and equations up to \ufb01rst order in \u03b8 and keeping track of \ufb01rst order dependencies.\n\nFirst we assume that belief propagation has converged to a stable \ufb01xed point. We expand\nthe beliefs and messages up to \ufb01rst order as2\n\nbi(xi; \u03b8) = b0\n\ni (xi)\n\n1 +\n\nRij(xi, yj)\u03b8j(yj)\n\nni\u03b1(xi) = n0\n\ni\u03b1(xi)\n\n1 +\n\nNi\u03b1,k(xi, yk)\u03b8k(yk)\n\n(cid:181)\n(cid:181)\n(cid:181)\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nk,yk\n\nj,yj\n\n(cid:182)\n\n(cid:182)\n(cid:182)\n\n(14)\n\n(15)\n\n(17)\n\n(18)\n\n(19)\n\nm\u03b1i(xi) = m0\n\n\u03b1i(xi)\n\n1 +\n\nM\u03b1i,k(xi, yk)\u03b8k(yk)\n\n1In extreme cases it is however possible that some entries of bLR\n2The unconventional form of this expansion will make subsequent derivations more transparent.\n\nij become negative.\n\nk,yk\n\n\fThe \u201cresponse matrices\u201d Rij, Ni\u03b1,j, M\u03b1i,j measure the sensitivities of the corresponding\nlogarithms of beliefs and messages to changes in the log potentials log \u03c8j(yj) at node j.\nNext, inserting the expansions (6,18,19) into the belief propagation equations (2) and\nmatching \ufb01rst order terms, we arrive at the following update equations for the \u201csuper-\nmessages\u201d M\u03b1i,k(xi, yk) and Ni\u03b1,k(xi, yk),\n\nNi\u03b1,k(xi, yk) \u2190 \u03b4ik\u03b4xiyk +\n\nM\u03b2i,k(xi, yk)\n\nM\u03b1i,k(xi, yk) \u2190\n\n\u03c8\u03b1(x\u03b1)\n\u03b1i(xi)\nm0\n\nj\u03b1(xj)\nn0\n\nj\u2208N\u03b1\\i\n\nj\u2208N\u03b1\\i\n\n(cid:88)\n\nx\u03b1\\i\n\n(cid:88)\n\n(20)\n\nNj\u03b1,k(xj, yk)\n\n(21)\n\n(cid:88)\n(cid:89)\n\n\u03b2\u2208Ni\\\u03b1\n\nThe super-messages are initialized at M\u03b1i,k = Ni\u03b1,k = 0 and updated using (20,21)\nuntil convergence. Just as for belief propagation, where messages are normalized to avoid\nnumerical over or under \ufb02ow, after each update the super-messages are \u201cnormalized\u201d as\nfollows,\n\nM\u03b1i,k(xi, yk) \u2190 M\u03b1i,k(xi, yk) \u2212\n\nM\u03b1i,k(xi, yk)\n\n(22)\n\n(cid:88)\n\nxi\n\nand similarly for Ni\u03b1,k. After the above \ufb01xed point equations have converged, we compute\nthe response matrix Rij(xi, xj) by again inserting the expansions (6,17,19) into (3) and\nmatching \ufb01rst order terms,\n\nRij(xi, xj) = \u03b4ij\u03b4xixj +\n\nM\u03b1i,j(xi, xj)\n\n(23)\n\n(cid:80)\n\n(cid:88)\n\n\u03b1\u2208Ni\n\n(cid:88)\n\nThe constraints (14) (which follow from the normalization of bi(xi; \u03b8) and b0\ninto\napplied to accomplish this,\n\ni (xi)) translate\ni (xi)Rij(xi, yj) = 0 and it is not hard to verify that the following shift can be\nb0\n\nxi\n\nRij(xi, yj) \u2190 Rij(xi, yj) \u2212\n\ni (xi)Rij(xi, yj)\nb0\n\nFinally, combining (17) with (12), we get\n\nxi\n\nCij(xi, xj) = b0\n\ni (xi)Rij(xi, xj)\n\n(24)\n\n(25)\n\nTheorem 4 If the factor graph has no loops then the linear response estimates de\ufb01ned in\n(25) are exact. Moreover, there exists a scheduling of the super-messages such that the\nalgorithm converges after just one iteration (i.e. every message is updated just once).\n\nSketch of Proof: Both results follow from the fact that belief propagation on tree structured\nfactor graphs computes the exact single node marginals for arbitrary \u03b8. Since the super-\nmessages are the \ufb01rst order terms of the BP updates with arbitrary \u03b8, we can invoke the\nexact linear response theorem given by (8) and (9) to claim that the algorithm converges to\nthe exact joint pairwise marginal distributions. (cid:164)\nFor graphs with cycles, BP is not guaranteed to converge. We can however still prove the\nfollowing strong result.\ni\u03b1(xi)} have converged to a stable \ufb01xed point,\nTheorem 5 If the messages {m0\nthen the update equations for the super-messages (20,21,22) will also converge to a unique\nstable \ufb01xed point, using any scheduling of the super-messages.\nSketch of Proof3: We \ufb01rst note that the updates (20,21,22) form a linear system of equa-\ntions which can only have one stable \ufb01xed point. The existence and stability of this \ufb01xed\n\n\u03b1i(xi), n0\n\n3For a more detailed proof of the above two theorems we refer to [9].\n\n\fpoint is proven by observing that the \ufb01rst order term is identical to the one obtained from\na linear expansion of the BP equations (2) around its stable \ufb01xed point. Finally, the Stein-\nRosenberg theorem guarantees that any scheduling will converge to the same \ufb01xed point. (cid:164)\n\n5 Inverting Matrices for Linear Response\n\nIn this section we describe an alternative method to compute \u2202bi(xi)\n\u2202\u03b8k(xk) by \ufb01rst computing\n\u2202bk(xk) and then inverting the matrix formed by \ufb02attened {i, xi} into a row index and\n\u2202\u03b8i(xi)\n{k, xk} into a column index. This method is a direct extension of [2]. The intuition is\nthat while perturbations in a single \u03b8i(xi) affect the whole system, perturbations in a single\nbi(xi) (while keeping the others \ufb01xed) affect each subsystem \u03b1 \u2208 A independently (see\n[8]). This makes it easier to compute \u2202\u03b8i(xi)\n\n\u2202bk(xk) then to compute \u2202bi(xi)\n\u2202\u03b8k(xk) .\n\nbi(0) = 1\u2212(cid:80)\n\nxi(cid:54)=0 bi(xi). Now the matrix formed by \u2202\u03b8i(xi)\n\nFirst we propose minimal representations for bi, \u03b8i and the messages. We assume that for\neach node i there is a distinguished value xi = 0. Set \u03b8i(0) = 0 while functionally de\ufb01ne\n\u2202bk(xk) for each i, k and xi, xk (cid:54)= 0\nis invertible and its inverse gives us the desired covariances for xi, xk (cid:54)= 0. Values for xi =\n0 or xk = 0 can then be computed using (14). We will also need minimal representations\nfor the messages. This can be achieved by de\ufb01ning new quantities \u03bbi\u03b1(xi) = log ni\u03b1(xi)\nfor all i and xi (cid:54)= 0. The \u03bbi\u03b1\u2019s can be interpreted as Lagrange multipliers to enforce the\nni\u03b1(0)\nconsistency constraints (5) [10]. We will use these multipliers instead of the messages in\nthis section.\nRe-expressing the \ufb01xed point equations (2,3) in terms of bi\u2019s and \u03bbi\u03b1\u2019s only, and introduc-\ning the perturbations \u03b8i, we get:\n\n(cid:182)ci\n\n(cid:181)\n\nbi(xi)\nbi(0)\n\n= \u03c8i(xi)\n(cid:80)\n\u03c8i(0) e\u03b8i(xi)\n(cid:80)\n\u03c8\u03b1(x\u03b1)\n\u03c8\u03b1(x\u03b1)\n\nx\u03b1\\i\n\nx\u03b1\n\nbi(xi) =\n\n\u03b1\u2208Ni\n\n(cid:89)\n(cid:81)\n(cid:81)\n(cid:181)\n\ne\u2212\u03bbi\u03b1(xi)\n\nfor all i, xi (cid:54)= 0\n\nj\u2208N\u03b1\nj\u2208N\u03b1\n\ne\u03bbj\u03b1(xj )\n\ne\u03bbj\u03b1(xj )\n\nfor all i, \u03b1 \u2208 Ni, xi (cid:54)= 0\n\n(26)\n\n(27)\n\n(28)\n\n(29)\n\nDifferentiating the logarithm of (26) with respect to bk(xk), we get\n\n\u2202\u03b8i(xi)\n\n\u2202bk(xk) = ci\u03b4ik\n\n\u03b4xixk\nbi(xi)\n\n+\n\n1\n\nbi(0)\n\n+\n\n\u03b1\u2208Ni\n\n\u2202\u03bbi\u03b1(xi)\n\u2202bk(xk)\n\n(cid:182)\n\n(cid:88)\n\n\u2202\u03bbi\u03b1(xi)\n\nremembering that bi(0) is a function of bi(xi), xi (cid:54)= 0. Notice that we need values for\n\u2202bk(xk) in order to solve for \u2202\u03b8i(xi)\n\u2202bk(xk) . Since perturbations in bk(xk) (while keeping other\nbj\u2019s \ufb01xed) do not affect nodes not directly connected to k, we have \u2202\u03bbi\u03b1(xi)\n\u2202bk(xk) = 0 for\nk (cid:54)\u2208 \u03b1. When k \u2208 \u03b1, these can in turn be obtained by solving, for each \u03b1, a matrix inverse.\nDifferentiating (27) by bk(xk), we obtain\n\n\u03b4ik\u03b4xixk =\n\nij(xi, xj) \u2202\u03bbj\u03b1(xj )\nC \u03b1\n\u2202bk(xk)\nj\u2208\u03b1\nxj(cid:54)=0\nb\u03b1(xi, xj) \u2212 bi(xi)bj(xj)\nbi(xi)\u03b4xixj \u2212 bi(xi)bj(xj)\n\nif i(cid:54)= j\nif i= j\n\nC \u03b1\n\nij(xi, xj) =\n\n(30)\nfor each i, k \u2208 N\u03b1 and xi, xk (cid:54)= 0. Flattening the indices in (29) (varying i, xi over rows\nand k, xk over columns), the LHS becomes the identity matrix, while the RHS is a product\n\n(cid:88)\n\n(cid:88)\n(cid:189)\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: L1-error in covariances for MF+LR, BP, BP+LR and \u201cconditioning\u201d. Dashed line is\nbaseline (C = 0). The results are separately plotted for neighboring nodes (a), next-to-nearest\nneighboring nodes (b) and the remaining nodes (c).\n\nij(xi, xj);\nof two matrices. The \ufb01rst is a covariance matrix C\u03b1 where the ijth block is C \u03b1\nwhile the second matrix consists of all the desired derivatives \u2202\u03bbj\u03b1(xj )\n\u2202bk(xk) . Hence the deriva-\ntives are given as elements of the inverse covariance matrix C\u22121\n\u03b1 . Finally, plugging the\nvalues of \u2202\u03bbj\u03b1(xj )\n\u2202bk(xk) and inverting that matrix will now give\n\u2202bk(xk)\nus the desired approximate covariances over the whole graph. Interestingly, the method\nonly requires access to the beliefs at the local minimum, not to the potentials or Lagrange\nmultipliers.\n\ninto (28) now gives \u2202\u03b8i(xi)\n\n6 Experiment\n\nThe accuracy of the estimated covariances Cij(xi, xj) in the LR approximation was stud-\nied on a 6\u00d7 6 square grid with only nearest neighbors connected and 3 states per node. The\nsolid curves in \ufb01gure 1 represent the error in the estimates for: 1) mean \ufb01eld + LR approx-\nimation [2, 9], 2) BP estimates for neighboring nodes with bEDGE = b\u03b1 in equation (3), 3)\nBP+LR and 4) \u201cconditioning\u201d, where bij(xi, xj) = bi|j(xi|xj) bBP\nj (xj) and bi|j(xi|xj) is\ncomputed by running BP N \u00b7 D times with xj clamped at a speci\ufb01c state (this has the same\ncomputational complexity as BP+LR). C was computed as Cij = bij \u2212 bibj, with {bi, bj}\nthe marginals of bij, and symmetrizing the result. The error was computed as the absolute\ndifference between the estimated and the true values, averaged over pairs of nodes and their\npossible states, and averaged over 25 random draws of the network. An instantiation of a\nnetwork was generated by randomly drawing the logarithm of the edge potentials from a\nzero mean Gaussian with a standard deviation ranging between [0, 2]. The node potentials\nwere set to 1.\nFrom these experiments we conclude that \u201cconditioning\u201d and BP+LR have similar accuracy\nand signi\ufb01cantly outperform MF+LR and BP, while \u201cconditioning\u201d performs slightly better\nthan BP+LR. The latter does however satisfy some desirable properties which are violated\nby conditioning (see section 7 for further discussion).\n\n7 Discussion\n\nIn this paper we propose to estimate covariances as follows: \ufb01rst observe that the log\npartition function is the cumulant generating function, next de\ufb01ne its conjugate dual \u2013 the\nGibbs free energy \u2013 and approximate it, \ufb01nally transform back to obtain a local convex\napproximation to the log partition function, from which the covariances can be estimated.\nThe computational complexity of the iterative linear response algorithm scales as O(N \u00b7\n\n0.511.5210\u2212510\u2212410\u2212310\u22122\u03c3edge error covariancesNeighborsConditioning BP+LR BP MF+LR C=0 0.511.5210\u2212610\u2212410\u22122error covariancesNext\u2212to\u2212Nearest Neighbors Conditioning BP+LR MF+LR C=0 \u03c3edge 0.511.5210\u2212610\u22124\u03c3edgeerror covariancesDistant Nodes Conditioning BP+LR MF+LR C=0 \fE \u00b7 D3) per iteration (N = #nodes, E = #edges, D = #states per node). The non-\niterative algorithm scales slightly worse, O(N 3 \u00b7 D3), but is based on a matrix inverse for\nwhich very ef\ufb01cient implementations exist. A question that remains open is whether we\ncan improve the ef\ufb01ciency of the iterative algorithm when we are only interested in the\njoint distributions of neighboring nodes.\n\nThere are still a number of generalizations worth mentioning. Firstly, the same ideas can\nbe applied to the MF approximation [9] and the Kikuchi approximation (see also [5]). Sec-\nondly, the presented method easily generalizes to the computation of higher order cumu-\nlants. Thirdly, when applying the same techniques to Gaussian random \ufb01elds, a propagation\nalgorithm results that computes the inverse of the weight matrix exactly [9]. In the case of\nmore general continuous random \ufb01eld models we are investigating whether linear response\nalgorithms can be applied to the \ufb01xed points of expectation propagation.\n\nThe most important distinguishing feature between the proposed LR algorithm and the\nconditioning procedure described in section 6 is the fact that the covariance estimate is\nautomatically positive semi-de\ufb01nite. Indeed the idea to include global constraints such as\npositive semi-de\ufb01niteness in approximate inference algorithms was proposed in [7]. Other\ndifferences include automatic consistency between joint pairwise marginals from LR and\nnode marginals from BP (not true for conditioning) and a convergence proof for the LR\nalgorithm (absent for conditioning, but not observed to be a problem experimentally). Fi-\nnally, the non-iterative algorithm is applicable to all local minima in the Bethe-Gibbs free\nenergy, even those that correspond to unstable \ufb01xed points of BP.\n\nAcknowledgements\n\nWe would like to thank Martin Wainwright for discussion. MW would like to thank Geoffrey Hinton\nfor support. YWT would like to thank Mike Jordan for support.\n\nReferences\n\n[1] T. Heskes. Stable \ufb01xed points of loopy belief propagation are minima of the bethe free energy.\n\nIn Advances in Neural Information Processing Systems, volume 15, Vancouver, CA, 2003.\n\n[2] H.J. Kappen and F.B. Rodriguez. Ef\ufb01cient learning in Boltzmann machines using linear re-\n\nsponse theory. Neural Computation, 10:1137\u20131156, 1998.\n\n[3] F.R. Kschischang, B. Frey, and H.A. Loeliger. Factor graphs and the sum-product algorithm.\n\nIEEE Transactions on Information Theory, 47(2):498\u2013519, 2001.\n\n[4] M. Opper and O. Winther. From naive mean \ufb01eld theory to the TAP equations. In Advanced\n\nMean Field Methods \u2013 Theory and Practice. MIT Press, 2001.\n\n[5] K. Tanaka. Probabilistic inference by means of cluster variation method and linear response\n\ntheory. IEICE Transactions in Information and Systems, E86-D(7):1228\u20131242, 2003.\n\n[6] Y.W. Teh and M. Welling. The uni\ufb01ed propagation and scaling algorithm.\n\nNeural Information Processing Systems, 2001.\n\nIn Advances in\n\n[7] M.J. Wainwright and M.I. Jordan. Semide\ufb01nite relaxations for approximate inference on graphs\nwith cycles. Technical report, Computer Science Division, University of California Berkeley,\n2003. Rep. No. UCB/CSD-3-1226.\n\n[8] M. Welling and Y.W. Teh. Approximate inference in boltzmann machines. Arti\ufb01cial Intelli-\n\ngence, 143:19\u201350, 2003.\n\n[9] M. Welling and Y.W. Teh. Linear response algorithms for approximate inference in graphical\n\nmodels. Neural Computation, 16:197\u2013221, 2004.\n\n[10] J.S. Yedidia, W. Freeman, and Y. Weiss. Generalized belief propagation. In Advances in Neural\n\nInformation Processing Systems, volume 13, 2000.\n\n[11] A.L. Yuille. CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent\n\nalternatives to belief propagation. Neural Computation, 14(7):1691\u20131722, 2002.\n\n\f", "award": [], "sourceid": 2419, "authors": [{"given_name": "Max", "family_name": "Welling", "institution": null}, {"given_name": "Yee", "family_name": "Teh", "institution": null}]}