{"title": "The Bethe Partition Function of Log-supermodular Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 117, "page_last": 125, "abstract": "Sudderth, Wainwright, and Willsky conjectured that the Bethe approximation corresponding to any fixed point of the belief propagation algorithm over an attractive, pairwise binary graphical model provides a lower bound on the true partition function. In this work, we resolve this conjecture in the affirmative by demonstrating that, for any graphical model with binary variables whose potential functions (not necessarily pairwise) are all log-supermodular, the Bethe partition function always lower bounds the true partition function. The proof of this result follows from a new variant of the \u201cfour functions\u201d theorem that may be of independent interest.", "full_text": "The Bethe Partition Function of Log-supermodular\n\nGraphical Models\n\nNicholas Ruozzi\n\nCommunication Theory Laboratory\n\nEPFL\n\nLausanne, Switzerland\n\nnicholas.ruozzi@epfl.ch\n\nAbstract\n\nSudderth, Wainwright, and Willsky conjectured that the Bethe approximation cor-\nresponding to any \ufb01xed point of the belief propagation algorithm over an attrac-\ntive, pairwise binary graphical model provides a lower bound on the true partition\nfunction. In this work, we resolve this conjecture in the af\ufb01rmative by demonstrat-\ning that, for any graphical model with binary variables whose potential functions\n(not necessarily pairwise) are all log-supermodular, the Bethe partition function\nalways lower bounds the true partition function. The proof of this result follows\nfrom a new variant of the \u201cfour functions\u201d theorem that may be of independent\ninterest.\n\n1\n\nIntroduction\n\nGraphical models have proven to be a useful tool for performing approximate inference in a wide va-\nriety of application areas including computer vision, combinatorial optimization, statistical physics,\nand wireless networking. Computing the partition function of a given graphical model, a typical in-\nference problem, is an NP-hard problem in general. Because of this, the inference problem is often\nreplaced by a variational approximation that is, hopefully, easier to solve. The Bethe approximation,\none such standard approximation, is of great interest both because of its practical performance and\nbecause of its relationship to the belief propagation (BP) algorithm: stationary points of the Bethe\nfree energy function correspond to \ufb01xed points of belief propagation [1]. However, the Bethe parti-\ntion function is only an approximation to the true partition function and need not provide an upper\nor lower bound.\nIn certain special cases, the Bethe approximation is conjectured to provide a lower bound on the true\npartition function. One such example is the class of attractive pairwise graphical models: models in\nwhich the interaction between any two neighboring variables places a greater weight on assignments\nin which the two variables agree. Many applications in computer vision and statistical physics can\nbe expressed as attractive pairwise graphical models (e.g., the ferromagnetic Ising model). Sudderth,\nWainwright, and Willsky [2] used a loop series expansion of Chertkov and Chernyak [3, 4] in order\nto study the \ufb01xed points of BP over attractive graphical models. They provided conditions on the\n\ufb01xed points of BP under which the stationary points of the Bethe free energy function corresponding\nto these \ufb01xed points are a lower bound on the true partition function. Empirically, they observed that,\neven when their conditions were not satis\ufb01ed, the Bethe partition function appeared to lower bound\nthe true partition function, and they conjectured that this is always the case for attractive pairwise\nbinary graphical models.\nRecent work on the relationship between the Bethe partition function and the graph covers of a\ngiven graphical model has suggested a new approach to resolving this conjecture. Vontobel [5]\ndemonstrated that the Bethe partition function can be precisely characterized by the average of the\n\n1\n\n\ftrue partition functions corresponding to graph covers of the base graphical model. The primary\ncontribution of the present work is to show that, for graphical models with log-supermodular po-\ntentials, the partition function associated with any graph cover of the base graph, appropriately\nnormalized, must lower bound the true partition function. As pairwise binary graphical models are\nlog-supermodular if and only if they are attractive, combining our result with the observations of [5]\nresolves the conjecture of [2].\nThe key element in our proof, and the second contribution of this work, is a new variant of the\n\u201cfour functions\u201d theorem that is speci\ufb01c to log-supermodular functions. We state and prove this\nvariant in Section 3.1, and in Section 4.1, we use it to resolve the conjecture. As a \ufb01nal contri-\nbution, we demonstrate that our variant of the \u201cfour functions\u201d theorem has applications beyond\nlog-supermodular functions: as an example, we use it to show that the Bethe partition function can\nalso provide a lower bound on the number of independent sets in a bipartite graph.\n\n2 Undirected Graphical Models\nLet f : {0, 1}n \u2192 R\u22650 be a non-negative function. We say that f factors with respect to a hyper-\ngraph G = (V,A) where A \u2286 2V , if there exist potential functions \u03c6i : {0, 1} \u2192 R\u22650 for each\ni \u2208 V and \u03c8\u03b1 : {0, 1}|\u03b1| \u2192 R\u22650 for each \u03b1 \u2208 A such that\n\n(cid:89)\n\ni\u2208V\n\n(cid:89)\n\n\u03b1\u2208A\n\nf (x) =\n\n\u03c6i(xi)\n\n\u03c8\u03b1(x\u03b1)\n\nwhere x\u03b1 is the subvector of the vector x indexed by the set \u03b1.\nWe will express the hypergraph G as a bipartite graph that consists of a variable node for each i \u2208 V ,\na factor node for each \u03b1 \u2208 A, and an edge joining the factor node corresponding to \u03b1 to the variable\nnode representing i if i \u2208 \u03b1. This is typically referred to as the factor graph representation of G.\nDe\ufb01nition 2.1. A function f : {0, 1}n \u2192 R\u22650 is log-supermodular if for all x, y \u2208 {0, 1}n\n\nf (x)f (y) \u2264 f (x \u2227 y)f (x \u2228 y)\n\nwhere (x\u2227y)i = min{xi, yi} and (x\u2228y)i = max{xi, yi}. Similarly, a function f : {0, 1}n \u2192 R\u22650\nis log-submodular if for all x, y \u2208 {0, 1}n\n\nf (x)f (y) \u2265 f (x \u2227 y)f (x \u2228 y).\n\nDe\ufb01nition 2.2. A factorization of a function f : {0, 1}n \u2192 R\u22650 over G = (V,A) is log-\nsupermodular if for all \u03b1 \u2208 A, \u03c8\u03b1(x\u03b1) is log-supermodular.\nEvery function that admits a log-supermodular factorization is necessarily log-supermodular, prod-\nucts of log-supermodular functions are easily seen to be log-supermodular, but the converse may not\nbe true outside of special cases. If |\u03b1| \u2264 2 for each \u03b1 \u2208 A, then we call the factorization pairwise.\nFor any pairwise factorization, f is log-supermodular if and only if \u03c8ij is log-supermodular for each\ni and j.\nPairwise graphical models such that \u03c8\u03b1(x\u03b1) is log-supermodular for all \u03b1 \u2208 A are referred to\nas attractive graphical models. A generalization of attractive interactions to the non-pairwise case\nis presented in [2]: for all \u03b1 \u2208 A, \u03c8\u03b1, when appropriately normalized, has non-negative central\nmoments. However, the relationship between this generalization and log-supermodularity remains\nunclear.\n\n2.1 Graph Covers\n\nGraph covers have played an important role in our understanding of graphical models [5, 6].\nRoughly, if a graph H covers a graph G, then H looks locally the same as G.\nDe\ufb01nition 2.3. A graph H covers a graph G = (V, E) if there exists a graph homomorphism\nh : H \u2192 G such that for all vertices v \u2208 G and all w \u2208 h\u22121(v), h maps the neighborhood \u2202w of w\nin H bijectively to the neighborhood \u2202v of v in G. If h(w) = v, then we say that w \u2208 H is a copy\nof v \u2208 G. Further, H is a k-cover of G if every vertex of G has exactly k copies in H.\n\n2\n\n\f1\n\n4\n\n2\n\n3\n\n(a) A graph,\nG.\n\n1\n\n2\n\n3\n\n4\n\n2\n\n1\n4\n(b) One possible cover of G.\n\n3\n\nFigure 1: An example of a graph cover. The nodes in the cover are labeled for the node that they\ncopy in the base graph.\n\nFor an example of a graph cover, see Figure 1.\nFor the factor graph corresponding to G = (V,A), each k-cover consists of a variable node for\neach of the k|V | variables, a factor node for each of the k|A| factors, and an edge joining each copy\nof \u03b1 \u2208 A to a distinct copy of each i \u2208 \u03b1. To any k-cover H = (VH ,AH ) of G given by the\nhomomorphism h, we can associate a collection of potentials: the potential at node i \u2208 VH is equal\nto \u03c6h(i), the potential at node h(i) \u2208 G, and for each \u03b1 \u2208 AH, we associate the potential \u03c8h(\u03b1). In\nthis way, we can construct a function f H : {0, 1}kn \u2192 R\u22650 such that f H factorizes over H.\nNotice that if f G admits a log-supermodular factorization over G and H is a k-cover of G, then f H\nadmits a log-supermodular factorization over H.\n\n2.2 Bethe Approximations\nFor a function f : {0, 1}n \u2192 R\u22650 that factorizes over G = (V,A), we are interested computing\nx f (x). In general, this is an NP-hard problem, but in practice,\nalgorithms, such as belief propagation, based on variational approximations produce reasonable esti-\nmates in many settings. One such variational approximation, the Bethe approximation at temperature\nT = 1, is de\ufb01ned as follows:\n\nthe partition function Z(G) = (cid:80)\n(cid:88)\n(cid:88)\n(cid:88)\n\u2212(cid:88)\n\nlog ZB(G, \u03c4 ) =\n\ni\u2208V\n\nxi\n\ni\u2208V\n\nxi\n\nfor \u03c4 in the local marginal polytope,\n\n\u03c4i(xi) log \u03c6i(xi) +\n\n(cid:88)\n(cid:88)\n(cid:88)\n\u03c4i(xi) log \u03c4i(xi) \u2212(cid:88)\n(cid:88)\n\n\u03b1\u2208A\n\n\u03b1\u2208A\n\nx\u03b1\n\nx\u03b1\n\n\u03c4\u03b1(x\u03b1) log \u03c8\u03b1(x\u03b1)\n\n\u03c4\u03b1(x\u03b1) log\n\n\u03c4\u03b1(x\u03b1)\ni\u2208\u03b1 \u03c4i(xi)\n\n(cid:81)\n(cid:88)\n\nT (cid:44) {\u03c4 \u2265 0 | \u2200\u03b1 \u2208 A, i \u2208 \u03b1,\n\n\u03c4\u03b1(x\u03b1) = \u03c4i(xi) and \u2200i \u2208 V,\n\n\u03c4i(xi) = 1}.\n\nx\u03b1\\i\n\nxi\n\nThe \ufb01xed points of the belief propagation algorithm correspond to stationary points of log ZB(G, \u03c4 )\nover T , the set of pseudomarginals [1], and the Bethe partition function is de\ufb01ned to be the maximum\nvalue achieved by this approximation over T :\n\nZB(G) = max\n\n\u03c4\u2208T ZB(G, \u03c4 ).\n\nFor a \ufb01xed factor graph G, we are interested in the relationship between the true partition function,\nZ(G), and the Bethe approximation corresponding to G, ZB(G). While, in general, ZB(G) can be\neither an upper or a lower bound on the true partition function, in this work, we address the following\nconjecture of [2]:\nConjecture 2.4. If f : {0, 1}n \u2192 R\u22650 admits a pairwise, log-supermodular factorization over\nG = (V,A), then ZB(G) \u2264 Z(G).\nWe resolve this conjecture in the af\ufb01rmative, and show that it continues to hold for a larger class of\nlog-supermodular functions. Our results are based, primarily, on two observations: a variant of the\n\u201cfour functions\u201d theorem [7] and the following, recent theorem of Vontobel [5]:\n\n3\n\n\fTheorem 2.5.\n\n(cid:115) (cid:88)\n\nZB(G) = lim sup\n\nk\u2192\u221e k\n\nH\u2208Ck(G)\n\nZ(H)/|Ck(G)|\n\nwhere Ck(G) is the set of all k-covers of G. 1\n\nProof. See Theorem 27 of [5].\nTheorem 2.5 suggests that a reasonable strategy for proving that ZB(G) \u2264 Z(G) would be to show\nthat Z(H) \u2264 Z(G)k for any k-cover H of G. This is the strategy that we adopt in the remainder of\nthis work.\n\n3 The \u201cFour Functions\u201d Theorem and Related Results\n\nThe \u201cfour functions\u201d theorem [7] is a general result concerning nonnegative functions over distribu-\ntive lattices. Many correlation inequalities from statistical physics, such as the FKG inequality, can\nbe seen as special cases of this theorem [8].\nTheorem 3.1 (\u201cFour Functions\u201d Theorem). Let f1, f2, f3, f4 : {0, 1}n \u2192 R\u22650 be nonnegative\nreal-valued functions. If for all x, y \u2208 {0, 1}n,\n\nthen\n\n(cid:104) (cid:88)\n\nx\u2208{0,1}n\n\nf1(x)\n\n(cid:105)(cid:104) (cid:88)\n\nf2(x)\n\nf3(x)\n\nx\u2208{0,1}n\n\nx\u2208{0,1}n\n\nx\u2208{0,1}n\n\nf1(x)f2(y) \u2264 f3(x \u2227 y)f4(x \u2228 y),\n\n(cid:105) \u2264(cid:104) (cid:88)\n\n(cid:105)(cid:104) (cid:88)\n\n(cid:105)\n\n.\n\nf4(x)\n\nThe following lemma is a direct consequence of the four functions theorem:\nLemma 3.2. If f : {0, 1}n \u2192 R\u22650 is log-supermodular, then every marginal of f is also log-\nsupermodular.\nThe four functions theorem can be extended to more than four functions, by generalizing \u2227 and\n\u2228. For any collection of vectors x1, . . . , xk \u2208 Rn, let zi(x1, . . . , xk) be the vector whose jth\nj for each j \u2208 {1, . . . , n}. As an example, for\ncomponent is the ith largest element of x1\nj \u2265 i} where {\u00b7 \u2265 \u00b7} is one if the\ninequality is satis\ufb01ed and zero otherwise. The \u201cfour functions\u201d theorem is then a special case of the\nmore general \u201c2k functions\u201d theorem [9, 10, 11]:\nTheorem 3.3 (\u201c2k Functions\u201d Theorem). Let f1, . . . , fk : {0, 1}n \u2192 R\u22650 and g1, . . . , gk :\n{0, 1}n \u2192 R\u22650 be nonnegative real-valued functions. If for all x1, . . . , xk \u2208 {0, 1}n,\n\nvectors x1, . . . , xk \u2208 {0, 1}n, zi(x1, . . . , xk)j = {(cid:80)k\n\nj , . . . , xk\n\na=1 xa\n\ngi(xi) \u2264 k(cid:89)\nk(cid:89)\n(cid:104) (cid:88)\n\ni=1\n\ni=1\n\ngi(x)\n\nx\u2208{0,1}n\n\nthen\n\nk(cid:89)\n\ni=1\n\n(cid:105) \u2264 k(cid:89)\n\ni=1\n\n(cid:104) (cid:88)\n\nx\u2208{0,1}n\n\n(cid:105)\n\n.\n\nfi(x)\n\nfi(zi(x1, . . . , xk)),\n\n(1)\n\n3.1 A Variant of the \u201cFour Functions\u201d Theorem\n\nA natural generalization of Theorem 3.3 would be to replace the product of functions on the left-hand\nside of Equation 1 with an arbitrary function over x1, . . . , xk: we will show that we can replace this\nproduct with an arbitrary log-supermodular function while preserving the conclusion of the theorem.\nThe key property of log-supermodular functions that makes this possible is the following lemma:\n\n1The proof of the theorem is demonstrated for \u201cnormal\u201d factor graphs, but it easily extends to the factor\n\ngraphs described above by replacing variable nodes with equality constraints.\n\n4\n\n\fLemma 3.4. If g : {0, 1}n \u2192 R\u22650 is log-supermodular, then for any integer k \u2265 1 and\n\nx1, . . . , xk \u2208 {0, 1}n,(cid:81)k\n\ni=1 g(xi) \u2264(cid:81)k\n\ni=1 g(zi(x1, . . . , xk)).\n\nProof. This follows directly from the log-supermodularity of g.\n\nThe proof of our variant of the \u201c2k functions theorem\u201d uses the properties of weak majorizations:\nDe\ufb01nition 3.5. A vector x \u2208 Rn is weakly majorized by a vector y \u2208 Rn, denoted x \u227aw y, if\n\n(cid:80)t\ni=1 zi(x1, . . . , xn) \u2264(cid:80)t\nTheorem 3.6. For x, y \u2208 Rn, x \u227aw y if and only if(cid:80)n\n\ni=1 zi(y1, . . . , yn) for all t \u2208 {1, . . . , n}.\n\nFor the purposes of this paper, we will only need the following result concerning weak majorizations:\ni=1 g(yi) for all continuous,\nincreasing, and convex functions g : R \u2192 R.\n\ni=1 g(xi) \u2264(cid:80)n\n\nProof. See 3.C.1.b and 4.B.2 of [12].\n\nWe now state and prove our variant of the 2k functions theorem in two pieces. First, we consider\nthe case where n = 1:\nLemma 3.7. Let f1, . . . , fk : {0, 1} \u2192 R\u22650 and g : {0, 1}k \u2192 R\u22650 be nonnegative real-valued\nfunctions such that g is log-supermodular. If for all x1, . . . , xk \u2208 {0, 1},\n\nthen\n\ng(x1, . . . , xk) \u2264 k(cid:89)\n(cid:88)\n\ni=1\n\ng(x1, . . . , xk) \u2264 k(cid:89)\n\nfi(zi(x1, . . . , xk)),\n\n(cid:104) (cid:88)\n\nx\u2208{0,1}\n\n(cid:105)\n\nfi(x)\n\n.\n\ni=1\n\nx1,...,xk\u2208{0,1}\n\nProof. For each c \u2208 {0, . . . , k}, de\ufb01ne X c = {(x1, . . . , xk) : x1 + . . . + xk = c}. Let Gc \u2208 R(k\nc)\nbe the vector obtained from by evaluating g at each element of X c, and de\ufb01ne F c similarly for\n\nf (x1, . . . , xk) (cid:44)(cid:81)k\n(cid:81)T\n\nt=1 zt(Gc\n\n1, . . . , Gc\n(k\nc)\n\ni=1 fi(xi).\n\n) \u2264 (cid:81)T\n\nOur strategy will be to show that\n\nt=1 zt(F c\n\n(cid:88)\n\ng(x1, . . . , xk) =\n\n(x1,...,xk)\u2208X c\n\nlog Gc \u227aw log F c for each c or, equivalently,\n1 , . . . , F c\n(k\nc)\n\n) for all c \u2208 {0, . . . , k} and T \u2264 (cid:0)k\nc)(cid:88)\n\n(cid:88)\n\nc)(cid:88)\n\nk(cid:89)\n\n(k\n\n(k\n\nc\n\n(cid:1). Then,\n\nthat\n\n2log Gc\n\nt \u2264\n\n2log F c\n\nt =\n\nfi(xi)\n\nt=1\n\nt=1\n\n(x1,...,xk)\u2208X c\n\ni=1\n\nby Theorem 3.6 and the fact that 2x is convex and increasing, we will have\n\nfor all c. As the X c are disjoint, this will complete the proof. We note that, by continuity arguments,\nthis analysis holds even when some values of g and f are equal to zero.\n\nNow, \ufb01x c \u2208 {0, . . . , k} and T \u2208 {1, . . . ,(cid:0)k\ng(vt) \u2264 T(cid:89)\n\nBy Lemma 3.4, we must have\n\n(cid:1)}. Suppose v1, . . . , vT \u2208 X c are T distinct vectors.\ng(zt(v1, . . . , vT )) \u2264 T(cid:89)\n\nT(cid:89)\n\nf (wt)\n\nc\n\nt=1\n\nt=1\n\nt=1\n\nj = zj(zt(v1, . . . , vT )1, . . . , zt(v1, . . . , vT )k) for each j \u2208 {1, . . . , k}. Given any such\nwhere wt\nv1, . . . , vT \u2208 X c, we will show how to construct distinct vectors v1, . . . , vT \u2208 X c such that\n\nt=1 f (wt) \u2264(cid:81)T\n(cid:81)T\n\nt=1 f (vt). Consequently, we will have\n\nT(cid:89)\n\ng(vt) \u2264 T(cid:89)\n\nf (vt) \u2264 T(cid:89)\n\nt=1\n\nt=1\n\nt=1\n\n5\n\nzt(F c\n\n1 , . . . , F c\nc)).\n(k\n\n\fin particular, for the T distinct vectors in X c that maximize(cid:81)T\nby swapping the rows of A so that for each i < j \u2208 {1, . . . , k},(cid:80)\n\nAs our construction will work for any choice of distinct vectors v1, . . . , vT \u2208 X c, it will work,\nt=1 g(vt), and the lemma will then\nfollow as a consequence of our previous arguments.\nWe now describe how to construct the vectors v1, . . . , vT from the vectors v1, . . . , vT . Let A \u2208\nRk\u00d7T be the matrix whose ith column is given by the vector vi. Construct A \u2208 Rk\u00d7T from A\np Ajp. Intuitively,\nthe \ufb01rst row of A corresponds to the row of A with the most nonzero elements, the second row\nof A corresponds to the row of A with the second largest number of nonzero elements, and so on.\nLet v1, . . . , vT be the columns of A. Notice that v1, . . . , vT are distinct vectors in X c and that, by\nconstruction, zj(zt(v1, . . . , vT )1, . . . , zt(v1, . . . , vT )k) = zt(v1, . . . , vT )j for each j \u2208 {1, . . . , k}\nand t \u2208 {1, . . . , T}. Therefore, we must have\n\np Aip \u2265 (cid:80)\n\nT(cid:89)\n\ng(vt) \u2264 T(cid:89)\n\ng(zt(v1, . . . , vT )) \u2264 T(cid:89)\n\nT(cid:89)\n\nf (zt(v1, . . . , vT )) =\n\nf (vt)\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\nwhere the equality follows from the de\ufb01nition of f as a product of the fi. In addition, the vector\nzt(v1, . . . , vT ) is simply a permuted version of the vector zt(v1, . . . , vT ) which means that their jth\nlargest elements must agree:\nwt\n\nj = zj(zt(v1, . . . , vT )1, . . . , zt(v1, . . . , vT )k)\n= zj(zt(v1, . . . , vT )1, . . . , zt(v1, . . . , vT )k)\n= zt(v1, . . . , vT )j.\n\nT(cid:89)\n\nT(cid:89)\n\nf (wt) =\n\nf (zt(v1, . . . , vT )) =\n\nf (vt)\n\nt=1\n\nt=1\n\nT(cid:89)\n\nTherefore,\n\ng(vt) \u2264 T(cid:89)\ntheorem: if g(x1, . . . , xk) =(cid:81)\n\nt=1\n\nt=1\n\nand the lemma follows as a consequence .\nRemark. In the case that n = 1 and k \u2265 1, this lemma is a more general result than the 2k functions\ni gi(xi) for g1, . . . , gk : {0, 1} \u2192 R\u22650, then g is log-supermodular.\nAs in the proof of the 2k functions theorem, the general theorem for n \u2265 1 follows by induction on\nn.This inductive proof closely follows the inductive argument in the proof of the \u201cfour functions\u201d\ntheorem described in [8] with the added observation that marginals of log-supermodular functions\ncontinue to be log-supermodular.\nTheorem 3.8. Let f1, . . . , fk : {0, 1}n \u2192 R\u22650 and g : {0, 1}kn \u2192 R\u22650 be nonnegative real-valued\nfunctions such that g is log-supermodular. If for all x1, . . . , xk \u2208 {0, 1}n,\n\nthen\n\ng(x1, . . . , xk) \u2264 k(cid:89)\n(cid:88)\n\ni=1\n\ng(x1, . . . , xk) \u2264 k(cid:89)\n\nfi(zi(x1, . . . , xk)),\n\n(cid:104) (cid:88)\n\nx\u2208{0,1}n\n\n(cid:105)\n\nfi(x)\n\n.\n\ni=1\n\nx1,...,xk\u2208{0,1}n\n\nProof. We will prove the result for general k and n by induction on n. The base case of n = 1\nfollows from Lemma 3.7. Now, for n \u2265 2, suppose that the result holds for k \u2265 1 and n \u2212 1, and\nlet f1, . . . , fk : {0, 1}n \u2192 R\u22650 and g : {0, 1}kn \u2192 R\u22650 be nonnegative real-valued functions such\nthat g is log-supermodular.\nDe\ufb01ne f(cid:48) : {0, 1}n\u22121 \u2192 R\u22650 and g(cid:48) : {0, 1}k(n\u22121) \u2192 R\u22650 as\n\nf(cid:48)\ni (y) = fi(y, 0) + fi(y, 1)\n\n(cid:88)\n\ng(cid:48)(y1, . . . , yk) =\n\ng(y1, s1, . . . , yk, sk)\n\ns1,...,sk\u2208{0,1}\n\n6\n\n\fNotice that g(cid:48) is log-supermodular because it is the marginal of a log-supermodular function (see\nLemma 3.2). If we can show that\n\ng(cid:48)(y1, . . . , yk) \u2264 k(cid:89)\n\nf(cid:48)\ni (zi(y1, . . . , yk))\n\nfor all y1, . . . , yk \u2208 {0, 1}n\u22121, then the result will follow by induction on n. To show this, \ufb01x\ny1, . . . , yk \u2208 {0, 1}n\u22121 and de\ufb01ne f : {0, 1} \u2192 R\u22650 and g : {0, 1}k \u2192 R\u22650 as\n\ni=1\n\nf i(s) = fi(zi(y1, . . . , yk), s)\ng(s1, . . . , sk) = g(y1, s1, . . . , yk, sk).\n\n(cid:81)k\nWe can easily check that g(s1, . . . , sk) is log-supermodular and that g(s1, . . . , sk) \u2264\nk(cid:89)\ni=1 f i(zi(s1, . . . , sk)) for all s1, . . . , sk \u2208 {0, 1}. Hence, by Lemma 3.7,\nf(cid:48)\ni (zi(y1, . . . , yk)),\n\ng(s1, . . . , sk) \u2264 k(cid:89)\n\ng(cid:48)(y1, . . . , yk) =\n\n(cid:88)\n\n(cid:88)\n\nf i(s) =\n\nwhich completes the proof of the theorem.\n\ns1,...,sk\n\ni=1\n\ns\u2208{0,1}\n\ni=1\n\n4 Graph Covers and the Partition Function\n\nIn this section, we show how to apply Theorem 3.8 in order to resolve Conjecture 2.4. In addition, we\nshow that the theorem can be applied more generally to yield similar results for a class of functions\nthat can be converted into log-supermodular functions by a change of variables.\n\n4.1 Log-supermodularity and Graph Covers\n\nThe following theorem follows easily from Theorem 3.8:\nTheorem 4.1. If f G : {0, 1}n \u2192 R\u22650 admits a log-supermodular factorization over G = (V,A),\nthen for any k-cover, H, of G, Z(H) \u2264 Z(G)k.\n\nProof. Let H be a k-cover of G. Divide the vertices of H into k sets S1, . . . , Sk such that each set\ncontains exactly one copy of each vertex i \u2208 V . Let the assignments to the variables in the set Si be\ndenoted by the vector xi.\nFor each \u03b1 \u2208 A, let yi\nBy Lemma 3.4,\n\n\u03b1 denote the assignment to the ith copy of \u03b1 by the elements of x1, . . . , xk.\n\ni=1\n\ni=1\n\n\u03c8\u03b1(yi\n\n\u03c8\u03b1(zi(y1\n\nk(cid:89)\n\n\u03b1) \u2264 k(cid:89)\n\nk(cid:89)\nFrom this, we can conclude that f H (x1, . . . , xk) \u2264(cid:81)k\nf H (x1, . . . , xk) \u2264 k(cid:89)\n\n(cid:88)\n\n\u03b1, . . . , yk\n\nZ(H) =\n\n\u03b1)) =\n\n3.8,\n\ni=1\n\n(cid:104)(cid:88)\n\nk(cid:89)\n\ni=1\n\n(cid:105)\n\n\u03c8\u03b1(zi(x1\n\n\u03b1, . . . , xk\n\n\u03b1)) =\n\n\u03c8\u03b1(zi(x1, . . . , xk)\u03b1).\n\ni=1 f G(zi(x1, . . . , xk)). Now, by Theorem\n\nf G(xi)\n\n= Z(G)k.\n\nx1,...,xk\n\ni=1\n\nxi\n\nThis theorem settles the conjecture of [2] for any log-supermodular function that admits a pairwise\nbinary factorization, and the conjecture continues to hold for any graphical model that admits a\nlog-supermodular factorization.\nCorollary 4.2. If f : {0, 1}n \u2192 R\u22650 admits a log-supermodular factorization over G = (V,A),\nthen ZB(G) \u2264 Z(G).\n\nProof. This follows directly from Theorem 4.1 and Theorem 2.5.\n\nAs the value of the Bethe approximation at any of the \ufb01xed points of BP is always a lower bound on\nZB(G), the conclusion of the corollary holds for any \ufb01xed point of the BP algorithm as well.\n\n7\n\n\f4.2 Beyond Log-supermodularity\n\nWhile Theorem 4.1 is a statement only about log-supermodular functions, we can use it to infer\nsimilar results even when the function under consideration is not log-supermodular. As an example\nof such an application, we consider the problem of counting the number of independent sets in a\ngiven graph, G = (V, E). An independent set, I \u2286 V , in G is a subset of the vertices such that no\ntwo adjacent vertices are in I. We de\ufb01ne the following function:\n\nI G(x1, . . . , x|V |) =\n\n(1 \u2212 xixj)\n\n(cid:89)\n\n(i,j)\u2208E\n\nwhich is equal to one if the nonzero xi\u2019s de\ufb01ne an independent set and zero otherwise. As every\npotential function depends on at most two variables, I G factorizes over the graph G = (V, E).\nNotice that f G is log-submodular, not log-supermodular.\nIn this section, we will focus on bipartite graphs: G = (V, E) is bipartite if we can partition the\nvertex set into two sets A \u2286 V and B = V \\ A such that A and B are independent sets. Examples\nof bipartite graphs include single cycles, trees, and grid graphs. We will denote bipartite graphs as\nG = (A, B, E).\nFor any bipartite graph G = (A, B, E), I G can be converted into a log-supermodular graphical\nmodel by a simple change of variables. De\ufb01ne ya = xa for all a \u2208 A and yb = 1\u2212 xb for all b \u2208 B.\nWe then have\n\n(cid:89)\n\n(i,j)\u2208E\n\n(1 \u2212 xixj)\n\n(cid:89)\n\nI G(x1, . . . , x|V |) =\n\n=\n\n(1 \u2212 ya(1 \u2212 yb))\n\n(a,b)\u2208E,a\u2208A,b\u2208B\nG\n\n(y1, . . . , y|V |).\n\n(cid:44) I\n\nI\n\nG admits a log-supermodular factorization over G and(cid:80)\ngraph cover H of G, we have(cid:80)\n\nx I G(x). Similarly, for any\nx I H (x). Consequently, by Theorem 4.1, we can\nconclude that Z(G) \u2265 ZB(G). Similar observations can be used to show that the Bethe partition\nfunction provides a lower bound on the true partition function for other problems that factor over\npairwise bipartite graphical models (e.g., the antiferromagnetic Ising model on a grid).\n\n(y) = (cid:80)\n\n(y) =(cid:80)\n\ny I\n\nG\n\ny I\n\nH\n\n5 Conclusions\n\nWhile the results presented above were discussed in the case that the temperature parameter, T , was\nequal to one, they easily extend to any T \u2265 0 (as exponentiation preserves log-supermodularity in\nthis case). Hence, all of the bounds discussed above can be extended to the problem of maximizing\na log-supermodular function. In particular, the inequality in Theorem 4.1 shows that the maximum\nvalue of the objective function on any graph cover is achieved by a lift of a maximizing assignment\non the base graph.\nThis work also suggests a number of directions for future research. Related work on the Bethe\napproximation for permanents suggests that conjectures similar to those discussed above can be\nmade for other classes of functions [13, 14]. While, like the \u201cfour functions\u201d theorem, many of the\nabove results can be extended to general distributive lattices, understanding when similar results may\nhold for non-binary problems may be of interest for graphical models that arise in certain application\nareas such as computer vision.\n\nAcknowledgments\n\nThe author would like to thank Pascal Vontobel and Nicolas Macris for useful discussions and sug-\ngestions during the preparation of this work. This work was supported by EC grant FP7-265496,\n\u201cSTAMINA\u201d.\n\n8\n\n\fReferences\n[1] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and gen-\neralized belief propagation algorithms. Information Theory, IEEE Transactions on, 51(7):2282\n\u2013 2312, July 2005.\n\n[2] E. B. Sudderth, M. J. Wainwright, and A. S. Willsky. Loop series and Bethe variational bounds\nin attractive graphical models. In Neural Information Processing Systems (NIPS), Vancouver,\nBC, Canada, Dec. 2007.\n\n[3] M. Chertkov and V. Y. Chernyak. Loop series for discrete statistical models on graphs. J. Stat.\n\nMech., 2006.\n\n[4] V. G\u00b4omez, J. M. Mooij, and H. J. Kappen. Truncating the loop series expansion for BP. Journal\n\nof Machine Learning Research (JMLR), 2007.\n\n[5] P. O. Vontobel. Counting in graph covers: A combinatorial characterization of the Bethe\n\nentropy function. CoRR, abs/1012.0065, 2010.\n\n[6] P. O. Vontobel and R. Koetter. Graph-cover decoding and \ufb01nite-length analysis of message-\n\npassing iterative decoding of LDPC codes. CoRR, abs/cs/0512078, 2005.\n\n[7] R. Ahlswede and D. E. Daykin. An inequality for the weights of two families of sets, their\n\nunions and intersections. Probability Theory and Related Fields, 43:183\u2013185, 1978.\n\n[8] N. Alon and J.H. Spencer. The probabilistic method. Wiley-Interscience series in discrete\n\nmathematics and optimization. Wiley, 2000.\n\n[9] R. Aharoni and U. Keich. A generalization of the Ahlswede-Daykin inequality. Discrete\n\nMathematics, 152(13):1 \u2013 12, 1996.\n\n[10] Y. Rinott and M. Saks. Correlation inequalities and a conjecture for permanents. Combinator-\n\nica, 13:269\u2013277, 1993.\n\n[11] Y. Rinott and M. Saks. On FKG-type and permanental inequalities. Lecture Notes-Monograph\n\nSeries, 22:pp. 332\u2013342, 1992.\n\n[12] A. W. Marshall and I. Olkin. Inequalities: Theory of Majorization and its Applications. Aca-\n\ndemic Press, New York, 1979.\n\n[13] P. O. Vontobel. The Bethe permanent of a non-negative matrix. In Communication, Control,\nand Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 341 \u2013346, Oct.\n2010.\n\n[14] L. Gurvits. Unleashing the power of Schrijver\u2019s permanental inequality with the help of the\n\nBethe approximation. ArXiv e-prints, June 2011.\n\n9\n\n\f", "award": [], "sourceid": 63, "authors": [{"given_name": "Nicholas", "family_name": "Ruozzi", "institution": null}]}