{"title": "Clamping Variables and Approximate Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 909, "page_last": 917, "abstract": "It was recently proved using graph covers (Ruozzi, 2012) that the Bethe partition function is upper bounded by the true partition function for a binary pairwise model that is attractive. Here we provide a new, arguably simpler proof from first principles. We make use of the idea of clamping a variable to a particular value. For an attractive model, we show that summing over the Bethe partition functions for each sub-model obtained after clamping any variable can only raise (and hence improve) the approximation. In fact, we derive a stronger result that may have other useful implications. Repeatedly clamping until we obtain a model with no cycles, where the Bethe approximation is exact, yields the result. We also provide a related lower bound on a broad class of approximate partition functions of general pairwise multi-label models that depends only on the topology. We demonstrate that clamping a few wisely chosen variables can be of practical value by dramatically reducing approximation error.", "full_text": "Clamping Variables and Approximate Inference\n\nAdrian Weller\n\nTony Jebara\n\nColumbia University, New York, NY 10027\n\nColumbia University, New York, NY 10027\n\nadrian@cs.columbia.edu\n\njebara@cs.columbia.edu\n\nAbstract\n\nIt was recently proved using graph covers (Ruozzi, 2012) that the Bethe partition\nfunction is upper bounded by the true partition function for a binary pairwise\nmodel that is attractive. Here we provide a new, arguably simpler proof from\n\ufb01rst principles. We make use of the idea of clamping a variable to a particular\nvalue. For an attractive model, we show that summing over the Bethe partition\nfunctions for each sub-model obtained after clamping any variable can only raise\n(and hence improve) the approximation. In fact, we derive a stronger result that\nmay have other useful implications. Repeatedly clamping until we obtain a model\nwith no cycles, where the Bethe approximation is exact, yields the result. We also\nprovide a related lower bound on a broad class of approximate partition functions\nof general pairwise multi-label models that depends only on the topology. We\ndemonstrate that clamping a few wisely chosen variables can be of practical value\nby dramatically reducing approximation error.\n\n1\n\nIntroduction\n\nMarginal inference and estimating the partition function for undirected graphical models, also called\nMarkov random \ufb01elds (MRFs), are fundamental problems in machine learning. Exact solutions may\nbe obtained via variable elimination or the junction tree method, but unless the treewidth is bounded,\nthis can take exponential time (Pearl, 1988; Lauritzen and Spiegelhalter, 1988; Wainwright and\nJordan, 2008). Hence, many approximate methods have been developed.\nOf particular note is the Bethe approximation, which is widely used via the loopy belief propagation\nalgorithm (LBP). Though this is typically fast and results are often accurate, in general it may\nconverge only to a local optimum of the Bethe free energy, or may not converge at all (McEliece\net al., 1998; Murphy et al., 1999). Another drawback is that, until recently, there were no guarantees\non whether the returned approximation to the partition function was higher or lower than the true\nvalue. Both aspects are in contrast to methods such as the tree-reweighted approximation (TRW,\nWainwright et al., 2005), which features a convex free energy and is guaranteed to return an upper\nbound on the true partition function. Nevertheless, empirically, LBP or convergent implementations\nof the Bethe approximation often outperform other methods (Meshi et al., 2009; Weller et al., 2014).\nUsing the method of graph covers (Vontobel, 2013), Ruozzi (2012) recently proved that the optimum\nBethe partition function provides a lower bound for the true value, i.e. ZB \u2264 Z, for discrete binary\nMRFs with submodular log potential cost functions of any order. Here we provide an alternative\nproof for attractive binary pairwise models. Our proof does not rely on any methods of loop series\n(Sudderth et al., 2007) or graph covers, but rather builds on fundamental properties of the derivatives\nof the Bethe free energy. Our approach applies only to pairwise models (whereas Ruozzi, 2012\napplies to any order), but we obtain stronger results for this class, from which ZB \u2264 Z easily\nfollows. We use the idea of clamping a variable and considering the approximate sub-partition\nfunctions over the remaining variables, as the clamped variable takes each of its possible values.\nNotation and preliminaries are presented in \u00a72. In \u00a73, we derive a lower bound, not just for the\nstandard Bethe partition function, but for a range of approximate partition functions over multi-label\n\n1\n\n\fvariables that may be de\ufb01ned from a variational perspective as an optimization problem, based only\non the topology of the model. In \u00a74, we consider the Bethe approximation for attractive binary pair-\nwise models. We show that clamping any variable and summing the Bethe sub-partition functions\nover the remaining variables can only increase (hence improve) the approximation. Together with a\nsimilar argument to that used in \u00a73, this proves that ZB \u2264 Z for this class of model. To derive the\nresult, we analyze how the optimum of the Bethe free energy varies as the singleton marginal of one\nparticular variable is \ufb01xed to different values in [0, 1]. Remarkably, we show that the negative of this\noptimum, less the singleton entropy of the variable, is a convex function of the singleton marginal.\nThis may have further interesting implications. We present experiments in \u00a75, demonstrating that\nclamping even a single variable selected using a simple heuristic can be very bene\ufb01cial.\n\n1.1 Related work\n\nBranching or conditioning on a variable (or set of variables) and approximating over the remaining\nvariables has a fruitful history in algorithms such as branch-and-cut (Padberg and Rinaldi, 1991;\nMitchell, 2002), work on resolution versus search (Rish and Dechter, 2000) and various approaches\nof (Darwiche, 2009, Chapter 8). Cutset conditioning was discussed by Pearl (1988) and re\ufb01ned by\nPeot and Shachter (1991) as a method to render the remaining topology acyclic in preparation for be-\nlief propagation. Eaton and Ghahramani (2009) developed this further, introducing the conditioned\nbelief propagation algorithm together with back-belief-propagation as a way to help identify which\nvariables to clamp. Liu et al. (2012) discussed feedback message passing for inference in Gaussian\n(not discrete) models, deriving strong results for the particular class of attractive models. Choi and\nDarwiche (2008) examined methods to approximate the partition function by deleting edges.\n\n2 Preliminaries\nWe consider a pairwise model with n variables X1, . . . , Xn and graph topology (V,E): V contains\nnodes {1, . . . , n} where i corresponds to Xi, and E \u2286 V \u00d7 V contains an edge for each pairwise\nrelationship. We sometimes consider multi-label models where each variable Xi takes values in\n{0, . . . , Li \u2212 1}, and sometimes restrict attention to binary models where Xi \u2208 B = {0, 1} \u2200i.\nLet x = (x1, . . . , xn) be a con\ufb01guration of all the variables, and N (i) be the neighbors of i. For\nall analysis of binary models, to be consistent with Welling and Teh (2001) and Weller and Jebara\n(2013), we assume a reparameterization such that p(x) = e\u2212E(x)\n, where the energy of a con\ufb01gura-\n\ntion, E = \u2212(cid:80)\n\ni\u2208V \u03b8ixi \u2212(cid:80)\n\n(i,j)\u2208E Wijxixj, with singleton potentials \u03b8i and edge weights Wij.\n\nZ\n\n2.1 Clamping a variable and related de\ufb01nitions\n\nWe shall \ufb01nd it useful to examine sub-partition functions obtained by clamping one particular vari-\nable Xi, that is we consider the model on the n\u2212 1 variables X1, . . . , Xi\u22121, Xi+1, . . . , Xn obtained\nby setting Xi equal to one of its possible values.\nLet Z|Xi=a be the sub-partition function on the model obtained by setting Xi = a, a \u2208 {0, . . . , Li\u2212\n1}. Observe that true partition functions and marginals are self-consistent in the following sense:\n\nLi\u22121(cid:88)\n\nj=0\n\nZ =\n\nZ|Xi=j \u2200i \u2208 V,\n\np(Xi = a) =\n\n.\n\n(1)\n\n(cid:80)Li\u22121\nZ|Xi=a\nj=0 Z|Xi=j\n\nThis is not true in general for approximate forms of inference,1 but if the model has no cycles, then\nin many cases of interest, (1) does hold, motivating the following de\ufb01nition.\nDe\ufb01nition 1. We say an approximation to the log-partition function ZA is ExactOnTrees if it may be\nspeci\ufb01ed by the variational formula \u2212 log ZA = minq\u2208Q FA(q) where: (1) Q is some compact space\nthat includes the marginal polytope; (2) FA is a function of the (pseudo-)distribution q (typically a\nfree energy approximation); and (3) For any model, whenever a subset of variables V(cid:48) \u2286 V is\nclamped to particular values P = {pi \u2208 {0, . . . , Li \u2212 1}, \u2200Xi \u2208 V(cid:48)}, i.e. \u2200Xi \u2208 V(cid:48), we constrain\nXi = pi, which we write as V(cid:48) \u2190 P , and the remaining induced graph on V \\V(cid:48) is acyclic, then the\n1For example, consider a single cycle with positive edge weights. This has ZB < Z (Weller et al., 2014),\n\nyet after clamping any variable, each resulting sub-model is a tree hence the Bethe approximation is exact.\n\n2\n\n\fapproximation is exact, i.e. ZA|V(cid:48)\u2190P = Z|V(cid:48)\u2190P . Similarly, de\ufb01ne an approximation to be in the\nbroader class of NotSmallerOnTrees if it satis\ufb01es all of the above properties except that condition\n(3) is relaxed to ZA|V(cid:48)\u2190P \u2265 Z|V(cid:48)\u2190P . Note that the Bethe approximation is ExactOnTrees, and\napproximations such as TRW are NotSmallerOnTrees, in both cases whether using the marginal\npolytope or relaxations thereof, such as the cycle or local polytope (Weller et al., 2014).\n\nWe shall derive bounds on ZA with the following idea: (i) Obtain upper or lower bounds on the\napproximation achieved by clamping and summing over the approximate sub-partition functions;\n(ii) Repeat until an acyclic graph is reached, where the approximation is either exact or bounded.\nWe introduce the following related concept from graph theory.\nDe\ufb01nition 2. A feedback vertex set (FVS) of a graph is a set of vertices whose removal leaves a\ngraph without cycles. Determining if there exists a feedback vertex set of a given size is a classical\nNP-hard problem (Karp, 1972). There is a signi\ufb01cant literature on determining the minimum cardi-\nnality of an FVS of a graph G, which we write as \u03bd(G). Further, if vertices are assigned non-negative\nweights, then a natural problem is to \ufb01nd an FVS with minimum weight, which we write as \u03bdw(G).\nAn FVS with a factor 2 approximation to \u03bdw(G) may be found in time O(|V| + |E| log |E|) (Bafna\net al., 1999). For pairwise multi-label MRFs, we may create a weighted graph from the topology by\nassigning each node i a weight of log Li, and then compute the corresponding \u03bdw(G).\n\n3 Lower Bound on Approximate Partition Functions\nWe obtain a lower bound on any approximation that is NotSmallerOnTrees by observing that ZA \u2265\nZA|Xn=j \u2200j from the de\ufb01nition (the sub-partition functions optimize over a subset).\nTheorem 3. If a pairwise MRF has topology with an FVS of size k and corresponding values\nL1, . . . , Lk, then for any approximation that is NotSmallerOnTrees, ZA \u2265\n\nZ(cid:81)k\n\n.\n\ni=1 Li\n\nA := (cid:80)Lk\u22121\n\nProof. We proceed by induction on k. The base case k = 0 holds by the assumption that ZA is NotS-\nmallerOnTrees. Now assume the result holds for k\u22121 and consider a MRF which requires k vertices\nto be deleted to become acyclic. Clamp variable Xk at each of its Lk values to create the approx-\nj=0 ZA|Xk=j. By the de\ufb01nition of NotSmallerOnTrees, ZA \u2265 ZA|Xk=j \u2200j;\nimation Z (k)\n(cid:81)k\u22121\nand by the inductive hypothesis, ZA|Xk=j \u2265 Z|Xk =j\n1(cid:81)k\u22121\nHence, LkZA \u2265 Z (k)\n\n(cid:80)Lk\u22121\nj=0 Z|Xk=j =\n\nA =(cid:80)Lk\u22121\n\nj=0 ZA|Xk=j \u2265\n\ni=1 Li\n\nZ(cid:81)k\u22121\n\ni=1 Li\n\n.\n\n.\n\ni=1 Li\n\nBy considering an FVS with minimum(cid:81)k\n\ni=1 Li, Theorem 3 is equivalent to the following result.\n\nTheorem 4. For any approximation that is NotSmallerOnTrees, ZA \u2265 Ze\u2212\u03bdw.\nThis bound applies to general multi-label models with any pairwise and singleton potentials (no\nneed for attractive). The bound is trivial for a tree, but already for a binary model with one cycle we\nobtain that ZB \u2265 Z/2 for any potentials, even over the marginal polytope. The bound is tight, at\nleast for uniform Li = L \u2200i.2 The bound depends only on the vertices that must be deleted to yield\na graph with no cycles, not on the number of cycles (which clearly upper bounds \u03bd(G)). For binary\nmodels, exact inference takes time \u0398((|V| \u2212 |\u03bd(G)|)2\u03bd(G)). Note that treewidth \u2264 \u03bd + 1.\n\n4 Attractive Binary Pairwise Models\n\nIn this Section, we restrict attention to the standard Bethe approximation. We shall use results\nderived in (Welling and Teh, 2001) and (Weller and Jebara, 2013), and adopt similar notation. The\nBethe partition function, ZB, is de\ufb01ned as in De\ufb01nition 1, where Q is set as the local polytope\nrelaxation and FA is the Bethe free energy, given by F(q) = Eq(E)\u2212 SB(q), where E is the energy\n2Given \u03bd, we can construct a model such that the bound is tight. For example, in the binary case: consider\na sub-MRF on a cycle with no singleton potentials and uniform, very high attractive edge weights. This can be\nshown to have ZB \u2248 Z/2 (Weller et al., 2014). Now connect \u03bd of these together in a chain using very weak\nedges (this construction is due to N. Ruozzi).\n\n3\n\n\fand SB is the Bethe pairwise entropy approximation (see Wainwright and Jordan, 2008 for details).\nWe consider attractive binary pairwise models and apply similar clamping ideas to those used in \u00a73.\nIn \u00a74.1 we show that clamping can never decrease the approximate Bethe partition function, then\nuse this result in \u00a74.2 to prove that ZB \u2264 Z for this class of model. In deriving the clamping result\nof \u00a74.1, in Theorem 7 we show an interesting, stronger result on how the optimum Bethe free energy\nchanges as the singleton marginal qi is varied over [0, 1].\n\n4.1 Clamping a variable can only increase the Bethe partition function\n\nLet ZB be the Bethe partition function for the original model. Clamp variable Xi and form the new\napproximation Z (i)\n\nj=0 ZB|Xi=j. In this Section, we shall prove the following Theorem.\n\nB =(cid:80)1\n\nTheorem 5. For an attractive binary pairwise model and any variable Xi, Z (i)\n\nB \u2265 ZB.\n\nWe \ufb01rst introduce notation and derive preliminary results, which build to Theorem 7, our strongest\nresult, from which Theorem 5 easily follows. Let q = (q1, . . . , qn) be a location in n-dimensional\npseudomarginal space, i.e. qi is the singleton pseudomarginal q(Xi = 1) in the local polytope. Let\nF(q) be the Bethe free energy computed at q using Bethe optimum pairwise pseudomarginals given\nby the formula for q(Xi = 1, Xj = 1) = \u03beij(qi, qj, Wij) in (Welling and Teh, 2001), i.e. for an\nattractive model, for edge (i, j), \u03beij is the lower root of\n\n\u03b1ij\u03be2\n\nij \u2212 [1 + \u03b1ij(qi + qj)]\u03beij + (1 + \u03b1ij)qiqj = 0,\n\ni\u22121(x), r\u2217\n\ni+1(x), . . . , r\u2217\n\nn(x)) with corresponding pairwise terms {\u03be\u2217\n\n(2)\nwhere \u03b1ij = eWij \u2212 1, and Wij > 0 is the strength (associativity) of the log-potential edge weight.\nLet G(q) = \u2212F(q). Note that log ZB = maxq\u2208[0,1]n G(q). For any x \u2208 [0, 1], consider the optimum\nlet log ZBi(x) = maxq\u2208[0,1]n:qi=x G(q). Let r\u2217(x) =\nconstrained by holding qi = x \ufb01xed, i.e.\nij}, be an arg max\n(r\u2217\n1(x), . . . , r\u2217\nfor where this optimum occurs. Observe that ZBi(x) is the \u2018Bethe partition function constrained to\nqi = x\u2019, with log ZBi(0) = log ZB|Xi=0, log ZBi(1) = log ZB|Xi=1 and log ZB = log ZBi(q\u2217\ni ) =\nmaxq\u2208[0,1]n G(q), where q\u2217\nTo prove Theorem 5, we need a suf\ufb01ciently good upper bound on log ZBi(q\u2217\ni ) compared to\nlog ZBi(0) and log ZBi(1). First we demonstrate what such a bound could be, then prove that\nthis holds. Let Si(x) = \u2212x log x \u2212 (1 \u2212 x) log(1 \u2212 x) be the standard singleton entropy.\nLemma 6 (Demonstrating what would be a suf\ufb01ciently good upper bound on log ZB). If \u2203x \u2208 [0, 1]\nsuch that log ZB \u2264 x log ZBi(1) + (1 \u2212 x) log ZBi(0) + Si(x), then:\n(i) ZBi(0) + ZBi(1) \u2212 ZB \u2265 emfc(x) where fc(x) = 1 + ec \u2212 exc+Si(x),\nm = min(log ZBi(0), log ZBi(1)) and c = | log ZBi(1) \u2212 log ZBi(0)|; and\n(ii) \u2200x \u2208 [0, 1], fc(x) \u2265 0 with equality iff x = \u03c3(c) = 1/(1 + exp(\u2212c)), the sigmoid function.\n\ni is a marginal of Xi at which the global optimum is achieved.\n\nProof. (i) This follows easily from the assumption. (ii) This is easily checked by differentiating. It\nis also given in (Koller and Friedman, 2009, Proposition 11.8).\n\nSee Figure 6 in the Supplement for example plots of the function fc(x). Lemma 6 motivates us to\nconsider if perhaps log ZBi(x) might be upper bounded by x log ZBi(1)+(1\u2212x) log ZBi(0)+Si(x),\ni.e.\nthe linear interpolation between log ZBi(0) and log ZBi(1), plus the singleton entropy term\nSi(x). It is easily seen that this would be true if r\u2217(qi) were constant. In fact, we shall show that\nr\u2217(qi) varies in a particular way which yields the following, stronger result, which, together with\nLemma 6, will prove Theorem 5.\nTheorem 7. Let Ai(qi) = log ZBi(qi) \u2212 Si(qi). For an attractive binary pairwise model, Ai(qi) is\na convex function.\n\nProof. We outline the main points of the proof. Observe that Ai(x) = maxq\u2208[0,1]n:qi=x G(q) \u2212\nSi(x), where G(q) = \u2212F(q). Note that there may be multiple arg max locations r\u2217(x). As shown\nin (Weller and Jebara, 2013), F is at least thrice differentiable in (0, 1)n and all stationary points lie\nin the interior (0, 1)n. Given our conditions, the \u2018envelope theorem\u2019 of (Milgrom, 1999, Theorem\n\n4\n\n\f(a) W=1\n\n(b) W=3\n\n(c) W=10\n\nFigure 1: 3d plots of vij = Q\n\n\u22121\nij , using \u03beij(qi, qj, W ) from (Welling and Teh, 2001).\n\n1) applies, showing that Ai is continuous in [0, 1] with right derivative3\nA(cid:48)\ni+(x) = max\n\n[G(qi = x, r\u2217(x)) \u2212 Si(x)] = max\nr\u2217(qi=x)\n\n.\n(3)\nWe shall show that this is non-decreasing, which is suf\ufb01cient to show the convexity result of Theorem\n7. To evaluate the right hand side of (3), we use the derivative shown by Welling and Teh (2001):\n\n[G(qi = x, r\u2217(x))] \u2212 dSi(x)\ndx\n\nr\u2217(qi=x)\n\n\u2202\n\u2202x\n\n\u2202\n\u2202x\n\n\u2202F\n\u2202qi\n\n= \u2212\u03b8i + log Qi,\n\nwhere log Qi = log\n\n= log\n\n(1 \u2212 qi)di\u22121\n\ni\n\nqdi\u22121\nqi\n1 \u2212 qi\n\n(cid:81)\n(cid:81)\nj\u2208N (i)(qi \u2212 \u03beij)\n(cid:89)\nj\u2208N (i)(1 + \u03beij \u2212 qi \u2212 qj)\n\n(cid:18)\n\n+ log\n\nQij, here de\ufb01ning Qij =\n\nj\u2208N (i)\n\n(cid:19)(cid:18) 1 \u2212 qi\n\n(cid:19)\n\n.\n\n(as in Weller and Jebara, 2013)\n\nqi \u2212 \u03beij\n\n1 + \u03beij \u2212 qi \u2212 qj\n\nqi\n, and thus cancels the \u2212 dSi(x)\n\n(cid:105)\n\n(cid:104)\u2212(cid:80)\n\nterm is exactly \u2212 dSi(qi)\n\ndx\n\ndqi\n\n. 4\n\nterm\n\nj , \u03be\u2217\nij)\n\ni+(qi) = maxr\u2217(qi)\n\nj\u2208N (i) log Qij(qi, r\u2217\n\nA key observation is that the log qi\n1\u2212qi\nat the end of (3). Hence, A(cid:48)\nIt remains to show that this expression is non-decreasing with qi. We shall show something stronger,\nthat at every arg max r\u2217(qi), and for all j \u2208 N (i),\u2212 log Qij is non-decreasing \u21d4 vij = Q\u22121\nij is non-\ndecreasing. The result then follows since the max of non-decreasing functions is non-decreasing.\nSee Figure 1 for example plots of the vij function, and observe that vij appears to decrease with\nqi (which is unhelpful here) while it increases with qj. Now, in an attractive model, the Bethe\n\u2264 0 (Weller and Jebara, 2013; Kor\u02d8c et al., 2012), hence as\nfree energy is submodular, i.e.\nqi increases, r\u2217\nis\nsuf\ufb01ciently large such that dvij\ndqi\nAt any particular arg max r\u2217(qi), writing v = vij[qi, r\u2217\n\nj (qi) can only increase (Topkis, 1978). For our purpose, we must show that dr\u2217\n\n\u2265 0. This forms the remainder of the proof.\n\n\u22022F\n\u2202qi\u2202qj\n\nj (qi))], we have\n\nij(qi, r\u2217\n\nj\ndqi\n\ndv\ndqi\n\n=\n\n=\n\n\u2202v\n\u2202qi\n\u2202v\n\u2202qi\n\n+\n\n+\n\n\u2202v\n\u2202\u03beij\n\u2202v\n\u2202\u03beij\n\nd\u03be\u2217\nij\ndqi\n\u2202\u03be\u2217\nij\n\u2202qi\n\n+\n\n+\n\nj (qi), \u03be\u2217\ndr\u2217\nj\ndqi\n\n(cid:19)\n\n(cid:18) \u2202v\n\n\u2202v\n\u2202qj\ndr\u2217\nj\ndqi\n1+\u03b1ij (qi\u2212\u03beij +qj\u2212\u03beij ) and similarly,\n\n\u03b1ij (qj\u2212\u03beij )+qj\n\n\u2202\u03be\u2217\nij\n\u2202qj\n\n\u2202v\n\u2202qj\n\n\u2202\u03beij\n\n+\n\n.\n\n(4)\n\n\u2202\u03beij\n\u2202qi\n\n=\n\nFrom (Weller and Jebara, 2013),\n1+\u03b1ij (qj\u2212\u03beij +qi\u2212\u03beij ), where \u03b1ij = eWij \u2212 1. The other partial derivatives are easily derived:\n\n\u03b1ij (qi\u2212\u03beij )+qi\n= qi(qj\u22121)(1\u2212qi)+(1+\u03beij\u2212qi\u2212qj )(qi\u2212\u03beij )\n\n\u2202v\n\u2202qi\n\n(1\u2212qi)2(qi\u2212\u03beij )2\n\n(1\u2212qi)(qi\u2212\u03beij )2 , and \u2202v\n\n, \u2202v\n\u2202\u03beij\nThe only remaining term needed for (4) is dr\u2217\n. The following results are proved in the Appendix,\nsubject to a technical requirement that at an arg max, the reduced Hessian H\\i, i.e. the matrix of\n(cid:16) p(Xi=1,Xj =0)\n3This result is similar to Danskin\u2019s theorem (Bertsekas, 1995). Intuitively, for multiple arg max locations,\n\neach may increase at a different rate, so here we must take the max of the derivatives over all the arg max.\n\n(cid:17)(cid:46)(cid:16) p(Xi=1)\n\n(1\u2212qi)(qi\u2212\u03beij ).\n\n(cid:17)\n\nj\ndqi\n\nqi(1\u2212qj )\n\n\u2212qi\n\n\u2202qj\n\n=\n\n=\n\n=\n\n\u2202\u03beij\n\u2202qj\n\n4We remark that Qij is the ratio\n\n= p(Xj =0|Xi=1)\np(Xj =0|Xi=0) .\n\np(Xi=0,Xj =0)\n\np(Xi=0)\n\n5\n\n00.5100.51123qjv=1/Qij, W=1qi00.5100.5101020qjv=1/Qij, W=3qi00.5100.51024x 104qjv=1/Qij, W=10qi\fsecond partial derivatives of F after removing the ith row and column, must be non-singular in\norder to have an invertible locally linear function. Call this required property P. By nature, each\nH\\i is positive semi-de\ufb01nite. If needed, a small perturbation argument allows us to assume that no\neigenvalue is 0, then in the limit as the perturbation tends to 0, Theorem 7 holds since the limit of\nconvex functions is convex. Let [n] = {1, . . . , n} and G be the topology of the MRF.\nTheorem 8. For any k \u2208 [n] \\ i, let Ck be the connected component of G \\ i that contains Xk. If\nst\u2212r\u2217\nCk + i is a tree, then dr\u2217\n\u03be\u2217\ns r\u2217\ns ) ,where P (i (cid:32) k) is the unique path from i to\ns (1\u2212r\u2217\nr\u2217\ni = qi. Proof in Appendix (subject to P).\nk in Ck + i, and for notational convenience, de\ufb01ne r\u2217\n\n=(cid:81)\n\n(s\u2192t)\u2208P (i(cid:32)k)\n\nk\ndqi\n\nt\n\nIndeed, Theorem 8 applies for any combination of attractive and repulsive edges. The result is\nremarkable, yet also intuitive. In the numerator, \u03best \u2212 qsqt = Covq(Xs, Xt), increasing with Wij\nand equal to 0 at Wij = 0 (Weller and Jebara, 2013), and in the denominator, qs(1\u2212qs) = Varq(Xs),\nhence the ratio is exactly what is called in \ufb01nance the beta of Xt with respect to Xs.5\nij\u2212qir\u2217\nIn particular, Theorem 8 shows that for any j \u2208 N (i) whose component is a tree, dr\u2217\n\u03be\u2217\nqi(1\u2212qi) .\nThe next result shows that in an attractive model, additional edges can only reinforce this sensitivity.\n\nj\ndqi\n\n=\n\nj\n\nTheorem 9. In an attractive model with edge (i, j),\nto P).\n\ndr\u2217\nj (qi)\ndqi\n\n\u2265 \u03be\u2217\n\nij\u2212qir\u2217\nqi(1\u2212qi) . Proof in Appendix (subject\n\nj\n\nNow collecting all terms, substituting into (4), and using (2), after some algebra yields that dv\ndqi\nas required to prove Theorem 7. This now also proves Theorem 5.\n\n\u2265 0,\n\n4.2 The Bethe partition function lower bounds the true partition function\n\nTheorem 5, together with an argument similar to the proof of Theorem 3, easily yields a new proof\nthat ZB \u2264 Z for an attractive binary pairwise model.\nTheorem 10 (\ufb01rst proved by Ruozzi, 2012). For an attractive binary pairwise model, ZB \u2264 Z.\n\nProof. We shall use induction on k to show that the following statement holds for all k:\nIf a MRF may be rendered acyclic by deleting k vertices v1, . . . , vk, then ZB \u2264 Z.\nThe base case k = 0 holds since the Bethe approximation is ExactOnTrees. Now assume the result\nholds for k\u22121 and consider a MRF which requires k vertices to be deleted to become acyclic. Clamp\nvariable Xk and consider Z (k)\nB ; and by the inductive\nj=0 Z|Xk=j = Z.\n\nhypothesis, ZB|Xk=j \u2264 Z|Xk=j \u2200j. Hence, ZB \u2264(cid:80)1\n\nj=0 ZB|Xk=j \u2264(cid:80)1\n\nj=0 ZB|Xk=j. By Theorem 5, ZB \u2264 Z (k)\n\nB =(cid:80)1\n\n5 Experiments\n\nFor an approximation which is ExactOnTrees, it is natural to try clamping a few variables to remove\ncycles from the topology. Here we run experiments on binary pairwise models to explore the po-\ntential bene\ufb01t of clamping even just one variable, though the procedure can be repeated. For exact\ninference, we used the junction tree algorithm. For approximate inference, we used Frank-Wolfe\n(FW, Frank and Wolfe, 1956): At each iteration, a tangent hyperplane to the approximate free en-\nergy is computed at the current point, then a move is made to the best computed point along the\nline to the vertex of the local polytope with the optimum score on the hyperplane. This proceeds\nmonotonically, even on a non-convex surface, hence will converge (since it is bounded), though\nit may be only to a local optimum and runtime is not guaranteed. This method typically produces\ngood solutions in reasonable time compared to other approaches (Belanger et al., 2013; Weller et al.,\n2014) and allows direct comparison to earlier results (Meshi et al., 2009; Weller et al., 2014). To\nfurther facilitate comparison, in this Section we use the same unbiased reparameterization used by\n\nWeller et al. (2014), with E = \u2212(cid:80)\n\ni\u2208V \u03b8ixi \u2212(cid:80)\n\n(i,j)\u2208E Wij\n\n2 [xixj + (1 \u2212 xi)(1 \u2212 xj)].\nqs(1\u2212qs)qt(1\u2212qt) for analyzing loop series. In\n\n\u03best\u2212qsqt\n\n5Sudderth et al. (2007) de\ufb01ned a different, symmetric \u03b2st =\n\nour context, we suggest that the ratio de\ufb01ned above may be a better Bethe beta.\n\n6\n\n\fTest models were constructed as follows: For n variables, singleton potentials were drawn \u03b8i \u223c\nU [\u2212Tmax, Tmax]; edge weights were drawn Wij \u223c U [0, Wmax] for attractive models, or Wij \u223c\nU [\u2212Wmax, Wmax] for general models. For models with random edges, we constructed Erd\u02ddos-Renyi\nrandom graphs (rejecting disconnected samples), where each edge has independent probability p of\nbeing present. To observe the effect of increasing n while maintaining approximately the same\naverage degree, we examined n = 10, p = 0.5 and n = 50, p = 0.1. We also examined models on\na complete graph topology with 10 variables for comparison with TRW in (Weller et al., 2014). 100\nmodels were generated for each set of parameters with varying Tmax and Wmax values.\nResults are displayed in Figures 2 to 4 showing average absolute error of log ZB vs log Z and aver-\nage (cid:96)1 error of singleton marginals. The legend indicates the different methods used: Original is FW\non the initial model; then various methods were used to select the variable to clamp, before running\nFW on the 2 resulting submodels and combining those results. avg Clamp for log Z means average\nover all possible clampings, whereas all Clamp for marginals computes each singleton marginal as\nthe estimated \u02c6pi = ZB|Xi=1/(ZB|Xi=0 + ZB|Xi=1). best Clamp uses the variable which with\nhindsight gave the best improvement in log Z estimate, thereby showing the best possible result for\nlog Z. Similarly, worst Clamp picks the variable which showed worst performance. Where one\nvariable is clamped, the respective marginals are computed thus: for the clamped variable Xi, use\n\u02c6pi as before; for all others, take the weighted average over the estimated Bethe pseudomarginals on\neach sub-model using weights 1 \u2212 \u02c6pi and \u02c6pi for sub-models with Xi = 0 and Xi = 1 respectively.\nmaxW and Mpower are heuristics to try to pick a good variable in advance. Ideally, we would like\nto break heavy cycles, but searching for these is NP-hard. maxW is a simple O(|E|) method which\nj\u2208N (i) |Wij|, and can be seen to perform well (Liu et al., 2012\nproposed the same maxW approach for inference in Gaussian models). One way in which maxW\ncan make a poor selection is to choose a variable at the centre of a large star con\ufb01guration but far\nfrom any cycle. Mpower attempts to avoid this by considering the convergent series of powers of a\nmodi\ufb01ed W matrix, but on the examples shown, this did not perform signi\ufb01cantly better. See \u00a78.1\nin the Appendix for more details on Mpower and further experimental results.\nFW provides no runtime guarantee when optimizing over a non-convex surface such as the Bethe\nfree energy, but across all parameters, the average combined runtimes on the two clamped sub-\nmodels was the same order of magnitude as that for the original model, see Figure 5.\n\npicks a variable Xi with maxi\u2208V(cid:80)\n\n6 Discussion\nThe results of \u00a74 immediately also apply to any binary pairwise model where a subset of variables\nmay be \ufb02ipped to yield an attractive model, i.e. where the topology is balanced with no frustrated\ncycles (Harary, 1953; Weller et al., 2014). For this class, together with the lower bound of \u00a73,\nwe have sandwiched the range of ZB (equivalently, given ZB, we have sandwiched the range of the\ntrue partition function Z) and bounded its error; further, clamping any variable, solving for optimum\nlog ZB on sub-models and summing is guaranteed to be more accurate than solving on the original\nmodel. In some cases, it may also be faster; indeed, some algorithms such as LBP may fail on the\noriginal model but perform well on clamped sub-models.\nMethods presented may prove useful for analyzing general (non-attractive) models, or for other\napplications. As one example, it is known that the Bethe free energy is convex for a MRF whose\ntopology has at most one cycle (Pakzad and Anantharam, 2002). In analyzing the Hessian of the\nBethe free energy, we are able to leverage this to show the following result, which may be useful for\noptimization (proof in Appendix; this result was conjectured by N. Ruozzi).\nLemma 11. In a binary pairwise MRF (attractive or repulsive edges, any topology), for any subset\nof variables S \u2286 V whose induced topology contains at most one cycle, the Bethe free energy (using\noptimum pairwise marginals) over S, holding variables V\\S at \ufb01xed singleton marginals, is convex.\nIn \u00a75, clamping appears to be very helpful, especially for attractive models with low singleton poten-\ntials where results are excellent (overcoming TRW\u2019s advantage in this context), but also for general\nmodels, particularly with the simple maxW selection heuristic. We can observe some decline in\nbene\ufb01t as n grows but this is not surprising when clamping just a single variable. Note, however,\nthat non-attractive models exist such that clamping and summing over any variable can lead to a\nworse Bethe approximation of log Z, see Figure 5c for a simple example on four variables.\n\n7\n\n\f(a) attractive log Z, Tmax = 0.1 (b) attractive margs, Tmax = 0.1\n\n(c) general log Z, Tmax = 2\n\n(d) general margs, Tmax = 2\n\nFigure 2: Average errors vs true, complete graph on n = 10. TRW in pink. Consistent legend throughout.\n\n(a) attractive log Z, Tmax = 0.1 (b) attractive margs, Tmax = 0.1\n\n(c) general log Z, Tmax = 2\n\n(d) general margs, Tmax = 2\n\nFigure 3: Average errors vs true, random graph on n = 10, p = 0.5. Consistent legend throughout.\n\n(a) attractive log Z, Tmax = 0.1 (b) attractive margs, Tmax = 0.1\n\n(c) general log Z, Tmax = 2\n\n(d) general margs, Tmax = 2\n\nFigure 4: Average errors vs true, random graph on n = 50, p = 0.1. Consistent legend throughout.\n\nx1\n\nx4\n\nx2\n\nx3\n\n(a) attractive random graphs\n\n(b) general random graphs\n\n(c) Blue (dashed red) edges are attractive (repulsive)\nwith edge weight +2 (\u22122). No singleton potentials.\n\nFigure 5: Left: Average ratio of combined sub-model runtimes to original runtime (using maxW, other choices\nare similar). Right: Example model where clamping any variable worsens the Bethe approximation to log Z.\n\nIt will be interesting to explore the extent to which our results may be generalized beyond binary\npairwise models. Further, it is tempting to speculate that similar results may be found for other\napproximations. For example, some methods that upper bound the partition function, such as TRW,\nmight always yield a lower (hence better) approximation when a variable is clamped.\n\nAcknowledgments. We thank Nicholas Ruozzi for careful reading, and Nicholas, David Sontag,\nAryeh Kontorovich, David Yao, Frederik Eaton and Toma\u02d8z Slivnik for helpful discussion and com-\nments. This work was supported in part by NSF grants IIS-1117631 and CCF-1302269.\n\nReferences\nV. Bafna, P. Berman, and T. Fujito. A 2-approximation algorithm for the undirected feedback vertex set prob-\n\nlem. SIAM Journal on Discrete Mathematics, 12(3):289\u20139, 1999.\n\nD. Belanger, D. Sheldon, and A. McCallum. Marginal inference in MRFs using Frank-Wolfe. In NIPS Work-\n\nshop on Greedy Optimization, Frank-Wolfe and Friends, December 2013.\n\nD. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 1995.\nA. Choi and A. Darwiche. Approximating the partition function by deleting and then correcting for model\n\nedges. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2008.\n\n8\n\n248121600.20.40.60.81max interaction strength W248121600.10.20.30.40.5max Originalall ClampmaxW Clampbest Clampworst ClampMpowerTRW interaction strength W248121601020304050max Originalavg ClampmaxW Clampbest Clampworst ClampMpowerTRW interaction strength W248121600.10.20.30.4max interaction strength W248121600.20.40.60.8max Originalavg ClampmaxW Clampbest Clampworst ClampMpower interaction strength W248121600.10.20.30.40.5max Originalall ClampmaxW Clampbest Clampworst ClampMpower interaction strength W24812160123456max Originalavg ClampmaxW Clampbest Clampworst ClampMpower interaction strength W248121600.10.20.30.4max Originalall ClampmaxW Clampbest Clampworst ClampMpower interaction strength W248121600.20.40.60.8max interaction strength W248121600.10.20.30.40.5max Originalall ClampmaxW Clampbest Clampworst ClampMpower interaction strength W2481216051015202530max Originalavg ClampmaxW Clampbest Clampworst ClampMpower interaction strength W248121600.10.20.30.4max Originalall ClampmaxW Clampbest Clampworst ClampMpower interaction strength W2481216123456max Random n=10, Tmax=2Random n=10, Tmax=0.1Random n=50, Tmax=2Random n=50, Tmax=0.1 interaction strength W24812161234567max Random n=10, Tmax=2Random n=10, Tmax=0.1Random n=50, Tmax=2Random n=50, Tmax=0.1 interaction strength W\fA. Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009.\nF. Eaton and Z. Ghahramani. Choosing a variable to clamp: Approximate inference using conditioned belief\n\npropagation. In Arti\ufb01cial Intelligence and Statistics, 2009.\n\nK. Fan. Topological proofs for certain theorems on matrices with non-negative elements. Monatshefte fr\n\nMathematik, 62:219\u2013237, 1958.\n\nM. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):\n\n95\u2013110, 1956. ISSN 1931-9193. doi: 10.1002/nav.3800030109.\n\nF. Harary. On the notion of balance of a signed graph. Michigan Mathematical Journal, 2:143\u2013146, 1953.\nR. Karp. Complexity of Computer Computations, chapter Reducibility Among Combinatorial Problems, pages\n\n85\u2013103. New York: Plenum., 1972.\n\nD. Koller and N. Friedman. Probabilistic Graphical Models - Principles and Techniques. MIT Press, 2009.\nF. Kor\u02d8c, V. Kolmogorov, and C. Lampert. Approximating marginals using discrete energy minimization. Tech-\n\nnical report, IST Austria, 2012.\n\nS. Lauritzen and D. Spiegelhalter. Local computations with probabilities on graphical structures and their\n\napplication to expert systems. Journal of the Royal Statistical Society series B, 50:157\u2013224, 1988.\n\nY. Liu, V. Chandrasekaran, A. Anandkumar, and A. Willsky. Feedback message passing for inference in\n\nGaussian graphical models. IEEE Transactions on Signal Processing, 60(8):4135\u20134150, 2012.\n\nR. McEliece, D. MacKay, and J. Cheng. Turbo decoding as an instance of Pearl\u2019s \u201dBelief Propagation\u201d algo-\n\nrithm. IEEE Journal on Selected Areas in Communications, 16(2):140\u2013152, 1998.\n\nO. Meshi, A. Jaimovich, A. Globerson, and N. Friedman. Convexifying the Bethe free energy. In UAI, 2009.\nP. Milgrom. The envelope theorems. Department of Economics, Standford University, Mimeo, 1999. URL\n\nhttp://www-siepr.stanford.edu/workp/swp99016.pdf.\n\nJ. Mitchell. Branch-and-cut algorithms for combinatorial optimization problems. Handbook of Applied Opti-\n\nmization, pages 65\u201377, 2002.\n\nK. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approximate inference: An empirical study.\n\nIn Uncertainty in Arti\ufb01cial Intelligence (UAI), 1999.\n\nM. Padberg and G. Rinaldi. A branch-and-cut algorithm for the resolution of large-scale symmetric traveling\n\nsalesman problems. SIAM review, 33(1):60\u2013100, 1991.\n\nP. Pakzad and V. Anantharam. Belief propagation and statistical physics. In Princeton University, 2002.\nJ. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann,\n\n1988.\n\nM. Peot and R. Shachter. Fusion and propagation with multiple observations in belief networks. Arti\ufb01cial\n\nIntelligence, 48(3):299\u2013318, 1991.\n\nI. Rish and R. Dechter. Resolution versus search: Two strategies for SAT. Journal of Automated Reasoning, 24\n\n(1-2):225\u2013275, 2000.\n\nN. Ruozzi. The Bethe partition function of log-supermodular graphical models. In Neural Information Pro-\n\ncessing Systems, 2012.\n\nE. Sudderth, M. Wainwright, and A. Willsky. Loop series and Bethe variational bounds in attractive graphical\n\nmodels. In NIPS, 2007.\n\nD. Topkis. Minimizing a submodular function on a lattice. Operations Research, 26(2):305\u2013321, 1978.\nP. Vontobel. Counting in graph covers: A combinatorial characterization of the Bethe entropy function. Infor-\n\nmation Theory, IEEE Transactions on, 59(9):6018\u20136048, Sept 2013. ISSN 0018-9448.\n\nM. Wainwright. Stochastic Processes on Graphs: Geometric and Variational Approaches. PhD thesis, MIT,\n\nEECS, 2002.\n\nM. Wainwright and M. Jordan. Graphical models, exponential families and variational inference. Foundations\n\nand Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\nM. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function. IEEE\n\nTransactions on Information Theory, 51(7):2313\u20132335, 2005.\n\nA. Weller and T. Jebara. Bethe bounds and approximating the global optimum. In AISTATS, 2013.\nA. Weller and T. Jebara. Approximating the Bethe partition function. In UAI, 2014.\nA. Weller, K. Tang, D. Sontag, and T. Jebara. Understanding the Bethe approximation: When and how can it\n\ngo wrong? In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2014.\n\nM. Welling and Y. Teh. Belief optimization for binary networks: A stable alternative to loopy belief propaga-\n\ntion. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2001.\n\n9\n\n\f", "award": [], "sourceid": 577, "authors": [{"given_name": "Adrian", "family_name": "Weller", "institution": "Columbia University"}, {"given_name": "Tony", "family_name": "Jebara", "institution": "Columbia University"}]}