{"title": "Uprooting and Rerooting Higher-Order Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 209, "page_last": 218, "abstract": "The idea of uprooting and rerooting graphical models was introduced specifically for binary pairwise models by Weller (2016) as a way to transform a model to any of a whole equivalence class of related models, such that inference on any one model yields inference results for all others. This is very helpful since inference, or relevant bounds, may be much easier to obtain or more accurate for some model in the class. Here we introduce methods to extend the approach to models with higher-order potentials and develop theoretical insights. In particular, we show that the triplet-consistent polytope TRI is unique in being `universally rooted'. We demonstrate empirically that rerooting can significantly improve accuracy of methods of inference for higher-order models at negligible computational cost.", "full_text": "Uprooting and Rerooting Higher-Order Graphical\n\nModels\n\nMark Rowland\u2217\n\nUniversity of Cambridge\n\nmr504@cam.ac.uk\n\nAdrian Weller\u2217\n\nUniversity of Cambridge and Alan Turing Institute\n\naw665@cam.ac.uk\n\nAbstract\n\nThe idea of uprooting and rerooting graphical models was introduced speci\ufb01cally\nfor binary pairwise models by Weller [19] as a way to transform a model to any\nof a whole equivalence class of related models, such that inference on any one\nmodel yields inference results for all others. This is very helpful since inference, or\nrelevant bounds, may be much easier to obtain or more accurate for some model\nin the class. Here we introduce methods to extend the approach to models with\nhigher-order potentials and develop theoretical insights. In particular, we show\nthat the triplet-consistent polytope TRI is unique in being \u2018universally rooted\u2019.\nWe demonstrate empirically that rerooting can signi\ufb01cantly improve accuracy of\nmethods of inference for higher-order models at negligible computational cost.\n\n1\n\nIntroduction\n\nUndirected graphical models with discrete variables are a central tool in machine learning. In this\npaper, we focus on three canonical tasks of inference: identifying a con\ufb01guration with highest\nprobability (termed maximum a posteriori or MAP inference), computing marginal probabilities of\nsubsets of variables (marginal inference) and calculating the normalizing constant (partition function).\nAll three tasks are typically computationally intractable, leading to much work to identify settings\nwhere exact polynomial-time methods apply, or to develop approximate algorithms that perform well.\nWeller [19] introduced an elegant method which \ufb01rst uproots and then reroots a given model M to any\nof a whole class of rerooted models {Mi}. The method relies on speci\ufb01c properties of binary pairwise\nmodels and makes use of an earlier construction which reduced MAP inference to the MAXCUT\nproblem on the suspension graph \u2207G (1; 2; 12; 19, see \u00a73 for details). For many important inference\ntasks, the rerooted models are equivalent in the sense that results for any one model yield results for\nall others with negligible computational cost. This can be very helpful since various models in the\nclass may present very different computational dif\ufb01culties for inference.\nHere we show how the idea may be generalized to apply to models with higher-order potentials over\nany number of variables. Such models have many important applications, for example in computer\nvision [6] or modeling protein interactions [5]. As for pairwise models, we again obtain signi\ufb01cant\nbene\ufb01ts for inference. We also develop a deeper theoretical understanding and derive important new\nresults. We highlight the following contributions:\n\u2022 In \u00a73-\u00a74, we show how to achieve ef\ufb01cient uprooting and rerooting of binary graphical models\n\u2022 In \u00a75, to simplify the subsequent analysis, we introduce pure k-potentials for any order k, which\nmay be of independent interest. We show that there is essentially only one pure k-potential which\nwe call the even k-potential, and that even k-potentials form a basis for all model potentials.\n\u2022 In \u00a76, we carefully analyze the effect of uprooting and rerooting on Sherali-Adams [11] relaxations\nLr of the marginal polytope, for any order r. One surprising observation in \u00a76.2 is that L3 (the\n\u2217Authors contributed equally.\n\nwith potentials of any order, while still allowing easy recovery of inference results.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ftriplet-consistent polytope or TRI) is unique in being universally rooted, in the sense that there is\nan af\ufb01ne score-preserving bijection between L3 for a model and L3 for each of its rerootings.\n\u2022 In \u00a77, our empirical results demonstrate that rerooting can signi\ufb01cantly improve accuracy of\ninference in higher-order models. We introduce effective heuristics to choose a helpful rerooting.\nOur observations have further implications for the many variational methods of marginal inference\nwhich optimize the sum of score and an entropy approximation over a Sherali-Adams polytope\nrelaxation. These include the Bethe approximation (intimately related to belief propagation) and\ncluster extensions, tree-reweighted (TRW) approaches and logdet methods [12; 14; 16; 22; 24].\n\n1.1 Background and discussion of theoretical contributions\n\nBased on earlier connections in [2], [19] showed the remarkable result for pairwise models that the\ntriplet-consistent polytope (L3 or TRI) is universally rooted (in the restricted sense de\ufb01ned in [19,\nTheorem 3]). This observation allowed straightforward strengthening of previously known results,\nfor example: it was previously shown [23] that the LP relaxation on TRI (LP+TRI) is always tight\nfor an \u2018almost-balanced\u2019 binary pairwise model, that is a model which can be rendered balanced\nby removing one variable [17]. Given [19, Theorem 3], this earlier result could immediately be\nsigni\ufb01cantly strengthened to [19, Theorem 4], which showed that LP+TRI is tight for a binary\npairwise model provided only that some rerooting exists such that the rerooted model is almost\nbalanced.\nFollowing [19], it was natural to suspect that the universal rootedness property might hold for all\n(or at least some) Lr, r \u2265 3. This would have impact on work such as [10] which examines which\nsigned minors must be forbidden to guarantee tightness of LP+L4. If L4 were universally rooted,\nthen it would be possible to simplify signi\ufb01cantly the analysis in [10].\nConsidering this issue led to our analysis of the mappings to symmetrized uprooted polytopes given\nin our Theorem 17. We believe this is the natural generalization of the lower order relationships of\nL2 and L3 to RMET and MET described in [2], though this direction was not clear initially.\nWith this formalism, together with the use of even potentials, we demonstrate our Theorems 20 and\n21, showing that in fact TRI is unique in being universally rooted (and indeed in a stronger sense\nthan given in [19]). We suggest that this result is surprising and may have further implications.\nAs a consequence, it is not possible to generate some quick theoretical wins by generalizing previous\nresults as [19] did to derive their Theorem 4, but on the other hand we observe that rerooting may be\nhelpful in practice for any approach using a Sherali-Adams relaxation other than L3. We verify the\npotential for signi\ufb01cant bene\ufb01ts experimentally in \u00a77.\n\n2 Graphical models\n\nA discrete graphical model M [G(V, E), (\u03b8E )E\u2208E] consists of: a hypergraph G = (V, E), which has\nn vertices V = {1, . . . , n} corresponding to the variables of the model, and hyperedges E \u2286 P(V ),\nwhere P(V ) is the powerset of V ; together with potential functions (\u03b8E )E\u2208E over the hyperedges\nE \u2208 E. We consider binary random variables (Xv)v\u2208V with each Xv \u2208 Xv = {0, 1}. For a subset\nU \u2286 V , xU \u2208 {0, 1}U is a con\ufb01guration of those variables (Xv)v\u2208U . We write xU for the \ufb02ipping\nof xU , de\ufb01ned by xi = 1 \u2212 xi \u2200i \u2208 U. The joint probability mass function factors as follows, where\n\nthe normalizing constant Z =(cid:80)\n\nxV \u2208{0,1}V exp(score(xV )) is the partition function:\n\n(cid:88)\n\nE\u2208E\n\np(xV ) =\n\n1\nZ\n\nexp (score(xV )) ,\n\nscore(xV ) =\n\n\u03b8E (xE ).\n\n(1)\n\n3 Uprooting and rerooting\nOur goal is to map a model M to any of a whole family of models {Mi} in such a way that inference\non any Mi will allow us easily to recover inference results on the original model M. In this section\nwe provide our mapping, then in \u00a74 we explain how to recover inference results for M.\nThe uprooting mechanism used by Weller [19] \ufb01rst reparametrizes edge potentials to the form\n\u03b8ij(xi, xj) = \u2212 1\n2 Wij 1[xi (cid:54)= xj], where 1[\u00b7] is the indicator function (a reparameterization modi\ufb01es\n\n2\n\n\f1\n\n3\n\n2\n\n4\n\n1\n\n3\n\n2\n\n4\n\n0\n\n1\n\n3\n\nM = M0\n\nM +\n\n2\n\n1\n\n0\nM4 = M +|X4=0\n\n3\n4\nM2 = M +|X2=0\n\n0\n\nFigure 1: Left: The hypergraph G of a graphical model M over 4 variables, with potentials on the hyperedges\n{1, 2}, {1, 3, 4}, and {2, 4}. Center-left: The suspension hypergraph \u2207G of the uprooted model M +. Center-\nright: The hypergraph \u2207G\\{4} of the rerooted model M4 = M +|X4=0, i.e. M + with X4 clamped to 0. Right:\nThe hypergraph \u2207G \\ {2} of the rerooted model M2 = M +|X2=0, i.e. M + with X2 clamped to 0.\n\npotential functions such that the complete score of each con\ufb01guration is unchanged, see 15 for details).\nNext, singleton potentials are converted to edge potentials with this same form by connecting to an\nadded variable X0. This mechanism had been used previously to reduce MAP inference on M to\nMAXCUT on the converted model [1; 12], and applies speci\ufb01cally only to binary pairwise models.\nWe introduce a generalized construction which applies to models with potentials of any order. We \ufb01rst\nuproot a model M to a highly symmetric uprooted model M + where an extra variable X0 is added,\nin such a way that the original model M is exactly M + with X0 clamped to the value 0. Since X0 is\nclamped to retrieve M, we may write M = M0 := M +|X0=0. Alternatively, we can choose instead\nto clamp a different variable Xi in M + which will lead to the rerooted model Mi := M +|Xi=0.\nDe\ufb01nition 1 (Clamping). For a graphical model M [G = (V, E), (\u03b8E )E\u2208E], and i \u2208 V , the model\nM|Xi=a obtained by clamping the variable Xi to the value a \u2208 Xi is given by: the hypergraph\n(V \\ {i}, Ei), where Ei = {E \\ {i}|E \u2208 E}; and potentials which are unchanged for hyperedges\nwhich do not contain i, while if i \u2208 E then \u03b8E\\{i}(xE\\{i}) = \u03b8E (xE\\{i}, xi = a).\nDe\ufb01nition 2 (Uprooting, suspension hypergraph). Given a model M [G(V, E), (\u03b8E )E\u2208E], the\nuprooted model M + adds a variable X0, which is added to every hyperedge of the original model. M +\nhas hypergraph \u2207G, with vertex set V + = V \u222a{0} and hyperedge set E+ = {E + = E\u222a{0}|E \u2208 E}.\n\u2207G is the suspension hypergraph of G. M + has potential functions (\u03b8+E\u222a{0})E\u2208E given by\n\n(cid:26)\u03b8E (xE )\n\n\u03b8E (xE )\n\nif x0 = 0\nif x0 = 1.\n\n\u03b8+E\u222a{0}(xE\u222a{0}) =\n\nWith this de\ufb01nition, all uprooted potentials are symmetric in that \u03b8+E +(xE +) = \u03b8+E +(xE +) \u2200E + \u2208 E+.\nDe\ufb01nition 3 (Rerooting). From De\ufb01nition 2, we see that given a model M, if we uproot to M +\nthen clamp X0 = 0, we recover the original model M. If instead in M + we clamp Xi = 0 for any\ni = 1, . . . , n, then we obtain the rerooted model Mi := M +|Xi=0.\nSee Figure 1 and Table 1 for examples of uprooting and rerooting. We explore the question of how to\nchoose a good variable for rerooting (i.e. how to choose a good variable to clamp in M +) in \u00a77.\n\n4 Recovery of inference tasks\n\nHere we demonstrate that the partition function, MAP score and con\ufb01guration, and marginal distri-\nbutions for a model M, can all be recovered from its uprooted model M + or any rerooted model\nMi i \u2208 V , with negligible computational cost. We write Vi = {0, 1, . . . , n} \\ {i} for the variable set\nof rerooted model Mi; scorei(xVi) for the score of xVi in Mi; and pi for the probability distribution\nfor Mi. We use superscript + to indicate the uprooted model. For example, the probability distri-\nbution for M + is given by p+(xV +) = 1\nobtain the following key lemma, which is critical to enable recovery of inference results.\nLemma 4 (Score-preserving map). Each con\ufb01guration xV of M maps to 2 con\ufb01gurations of\nfrom M, xV \u2192 in M +, both of (x0 = 0, xV ) and\nthe uprooted M + with the same score, i.e.\n(x0 = 1, xV ) with score(xV ) = score+(x0 = 0, xV ) = score+(x0 = 1, xV ). For any i \u2208 V +,\nexactly one of the two uprooted con\ufb01gurations has xi = 0, and just this one will be selected in Mi.\nHence, there is a score-preserving bijection between con\ufb01gurations of M and those of Mi:\n\nZ+ exp(cid:0)(cid:80)E\u2208E+ \u03b8E (xE )(cid:1). From the de\ufb01nitions of \u00a73, we\n\n(cid:26)(x0 = 0, xV \\{i})\n\n(x0 = 1, xV \\{i})\n\nif xi = 0\nif xi = 1.\n\n(2)\n\nFor any i \u2208 V + :\n\nin M, xV \u2194 in Mi,\n\n3\n\n\fM con\ufb01g\n\nx1\n0\n0\n0\n0\n1\n1\n1\n1\n\nx3\n0\n0\n1\n1\n0\n0\n1\n1\n\nx4\n0\n1\n0\n1\n0\n1\n0\n1\n\nM + con\ufb01guration\nx4\nx0\n0\n0\n1\n0\n0\n0\n0\n1\n0\n0\n1\n0\n0\n0\n1\n0\n1\n0\n1\n1\n0\n1\n1\n1\n0\n1\n1\n1\n0\n1\n1\n1\n\nx1\n0\n0\n0\n0\n1\n1\n1\n1\n0\n0\n0\n0\n1\n1\n1\n1\n\nx3\n0\n0\n1\n1\n0\n0\n1\n1\n0\n0\n1\n1\n0\n0\n1\n1\n\nM4 con\ufb01g\n\nx0\n0\n\nx1\n0\n\nx3\n0\n\n0\n\n0\n\n0\n\n1\n\n1\n\n1\n\n1\n\n0\n\n1\n\n1\n\n0\n\n0\n\n1\n\n1\n\n1\n\n0\n\n1\n\n0\n\n1\n\n0\n\n1\n\nTable 1: An illustration of how scores of potential \u03b8134 on hyperedge {1, 3, 4} in an original model M map to\npotential \u03b80134 in M + and then to \u03b8013 in M4. See Figure 1 for the hypergraphs. Each color indicates a value\nof \u03b8134(x1, x3, x4) for a different con\ufb01guration (x1, x3, x4). Note that M + has 2 rows of each color, while\nafter rerooting to M4, we again have exactly one row of each color. The 1-1 score preserving map between\ncon\ufb01gurations of M and any Mi is critical to enable recovery of inference results; see Lemma 4.\n\nTable 1 illustrates this perhaps surprising result, from which the next two propositions follow.\nProposition 5 (Recovering the partition function). Given a model M [G(V, E), (\u03b8E )E\u2208E] with\npartition function Z as in (1), the partition function Z + of the uprooted model M + is twice Z, and\nthe partition function of each rerooted model Mi is exactly Z, for any i \u2208 V .\nProposition 6 (Recovering a MAP con\ufb01guration). From M +: xV is an arg max for p iff (x0 =\n0, xV ) is an arg max for p+ iff (x0 = 1, xV ) is an arg max for p+. From a rerooted model Mi:\n(xV \\{i}, xi = 0) is an arg max for p iff (x0 = 0, xV \\{i}) is an arg max for pi; (xV \\{i}, xi = 1) is an\narg max for p iff (x0 = 1, xV \\{i}) is an arg max for pi.\n\nWe can recover marginals as shown in the following proposition, proof in the Appendix \u00a79.1.\nProposition 7 (Recovering marginals). For a subset \u2205 (cid:54)= U \u2286 V , we can recover from M +:\np(xU ) = p+(x0 = 0, xU ) + p+(x0 = 1, xU )\nTo recover from a rerooted Mi: (i) For any i \u2208 V \\ U, p(xU ) = pi(x0 = 0, xU ) + pi(x0 = 1, xU ).\n(ii) For any i \u2208 U, p(xU ) =\n\n(cid:26)pi(x0 = 0, xU\\{i}) xi = 0\n\n= 2p+(x0 = 0, xU ) = 2p+(x0 = 1, xU ).\n\npi(x0 = 1, xU\\{i}) xi = 1.\n\nIn \u00a76, we provide a careful analysis of the impact of uprooting and rerooting on the Sherali-Adams\nhierarchy of relaxations of the marginal polytope [11]. We \ufb01rst introduce a way to parametrize\npotentials which will be particularly useful, and which may be of independent interest.\n\n5 Pure k-potentials\n\nWe introduce the notion of pure k-potentials. These allow the speci\ufb01cation of interactions which act\n\u2018purely\u2019 over a set of variables of a given size k, without in\ufb02uencing the distribution of any subsets.\nWe show that in fact, there is essentially only one pure k-potential. Further, we show that one can\nexpress any \u03b8E potential in terms of pure potentials over E and subsets of E, and that pure potentials\nhave appealing properties when uprooted and rerooted which help our subsequent analysis.\nWe say that a potential is a k-potential if k is the smallest number such that the score of the potential\nmay be determined by considering the con\ufb01guration of k variables. Usually a potential \u03b8E is a\nk-potential with k = |E|. For example, typically a singleton potential is a 1-potential, and an edge\npotential is a 2-potential. However, note that k < |E| is possible if one or more variables in E are not\nneeded to establish the score (a simple example is \u03b812(x1, x2) = x1, which clearly is a 1-potential).\n\n4\n\n\fIn general, a k-potential will affect the marginal distributions of all subsets of the k variables. For\nexample, one popular form of 2-potential is \u03b8ij(xi, xj) = Wijxixj, which tends to pull Xi and Xj\ntoward the same value, but also tends to increase each of p(Xi = 1) and p(Xj = 1). For pairwise\nmodels, a different reparameterization of potentials instead writes the score as\n\n(cid:88)\n\ni\u2208V\n\n(cid:88)\n\n(i,j)\u2208E\n\nscore(xV ) =\n\n\u03b8ixi +\n\n1\n2\n\nWij 1[xi = xj].\n\n(3)\n\nExpression (3) has the desirable feature that the \u03b8ij(xi, xj) = 1\n2 Wij 1[xi = xj] edge potentials affect\nonly the pairwise marginals, without disturbing singleton marginals. This motivates the following\nde\ufb01nition.\nDe\ufb01nition 8. Let k \u2265 2, and let U be a set of size k. We say that a k-potential \u03b8U : {0, 1}U \u2192 R\nis a pure k-potential if the distribution induced by the potential, p(xU ) \u221d exp(\u03b8U (xU )), has the\nproperty that for any \u2205 (cid:54)= W (cid:40) U, the marginal distribution p(xW ) is uniform.\nWe shall see in Proposition 10 that a pure k-potential must essentially be an even k-potential.\nDe\ufb01nition 9. Let k \u2208 N, and |U| = k. An even k-potential is a k-potential \u03b8U : {0, 1}U \u2192 R of the\nform \u03b8U (xU ) = a1[ |{i \u2208 U|xi = 1}| is even], for some a \u2208 R which is its coef\ufb01cient. In words,\n\u03b8U (xU ) takes value a if xU has an even number of 1s, else it takes value 0.\n\n(cid:0)1[ |{i \u2208 W|Xi = 1}| is even](cid:1)\n\nAs an example, the 2-potential \u03b8ij(xi, xj) = 1\n2 Wij 1[xi = xj] in (3) is an even 2-potential with\nU = {i, j} and coef\ufb01cient Wij/2. The next two propositions are proved in the Appendix \u00a79.2.\nProposition 10 (All pure potentials are essentially even potentials). Let k \u2265 2, and |U| = k. If\n\u03b8U :{0, 1}U \u2192 R is a pure k-potential then \u03b8U must be an af\ufb01ne function of the even k-potential, i.e.\n\u2203 a, b \u2208 R s.t. \u03b8U (xU ) = a1[ |{i \u2208 U|xi = 1}| is even] + b.\nProposition 11 (Even k-potentials form a basis). For a \ufb01nite set U, the set of even k-potentials\nW\u2286U , indexed by subsets W \u2286 U, forms a basis for the vector\nspace of all potential functions \u03b8 : {0, 1}U \u2192 R.\nAny constant in a potential will be absorbed into the partition function Z and does not affect the\nprobability distribution, see (1). An even 2-potential with positive coef\ufb01cient, e.g. as in (3) if\nWij > 0, is supermodular. Models with only supermodular potentials (equivalently, submodular cost\nfunctions) typically admit easier inference [3; 7]; if such a model is binary pairwise then it is called\nattractive. However, for k > 2, even k-potentials \u03b8E are neither supermodular nor submodular. Yet if\nk is an even number, observe that \u03b8E (xE ) = \u03b8E (xE ). We discuss this further in Appendix \u00a710.4.\nWhen a k-potential is uprooted, in general it may become a (k + 1)-potential (recall De\ufb01nition 2).\nThe following property of even k-potentials is helpful for our analysis in \u00a76, and is easily checked.\nLemma 12 (Uprooting an even k-potential). When an even k-potential \u03b8E with |E| = k is uprooted:\nif k is an even number, then the uprooted potential is exactly the same even k-potential; if k is odd,\nthen we obtain the even (k + 1)-potential over E \u222a {0} with the same coef\ufb01cient as the original \u03b8E.\n\n6 Marginal polytope and Sherali-Adams relaxations\n\nWe saw in Lemma 4 that there is a score-preserving 1-2 mapping from con\ufb01gurations of M to those\nof M +, and a bijection between con\ufb01gurations of M and any Mi. Here we examine the extent to\nwhich these score-preserving mappings extend to (pseudo-)marginal probability distributions over\nvariables by considering the Sherali-Adams relaxations [11] of the respective marginal polytopes.\nThese relaxations feature prominently in many approaches for MAP and marginal inference.\nFor U \u2286 V , we write \u00b5U for a probability distribution in P({0, 1}U ), the set of all probability\ndistributions on {0, 1}U . Bold \u00b5 will represent a collection of measures over various subsets of\nvariables. Given (1), to compute an expected score, we need (\u00b5E )E\u2208E. This motivates the following.\nDe\ufb01nition 13. The marginal polytope M(G(V, E)) = {(\u00b5E )E\u2208E\nwhere for U1 \u2286 U2 \u2286 V , \u00b5U2\u2193U1 denotes the marginalization of \u00b5U2 \u2208 P({0, 1}U2) onto {0, 1}U1.\nM(G) consists of marginal distributions for every hyperedge E \u2208 E such that all the marginals are\nconsistent with a global distribution over all variables V . Methods of variational inference typically\n\n(cid:12)(cid:12)\u2203\u00b5V s.t. \u00b5V E = \u00b5E \u2200E \u2208 E},\n\n5\n\n\foptimize either the score (for MAP inference) or the score plus an entropy term (for marginal\ninference) over a relaxation of the marginal polytope [15]. This is because M(G) is computationally\nintractable, with an exponential number of facets [2]. Relaxations from the Sherali-Adams hierarchy\n[11] are often used, requiring consistency only over smaller clusters of variables.\nDe\ufb01nition 14. Given an integer r \u2265 2, if a hypergraph G(V, E) satis\ufb01es maxE\u2208E |E| \u2264 r \u2264 |V |,\nthen we say that G is r-admissible, and de\ufb01ne the Sherali-Adams polytope of order r on G by\n\nLr(G) =\n\n(\u00b5E )E\u2208E\n\nlocally consistent, s.t. \u00b5U\u2193E = \u00b5E \u2200 E \u2286 U \u2286 V,\n\n|U| = r\n\n,\n\n(cid:26)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u2203(\u00b5U ) U\u2286V\n\n|U|=r\n\n(cid:27)\n\nwhere a collection of measures (\u00b5A)A\u2208I (for some set I of subsets of V ) is locally consistent, or l.c.,\nif for any A1, A2 \u2208 I, we have \u00b5A1\u2193A1\u2229A2 = \u00b5A2\u2193A1\u2229A2. Each element of Lr(G) is a set of locally\nconsistent probability measures over the hyperedges. Note that M(G) \u2286 Lr(G) \u2286 Lr\u22121(G). The\npairwise relaxation L2(G) is commonly used but higher-order relaxations achieve greater accuracy,\nhave received signi\ufb01cant attention [10; 13; 18; 22; 23], and are required for higher-order potentials.\n\n6.1 The impact of uprooting and rerooting on Sherali-Adams polytopes\n\nWe introduce two variants of the Sherali-Adams polytopes which will be helpful in analyzing\nuprooted models. For a measure \u00b5U \u2208 P({0, 1}U ), we de\ufb01ne the \ufb02ipped measure \u00b5U as \u00b5U (xU ) =\n\u00b5U (xU ) \u2200xU \u2208 {0, 1}U . A measure \u00b5U is \ufb02ipping-invariant if \u00b5U = \u00b5U .\nDe\ufb01nition 15. The symmetrized Sherali-Adams polytopes for an uprooted hypergraph \u2207G(V +, E+)\n(as given in De\ufb01nition 2), is:\n\n(\u00b5E )E\u2208E+ \u2208 Lr(\u2207G)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u00b5E = \u00b5E \u2200E \u2208 E+\n\n(cid:27)\n\n.\n\n(cid:26)\n(cid:101)Lr(\u2207G) =\n(cid:12)(cid:12)(cid:12)(cid:12)\u2203(\u00b5U )i\u2208U\u2286V +\n\n|U|=r\n\n(cid:26)\n\nl.c., s.t.\n\n(\u00b5E )E\u2208E+\n\nDe\ufb01nition 16. For any i \u2208 V +, and any integer r \u2265 2 such that maxE\u2208E+ |E| \u2264 r \u2264 |V +|, we\nde\ufb01ne the symmetrized Sherali-Adams polytope of order r uprooted at i to be\n\n(cid:101)Li\nr(\u2207G) =\nThus, for each collection of measures over hyperedges in(cid:101)Li\nthat for any hypergraph G(V, E) and any i \u2208 V +, we have(cid:101)Lr+1(\u2207G) \u2286(cid:101)Li\n\nr(\u2207G), there exist corresponding \ufb02ipping-\nr+1(\u2207G) \u2286(cid:101)Lr(\u2207G).\ninvariant, locally consistent measures on sets of size r which contain i (and their subsets). Note\n\n\u00b5U\u2193E = \u00b5E \u2200 E \u2286 U \u2286 V, |U| = r, i \u2208 U\n\u00b5U = \u00b5U\n\n\u2200U \u2286 V,|U| = r, i \u2208 U\n\nWe next extend the correspondence of Lemma 4 to collections of locally-consistent probability\ndistributions on the hyperedges of G, see the Appendix \u00a79.3 for proof.\nTheorem 17. For a hypergraph G(V, E), and integer r such that maxE\u2208E |E| \u2264 r \u2264 |V |, there is\nan af\ufb01ne score-preserving bijection\n\n(cid:27)\n\n.\n\n(cid:101)L0\nr+1(\u2207G) .\n\nLr(G)\n\nUproot(cid:29)\n\nRootAt0\n\nTheorem 17 establishes the following diagram of polytope inclusions and af\ufb01ne bijections:\n\nFor M = M0 :\n\nFor M + :\n\nLr+1(G) \u2286 Unnamed \u2286\n\n\uf8e6\uf8e6(cid:121)(cid:120)\uf8e6\uf8e6RootAt0\nr+2(\u2207G) \u2286 (cid:101)Lr+1(\u2207G) \u2286 (cid:101)L0\n(cid:101)L0\n\n\uf8e6\uf8e6(cid:121)(cid:120)\uf8e6\uf8e6RootAt0\n\nUproot\n\nUproot\n\nLr(G)\n\n\uf8e6\uf8e6(cid:121)(cid:120)\uf8e6\uf8e6RootAt0\n\nr+1(\u2207G) .\n\nUproot\n\n(4)\n\n[2] identi\ufb01ed L2(G) with(cid:101)L0\nA question of theoretical interest and practical importance is which of the inclusions in (4) are\nstrict. Our perspective here generalizes earlier work. Using different language, Deza and Laurent\n(cid:101)L3(\u2207G) with MET, the semimetric polytope. Building on this, Weller [19] considered L3(G), the\n3(\u2207G), which was termed RMET, the rooted semimetric polytope; and\n\ntriplet-consistent polytope or TRI, though only in the context of pairwise potentials, and showed that\nL3(G) has the remarkable property that if it is used to optimize an LP for a model M on G, the exact\nsame optimum is achieved for L3(Gi) for any rerooting Mi. It was natural to conjecture that Lr(G)\nmight have this same property for all r > 3, yet this was left as an open question.\n\n6\n\n\f6.2 L3 is unique in being universally rooted\nWe shall \ufb01rst strengthen [19] to show that L3 is universally rooted in the following stronger sense.\nDe\ufb01nition 18. We say that the rth-order Sherali-Adams relaxation is universally rooted (and write\n\u201cLr is universally rooted\u201d for short) if for all admissible hypergraphs G, there is an af\ufb01ne score-\npreserving bijection between Lr(G) and Lr(Gi), for each rerooted hypergraph (Gi)i\u2208V .\nIf Lr is universally rooted, this applies for potentials over up to r variables (the maximum which\nmakes sense in this context), and clearly it implies that optimizing score over any rerooting (as in\nMAP inference) will attain the same objective. The following result is proved in the Appendix \u00a79.3.\nLemma 19. If Lr is universally rooted for hypergraphs of maximum hyperedge degree p < r with p\neven, then Lr is also universally rooted for r-admissible hypergraphs with maximum degree p + 1.\nr+1(\u2207G). Then by considering\nmarginals using a basis equivalent to that described in Proposition 11 for even k-potentials, we\nobserve that the symmetry of the polytope enforces only one possible marginal for (p + 1)-clusters.\nCombining Lemma 19 with arguments which extend those used by [19] demonstrates the following\nresult, proved in the Appendix.\nTheorem 20. L3 is universally rooted.\n\nThe proof relies on mapping to the symmetrized uprooted polytope(cid:101)L0\n\nWe next provide a striking and rather surprising result, see the Appendix for proof and details.\nTheorem 21. L3 is unique in being universally rooted. Speci\ufb01cally, for any integer r > 1 other than\nr = 3, we constructively demonstrate a hypergraph G(V, E) with |V | = r + 1 variables for which\n\n(cid:101)L0\nr+1(\u2207G) (cid:54)=(cid:101)Li\nTheorem 21 examines(cid:101)L0\n\nr+1(\u2207G) for any i \u2208 V .\nr+1(\u2207G) and(cid:101)Li\n\nr+1(\u2207G), which by Theorem 17 are the uprooted equivalents\nof Lr(G) and Lr(Gi). It might appear more satisfying to try to demonstrate the result directly for\nthe rooted polytopes, i.e. to show Lr(G) (cid:54)= Lr(Gi). However, in general the rooted polytopes\nare not comparable: an r-potential in M can map to an (r + 1)-potential in M + and then to an\n(r + 1)-potential in Mi which cannot be evaluated for an Lr polytope.\nTheorem 21 shows that we may hope for bene\ufb01ts from rerooting for any inference method based on a\nSherali-Adams relaxed polytope Lr, unless r = 3.\n\n7 Experiments\n\nHere we show empirically the bene\ufb01ts of uprooting and rerooting for approximate inference methods\nin models with higher-order potentials. We introduce an ef\ufb01cient heuristic which can be used in\npractice to select a variable for rerooting, and demonstrate its effectiveness.\nWe compared performance after different rerootings of marginal inference (to guarantee convergence\nwe used the double loop method of Heskes et al. [4], which relates to generalized belief propagation,\n24) and MAP inference (using loopy belief propagation, LBP [9]). For true values, we used the\njunction tree algorithm. All methods were implemented using libDAI [8]. We ran experiments\non complete hypergraphs (with 8 variables) and toroidal grid models (5 \u00d7 5 variables). Potentials\nup to order 4 were selected randomly, by drawing even k-potentials from Unif([\u2212Wmax, Wmax])\ndistributions for a variety of Wmax parameters, as shown in Figure 2, which highlights results for\nestimating log Z. For each regime of maximum potential values, we plot results averaged over 20\nruns. For additional details and results, including marginals, other potential choices and larger models,\nsee Appendix \u00a710.\nWe display average error of the inference method applied to: the original model M; the uprooted\nmodel M +; then rerootings at: the worst variable, the best variable, the K heuristic variable, and\nthe G heuristic variable. Best and worst always refer to the variable at which rerooting gave with\nhindsight the best and worst error for the partition function (even in plots for other measures).\n\n7\n\n\f7.1 Heuristics to pick a good variable for rerooting\n\nFrom our De\ufb01nition 3, a rerooted model Mi is obtained by clamping the uprooted model M + at\nvariable Xi. Hence, selecting a good variable for rerooting is exactly the choice of a good variable\nto clamp in M +. Considering pairwise models, Weller [19] re\ufb01ned the maxW method [20; 21] to\nintroduce the maxtW heuristic, and showed that it was very effective empirically. maxtW selects\n4 |, where N (i) is the set of neighbors of i in the model\n\nthe variable Xi with max(cid:80)\n\nj\u2208N (i) tanh| Wij\n\ngraph, and Wij is the strength of the pairwise interaction.\nThe intuition for maxtW is as follows. Pairwise methods of approximate inference such as Bethe\nare exact for models with no cycles. If we could, we would like to \u2018break\u2019 tight cycles with strong\nedge weights, since these lead to error. When a variable is clamped, it is effectively removed from\nthe model. Hence, we would like to reroot at a variable that sits on many cycles with strong edge\nweights. Identifying such cycles is NP-hard, but the maxtW heuristic attempts to do this by looking\nonly locally around each variable. Further, the effect of a strong edge weight saturates [21]: a very\nstrong edge weight Wij effectively \u2018locks\u2019 its end variables (either together or opposite depending on\nthe sign of Wij), and this effect cannot be signi\ufb01cantly increased even by an extremely strong edge.\nHence the tanh function was introduced to the earlier maxW method, leading to the maxtW heuristic.\nAs observed in \u00a75, if we express our model potentials in terms of pure k-potentials, then the uprooted\nmodel will only have pure k-potentials for various values of k which are even numbers. Intuitively,\nthe higher the coef\ufb01cients on these potentials, the more tightly connected is the model leading to more\nchallenging inference. Hence, a natural way to generalize the maxtW approach to handle higher-order\npotentials is to pick a variable Xi in M + which maximizes the following measure:\n\nclamp-heuristic-measure(i) =\n\nc2 tanh|t2aE| +\n\nc4 tanh|t4aE|,\n\n(5)\n\n(cid:88)\n\n(cid:88)\n\ni\u2208E:|E|=2\n\ni\u2208E:|E|=4\n\nwhere aE is the coef\ufb01cient (weight) of the relevant pure k-potential, see De\ufb01nition 9, and the\n{c2, t2},{c4, t4} terms are constants for pure 2-potentials and for pure 4-potentials respectively. This\napproach extends to potentials of higher orders by adding similar further terms. Since our goal is to\nrank the measures for each i \u2208 V +, without loss of generality we take c2 = 1. We \ufb01t the t2, c4 and t4\nconstants to the data from our experimental runs, see the Appendix for details. Our K heuristic was \ufb01t\nonly to runs for complete hypergraphs while the G heuristic was \ufb01t only to runs for models on grids.\n\n7.2 Observations on results\n\nConsidering all results across models and approximate methods for estimating log Z, marginals and\nMAP inference (see Figure 2 and Appendix \u00a710.3), we make the following observations. Both K and\nG heuristics perform well (in and out of sample): they never hurt materially and often signi\ufb01cantly\nimprove accuracy, attaining results close to the best possible rerooting. Since our two heuristics\nachieve similar performance, sensitivity to the exact constants in (5) appears low. We veri\ufb01ed this by\ncomparing to maxtW for pairwise models as in [19]: both K and G heuristics performed just slightly\nbetter than maxtW. For all our runs, inference on rerooted models took similar time as on the original\nmodel (time required to reroot and later to map back inference results is negligible), see \u00a710.3.1.\nObserve that stronger 1-potentials tend to make inference easier, pulling each variable toward a\nspeci\ufb01c setting, and reducing the bene\ufb01ts from rerooting (left column of Figure 2). Stronger pure\nk-potentials for k > 1 intertwine variables more tightly: this typically makes inference harder and\nincreases the gains in accuracy from rerooting. The pure k-potential perspective facilitates this\nanalysis.\nWhen we examine larger models, or models with still higher order potentials, we observe qualitatively\nsimilar results, see Appendix \u00a710.3.4 and 10.3.6.\n\n8 Conclusion\n\nWe introduced methods which broaden the application of the uprooting and rerooting approach\nto binary models with higher-order potentials of any order. We demonstrated several important\ntheoretical insights, including Theorems 20 and 21 which show that L3 is unique in being universally\nrooted. We developed the helpful tool of even k-potentials in \u00a75, which may be of independent\n\n8\n\n\fAverage abs(error) in log Z for K8 complete hypergraphs (fully connected) on 8 variables.\n\nAverage abs(error) in log Z for Grids on 5 \u00d7 5 variables (toroidal). Legends are consistent across all plots.\n\nvary Wmax for 1-pots\n\nvary Wmax for 2-pots\n\nvary Wmax for 3-pots\n\nvary Wmax for 4-pots\n\nFigure 2: Error in estimating log Z for random models with various pure k-potentials over 20 runs. If not\nshown, Wmax max coef\ufb01cients for pure k-potentials are 0 for k = 1, 8 for k = 2, 0 for k = 3, 8 for k = 4.\nWhere the red K heuristic curve is not visible, it coincides with the green G heuristic. Both K and G heuristics\nfor selecting a rerooting work well: they never hurt and often yield large bene\ufb01ts. See \u00a77 for details.\n\ninterest. We empirically demonstrated signi\ufb01cant bene\ufb01ts for rerooting in higher-order models \u2013\nparticularly for the hard case of strong cluster potentials and weak 1-potentials \u2013 and provided an\nef\ufb01cient heuristic to select a variable for rerooting. This heuristic is also useful to indicate when\nrerooting is unlikely to be helpful for a given model (if (5) is maximized by taking i = 0).\nIt is natural to compare the effect of rerooting M to Mi, against simply clamping Xi in the original\nmodel M. A key difference is that rerooting achieves the clamping at Xi for negligible computational\ncost. In contrast, if Xi is clamped in the original model then the inference method will have to\nbe run twice: once clamping Xi = 0, and once clamping Xi = 1, then results must be combined.\nThis is avoided with rerooting given the symmetry of M +. Rerooting effectively replaces what may\nbe a poor initial implicit choice of clamping at X0 with a carefully selected choice of clamping\nvariable almost for free. This is true even for large models where it may be advantageous to clamp a\nseries of variables: by rerooting, one of the series is obtained for free, potentially gaining signi\ufb01cant\nbene\ufb01t with little work required. Note that each separate connected component may be handled\nindependently, with its own added variable. This could be useful for (repeatedly) composing clamping\nand then rerooting each separated component to obtain an almost free clamping in each.\n\nAcknowledgements\n\nWe thank Aldo Pacchiano for helpful discussions, and the anonymous reviewers for helpful comments.\nMR acknowledges support by the UK Engineering and Physical Sciences Research Council (EPSRC)\ngrant EP/L016516/1 for the University of Cambridge Centre for Doctoral Training, the Cambridge\nCentre for Analysis. AW acknowledges support by the Alan Turing Institute under the EPSRC grant\nEP/N510129/1, and by the Leverhulme Trust via the CFI.\n\nReferences\n[1] F. Barahona, M. Gr\u00f6tschel, M. J\u00fcnger, and G. Reinelt. An application of combinatorial optimization to\n\nstatistical physics and circuit layout design. Operations Research, 36(3):493\u2013513, 1988.\n\n[2] M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springer Publishing Company, Incorporated, 1st\n\nedition, 1997. ISBN 978-3-642-04294-2.\n\n[3] J. Djolonga and A. Krause. Scalable variational inference in log-supermodular models. In ICML, pages\n\n1804\u20131813, 2015.\n\n[4] T. Heskes, K. Albers, and B. Kappen. Approximate inference and constrained optimization. In UAI, pages\n\n313\u2013320, 2003.\n\n9\n\n\f[5] A. Jaimovich, G. Elidan, H. Margalit, and N. Friedman. Towards an integrated protein\u2013protein interaction\nnetwork: A relational Markov network approach. Journal of Computational Biology, 13(2):145\u2013164, 2006.\n\n[6] P. Kohli, L. Ladicky, and P. Torr. Robust higher order potentials for enforcing label consistency. Interna-\n\ntional Journal of Computer Vision, 82(3):302\u2013324, 2009.\n\n[7] V. Kolmogorov, J. Thapper, and S. \u017divn\u00fd. The power of linear programming for general-valued CSPs.\n\nSIAM Journal on Computing, 44(1):1\u201336, 2015.\n\n[8] J. Mooij. libDAI: A free and open source C++ library for discrete approximate inference in graphical\nmodels. Journal of Machine Learning Research, 11:2169\u20132173, August 2010. URL http://www.jmlr.\norg/papers/volume11/mooij10a/mooij10a.pdf.\n\n[9] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan\n\nKaufmann, 1988.\n\n[10] M. Rowland, A. Pacchiano, and A. Weller. Conditions beyond treewidth for tightness of higher-order LP\n\nrelaxations. In Arti\ufb01cal Intelligence and Statistics (AISTATS), 2017.\n\n[11] H. Sherali and W. Adams. A hierarchy of relaxations between the continuous and convex hull repre-\nsentations for zero-one programming problems. SIAM Journal on Discrete Mathematics, 3(3):411\u2013430,\n1990.\n\n[12] D. Sontag. Cutting plane algorithms for variational inference in graphical models. Master\u2019s thesis, MIT,\n\nEECS, 2007.\n\n[13] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In NIPS, 2007.\n\n[14] M. Wainwright and M. Jordan. Log-determinant relaxation for approximate inference in discrete Markov\n\nrandom \ufb01elds. IEEE Transactions on Signal Processing, 2006.\n\n[15] M. Wainwright and M. Jordan. Graphical models, exponential families and variational inference. Founda-\n\ntions and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[16] M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function.\n\nIEEE Transactions on Information Theory, 51(7):2313\u20132335, 2005.\n\n[17] A. Weller. Revisiting the limits of MAP inference by MWSS on perfect graphs. In AISTATS, 2015.\n\n[18] A. Weller. Characterizing tightness of LP relaxations by forbidding signed minors. In UAI, 2016.\n\n[19] A. Weller. Uprooting and rerooting graphical models. In International Conference on Machine Learning\n\n(ICML), 2016.\n\n[20] A. Weller and J. Domke. Clamping improves TRW and mean \ufb01eld approximations. In Arti\ufb01cial Intelligence\n\nand Statistics (AISTATS), 2016.\n\n[21] A. Weller and T. Jebara. Clamping variables and approximate inference. In Neural Information Processing\n\nSystems (NIPS), 2014.\n\n[22] A. Weller, K. Tang, D. Sontag, and T. Jebara. Understanding the Bethe approximation: When and how can\n\nit go wrong? In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2014.\n\n[23] A. Weller, M. Rowland, and D. Sontag. Tightness of LP relaxations for almost balanced models. In\n\nArti\ufb01cial Intelligence and Statistics (AISTATS), 2016.\n\n[24] J. Yedidia, W. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief\n\npropagation algorithms. IEEE Trans. Information Theory, pages 2282\u20132312, 2005.\n\n10\n\n\f", "award": [], "sourceid": 172, "authors": [{"given_name": "Mark", "family_name": "Rowland", "institution": "University of Cambridge"}, {"given_name": "Adrian", "family_name": "Weller", "institution": "University of Cambridge"}]}