{"title": "Inference in Graphical Models via Semidefinite Programming Hierarchies", "book": "Advances in Neural Information Processing Systems", "page_first": 417, "page_last": 425, "abstract": "Maximum A posteriori Probability (MAP) inference in graphical models amounts to solving a graph-structured combinatorial optimization problem. Popular inference algorithms such as belief propagation (BP) and generalized belief propagation (GBP) are intimately related to linear programming (LP) relaxation within the Sherali-Adams hierarchy. Despite the popularity of these algorithms, it is well understood that the Sum-of-Squares (SOS) hierarchy based on semidefinite programming (SDP) can provide superior guarantees. Unfortunately, SOS relaxations for a graph with $n$ vertices require solving an SDP with $n^{\\Theta(d)}$ variables where $d$ is the degree in the hierarchy. In practice, for $d\\ge 4$, this approach does not scale beyond a few tens of variables. In this paper, we propose binary SDP relaxations for MAP inference using the SOS hierarchy with two innovations focused on computational efficiency. Firstly, in analogy to BP and its variants, we only introduce decision variables corresponding to contiguous regions in the graphical model. Secondly, we solve the resulting SDP using a non-convex Burer-Monteiro style method, and develop a sequential rounding procedure. We demonstrate that the resulting algorithm can solve problems with tens of thousands of variables within minutes, and outperforms BP and GBP on practical problems such as image denoising and Ising spin glasses. Finally, for specific graph types, we establish a sufficient condition for the tightness of the proposed partial SOS relaxation.", "full_text": "Inference in Graphical Models\n\nvia Semide\ufb01nite Programming Hierarchies\n\nMurat A. Erdogdu\nMicrosoft Research\n\nerdogdu@cs.toronto.edu\n\nYash Deshpande\n\nMIT and Microsoft Research\n\nyash@mit.edu\n\nAndrea Montanari\nStanford University\n\nmontanari@stanford.edu\n\nAbstract\n\nMaximum A posteriori Probability (MAP) inference in graphical models amounts\nto solving a graph-structured combinatorial optimization problem. Popular infer-\nence algorithms such as belief propagation (BP) and generalized belief propagation\n(GBP) are intimately related to linear programming (LP) relaxation within the\nSherali-Adams hierarchy. Despite the popularity of these algorithms, it is well\nunderstood that the Sum-of-Squares (SOS) hierarchy based on semide\ufb01nite pro-\ngramming (SDP) can provide superior guarantees. Unfortunately, SOS relaxations\nfor a graph with n vertices require solving an SDP with n\u21e5(d) variables where\nd is the degree in the hierarchy. In practice, for d 4, this approach does not\nscale beyond a few tens of variables. In this paper, we propose binary SDP relax-\nations for MAP inference using the SOS hierarchy with two innovations focused\non computational ef\ufb01ciency. Firstly, in analogy to BP and its variants, we only\nintroduce decision variables corresponding to contiguous regions in the graphical\nmodel. Secondly, we solve the resulting SDP using a non-convex Burer-Monteiro\nstyle method, and develop a sequential rounding procedure. We demonstrate that\nthe resulting algorithm can solve problems with tens of thousands of variables\nwithin minutes, and outperforms BP and GBP on practical problems such as image\ndenoising and Ising spin glasses. Finally, for speci\ufb01c graph types, we establish a\nsuf\ufb01cient condition for the tightness of the proposed partial SOS relaxation.\n\nIntroduction\n\n1\nGraphical models provide a powerful framework for analyzing systems comprised by a large number\nof interacting variables. Inference in graphical models is crucial in scienti\ufb01c methodology with\ncountless applications in a variety of \ufb01elds including causal inference, computer vision, statistical\nphysics, information theory, and genome research [WJ08, KF09, MM09].\nIn this paper, we propose a class of inference algorithms for pairwise undirected graphical models.\nSuch models are fully speci\ufb01ed by assigning: (i) a \ufb01nite domain X for the variables; (ii) a \ufb01nite\ngraph G = (V, E) for V = [n] \u2318{ 1, . . . , n} capturing the interactions of the basic variables;\n(iii) a collection of functions \u2713 = ({\u2713v\nij}(i,j)2E) that quantify the vertex potentials and\ninteractions between the variables; whereby for each vertex i 2 V we have \u2713v\ni : X! R and for each\nedge (i, j) 2 E, we have \u2713e\nij : X\u21e5X! R (an arbitrary ordering is \ufb01xed on the pair of vertices\n{i, j}). These parameters can be used to form a probability distribution on X V for the random vector\nx = (x1, x2, ..., xn) 2X V by letting,\neU (x;\u2713) ,\n\ni }i2V ,{\u2713e\n\n(1.1)\n\n1\n\n\u2713v\ni (xi) ,\n\np(x|\u2713) =\n\nZ(\u2713)\n\nU (x; \u2713) = X(i,j)2E\n\n\u2713e\n\nij(xi, xj) +Xi2V\n\nwhere Z(\u2713) is the normalization constant commonly referred to as the partition function. While such\nmodels can encode a rich class of multivariate probability distributions, basic inference tasks are\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fintractable except for very special graph structures such as trees or small treewidth graphs [CD+06].\nIn this paper, we will focus on MAP estimation, which amounts to solving the combinatorial\noptimization problem\n\n\u02c6x(\u2713) \u2318 arg max\nx2X V\n\nU (x; \u2713).\n\nIntractability plagues other classes of graphical models as well (e.g. Bayesian networks, factor\ngraphs), and has motivated the development of a wide array of heuristics. One of the simplest such\nheuristics is the loopy belief propagation (BP) [WJ08, KF09, MM09]. In its max-product version\n(that is well-suited for MAP estimation), BP is intimately related to the linear programming (LP)\nrelaxation of the combinatorial problem maxx2X V U (x; \u2713). Denoting the decision variables by\nb = ({bi}i2V ,{bij}(i,j)2E), LP relaxation form of BP can be written as\n\u2713ij(xi, xj)bij(xi, xj) +Xi2V Xxi2X\n\nX(i,j)2E Xxi,xj2X\n\n\u2713i(xi)bi(xi) ,\n\nmaximize\n\nb\n\n(1.2)\n\n(1.3)\n\n(1.4)\n\nsubject to Xxj2X\n\nbij(xi, xj) = bi(xi)\n\n8(i, j) 2 E ,\n8(i, j) 2 E ,\n\nbij 2 X\u21e5X\n\nbi 2 X 8i 2 V,\n\n(1.5)\nwhere S denotes the simplex of probability distributions over set S. The decision variables are\nreferred to as \u2018beliefs\u2019, and their feasible set is a relaxation of the polytope of marginals of distributions.\nThe beliefs satisfy the constraints on marginals involving at most two variables connected by an edge.\nLoopy belief propagation is successful on some applications, e.g. sparse locally tree-like graphs\nthat arise, for instance, decoding modern error correcting codes [RU08] or in random constraint\nsatisfaction problems [MM09]. However, in more structured instances \u2013 arising for example in\ncomputer vision \u2013 BP can be substantially improved by accounting for local dependencies within\nsubsets of more than two variables. This is achieved by generalized belief propagation (GBP)\n[YFW05] where the decision variables are beliefs bR that are de\ufb01ned on subsets of vertices (a\n\u2018region\u2019) R \u2713 [n], and that represent the marginal distributions of the variables in that region. The\nbasic constraint on the beliefs is the linear marginalization constraint:PxR\\S\nbR(xR) = bS(xS),\nholding whenever S \u2713 R. Hence GBP itself is closely related to LP relaxation of the polytope\nof marginals of probability distributions. The relaxation becomes tighter as larger regions are\nincorporated. In a prototypical application, G is a two-dimensional grid, and regions are squares\ninduced by four contiguous vertices (plaquettes), see Figure 1, left frame. Alternatively in the right\nframe of the same \ufb01gure, the regions correspond to triangles.\nThe LP relaxations that correspond to GBP are closely related to the Sherali-Adams hierarchy\n[SA90]. Similar to GBP, the variables within this hierarchy are beliefs over subsets of variables\nbR = (bR(xR))xR2X R which are consistent under marginalization: PxR\\S\nbR(xR) = bS(xS).\nHowever, these two approaches differ in an important point: Sherali-Adams hierarchy uses beliefs\nover all subsets of |R|\uf8ff d variables, where d is the degree in the hierarchy; this leads to an LP of size\n\u21e5(nd). In contrast, GBP only retains regions that are contiguous in G. If G has maximum degree k,\nthis produces an LP of size O(nkd), a reduction which is signi\ufb01cant for large-scale problems.\nGiven the broad empirical success of GBP, it is natural to develop better methods for inference in\ngraphical models using tighter convex relaxations. Within combinatorial optimization, it is well\nunderstood that the semide\ufb01nite programming (SDP) relaxations provide superior approximation\nguarantees with respect to LP [GW95]. Nevertheless, SDP has found limited applications in inference\ntasks for graphical models for at least two reasons. A structural reason: standard SDP relaxations (e.g.\n[GW95]) do not account exactly for correlations between neighboring vertices in the graph which is\nessential for structured graphical models. As a consequence, BP or GBP often outperforms basic\nSDPs. A computational reason: basic SDP relaxations involve \u21e5(n2) decision variables, and generic\ninterior point solvers do not scale well for the large-scale applications. An exception is [WJ04] which\nemploys the simplest SDP relaxation (degree 2 Sum-Of-Squares, see below) in conjunction with a\nrelaxation of the entropy and interior point methods \u2013 higher order relaxations are brie\ufb02y discussed\nwithout implementation as the resulting program suffers from the aforementioned limitations.\nIn this paper, we revisit MAP inference in graphical models via SDPs, and propose an approach that\ncarries over the favorable performance guarantees of SDPs into inference tasks. For simplicity, we\nfocus on models with binary variables, but we believe that many of the ideas developed here can be\nnaturally extended to other \ufb01nite domains. We present the following contributions:\n\n2\n\n\fFigure 1: A two dimensional grid, and two typical choices for regions for GBP and PSOS.\nLeft: Regions are plaquettes comprising four vertices. Right: Regions are triangles.\nPartial Sum-Of-Squares relaxations. We use SDP hierarchies, speci\ufb01cally the Sum-Of-Squares\n(SOS) hierarchy [Sho87, Las01, Par03] to formulate tighter SDP relaxations for binary MAP inference\nthat account exactly for the joint distributions of small subsets of variables xR, for R \u2713 V . However,\nSOS introduces decision variables for all subsets R \u2713 V with |R|\uf8ff d/2 (d is a \ufb01xed even integer),\nand hence scales poorly for large-scale inference problems. We propose a similar modi\ufb01cation as in\nGBP. Instead of accounting for all subsets R with |R|\uf8ff d/2, we only introduce decision variables to\nrepresent a certain family of such subsets (regions) of vertices in G. The resulting SDP has (for d and\nthe maximum degree of G bounded) only O(n2) decision variables which is suitable for practical\nimplementations. We refer to these relaxations as Partial Sum-Of-Squares (PSOS), cf. Section 2.\nTheoretical analysis. In Section 2.1, we prove that suitable PSOS relaxations are tight for certain\nclasses of graphs, including planar graphs, with \u2713v = 0. While this falls short of explaining the\nempirical results (which uses simpler relaxations, and \u2713v 6= 0), it points in the right direction.\nOptimization algorithm and rounding. Despite the simpli\ufb01cation afforded by PSOS, interior-point\nsolvers still scale poorly to large instances. In order to overcome this problem, we adopt a non-convex\napproach proposed by Burer and Monteiro [BM03]. We constrain the rank of the SDP matrix in\nPSOS to be at most r, and solve the resulting non-convex problem using a trust-region coordinate\nascent method, cf. Section 3.1. Further, we develop a rounding procedure called Con\ufb01dence Lift and\nProject (CLAP) which iteratively uses PSOS relaxations to obtain an integer solution, cf. Section 3.2.\nNumerical experiments. In Section 4, we present numerical experiments with PSOS by solving\nproblems of size up to 10, 000 within several minutes. While additional work is required to scale\nthis approach to massive sizes, we view this as an exciting proof-of-concept. To the best of our\nknowledge, no earlier attempt was successful in scaling higher order SOS relaxations beyond tens\nof dimensions. More speci\ufb01cally, we carry out experiments with two-dimensional grids \u2013 an image\ndenoising problem, and Ising spin glasses. We demonstrate through extensive numerical studies that\nPSOS signi\ufb01cantly outperforms BP and GBP in the inference tasks we consider.\n2 Partial Sum-Of-Squares Relaxations\nFor concreteness, throughout the paper we focus on pairwise models with binary variables. We do not\nexpect fundamental problems extending the same approach to other domains. For binary variables\nx = (x1, x2, ..., xn), MAP estimation amounts to solving the following optimization problem\n\nmaximize\n\nx\n\nX(i,j)2E\n\n\u2713e\n\nijxixj +Xi2V\n\nsubject to xi 2{ +1,1} ,\n\n8i 2 V ,\n\n\u2713v\ni xi ,\n\n(INT)\n\nij)1\uf8ffi,j\uf8ffn and \u2713v = (\u2713v\n\nwhere \u2713e = (\u2713e\nFor the reader\u2019s convenience, we recall a few basic facts about SOS relaxations, referring to [BS16]\nfor further details. For an even integer d, SOS(d) is an SDP relaxation of INT with decision variable\n\ni )1\uf8ffi\uf8ffn are the parameters of the graphical model.\n\nX :[n]\n\n\uf8ffd ! R where[n]\n\nmaximize\n\n\uf8ffd denotes the set of subsets S \u2713 [n] of size |S|\uf8ff d; it is given as\nijX({i, j}) +Xi2V\nsubject to X(;) = 1, M(X) < 0 .\n\nX(i,j)2E\n\n\u2713v\ni X({i}) ,\n\n\u2713e\n\nX\n\n(SOS)\n\nThe moment matrix M(X) is indexed by sets S, T \u2713 [n], |S|,|T|\uf8ff d/2, and has entries M(X)S,T =\nX(S4T ) with 4 denoting the symmetric difference of two sets. Note that M(X)S,S = X(;) = 1.\n\n3\n\nRegion 1Region 2Region 1Region 2Region 4Region 3\fl\n\n \n\ne\nu\na\nV\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n662\n600\n\n400\n\n200\n\n0\n\np\na\nG\n \ny\nt\ni\nl\n\na\nu\nD\n\n1e+01\n\n1e\u221201\n\n1e\u221203\n\n1e\u221205\n\n1e\u221207\n\nRank\n20\n10\n5\n3\n2\n\nRank\n20\n10\n5\n3\n2\n\n0\n\n50\n\n100\n\nIterations\n\n150\n\n200\n\n0\n\n50\n\n100\n\nIterations\n\n150\n\n200\n\nFigure 2: Effect of the rank constraint r on n = 400 square lattice (20 \u21e5 20): Left plot shows the\nchange in the value of objective at each iteration. Right plot shows the duality gap of the Lagrangian.\nWe can equivalently represent M(X) as a Gram matrix by letting M(X)S,T = hS, Ti for a\ncollection of vectors S 2 Rr indexed by S 2 [n]\nsemide\ufb01nite matrix; however, in what follows it is convenient from a computational perspective to\nconsider smaller choices of r. The constraint M(X)S,S = 1 is equivalent to kSk = 1, and the\ncondition M (X)S,T = X(S4T ) can be equivalently written as\n(2.1)\n\n\uf8ffd/2. The case r = [n]\n\n\uf8ffd/2 can represent any\n\nhS1, T1i = hS2, T2i ,\n\n8S14T1 = S24T2.\n\nIn the case d = 2, SOS(2) recovers the classical Goemans-Williamson SDP relaxation [GW95].\nIn the following, we consider the simplest higher-order SDP, namely SOS(4) for which the general\nconstraints in Eq. (2.1) can be listed explicitly. Fixing a region R \u2713 V , and de\ufb01ning the Gram vectors\n;, (i)i2V , (ij){i,j}\u2713V , we list the constraints that involve vectors S for S \u2713 R and |S| = 1, 2:\n(Sphere i )\n(Undirected i j)\n(Directed i ! j)\njV k)\n(V-shaped i\ni\nj4k)\n(Triangle\nk\u21e4j\nl )\nGiven an assignment of the Gram vectors = (;, (i)i2V , (ij){i,j}\u2713V ), we denote by |R its\nrestriction to R, namely |R = (;, (i)i2R, (ij){i,j}\u2713R). We denote by \u2326(R), the set of vectors\n|R that satisfy the above constraints. With these notations, the SOS(4) SDP can be written as\n\nkik = 1\nhi, ji = hij, ;i\nhi, iji = hj, ;i\nhi, jki = hk, iji\nhij, jki = hik, ;i\nhij, kli = hik, jli\n\n8i 2 S [ {;},\n8i, j 2 S,\n8i, j 2 S,\n8i, j, k 2 S,\n8i, j, k 2 S,\n8i, j, k, l 2 S.\n\n(Loop i\n\nX(i,j)2E\n\n\u2713e\n\nijhi, ji +Xi2V\n\n\u2713v\ni hi, ;i ,\n\nmaximize\n\n\n\nsubject to 2 \u2326(V ) .\n\n(SOS(4))\n\nA speci\ufb01c Partial SOS (PSOS) relaxation is de\ufb01ned by a collection of regions R =\n{R1, R2, . . . , Rm}, Ri \u2713 V . We will require R to be a covering, i.e. [m\ni=1Ri = V and for\neach (i, j) 2 E there exists ` 2 [m] such that {i, j}\u2713 R`. Given such a covering, the PSOS(4)\nrelaxation is\n(PSOS(4))\n\nmaximize\n\n\u2713e\n\n\n\nX(i,j)2E\n\nijhi, ji +Xi2V\n\n\u2713v\ni hi, ;i ,\n8i 2{ 1, 2, . . . , m} .\n\nsubject to |Ri 2 \u2326(Ri)\n\nNotice that variables ij only enter the above program if {i, j}\u2713 R` for some `. As a consequence,\nthe dimension of the above optimization problem is O(rPm\n`=1 |R`|2), which is O(nr) if the regions\nhave bounded size; this will be the case in our implementation. Of course, the speci\ufb01c choice of\nregions R is crucial for the quality of this relaxation. A natural heuristic is to choose each region R`\nto be a subset of contiguous vertices in G, which is generally the case for GBP algorithms.\n\n4\n\n\f= 0\nfor s 2 Actives do\nif s 2 V then\ncs =Pt2@s \u2713e\nelse\n\ncs = \u2713e\n\nstt + \u2713v\n\ns ;\n\ns1s2; + \u2713v\n\ns1s2 + \u2713v\n\ns2s1\n\n/* s 2 V is a vertex */\n/* s = (s1, s2) 2 E is an edge */\n\nAlgorithm 1: Partial-SOS\nInput :G = (V, E), \u2713e 2 Rn\u21e5n, \u2713v 2 Rn, 2 Rr\u21e5(1+|V |+|E|), Reliables = ;\nActives = V [ E \\ Reliables, and = 1,\nwhile > tol do\n\nkk=1 hcs, i + \u21e2\n\nForm matrix As, vector bs, and the corresponding Lagrange multipliers s (see text).\nnew\ns arg max\n + knew\ns new\ns s + Ass bs\n\n2kAs bs + sk2 \n\ns sk2 + kAss bsk2\n\ns\n\n/* update variables */\n\n/* sub-problem */\n\nij)(i,j)2E, (\u2713v\n\n2.1 Tightness guarantees\nSolving exactly INT is NP-hard even if G is a three-dimensional grid [Bar82]. Therefore, we do\nnot expect PSOS(4) to be tight for general graphs G. On the other hand, in our experiments (cf.\nSection 4), PSOS(4) systematically achieves the exact maximum of INT for two-dimensional grids\nwith random edge and vertex parameters (\u2713e\ni )i2V . This \ufb01nding is quite surprising and\ncalls for a theoretical explanation. While full understanding remains an open problem, we present\nhere partial results in that direction.\nRecall that a cycle in G is a sequence of distinct vertices (i1, . . . , i`) such that, for each j 2 [`] \u2318\n{1, 2, . . . ,`}, (ij, ij+1) 2 E (where ` + 1 is identi\ufb01ed with 1). The cycle is chordless if there is no\nj, k 2 [`], with j k 6= \u00b11 mod ` such that (ij, ik) 2 E. We say that a collection of regions R\non graph G is circular if for each chordless cycle in G there exists a region in R 2R such that all\nvertices of the cycle belong to R. We also need the following straightforward notion of contractibility.\nA contraction of G is a new graph obtained by identifying two vertices connected by an edge in G.\nG is contractible to H if there exists a sequence of contractions transforming G into H.\nThe following theorem is a direct consequence of a result of Barahona and Mahjoub [BM86] (see\nSupplement for a proof).\nTheorem 1. Consider the problem INT with \u2713v = 0. If G is not contractible to K5 (the complete\ngraph over 5 vertices), then PSOS(4) with a circular covering R is tight.\nThe assumption that \u2713v = 0 can be made without loss of generality (see Supplement for the reduction\nfrom the general case). Furthermore, INT can be solved in polynomial time if G is planar, and \u2713v = 0\n[Bar82]. Note however, the reduction from \u2713v 6= 0 to \u2713v = 0 can transform a planar graph to a\nnon-planar graph. This theorem implies that (full) SOS(4) is also tight if G is not contractible to K5.\nNotice that planar graphs are not contractible to K5, and we recover the fact that INT can be solved\nin polynomial time if \u2713v = 0. This result falls short of explaining the empirical \ufb01ndings in Section 4,\nfor at least two reasons. Firstly the reduction to \u2713v = 0 induces K5 subhomomorphisms for grids.\nSecond, the collection of regions R described in the previous section does not include all chordless\ncycles. Theoretically understanding the empirical performance of PSOS(4) as stated remains open.\nHowever, similar cycle constraints have proved useful in analyzing LP relaxations [WRS16].\n3 Optimization Algorithm and Rounding\n3.1 Solving PSOS(4) via Trust-Region Coordinate Ascent\nWe will approximately solve PSOS(4) while keeping r = O(1). Earlier work implies that (under\nsuitable genericity condition on the SDP) there exists an optimal solution with rank p2 # constraints\n[Pat98]. Recent work [BVB16] shows that for r > p2 # constraints, the non-convex optimization\nproblem has no non-global local maxima. For SOS(2), [MM+17] proves that setting r = O(1)\nis suf\ufb01cient for achieving O(1/r) relative error from the global maximum for speci\ufb01c choices of\npotentials \u2713e,\u2713 v. We \ufb01nd that there is little or no improvement beyond r = 10 (cf. Figure 2).\n\n5\n\n\fAlgorithm 2: CLAP: Confidence Lift And Project\nInput :G = (V, E), \u2713e 2 Rn\u21e5n, \u2713v 2 Rn, regions R = {R1, ..., Rm}\nInitialize variable matrix 2 Rr\u21e5(1+|V |+|E|) and set Reliables = ;.\nwhile Reliables 6= V [ E do\nRun Partial-SOS on inputs G = (V, E), \u2713e, \u2713v, , Reliables\nPromotions = ; and Con\ufb01dence = 0.9\nwhile Con\ufb01dence > 0 and Promotions 6= ; do\n\nfor s 2 V [ E \\ Reliables do\n\nif |h;, si| > Con\ufb01dence then\n\ns = sign(h;, si) \u00b7 ;\nPromotions Promotions [{ sc}\n\nif Promotions = ; then\nCon\ufb01dence Con\ufb01dence 0.1\nReliables Reliables [ Promotions\n\n/* lift procedure */\n\n/* find promotions */\n\n/* project procedure */\n\n/* decrease confidence level */\n\n/* update Reliables */\n\n\u21e2\n\n.\n\nE =(i, j) 2 V \u21e5 V :\n\nOutput :(hi, ;i)i2V 2 {1, +1}n\nWe will assume that R = (R1, . . . , Rm) is a covering of G (in the sense introduced in the previous\nsection), and \u2013without loss of generality\u2013 we will assume that the edge set is\n(3.1)\nIn other words, E is the maximal set of edges that is compatible with R being a covering. This can\nalways be achieved by adding new edges (i, j) to the original edge set with \u2713e\nij = 0. Hence, the\ndecision variables s are indexed by s 2S = {;}[V [E. Apart from the norm constraints, all other\nconsistency constraints take the form hs, ri = ht, pi for some 4-tuple of indices (s, r, t, p). We\ndenote the set of all such 4-tuples by C, and construct the augmented Lagrangian of PSOS(4) as\nL(, ) =Xi2V\n\n2 X(s,r,t,p)2C\u21e3hs, ri ht, pi + s,r,t,p\u23182\n\nsuch that {i, j}\u2713 R` .\n\ni hi, ;i + X(i,j)2E\n\n\u2713e\nijhi, ji +\n\n9` 2 [m]\n\n\u2713v\n\nAt each step, our algorithm execute two operations: (i) maximize the cost function with respect to one\nof the vectors s; (ii) perform one step of gradient descent with respect to the corresponding subset\nof Lagrangian parameters, to be denoted by s. More precisely, \ufb01xing s 2 S \\ {;} (by rotational\ninvariance, it is not necessary to update ;), we note that s appears in the constraints linearly (or\nit does not appear). Hence, we can write these constraints in the form Ass = bs where As, bs\ndepend on (r)r6=s but not on s. We stack the corresponding Lagrangian parameters in a vector s;\ntherefore the Lagrangian term involving s reads (\u21e2/2)kAss bs + sk2. On the other hand, the\ngraphical model contribution is that the \ufb01rst two terms in L(, ) are linear in s, and hence they\ncan be written as hcs, si. Summarizing, we have\n\nL(, ) =hcs, si + kAss bs + sk2 + eL(r)r6=s, .\n\n(3.2)\nIt is straightforward to compute As, bs, cs; in particular, for (s, r, t, p) 2C , the rows of As and bs\nare indexed by r such that the vectors r form the rows of As, and ht, pi form the corresponding\nentry of bs. Further, if s is a vertex and @s are its neighbors, we set cs = Pt2@s \u2713e\ns ;\nwhile if s = (s1, s2) is an edge, we set cs = \u2713e\ns2s1. Note that we are using the\nequivalent representations hi, ji = hij, ;i, hij, ji = hi, ;i, and hij, ii = hj, ;i.\nFinally, we maximize Eq. (3.2) with respect to s by a Mor\u00e9-Sorenson style method [MS83].\n3.2 Rounding via Con\ufb01dence Lift and Project\nAfter Algorithm 1 generates an approximate optimizer for PSOS(4), we reduce its rank to produce\na solution of the original combinatorial optimization problem INT. To this end, we interpret hi, ;i\nas our belief about the value of xi in the optimal solution of INT, and hij, ;i as our belief about\nthe value of xixj. This intuition can be formalized using the notion of pseudo-probability [BS16].\nWe then recursively round the variables about which we have strong beliefs; we \ufb01x rounded variables\nin the next iteration, and solve the induced PSOS(4) on the remaining ones.\nMore precisely, we set a con\ufb01dence threshold Con\ufb01dence. For any variable s such that |hs, ;i| >\nCon\ufb01dence, we let xs = sign(hs, ;i) and \ufb01x s = xs ;. These variables s are no longer\n\ns1s2; + \u2713v\n\ns1s2 + \u2713v\n\nstt + \u2713v\n\n6\n\n\fTrue\t\n\nNoisy\t\n\nBP-SP\t\n\nBP-MP\t\n\nGBP\t\n\nPSOS(2)\t\n\nPSOS(4)\t\n\n\t\n\n.\n\n2\n0\n=\np\n\n\t\n\n\t\n\n\t\ni\nl\nl\n\nu\no\nn\nr\ne\nB\n\n\tU(x)\t:\t\t\t\t25815\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t19237\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t26165\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t26134\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t26161\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t26015\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t26194\t\n\tTime:\t\t\t\t\t\t\t\t\t\t\t-\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t-\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t2826s\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t2150s \t\n\t\t\t\t\t\t\t\t\t\t\t5059s\t\n\n\t\t\t\t\t\t\t\t\t454s\t\n\n\t\t\t\t\t\t7894s \t\n\n\t\n\n\t\n\n.\n\n6\n0\n0\n0\n=\np\ne\ns\ni\nw\nk\nc\no\nB\n\nl\n\n\t\t\t\t\t\t\t8844s\t\n\n\t\t\t\t\t\t\t\t\t\t248s\t\n\n\tU(x)\t:\t\t\t\t27010\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t26808\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t27230\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t27012\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t27232\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t26942\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t27252\t\n\tTime:\t\t\t\t\t\t\t\t\t\t\t-\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t-\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1674s\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t729s \t\n\t\t\t\t\t\t\t\t\t\t\t4457s\t\nFigure 3: Denoising a binary image by maximizing the objective function Eq. (4.1). Top row: i.i.d.\nBernoulli error with \ufb02ip probability p = 0.2 with \u27130 = 1.26. Bottom row: blockwise noise where\neach pixel is the center of a 3 \u21e5 3 error block independently with probability p = 0.006 and \u27130 = 1.\nupdated, and instead the reduced SDP is solved. If no variable satis\ufb01es the con\ufb01dence condition, the\nthreshold is reduced until variables are found that satisfy it. After the \ufb01rst iteration, most variables\nyield strong beliefs and are \ufb01xed; hence the consequent iterations have fewer variables and are faster.\n4 Numerical Experiments\nIn this section, we validate the performance of the Partial SOS relaxation and the CLAP rounding\nscheme on models de\ufb01ned on two-dimensional grids. Grid-like graphical models are common in a\nvariety of \ufb01elds such as computer vision [SSZ02], and statistical physics [MM09]. In Section 4.1, we\nstudy an image denoising example and in Section 4.2 we consider the Ising spin glass \u2013 a model in\nstatistical mechanics that has been used as a benchmark for inference in graphical models.\nOur main objective is to demonstrate that Partial SOS can be used successfully on large-scale\ngraphical models, and is competitive with the following popular inference methods:\n\u2022 Belief Propagation - Sum Product (BP-SP): Pearl\u2019s belief propagation computes exact marginal\ndistributions on trees [Pea86]. Given a graph structured objective function U (x), we apply BP-SP\nto the Gibbs-Boltzmann distribution p(x) = exp{U (x)}/Z using the standard sum-product update\nrules with an inertia of 0.5 to help convergence [YFW05], and threshold the marginals at 0.5.\n\u2022 Belief Propagation - Max Product (BP-MP): By replacing the marginal probabilities in the sum-\nproduct updates with max-marginals, we obtain BP-MP, which can be used for exact inference\non trees [MM09]. For general graphs, BP-MP is closely related to an LP relaxation of the\ncombinatorial problem INT [YFW05, WF01]. Similar to BP-SP, we use an inertia of 0.5. Note\nthat the Max-Product updates can be equivalently written as Min-Sum updates [MM09].\n\u2022 Generalized Belief Propagation (GBP): The decision variables in GBP are beliefs (joint prob-\nability distributions) over larger subsets of variables in the graph G, and they are updated in a\nmessage passing fashion [YFW00, YFW05]. We use plaquettes in the grid (contiguous groups of\nfour vertices) as the largest regions, and apply message passing with inertia 0.1 [WF01].\n\u2022 Partial SOS - Degree 2 (PSOS(2)): By de\ufb01ning regions as single vertices and enforcing only\nthe sphere constraints, we recover the classical Goemans-Williamson SDP relaxation [GW95].\nNon-convex Burer-Monteiro approach is extremely ef\ufb01cient in this case [BM03]. We round the\nSDP solution by \u02c6xi = sign(hi, ;i) which is closely related to the classical approach of [GW95].\n\u2022 Partial SOS - Degree 4 (PSOS(4)): This is the algorithm developed in the present paper. We\ntake the regions R` to be triangles, cf. Figure 1, right frame. In an pn \u21e5 pn grid, we have\n2(pn 1)2 such regions resulting in O(n) constraints. In Figures 3 and 4, PSOS(4) refers to the\nCLAP rounding scheme applied together with PSOS(4) in the lift procedure.\n4.1\nGiven a pn \u21e5 pn binary image x0 2{ +1,1}n, we generate a corrupted version of the same\nimage y 2{ +1,1}n. We then try to denoise y by maximizing the following objective function:\n(4.1)\n\nImage Denoising via Markov Random Fields\n\nyixi ,\n\nU (x) = X(i,j)2E\n\nxixj + \u27130Xi2V\n\n7\n\n\fFigure 4: Solving the MAP inference problem INT for Ising spin glasses on two-dimensional grids.\nU and N represent uniform and normal distributions. Each bar contains 100 independent realizations.\nWe plot the ratio between the objective value achieved by that algorithm and the exact optimum for\nn 2{ 16, 25}, or the best value achieved by any of the 5 algorithms for n 2{ 100, 400, 900}.\nwhere the graph G is the pn \u21e5 pn grid, i.e., V = {i = (i1, i2) :\ni1, i2 2{ 1, . . . ,pn}} and\nE = {(i, j) : ki jk1 = 1}. In applying Algorithm 1, we add diagonals to the grid (see right plot\nin Figure 1) in order to satisfy the condition (3.1) with corresponding weight \u2713e\nIn Figure 3, we report the output of various algorithms for a 100 \u21e5 100 binary image. We are\nnot aware of any earlier implementation of SOS(4) beyond tens of variables, while PSOS(4) is\napplied here to n = 10, 000 variables. Running times for CLAP rounding scheme (which requires\nseveral runs of PSOS(4)) are of order an hour, and are reported in Figure 3. We consider two noise\nmodels: i.i.d. Bernoulli noise and blockwise noise. The model parameter \u27130 is chosen in each\ncase as to approximately optimize the performances under BP denoising. In these (as well as in 4\nother experiments of the same type reported in the supplement), PSOS(4) gives consistently the best\nreconstruction (often tied with GBP), in reasonable time. Also, it consistently achieves the largest\nvalue of the objective function among all algorithms.\n\nij = 0.\n\nIsing Spin Glass\n\nij}(i,j)2E, {\u2713v\n\ni \u21e0U ({+1/2,1/2}), (iii) \u2713e\n\n4.2\nThe Ising spin glass (also known as Edwards-Anderson model [EA75]) is one of the most studied\nmodels in statistical physics. It is given by an objective function of the form INT with G a d-\ndimensional grid, and i.i.d. parameters {\u2713e\ni }i2V . Following earlier work [YFW05],\nwe use Ising spin glasses as a testing ground for our algorithm. Denoting the uniform and normal\ndistributions by U and N respectively, we consider two-dimensional grids (i.e. d = 2), and the\nfollowing parameter distributions: (i) \u2713e\nij \u21e0\nU({+1,1}) and \u2713v\ni \u21e0 N(0, 2) with = 0.1\n(this is the setting considered in [YFW05]), and (iv) \u2713e\ni \u21e0 N(0, 2) with = 1.\nFor each of these settings, we considered grids of size n 2{ 16, 25, 100, 400, 900}.\nIn Figure 4, we report the results of 8 experiments as a box plot. We ran the \ufb01ve inference algorithms\ndescribed above on 100 realizations; a total of 800 experiments are reported in Figure 4. For each of\nthe realizations, we record the ratio of the achieved value of an algorithm to the exact maximum (for\nn 2{ 16, 25}), or to the best value achieved among these algorithms (for n 2{ 100, 400, 900}). This\nis because for lattices of size 16 and 25, we are able to run an exhaustive search to determine the true\nmaximizer of the integer program. Further details are reported in the supplement.\nIn every single instance of 800 experiments, PSOS(4) achieved the largest objective value, and\nwhenever this could be veri\ufb01ed by exhaustive search (i.e. for n 2{ 16, 25}) it achieved an exact\nmaximizer of the integer program.\n\ni \u21e0U ({+1,1}), (ii) \u2713e\n\nij \u21e0U ({+1,1}) and \u2713v\n\nij \u21e0 N(0, 1) and \u2713v\n\nij \u21e0 N(0, 1) and \u2713v\n\n8\n\nRatio to the best algorithmPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MPPSOS(4)PSOS(2)GBPBP-SPBP-MP\fReferences\n\n[Bar82] Francisco Barahona. On the computational complexity of Ising spin glass models. Journal of Physics\n\nA: Mathematical and General, 15(10):3241, 1982.\n\n[BM86] Francisco Barahona and Ali Ridha Mahjoub. On the cut polytope. Mathematical programming,\n\n36(2):157\u2013173, 1986.\n\n[BM03] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semide\ufb01nite\n\nprograms via low-rank factorization. Mathematical Programming, 95(2):329\u2013357, 2003.\n\n[BS16] Boaz Barak and David Steurer. Proofs, beliefs, and algorithms through the lens of sum-of-squares.\n\nCourse notes: http://www. sumofsquares. org/public/index. html, 2016.\n\n[BVB16] Nicolas Boumal, Vlad Voroninski, and Afonso Bandeira. The non-convex Burer-Monteiro approach\nworks on smooth semide\ufb01nite programs. In Advances in Neural Information Processing Systems,\npages 2757\u20132765, 2016.\n\n[CD+06] Robert G Cowell, Philip Dawid, Steffen L Lauritzen, and David J Spiegelhalter. Probabilistic\nnetworks and expert systems: Exact computational methods for Bayesian networks. Springer Science\n& Business Media, 2006.\n\n[EA75] Samuel Frederick Edwards and Phil W Anderson. Theory of spin glasses. Journal of Physics F:\n\nMetal Physics, 5(5):965, 1975.\n\n[EM15] Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled newton methods. In\n\nAdvances in Neural Information Processing Systems, pages 3052\u20133060, 2015.\n\n[GW95] Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum\ncut and satis\ufb01ability problems using semide\ufb01nite programming. Journal of the ACM (JACM),\n42(6):1115\u20131145, 1995.\n\n[KF09] Daphne Koller and Nir Friedman. Probabilistic graphical models. MIT press, 2009.\n[Las01] Jean B Lasserre. An explicit exact SDP relaxation for nonlinear 0-1 programs. In International\n\nConference on Integer Programming and Combinatorial Optimization, pages 293\u2013303, 2001.\n\n[MM09] Marc M\u00e9zard and Andrea Montanari. Information, physics, and computation. Oxford Press, 2009.\n[MM+17] Song Mei, Theodor Misiakiewicz, Andrea Montanari, and Roberto I Oliveira. Solving SDPs\narXiv preprint\n\nfor synchronization and MaxCut problems via the Grothendieck inequality.\narXiv:1703.08729, 2017.\n\n[MS83] Jorge J Mor\u00e9 and Danny C Sorensen. Computing a trust region step. SIAM Journal on Scienti\ufb01c and\n\nStatistical Computing, 4(3):553\u2013572, 1983.\n\n[Par03] Pablo A Parrilo. Semide\ufb01nite programming relaxations for semialgebraic problems. Mathematical\n\nprogramming, 96(2):293\u2013320, 2003.\n\n[Pat98] G\u00e1bor Pataki. On the rank of extreme matrices in semide\ufb01nite programs and the multiplicity of\n\noptimal eigenvalues. Mathematics of operations research, 23(2):339\u2013358, 1998.\n\n[Pea86] Judea Pearl. Fusion, propagation, and structuring in belief networks. Arti\ufb01cial intelligence, 29(3):241\u2013\n\n288, 1986.\n\n[RU08] Tom Richardson and Ruediger Urbanke. Modern coding theory. Cambridge Press, 2008.\n[SA90] Hanif D Sherali and Warren P Adams. A hierarchy of relaxations between the continuous and convex\nhull representations for zero-one programming problems. SIAM Journal on Discrete Mathematics,\n3(3):411\u2013430, 1990.\n\n[Sho87] Naum Z Shor. Class of global minimum bounds of polynomial functions. Cybernetics and Systems\n\nAnalysis, 23(6):731\u2013734, 1987.\n\n[SSZ02] Jian Sun, Heung-Yeung Shum, and Nan-Ning Zheng. Stereo matching using belief propagation. In\n\nEuropean Conference on Computer Vision, pages 510\u2013524. Springer, 2002.\n\n[WF01] Yair Weiss and William T Freeman. On the optimality of solutions of the max-product belief-\n\npropagation algorithm in arbitrary graphs. IEEE Trans. on Info. Theory, 47(2):736\u2013744, 2001.\n\n[WJ04] Martin J Wainwright and Michael I Jordan. Semide\ufb01nite relaxations for approximate inference on\ngraphs with cycles. In Advances in Neural Information Processing Systems, pages 369\u2013376, 2004.\n[WJ08] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[WRS16] Adrian Weller, Mark Rowland, and David Sontag. Tightness of lp relaxations for almost balanced\n\nmodels. In Arti\ufb01cial Intelligence and Statistics, pages 47\u201355, 2016.\n\n[YFW00] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Generalized belief propagation.\n\nAdvances in Neural Information Processing Systems, pages 689\u2013695, 2000.\n\nIn\n\n[YFW05] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Constructing free-energy approxima-\ntions and generalized belief propagation algorithms. IEEE Transactions on Information Theory,\n51(7):2282\u20132312, 2005.\n\n9\n\n\f", "award": [], "sourceid": 319, "authors": [{"given_name": "Murat", "family_name": "Erdogdu", "institution": "Microsoft Research"}, {"given_name": "Yash", "family_name": "Deshpande", "institution": "MIT"}, {"given_name": "Andrea", "family_name": "Montanari", "institution": "Stanford"}]}