{"title": "Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins", "book": "Advances in Neural Information Processing Systems", "page_first": 2384, "page_last": 2392, "abstract": "While finding the exact solution for the MAP inference problem is intractable for many real-world tasks, MAP LP relaxations have been shown to be very effective in practice. However, the most efficient methods that perform block coordinate descent can get stuck in sub-optimal points as they are not globally convergent. In this work we propose to augment these algorithms with an $\\epsilon$-descent approach and present a method to efficiently optimize for a descent direction in the subdifferential using a margin-based extension of the Fenchel-Young duality theorem. Furthermore, the presented approach provides a methodology to construct a primal optimal solution from its dual optimal counterpart. We demonstrate the efficiency of the presented approach on spin glass models and protein interactions problems and show that our approach outperforms state-of-the-art solvers.", "full_text": "Globally Convergent Dual MAP LP Relaxation\n\nSolvers using Fenchel-Young Margins\n\nAlexander G. Schwing\n\nETH Zurich\n\naschwing@inf.ethz.ch\n\nMarc Pollefeys\n\nETH Zurich\n\npomarc@inf.ethz.ch\n\nTamir Hazan\nTTI Chicago\n\ntamir@ttic.edu\n\nRaquel Urtasun\n\nTTI Chicago\n\nrurtasun@ttic.edu\n\nAbstract\n\nWhile \ufb01nding the exact solution for the MAP inference problem is intractable for\nmany real-world tasks, MAP LP relaxations have been shown to be very effective\nin practice. However, the most ef\ufb01cient methods that perform block coordinate\ndescent can get stuck in sub-optimal points as they are not globally convergent.\nIn this work we propose to augment these algorithms with an \u0001-descent approach\nand present a method to ef\ufb01ciently optimize for a descent direction in the sub-\ndifferential using a margin-based formulation of the Fenchel-Young duality the-\norem. Furthermore, the presented approach provides a methodology to construct\na primal optimal solution from its dual optimal counterpart. We demonstrate the\nef\ufb01ciency of the presented approach on spin glass models and protein interaction\nproblems and show that our approach outperforms state-of-the-art solvers.\n\nIntroduction\n\n1\nGraphical models are a common method to describe the dependencies of a joint probability distribu-\ntion over a set of discrete random variables. Finding the most likely con\ufb01guration of a distribution\nde\ufb01ned by such a model, i.e., the maximum a-posteriori (MAP) assignment, is one of the most\nimportant inference tasks. Unfortunately, it is a computationally hard problem for many interest-\ning applications. However, it has been shown that linear programming (LP) relaxations recover the\nMAP assignment in many cases of interest (e.g., [13, 23]).\nDue to the large amount of variables and constraints, solving inference problems in practice still re-\nmains a challenge for standard LP solvers. Development of speci\ufb01cally tailored algorithms has since\nbecome a growing area of research. Many of these designed solvers consider the dual program, thus\nthey are based on local updates that follow the graphical model structure, which ensures suitability\nfor very large problems. Unfortunately, the dual program is non-smooth, hence introducing dif\ufb01cul-\nties to existing solvers. For example, block coordinate descent algorithms, typically referred to as\nconvex max-product, monotonically decrease the dual objective and converge very fast, but are not\nguaranteed to reach the global optimum of the dual program [3, 6, 11, 14, 17, 20, 22, 24, 25]. Dif-\nferent approaches to overcome the sub-optimality of the convex max-product introduced different\nperturbed programs for which convergence to the dual optimum is guaranteed, e.g., smoothing, prox-\nimal methods and augmented Lagrangian methods [6, 7, 8, 16, 18, 19, 27]. However, since these\nalgorithms consider a perturbed program they are typically slower than the convex max-product\nvariants [8, 18].\nIn this work we propose to augment the convex max-product algorithm with a steepest \u0001-descent\napproach to monotonically decrease the dual objective until reaching the global optimum of the dual\nprogram. To perform the \u0001-descent we explore the \u0001-subgradients of the dual program, and provide\na method to search for a descent direction in the \u0001-subdifferential using a margin-based formula-\ntion of the Fenchel-Young duality theorem. This characterization also provides a new algorithm to\n\n1\n\n\fconstruct a primal optimal solution for the LP relaxation from a dual optimal solution. We demon-\nstrate the effectiveness of our approach on spin glass models and protein-protein interactions taken\nfrom the probabilistic inference challenge (PIC 2011)1. We illustrate that the method exhibits nice\nconvergence properties while possessing optimality certi\ufb01cates.\nWe begin by introducing the notation, MAP LP relaxations and their dual programs. We subse-\nquently describe the subgradients of the dual and provide an ef\ufb01cient procedure to recover a primal\noptimal solution. We explore the \u0001-subgradients of the dual objective, and introduce an ef\ufb01cient\nglobally convergent dual solver based on the \u0001-margin of the Fenchel-Young duality theorem. Fi-\nnally, we extend our approach to graphical models over general region graphs.\n2 Background\nGraphical models encode joint distributions over discrete product spaces X = X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Xn. The\njoint probability is de\ufb01ned by combining energy functions over subsets of variables. Throughout\nthis work we consider two types of functions: single variable functions, \u03b8i(xi), which correspond\nto the n vertices in the graph, i \u2208 {1, ..., n}, and functions over subsets of variables \u03b8\u03b1(x\u03b1), for\n\u03b1 \u2282 {1, .., n}, that correspond to the graph hyperedges. The joint distribution is then given by\n\u03b1\u2208E \u03b8\u03b1(x\u03b1)). In this paper we focus on estimating the MAP, i.e.,\n\ufb01nding the assignment that maximizes the probability, or equivalently minimizes the energy which\nis the negative log probability. Estimating the MAP can be written as a program of the form [10]:\n\ni\u2208V \u03b8i(xi) +(cid:80)\n\np(x) \u221d exp((cid:80)\n\n\u03b8i(xi) +\n\n\u03b8\u03b1(x\u03b1).\n\n(1)\n\n(cid:88)\n\ni\u2208V\n\nargmax\nx1,...,xn\n\n(cid:88)\n\n\u03b1\u2208E\n\nDue to its combinatorial nature, this problem is NP-hard for general graphical models. It is tractable\nonly in some special cases such as tree structured graphs, where specialized dynamic programming\nalgorithms (e.g., max-product belief propagation) are guaranteed to recover the optimum.\nThe MAP program in Eq. (1) has a linear form, thus it is naturally represented as an integer linear\nprogram. Its tractable relaxation is obtained by replacing the integral constraints with non-negativity\nconstraints as follows:\n\nb\u03b1(x\u03b1)\u03b8\u03b1(x\u03b1) +\n\nbi(xi)\u03b8i(xi)\n\ns.t. bi(xi), b\u03b1(x\u03b1) \u2265 0,\n\nb\u03b1(x\u03b1) = 1,\n\nx\u03b1\n\nxi\n\n(cid:88)\n\n(cid:88)\n\nx\u03b1\\xi\n\nbi(xi) = 1,\n\nb\u03b1(x\u03b1) = bi(xi).\n\n(2)\n\n(cid:88)\n\n\u03b1,x\u03b1\n\nmax\nbi,b\u03b1\n\n(cid:88)\n(cid:88)\n\ni,xi\n\nWhenever the maximizing argument to above linear program happens to be integral, i.e., the optimal\nbeliefs satisfy bi(xi), b\u03b1(x\u03b1) \u2208 {0, 1}, the program value equals the MAP value. Moreover, the\nmaximum arguments of the optimal beliefs point toward the MAP assignment [26].\nWe denote by N (i) the edges that contain vertex i and by N (\u03b1) the vertices in the edge \u03b1. Following\n[22, 27] we consider the re-parametrized dual\n\n(cid:88)\n\ni\n\n\uf8f1\uf8f2\uf8f3\u03b8i(xi) +\n\n(cid:88)\n\n\u03b1\u2208N (i)\n\n\uf8fc\uf8fd\uf8fe +\n\n(cid:88)\n\n\u03b1\n\n\uf8f1\uf8f2\uf8f3\u03b8\u03b1(x\u03b1) \u2212 (cid:88)\n\ni\u2208N (\u03b1)\n\nq(\u03bb) =\n\nmax\n\nxi\n\n\u03bbi\u2192\u03b1(xi)\n\nmax\nx\u03b1\n\n\u03bbi\u2192\u03b1(xi)\n\n(3)\n\n\uf8fc\uf8fd\uf8fe .\n\nThe dual program value upper bounds the primal program described in Eq. (2). Therefore to compute\nthe primal optimal value one can minimize the dual upper bound. Using block coordinate descent\non the dual objective amounts to optimizing blocks of dual variables while holding the remaining\nones \ufb01xed. This results in the convex max-product message-passing update rules [6, 17]:\n\n(cid:110)\n\n(cid:16)\n\n(cid:88)\n(cid:88)\n\nj\u2208N (\u03b1)\\i\n\n\u03b2\u2208N (i)\n\n\u03bbj\u2192\u03b1(xj)\n\n(cid:111)\n(cid:17) \u2212 \u00b5\u03b1\u2192i(xi)\n\n\u03b8i(xi) +\n\n\u00b5\u03b2\u2192i(xi)\n\nRepeat until convergence, for every i = 1, ..., n:\n\u2200xi, \u03b1 \u2208 N (i) \u00b5\u03b1\u2192i(xi) = max\nx\u03b1\\xi\n\n\u03b8\u03b1(x\u03b1) +\n\n\u2200xi, \u03b1 \u2208 N (i) \u03bbi\u2192\u03b1(xi) =\n\n1\n\n1 + |N (i)|\n\n1http://www.cs.huji.ac.il/project/PASCAL/index.php\n\n2\n\n\f\u2200\u02c6\u03bb\n\nq(\u02c6\u03bb) \u2212 d(cid:62)\u02c6\u03bb \u2265 q(\u03bb) \u2212 d(cid:62)\u03bb.\n\nThe convex max-product algorithm is guaranteed to converge since it minimizes the dual function,\nwhich is lower bounded by the primal program. Interestingly, the convex max-product shares the\nsame complexity as the max-product belief propagation, which is attained by replacing the coef-\n\ufb01cient 1/(1 + |N (i)|) by 1. It has, however, two fundamental problems. First, it can get stuck\nin non-optimal stationary points. This happens since the dual objective is non-smooth, thus the\nalgorithm can reach a corner, for which the dual objective stays \ufb01xed when changing only a few\nvariables. For example, consider the case of a minimization problem where we try to descend from\na pyramid while taking only horizontal and vertical paths. We eventually stay at the same height.\nThe second drawback of convex max-product is that it does not always produce a primal optimal\nsolution, bi(xi), b\u03b1(x\u03b1), even when it reaches a dual optimal solution.\nIn the next section, we consider the dual subgradients, and provide an ef\ufb01cient algorithm for detect-\ning corners, as well as for decoding a primal optimal solution from a dual optimal solution. This is\nan intermediate step which facilitates the margin analysis of the Fenchel-Young duality theorem in\nSec. 4. It provides an ef\ufb01cient way to get out of corners, and to reach the optimal dual value.\n3 The Subgradients of the Dual Objective and Steepest Descent\nSubgradients are generalizations of gradients for non-smooth convex functions. Consider the func-\ntion q(\u03bb) in Eq. (3). A vector d is called a subgradient of q(\u03bb) if it supports the epigraph of q(\u03bb) at\n\u03bb, i.e.,\n\n\u03b1 = argmaxx\u03b1{\u03b8\u03b1(x\u03b1) +(cid:80)\n\n(4)\nThe supporting hyperplane at (\u03bb, q(\u03bb)) with slope d takes the form d(cid:62)\u03bb \u2212 q\u2217(d), when de\ufb01ning\nthe conjugate dual as q\u2217(d) = max\u03bb{d(cid:62)\u03bb \u2212 q(\u03bb)}. From the de\ufb01nition of q\u2217(d) one can derive\nthe Fenchel-Young duality theorem: q(\u03bb) + q\u2217(d) \u2265 d(cid:62)\u03bb, where equality holds if and only if d\nis a supporting hyperplane at (\u03bb, q(\u03bb)). The set of all subgradients is called the subdifferential,\ndenoted by \u2202q(\u03bb), which can be characterized using the Fenchel-Young theorem as \u2202q(\u03bb) = {d :\nq(\u03bb) + q\u2217(d) = \u03bb(cid:62)d}. The subdifferential provides a way to reason about the optimal solutions of\nq(\u03bb). Using Eq. (4) we can verify that \u03bb is dual optimal if and only if 0 \u2208 \u2202q(\u03bb). In the following\nclaim we characterize the subdifferential of the dual function q(\u03bb) using the Fenchel-Young duality\ntheorem:\ni = argmaxxi{\u03b8i(xi) \u2212\nClaim 1. Consider the dual function q(\u03bb) given in Eq. (3). Let X\u2217\ni\u2208N (\u03b1) \u03bbi\u2192\u03b1(xi)}. Then d \u2208 \u2202q(\u03bb), if\nb\u03b1(x\u03b1) \u2212 bi(xi) for probability distributions bi(xi), b\u03b1(x\u03b1) whose\n\u03b1 respectively.\n\n(cid:80)\nand only if di\u2192\u03b1(xi) =(cid:80)\n\u03b1\u2208N (i) \u03bbi\u2192\u03b1(xi)} and X\u2217\nnonzero entries belong to X\u2217\nProof: Using the Fenchel-Young characterization of Eq. (4) for the max-function we obtain the set\n\u03b1. Summing over all regions r \u2208 {i, \u03b1} while noticing the change\nof maximizing elements X\u2217\nof sign, we obtain the marginalization disagreements di\u2192\u03b1(xi).\nThe convex max-product algorithm performs block coordinate descent updates. Thus it iterates\nover vertices i and computes optimal solutions \u03bbi\u2192\u03b1(xi) for every xi, \u03b1 \u2208 N (i) analytically, while\nholding the rest of the variables \ufb01xed. The claim above implies that the convex max-product iterates\nover i and generates beliefs bi(xi), b\u03b1(x\u03b1) for every xi, \u03b1 \u2208 N (i) that agree on their marginal\nprobabilities. This interpretation provides an insight into the non-optimal stationary points of the\nconvex max-product, i.e., points for which it is not able to generate consistent beliefs b\u03b1(x\u03b1) when\nit iterates over i = 1, . . . , n. The representation of the subdifferential as the amount of disagreement\nbetween the marginalization constraints provides a simple procedure to verify dual optimality, as\nwell as to construct primal optimal solutions. This is summarized in the corollary below.\nCorollary 1. Given a point \u03bb, and sets X\u2217\nX\u2217\ni , X\u2217\n\n\u03b1 be elements in\n\nx\u03b1\\xi\ni , X\u2217\n\ni , X\u2217\n\n\u03b1 as de\ufb01ned in Claim 1, let x\u2217\n\ni , x\u2217\n\ni , X\u2217\n\n\u03b1 respectively. Consider the quadratic program\nb\u03b1(x\u2217\n\n(cid:88)\n\n(cid:16) (cid:88)\n\n(cid:17)2\n\nmin\nbi,b\u03b1\n\ni,x\u2217\ns.t. bi(x\u2217\n\ni ,\u03b1\u2208N (i)\n\u03b1\\x\u2217\nx\u2217\n\u03b1) \u2265 0,\ni ), b\u03b1(x\u2217\n\ni\n\n\u03b1) \u2212 bi(x\u2217\n(cid:88)\ni )\n\nb\u03b1(x\u2217\n\n\u03b1) = 1,\n\nx\u2217\n\n\u03b1\n\n(cid:88)\n\nx\u2217\n\ni\n\nbi(x\u2217\n\ni ) = 1.\n\n\u03bb is a dual optimal solution if and only if the value of the above program equals zero. Moreover, if\n\u03bb is a dual optimal solution, then the optimal beliefs b\u2217\ni (xi) are also the optimal solution\n\n\u03b1(x\u03b1), b\u2217\n\n3\n\n\f(cid:80)\n\nb\u2217\n\u03b1(x\u2217\n\n\u03b1) \u2212 b\u2217\n\ni (x\u2217\n\n\u03b1\\x\u2217\nx\u2217\n\ni\n\nof the primal program in Eq. (2). However, if \u03bb is not dual optimal, then the vector d\u2217\n\ni\u2192\u03b1(xi) =\n\ni ) points towards the steepest descent direction of the dual function, i.e.,\n\nd\u2217 = argmin\n(cid:107)d(cid:107)\u22641\n\nlim\n\u03b1\u21930\n\nq(\u03bb + \u03b1d) \u2212 q(\u03bb)\n\n\u03b1\n\n.\n\nProof: The steepest descent direction d of q is given by minimizing the directional derivative q(cid:48)\nd,\n\nmin(cid:107)d(cid:107)\u22641\n\nq(cid:48)\nd(\u03bb) = min(cid:107)d(cid:107)\u22641\n\nd(cid:62)y = max\ny\u2208\u2202q\n\nmax\ny\u2208\u2202q\n\nmin(cid:107)d(cid:107)\u22641\n\nd(cid:62)y = max\ny\u2208\u2202q\n\n\u2212(cid:107)y(cid:107)2,\n\nwhich yields the above program (cf . [2], Chapter 4). If the zero vector is part of the subdifferential,\nwe are dual optimal. Primal optimality follows from Claim 1.\nOne can monotonically decrease the dual objective by minimizing it along the steepest descent\ndirection. Unfortunately, following the steepest descent direction does not guarantee convergence to\nthe global minimum of the dual function [28]. Performing steepest descent might keep minimizing\nthe dual objective with smaller and smaller increments, thus converging to a suboptimal solution.\nThe main drawback of steepest descent as well as block coordinate descent when applied to the dual\nobjective in Eq. (3) is that both procedures only consider the support of X\u2217\n\u03b1 de\ufb01ned in Claim 1.\nIn the following we show that by considering the \u0001-margin of these supports we can guarantee that\nat every iteration we decrease the dual value by at least \u0001. This procedure results in an ef\ufb01cient\nalgorithm that reaches both dual and primal optimal solutions.\n\ni , X\u2217\n\n4 The \u0001-Subgradients of the Dual Objective and Steepest \u0001-Descent\nTo monotonically decrease the dual value while converging to the optimum, we suggest to explore\nthe \u0001-neighborhood of the dual objective in Eq. (3) around the current iterate \u03bb. For this purpose, we\nexplore its family of \u0001-subgradients. Given our convex dual function q(\u03bb) and a positive scalar \u0001, we\nsay that a vector d is an \u0001-subgradient at \u03bb if it supports the epigraph of q(\u03bb) with an \u0001-margin:\n\n\u2200\u02c6\u03bb\n\nq(\u02c6\u03bb) \u2212 d(cid:62)\u02c6\u03bb \u2265 q(\u03bb) \u2212 d(cid:62)\u03bb \u2212 \u0001.\n\n(5)\n\nThe subgradients of a convex function are also \u0001-subgradients. The family of \u0001-subgradients is called\nthe \u0001-subdifferential and is denoted by \u2202\u0001q(\u03bb). Using the conjugate dual q\u2217(d), we can characterize\nthe \u0001-subdifferential by employing the \u0001-margin Fenchel-Young duality theorem.\n\n(\u0001-margin Fenchel-Young duality)\n\n\u2202\u0001q(\u03bb) =\n\nd : 0 \u2264 q(\u03bb) + q\u2217(d) \u2212 d(cid:62)\u03bb \u2264 \u0001\n\n(6)\n\n(cid:110)\n\n(cid:111)\n\nThe \u0001-subdifferential augments the subdifferential of q(\u03bb) with additional directions d which control\nthe \u0001-neighborhood of the function. Whenever one \ufb01nds a steepest descent direction within \u2202\u0001q(\u03bb),\nit is guaranteed to improve the dual objective by at least \u0001. Moreover, if one cannot \ufb01nd such a\ndirection within the \u0001-subdifferential, then q(\u03bb) is guaranteed to be \u0001-close to the dual optimum.\nThis is summarized in the following claim.\nClaim 2. Let q(\u03bb) be a convex function and let \u0001 be a positive scalar. The \u0001-subdifferential \u2202\u0001q(\u03bb) is\na convex and compact set. If 0 (cid:54)\u2208 \u2202\u0001q(\u03bb) then the direction d\u2217 = argmin(cid:107)d(cid:107) subject to d \u2208 \u2202\u0001q(\u03bb)\nis a descent direction and inf \u03b1>0 q(\u03bb \u2212 \u03b1d) < q(\u03bb) \u2212 \u0001. On the other hand, if 0 \u2208 \u2202\u0001q(\u03bb) then\nq(\u03bb) \u2264 inf \u03bb q(\u03bb) + \u0001.\nProof: [2] Proposition 4.3.1.\nAlthough \u2202\u0001q(\u03bb) is a convex and compact set, \ufb01nding its direction of descent is computationally\nchallenging. Fortunately, it can be approximated whenever the convex function is a sum of simple\nr \u2202\u0001qr(\u03bb) satis\ufb01es\n\u2202\u0001q(\u03bb) \u2282 \u02dc\u2202\u0001q(\u03bb) \u2282 \u2202m\u0001q(\u03bb), (see, e.g., [2]). On the one hand, if 0 (cid:54)\u2208 \u02dc\u2202\u0001q(\u03bb) then the direction of\nsteepest descent taken from \u02dc\u2202\u0001q(\u03bb) reduces the dual objective by at least \u0001. If 0 \u2208 \u02dc\u2202\u0001q(\u03bb) then q(\u03bb)\nis m\u0001-close to the dual optimum. In the following claim we use the \u0001-margin Fenchel-Young duality\nin Eq. (6) to characterize the approximated \u0001-subdifferential of the dual function.\n\nr=1 qr(\u03bb). The approximation \u02dc\u2202\u0001q(\u03bb) = (cid:80)\n\nconvex functions, i.e., q(\u03bb) = (cid:80)m\n\n4\n\n\fx\u03b1\\xi\n\nonly if di\u2192\u03b1(xi) =(cid:80)\n\uf8f1\uf8f2\uf8f3\u03b8i(xi) \u2212 (cid:88)\n\uf8f1\uf8f2\uf8f3\u03b8\u03b1(x\u03b1) +\n(cid:88)\n\nmax\nx\u03b1\n\n\u2200\u03b1\n\nmax\n\n\u2200i\n\n\u03b1\u2208N (i)\n\nxi\n\ni\u2208N (\u03b1)\n\n\u03bbi\u2192\u03b1(xi)\n\nbi(xi)\n\n\u03bbi\u2192\u03b1(xi)\n\n\uf8fc\uf8fd\uf8fe \u2212 \u0001 \u2264(cid:88)\n\uf8fc\uf8fd\uf8fe \u2212 \u0001 \u2264(cid:88)\n\nxi\n\nx\u03b1\n\n\uf8eb\uf8ed\u03b8i(xi) \u2212 (cid:88)\n\uf8eb\uf8ed\u03b8\u03b1(x\u03b1) +\n(cid:88)\n\n\u03b1\u2208N (i)\n\ni\u2208N (\u03b1)\nr (b) \u2212 b(cid:62) \u02c6\u03b8 \u2264 \u0001 with q\u2217\n\n\uf8f6\uf8f8\n\n\uf8f6\uf8f8 .\n\n\u03bbi\u2192\u03b1(xi)\n\nb\u03b1(x\u03b1)\n\n\u03bbi\u2192\u03b1(xi)\n\nClaim 3. Consider the dual function q(\u03bb) in Eq. (3). Then the approximated \u0001-subdifferential con-\nsists of vectors d whose entries correspond to marginalization disagreements, i.e., d \u2208 \u02dc\u2202\u0001q(\u03bb) if and\nb\u03b1(x\u03b1)\u2212 bi(xi) for probability distributions bi(xi), b\u03b1(x\u03b1) that satisfy\n\nProof: Eq. (6) implies b \u2208 \u2202\u0001qr(\u02c6\u03b8) if and only if qr(\u02c6\u03b8) + q\u2217\nr (b) denoting the\nconjugate dual of qr(\u02c6\u03b8). Plugging in qr, q\u2217\nr we obtain not only the maximizing beliefs but all beliefs\nwith an \u0001-margin. Summing over r \u2208 {i, \u03b1} while noticing that \u03bbi\u2192\u03b1(xi) change signs between q\u03b1\n\nand qi we obtain the marginalization disagreements di\u2192\u03b1(xi) =(cid:80)\n\nb\u03b1(x\u03b1) \u2212 bi(xi).\n\nx\u03b1\\xi\n\n\u02dc\u2202\u0001q(\u03bb) is described using beliefs bi(xi), b\u03b1(x\u03b1) that satisfy linear constraints, therefore \ufb01nding a\ndirection of \u0001-descent can be done ef\ufb01ciently. Claim 2 ensures that minimizing the dual objec-\ntive along a direction of descent decreases its value by at least \u0001. Moreover, we are guaranteed to be\n(|V |+|E|)\u0001-close to a dual optimal solution if no direction of descent is found in \u02dc\u2202\u0001q(\u03bb). Therefore,\nwe are able to get out of corners and ef\ufb01ciently reach an approximated dual optimal solution. The\ninterpretation of the Fenchel-Young margin as the amount of disagreement between the marginaliza-\ntion constraints also provides a simple way to reconstruct an approximately optimal primal solution.\nThis is summarized in the following corollary.\n\n\u03b1\u2208N (i) \u03bbi\u2192\u03b1(xi) and \u02c6\u03b8\u03b1(x\u03b1) = \u03b8\u03b1(x\u03b1) +\n\n(cid:88)\n\ni\u2208N (\u03b1) \u03bbi\u2192\u03b1(xi). Consider the quadratic program\n\nCorollary 2. Given a point \u03bb, set \u02c6\u03b8i(xi) = \u03b8i(xi) \u2212(cid:80)\n(cid:80)\n(cid:17)2\n(cid:88)\n(cid:88)\n\nx\u03b1\\xi\ns.t. bi(xi), b\u03b1(x\u03b1) \u2265 0,\n\nb\u03b1(x\u03b1) \u2212 bi(xi)\n\n(cid:16) (cid:88)\n\nb\u03b1(x\u03b1) = 1,\n\n(cid:88)\n\n(cid:88)\n\ni,xi,\u03b1\u2208N (i)\n\nmin\nbi,b\u03b1\n\nx\u03b1\n\nxi\n\nbi(xi)\u02c6\u03b8i(xi) \u2265 max\n\n{\u02c6\u03b8i(xi)} \u2212 \u0001,\n\nxi\n\nx\u03b1\n\nbi(xi) = 1\n\nb\u03b1(x\u03b1)\u02c6\u03b8\u03b1(x\u03b1) \u2265 max\n\nx\u03b1\n\n{\u02c6\u03b8\u03b1(x\u03b1)} \u2212 \u0001.\n\nxi\n\nq(\u03bb) is (|V | + |E|)\u0001-close to the dual optimal value if and only if the value of the above program\ni (xi) primal value is (|V | + |E|)\u0001-close to the\nequals zero. Moreover, the optimal beliefs b\u2217\noptimal primal value in Eq. (2). However, if q(\u03bb) is not (|V | +|E|)\u0001-close to the dual optimal value\nthen the vector d\u2217\ni (xi) points towards the steepest \u0001-descent direction\nof the function, namely\nq(\u03bb + \u03b1d) \u2212 q(\u03bb) + \u0001\n\ni\u2192\u03b1(xi) =(cid:80)\n\n\u03b1(x\u03b1)\u2212b\u2217\nb\u2217\n\n\u03b1(x\u03b1), b\u2217\n\nx\u03b1\\xi\n\nd\u2217 = argmin\n(cid:107)d(cid:107)\u22641\n\nlim\n\u03b1\u21930\n\n\u03b1\n\n.\n\n(cid:88)\n\nProof: The steepest \u0001-descent direction is given by the minimum norm element of the \u0001-\nsubdifferential, described in Claim 3. (|V | + |E|)\u0001-closeness to the dual optimum is given by ([2],\nProposition 4.3.1) once we \ufb01nd the value of the quadratic program to be zero. Note that the superset\n\u02dc\u2202\u0001 is composed of |V | + |E| subdifferentials. If the value of the above program equals zero, the\nbeliefs ful\ufb01ll marginalization constraints and they denote a probability distribution. Summing both\n\u0001-margin inequalities w.r.t. i, \u03b1, we obtain\n\n(cid:88)\n\nb\u03b1(x\u03b1)\u02c6\u03b8\u03b1(x\u03b1) \u2265(cid:88)\n\nbi(xi)\u02c6\u03b8i(xi) +\n\n\u02c6\u03b8i(xi) +\n\nmax\n\nxi\n\nmax\nx\u03b1\n\n\u02c6\u03b8\u03b1(x\u03b1) \u2212 (|V | + |E|)\u0001.\n\ni,xi\n\n\u03b1,x\u03b1\n\ni\n\nwhere the primal on the left hand side of the resulting inequality is larger then the dual subtracted\nby (|V | + |E|)\u0001. With the dual itself upper bounding the primal, the corollary follows.\nThus, we can construct an algorithm that performs \u0001 improvements over the dual function in each\niteration. We can either perform block-coordinate dual descent (i.e., convex max-product updates)\nor steepest \u0001-descent steps. Since both methods monotonically improve the same dual function, our\napproach is guaranteed to reach the optimal dual solution and to recover the primal optimal solution.\n\n5\n\n(cid:88)\n\n\u03b1\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Difference between the minimal dual value attained by convex max-product q(\u03bbCMP)\nand our approach q(\u03bb\u0001). Convex max-product gets stuck in about 20% of all cases. (b) Dual value\nachieved after a certain amount of time for cases where convex max-product gets stuck.\n5 High-Order Region Graphs\nGraphical models naturally describe probability distributions with different types of regions r \u2282\n{1, ..., n}. However, the linear program relaxation described in Eq. (2) considers interactions be-\ntween regions which correspond to variables i and regions that correspond to cliques \u03b1.\nIn the\nfollowing we extend the \u0001-descent framework when considering linear programming relaxations\nwithout constraining the region interactions. Since we allow any regions to interact, we denote these\ninteractions through a region graph [29]. A region graph is a directed graph whose nodes represent\nthe regions and its direct edges correspond to the inclusion relation, i.e., a directed edge from node\nr to s is possible only if s \u2282 r. We adopt the terminology where P (r) and C(r) stand for all nodes\nthat are parents and children of the node r, respectively. Thus we consider the linear programming\nrelaxation of a general high-order graphical model as follows\n\n(cid:88)\n\nr,xr\n\n(cid:88)\n\nxr\n\n(cid:110)\n\n(cid:88)\n\nr\n\nq(\u03bb) =\n\nmax\n\nxr\n\n\u03b8r(xr) +\n\n(cid:88)\n\n\u03bbc\u2192r(xc) \u2212 (cid:88)\n\nc\u2208C(r)\n\np\u2208P (r)\n\nmax\n\nb\n\nbr(xr)\u03b8r(xr)\n\ns.t. br(xr) \u2265 0,\n\nbr(xr) = 1,\n\n\u2200r, s \u2208 P (r)\n\nFollowing [5, 22, 27] we consider the re-parametrized dual program\n\n(cid:88)\n\nxs\\xr\n\n(7)\n\nbs(xs) = br(xr)\n\n(cid:111)\n\n\u03bbr\u2192p(xr)\n\nwhich is a sum of max-functions. Its approximated \u0001-subdifferential is described with respect to\ntheir Fenchel-Young margins. Using the same reasoning as in Sec. 4 we present a simple way to\nrecover an \u0001-steepest descent direction, as well as to reconstruct an approximated optimal primal\nsolution.\n\nc\u2208C(r) \u03bbc\u2192r(xr) \u2212(cid:80)\n\np\u2208P (r) \u03bbr\u2192p(xr).\n\nConsider the quadratic program\n\nCorollary 3. Given a point \u03bb, set \u02c6\u03b8r(xr) = \u03b8r(xr) +(cid:80)\n(cid:17)2\n(cid:88)\n\n(cid:16) (cid:88)\n(cid:88)\n\nbp(xp) \u2212 br(xr)\n\n(cid:88)\n\nxp\\xr\n\nminb\n\nr,xr,p\u2208P (r)\nbr(xr) \u2265 0,\n\nbr(xr) = 1,\n\ns.t.\n\nxr\n\nxr\n\nbr(xr)\u02c6\u03b8r(xr) \u2265 max\n\n{\u02c6\u03b8r(xr)} \u2212 \u0001\n\nxr\n\nLet |R| be the total number of regions in the graph, then \u03bb is |R|\u0001-close to the dual optimal solution\nif and only if the value of the above program equals zero. Moreover, the optimal beliefs b\u2217\nr(xr) are\nalso |R|\u0001-close to the optimal solution of the primal program in Eq. (7). However, if q(\u03bb) is not\n|R|\u0001 close to the dual optimal solution then the vector d\u2217\nr(xr) points\ntowards the steepest \u0001-descent direction of the dual function.\n\nr\u2192p(xr) =(cid:80)\n\np(xp) \u2212 b\u2217\nb\u2217\n\nxp\\xr\n\nProof: It is a straightforward generalization of Corollary 2.\nWhen dealing with high-order region graphs, one can choose a region graph, e.g., the Hasse diagram,\nthat has signi\ufb01cantly less edges than a region graph that connects variables i to cliques \u03b1. Therefore,\nwhen considering many high-order regions, the formulation in the above corollary is more ef\ufb01cient\nthan the one in Corollary 2.\n\n6\n\n\f]\ns\n[\n\ne\nm\n\ni\nt\n\n]\ns\n[\n\ne\nm\n\ni\nt\n\n\u0001\n(a)\n\n\u0001\n(b)\n\nFigure 2: Average time required for different solvers to achieve a speci\ufb01ed accuracy on 30 spin glass\nmodels, (a) when solvers are applied to \u201chard\u201d problems only, i.e., those where CMP gets stuck far\nfrom the optimum. Average results over 30 models are shown in (b), (c) decrease of the dual value\nover time for ADLP and our \u0001-descent approach.\n\n(c)\n\n6 Experimental Evaluation\n\nTo bene\ufb01t from the ef\ufb01ciency of convex max-product, our implementation starts by employing\nblock-coordinate descent iterations before switching to the globally convergent \u0001-descent approach\nonce the dual value decreases by less than \u0001 = 0.01. As we always optimize the same cost func-\ntion, switching the gradient computation is possible. We employ a backtracking line search in our\n\u0001-descent approach. In the following we demonstrate the effectiveness of our approach on synthetic\n10x10 spin glass models as well as protein interactions from the probabilistic inference challenge\n(PIC 2011). We consider spin glass models that consist of local factors, each having 3 states with\nvalues randomly chosen according to N (0, 1). We use three states as convex max-product is optimal\nfor pairwise spin glass models with only two states per random variable. The pairwise factors of the\nregular grid are weighted potentials with +1 on the diagonal and off-diagonal entries being \u22121. The\nweights are again independently drawn from N (0, 1). In the \ufb01rst experiment we are interested in\nestimating how often convex max-product gets stuck in corners. We generate a set of 1000 spin glass\nmodels and estimate the distribution of the dual value difference comparing the \u0001-descent approach\nwith the convex max-product result after 10, 000 iterations. We observe in Fig. 1(a) that about 20%\nof the spin glass models have a dual value difference larger than zero.\nHaving observed that convex max-product does not achieve optimality for 20% of the models, we\nnow turn our attention to evaluating the run-time of different algorithms. We compare our imple-\nmentation of the \u0001-steepest descent algorithm with the alternating direction method for dual MAP-\nLP relaxations (ADLP) [18].\nIn addition, we illustrate the performance of convex max-product\n(CMP) [6] and compare against the dual-decomposition work of [12] provided in a generic (DDG)\nand a re-weighted (DDR) version in the STAIR library [4]. Note that ADLP is also implemented in\nthis library. All algorithms are restricted to at most 20, 000 iterations. We draw the readers attention\nto Fig. 1(b), where we evaluate a single spin glass model and illustrate the dual value obtained after\na certain amount of time. As given by the derivations, CMP is a monotonically decreasing algorithm\nthat can get stuck in corners. It is important to note that our \u0001-descent approach is monotonically\ndecreasing as well, which contrasts all the other investigated algorithms (ADLP, DDG, DDR).\nWe evaluate the time it requires the different algorithms to achieve a given accuracy. We \ufb01rst focus\non \u201chard\u201d problems, where we de\ufb01ned \u201chard\u201d as those spin glass models whose difference between\nconvex max-product and the \u0001-descent method is larger than 0.2. To obtain statistically meaningful\nresults we average over 30 hard problems and report the time to achieve a given accuracy in Fig. 2(a).\nWe used the minimum across all dual values found by all algorithms as the optimum. If an algorithm\ndoes not achieve \u0001-close accuracy within 20,000 iterations we set its time to the arbitrarily chosen\nvalue of 105. We note that CMP is very fast for low accuracies (high \u0001) but gets stuck in corners,\nnot achieving high accuracies (low \u0001). This is also the case for DDG and DDR. ADLP achieves\nsigni\ufb01cantly lower \u0001-closeness but the 20, 000 iteration limit stops it from reaching 10\u22123. The\nprevious experiment focus on hard problems. In order to evaluate the average case, we randomly\ngenerate 30 spin glass models. The results are provided in Fig. 2(b). As expected the \u0001-descent\napproach performs similarly well, ADLP achieves lower accuracies on more samples. The step\napparent for CMP, DDG and DDR is not as sharp, but still very signi\ufb01cant.\nProtein interactions: We rely on the data provided by the PIC 2011 and compare the \u0001-descent\napproach to ADLP as it is the most competitive method in the previous experiments. The dual\nenergy obtained after a given amount of time is illustrated in Fig. 2(c).\n\n7\n\n\f7 Related Work\nWe explore methods to solve LP relaxations by monotonically decreasing the value of its dual ob-\njective and reconstructing a primal optimal solution. For this purpose we investigate approximated\nsubgradients of the dual program using the Fenchel-Young margins, and provide a method to reduce\nthe dual objective in every step by a constant value until convergence to the optimum. Ef\ufb01cient dual\nsolvers were extensively studied in the context of LP relaxations for the MAP problem [14, 20, 25].\nThe dual program is non-smooth, thus subgradient descent algorithms are guaranteed to reach the\ndual optimum [12], as well as recover the primal optimum [12]. Despite their theoretical guarantees,\nsubgradient methods are typically slow. Dual block coordinate descent methods, typically referred\nto as convex max-product algorithms, are monotonically decreasing, and were shown to be faster\nthan subgradient methods [3, 6, 11, 17, 22, 24, 27]. Since the dual program is non-smooth, these\nalgorithms can get stuck in non-optimal stationary points and cannot in general recover a primal\noptimal solution [26]. Our work speci\ufb01cally addresses these drawbacks.\nRecently, several methods were devised to overcome the sub-optimality of convex max-product\nalgorithms. Unlike our approach, all these algorithms optimize a perturbed program. Some methods\nuse the soft-max with low temperature to smooth the dual objective in order to avoid corners as well\nas to recover primal optimal solutions [6, 7, 8]. However, these methods are typically slower, as\ncomputation of the low-temperature soft-max is more expensive than max-computation. [19] applied\nthe proximal method, employing a primal strictly concave perturbation, which results in a smooth\ndual approximation that is temperature independent. This approach converges to the dual optimum\nand recovers the primal optimal solution. However, it uses a double loop scheme where every\nupdate involves executing a convex sum-product algorithm. Alternative methods applied augmented\nLagrangian techniques to the primal [16] and the dual programs [18]. The augmented Lagrangian\nmethod guarantees to reach the global optimum and recover the dual and primal solutions. Unlike\nour approach, this method is not monotonically decreasing and works on a perturbed objective, thus\ncannot be ef\ufb01ciently integrated with convex max-product updates that perform block coordinate\ndescent on the dual of the LP relaxation.\nOur approach is based on the \u0001-descent algorithm for convex functions [2]. We use the \u0001-margin\nof the Fenchel-Young duality theorem to adjust the \u0001-subdifferential to the dual objective of the\nLP relaxation, thus augmenting the convex max-product with the ability to get out of corners. We\nalso construct an ef\ufb01cient method to recover a primal optimal solution. Our approach is related\nto the Bundle method [15, 9], which performs an \u0001-subgradient descent in cases where ef\ufb01cient\nsearch in the \u0001-subdifferential is impossible. The graphical model structure in our setting makes\nsearching in the \u0001-subdifferential easy, thus our approach is signi\ufb01cantly faster. Our algorithm satis-\n\ufb01es \u0001-complementary slackness while performing \u0001-descent step, similarly to the auction algorithm.\nHowever, our algorithm is monotonically decreasing and can be used for general graphical models,\nwhile the auction algorithm might increase its dual and its convergence properties hold only for\nnetwork \ufb02ow problems.\n8 Conclusions and Discussion\nEvaluating the MAP assignment and solving its LP relaxations are key problems in approximate\ninference. Some of the existing solvers, such as convex max-product, have limitations. Mainly, these\nsolvers can get stuck in a non-optimal stationary point, thus they cannot recover the primal optimal\nsolution. We explore the properties of subgradients of the dual objective and construct a simple\nalgorithm that determines if the dual stationary point is optimal and recovers the primal optimal\nsolution in this case (Corollary 1). Moreover, we investigate the family of \u0001-subgradients using\nFenchel-Young margins and construct a monotonically decreasing algorithm that is guaranteed to\nachieve optimal dual and primal solutions (Corollary 2), including general region graphs (Corollary\n3). We show that our algorithm compares favorably with pervious methods on spin glass models\nand protein interactions. The approximated steepest descent direction is recovered by solving a\nquadratic program subject to linear constraints. We used the Gurobi solver2, which ignores the\ngraphical structure of the linear constraints. We believe that constructing a message-passing solver\nfor this sub-problem will signi\ufb01cantly speed-up our approach. Further extensions, e.g., enforcing\nconstraints over messages such as those arising from cloud computing are also applicable to our\nsetting [1, 21].\n\n2http://www.gurobi.com\n\n8\n\n\fReferences\n[1] A. Auslender and M. Teboulle. Interior gradient and epsilon-subgradient descent methods for constrained\n\nconvex minimization. Mathematics of Operations Research, 2004.\n\n[2] D. P. Bertsekas, A. Nedi\u00b4c, and A. E. Ozdaglar. Convex Analysis and Optimization. Athena Scienti\ufb01c,\n\n2003.\n\n[3] A. Globerson and T. S. Jaakkola. Fixing max-product: convergent message passing algorithms for MAP\n\nrelaxations. In Proc. NIPS, 2007.\n\n[4] S. Gould, O. Russakovsky, I. Goodfellow, P. Baumstarck, A. Y. Ng, and D. Koller. The STAIR Vision\n\nLibrary (v2.4), 2011. http://ai.stanford.edu/ sgould/svl.\n\n[5] T. Hazan, J. Peng, and A. Shashua. Tightening fractional covering upper bounds on the partition function\n\nfor high-order region graphs. In Proc. UAI, 2012.\n\n[6] T. Hazan and A. Shashua. Norm-product belief propagation: Primal-dual message-passing for approxi-\n\nmate inference. Trans. on Information Theory, 2010.\n\n[7] J. K. Johnson. Convex relaxation methods for graphical models: Lagrangian and maximum entropy\n\napproaches. PhD thesis, Massachusetts Institute of Technology, 2008.\n\n[8] V. Jojic, S. Gould, and D. Koller. Accelerated dual decomposition for MAP inference. In Proc. ICML,\n\n2010.\n\n[9] J. H. Kappes, B. Savchynskyy, and C. Schn\u00a8orr. A Bundle Approach To Ef\ufb01cient MAP-Inference by\n\nLagrangian Relaxation. In Proc. CVPR, 2012.\n\n[10] D. Koller and N. Friedman. Probabilistic graphical models. MIT Press, 2009.\n[11] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. PAMI, 2006.\n[12] N. Komodakis, N. Paragios, and G. Tziritas. MRF Energy Minimization & Beyond via Dual Decomposi-\n\ntion. PAMI, 2010.\n\n[13] T. Koo, A.M. Rush, M. Collins, T. Jaakkola, and D. Sontag. Dual decomposition for parsing with non-\n\nprojective head automata. In Proc. EMNLP, 2010.\n\n[14] A.M.C.A. Koster, S.P.M. van Hoesel, and A.W.J. Kolen. The partial constraint satisfaction problem:\n\nFacets and lifting theorems. Operations Research Letters, 1998.\n\n[15] C. Lemar\u00b4echal. An algorithm for minimizing convex functions. Information processing, 1974.\n[16] A.F.T. Martins, M.A.T. Figueiredo, P.M.Q. Aguiar, N.A. Smith, and E.P. Xing. An Augmented La-\n\ngrangian Approach to Constrained MAP Inference. In Proc. ICML, 2011.\n\n[17] T. Meltzer, A. Globerson, and Y. Weiss. Convergent Message Passing Algorithms \u2013 A Unifying View. In\n\nProc. UAI, 2009.\n\n[18] O. Meshi and A. Globerson. An Alternating Direction Method for Dual MAP LP Relaxation. In Proc.\n\nECML PKDD, 2011.\n\n[19] P. Ravikumar, A. Agarwal, and M. J. Wainwright. Message-passing for graph-structured linear programs:\n\nProximal methods and rounding schemes. JMLR, 2010.\n\n[20] M. Schlesinger. Syntactic analysis of two-dimensional visual signals in noisy conditions. Kibernetika,76.\n[21] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed message passing for large scale\n\ngraphical models. In Proc. CVPR, 2011.\n\n[22] D. Sontag and T. S. Jaakkola. Tree block coordinate descent for MAP in graphical models.\n\nAISTATS, 2009.\n\nIn Proc.\n\n[23] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using\n\nmessage passing. In Proc. UAI, 2008.\n\n[24] D. Tarlow, D. Batra, P. Kohli, and V. Kolmogorov. Dynamic tree block coordinate ascent. In Proc. ICML,\n\n2011.\n\n[25] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. MAP estimation via agreement on trees: message-\n\npassing and linear programming. Trans. on Information Theory, 2005.\n\n[26] Y. Weiss, C. Yanover, and T. Meltzer. MAP Estimation, Linear Programming and Belief Propagation with\n\nConvex Free Energies. In Proc. UAI, 2007.\n\n[27] T. Werner. Revisiting the linear programming relaxation approach to gibbs energy minimization and\n\nweighted constraint satisfaction. PAMI, 2010.\n\n[28] P. Wolfe. A method of conjugate subgradients for minimizing nondifferentiable functions. Nondifferen-\n\ntiable Optimization, 1975.\n\n[29] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized\n\nbelief propagation algorithms. Trans. on Information Theory, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1151, "authors": [{"given_name": "Alex", "family_name": "Schwing", "institution": null}, {"given_name": "Tamir", "family_name": "Hazan", "institution": null}, {"given_name": "Marc", "family_name": "Pollefeys", "institution": null}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": null}]}