{"title": "Minimizing Sparse High-Order Energies by Submodular Vertex-Cover", "book": "Advances in Neural Information Processing Systems", "page_first": 962, "page_last": 970, "abstract": "Inference on high-order graphical models has become increasingly important in recent years. We consider energies with simple 'sparse' high-order potentials. Previous work in this area uses either specialized message-passing or transforms each high-order potential to the pairwise case. We take a fundamentally different approach, transforming the entire original problem into a comparatively small instance of a submodular vertex-cover problem. These vertex-cover instances can then be attacked by standard pairwise methods, where they run much faster (4--15 times) and are often more effective than on the original problem. We evaluate our approach on synthetic data, and we show that our algorithm can be useful in a fast hierarchical clustering and model estimation framework.", "full_text": "Minimizing Sparse High-Order Energies by\n\nSubmodular Vertex-Cover\n\nAndrew Delong\n\nUniversity of Toronto\n\nandrew.delong@gmail.com\n\nOlga Veksler\n\nWestern University\nolga@csd.uwo.ca\n\nAnton Osokin\n\nMoscow State University\nanton.osokin@gmail.com\n\nYuri Boykov\n\nWestern University\nyuri@csd.uwo.ca\n\nAbstract\n\nInference in high-order graphical models has become important in recent years.\nSeveral approaches are based, for example, on generalized message-passing, or\non transformation to a pairwise model with extra \u2018auxiliary\u2019 variables. We focus\non a special case where a much more ef\ufb01cient transformation is possible. Instead\nof adding variables, we transform the original problem into a comparatively small\ninstance of submodular vertex-cover. These vertex-cover instances can then be\nattacked by existing algorithms (e.g. belief propagation, QPBO), where they often\nrun 4\u201315 times faster and \ufb01nd better solutions than when applied to the original\nproblem. We evaluate our approach on synthetic data, then we show applications\nwithin a fast hierarchical clustering and model-\ufb01tting framework.\n\n1 Introduction\n\nMAP inference on graphical models is a central problem in machine learning, pattern recognition,\nand computer vision. Several algorithms have emerged as practical tools for inference, especially\nfor graphs containing only unary and pairwise factors. Prominent examples include belief propaga-\ntion [30], more advanced message passing methods like TRW-S [21] or MPLP [33], combinatorial\nmethods like (cid:11)-expansion [6] (for \u2018metric\u2019 factors) and QPBO [32] (mainly for binary problems).\nIn terms of optimization, these algorithms are designed to minimize objective functions (energies)\ncontaining unary and pairwise terms.\nMany inference problems must be modeled using high-order terms, not just pairwise, and such\nproblems are increasingly important for many applications. Recent developments in high-order in-\nference include, for example, high-arity CRF potentials [19, 38, 25, 31], cardinality-based potentials\n[13, 34], global potentials controlling the appearance of labels [24, 26, 7], learning with high-order\nloss functions [35], among many others.\nOne standard approach to high-order inference is to transform the problem to the pairwise case and\nthen simply apply one of the aforementioned \u2018pairwise\u2019 algorithms. These transformations add many\n\u2018auxiliary\u2019 variables to the problem but, if the high-order terms are sparse in the sense suggested\nby Rother et al. [31], this can still be a very ef\ufb01cient approach. There can be several equivalent\nhigh-order-to-pairwise transformations, and this choice affects the dif\ufb01culty of the resulting pair-\nwise inference problem. Choosing the \u2018easiest\u2019 transformation is not trivial and has been explicitly\nstudied, for example, by Gallagher et al. [11].\nOur work is about fast energy minimization (MAP inference) for particularly sparse, high-order \u201cpat-\ntern potentials\u201d used in [25, 31, 29]: each energy term prefers a speci\ufb01c (but arbitrary) assignment\nto its subset of variables. Instead of directly transforming the high-order problem to pairwise, we\ntransform the entire problem to a comparatively small instance of submodular vertex-cover (SVC).\nThe vertex-cover implicitly provides a solution to the original high-order problem. The SVC in-\nstance can itself be converted to pairwise, and standard inference techniques run much faster and are\noften more effective on this compact representation.\n\n1\n\n\fWe also show that our \u2018sparse\u2019 high-order energies naturally appear when trying to solve hierarchi-\ncal clustering problems using the algorithmic approach called fusion moves [27], also conceptually\nknown as optimized crossover [1]. Fusion is a powerful very large-scale neighborhood search tech-\nnique [3] that in some sense generalizes (cid:11)-expansion. The fusion approach is not standard for the\nkind of clustering objective we will consider, but we believe it is an interesting optimization strategy.\nThe remainder of the paper is organized as follows. Section 2 introduces the class of high-order en-\nergies we consider, then derives the transformation to SVC and the subsequent decoding. Section 3\ncontains experiments that suggest signi\ufb01cant speedups, and discusses possible applications.\n\n\u220f\n\n2 Sparse High-Order Energies Reducible to SVC\n\n\u220f\nIn what follows we use x to denote a vector of binary variables, xP to denote product\nxQ to denote\nalways use i to denote a variable index from I, and j to denote a clique index from V.\nIt is well-known that any pseudo-boolean function (binary energy) can be written in the form\n\ni\u2208P xi, and\ni\u2208Q xi. It will be convenient to adopt the convention that x{} = 1 and x{} = 1. We\n\nF (x) =\n\nbjxPj xQj\n\n(1)\nwhere each clique j has coef\ufb01cient \u2212bj with bj \u2265 0, and is de\ufb01ned over variables in sets Pj; Qj \u2286 I.\nOur approach will be of practical interest only when, roughly speaking, |V| \u226a |I|.\nFor example, if x = (x1; : : : ; x7) then a clique j with Pj = {2; 3} and Qj = {4; 5; 6} will explicitly\nreward binary con\ufb01guration (\u00b7 ; 1; 1; 0; 0; 0; \u00b7) by the amount bj (depicted as b1 in Figure 1). If there\nare several overlapping (and con\ufb02icting) cliques, then the minimization problem can be dif\ufb01cult.\n\u2211\nA standard way to minimize F (x) would be to substitute each \u2212bjxPj xQj term with a collection\nIn our experiments, we used the substitution \u2212xPj xQj = \u22121 +\nof equivalent pairwise terms.\nminy\u2208{0;1} y +\nxiy where y is an auxiliary variable. This is like the Type-II\ni\u2208Qj\nxiy +\ntransformation in [31], and we found that it worked better than Type-I for our experiments. However,\nwe aim to minimize F (x) in a novel way, so \ufb01rst we review the submodular vertex-cover problem.\n\n\u2211\n\ni\u2208Pj\n\n\u2211\n\ni\u2208I\n\naixi \u2212\n\n\u2211\n\nj\u2208V\n\n2.1 Review of Submodular Vertex-Cover\n\nThe classic minimum-weighted vertex-cover (VC) problem can be stated as a 0-1 integer program\nwhere variable uj = 1 if and only if vertex j is included in the cover.\n\n(VC) minimize\n\nsubject to uj + uj\u2032 \u2265 1 \u2200{j; j\u2032} \u2208 E\n\nj\u2208V wjuj\nuj \u2208 {0; 1}:\n\n\u2211\n\n2\n\nWithout loss of generality one can assume wj > 0 and j \u0338= j\u2032 for all {j; j\u2032} \u2208 E. If the graph\n(V;E) is bipartite, then we call the specialized problem VC-B and it can be solved very ef\ufb01ciently\nby specialized bipartite maximum \ufb02ow algorithms such as [2].\nA function f (x) is called submodular if f (x\u2227 y) + f (x\u2228 y) \u2264 f (x) + f (y) for all x; y \u2208 {0; 1}V\nwhere (x \u2227 y)j = xjyj and (x \u2228 y)j = 1 \u2212 xjyj. A submodular function can be minimized in\nstrongly polynomial time by combinatorial methods [17], but becomes NP-hard when subject to\narbitrary covering constraints like (3).\nThe submodular vertex-cover (SVC) problem generalizes VC by replacing the linear (modular)\nobjective (2) with an arbitrary submodular objective,\n\n(SVC) minimize f (u)\n\nsubject to uj + uj\u2032 \u2265 1 \u2200{j; j\u2032} \u2208 E\n\nuj \u2208 {0; 1}:\n\nIwata & Nagano [18] recently showed that when f (\u00b7) \u2265 0 a 2-approximation can be found in\npolynomial time and that this is the best constant-ratio bound achievable. It turns out that a half-\n2 ; 1} (call this problem SVC-H), followed by upward rounding, gives\nintegral relaxation uj \u2208 {0; 1\n\n(2)\n(3)\n\n(4)\n\n\fa 2-approximation much like for standard VC. They also show how to transform any SVC-H\ninstance into a bipartite instance of SVC (see below); this extends a classic result by Nemhauser &\nTrotter [28], allowing specialized combinatorial algorithms like [17] to solve the relaxation.\nIn the bipartite submodular vertex-cover (SVC-B) problem, the graph nodes V can be partitioned\ninto sets J ;K so the binary variables are u \u2208 {0; 1}J\n\n; v \u2208 {0; 1}K and we solve\n\n(SVC-B) minimize f (u) + g(v)\n\nsubject to uj + vk \u2265 1 \u2200{j; k} \u2208 E\n\nuj; vk \u2208 {0; 1} \u2200j \u2208 J ; k \u2208 K\n\n(5)\n\nwhere both f (\u00b7) and g(\u00b7) are submodular functions. This SVC-B formulation is a trivial extension\nof the construction in [18] (they assume g = f), and their proof of tractability extends easily to (5).\n\nf (u) =\n\nS\u2208S 0\n\n\u220f\n\n\u2211\n\n\u2211\n\n2.2 Solving Bipartite SVC with Min-Cut\nIt will be useful to note that if f and g above can be written in a special manner, SVC-B can\nbe solved by fast s-t minimum cut instead of by [17, 15]. Suppose we have an SVC-B instance\n(J ;K;E; f; g) where we can write submodular f and g as\nwSuS; and g(v) =\n\n(6)\nHere S 0 and S 1 are collections of subsets of J and K respectively, and typescript uS denotes\nproduct\nProposition 1. If wS \u2264 0 for all |S| \u2265 2 in (6), then SVC-B reduces to s-t minimum cut.\nProof. We can de\ufb01ne an equivalent problem over variables uj and zk = vk. With this substitution,\nthe covering constraints become uj \u2265 zk. Since \u201cg(v) submodular in v\u201d implies \u201cg(1\u2212v) submod-\nular in v,\u201d letting (cid:22)g(z) = g(z) = g(v) means (cid:22)g(z) is submodular as a function of z. Minimizing\nf (u)+ (cid:22)g(z) subject to uj \u2265 zk is equivalent to our original problem. Since uj \u2265 zk can be enforced\nby large (submodular) penalty on assignment ujzk, SVC-B is equivalent to\n(cid:17) ujzk where (cid:17) = \u221e.\n\nj\u2208S uj throughout (as distinct from typescript u, which denotes a vector).\n\nminimize f (u) + (cid:22)g(z) +\n\nwSvS:\n\nS\u2208S 1\n\n\u2211\n\u2211\n\n(j;k)\u2208E\n\n\u220f\n\n(7)\n\nWhen f and g take the form (6), we have (cid:22)g(z) =\nk\u2208S zk.\nIf wS \u2264 0 for all |S| \u2265 2, we can build an s-t minimum cut graph corresponding to (7) by directly\napplying the constructions in [23, 10]. We can do this because each term has coef\ufb01cient wS \u2264 0\nwhen written as u1 \u00b7\u00b7\u00b7 u|S| or z1 \u00b7\u00b7\u00b7 z|S|, i.e. either all complemented or all uncomplemented.\n\nS\u2208S 1 wSzS where zS denotes product\n\n2.3 Transforming F (x) to SVC\n\nTo get a sense for how our transformation works, see Figure 1. The transformation is reminiscent of\nthe binary dual of a Constraint Satisfaction Problem (CSP) [37]. The vertex-cover construction of\n[4] is actually a special linear (modular) case of our transformation (details in Proposition 2).\n\n\u2211\n\n7\n\ni=1 aixi\u2212b1x2x3x4x5x6\u2212b2x1x2x3x4x5\u2212b3x3x4x5x6x7.\nFigure 1: Left: factor graph F (x) =\nA small white square indicates ai > 0, a black square ai < 0. A hollow edge connecting xi to factor\nj indicates i \u2208 Pj, and a \ufb01lled-in edge indicates i \u2208 Qj. Right: factor graph of our corresponding\nSVC instance. High-order factors of the original problem, shown with gray squares on the left, are\ntransformed into variables of SVC problem. Covering constraints are shown as dashed lines. Two\npairwise factors are formed with coef\ufb01cients w{1;3} = \u2212a3 and w{1;2} = a4 + a5, both \u2264 0.\n\n3\n\n\f\u2217 \u2208 {0; 1}V.\n\n\u2217 \u2208 {0; 1}I for\nTheorem 1. For any F (x) there exists an instance of SVC such that an optimum x\nF can be computed from an optimal vertex-cover u\nProof. First we give the construction for SVC instance (V;E; f ). Introduce auxiliary binary vari-\nables u \u2208 {0; 1}V where uj = xPj xQj . Because each bj \u2265 0, minimizing F (x) is equivalent to the\n0-1 integer program with non-linear constraints\nminimize F (x; u)\nsubject to uj \u2264 xPj xQj\n\n(8)\nInequality (8) is suf\ufb01cient if bj \u2265 0 because, for any \ufb01xed x, equality uj = xPj xQj holds for some\nu that minimizes F (x; u).\nWe try to formulate a minimization problem solely over u. As a consequence of (8) we have uj =\n0 \u21d2 xPj = 1; xQj = 0. (We use typescript xS to denote vector (xi)i\u2208S, whereas xS denotes a\nproduct\u2014a scalar value.) Notice that, when some Pj and Qj\u2032 overlap, not all u \u2208 {0; 1}V can be\nfeasible with respect to assignments x \u2208 {0; 1}I. For each i \u2208 I, let us collect the cliques that i\nparticipates in: de\ufb01ne sets Ji; Ki \u2286 V where Ji = { j | i \u2208 Pj} and Ki = { j | i \u2208 Qj}. We show\n\u2265 1 for all i \u2208 I, where uS denotes a product. In\nthat u can be feasible if and only if uJi + uKi\nother words, u can be feasible if and only if, for each i,\n\n\u2200j \u2208 V:\n\n\u2203 uj = 0; j \u2208 Ji =\u21d2 uk = 1 \u2200j \u2208 Ki\n\u2203 uk = 0; k \u2208 Ki =\u21d2 uj = 1 \u2200j \u2208 Ji:\n\n(9)\n\u2265 1 is necessary: if both uJi = 0 and\n(\u21d2) If uj \u2264 xPj xQj for all j \u2208 V, then having uJi + uKi\nuKi = 0 for any i it would mean there exists j \u2208 Ji and k \u2208 Ki for which xPj = 1 and xQk = 0,\ncontradicting any unique assignment to xi.\n(\u21d0) If uJi + uKi\n\u2265 1 for all i \u2208 I, then we can always choose some x \u2208 {0; 1}I for which every\nuj \u2264 xPj xQj . It will be convenient to choose a minimum cost assignment for each xi, subject to the\nconstraints uJi = 0 \u21d2 xi = 1 and uKi = 0 \u21d2 xi = 0. If both uJi = uKi = 1 then xi could be\neither 0 or 1 so choose the best, giving\n\n{\n\nx(u)i =\n\n0\n1\n\nif uKi = 0\nif uJi = 0\n[ai < 0] otherwise.\n\nThe assignment x(u) is feasible with respect to (8) because for any uj = 1 we have x(u)Pj = 1\nand x(u)Qj = 0.\n\u2265 1. To express\nWe have completed the proof that u can be feasible if and only if uJi + uKi\nminimization of F solely in terms of u, \ufb01rst write (10) in equivalent form\n\n{\n\nAgain, this de\ufb01nition of x(u) minimizes F (x; u) over all x satisfying inequality (8). Use (11) to\nwrite new SVC objective f (u) = F (x(u); u), which becomes\n\u2212\n\nbj(1 \u2212 uj)\n\nf (u) =\n\naiuKi\n\nx(u)i =\n\n\u2211\n\u2211\n\n=\n\nuKi\n1 \u2212 uJi\n\u2211\n\nif ai < 0\notherwise.\n\n\u2211\n\nj\u2208V\n\n\u2211\n\ni : ai>0\n\ni : ai<0\n\ni : ai>0\n\ni : ai<0\n\naiuKi +\n\nbjuj + const:\n\nj\u2208V\n\nTo collect coef\ufb01cients in the \ufb01rst two summands of (12), we must group them by each unique clique\nthat appears. We de\ufb01ne set S = {S \u2286 V | (\u2203Ji = S) \u2228 (\u2203Ki = S)} and write\n)\n+ bj if S = {j}\n\nwSuS + const\n\u2212ai +\n\nwhere wS =\n\n\u2211\n\nf (u) =\n\n(\n\n(13)\n\n(14)\n\n:\n\nSince the high-order terms uS in (13) have non-positive coef\ufb01cients wS \u2264 0, then f (u) is submod-\nular [5]. Also note that for each i at most one of Ji or Ki contributes to the sum, so there are at most\n|S| \u2264 |I| unique terms uS with wS \u0338= 0. If |S|;|V| \u226a |I| then our SVC instance will be small.\nFinally, to ensure (9) holds we add a covering constraint uj + uk \u2265 1 whenever there exists i such\nthat j \u2208 Ji; k \u2208 Ki. For this SVC instance, an optimal covering u minimizes F (x(u); u).\n\n(10)\n\n(11)\n\n(12)\n\n\u2211\nai(1 \u2212 uJi) +\n\u2212aiuJi +\n\u2211\n\u2211\n\nS\u2208S\n\ni : ai>0;\nJi=S\n\nai\ni : ai<0;\nKi=S\n\n4\n\n\fThe construction in Theorem 1 suggests the entire minimization procedure below.\n\nai > 0 then wJi\n\nMINIMIZE-BY-SVC(F ) where F is a pseudo-boolean function in the form of (1)\n1 wfjg := bj 8j 2 V\n2 for i 2 I do\n\u2211\n3\n4\n5\n6 let f (u) =\n(cid:3)\n7 u\n(cid:3)\n8 return x(u\n\n(cid:0) ai\nif\nelse if ai < 0 then wKi := wKi + ai\nE := E [ ffj; kgg 8j 2 Ji; k 2 Ki\n:= SOLVE-SVC(V;E; f )\n\n(distribute ai to high-order SVC coef\ufb01cients)\n(where index sets Ji and Ki de\ufb01ned in Theorem 1)\n(cid:21) 1)\n(add covering constraints to enforce uJi + uKi\n(de\ufb01ne SVC objective over clique indices V)\n(solve with BP, QPBO, Iwata, etc.)\n(decode the covering as in (10))\n\nS2S wSuS\n\n:= wJi\n\n)\n\n\u220f\n\n\u2211\n\n\u2211\n\u2265 1 (instead of O(|Ji|\u00b7|Ki|) constraints). An optimal covering y\n\nOne reviewer suggested an extension that scales better with the number of overlapping cliques. The\nidea is to formulate SVC over the elements of S rather than V. Speci\ufb01cally, let y \u2208 {0; 1}S and use\nj\u2208S(bj + 1)yS y{j}, where the inner sum ensures\nsubmodular objective f (y) =\nj\u2208S y{j} at a local minimum because w{j} \u2264 bj. For each unique pair {Ji; Ki}, add a\nyS =\n\u2217 of S\ncovering constraint yJi\nthen gives an optimal covering of V by assigning uj = y\n\u2217\n{j}. Here we use the original construction,\nand still report signi\ufb01cant speedups. See [8] for discussion of ef\ufb01cient implementation, and an\nalternate proof of Theorem 1 based on LP relaxation.\n\nS\u2208S wSyS +\n\n+ yKi\n\n2.4 Special Cases of Note\nProposition 2. If {Pj}j\u2208V are disjoint and, separately, {Qj}j\u2208V are disjoint (equivalently each\n|Ji|;|Ki| \u2264 1), then the SVC instance in Theorem 1 reduces to standard VC.\nProof. Each S \u2208 S in objective (13) must be S = {j} for some j \u2208 V. The objective then becomes\nf (u) =\n\nj\u2208V w{j}uj + const, a form of standard VC.\n\n\u2211\n\nProposition 2 shows that the main result of [4] is a special case of our Theorem 1 when Ji = {j}\nand Ki = {k} with j; k determined by two labelings being \u2018fused\u2019. In Section 3, this generalization\nof [4] will allow us to apply a similar fusion-based algorithm to hierarchical clustering problems.\nProposition 3. If each particular j \u2208 V has either Pj = {} or Qj = {}, then the construction in\nTheorem 1 is an instance of SVC-B. Moreover, it is reducible to s-t minimum cut.\n\u2032 \u2208 I, so sets J = {j : |Pj| \u2265 1} and\nProof. In this case Ji is disjoint with Ki\u2032 for any i; i\nK = {j : |Qj| \u2265 1} are disjoint. Since E contains pairs (j; k) with j \u2208 J and k \u2208 K, graph (V;E)\nis bipartite. By the disjointness of any Ji and Ki\u2032, the unique clique sets S can be partitioned into\nS 0 = {S \u2286 J | \u2203Ji = S} and S 1 = {S \u2286 K | \u2203Ki = S} so that (13) can be written as in\nProposition 1 and thereby reduced to s-t minimum cut.\nCorollary 1. If sets {Pj}j\u2208V and {Qj}j\u2208V satisfy the conditions of propositions 2 and 3, then\nminimizing F (x) reduces to an instance of VC-B and can be solved by bipartite maximum \ufb02ow.\n\nWe should note that even though SVC has a 2-approximation algorithm [18], this does not give us\na 2-approximation for minimizing F in general. Even if F (x) \u2265 0 for all x, it does not imply\nf (u) \u2265 0 for con\ufb01gurations of u that violate the covering constraints, as would be required.\n\n3 Applications\n\nEven though any pseudo-boolean function can be expressed in form (1), many interesting problems\nwould require an exponential number of terms to be expressed in that form. Only certain speci\ufb01c\napplications will naturally have |V| \u226a |I|, so this is the main limitation of our approach. There may\nbe applications in high-order segmentation. For example, when P n-Potts potentials [19] are incor-\nporated into (cid:11)-expansion, the resulting expansion step contains high-order terms that are compact\nin this form; in the absence of pairwise CRF terms, Proposition 3 would apply.\nThe (cid:11)-expansion algorithm has also been extended to optimize the facility location objective [7]\ncommonly used for clustering (e.g. [24]). The resulting high-order terms inside the expansion step\n\n5\n\n\fFigure 2: Effectiveness of each algorithm as strength of high-order coef\ufb01cients is increased by factor\nof (cid:21) \u2208 {1::16}. For a \ufb01xed (cid:21), the \ufb01nal energy of each algorithm was normalized between 0.0 (best\nlower bound) and 1.0 (baseline ICM energy); the true energy gap between lower bound and baseline\nis indicated at top, e.g. for (cid:21) = 1 the \u201clb+5\u201d means ICM was typically within 5 of the lower bound.\n\nalso take the form (1) (in fact, Corollary 1 applies here); with no need to build the \u2018full\u2019 high-order\ngraph, this would allow (cid:11)-expansion to work as a fast alternative to the classic greedy algorithm\nfor facility location, very similar to the fusion-based algorithm in [4]. However, in Section 3.2 we\nshow that our generalized transformation allows for a novel way to optimize a hierarchical facility\nlocation objective. We will use a recent geometric image parsing model [36] as a speci\ufb01c example.\nFirst, Section 3.1 compares a number of methods on synthetic instances of energy (1).\n\n3.1 Results on Synthetic Instances\nEach instance is a function F (x) where x represents a 100 \u00d7 100 grid of binary variables with\nrandom unary coef\ufb01cients ai \u2208 [\u221210; 10]. Each instance also has |J | = 50 high-order cliques with\nbj \u2208 [250(cid:21); 500(cid:21)] (we will vary (cid:21)), where variable sets Pj and Qj each cover a random nj \u00d7 nj and\nmj \u00d7 mj region respectively (here the region size nj; mj \u2208 {10; : : : ; 15} is chosen randomly). If\nPj and Qj are not disjoint, then either Pj := Pj \\ Qj or Qj := Qj \\ Pj, as determined by a coin \ufb02ip.\nWe tested the following algorithms: BP [30], TRW-S [21], MPLP [33], QPBO [14], and extensions\nQPBO-P and QPBO-I [32]. For BP we actually used the implementation provided by [21] which is\nvery fast but, we should note, does not support message-damping; convergence of BP may be more\nreliable if this were supported. Algorithms were con\ufb01gured as follows: BP for 25 iterations (more\ndid not help); TRW-S for 800 iterations (epsilon 1); MPLP for 2000 initial iterations + 20 clusters\nadded + 100 iterations per tightening; QPBO-I with 5 random improve steps. We ran MPLP for a\nparticularly long time to ensure it had ample time to tighten and converge; indeed, it always yielded\nthe best lower bound. We also tested MINIMIZE-BY-SVC by applying each of these algorithms to\nsolve the resulting SVC problem, and in this case also tried the Iwata-Nagano construction [18].\nTo transform high-order potentials to quadratic, we report results using Type-II binary reduction [31]\n\u2211\nbecause for TRW-S/MPLP it dominated the Type-I reduction in our experiments, and for BP and the\nothers it made no difference. This runs counter to the conventional used of \u201cnumber of supermodular\nterms\u201d as an estimate of dif\ufb01culty: the Type-I reduction would generate one supermodular edge per\nhigh-order term, whereas Type-II generates |Pj| supermodular edges for each term (\nOne minor detail is how to evaluate the \u2018partial\u2019 labelings returned by QPBO and QPBO-P. In the\ncase of minimizing F directly, we simply assigned such variables xi = [ai < 0]. In the case of\nMINIMIZE-BY-SVC we included all unlabeled nodes in the cover, which means a variable xi with\nuJi and uKi all unlabeled will similarly be assigned xi = [ai < 0].\nFigure 2 shows the relative performance of each algorithm, on average. When (cid:21) = 1 the high-order\ncoef\ufb01cients are relatively weak compared to the unary terms, so even ICM succeeds at \ufb01nding a\nnear-optimal energy. For larger (cid:21) the high-order terms become more important, and we make a\nnumber of observations:\n\nxiy).\n\ni\u2208Pj\n\n\u2013 ICM, BP, TRW-S, MPLP all perform much better when applied to the SVC problem.\n\u2013 QPBO-based methods do not perform better when applied to the SVC problem.\n\u2013 QPBO-I consistently gives good results; BP also gives good results if applied to SVC.\n\u2013 The Iwata-Nagano construction is effectively the same as QBPO applied to SVC.\n\n6\n\n0.40.60.81ICMBPTRWSMPLPQPBOQPBOPlb+5+7300+20000+56000+12000000.2124816QPBOI\u03bb =ICMSVC-ICMSVC-BPSVC-TRWSSVC-MPLPSVC-QPBOSVC-QPBOPlb+5+7300+20000+56000+120000124816SVC-QPBOPSVC-QPBOISVC-Iwata\u03bb =\fWe also observed that the TRW-S lower bound was the same with or without transformation to\nSVC, but convergence took much fewer iterations when applied to SVC. In principle, TRW on\nbinary problems solves the same LP relaxation as QPBO [22]. The TRW-S code \ufb01nds much better\nsolutions because it uses the \ufb01nal messages as hints to decode a good solution, unlike for QPBO.\nTable 1 gives typical running times for each of the cases in Figure 2 on a 2.66 GHz Intel Core2\nprocessor. Code was written in C++, but the SVC transformation was not optimized at all. Still,\nSVC-QBPOI is 20 times faster than QPBOI while giving similar energies on average. The overall\nresults suggest that SVC-BP or SVC-QPBOI are the fastest ways to \ufb01nd a low-energy solution (bold\nin Table 1) on problems containing many con\ufb02icting high-order terms of the form (1). Running\ntimes were relatively consistent for all (cid:21) \u2265 2.\n\nTable 1: Typical running times of each algorithm. First row uses Type-II binary reduction on F ,\nthen directly runs each algorithm. Second row \ufb01rst transforms to SVC, does Type-II reduction, runs\nthe algorithm, and decodes the result; times shown include all these steps.\n\ndirectly minimize F\n\nMINIMIZE-BY-SVC(F )\n\nBP\n22ms\n5.2ms\n\nTRW-S MPLP QPBO QPBO-P QPBO-I\n670ms\n140ms\n7.2ms\n19ms\n\n25min\n80sec\n\n30ms\n5.4ms\n\n25sec\n99ms\n\nIwata\nN/A\n5ms\n\n3.2 Application: Hierarchical Model-Estimation / Clustering\n\n\u2217 such that the crossover labeling l(x) = (lxi\n\nIn clustering and multi-model estimation, it is quite common to either explicitly constrain the num-\nber of clusters, or\u2014more relevant to our work\u2014to penalize the number of clusters in a solution.\nPenalizing the number of clusters is a kind of complexity penalty on the solution. Recent examples\ninclude [24, 7, 26], but the basic idea has been used in many contexts over a long period. A classic\noperations research problem with the same fundamental components is facility location: the clients\n(data points) must be assigned to a nearby facility (cluster) but each facility costs money to open.\nThis can be thought of as a labeling problem, where each data point is a variable, and there is a label\nfor each cluster.\nFor hard optimization problems there is a particular algorithmic approach called fusion [27] or op-\ntimized crossover [1]. The basic idea is two take two candidate solutions (e.g. two attempts at clus-\ntering), and to \u2018fuse\u2019 the best parts of each solution, effectively stitching them together. To see this\nmore concretely, imagine a labeling problem where we wish to minimize E(l) where l = (li)i\u2208I\nis a vector of label assignments. If l0 is the \ufb01rst candidate labeling, and l1 is the second candidate\ni )i\u2208I\nlabeling, a fusion operation seeks a binary string x\n\u2217 identi\ufb01es the best possible \u2018stitching\u2019 of the two candidate\nminimizes E(l(x)). In other words, x\nsolutions with respect to the energy.\nIn [4] we derived a fusion operation based on the greedy formulation of facility location, and found\nthat the subproblem reduced to minimum-weighted vertex-cover. We will now show that the fusion\noperation for hierarchical facility location objectives requires minimizing an energy of the form (1),\nwhich we have already shown can be transformed to a submodular vertex-cover problem. Givoni\net al. [12] recently proposed a message-passing scheme for hierarchical facility location, with exper-\niments on synthetic and HIV strain data. We focus on more a computer vision-centric application:\ndetecting a hierarchy of lines and vanishing points in images using the geometric image parsing\nobjective proposed by Tretyak et al. [36].\nThe hierarchical energy proposed by [36] contains \ufb01ve \u2018layers\u2019: edges, line segments, lines, vanish-\ning points, and horizon. Each layer provides evidence for subsequent (higher) layers, and at each\nlevel their is a complexity cost that regulates how much evidence is needed to detect a line, to detect\na vanishing point, etc. For simplicity we only model edges, lines, and vanishing points, but our\nfusion-based framework easily extends to the full model. The purpose of our experiments are, \ufb01rst\nand foremost, to demonstrate that MINIMIZE-BY-SVC speeds up inference and, secondly, to sug-\ngest that a hierarchical clustering framework based on fusion operations (similar to non-hierarchical\n[4]) is an interesting and potentially worthwhile alternative to the greedy and local optimization used\nin state-of-the-art methods like [36].\n\n7\n\n\fLet {yi}i\u2208I be a set of oriented edges yi = (xi; yi; i) where (x; y) is position in the image and \nis an angle; these bottom-level features are generated by a Canny edge detector. Let L be a set of\ncandidate lines, and let V be a set of candidate vanishing points. These sets are built by randomly\nsampling: one oriented edge to generate each candidate line, and pairs of lines to generate each\ncandidate vanishing point. Each line j \u2208 L is associated with one vanishing point kj \u2208 V. (If a line\npasses close to multiple vanishing points, a copy of the line is made for each.) We seek a labeling l\nwhere li \u2208 L \u222a \u2298 identi\ufb01es the line (and vanishing point) that edge i belongs to, or assigns outlier\nlabel \u2298. Let Di(j) = distj(xi; yi) + distj( i) denote the spatial distance and angular deviation of\nedge yi to line j, and let the outlier cost be Di(\u2298) = const. Similarly, let Dj = distj(kj) be the\ndistance of line j and its associated vanishing point projected onto the Gaussian sphere (see [36]).\nFinally let Cl and Cv denote positive constants that penalize the detection of a line and a vanishing\npoint respectively. The hierarchical energy we minimize is\n\n\u2211\n\n\u2211\n\nE(l) =\n\nDi(li) +\n\ni\u2208I\n\nj\u2208L\n\n(Cl + Dj)\u00b7[\u2203li = j] +\n\u2211\n\nCv\u00b7[\u2203kli = k]:\n\u2211\n\nThis energy penalizes the number of unique lines, and the number of unique vanishing points that\nlabeling l depends on. Given two candidate labelings l0; l1, writing the fusion energy for (15) gives\n\ni )xi +\n\n\u2212 D0\n\nE(l(x)) =\n\nCv\u00b7(1\u2212xPk xQk )\nk\u2208V\n= k }, Qk = { i | kl1\n\n(Cl + Dj)\u00b7(1\u2212xPj xQj ) +\n(16)\ni + (D1\nD0\ni\nj\u2208L\ni = j }, and Pk = { i | kl0\n= k }.\ni = j }, Qj = { i | l1\nwhere Pj = { i | l0\nNotice that sets {Pj} are disjoint with each other, but each Pj is nested in subset Pkj , so overall\nProposition 2 does not apply, and so neither does the algorithm in [4].\nFor each image we used 10,000 edges, generated 8,000 candidate lines and 150 candidate vanishing\npoints. We then generated 4 candidate labelings, each by allowing vanishing points to be detected\nin randomized order, and their associated lines to be detected in greedy order, and then we fused\nthe labelings together by minimizing (16). Overall inference with QPBOI took 2\u20136 seconds per\nimage, whereas SVC-QPBOI took 0.5-0.9 seconds per image with relative speedup of 4\u20136 times.\nThe simpli\ufb01ed model is enough to show that hierarchical clustering can be done in this new and\npotentially powerful way. As argued in [27], fusion is a robust approach because it combines the\nstrengths\u2014quite literally\u2014of all methods used to generate candidates.\n\ni\n\ni\n\n\u2211\n\nk\u2208V\n\n\u2211\n\ni\u2208I\n\n(15)\n\nFigure 3: (Best seen in color.) Edge features color-coded by their detected vanishing point. Not\nshown are the detected lines that make up the intermediate layer of inference (similar to [36]).\nImages taken from York [9] and Eurasia [36] datasets.\nAcknowledgements We thank Danny Tarlow for helpful discussion regarding MPLP, and an anonymous\nreviewer for suggesting a more ef\ufb01cient way to enforce covering constraints(!). This work supported by NSERC\nDiscovery Grant R3584A02, Canadian Foundation for Innovation (CFI), and Early Researcher Award (ERA).\n\nReferences\n\n[1] Aggarwal, C.C., Orlin, J.B., & Tai, R.P. (1997) Optimized Crossover for the Independent Set Problem.\n\nOperations Research 45(2):226\u2013234.\n\n[2] Ahuja, R.K., Orlin, J.B., Stein, C. & Tarjan, R.E. (1994) Improved algorithms for bipartite network \ufb02ow.\n\nSIAM Journal on Computing 23(5):906\u2013933.\n\n[3] Ahuja, R.K., Ergun, \u00a8O., Orlin, J.B., & Punnen, A.P. (2002) A survey of very large-scale neighborhood\n\nsearch techniques. Discrete Applied Mathematics 123(1\u20133):75\u2013202.\n\n[4] Delong, A., Veksler, O. & Boykov, Y. (2012) Fast Fusion Moves for Multi-Model Estimation. European\n\nConference on Computer Vision.\n\n[5] Boros, E. & Hammer, P.L. (2002) Pseudo-Boolean Optimization. Discrete App. Math. 123(1\u20133):155\u2013225.\n[6] Boykov, Y., Veksler, O., & Zabih, R. (2001) Fast Approximate Energy Minimization via Graph Cuts.\n\nIEEE Transactions on Pattern Recognition and Machine Intelligence. 23(11):1222\u20131239.\n\n8\n\n\f[7] Delong, A., Osokin, A., Isack, H.N., & Boykov, Y. (20120) Fast Approximate Energy Minimization with\n\nLabel Costs. International Journal of Computer Vision 96(1):127. Earlier version in CVPR 2010.\n\n[8] Delong, A., Veksler, O., Osokin, A., & Boykov, Y. (2012) Minimizing Sparse High-Order Energies by\n\nSubmodular Vertex-Cover. Technical Report, Western University.\n\n[9] Denis, P., Elder, J., & Estrada, F. (2008) Ef\ufb01cient Edge-Based Methods for Estimating Manhattan Frames\n\nin Urban Imagery. European Conference on Computer Vision.\n\n[10] Freedman, D. & Drineas, P. (2005) Energy minimization via graph cuts: settling what is possible. IEEE\n\nConference on Computer Vision and Pattern Recognition.\n\n[11] Gallagher, A.C., Batra, D., & Parikh, D. (2011) Inference for order reduction in Markov random \ufb01elds.\n\nIEEE Conference on Computer Vision and Pattern Recognition.\n\n[12] Givoni, I.E., Chung, C., & Frey, B.J. (2011) Hierarchical Af\ufb01nity Propagation. Uncertainty in AI.\n[13] Gupta, R., Diwan, A., & Sarawagi, S. (2007) Ef\ufb01cient inference with cardinality-based clique potentials.\n\nInternational Conference on Machine Learning.\n\n[14] Hammer, P.L., Hansen, P., & Simeone, B. (1984) Roof duality, complementation and persistency in\n\nquadratic 0-1 optimization. Mathematical Programming 28:121\u2013155.\n\n[15] Hochbaum, D.S. (2010) Submodular problems \u2013 approximations and algorithms. Arxiv preprint\n\narXiv:1010.1945.\n\n[16] Iwata, S., Fleischer, L. & Fujishige, S. (2001) A combinatorial, strongly polynomial-time algorithm for\n\nminimizing submodular functions. Journal of the ACM 48:761\u2013777.\n\n[17] Iwata, S. & Orlin, J.B. (2009) A simple combinatorial algorithm for submodular function minimization.\n\nACM-SIAM Symposium on Discrete Algorithms.\n\n[18] Iwata, S. & Nagano, K. (2009) Submodular Function Minimization under Covering Constraints. IEEE\n[19] Kohli, P., Kumar, M.P. & Torr, P.H.S. (2007) P 3 & Beyond: Solving Energies with Higher Order Cliques.\n\nSymposium on Foundations of Computer Science.\n\nIEEE Conference on Computer Vision and Pattern Recognition.\n\n[20] Kolmogorov, V. (2010) Minimizing a sum of submodular functions. Arxiv preprint arXiv:1006.1990.\n[21] Kolmogorov, V. (2006) Convergent Tree-Reweighted Message Passing for Energy Minimization. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence 28(10):1568\u20131583.\n\n[22] Kolmogorov, V., & Wainwright, M.J. (2005) On the optimality of tree-reweighted max-product message-\n\npassing. Uncertainty in Arti\ufb01cial Intelligence.\n\n[23] Kolmogorov, V. & Zabih, R. (2004) What Energy Functions Can Be Optimized via Graph Cuts? IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence 26(2):147\u2013159.\n\n[24] Komodakis, N., Paragios, N., & Tziritas, G. (2008) Clustering via LP-based Stabilities. Neural Informa-\n\ntion Processing Systems.\n\n[25] Komodakis, N., & Paragios, N. (2009) Beyond pairwise energies: Ef\ufb01cient optimization for higher-order\n\nMRFs. IEEE Computer Vision and Pattern Recognition.\n\n[26] Ladick\u00b4y, L., Russell, C., Kohli, P., & Torr, P.H.S (2010) Graph Cut based Inference with Co-occurrence\n\nStatistics. European Conference on Computer Vision.\n\n[27] Lempitsky, V., Rother, C., Roth, S., & Blake, A. (2010) Fusion Moves for Markov Random Field\n\nOptimization. IEEE Transactions on Pattern Analysis and Machine Inference. 32(9):1392\u20131405.\n\n[28] Nemhauser, G.L. and Trotter, L.E. (1975) Vertex packings: Structural properties and algorithms.\n\nMathematical Programming 8(1):232\u2013248.\n\n[29] Osokin, A., & Vetrov, D. (2012) Submodular relaxations for MRFs with high-order potentials. HiPot:\n\nECCV Workshop on Higher-Order Models and Global Constraints in Computer Vision.\n\n[30] Pearl, J. (1988) Fusion, propagation, and structuring in belief networks. Arti\ufb01cial Intell. 29(3):251\u2013288.\n[31] Rother, C., Kohli, P., Feng, W., & Jia, J. (2009) Minimizing sparse higher order energy functions of\n\ndiscrete variables. IEEE Conference on Computer Vision and Pattern Recognition.\n\n[32] Rother, C., Kolmogorov, V., Lempitsky, V., & Szummer, M. (2007) Optimizing Binary MRFs via\n\nExtended Roof Duality. IEEE Conference on Computer Vision and Pattern Recognition.\n\n[33] Sontag, D., Meltzer, T., Globerson, A., Jaakkola, T., & Weiss, Y. (2008) Tightening LP relaxations for\n\nMAP using message passing. Uncertainty in Arti\ufb01cial Intelligence.\n\n[34] Tarlow, D., Givoni, I.E., & Zemel, R.S. (2010) HOPMAP: Ef\ufb01cient message passing with high order\n\npotentials. International Conference on Arti\ufb01cial Intelligence and Statistics.\n\n[35] Tarlow, D., & Zemel, R. (2012) Structured Output Learning with High Order Loss Functions. Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics.\n\n[36] Tretyak, E., Barinova, O., Kohli, P., & Lempitsky, V. (2011) Geometric Image Parsing in Man-Made\n\nEnvironments. International Journal of Computer Vision 97(3):305\u2013321.\n\n[37] Tsang, E. (1993) Foundations of constraint satisfaction. Academic Press, London.\n[38] Werner, T. (2008) High-arity Interactions, Polyhedral Relaxations, and Cutting Plane Algorithm for Soft\n\nConstraint Optimisation (MAP-MRF). IEEE Conference on Computer Vision and Pattern Recognition.\n\n9\n\n\f", "award": [], "sourceid": 460, "authors": [{"given_name": "Andrew", "family_name": "Delong", "institution": null}, {"given_name": "Olga", "family_name": "Veksler", "institution": null}, {"given_name": "Anton", "family_name": "Osokin", "institution": null}, {"given_name": "Yuri", "family_name": "Boykov", "institution": null}]}