{"title": "Joint M-Best-Diverse Labelings as a Parametric Submodular Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 334, "page_last": 342, "abstract": "We consider the problem of jointly inferring the $M$-best diverse labelings for a binary (high-order) submodular energy of a graphical model. Recently, it was shown that this problem can be solved to a global optimum, for many practically interesting diversity measures. It was noted that the labelings are, so-called, nested. This nestedness property also holds for labelings of a class of parametric submodular minimization problems, where different values of the global parameter $\\gamma$ give rise to different solutions. The popular example of the parametric submodular minimization is the monotonic parametric max-flow problem, which is also widely used for computing multiple labelings. As the main contribution of this work we establish a close relationship between diversity with submodular energies and the parametric submodular minimization. In particular, the joint $M$-best diverse labelings can be obtained by running a non-parametric submodular minimization (in the special case - max-flow) solver for $M$ different values of $\\gamma$ in parallel, for certain diversity measures. Importantly, the values for~$\\gamma$ can be computed in a closed form in advance, prior to any optimization. These theoretical results suggest two simple yet efficient algorithms for the joint $M$-best diverse problem, which outperform competitors in terms of runtime and quality of results. In particular, as we show in the paper, the new methods compute the exact $M$-best diverse labelings faster than a popular method of Batra et al., which in some sense only obtains approximate solutions.", "full_text": "Joint M-Best-Diverse Labelings as a Parametric\n\nSubmodular Minimization\n\nAlexander Kirillov1 Alexander Shekhovtsov2 Carsten Rother1 Bogdan Savchynskyy1\n\nalexander.kirillov@tu-dresden.de\n\n1 TU Dresden, Dresden, Germany\n\n2 TU Graz, Graz, Austria\n\nAbstract\n\nWe consider the problem of jointly inferring the M-best diverse labelings for a\nbinary (high-order) submodular energy of a graphical model. Recently, it was\nshown that this problem can be solved to a global optimum, for many practically\ninteresting diversity measures. It was noted that the labelings are, so-called, nested.\nThis nestedness property also holds for labelings of a class of parametric submodu-\nlar minimization problems, where different values of the global parameter \u03b3 give\nrise to different solutions. The popular example of the parametric submodular\nminimization is the monotonic parametric max-\ufb02ow problem, which is also widely\nused for computing multiple labelings. As the main contribution of this work\nwe establish a close relationship between diversity with submodular energies and\nthe parametric submodular minimization. In particular, the joint M-best diverse\nlabelings can be obtained by running a non-parametric submodular minimization\n(in the special case - max-\ufb02ow) solver for M different values of \u03b3 in parallel, for\ncertain diversity measures. Importantly, the values for \u03b3 can be computed in a\nclosed form in advance, prior to any optimization. These theoretical results suggest\ntwo simple yet ef\ufb01cient algorithms for the joint M-best diverse problem, which\noutperform competitors in terms of runtime and quality of results. In particular, as\nwe show in the paper, the new methods compute the exact M-best diverse labelings\nfaster than a popular method of Batra et al., which in some sense only obtains\napproximate solutions.\n\n1\n\nIntroduction\n\nA variety of tasks in machine learning, computer vision and other disciplines can be formulated as\nenergy minimization problems, also known as Maximum-a-Posteriori (MAP) or Maximum Likelihood\n(ML) estimation problems in undirected graphical models (Markov or Conditional Random Fields).\nThe importance of this problem is well-recognized, which can be seen by the many specialized\nbenchmarks [36, 21] and computational challenges [10] for its solvers. This motivates the task of\n\ufb01nding the most probable solution. Recently, a slightly different task has gained popularity, both from\na practical and theoretical perspective. The task is not only to \ufb01nd the most probable solution but\nmultiple diverse solutions, all with low energy, see e.g., [4, 31, 22, 23]. The task is referred to as the\n\u201cM-best-diverse problem\u201d [4], and it has been used in a variety of scenarios, such as: (a) Expressing\nuncertainty of the computed solutions [33]; (b) Faster training of model parameters [16]; (c) Ranking\nof inference results [40]; (d) Empirical risk minimization [32]; (e) Loss-aware optimization [1]; (f)\nUsing diverse proposals in a cascading framework [39, 35].\n\nThis project has received funding from the European Research Council (ERC) under the European Union\u2019s\nHorizon 2020 research and innovation programme (grant agreement No 647769). A. Shekhovtsov was supported\nby ERC starting grant agreement 640156.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fIn this work we build on the recently proposed formulation of [22] for the M-best-diverse problem. In\nthis formulation all M con\ufb01gurations are inferred jointly, contrary to the well-known method [4, 31],\nwhere a sequential, greedy procedure is used. Hence, we term it \u201cjoint M-best-diverse problem\u201d. As\nshown in [22, 23], the joint approach qualitatively outperforms the sequential approach [4, 31] in a\nnumber of applications. This is explained by the fact that the sequential method [4] can be considered\nas an approximate and greedy optimization technique for solving the joint M-best-diverse problem.\nWhile the joint approach is superior with respect to quality of its results, it is inferior to the sequential\nmethod [4] with respect to runtime. For the case of binary submodular energies, the approximate\nsolver in [22] and the exact solver in [23] are several times slower than the sequential technique [4]\nfor a normally sized image. Obviously, this is a major limitation when using it in a practical setting.\nFurthermore, the difference in runtime grows with the number M of con\ufb01gurations.\nIn this work, we show that in case of binary submodular energies an exact solution to the joint M-\nbest-diverse problem can be obtained signi\ufb01cantly faster than the approximate one with the sequential\nmethod [4]. Moreover, the difference in runtime grows with the number M of con\ufb01gurations.\nRelated work\nThe importance of the considered problem can be justi\ufb01ed by the fact that a procedure of computing\nM-best solutions to discrete optimization problems was proposed over 40 years ago, in [28]. Later,\nmore ef\ufb01cient specialized procedures were introduced for MAP-inference on a tree [34], junction-\ntrees [30] and general graphical models [41, 13, 3]. However, such methods are however not suited\nfor the scenario where diversity of the solutions is required, since they do not enforce it explicitly.\nStructural Determinant Point Processes [27] is a tool for modelling probability distributions of\nstructured models. Unfortunately, an ef\ufb01cient sampling procedure to obtain diverse con\ufb01gurations is\nfeasible only for tree-structured graphical models. The recently proposed algorithm [8] to \ufb01nd M\nbest modes of a distribution is also limited to chains and junction chains of bounded width.\nTraining of M independent graphical models to produce multiple diverse solutions was proposed\nin [15], and was further explored in [17, 9]. In contrast, we assume a single \ufb01xed model where\ncon\ufb01gurations with low energy (hopefully) correspond to the desired output.\nThe work of [4] de\ufb01nes the M-best-diverse problem, and proposes a solver for it. However, the\ndiversity of the solutions is de\ufb01ned sequentially, with respect to already extracted labelings. In\ncontrast to [4], the work [22] de\ufb01ned the \u201cjoint M-best-diverse problem\u201d as an optimization problem\nof a single joint energy. The most related work to ours is [23], where an ef\ufb01cient method for the\njoint M-best-diverse problem was proposed for submodular energies. The method is based on the\nfact that for submodular energies and a family of diversity measures (which includes e.g., Hamming\ndistance) the set of M diverse solutions can be totally ordered with respect to the partial labeling\norder. In the binary labeling case, the M-best-diverse solutions form a nested set. However, although\nthe method [23] is a considerably more ef\ufb01cient way to solve the problem, compared to the general\nalgorithm proposed in [22], it is still considerably slower than the sequential method [4]. Furthermore,\nthe runtime difference grows with the number M of con\ufb01gurations.\nInterestingly, the above-mentioned \u201cnestedness property\u201d is also ful\ufb01lled by minimizers of a para-\nmetric submodular minimization problem [12]. In particular, it holds for the monotonic max-\ufb02ow\nmethod [25], which is also widely used for obtaining diverse labelings in practice [7, 20, 19]. Natu-\nrally, we would like to ask questions about the relationship of these two techniques, such as: \u201cDo\nthe joint M-best-diverse con\ufb01gurations form a subset of the con\ufb01gurations returned by a parametric\nsubmodular minimization problem?\u201d, and conversely \u201cCan the parametric submodular minimization\nbe used to (ef\ufb01ciently) produce the M-best-diverse con\ufb01gurations?\u201d We give positive answers to\nboth these questions.\nContribution\n\n\u2022 For binary submodular energies we provide a relationship between the joint M-best-diverse\nand the parametric submodular minimization problems. In case of \u201cconcave node-wise diversity\nmeasures\u201d 1 we give a closed-form formula for the parameters values, which corresponds to the joint\nM-best-diverse labelings. The values can be computed in advance, prior to any optimization, which\nallows to obtain each labeling independently.\n\u2022 Our theoretical results suggest a number of ef\ufb01cient algorithms for the joint M-best-diverse\n\n1Precise de\ufb01nition is given in Sections 2 and 3.\n\n2\n\n\fproblem. We describe and experimentally evaluate the two simplest of them, sequential and parallel.\nBoth are considerably faster than the popular technique [4] and are as easy to implement. We\ndemonstrate the effectiveness of these algorithms on two publicly available datasets.\n\nLA =(cid:81)\n\n2 Background and Problem De\ufb01nition\nEnergy Minimization Let 2A denote the powerset of a set A. The pair G = (V, F ) is called a\nhyper-graph and has V as a \ufb01nite set of variable nodes and F \u2286 2V as a set of factors. Each variable\nnode v \u2208 V is associated with a variable yv taking its values in a \ufb01nite set of labels Lv. The set\nv\u2208A Lv denotes the Cartesian product of sets of labels corresponding to a subset A \u2286 V of\nvariables. Functions \u03b8f : Lf \u2192 R, associated with factors f \u2208 F , are called potentials and de\ufb01ne\nlocal costs on values of variables and their combinations. Potentials \u03b8f with |f| = 1 are called\nunary, with |f| = 2 pairwise and |f| > 2 high-order. Without loss of generality we will assume\nthat there is a unary potential \u03b8v assigned to each variable v \u2208 V. This implies that F = V \u222a F,\nwhere F = {f \u2208 F : |f| \u2265 2}. For any non-unary factor f \u2208 F the corresponding set of variables\n{yv : v \u2208 f} will be denoted by yf . The energy minimization problem consists in \ufb01nding a labeling\ny\u2217 = (yv : v \u2208 V) \u2208 LV, which minimizes the total sum of corresponding potentials:\n\n(cid:88)\n\nv\u2208V\n\n(cid:88)\n\nf\u2208F\n\narg min\ny\u2208LV\n\nE(y) = arg min\ny\u2208L\n\n\u03b8v(yv) +\n\n\u03b8f (yf ) .\n\n(1)\n\nM(cid:88)\n\nProblem (1) is also known as MAP-inference. Labeling y\u2217 satisfying (1) will be later called a solution\nof the energy-minimization or MAP-inference problem, shortly MAP-labeling or MAP-solution.\nUnless otherwise speci\ufb01ed, we will assume that Lv = {0, 1}, v \u2208 V, i.e. each variable may take only\ntwo values. Such energies will be called binary. We also assume that the logical operations \u2264 and \u2265\nare de\ufb01ned in a natural way on the sets Lv. The case, when the energy E decomposes into unary and\npairwise potentials only, we will term as pairwise case or pairwise energy.\nIn the following, we use brackets to distinguish between upper index and power, i.e. (A)n means\nthe n-th power of A, whereas n is an upper index in the expression An. We will keep, however, the\nstandard notation Rn for the n-dimensional vector space and skip the brackets if an upper index does\nnot make mathematical sense such as in the expression {0, 1}|V|.\nJoint-DivMBest Problem Instead of searching for a single labeling with the lowest energy, one\nmight ask for a set of labelings with low energies, yet being signi\ufb01cantly different from each other. In\n[22] it was proposed to infer such M diverse labelings {y1, . . . , yM} \u2208 (L)M jointly by minimizing\n\nEM ({y}) =\n\nE(yi) \u2212 \u03bb\u2206M ({y})\n\ni=1\n\n(2)\nw.r.t. {y} := y1, . . . , yM for some \u03bb > 0. Following [22] we use the notation {y} and {y}v\nas shortcuts for y1, . . . , yM and y1\nv correspondingly. Function \u2206M de\ufb01nes diversity of\narbitrary M labelings, i.e. \u2206M ({y}) takes a large value if labelings {y} are in a certain sense diverse,\nand a small value otherwise.\nIn the following, we will refer to the problem (1) of minimizing the energy E itself as to the master\nproblem for (2).\nNode-Wise Diversity In what follows we will consider only node-wise diversity measures, i.e. those\nwhich can be represented in the form\n\nv, . . . , yM\n\nfor some node diversity measure \u2206M\ninvariant diversity measures. In other words, such measures that \u2206M\nany permutation \u03c0 of variables {y}v.\n\n\u2206M ({y}) =\n(3)\nv : {0, 1}M \u2192 R. Moreover, we will stick to permutation\nv (\u03c0({y}v)) for\nm=1(cid:74)ym\nv = 0(cid:75)\n\nv =(cid:80)M\n\nLet the expression(cid:74)A(cid:75) be equal to 1 if A is true and 0 otherwise. Let also m0\n\ncount the number of 0\u2019s in {y}v. In the binary case Lv = {0, 1}, any permutation invariant measure\ncan be represented as\n\nv ({y}v) = \u2206M\n\nv ({y}v)\n\u2206M\n\n(cid:88)\n\nv\u2208V\n\nv ({y}v) = \u00af\u2206M\n\u2206M\nTo keep notation simple, we will use \u2206M\nv for both representations: \u2206M\n\nv (m0\n\nv) .\n\nv ({y}v) and \u00af\u2206M\n\nv (m0\n\nv).\n\n(4)\n\n3\n\n\fExample 1 (Hamming distance diversity). Consider the common node diversity measure, the sum of\nHamming distances between each pair of labels:\n\nv ({y}v) =\n\u2206M\n\nM(cid:88)\n\ni=1\n\nM(cid:88)\nj=i+1(cid:74)yi\n\nv(cid:75).\nv (cid:54)= yj\n\n(5)\n\n\u2206M\n\nv (m0\n\nv) = m0\n\nv \u00b7 (M \u2212 m0\nv).\n\nThis measure is permutation invariant. Therefore, it can be written as a function of the number m0\nv:\n(6)\nMinimization Techniques for Joint-DivMBest Problem Direct minimization of (2) has so far been\nconsidered as a dif\ufb01cult problem even when the master problem (1) is easy to solve. We refer to [22]\nfor a detailed investigation of using general MAP-inference solvers for (2). In this paragraph we\nbrie\ufb02y summarize existing ef\ufb01cient minimization approaches for (2).\nAs shown in [22] the sequential method DivMBest [4] can be seen as a greedy algorithm for\napproximate minimization of (2), by \ufb01nding one solution after another. The sequential method [4]\nis used for diversity measures that can be represented by sum of diversity measures between each\nm2=m1+1 \u22062(ym1, ym2). For each m = 1, . . . , M\n\npair of solutions, i.e. \u2206M ({y}) =(cid:80)M\n\n(cid:80)M\n\nm1=1\n\nthe method sequentially computes\n\n(cid:34)\n\n(cid:35)\n\nym = arg min\n\ny\u2208LV\n\nE(y) \u2212 \u03bb\n\n\u22062(y, yi)\n\n.\n\n(7)\n\nm\u22121(cid:88)\n\ni=1\n\nIn case of a node diversity measure (3), this algorithm requires sequentially solving M energy\nminimization problems (1), with only modi\ufb01ed unary potentials comparing to the master problem (1).\nIt typically implies that an ef\ufb01cient solver for the master problem can also be used to obtain its diverse\nsolutions.\nIn [23] an ef\ufb01cient approach for (2) was proposed for submodular energies E. An energy E(y) is\ncalled submodular if for any two labelings y, y(cid:48) \u2208 LV it holds\n\nE(y \u2228 y(cid:48)) + E(y \u2227 y(cid:48)) \u2264 E(y) + E(y(cid:48)) ,\n\n(8)\nwhere y \u2228 y(cid:48) and y \u2227 y(cid:48) denote correspondingly the node-wise maximum and minimum with respect\nto the natural order on the label set Lv.\nIn the following, we will use the term the higher labeling. The labeling y is higher than the labeling\ny(cid:48) if yv \u2265 y(cid:48)\nv for all v \u2208 V. So, the labeling y \u2228 y(cid:48) is higher than y and y(cid:48). Since the set of all\nlabelings is a lattice w.r.t. the operation \u2265, we will speak also about the highest labeling.\nIt was shown in [23] that for submodular energies, under certain technical conditions on the diversity\nmeasure \u2206M\nv (see Lemma 2 in [23]), the problem (2) can be reformulated as a submodular energy\nminimization and, therefore, can be solved either exactly or approximately by ef\ufb01cient optimization\ntechniques (e.g., by reduction to the min-cut/max-\ufb02ow in the pairwise case). However, the size of\nthe reformulated problem grows at least linearly with M (quadratically in the case of the Hamming\ndistance diversity (5)) and therefore even approximate algorithms require longer time than the\nDivMBest (7) method. Moreover, this difference in runtime grows with M.\nThe mentioned transformation of (2) into a submodular energy minimization problem is based on the\ntheorem below, which plays a crucial role in obtaining the main results of this work. We \ufb01rst give a\nde\ufb01nition of the \u201cnestedness property\u201d, which is also important for the rest of the paper.\nDe\ufb01nition 1. An M-tuple (y1, . . . , yM ) \u2208 (LV )M is called nested if for each v \u2208 V the inequality\nv \u2264 yj\nyi\nTheorem 1. [Special case of Thm. 1 of [23]] For a binary submodular energy and a node-wise\npermutation invariant diversity, there exists a nested minimizer to the Joint-DivMBest objective (2).\nParametric submodular minimization Let \u03b3 \u2208 R|V|, i = {1, . . . , k} be a vector of parameters\nwith the coordinates indexed by the node index v \u2208 V. We de\ufb01ne the parametric energy minimization\nas the problem of evaluating the function\n\nv holds for 1 \u2264 i \u2264 j \u2264 M, i.e. for LV = {0, 1}, yi\n\nv = 1 implies yj\n\nv = 1 for j > i.\n\n(cid:34)\n\n(cid:35)\n\nmin\ny\u2208LV\n\nE\u03b3(y) := min\ny\u2208L\n\nE(y) +\n\n\u03b3vyv\n\n(9)\n\n(cid:88)\n\nv\u2208V\n\nfor all values of the parameter \u03b3 \u2208 \u0393 \u2286 R|V|. The most important cases of the parametric energy\nminimization are\n\n4\n\n\fde\ufb01ned as(cid:80)M\n\nFigure 1: Hamming distance (left) and linear (right) diversity measures for M = 5. Value m is\n\nm=1(cid:74)ym\n\nv = 0(cid:75). Both diversity measures are concave.\n\n\u2022 the monotonic parametric max-\ufb02ow problem [14, 18], which corresponds to the case when E is\na binary submodular pairwise energy and \u0393 = {\u03bd \u2208 R|V| : \u03bdv = \u03b3v(\u03bb)} and functions \u03b3v : \u039b \u2192 R\nare non-increasing for \u039b \u2286 R.\n\u2022 a subclass of the parametric submodular minimization [12, 2], where E is submodular and\n\u0393 = {\u03b31, \u03b32, . . . , \u03b3k \u2208 R|V| : \u03b31 \u2265 \u03b32 \u2265 . . . \u2265 \u03b3k}, where operation \u2265 is applied coordinate-wise.\nIt is known [38] that in these two cases, (i) the highest minimizers y1, . . . , yk \u2208 LV of E\u03b3i,\ni = {1, . . . , k} are nested and (ii) the parametric problem (9) is solvable ef\ufb01ciently by respective\nalgorithms [14, 18, 12]. In the following, we will show that for a submodular energy E the Joint-\nDivMBest problem (2) reduces to the parametric submodular minimization with the values \u03b31 \u2265\n\u03b32 \u2265 . . . \u2265 \u03b3M \u2208 R|V| given in closed form.\n\n3\n\nJoint M-Best-Diverse Problem as a Parametric Submodular Minimization\n\nOur results hold for the following subclass of the permutation invariant node-wise diversity measures:\nv (m) is called concave if for any 1 \u2264 i \u2264 j \u2264 M it\nDe\ufb01nition 2. A node-wise diversity measure \u2206M\nholds\n(10)\n\nv (i \u2212 1) \u2265 \u2206M\n\nv (i) \u2212 \u2206M\n\u2206M\n\nv (j) \u2212 \u2206M\n\nv (j \u2212 1).\n\nThere are a number of practically relevant concave diversity measures:\nExample 2. Hamming distance diversity (6) is concave, see Fig. 1 for illustration.\nExample 3. Diversity measures of the form\n\nv) = \u2212(cid:0)|m0\n\nv)|(cid:1)p\n\n= \u2212(cid:0)|2m0\n\nv \u2212 M|(cid:1)p\n\nv \u2212 (M \u2212 m0\n\n\u2206M\n\nv (m0\n\n(11)\nare concave for any p \u2265 1. Here M \u2212 m0\nv is the number of variables labeled as 1. Hence,\nv)| is an absolute value of the difference between the numbers of variables labeled\n|m0\nas 0 and 1. It expresses the natural fact that a distribution of 0\u2019s and 1\u2019s is more diverse, when their\namounts are similar.\n\nv \u2212 (M \u2212 m0\n\nFor p = 1 we call the measure (11) linear; for p = 2 the measure (11) coincides with the Hamming\ndistance diversity (6). An illustration of these two cases is given in Fig. 1.\nOur main theoretical result is given by the following theorem:\nTheorem 2. Let E be binary submodular and \u2206M be a node-wise diversity measure with each\nv , v \u2208 V , being permutation invariant and concave. Then a nested M-tuple (ym)M\ncomponent \u2206M\nm=1\nminimizing the Joint-DivMBest objective (2) can be found as the solutions of the following M\nproblems:\n\n(cid:34)\n\n(cid:35)\n\nE(y) +\n\n\u03b3m\nv yv\n\n,\n\n(12)\n\n(cid:88)\n\nv\u2208V\n\nv = \u03bb(cid:0)\u2206M\n\nwhere \u03b3m\nminimizer must be selected.\n\nv (m) \u2212 \u2206M\n\nym = arg min\n\nv (m \u2212 1)(cid:1).\n\nyV\n\nIn the case of multiple solutions in (12) the highest\n\nWe refer to the supplement for the proof of Theorem 2 and discuss its practical consequences below.\nFirst note that the sequence (\u03b3m)M\nv . Each of the M\noptimization problems (12) has the same size as the master problem (1) and differs from it by\n\nm=1 is monotone due to concavity of \u2206M\n\n5\n\n012345m0246\u2206Mv(m)012345m\u22125\u22123\u22121\u2206Mv(m)\funary potentials only. Theorem 2 implies that \u03b3m in (12) satisfy the monotonicity condition:\n\u03b31 \u2265 \u03b32 \u2265 . . . \u2265 \u03b3M . Therefore, equations (12) constitute the parametric submodular minimization\nproblem as de\ufb01ned above, which reduces to the monotonic parametric max-\ufb02ow problem for pairwise\nE. Let (cid:98)\u00b7(cid:99) denote the largest integer not exceeding an argument of the operation.\nCorollary 1. Let \u2206M\n1. \u03b3m\n2. The values \u03b3m\n\nv in Theorem 2 be the Hamming distance diversity (6). Then it holds:\n\nv , m = 1, . . . , M are symmetrically distributed around 0:\n\nv = \u03bb(M \u2212 2m + 1).\n\nv\n\nv = 2\u03bb, m = 1, . . . , M.\n\n\u2212\u03b3m\n\nv = \u03b3M +1\u2212m\n\nv\n\nand \u03b3m\n\u2212 \u03b3m\n\nv = 0, if m = (M + 1)/2 .\n\n\u2265 0, for m \u2264 (cid:98)(M + 1)/2(cid:99)\n3. Moreover, this distribution is uniform, that is \u03b3m+1\n4. When M is odd, the MAP-solution (corresponding to \u03b3(M +1)/2 = 0) is always among the\nM-best-diverse labelings minimizing (2).\nCorollary 2. Implications 2 and 4 of Corollary 1 hold for any symmetrical concave \u2206M\nwhere \u2206M\nCorollary 3. For linear diversity measure the value \u03b3m\n\nv in (12) is equal to \u03bb \u00b7 sgn(cid:0) M\n\nsgn(x) is a sign function, i.e., sgn(x) = (cid:74)x > 0(cid:75) \u2212(cid:74)x < 0(cid:75). Since all \u03b3m\n\n2 are the\nsame, this diversity measure can give only up to 3 different diverse labelings. Therefore, this diversity\nmeasure is not useful for M > 3, and can be seen as a limit of useful concave diversity measures.\n\nv (M + 1 \u2212 m) for m \u2264 (cid:98)(M + 1)/2(cid:99).\n\n2 \u2212 m(cid:1), where\n\nv (m) = \u2206M\n\nv , i.e., those\n\nv for m < M\n\nv\n\n4 Ef\ufb01cient Algorithmic Solutions\nTheorem 2 suggests several new computational methods for minimizing the Joint-DivMBest objec-\ntive (2). All of them are more ef\ufb01cient than those proposed in [23]. Indeed, as we show experimentally\nin Section 5, they outperform even the sequential DivMBest method (7).\nThe simplest algorithm applies a MAP-inference solver to each of the M problems (12) sequentially\nand independently. This algorithm has the same computational cost as DivMBest (7) since it also\nsequentially solves M problems of the same size. However, already its slightly improved version,\ndescribed below, performs faster than DivMBest (7).\nSequential Algorithm Theorem 2 states that solutions of (12) are nested. Therefore, from ym\u22121\n= 1\nv = 1 for labelings ym\u22121 and ym obtained according to (12). This allows to\nit follows that ym\nreduce the size and computing time for each subsequent problem in the sequence.2 Reusing the\n\ufb02ow from the previous step gives an additional speedup. In fact, when applying a push relabel or\npseudo\ufb02ow algorithm in this fashion the total work complexity is asymptotically the same as of a\nsingle minimum cut [14, 18] of the master problem. In practice, this strategy is ef\ufb01cient with other\nmin-cut solvers (without theoretical guarantees) as well. In our experiments we evaluated it with the\ndynamic augmenting path method [6, 24].\nParallel Algorithm The M problems (12) are completely independent, and their highest minimizers\nrecover the optimal M-tuple (ym)m according to Theorem 2. They can be solved fully in parallel or,\nusing p < M processors, in parallel groups of M/p problems per processor, incrementally within\neach group. The overhead is only in copying data costs and sharing the memory bandwidth.\nAlternative approaches One may suggest that for large M it would be more ef\ufb01cient to solve the full\nparametric max\ufb02ow problem [18, 14] and then \u201cread out\u201d solutions corresponding to the desired values\n\u03b3m. However, the known algorithms [18, 14] would perform exactly the incremental computation\ndescribed in the sequential approach above plus an extra work of identifying all breakpoints. This\nis only sensible when M is larger than the number of breakpoints or the diversity measure is not\nknown in advance (e.g., is itself parametric). Similarly, parametric submodular function minimization\ncan be solved in the same worst case complexity [12] as non-parametric, but the algorithm is again\nincremental and would just perform less work when the parameters of interest are known in advance.\n5 Experimental Evaluation\nWe base our experiments on two datasets: (i) The interactive foreground/background image segmen-\ntation dataset utilized in several papers [4, 31, 22, 23] for comparing diversity techniques; (ii) A new\n\n2By applying \u201csymmetric reasoning\u201d for the label 0, further speed-ups can be achieved. However, we stick to\n\nthe \ufb01rst variant in our experiments.\n\n6\n\n\fTable 1: Interactive segmentation. The quality measure is a per-pixel accuracy of the best\nsegmentation, out of M, averaged over all test images. The runtime is in milliseconds (ms).\nThe quality for M = 1 is 91.57. Parametric-parallel is the fastest method followed by\nParametric-sequential. Both achieve higher quality than DivMBest, and return the same solu-\ntion as Joint-DivMBest.\n\nM=2\n\nM=6\n\nM=10\n\nDivMBest [4]\nJoint-DivMBest [23]\nParametric-sequential (1 core)\nParametric-parallel (6 cores)\n\nquality\n93.16\n95.13\n95.13\n95.13\n\ntime (ms)\n\n2.6\n5.5\n2.2\n1.9\n\nquality\n95.02\n96.01\n96.01\n96.01\n\ntime (ms)\n\n11.6\n17.2\n5.5\n4.3\n\nquality\n95.16\n96.19\n96.19\n96.19\n\ntime (ms)\n\n15.4\n80.3\n8.4\n6.2\n\ndataset for foreground/background image segmentation with binary pairwise energies derived from\nthe well-known PASCAL VOC 2012 dataset [11]. Energies of the master problem (1) in both cases\nare binary and pairwise, therefore we use their reduction [26] to the min-cut/max-\ufb02ow problem to\nobtain solutions ef\ufb01ciently.\nBaselines Our main competitor is the fastest known approach for inferring M diverse solutions, the\nDivMBest method [4]. We made its ef\ufb01cient re-implementation using dynamic graph-cut [24]. We\nalso compare our method with Joint-DivMBest [23], which provides an exact minimum of (2) as\nour method does.\nDiversity Measure In all of our experiments we use the Hamming distance diversity measure (5).\nNote that in [31] more sophisticated diversity measures were used e.g., the Hamming Ball. However,\nthe DivMBest method (7) with this measure requires to run a very time-consuming HOP-MAP [37]\ninference technique. Moreover, the experimental evaluation in [23] suggests that the exact minimum\nof (2) with Hamming distance diversity (5) outperforms DivMBest with a Hamming Ball distance\ndiversity.\nOur Method We evaluate both algorithms described in Section 4, i.e., sequential and parallel. We\nrefer to them as Parametric-sequential and Parametric-parallel respectively. We utilize\nthe dynamic graph-cut [24] technique for Parametric-sequential, which makes it comparable to\nour implementation of DivMBest. The max-\ufb02ow solver of [6] is used for Parametric-parallel\ntogether with OpenMP directives. For the experiments we use a computer with 6 physical cores (12\nvirtual cores), and run Parametric-parallel with M threads.\nParameters \u03bb (from (7) and (2)) were tuned via cross-validation for each algorithm and each experi-\nment separately.\n5.1\nThe basic idea is that after a user interaction, the system provides the user with M diverse seg-\nmentations, instead of a single one. The user can then manually select the best one and add more\nuser scribbles, if necessary. Following [4] we consider only the \ufb01rst iteration of such an interactive\nprocedure, i.e., we consider user scribbles to be given and compare the sets of segmentations returned\nby the system.\nThe authors of [4] kindly provided us their 50 super-pixel graphical model instances. They are based\non a subset of the PASCAL VOC 2010 [11] segmentation challenge with manually added scribbles.\nAn instance has on average 3000 nodes. Pairwise potentials are given by contrast-sensitive Potts\nterms [5], which are submodular in the binary case. This implies that Theorem 2 is applicable.\nQuantitative comparison and runtime of the different algorithms are presented in Table 1. As\nin [4], our quality measure is a per-pixel accuracy of the best solution for each test image, averaged\nover all test images. As expected, Joint-DivMBest and Parametric-* return the same, exact\nsolution of (2). The measured runtime is also averaged over all test images. Parametric-parallel\nis the fastest method followed by Parametric-sequential. Note that on a computer with fewer\ncores, Parametric-sequential may even outperform Parametric-parallel because of the\nparallelization overheads.\n5.2 Foreground/Background Segmentation\nThe Pascal VOC 2012 [11] segmentation dataset has 21 labels. We selected all those 451 images\nfrom the validation set for which the ground truth labeling has only two labels (background and one\n\nInteractive Segmentation\n\n7\n\n\f(a) Intersection-over-union (IoU)\n\n(b) Runtime in seconds\n\nFigure 2: Foreground/background segmentation.\n(a) Intersection-over-union (IoU) score\nParametric represents a curve, which is the same\nfor the best segmentation, out of M.\nfor Parametric-sequential, Parametric-parallel and Joint-DivMBest, since they ex-\nactly solve the same Joint-DivMBest problem.\n(b) DivMBest uses dynamic graph-cut [24].\nParametric-sequential uses dynamic graph-cut and a reduced size graph for each consecu-\ntive labeling problem. Parametric-parallel solves M problems in parallel using OpenMP.\n\nof the 20 object classes) and which were not used for training. As unary potentials we use the output\nprobabilities of the publicly available fully convolutional neural network FCN-8s [29], which is\ntrained for the Pascal VOC 2012 challenge. This CNN gives unary terms for all 21 classes. For each\nimage we pick only two classes: the background and the class-label that is presented in the ground\ntruth. As pairwise potentials we use the contrastive-sensitive Potts terms [5] with a 4-connected grid\nstructure.\nQuantitative Comparison and Runtime As quality measure we use the standard Pascal VOC\nmeasure for semantic segmentation \u2013 average intersection-over-union (IoU) [11]. The unary potentials\nalone, i.e., output of FCN-8s, give 82.12 IoU. The single best labeling, returned by the MAP-inference\nproblem, improves it to 83.23 IoU.\nThe comparisons with respect to runtime and accuracy of results are presented in Fig. 2a and 2b\nrespectively. The increase in runtime with respect to M for Parametric-parallel is due to\nparallelization overhead costs, which grow with M. Parametric-parallel is a clear winner in\nthis experiment, both in terms of quality and runtime. Parametric-sequential is slower than\nParametric-parallel but faster than DivMBest. The difference in runtime between these three\nalgorithms grows with M.\n\n6 Conclusion and Outlook\nWe have shown that the M labelings, which constitute a solution to the Joint-DivMBest problem with\nbinary submodular energies, and concave node-wise permutation invariant diversity measures can be\ncomputed in parallel, independently from each other, as solutions of the master energy minimization\nproblem with modi\ufb01ed unary costs. This allows to build solvers which run even faster than the\napproximate method of Batra et al. [4]. Furthermore, we have shown that such Joint-DivMBest\nproblems reduce to the parametric submodular minimization. This shows a clear connection of these\ntwo practical approaches to obtaining diverse solutions to the energy minimization problem.\nReferences\n[1] Ahmed, F., Tarlow, D., Batra, D.: Optimizing expected Intersection-over-Union with candidate-constrained\n\nCRFs. In: ICCV (2015)\n\n[2] Bach, F.: Learning with submodular functions: A convex optimization perspective. Foundations and\n\nTrends in Machine Learning 6(2-3), 145\u2013373 (2013)\n\n[3] Batra, D.: An ef\ufb01cient message-passing algorithm for the M-best MAP problem. In: UAI (2012)\n[4] Batra, D., Yadollahpour, P., Guzman-Rivera, A., Shakhnarovich, G.: Diverse M-best solutions in Markov\n\nrandom \ufb01elds. In: ECCV. Springer Berlin/Heidelberg (2012)\n\n[5] Boykov, Y., Jolly, M.P.: Interactive graph cuts for optimal boundary & region segmentation of objects in\n\nN-D images. In: ICCV (2001)\n\n[6] Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-\ufb02ow algorithms for energy\n\nminimization in vision. TPAMI 26(9), 1124\u20131137 (2004)\n\n8\n\n12345678910M838485868788IoUscoreParametricDivMBest12345678910M0.000.050.100.150.200.250.300.350.40time(s)DivMBestParametric-sequentialParametric-parallel\f[7] Carreira, J., Sminchisescu, C.: Constrained parametric min-cuts for automatic object segmentation. In:\n\n[8] Chen, C., Kolmogorov, V., Zhu, Y., Metaxas, D.N., Lampert, C.H.: Computing the M most probable modes\n\n[9] Dey, D., Ramakrishna, V., Hebert, M., Andrew Bagnell, J.: Predicting multiple structured visual interpreta-\n\nCVPR. pp. 3241\u20133248 (2010)\n\nof a graphical model. In: AISTATS (2013)\n\ntions. In: ICCV (2015)\n\n[10] Elidan, G., Globerson, A.: The probabilistic inference challenge (PIC2011) (2011)\n[11] Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object\n\nClasses Challenge 2012 (VOC2012) Results (2012)\n\n[12] Fleischer, L., Iwata, S.: A push-relabel framework for submodular function minimization and applications\n\nto parametric optimization. Discrete Applied Mathematics 131(2), 311\u2013322 (2003)\n\n[13] Fromer, M., Globerson, A.: An LP view of the M-best MAP problem. In: NIPS 22 (2009)\n[14] Gallo, G., Grigoriadis, M.D., Tarjan, R.E.: A fast parametric maximum \ufb02ow algorithm and applications.\n\nSIAM Journal on Computing 18(1), 30\u201355 (1989)\n\n[15] Guzman-Rivera, A., Batra, D., Kohli, P.: Multiple choice learning: Learning to produce multiple structured\n\n[16] Guzman-Rivera, A., Kohli, P., Batra, D.: DivMCuts: Faster training of structural SVMs with diverse\n\n[17] Guzman-Rivera, A., Kohli, P., Batra, D., Rutenbar, R.A.: Ef\ufb01ciently enforcing diversity in multi-output\n\n[18] Hochbaum, D.S.: The pseudo\ufb02ow algorithm: A new algorithm for the maximum-\ufb02ow problem. Operations\n\noutputs. In: NIPS 25 (2012)\n\nM-best cutting-planes. In: AISTATS (2013)\n\nstructured prediction. In: AISTATS (2014)\n\nresearch 56(4), 992\u20131009 (2008)\n\nIn: ICCV. pp. 849\u2013856. IEEE (2009)\n\n[19] Hower, D., Singh, V., Johnson, S.C.: Label set perturbation for MRF based neuroimaging segmentation.\n\n[20] Jug, F., Pietzsch, T., Kainm\u00fcller, D., Funke, J., Kaiser, M., van Nimwegen, E., Rother, C., Myers, G.:\nOptimal joint segmentation and tracking of Escherichia Coli in the mother machine. In: BAMBI (2014)\n[21] Kappes, J.H., Andres, B., Hamprecht, F.A., Schn\u00f6rr, C., Nowozin, S., Batra, D., Kim, S., Kausler, B.X.,\nKr\u00f6ger, T., Lellmann, J., Komodakis, N., Savchynskyy, B., Rother, C.: A comparative study of modern\ninference techniques for structured discrete energy minimization problems. IJCV pp. 1\u201330 (2015)\n\n[22] Kirillov, A., Savchynskyy, B., Schlesinger, D., Vetrov, D., Rother, C.: Inferring M-best diverse labelings in\n\na single one. In: ICCV (2015)\n\n[23] Kirillov, A., Schlesinger, D., Vetrov, D.P., Rother, C., Savchynskyy, B.: M-best-diverse labelings for\n\nsubmodular energies and beyond. In: NIPS (2015)\n\n[24] Kohli, P., Torr, P.H.: Dynamic graph cuts for ef\ufb01cient inference in Markov random \ufb01elds. PAMI (2007)\n[25] Kolmogorov, V., Boykov, Y., Rother, C.: Applications of parametric max\ufb02ow in computer vision. In: ICCV.\n\npp. 1\u20138. IEEE (2007)\n\n[26] Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? TPAMI (2004)\n[27] Kulesza, A., Taskar, B.: Structured determinantal point processes. In: NIPS 23 (2010)\n[28] Lawler, E.L.: A procedure for computing the K best solutions to discrete optimization problems and its\n\napplication to the shortest path problem. Management Science 18(7) (1972)\n\n[29] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. CVPR\n\n(2015)\n\n[30] Nilsson, D.: An ef\ufb01cient algorithm for \ufb01nding the M most probable con\ufb01gurationsin probabilistic expert\n\nsystems. Statistics and Computing 8(2), 159\u2013173 (1998)\n\n[31] Prasad, A., Jegelka, S., Batra, D.: Submodular meets structured: Finding diverse subsets in exponentially-\n\nlarge structured item sets. In: NIPS 27 (2014)\n\n[32] Premachandran, V., Tarlow, D., Batra, D.: Empirical minimum bayes risk prediction: How to extract an\n\nextra few % performance from vision models with just three more parameters. In: CVPR (2014)\n\n[33] Ramakrishna, V., Batra, D.: Mode-marginals: Expressing uncertainty via diverse M-best solutions. In:\n\nNIPS Workshop on Perturbations, Optimization, and Statistics (2012)\n\n[34] Schlesinger, M.I., Hlavac, V.: Ten lectures on statistical and structural pattern recognition (2002)\n[35] Sener, O., Saxena, A.: rCRF: Recursive belief estimation over CRFs in RGB-D activity videos.\n\nIn:\n\nRobotics Science and Systems (RSS) (2015)\n\n[36] Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M.F., Rother, C.:\nA comparative study of energy minimization methods for Markov random \ufb01elds with smoothness-based\npriors. TPAMI 30(6), 1068\u20131080 (2008)\n\n[37] Tarlow, D., Givoni, I.E., Zemel, R.S.: HOP-MAP: Ef\ufb01cient message passing with high order potentials. In:\n\n[38] Topkis, D.M.: Minimizing a submodular function on a lattice. Operations research 26(2), 305\u2013321 (1978)\n[39] Wang, S., Fidler, S., Urtasun, R.: Lost shopping! monocular localization in large indoor spaces. In: ICCV\n\n[40] Yadollahpour, P., Batra, D., Shakhnarovich, G.: Discriminative re-ranking of diverse segmentations. In:\n\n[41] Yanover, C., Weiss, Y.: Finding the M most probable con\ufb01gurations using loopy belief propagation. In:\n\nAISTATS (2010)\n\n(2015)\n\nCVPR (2013)\n\nNIPS 17 (2004)\n\n9\n\n\f", "award": [], "sourceid": 213, "authors": [{"given_name": "Alexander", "family_name": "Kirillov", "institution": "TU Dresden"}, {"given_name": "Alexander", "family_name": "Shekhovtsov", "institution": "Graz University of Technology"}, {"given_name": "Carsten", "family_name": "Rother", "institution": "TU Dresden"}, {"given_name": "Bogdan", "family_name": "Savchynskyy", "institution": "TU Dresden"}]}