{"title": "Submodular meets Structured: Finding Diverse Subsets in Exponentially-Large Structured Item Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 2645, "page_last": 2653, "abstract": "To cope with the high level of ambiguity faced in domains such as Computer Vision or Natural Language processing, robust prediction methods often search for a diverse set of high-quality candidate solutions or proposals. In structured prediction problems, this becomes a daunting task, as the solution space (image labelings, sentence parses, etc.) is exponentially large. We study greedy algorithms for finding a diverse subset of solutions in structured-output spaces by drawing new connections between submodular functions over combinatorial item sets and High-Order Potentials (HOPs) studied for graphical models. Specifically, we show via examples that when marginal gains of submodular diversity functions allow structured representations, this enables efficient (sub-linear time) approximate maximization by reducing the greedy augmentation step to inference in a factor graph with appropriately constructed HOPs. We discuss benefits, tradeoffs, and show that our constructions lead to significantly better proposals.", "full_text": "Submodular meets Structured: Finding Diverse\n\nSubsets in Exponentially-Large Structured Item Sets\n\nAdarsh Prasad\n\nUT Austin\n\nadarsh@cs.utexas.edu\n\nStefanie Jegelka\n\nUC Berkeley\n\nstefje@eecs.berkeley.edu\n\nDhruv Batra\nVirginia Tech\n\ndbatra@vt.edu\n\nAbstract\n\nTo cope with the high level of ambiguity faced in domains such as Computer\nVision or Natural Language processing, robust prediction methods often search\nfor a diverse set of high-quality candidate solutions or proposals. In structured\nprediction problems, this becomes a daunting task, as the solution space (image\nlabelings, sentence parses, etc.) is exponentially large. We study greedy algo-\nrithms for \ufb01nding a diverse subset of solutions in structured-output spaces by\ndrawing new connections between submodular functions over combinatorial item\nsets and High-Order Potentials (HOPs) studied for graphical models. Speci\ufb01cally,\nwe show via examples that when marginal gains of submodular diversity functions\nallow structured representations, this enables ef\ufb01cient (sub-linear time) approxi-\nmate maximization by reducing the greedy augmentation step to inference in a\nfactor graph with appropriately constructed HOPs. We discuss bene\ufb01ts, trade-\noffs, and show that our constructions lead to signi\ufb01cantly better proposals.\n\nIntroduction\n\n1\nMany problems in Computer Vision, Natural Language Processing and Computational Biology in-\nvolve mappings from an input space X to an exponentially large space Y of structured outputs.\nFor instance, Y may be the space of all segmentations of an image with n pixels, each of which\nmay take L labels, so |Y| = Ln. Formulations such as Conditional Random Fields (CRFs) [24],\nMax-Margin Markov Networks (M3N) [31], and Structured Support Vector Machines (SSVMs) [32]\nhave successfully provided principled ways of scoring all solutions y \u2208 Y and predicting the single\nhighest scoring or maximum a posteriori (MAP) con\ufb01guration, by exploiting the factorization of a\nstructured output into its constituent \u201cparts\u201d.\nIn a number of scenarios, the posterior P(y|x) has several modes due to ambiguities, and we seek\nnot only a single best prediction but a set of good predictions:\n(1) Interactive Machine Learning. Systems like Google Translate (for machine translation) or\nPhotoshop (for interactive image segmentation) solve structured prediction problems that are often\nambiguous (\"what did the user really mean?\"). Generating a small set of relevant candidate solutions\nfor the user to select from can greatly improve the results.\n(2) M-Best hypotheses in cascades. Machine learning algorithms are often cascaded, with the\noutput of one model being fed into another [33]. Hence, at the initial stages it is not necessary to\nmake a single perfect prediction. We rather seek a set of plausible predictions that are subsequently\nre-ranked, combined or processed by a more sophisticated mechanism.\nIn both scenarios, we ideally want a small set of M plausible (i.e., high scoring) but non-redundant\n(i.e., diverse) structured-outputs to hedge our bets.\nSubmodular Maximization and Diversity. The task of searching for a diverse high-quality sub-\nset of items from a ground set V has been well-studied in information retrieval [5], sensor place-\nment [22], document summarization [26], viral marketing [17], and robotics [10]. Across these\ndomains, submodularity has emerged as an a fundamental and practical concept \u2013 a property of\nfunctions for measuring diversity of a subset of items. Speci\ufb01cally, a set function F : 2V \u2192 R is\nsubmodular if its marginal gains, F (a|S) \u2261 F (S\u222aa)\u2212F (S) are decreasing, i.e. F (a|S) \u2265 F (a|T )\n\n1\n\n\f(a) Image\n\n(b) All segmentations: |V | = Ln\n\n(c) Structured Representation.\n\nFigure 1: (a) input image; (b) space of all possible object segmentations / labelings (each item is a segmenta-\ntion); (c) we convert the problem of \ufb01nding the item with the highest marginal gain F (a|S) to a MAP inference\nproblem in a factor graph over base variables y with an appropriately de\ufb01ned HOP.\nfor all S \u2286 T and a /\u2208 T . In addition, if F is monotone, i.e., F (S) \u2264 F (T ), \u2200S \u2286 T , then a simple\ngreedy algorithm (that in each iteration t adds to the current set St the item with the largest marginal\ngain F (a|St)) achieves an approximation factor of (1 \u2212 1\ne ) [27]. This result has had signi\ufb01cant\npractical impact [21]. Unfortunately, if the number of items |V | is exponentially large, then even a\nsingle linear scan for greedy augmentation is infeasible.\nIn this work, we study conditions under which it is feasible to greedily maximize a submodular\nfunction over an exponentially large ground set V = {v1, . . . , vN} whose elements are combinato-\nrial objects, i.e., labelings of a base set of n variables y = {y1, y2, . . . , yn}. For instance, in image\nsegmentation, the base variables yi are pixel labels, and each item a \u2208 V is a particular labeling of\nthe pixels. Or, if each base variable ye indicates the presence or absence of an edge e in a graph,\nthen each item may represent a spanning tree or a maximal matching. Our goal is to \ufb01nd a set of\nM plausible and diverse con\ufb01gurations ef\ufb01ciently, i.e. in time sub-linear in |V | (ideally scaling as\na low-order polynomial in log |V |). We will assume F (\u00b7) to be monotone submodular, nonnegative\nand normalized (F (\u2205) = 0), and base our study on the greedy algorithm. As a running example, we\nfocus on pixel labeling, where each base variable takes values in a set [L] = {1, . . . , L} of labels.\nContributions. Our principal contribution is a conceptual one. We observe that marginal gains of a\nnumber of submodular functions allow structured representations, and this enables ef\ufb01cient greedy\nmaximization over exponentially large ground sets \u2013 by reducing the greedy augmentation step to\na MAP inference query in a discrete factor graph augmented with a suitably constructed High-\nOrder Potential (HOP). Thus, our work draws new connections between two seemingly disparate\nbut highly related areas in machine learning \u2013 submodular maximization and inference in graphical\nmodels with structured HOPs. As speci\ufb01c examples, we construct submodular functions for three\ndifferent, task-dependent de\ufb01nitions of diversity, and provide reductions to three different HOPs for\nwhich ef\ufb01cient inference techniques have already been developed. Moreover, we present a generic\nrecipe for constructing such submodular functions, which may be \u201cplugged\u201d with ef\ufb01cient HOPs\ndiscovered in future work. Our empirical contribution is an ef\ufb01cient algorithm for producing a set of\nimage segmentations with signi\ufb01cantly higher oracle accuracy1 than previous works. The algorithm\nis general enough to transfer to other applications. Fig. 1 shows an overview of our approach.\nRelated work: generating multiple solutions. Determinental Point Processesare an elegant prob-\nabilistic model over sets of items with a preference for diversity. Its generalization to a structured\nsetting [23] assumes a tree-structured model, an assumption that we do not make. Guzman-Rivera et\nal. [14, 15] learn a set of M models, each producing one solution, to form the set of solutions. Their\napproach requires access to the learning sub-routine and repeated re-training of the models, which\nis not always possible, as it may be expensive or proprietary. We assume to be given a single (pre-\ntrained) model from which we must generate multiple diverse, good solutions. Perhaps the closest\nto our setting are recent techniques for \ufb01nding diverse M-best solutions [2, 28] or modes [7, 8]\nin graphical models. While [7] and [8] are inapplicable since they are restricted to chain and tree\ngraphs, we compare to other baselines in Section 3.2 and 4.\n1.1 Preliminaries and Notation\nWe select from a ground set V of N items. Each item is a labeling y = {y1, y2, . . . , yn}\nof n base variables. For clarity, we use non-bold letters a \u2208 V for items, and boldface let-\nters y for base set con\ufb01gurations. Uppercase letters refer to functions over the ground set items\nF (a|A), R(a|A), D(a|A), and lowercase letters to functions over base variables f (y), r(y), d(y).\n\n1The accuracy of the most accurate segmentation in the set.\n\n2\n\n+ argmaxa\u2208VF(a|S)\u2261r(y)d(y|S)\fF (S) = R(S) + \u03bbD(S),\n\nFormally, there is a bijection \u03c6 : V (cid:55)\u2192 [L]m that maps items a \u2208 V to their representation as base\nvariable labelings y = \u03c6(a). For notational simplicity, we often use y \u2208 S to mean \u03c6\u22121(y) \u2208 S,\ni.e. the item corresponding to the labeling y is present in the set S \u2286 V . We write (cid:96) \u2208 y if the label\n(cid:96) is used in y, i.e. \u2203j s.t. yj = (cid:96). For a set c \u2286 [n], we use yc to denote the tuple {yi | i \u2208 c}.\nOur goal to \ufb01nd an ordered set or list of items S \u2286 V that maximizes a scoring function F . Lists\ngeneralize the notation of sets, and allow for reasoning of item order and repetitions. More details\nabout list vs set prediction can be found in [29, 10].\nScoring Function. We trade off the relevance and diversity of list S \u2286 V via a scoring function\nF : 2V \u2192 R of the form\nwhere R(S) =(cid:80)\n(1)\na\u2208S R(a) is a modular nonnegative relevance function that aggregates the quality\nof all items in the list; D(S) is a monotone normalized submodular function that measure the diver-\nsity of items in S; and \u03bb \u2265 0 is a trade-off parameter. Similar objective functions were used e.g. in\n[26]. They are reminiscent of the general paradigm in machine learning of combining a loss func-\ntion that measures quality (e.g. training error) and a regularization term that encourages desirable\nproperties (e.g. smoothness, sparsity, or \u201cdiversity\u201d).\nSubmodular Maximization. We aim to \ufb01nd a list S that maximizes F (S) subject to a cardinality\nconstraint |S| \u2264 M. For monotone submodular F , this may be done via a greedy algorithm that\nstarts out with S0 = \u2205, and iteratively adds the next best item:\n(2)\nThe \ufb01nal solution SM is within a factor of (1 \u2212 1\ne ) of the optimal solution S\u2217: F (SM ) \u2265 (1 \u2212\ne )F (S\u2217) [27]. The computational bottleneck is that in each iteration, we must \ufb01nd the item with\nthe largest marginal gain. Clearly, if |V | has exponential size, we cannot touch each item even once.\nInstead, we propose \u201caugmentation sub-routines\u201d that exploit the structure of V and maximize the\nmarginal gain by solving an optimization problem over the base variables.\n2 Marginal Gains in Con\ufb01guration Space\nTo solve the greedy augmentation step via optimization over y, we transfer the marginal gain from\nthe world of items to the world of base variables and derive functions on y from F :\n\nat \u2208 argmaxa\u2208V F (a | St\u22121).\n\nSt = St\u22121 \u222a at,\n\n1\n\n.\n\n(3)\n\nis then given by R(a) = r(y) = (cid:80)\nthis quality becomes r(y) =(cid:80)\nr(y) = (cid:80)\n\np\u2208V\n\n(cid:124)\n\nMaximizing F (a|S) now means maximizing f (y|S) for y = \u03c6(a). This can be a hard combinatorial\noptimization problem in general. However, as we will see, there is a broad class of useful functions\nF for which f inherits exploitable structure, and argmaxy f (y|S) can be solved ef\ufb01ciently, exactly\nor at least approximately.\nRelevance Function. We use a structured relevance function R(a) that is the score of a factor graph\nde\ufb01ned over the base variables y. Let G = (V,E) be a graph de\ufb01ned over {y1, y2, . . . , yn}, i.e.\nV = [n], E \u2286\nbe the log-potential functions (or factors) for these cliques. The quality of an item a = \u03c6\u22121(y)\nC\u2208C \u03b8C(yC). For instance, with only node and edge factors,\n\u03b8pq(yp, yq). In this model, \ufb01nding the single\n\n(cid:1). Let C = {C | C \u2286 V} be a set of cliques in the graph, and let \u03b8C : [L]|C| (cid:55)\u2192 R\n\n\u03b8p(yp) +(cid:80)\n\n(cid:0)V2\n\nC\u2208C \u03b8C(yC) = w\n\nhighest quality item corresponds to maximum a posteriori (MAP) inference in the factor graph.\nAlthough we refer to terms with probabilistic interpretations such as \u201cMAP\u201d, we treat our relevance\nfunction as output of an energy-based model [25] such as a Structured SVM [32]. For instance,\n\u03c8(y) for parameters w and feature vector \u03c8(y). Moreover, we as-\nsume that the relevance function r(y) is nonnegative2. This assumption ensures that F (\u00b7) is mono-\ntone. If F is non-monotone, algorithms other than the greedy are needed [4, 12]. We leave this\ngeneralization for future work. In most application domains the relevance function is learned from\ndata and thus our positivity assumption is not restrictive \u2013 one can simply learn a positive relevance\nfunction. For instance, in SSVMs, the relevance weights are learnt to maximize the margin between\nthe correct labeling and all incorrect ones. We show in the supplement that SSVM parameters that\nassign nonnegative scores to all labelings achieve exactly the same hinge loss (and thus the same\ngeneralization error) as without the nonnegativity constraint.\n\n(p,q)\u2208E\n\n2Strictly speaking, this condition is suf\ufb01cient but not necessary. We only need nonnegative marginal gains.\n\n3\n\n(cid:125)\n(cid:124)\nF (\u03c6\u22121(y) | S)\n\n(cid:123)(cid:122)\n\nf (y|S)\n\n= R(\u03c6\u22121(y))\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nr(y)\n\n(cid:125)\n\n(cid:124)\n(cid:125)\n+\u03bb D(\u03c6\u22121(y) | S)\n\n(cid:123)(cid:122)\n\nd(y|S)\n\n\fFigure 2: Diversity via groups:\n(a) groups de\ufb01ned by the pres-\nence of labels (i.e. #groups\n= L); (b) groups de\ufb01ned by\nHamming balls around each\nitem/labeling (i.e. #groups =\nLn). In each case, diversity is\nmeasured by how many groups\nare covered by a new item. See\ntext for details.\n\n(a) Label Groups\n\n(b) Hamming Ball Groups\n\nD(S) =\n\nOur scheme relies on constructing groups Gi that cover the ground set, i.e. V = (cid:83)\n\n3 Structured Diversity Functions\nWe next discuss a general recipe for constructing monotone submodular diversity functions D(S),\nand for reducing their marginal gains to structured representations over the base variables d(y|S).\ni Gi. These\ngroups will be de\ufb01ned by task-dependent characteristics \u2013 for instance, in image segmentation, G(cid:96)\ncan be the set of all segmentations that contain label (cid:96). The groups can be overlapping. For instance,\nif a segmentation y contains pixels labeled \u201cgrass\u201d and \u201ccow\u201d, then y \u2208 Ggrass and y \u2208 Gcow.\nGroup Coverage: Count Diversity. Given V and a set of groups {Gi}, we measure the diversity\nof a list S in terms of its group coverage, i.e., the number of groups covered jointly by items in S:\n(4)\nwhere we de\ufb01ne Gi \u2229 S as the intersection of Gi with the set of unique items in S. It is easy to show\nthat this function is monotone submodular. If G(cid:96) is the group of all segmentations that contain label\n(cid:96), then the diversity measure of a list of segmentations S is the number of object labels that appear\nin any a \u2208 S. The marginal gain is the number of new groups covered by a:\n\n(cid:12)(cid:12)(cid:12)(cid:8)i | Gi \u2229 S (cid:54)= \u2205\n(cid:9)(cid:12)(cid:12)(cid:12),\n(cid:12)(cid:12)(cid:12)(cid:8)i | a \u2208 Gi and S \u2229 Gi = \u2205\n(cid:9)(cid:12)(cid:12)(cid:12).\n(cid:88)\nh(cid:0)(cid:12)(cid:12)Gi \u2229 S(cid:12)(cid:12)(cid:1).\n(cid:20)\nh(cid:0)1 +(cid:12)(cid:12)Gi \u2229 S(cid:12)(cid:12)(cid:1)\nSince h is concave, the gain h(cid:0)1 +(cid:12)(cid:12)Gi \u2229 S(cid:12)(cid:12)(cid:1)\n\u2212 h(cid:0)(cid:12)(cid:12)Gi \u2229 S(cid:12)(cid:12)(cid:1) decreases as S becomes larger. Thus,\n\n(6)\nwhere h is any nonnegative nondecreasing concave scalar function. This is a sum of submodular\nfunctions and hence submodular. Eqn. (4) is a special case of Eqn. (6) with h(y) = min{1, y}.\nOther possibilities are \u221a\u00b7, or log(1 + \u00b7). For this general de\ufb01nition of diversity, the marginal gain is\n\nD(a | S) =\n\n(5)\nThus, the greedy algorithm will try to \ufb01nd an item/segmentation that belongs to as many as yet\nunused groups as possible.\nGroup Coverage: General Diversity. More generally, instead of simply counting the number of\ngroups covered by S, we can use a more re\ufb01ned decay\n\n\u2212 h(cid:0)(cid:12)(cid:12)Gi \u2229 S(cid:12)(cid:12)(cid:1)(cid:21)\n\n.\n\nthe marginal gain of an item a is proportional to how rare each group Gi (cid:51) a is in the list S.\nIn each step of the greedy algorithm, we maximize r(y) + \u03bbd(y|S). We already established a\nstructured representation of r(y) via a factor graph on y. In the next few subsections, we specify\nthree example de\ufb01nitions of groups Gi that instantiate three diversity functions D(S). For each\nD(S), we show how the marginal gains D(a|S) can be expressed as a speci\ufb01c High-Order Potential\n(HOP) d(y|S) in the factor graph over y. These HOPs are known to be ef\ufb01ciently optimizable, and\nhence we can solve the augmentation step ef\ufb01ciently. Table 1 summarizes these connections.\nDiversity and Parsimony. If the groups Gi are overlapping, some y can belong to many groups\nsimultaneously. While such a y may offer an immediate large gain in diversity, in many applications\nit is more natural to seek a small list of complementary labelings rather than having all labels occur\nin the same y. For instance, in image segmentation with groups de\ufb01ned by label presence (Sec. 3.1),\nnatural scenes are unlikely to contain many labels at the same time.\nInstead, the labels should\nbe spread across the selected labelings y \u2208 S. Hence, we include a parsimony factor p(y) that\nbiases towards simpler labelings y. This term is a modular function and does not affect the diversity\nfunctions directly. We next outline some example instantiations of the functions (4) and (6).\n\nD(S) =\n\ni\n\n(cid:88)\n\nD(a | S) =\n\ni:Gi(cid:51)a\n\n(7)\n\n4\n\n\fSection 3.1\nSupplement\nSection 3.2\n\nGroups (Gi)\nLabels\nLabel Transitions\nHamming Balls\n\nHigher Order Potentials\nLabel Cost\nCo-operative Cuts\nCardinality Potentials\n\nTable 1: Different diversity functions and corresponding HOPs.\n\n.\n\n=\n\n(cid:96)\u2208y\n\n1.\n\n(cid:96):G(cid:96)(cid:51)y\n\n(cid:88)\n\n(cid:88)\n\n3.1 Diversity of Labels\nFor the \ufb01rst example, let G(cid:96) be the set of all labelings y containing the label (cid:96), i.e. y \u2208 G(cid:96) if and only\nif yj = (cid:96) for some j \u2208 [n]. Such a diversity function arises in multi-class image segmentation \u2013 if the\nhighest scoring segmentation contains \u201csky\u201d and \u201cgrass\u201d, then we would like to add complementary\nsegmentations that contain an unused class label, say \u201csheep\u201d or \u201ccow\u201d.\nStructured Representation of Marginal Gains. The marginal gain for this diversity function turns\nout to be a HOP called label cost [9]. It penalizes each label that occurs in a previous segmentation.\nLet lcountS((cid:96)) be the number of segmentations in S that contain label (cid:96). In the simplest case of\ncoverage diversity (4), the marginal gain provides a constant reward for every as yet unseen label (cid:96):\n(8)\n\n(cid:9)(cid:12)(cid:12)(cid:12) =\n(cid:12)(cid:12)(cid:12)(cid:8)(cid:96) | y \u2208 G(cid:96), S \u2229 G(cid:96) = \u2205\nd(y | S) =\nh(cid:0)1 + lcountS((cid:96))(cid:1) \u2212 h(cid:0)lcountS((cid:96))(cid:1)(cid:21)\n(cid:20)\n(cid:20)\nh(cid:0)1 +(cid:12)(cid:12)G(cid:96) \u2229 S(cid:12)(cid:12)(cid:1) \u2212 h(cid:0)(cid:12)(cid:12)G(cid:96) \u2229 S(cid:12)(cid:12)(cid:1)(cid:21)\n(cid:88)\n\nFor the general group coverage diversity (6), the gain becomes\nd(y|S) =\n\n(cid:96)\u2208y,lcountS ((cid:96))=0\n\nthe segmentations already chosen in S. The parsimony factor in this setting is p(y) =(cid:80)\n\nThus, d(y|S) rewards the presence of a label (cid:96) in y by an amount proportional to how rare (cid:96) is in\n(cid:96)\u2208y c((cid:96)).\nIn the simplest case, c((cid:96)) = \u22121, i.e. we are charged a constant for every label used in y.\nWith this type of diversity (and parsimony terms), the greedy augmentation step is equivalent to\nperforming MAP inference in a factor graph augmented with label reward HOPs: argmaxy r(y) +\n\u03bb(d(y | S) + p(y)). Delong et al. [9] show how to perform approximate MAP inference with such\nlabel costs via an extension to the standard \u03b1-expansion [3] algorithm.\nLabel Transitions. Label Diversity can be extended to reward not just the presence of previously\nunseen labels, but also the presence of previously unseen label transitions (e.g., a person in front\nof a car or a person in front of a house). Formally, we de\ufb01ne one group G(cid:96),(cid:96)(cid:48) per label pair (cid:96), (cid:96)(cid:48),\nand y \u2208 G(cid:96),(cid:96)(cid:48) if it contains two adjacent variables yi, yj with labels yi = (cid:96), yj = (cid:96)(cid:48). This diversity\nfunction rewards the presence of a label pair ((cid:96), (cid:96)(cid:48)) by an amount proportional to how rare this pair\nis in the segmentations that are part of S. For such functions, the marginal gain d(y|S) becomes\na HOP called cooperative cuts [16]. The inference algorithm in [19] gives a fully polynomial-time\napproximation scheme for any nondecreasing, nonnegative h, and the exact gain maximizer for the\ncount function h(y) = min{1, y}. Further details may be found in the supplement.\n3.2 Diversity via Hamming Balls\nThe label diversity function simply rewarded the presence of a label (cid:96), irrespective of which or how\nmany variables yi were assigned that label. The next diversity function rewards a large Hamming dis-\ni ]] between con\ufb01gurations (where [[\u00b7]] is the Iverson bracket.)\nLet Bk(y) denote the k-radius Hamming ball centered at y, i.e. B(y) = {y(cid:48) | Ham(y(cid:48), y) \u2264 k}.\nThe previous section constructed one group per label (cid:96). Now, we construct one group Gy for each\ncon\ufb01guration y, which is the k-radius Hamming ball centered at y, i.e. Gy = Bk(y).\nStructured Representation of Marginal Gains. For this diversity, the marginal gain d(y|S) be-\ncomes a HOP called cardinality potential [30]. For count group coverage, this becomes\n(9a)\n\ntance Ham(y1, y2) =(cid:80)n\n\ni (cid:54)= y2\n\ni=1[[y1\n\n(cid:12)(cid:12)(cid:12)(cid:8)y(cid:48) | Gy(cid:48) \u2229 (S \u222a y) (cid:54)= \u2205\n(cid:9)(cid:12)(cid:12)(cid:12) \u2212\n(cid:12)(cid:12)(cid:12) (cid:91)\n(cid:12)(cid:12)(cid:12) (cid:91)\n\n(cid:12)(cid:12)(cid:12) \u2212\n\nBk(y(cid:48))\n\nBk(y(cid:48))\n\ny(cid:48)\u2208S\u222ay\n\ny(cid:48)\u2208S\n\nd(y|S) =\n=\n\ni.e., the marginal gain of adding y is the number of new con\ufb01gurations y(cid:48) covered by the Hamming\nball centered at y. Since the size of the intersection of Bk(y) with a union of Hamming balls does\nnot have a straightforward structured representation, we maximize a lower bound on d(y|S) instead:\n(10)\n\nd(y | S) \u2265 dlb(y | S) \u2261\n\n(cid:105)(cid:12)(cid:12)(cid:12)(cid:12),\n\nBk(y(cid:48))\n\n(9b)\n\n(cid:104) (cid:91)\n\n(cid:12)(cid:12)(cid:12)(cid:8)y(cid:48) | Gy(cid:48) \u2229 S (cid:54)= \u2205\n(cid:9)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)Bk(y) \u2229\n(cid:12)(cid:12)(cid:12) =\n(cid:12)(cid:12)(cid:12)Bk(y)\n(cid:12)(cid:12)(cid:12) \u2212\n(cid:88)\n(cid:12)(cid:12)Bk(y) \u2229 Bk(y(cid:48))(cid:12)(cid:12)\n(cid:12)(cid:12)Bk(y)(cid:12)(cid:12) \u2212\n\ny(cid:48)\u2208S\n\ny(cid:48)\u2208S\n\n5\n\n\f|S| \u2212 Iy(cid:48)(y) that saturates as y moves far away from y(cid:48).\n\ninvolves maximizing a diversity-augmented score: argmaxy r(y)+\u03bb(cid:80)\ndiversity function grows linearly with the Hamming distance, \u03b8y(cid:48)(y) = Ham(y(cid:48), y) =(cid:80)n\n\nThis lower bound dlb(y|S) overcounts the intersection in Eqn. (9b) by summing the intersections\nwith each Bk(y(cid:48)) separately. We can also interpret this lower bound as clipping the series arising\nfrom the inclusion-exclusion principle to the \ufb01rst-order terms. Importantly, (10) depends on y only\nvia its Hamming distance to y(cid:48). This is a cardinality potential that depends only on the number of\n(cid:80)\nvariables yi assigned to a particular label. Speci\ufb01cally, ignoring constant terms, the lower bound can\nbe written as a summation of cardinality factors (one for each previous solution y(cid:48) \u2208 S): dlb(y|S) =\ny(cid:48)\u2208S \u03b8y(cid:48)(y), where \u03b8y(cid:48)(y) = b\n|S| \u2212 Iy(cid:48)(y), b is a constant (size of a k-radius Hamming ball), and\nIy(cid:48)(y) is the number of points in the intersection of k-radius Hamming balls centered at y(cid:48) and y.\nWith this approximation, the greedy step means performing MAP inference in a factor graph aug-\nmented with cardinality potentials: argmaxy r(y) + \u03bbdlb(y|S). This may be solved via message-\npassing, and all outgoing messages from cardinality factors can be computed in O(n log n) time\n[30]. While this algorithm does not offer any approximation guarantees, it performs well in prac-\ntice. A subtle point to note is that dlb(y|S) is always decreasing w.r.t. |S| but may become negative\ndue to over-counting. We can \ufb01x this by clamping dlb(y|S) to be greater than 0, but in our experi-\nments this was unnecessary \u2013 the greedy algorithm never chose a set where dlb(y|S) was negative.\nComparison to DivMBest. The greedy algorithm for Hamming diversity is similar in spirit to the\nrecent work of Batra et al. [2], who also proposed a greedy algorithm (DivMBest) for \ufb01nding diverse\nMAP solutions in graphical models. They did not provide any justi\ufb01cation for greedy, and our\nformulation sheds some light on their work. Similar to our approach, at each greedy step, DivMBest\ny(cid:48)\u2208S \u03b8y(cid:48)(y). However, their\ni=1[[y(cid:48)i (cid:54)=\nyi]]. Linear diversity rewards are not robust, and tend to over-reward diversity. Our formulation uses\na robust diversity function \u03b8y(cid:48)(y) = b\nIn our experiments, we make the saturation behavior smoothly tunable via a parameter \u03b3: Iy(cid:48)(y) =\ne\u2212\u03b3 Ham(y(cid:48)\n,y). A larger \u03b3 corresponds to Hamming balls of smaller radius, and can be set to optimize\nperformance on validation data. We found this to work better than directly tuning the radius k.\n4 Experiments\nWe apply our greedy maximization algorithms to two image segmentation problems: (1) interactive\nbinary segmentation (object cutout) (Section 4.1); (2) category-level object segmentation on the\nPASCAL VOC 2012 dataset [11] (Section 4.2). We compare all methods by their respective oracle\naccuracies, i.e. the accuracy of the most accurate segmentation in the set of M diverse segmentations\nreturned by that method. For a small value of M \u2248 5 to 10, a high oracle accuracy indicates that\nthe algorithm has achieved high recall and has identi\ufb01ed a good pool of candidate solutions for\nfurther processing in a cascaded pipeline. In both experiments, the label \u201cbackground\u201d is typically\nexpected to appear somewhere in the image, and thus does not play a role in the label cost/transition\ndiversity functions. Furthermore, in binary segmentation there is only one non-background label.\nThus, we report results with Hamming diversity only (label cost and label transition diversities are\nnot applicable). For the multi-class segmentation experiments, we report experiments with all three.\nBaselines. We compare our proposed methods against DivMBest [2], which greedily produces\ndiverse segmentation by explicitly adding a linear Hamming distance term to the factor graph. Each\nHamming term is decomposable along the variables yi and simply modi\ufb01es the node potentials\ny(cid:48)\u2208S[[yi (cid:54)= y(cid:48)i]]. DivMBest has been shown to outperform techniques such as M-\nBest-MAP [34, 1], which produce high scoring solutions without a focus on diversity, and sampling-\nbased techniques, which produce diverse solutions without a focus on the relevance term [2]. Hence,\nwe do not include those methods here. We also report results for combining different diversity\nfunctions via two operators: (\u2297), where we generate the top M\nk solutions for each of k diversity\nfunctions and then concatenate these lists; and (\u2295), where we linearly combine diversity functions\n(with coef\ufb01cients chosen by k-D grid search) and generate M solutions using the combined diversity.\n4.1\nIn interactive foreground-background segmentation, the user provides partial labels via scribbles.\nOne way to minimize interactions is for the system to provide a set of candidate segmentations for\nthe user to choose from. We replicate the experimental setup of [2], who curated 100 images from\nthe PASCAL VOC 2012 dataset, and manually provided scribbles on objects contained in them.\nFor each image, the relevance model r(y) is a 2-label pairwise CRF, with a node term for each\n\n\u02dc\u03b8(yi) = \u03b8(yi)+\u03bb(cid:80)\n\nInteractive segmentation\n\n6\n\n\fLabel Cost (LC)\n\n(cid:112)(\u00b7)\n\nmin{1, \u00b7}\nlog(1 + \u00b7)\n\nMAP\n\n42.35\n42.35\n42.35\n\nM=5\n\n45.43\n45.72\n46.28\n\nM=15\n\n45.58\n50.01\n50.39\n\nHamming Ball (HB)\n\nMAP\n\n43.43\n43.43\n\nM=5\n\n51.21\n51.71\n\nM=15\n\n52.90\n55.32\n\nDivMBest\nHB\n\n\u2297 Combined Diversity\n\nM=15\n\nM=16\n\n(cid:112)(\u00b7)\n\nmin{1, \u00b7}\nlog(1 + \u00b7)\n\n\u2295 Combined Diversity\n\nM=15\n\nLabel Transition (LT)\n\nMAP\n\n42.35\n42.35\n42.35\n\nM=5\n\n44.26\n45.43\n45.92\n\nM=15\n\n44.78\n46.21\n46.89\n\nHB \u2297 LC \u2297 LT\nDivMBest \u2297 HB \u2297 LC \u2297 LT\nTable 2: PASCAL VOC 2012 val oracle accuracies for different diversity functions.\n\nDivMBest \u2295 HB\nDivMBest \u2295 LC \u2295 LT\n\n56.97\n-\n\n-\n57.39\n\n55.89\n53.47\n\nsuperpixel in the image and an edge term for each adjacent pair of superpixels. At each superpixel,\nwe extract colour and texture features. We train a Transductive SVM from the partial supervision\nprovided by the user scribbles. The node potentials are derived from the scores of these TSVMs. The\nedge potentials are contrast-sensitive Potts. Fifty of the images were used for tuning the diversity\nparameters \u03bb, \u03b3, and the other 50 for reporting oracle accuracies. The 2-label contrast-sensitive Potts\nmodel results in a supermodular relevance function r(y), which can be ef\ufb01ciently maximized via\ngraph cuts [20]. The Hamming ball diversity dlb(y|S) is a collection of cardinality factors, which\nwe optimize with the Cyborg implementation [30].\nResults. For each of the 50 test images in our dataset we generated the single best y1 and 5 addi-\ntional solutions {y2, . . . , y6} using each method. Table 3 shows the average oracle accuracies for\nDivMBest, Hamming ball diversity, and their two combinations. We can see that the combinations\nslightly outperform both approaches.\n\nDivMBest\nHamming Ball\nDivMBest\u2297Hamming Ball\nDivMBest\u2295Hamming Ball\n\nMAP\n\n91.57\n91.57\n-\n-\n\nM=2\n\n93.16\n93.95\n-\n-\n\nM=6\n\n95.02\n94.86\n95.16\n95.14\n\nTable 3: Interactive segmentation: oracle pixel accuracies averaged over 50 test images\n\n4.2 Category level Segmentation\nIn category-level object segmentation, we label each pixel with one of 20 object categories or back-\nground. We construct a multi-label pairwise CRF on superpixels. Our node potentials are outputs of\ncategory-speci\ufb01c regressors trained by [6], and our edge potentials are multi-label Potts. Inference\nin the presence of diversity terms is performed with the implementations of Delong et al. [9] for\nlabel costs, Tarlow et al. [30] for Hamming ball diversity, and Boykov et al. [3] for label transitions.\n\nFigure 3: Qualitative Results:\neach row shows the original im-\nage, ground-truth segmentation\n(GT) from PASCAL, the single-\nbest segmentation y1, and oracle\nsegmentation from the M = 15\nsegmentations (excluding y1) for\ndifferent de\ufb01nitions of diversity.\nHamming typically performs the\nbest. In certain situations (row3),\nlabel transitions help since the\nsingle-best segmentation y1 in-\ncluded a rare pair of labels (dog-\ncat boundary).\n\nResults. We evaluate all methods on the PASCAL VOC 2012 data [11], consisting of train, val\nand test partitions with about 1450 images each. We train the regressors of [6] on train, and\nreport oracle accuracies of different methods on val (we cannot report oracle results on test since\nthose annotations are not publicly available). Diversity parameters (\u03b3, \u03bb) are chosen by perform-\ning cross-val on val. The standard PASCAL accuracy is the corpus-level intersection-over-union\nmeasure, averaged over all categories. For both label cost and transition, we try 3 different concave\n\n7\n\n\ffunctions h(\u00b7) = min{1,\u00b7},(cid:112)\n\nF (S\u2217)\u2212Fmin \u2265 (1 \u2212 1\n\ne\u03b1 ) \u2212\n\ni \u0001i\n\n(\u00b7) and log(1 + \u00b7). Table 2 shows the results.3 Hamming ball diver-\nsity performs the best, followed by DivMBest, and label cost/transitions are worse here. We found\nthat while worst on average, label transition diversity helps in an interesting scenario \u2013 when the\n\ufb01rst best segmentation y1 includes a pair of rare or mutually confusing labels (say dog-cat). Fig. 3\nshows an example, and more illustrations are provided in the supplement. In these cases, searching\nfor a different label transition produces a better segmentation. Finally, we note that lists produced\nwith combined diversity signi\ufb01cantly outperform any single method (including DivMBest).\n5 Discussion and Conclusion\nIn this paper, we study greedy algorithms for maximizing scoring functions that promote diverse\nsets of combinatorial con\ufb01gurations. This problem arises naturally in domains such as Computer\nVision, Natural Language Processing, or Computational Biology, where we want to search for a set\nof diverse high-quality solutions in a structured output space.\nThe diversity functions we propose are monotone submodular functions by construction. Thus, if\nr(y) + p(y) \u2265 0 for all y, then the entire scoring function F is monotone submodular. We showed\nthat r(y) can simply be learned to be positive. The greedy algorithm for maximizing monotone\nsubmodular functions has proved useful in moderately-sized unstructured spaces. To the best of our\nknowledge, this is the \ufb01rst generalization to exponentially large structured output spaces. In par-\nticular, our contribution lies in reducing the greedy augmentation step to inference with structured,\nef\ufb01ciently solvable HOPs. This insight makes new connections between submodular optimization\nand work on inference in graphical models. We now address some questions.\nCan we sample? One question that may be posed is how random sampling would perform for large\nground sets V . Unfortunately, the expected value of a random sample of M elements can be much\nworse than the optimal value F (S\u2217), especially if N is large. Lemma 1 is proved in the supplement.\nLemma 1. Let S \u2286 V be a sample of size M taken uniformly at random. There exist monotone\nsubmodular functions where E[F (S)] \u2264 M\nGuarantees? If F is nonnegative, monotone submodular, then using an exact HOP inference algo-\nrithm will clearly result in an approximation factor of 1\u2212 1/e. But many HOP inference procedures\nare approximate. Lemma 2 formalizes how approximate inference affects the approximation bounds.\nLemma 2. Let F \u2265 0 be monotone submodular.\nIf each step of the greedy algorithm uses an\napproximate marginal gain maximizer bt+1 with F (bt+1 | St) \u2265 \u03b1 maxa\u2208V F (a | St) \u2212 \u0001t+1, then\ni=1 \u0001t.\nF (SM ) \u2265 (1 \u2212 1\nParts of Lemma 2 have been observed in previous work [13, 29]; we show the combination in the\nsupplement. If F is monotone but not nonnegative, then Lemma 2 can be extended to a relative error\nbound F (SM )\u2212Fmin\nthat refers to Fmin = minS F (S) and the optimal\nsolution S\u2217. While stating these results, we add that further additive approximation losses occur if\nthe approximation bound for inference is computed on a shifted or re\ufb02ected function (positive scores\nvs positive energies). We pose theoretical improvements as an open question for future work. That\nsaid, our experiments convincingly show that the algorithms perform very well in practice, even\nwhen there are no guarantees (as with Hamming Ball diversity).\nGeneralization. In addition to the three speci\ufb01c examples in Section 3, our constructions generalize\nto the broad HOP class of upper-envelope potentials [18]. The details are provided in the supplement.\nAcknowledgements. We thank Xiao Lin for his help. The majority of this work was done while AP was an\nintern at Virginia Tech. AP and DB were partially supported by the National Science Foundation under Grant\nNo. IIS-1353694 and IIS-1350553, the Army Research Of\ufb01ce YIP Award W911NF-14-1-0180, and the Of\ufb01ce\nof Naval Research Award N00014-14-1-0679, awarded to DB. SJ was supported by gifts from Amazon Web\nServices, Google, SAP, The Thomas and Stacey Siebel Foundation, Apple, C3Energy, Cisco, Cloudera, EMC,\nEricsson, Facebook, GameOnTalis, Guavus, HP, Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk, Virdata,\nVMware, WANdisco, and Yahoo!.\nReferences\n[1] D. Batra. An Ef\ufb01cient Message-Passing Algorithm for the M-Best MAP Problem. In UAI, 2012. 6\n\n(cid:80)M\ne\u03b1 ) max|S|\u2264M F (S) \u2212\n(cid:80)\nF (S\u2217)\u2212Fmin\n\nN max|S|=M F (S).\n\n3MAP accuracies in Table 2 are different because of two different approximate MAP solvers: LabelCost/-\n\nTransition use alpha-expansion and HammingBall/DivMBest use message-passing.\n\n8\n\n\f[2] D. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich. Diverse M-Best Solutions in\n\nMarkov Random Fields. In ECCV, 2012. 2, 6\n\n[3] Y. Boykov, O. Veksler, and R. Zabih. Ef\ufb01cient approximate energy minimization via graph cuts. PAMI,\n\n20(12):1222\u20131239, 2001. 5, 7\n\n[4] N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. A tight (1/2) linear-time approximation to uncon-\n\nstrained submodular maximization. In FOCS, 2012. 3\n\n[5] J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for reordering documents\nand producing summaries. In Proc. 21st annual international ACM SIGIR conference on Research and\nDevelopment in Information Retrieval, SIGIR \u201998, pages 335\u2013336, 1998. 1\n\n[6] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pool-\n\n[7] C. Chen, V. Kolmogorov, Y. Zhu, D. Metaxas, and C. H. Lampert. Computing the m most probable modes\n\n[8] C. Chen, H. Liu, D. Metaxas, and T. Zhao. Mode estimation for high dimensional discrete tree graphical\n\n[9] A. Delong, A. Osokin, H. N. Isack, and Y. Boykov. Fast approximate energy minimization with label\n\ning. In ECCV, pages 430\u2013443, 2012. 7\n\nof a graphical model. In AISTATS, 2013. 2\n\nmodels. In NIPS, 2014. 2\n\ncosts. In CVPR, pages 2173\u20132180, 2010. 5, 7\n\n[10] D. Dey, T. Liu, M. Hebert, and J. A. Bagnell. Contextual sequence prediction with application to control\n\nlibrary optimization. In Robotics Science and Systems (RSS), 2012. 1, 3\n\n[11] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object\n\nClasses Challenge 2012 (VOC2012). 6, 7\n\n[12] U. Feige, V. S. Mirrokni, and J. Vondrak. Maximizing non-monotone submodular functions. In FOCS,\n\n[13] P. Goundan and A. Schulz. Revisiting the greedy approach to submodular set function maximization.\n\n2007. ISBN 0-7695-3010-9. 3\n\nManuscript, 2009. 8\n\nStructured Outputs. In Proc. NIPS, 2012. 2\n\nstructured prediction. In AISTATS, 2014. 2\n\nCVPR, 2011. 5\n\n[14] A. Guzman-Rivera, D. Batra, and P. Kohli. Multiple Choice Learning: Learning to Produce Multiple\n\n[15] A. Guzman-Rivera, P. Kohli, D. Batra, and R. Rutenbar. Ef\ufb01ciently enforcing diversity in multi-output\n\n[16] S. Jegelka and J. Bilmes. Submodularity beyond submodular energies: Coupling edges in graph cuts. In\n\n[17] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of in\ufb02uence through a social network. In\n\nACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2003. 1\n\n[18] P. Kohli and M. P. Kumar. Energy minimization for linear envelope MRFs. In CVPR, 2010. 8\n[19] P. Kohli, A. Osokin, and S. Jegelka. A principled deep random \ufb01eld model for image segmentation. In\n\n[20] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? PAMI, 26(2):\n\nCVPR, 2013. 5\n\n147\u2013159, 2004. 7\n\n[21] A. Krause and S. Jegelka. Submodularity in machine learning: New directions. ICML Tutorial, 2013. 2\n[22] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory,\n\nef\ufb01cient algorithms and empirical studies. JMLR, 9:235\u2013284, 2008. 1\n\n[23] A. Kulesza and B. Taskar. Structured determinantal point processes. In Proc. NIPS, 2010. 2\n[24] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random \ufb01elds: Probabilistic models for\n\nsegmenting and labeling sequence data. In ICML, 2001. 1\n\n[25] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. A tutorial on energy-based learning. In\n\nPredicting Structured Data. MIT Press, 2006. 3\n\n[26] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In ACL, 2011. 1, 3\n[27] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of approximations for maximizing submodular set\n\nfunctions. Mathematical Programming, 14(1):265\u2013294, 1978. 2, 3\n\n[28] D. Park and D. Ramanan. N-best maximal decoders for part models. In ICCV, 2011. 2\n[29] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008.\n\n3, 8\n\n[30] D. Tarlow, I. E. Givoni, and R. S. Zemel. HOP-MAP: Ef\ufb01cient message passing with high order potentials.\n\nIn AISTATS, pages 812\u2013819, 2010. 5, 6, 7\n\n[31] B. Taskar, C. Guestrin, and D. Koller. Max-Margin Markov networks. In NIPS, 2003. 1\n[32] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. JMLR, 6:1453\u20131484, 2005. 1, 3\n\n[33] P. Viola and M. J. Jones. Robust real-time face detection. Int. J. Comput. Vision, 57(2):137\u2013154, May\n\n[34] C. Yanover and Y. Weiss. Finding the m most probable con\ufb01gurations using loopy belief propagation. In\n\n2004. ISSN 0920-5691. 1\n\nNIPS, 2003. 6\n\n9\n\n\f", "award": [], "sourceid": 1378, "authors": [{"given_name": "Adarsh", "family_name": "Prasad", "institution": "University of Texas at Austin"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "UC Berkeley"}, {"given_name": "Dhruv", "family_name": "Batra", "institution": "Virginia Tech"}]}