{"title": "SubmodBoxes: Near-Optimal Search for a Set of Diverse Object Proposals", "book": "Advances in Neural Information Processing Systems", "page_first": 1378, "page_last": 1386, "abstract": "This paper formulates the search for a set of bounding boxes (as needed in object proposal generation) as a monotone submodular maximization problem over the space of all possible bounding boxes in an image. Since the number of possible bounding boxes in an image is very large $O(#pixels^2)$, even a single linear scan to perform the greedy augmentation for submodular maximization is intractable. Thus, we formulate the greedy augmentation step as a Branch-and-Bound scheme. In order to speed up repeated application of B\\&B, we propose a novel generalization of Minoux\u2019s \u2018lazy greedy\u2019 algorithm to the B\\&B tree. Theoretically, our proposed formulation provides a new understanding to the problem, and contains classic heuristic approaches such as Sliding Window+Non-Maximal Suppression (NMS) and and Efficient Subwindow Search (ESS) as special cases. Empirically, we show that our approach leads to a state-of-art performance on object proposal generation via a novel diversity measure.", "full_text": "SubmodBoxes: Near-Optimal Search for a Set of\n\nDiverse Object Proposals\n\nQing Sun\n\nVirginia Tech\n\nDhruv Batra\nVirginia Tech\n\nsunqing@vt.edu\n\nhttps://mlp.ece.vt.edu/\n\nAbstract\n\nThis paper formulates the search for a set of bounding boxes (as needed in object\nproposal generation) as a monotone submodular maximization problem over the\nspace of all possible bounding boxes in an image. Since the number of possible\nbounding boxes in an image is very large O(#pixels2), even a single linear scan\nto perform the greedy augmentation for submodular maximization is intractable.\nThus, we formulate the greedy augmentation step as a Branch-and-Bound scheme.\nIn order to speed up repeated application of B&B, we propose a novel generaliza-\ntion of Minoux\u2019s \u2018lazy greedy\u2019 algorithm to the B&B tree. Theoretically, our\nproposed formulation provides a new understanding to the problem, and contains\nclassic heuristic approaches such as Sliding Window+Non-Maximal Suppression\n(NMS) and and Ef\ufb01cient Subwindow Search (ESS) as special cases. Empirically,\nwe show that our approach leads to a state-of-art performance on object proposal\ngeneration via a novel diversity measure.\n\nIntroduction\n\n1\nA number of problems in Computer Vision and Machine Learning involve searching for a set of\nbounding boxes or rectangular windows. For instance, in object detection [9, 16, 17, 19, 34, 36, 37],\nthe goal is to output a set of bounding boxes localizing all instances of a particular object category.\nIn object proposal generation [2, 7, 39, 41], the goal is to output a set of candidate bounding boxes\nthat may potentially contain an object (of any category). Other scenarios include face detection,\nmulti-object tracking and weakly supervised learning [10].\nClassical Approach: Enumeration + Diverse Subset Selection. In the context of object detection,\nthe classical paradigm for searching for a set of bounding boxes used to be:\n\n\u2022 Sliding Window [9, 16, 40]: i.e., enumeration over all windows in an image with some\n\nlevel of sub-sampling, followed by\n\n\u2022 Non-Maximal Suppression (NMS): i.e., picking a spatially-diverse set of windows by\n\nsuppressing windows that are too close or overlapping.\n\nAs several previous works [3,26,40] have recognized, the problem with this approach is inef\ufb01ciency\n\u2013 the number of possible bounding boxes or rectangular subwindows in an image is O(#pixels2).\nEven a low-resolution (320 x 240) image contains more than one billion rectangular windows [26]!\nAs a result, modern object detection pipelines [17, 19, 36] often rely on object proposals as a pre-\nprocessing step to reduce the number of candidate object locations to a few hundreds or thousands\n(rather than billions).\nInterestingly, this migration to object proposals has simply pushed the problem (of searching for a\nset of bounding boxes) upstream. Speci\ufb01cally, a number of object proposal techniques [8, 32, 41]\ninvolve the same enumeration + NMS approach \u2013 except they typically use cheaper features to be a\nfast proposal generation step.\nGoal. The goal of this paper is to formally study the search for a set of bounding boxes as an op-\ntimization problem. Clearly, enumeration + post-processing for diversity (via NMS) is one widely-\nused heuristic approach. Our goal is to formulate a formal optimization objective and propose an\nef\ufb01cient algorithm, ideally with guarantees on optimization performance.\nChallenge. The key challenge is the exponentially-large search space \u2013 the number of possible\n\n(cid:1) = O(#pixels2M ) (assuming M \u2264 #pixels2/2).\n\nM-sized sets of bounding boxes is(cid:0)O(#pixels2)\n\nM\n\n1\n\n\f\u03bb(cid:124)(cid:123)(cid:122)(cid:125)\n\ntrade-off parameter\n\n(cid:124)(cid:123)(cid:122)(cid:125)\n\ndiversity\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nbudget constraint\n\nFigure 1: Overview of our formulation: SubmodBoxes. We formulate the selection of a set of boxes as a con-\nstrained submodular maximization problem. The objective and marginal gains consists of two parts: relevance\nand diversity. Figure (b) shows two candidate windows ya and yb. Relevance is the sum of edge strength\nover all edge groups (black curves) wholly enclosed in the window. Figure (c) shows the diversity term. The\nmarginal gain in diversity due to a new window (ya or yb) is the ability of the new window to cover the refer-\nence boxes that are currently not well-covered with the already chosen set Y = {y1, y2}. In this case, we can\nsee that ya covers a new reference box b1. Thus, the marginal gain in diversity of ya will be larger than yb.\nOverview of our formulation: SubmodBoxes. Let Y denote the set of all possible bounding boxes\nor rectangular subwindows in an image. This is a structured output space [4,21,38], with the size of\nthis set growing quadratically with the size of the input image, |Y| = O(#pixels2).\nWe formulate the selection of a set of boxes as a search problem on the power set 2Y. Speci\ufb01cally,\ngiven a budget of M windows, we search for a set Y of windows that are both relevant (e.g., have\nhigh likelihood of containing an object) and diverse (to cover as many objects instances as possible):\n(1)\n\n|Y | \u2264 M\n\n= R(Y )\n\nD(Y )\n\ns.t.\n\n+\n\nargmax\nY \u22082Y\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nsearch over power-set\n\n(cid:124)(cid:123)(cid:122)(cid:125)\n\nF (Y )\n\nobjective\n\n(cid:124)(cid:123)(cid:122)(cid:125)\n\nrelevance\n\ne ) [24, 30].\n\nCrucially, when the objective function F : 2Y \u2192 R is monotone and submodular, then a simple\ngreedy algorithm (that iteratively adds the window with the largest marginal gain [24]) achieves a\nnear-optimal approximation factor of (1 \u2212 1\nUnfortunately, although conceptually simple, this greedy augmentation step requires an enumeration\nover the space of all windows Y and thus a na\u00efve implementation is intractable.\nIn this work, we show that for a broad class of relevance and diversity functions, this greedy augmen-\ntation step may be ef\ufb01ciently formulated as a Branch-and-Bound (B&B) step [12, 26], with easily\ncomputable upper-bounds. This enables an ef\ufb01cient implementation of greedy, with signi\ufb01cantly\nfewer evaluations than a linear scan over Y.\nFinally, in order to speed up repeated application of B&B across iterations of the greedy algorithm,\nwe present a novel generalization of Minoux\u2019s \u2018lazy greedy\u2019 algorithm [29] to the B&B tree, where\ndifferent branches are explored in a lazy manner in each iteration.\nWe apply our proposed technique SubmodBoxes to the task of generating object proposals [2, 7, 39,\n41] on the PASCAL VOC 2007 [13], PASCAL VOC 2012 [14], and MS COCO [28] datasets. Our\nresults show that our approach outperforms all baselines.\nContributions. This paper makes the following contributions:\n\n1. We formulate the search for a set of bounding boxes or subwindows as the constrained\nmaximization of a monotone submodular function. To the best of our knowledge, despite\nthe popularity of object recognition and object proposal generation, this is the \ufb01rst such\nformal optimization treatment of the problem.\n\n2. Our proposed formulation contains existing heuristics as special cases. Speci\ufb01cally, Slid-\ning Window + NMS can be viewed as an instantiation of our approach under a speci\ufb01c\nde\ufb01nition of the diversity function D(\u00b7).\n\n3. Our work can be viewed as a generalization of the \u2018Ef\ufb01cient Subwindow Search (ESS)\u2019\nof Lampert et al. [26], who proposed a B&B scheme for \ufb01nding the single best bounding\nbox in an image. Their extension to detecting multiple objects consisted of a heuristic\nfor \u2018suppressing\u2019 features extracted from the selected bounding box and re-running the\nprocedure. We show that this heuristic is a special case of our formulation under a speci\ufb01c\ndiversity function, thus providing theoretical justi\ufb01cation to their intuitive heuristic.\n\n4. To the best of our knowledge, our work presents the \ufb01rst generalization of Minoux\u2019s \u2018lazy\n\ngreedy\u2019 algorithm [29] to structured-output spaces (the space of bounding boxes).\n\n2\n\n\f5. Finally, our experimental contribution is a novel diversity measure which leads to state-of-\n\nart performance on the task of generating object proposals.\n\n2 Related Work\nOur work is related to a few different themes of research in Computer Vision and Machine Learning.\nSubmodular Maximization and Diversity. The task of searching for a diverse high-quality subset\nof items from a ground set has been well-studied in a number of application domains [6, 11, 22,\n25, 27, 31], and across these domains submodularity has emerged as an a fundamental property of\nset functions for measuring diversity of a subset of items. Most previous work has focussed on\nsubmodular maximization on unstructured spaces, where the ground set is ef\ufb01ciently enumerable.\nOur work is closest in spirit to Prasad et al. [31], who studied submodular maximization on struc-\ntured output spaces, i.e. where each item in the ground set is itself a structured object (such as a\nsegmentation of an image). Unlike [31], our ground set Y is not exponentially large, only \u2018quadrat-\nically\u2019 large. However, enumeration over the ground set for the greedy-augmentation step is still\ninfeasible, and thus we use B&B. Such structured output spaces and greedy-augmentation oracles\nwere not explored in [31].\nBounding Box Search in Object Detection and Object Proposals. As we mention in the introduc-\ntion, the search for a set of bounding boxes via heuristics such as Sliding Window + NMS used to be\nthe dominant paradigm in object recognition [9, 16, 40]. Modern pipelines have shifted that search\nstep to object proposal algorithms [17,19,36]. A comparison and overview of object proposals may\nbe found in [20]. Zitnick et al. [41] generate candidate bounding boxes via Sliding Window + NMS\nbased on an \u201cobjectness\u201d score, which is a function of the number of contours wholly enclosed by\na bounding box. We use this objectness score as our relevance term, thus making SubmodBoxes\ndirectly comparable to NMS. Another closely related work is [18], which presents an \u2018active search\u2019\nstrategy for reranking selective search [39] object proposals based on a contextual cues. Unlike this\nwork, our formulation is not restricted to any pre-selected set of windows. We search over the entire\npower set 2Y, and may generate any possible set of windows (up to convergence tolerance in B&B).\nBranch-and-Bound. One key building block of our work is the \u2018Ef\ufb01cient Subwindow Search\n(ESS)\u2019 B&B scheme et al. [26]. ESS was originally proposed for single-instance object detec-\ntion. Their extension to detecting multiple objects consisted of a heuristic for \u2018suppressing\u2019 features\nextracted from the selected bounding box and re-running the procedure. In this work, we extend\nand generalize ESS in multiple ways. First, we show that relevance (objectness scores) and diversity\nfunctions used in object proposal literature are amenable to upper-bound and thus B&B optimiza-\ntion. We also show that the \u2018suppression\u2019 heuristic used by [26] is a special case of our formulation\nunder a speci\ufb01c diversity function, thus providing theoretical justi\ufb01cation to their intuitive heuristic.\nFinally, [3] also proposed the use of B&B for NMS in object detection. Unfortunately, as we explain\nlater in the paper, the NMS objective is submodular but not monotone, and the classical greedy algo-\nrithm does not have approximation guarantees in this setting. In contrast, our work presents a general\nframework for bounding-box subset-selection based on monotone submodular maximization.\n3 SubmodBoxes: Formulation and Approach\nWe begin by establishing the notation used in the paper.\nPreliminaries and Notation. For an input image x, let Yx denote the set of all possible bounding\nboxes or rectangular subwindows in this image. For simplicity, we drop the explicit dependance on\nx, and just use Y. Uppercase letters refer to set functions F (\u00b7), R(\u00b7), D(\u00b7), and lowercase letter refer\nto functions over individual items f (y), r(y).\nA set function F : 2Y \u2192 R is submodular if its marginal gains F (b|S) \u2261 F (S \u222a b) \u2212 F (S) are\ndecreasing, i.e. F (b|S) \u2265 F (b|T ) for all sets S \u2286 T \u2286 Y and items b /\u2208 T . The function F is called\nmonotone if adding an item to a set does not hurt, i.e. F (S) \u2264 F (T ), \u2200S \u2286 T .\nConstrained Submodular Maximization. From the classical result of Nemhauser [30], it is known\nthat cardinality constrained maximization of a monotone submodular F can be performed near-\noptimally via a greedy algorithm. We start out with an empty set Y 0 = \u2205, and iteratively add the\nnext \u2018best\u2019 item with the largest marginal gain over the chosen set :\nyt = argmax\n\ny\u2208Y\nThe score of the \ufb01nal solution Y M is within a factor of (1 \u2212 1\ne ) of the optimal solution. The com-\nputational bottleneck is that in each iteration, we must \ufb01nd the item with the largest marginal gain.\nIn our case, |Y| is the space of all rectangular windows in an image, and exhaustive enumeration\n\nF (y | Y t\u22121).\n\nY t = Y t\u22121 \u222a yt,\n\nwhere\n\n(2)\n\n3\n\n\fFigure 2: Priority queue in B&B scheme. Each vertex in the tree represents a set of windows. Blue rectangles\ndenote the largest and the smallest window in the set. Gray region denotes the rectangle set Yv. In each case,\nthe priority queue consists of all leaves in the B&B tree ranked by the upper bound Uv. Left: shows vertex v is\nsplit along the right coordinate interval into equal halves: v1 and v2. Middle: The highest priority vertex v1 in\nQ1 is further split along bottom coordinate into v3 and v4. Right: The highest priority vertex v4 in Q2 is split\nalong right coordinate into v5 and v6. This procedure is repeated until the highest priority vertex in the queue\nis a single rectangle.\nis intractable. Instead of exploring subsampling as is done in Sliding Window methods, we will\nformulate this greedy augmentation step as an optimization problem solved with B&B.\nSets vs Lists. For pedagogical reasons, our problem setup is motivated with the language of sets\n(Y, 2Y) and subsets (Y ). In practice, our work falls under submodular list prediction [11, 33, 35].\nThe generalization from sets to lists allows reasoning about an ordering of the items chosen and\n(potentially) repeated entries in the list. Our \ufb01nal solution Y M is an (ordered) list not an (unordered)\nset. All guarantees of greedy remain the same in this generalization [11, 33, 35].\n3.1 Parameterization of Y and Branch-and-Bound Search\nIn this subsection, we brie\ufb02y recap the Ef\ufb01cient Subwindow Search (ESS) of Lampert et al. [26],\nwhich is used a key building block in this work. The goal of [26] is to maximize a (potentially\nnon-smooth) objective function over the space of all rectangular windows: maxy\u2208Y f (y).\nA rectangular window y \u2208 Y is parameterized by its top, bottom, left, and right coordinates y =\n(t, b, l, r). A set of windows is represented by using intervals for each coordinate instead of a single\ninteger, for example [T, B, L, R], where T = [tlow, thigh] is a range. In this parameterization, the\nset of all possible boxes in an (h\u00d7 w)-sized image can be written as Y = [[1, h], [1, h], [1, w], [1, w]].\nBranch-and-Bound over Y. ESS creates a B&B tree, where each vertex v in the tree is a rect-\nangle set Yv and an associated upper-bound on the objective function achievable in this set, i.e.\nmaxy\u2208Yv f (y) \u2264 Uv. Initially, this tree consists of a single vertex, which is the entire search space\nY and (typically) a loose upper-bound. ESS proceeds in a best-\ufb01rst manner [26]. In each iteration,\nthe vertex/set with the highest upper-bound is chosen for branching, and then new upper-bounds\nare computed on each of the two children/sub-sets created. In practice, this is implemented with a\npriority queue over the vertices/sets that are currently leaves in the tree. Fig. 2 shows an illustration\nof this procedure. The parent rectangle set is split along its largest coordinate interval into two equal\nhalves, thus forming disjoint children sets. B&B explores the tree in a best-\ufb01rst manner till a single\nrectangle is identi\ufb01ed with a score equal to the upper-bound at which point we have found a global\noptimum. In our experiments, we show results with different convergence tolerances.\nObjective. In our setup, the objective (at each greedy-augmentation step) is the marginal gain of\nthe window y w.r.t. the currently chosen list of windows Y t\u22121, i.e. f (y) = F (y | Y t\u22121) = R(y |\nY t\u22121) + \u03bbD(y | Y t\u22121). In the following subsections, we describe the relevance and diversity terms\nin detail, and show how upper bounds can be ef\ufb01ciently computed over the sets of windows.\n3.2 Relevance Function and Upper Bound\nThe goal of the relevance function R(Y ) is to quantify the \u201cquality\u201d or \u201crelevance\u201d of the windows\nchosen in Y . In our work, we de\ufb01ne R(Y ) to be a modular function aggregating the quality of\ny\u2208Y r(y). Thus, the marginal gain of window y is simply its\nindividual quality regardless of what else has already been chosen, i.e. R(y | Y t\u22121) = r(y).\nIn our application of object proposal generation, we use the objectness score produced by Edge-\nBoxes [41] as our relevance function. The main intuition of EdgeBoxes is that the number of\ncontours or \u201cedge groups\u201d wholly contained in a box is indicative of its objectness score. Thus,\nit \ufb01rst creates a grouping of edge pixels called edge groups, each associated with a real-valued edge\nstrength si.\nAbstracting away some of the domain-speci\ufb01c details, EdgeBoxes essentially de\ufb01nes the score of a\nbox as a weighted sum of the strengths of edge groups contained in it, normalized by the size of the\n\nall chosen windows i.e. R(Y ) = (cid:80)\n\n4\n\n\f(cid:80)\n\n(cid:80)\n\nedge group i\u2208y wisi\nsize-normalization\n\nbox i.e. EdgeBoxesScore(y) =\n\u2018edge group i \u2208 y\u2019 to mean the edge groups contained the rectangle y.\nThese weights and size normalizations were found to improve performance of EdgeBoxes. In our\nwork, we use a simpli\ufb01cation of the EdgeBoxesScore which allow for easy computation of upper\nbounds:\n\n, where with a slight abuse of notation, we use\n\n(3)\ni.e., we ignore the weights. One simple upper-bound for a set of windows Yv can be computed by\naccumulating all possible positive scores and the least necessary negative scores:\n\nr(y) =\n\n,\n\nedge group i\u2208y si\nsize-normalization\n\n(cid:80)\nsi \u00b7 [[si \u2265 0]] +(cid:80)\n\nedge group i\u2208ymin\n\nsi \u00b7 [[si \u2264 0]]\n\nr(y) \u2264\n\nedge group i\u2208ymax\n\n,\n\nmax\ny\u2208Yv\n\nsize-normalization(ymin)\n\n(4)\nwhere ymax is the largest and ymin is the smallest box in the set Yv; and [[\u00b7]] is the Iverson bracket.\nConsistent with the experiments in [41] , we found that this simpli\ufb01cation indeed hurts performance\nin the EdgeBoxes Sliding Window + NMS pipeline. However, interestingly we found that even\nwith this weaker relevance term, SubmodBoxes was able to outperform EdgeBoxes. Thus, the drop\nin performance due to a weaker relevance term was more than compensated for by the ability to\nperform B&B jointly on the relevance and diversity terms.\n3.3 Diversity Function and Upper Bound\nThe goal of the diversity function D(Y ) is to encourage non-redundancy in the chosen set of win-\ndows and potentially capture different objects in the image. Before we introduce our own diversity\nfunction, we show how existing heuristics in object detection and proposal generation can be written\nas special cases of this formulation, under speci\ufb01c diversity functions.\nSliding Window + NMS. Non-Maximal Suppression (NMS) is the most popular heuristic for select-\ning diverse boxes in computer vision. NMS is typically explained procedurally \u2013 select the highest\nscoring window y1, suppress all windows that overlap with y1 by more than some threshold, select\nthe next highest scoring window y2, rinse and repeat.\nThis procedure can be explained as a special case of our formulation. Sliding Window corresponds\nto enumeration over Y with some level of sub-sampling (or stride), typically with a \ufb01xed aspect\nratio. Each step in NMS is precisely a greedy augmentation step under the following marginal gain:\n(5a)\n\nr(y) + \u03bbDNM S(y | Y t\u22121), where\n\nargmax\ny\u2208Ysub-sampled\nDNM S(y | Y t\u22121) =\n\n(cid:26)0\n\nif maxy(cid:48)\u2208Y t\u22121 IoU(y(cid:48), y) \u2264 NMS-threshold\n\n(5b)\n\n\u2212\u221e else.\n\nIntuitively, the NMS diversity function imposes an in\ufb01nite penalty if a new window y overlaps\nwith a previously chosen y(cid:48) by more than a threshold, and offers no reward for diversity beyond\nthat. This explains the NMS procedure of suppressing overlapping windows and picking the highest\nscoring one among the unsuppressed ones. Notice that this diversity function is submodular but not\nmonotone (the marginals gains may be negative). A similar observation was made in [3]. For such\nnon-monotone functions, greedy does not have approximation guarantees and different techniques\nare needed [5, 15]. This is an interesting perspective on the classical NMS heuristic.\nESS Heuristic [26]. ESS was originally proposed for single-instance object detection. Their ex-\ntension to detecting multiple instances consisted of a heuristic for suppressing the features extracted\nfrom the selected bounding box and re-running the procedure. Since their scoring function was lin-\near in the features, this heuristic of suppressing features and rerunning B&B can be expressed as a\ngreedy augmentation step under the following marginal gain:\nargmax\n\nr(y) + \u03bbDESS(y | Y t\u22121), where DESS(y | Y t\u22121) = \u2212r(cid:0)y \u2229 (y1 \u222a y2 . . . yt\u22121)(cid:1) (6)\n\ny\u2208Y\n\ni.e., the ESS diversity function subtracts the score contribution coming from the intersection region.\nIf the r(\u00b7) is non-negative, it is easy to see that this diversity function is monotone and submodular\n\u2013 adding a new window never hurts, and since the marginal gain is the score contribution of the new\nregions not covered by previous window, it is naturally diminishing. Thus, even though this heuristic\nnot was presented as such, the authors of [26] did in fact formulate a near-optimal greedy algorithm\nfor maximizing a monotone submodular function. Unfortunately, while r(\u00b7) is always positive in\nour experiments, this was not the case in the experimental setup of [26].\n\n5\n\n\fy\u2208Y\n\n\u03b4IoU(y, b | Y t\u22121) = max{IoU(y, b) \u2212 max\ny(cid:48)\u2208Y t\u22121\n\nr(y) + \u03bbDcoverage(y | Y t\u22121), where Dcoverage(y | Y t\u22121) = max\nb\u2208B\n\nOur Diversity Function. Instead of hand-designing an explicit diversity function, we use a function\nthat implicitly measures diversity in terms of coverage of a set of reference set of bounding boxes\nB. This reference set of boxes may be a uniform sub-sampling of the space of windows as done\nin Sliding Window methods, or may itself be the output of another object proposal method such as\nSelective Search [39]. Speci\ufb01cally, each greedy augmentation step under our formulation given by:\n\u03b4IoU(y, b | Y t\u22121) (7a)\nargmax\nIoU(y(cid:48), b), 0}. (7b)\nIntuitively speaking, the marginal gain of a new window y under our diversity function is the largest\ngain in coverage of exactly one of the references boxes. We can also formulate this diversity function\nas a maximum bipartite matching problem between the reference proposal boxes Y and the reference\nboxes B (in our experiments, we also study performance under top-k matches). We show in the\nsupplement that this marginal gain is always non-negative and decreasing with larger Y t\u22121, thus the\ndiversity function is monotone submodular. All that remains is to compute an upper-bound on this\nmarginal gain. Ignoring constants, the key term to bound is IoU(y, b). We can upper-bound this\nterm by computing the intersection w.r.t. the largest window in the window set ymax, and computing\nthe union w.r.t. to the smallest window ymin, i.e. maxy\u2208Yv IoU(y, b) \u2264 area(ymax\u2229b)\narea(ymin\u222ab) .\n4 Speeding up Greedy with Minoux\u2019s \u2018Lazy Greedy\u2019\nIn order to speed up repeated application of B&B across iterations of the greedy algorithm, we now\npresent an application of Minoux\u2019s \u2018lazy greedy\u2019 algorithm [29] to the B&B tree.\nThe key insight of classical lazy greedy is that the marginal gain function F (y | Y t) is a non-\nincreasing function of t (due to submodularity of F ). Thus, at time t \u2212 1, we can cache the priority\nqueue of marginals gains F (y | Y t\u22122) for all items. At time t, lazy greedy does not recompute\nall marginal gains. Rather, the item at the front of the priority queue is picked, its marginal gain is\nupdated F (y | Y t\u22121), and the item is reinserted into the queue. Crucially, if the item remains at\nthe front of the priority queue, lazy greedy can stop, and we have found the item with the largest\nmarginal gain.\nInterleaving Lazy Greedy with B&B. In our work, the priority queue does not contain single items,\nrather sets of windows Yv corresponding to the vertices in the B&B tree. Thus, we must interleave\nthe lazy updates with the Branch-and-Bound steps. Speci\ufb01cally, we pick a set from the front of\nthe queue and compute the upper-bound on its marginal gain. We reinsert this set into the priority\nqueue. Once a set remains at the front of the priority queue after reinsertion, we have found the set\nwith the highest upper-bound. This is when perform a B&B step, i.e. split this set into two children,\ncompute the upper-bounds on the children, and insert them into the queue.\n\nFigure 3: Interleaving Lazy Greedy with B&B. The \ufb01rst few steps update upper-bounds, following by \ufb01nally\nbranching on a set. Some sets, such as v2 are never updated or split, resulting in a speed-up.\nFig. 3 illustrates how the priority queue and B&B tree are updated in this process. Suppose at the\nend of iteration t \u2212 1 and the beginning of iteration t, we have the priority queue shown on the\nleft. The \ufb01rst few updates involve recomputing the upper-bounds on the window sets (v6, v5, v3),\nfollowing by branching on v3 because it continues to stay on top of the queue, creating new vertices\nv7, v8. Notice that v2 is never explored (updated or split), resulting in a speed-up.\n5 Experiments\nSetup. We evaluate SubmodBoxes for object proposal generation on three datasets: PASCAL VOC\n2007 [13], PASCAL VOC 2012 [14], and MS COCO [28]. The goal of experiments is to validate our\napproach by testing the accuracy of generated object proposals and the ability of handling different\nkinds of reference boxes, and observe trends as we vary multiple parameters.\n\n6\n\n\f(a) Pascal VOC 2007\n\n(b) Pascal VOC 2012\n\nFigure 4: ABO vs. No. Proposals.\n\n(c) MS COCO\n\nEvaluation. To evaluate the quality of our object proposals, we use Mean Average Best Overlap\n(MABO) score. Given a set of ground-truth boxes GTc for a class c, ABO is calculated by averaging\nthe best IoU between each ground truth bounding box and all object proposals:\n\nABOc =\n\n1\n\n|GTc|\n\nIoU(g, y)\n\nmax\ny\u2208Y\n\n(8)\n\n(cid:88)\n\ng\u2208GTc\n\nMABO is a mean ABO over all classes.\nWeighing the Reference Boxes. Recall that the marginal gain of our proposed diversity function\nrewards covering the reference boxes with the chosen set of boxes. Instead of weighing all reference\nboxes equally, we found it important to weigh different reference boxes differently. The exact form\nthe weighting rule is provided in the supplement. In our experiments, we present results with and\nwithout such a weighting to show impact of our proposed scheme.\n5.1 Accuracy of Object Proposals\nIn this section, we explore the performance of our proposed method in comparison to relevant object\nproposal generators. For the two PASCAL datasets, we perform cross validation on 2510 validation\nimages of PASCAL VOC 2007 for the best parameter \u03bb, then report accuracies on 4952 test images\nof PASCAL VOC 2007 and 5823 validation images of PASCAL VOC 2012. The MS COCO dataset\nis much larger, so we randomly select a subset of 5000 training images for tuning \u03bb, and test on\ncomplete validation dataset with 40138 images.\nWe use 1000 top ranked selective search windows [39] as reference boxes. In a manner similar\nto [23], we chose a different \u03bbM for M = 100, 200, 400, 600, 800, 1000 proposals. We compare our\napproach with several baselines: 1) \u03bb = \u221e, which essentially involves re-ranking selective search\nwindows by considering their ability to coverage other boxes. 2) Three variants of EdgeBoxes [41]\nat IoU = 0.5, 0.7 and 0.9, and corresponding three variants without af\ufb01nities in (3). 3) Selective\nSearch: compute multiple hierarchical segments via grouping superpixels and placing bounding\nboxes around them. 4) SS-EB: use EdgeBoxesScore to re-rank Selective Search windows.\nFig. 4 shows that our approach at \u03bb = \u221e and validation-tuned \u03bb both outperform all baselines.\nAt M = 25, 100, and 500, our approach is 20%, 11%, and 3% better than Selective Search and\n14%, 10%, and 6% better than EdgeBoxes70, respectively.\n5.2 Ablation Studies.\nWe now study the performance of our system under different components and parameter settings.\nEffect of \u03bb and Reference Boxes. We test performance of our approach as a function of \u03bb using\nreference boxes from different object proposal generators (all reported at M=200 on PASCAL VOC\n2012). Our reference box generators are: 1) Selective Search [39]; 2) MCG [2]; 3) CPMC [7]; 4)\nEdgeBoxes [41] at IoU = 0.7; 5) Objectness [1]; and 6) Uniform-sampling [20]: i.e. uniformly\nsample the bounding box center position, square root area and log aspect ratio.\nTable 1 shows the performance of SubmodBoxes when used with these different reference box\ngenerators. Our approach shows improvement (over corresponding method) for all reference boxes.\nOur approach outperforms the current state of art MCG by 2% and Selective Search by 5%. This is\nsigni\ufb01cantly larger than previous improvements reported in the literature.\nFig. 5a shows more \ufb01ne-grained behavior as \u03bb is varied. At \u03bb = 0 all methods produce the same\n(highest weighted) box M times. At \u03bb = \u221e, they all perform a reranking of the reference set of\nboxes. In nearly all curves, there is a peak at some intermediate setting of \u03bb. The only exception is\nEdgeBoxes, which is expected since it is being used in both the relevance and diversity terms.\nEffect of No. B&B Steps. We analyze the convergence trends of B&B. Fig. 5b shows that both the\noptimization objective function value and the mABO increase with the number of B&B iterations.\n\n7\n\n020040060080010000.10.20.30.40.50.60.70.8No. proposalsABO SubmodBoxesSubmodBoxes,\u03bb=\u221eEB50EB70EB90EB50_no_affEB70_no_affEB90_no_affSSSS\u2212EB020040060080010000.10.20.30.40.50.60.70.8No. proposalsABO SubmodBoxesSubmodBoxes,\u03bb=\u221eEB50EB70EB90EB50_no_affEB70_no_affEB90_no_affSSSS\u2212EB0200400600800100000.10.20.30.40.50.6No. proposalsABO SubmodBoxesSubmodBoxes,\u03bb=\u221eEB50EB70EB90EB50_no_affEB70_no_affEB90_no_affSSSS\u2212EB\fEB\n\nSelective-Search MCG\n0.6747\n0.7377\n0.5042 0.6350\n0.7417\n0.6467\n0.5534\n0.6232\n0.7409 0.6558\n0.7206\n0.6755\n\n\u03bb \u2248 0.4, weighting\n\u03bb \u2248 0.4, without weighting\n\u03bb = 10, weighting\n\u03bb = 10, without weighting\n\u03bb = \u221e, weighting\nOriginal method\nTable 1: Comparison with/without weighting scheme (rows) with different reference boxes (columns). \u2018Orig-\ninal method\u2019 row shows performance of directly using object proposals from these proposal generators. \u2018\u2248\u2019\nmeans we report the best performance from \u03bb = 0.3, 0.4 and 0.5 considering the peak occurs at different \u03bb for\ndifferent object proposal generators.\n\nCPMC Objectness Uniform-sampling\n0.7125\n0.5681\n0.7130\n0.5849\n0.7116\n0.7032\n\n0.5937\n0.5136\n0.5478\n0.5115\n0.5453\n0.5295\n\n0.6131\n0.6220\n0.5006\n0.5920\n0.4980\n0.6038\n\n0.7342\n0.5697\n0.7233\n0.5844\n0.7222\n0.6817\n\n(a) Performance vs. \u03bb with differ-\nent reference box generators.\n\n(b) Objective and performance vs.\nNo. of iterations.\n\n(c) Performance vs. No. of\nmatching boxes.\n\nFigure 5: Experiments on different parameter settings.\n\nEffect of No. of Matching Boxes. Instead of allowing the chosen boxes to cover exactly one ref-\nerence box, we analyze the effect of matching top-k reference boxes. Fig. 5c shows that the perfor-\nmance decreases monotonically bit as more matches are allowed.\n\nFigure 6: Comparison of the\nnumber of B&B iterations of our\nLazy Greedy generalization and\nindependent B&B runs.\n\nSpeed up via Lazy Greedy. Fig. 6 compares the number of B&B\niterations required with and without our proposed Lazy Greedy gen-\neralization (averaged over 100 randomly chosen images) \u2013 we can\nsee that Lazy Greedy signi\ufb01cantly reduces the number of B&B\niterations required. The cost of each B&B evaluation is nearly\nthe same, so the iteration speed-up is directly proportional to time\nspeed-up.\n6 Conclusions\nTo summarize, we formally studied the search for a set of diverse\nbounding boxes as an optimization problem and provided theoret-\nical justi\ufb01cation for greedy and heuristic approaches used in prior\nwork. The key challenge of this problem is the large search space.\nThus, we proposed a generalization of Minoux\u2019s \u2018lazy greedy\u2019 on\nB&B tree to speed up classical greedy. We tested our formulation\non three datasets of object detection: PASCAL VOC 2007, PAS-\nCAL 2012 and Microsoft COCO. Results show that our formulation outperforms all baselines with\na novel diversity measure.\nAcknowledgements. This work was partially supported by a National Science Foundation CA-\nREER award, an Army Research Of\ufb01ce YIP award, an Of\ufb01ce of Naval Research grant, an AWS\nin Education Research Grant, and GPU support by NVIDIA. The views and conclusions contained\nherein are those of the authors and should not be interpreted as necessarily representing the of\ufb01cial\npolicies or endorsements, either expressed or implied, of the U.S. Government or any sponsor.\nReferences\n[1] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. PAMI, 34(11):2189\u2013\n\n[2] P. Arbelaez, J. P. Tuset, J. T.Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In\n\n[3] M. Blaschko. Branch and bound strategies for non-maximal suppression in object detection. In EMM-\n\n2202, Nov 2012. 7\n\nCVPR, 2014. 1, 2, 7\n\nCVPR, pages 385\u2013398, 2011. 1, 3, 5\n\nECCV, 2008. 2\n\n[4] M. B. Blaschko and C. H. Lampert. Learning to localize objects with structured output regression. In\n\n[5] N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. A tight (1/2) linear-time approximation to uncon-\n\nstrained submodular maximization. In FOCS, 2012. 5\n\n8\n\n00.511.520.550.60.650.7mABO\u03bb SSObjectnessEBMCGUniformCPMC10002000500010000250265280295310No.IterationsObjective values100020005000100000.550.60.650.7mABO051015200.690.70.71No.Matching boxesmABO0501000123x 107No.ProposalsNo.Evaluations Without LazyLazy\f[6] J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and\nproducing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Re-\nsearch and development in information retrieval, SIGIR \u201998, pages 335\u2013336, 1998. 3\n\n[7] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In\n\n[8] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr. Bing:binarized normed gradients for objectness estima-\n\nCVPR, 2010. 1, 2, 7\n\ntion at 300fps. In CVPR, 2014. 1\n\n[9] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 1, 3\n[10] T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In ECCV,\n\n2010. 1\n\n1966. 2\n\n[11] D. Dey, T. Liu, M. Hebert, and J. A. Bagnell. Contextual sequence prediction with application to control\n\nlibrary optimization. In Robotics Science and Systems (RSS), 2012. 3, 4\n\n[12] E.L.Lawler and D.E.Wood. Branch-and-bound methods: A survey. Operations Research, 14(4):699\u2013719,\n\n[13] M. Everingham, L. Van Gool, C. K.\n\nI. Williams,\n\nJ. Winn,\n\nPASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.\nnetwork.org/challenges/VOC/voc2007/workshop/index.html. 2, 6\n\n[14] M. Everingham, L. Van Gool, C. K.\n\nI. Williams,\n\nJ. Winn,\n\nPASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.\nnetwork.org/challenges/VOC/voc2012/workshop/index.html. 2, 6\n\nand A. Zisserman.\n\nThe\nhttp://www.pascal-\n\nand A. Zisserman.\n\nThe\nhttp://www.pascal-\n\n[15] U. Feige, V. Mirrokni, and J. Vondr\u00e1k. Maximizing non-monotone submodular functions. In FOCS, 2007.\n\n5\n\n4\n\n[16] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discrimina-\n\ntively trained part based models. PAMI, 32(9):1627\u20131645, 2010. 1, 3\n\n[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014. 1, 3\n\n[18] A. Gonzalez-Garcia, A. Vezhnevets, and V. Ferrari. An active search strategy for ef\ufb01cient object detection.\n\n[19] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual\n\n[20] J. Hosang, R. Benenson, and B. Schiele. How good are detection proposals, really? In BMVC, 2014. 3, 7\n[21] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural svms. Machine Learning,\n\nIn CVPR, 2015. 3\n\nrecognition. In ECCV, 2014. 1, 3\n\n77(1):27\u201359, 2009. 2\n\n[22] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of in\ufb02uence through a social network. In\n\nACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2003. 3\n\n[23] P. Krahenbuhl and V. Koltun. Learning to propose objects. In CVPR, 2015. 7\n[24] A. Krause and D. Golovin. Submodular function maximization. In Tractability: Practical Approaches to\n\nHard Problems (to appear). Cambridge University Press, 2014. 2\n\n[25] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in gaussian processes: Theory,\n\nef\ufb01cient algorithms and empirical studies. J. Mach. Learn. Res., 9:235\u2013284, 2008. 3\n\n[26] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Ef\ufb01cient subwindow search: A branch and bound\n\nframework for object localization. TPMAI, 31(12):2129\u20132142, 2009. 1, 2, 3, 4, 5\n\n[27] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In ACL, 2011. 3\n[28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\n\nCOCO: Common objects in context. In ECCV, 2014. 2, 6\n\n[29] M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. Optimization\n\nTechniques, pages 234\u2013243, 1978. 2, 6\n\n[30] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of approximations for maximizing submodular set\n\nfunctions. Mathematical Programming, 14(1):265\u2013294, 1978. 2, 3\n\n[31] A. Prasad, S. Jegelka, and D. Batra.\n\nSubmodular meets structured: Finding diverse subsets in\n\nexponentially-large structured item sets. In NIPS, 2014. 3\n\n[32] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region\n\n[33] S. Ross, J. Zhou, Y. Yue, D. Dey, and J. A. Bagnell. Learning policies for contextual submodular predic-\n\nproposal networks. In NIPS, 2015. 1\n\ntion. In ICML, 2013. 4\n\n[34] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. In ICLR, 2014. 1\n\n[35] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008.\n\n[36] C. Szegedy, S. Reed, and D. Erhan. Scalable, high-quality object detection. In CVPR, 2014. 1, 3\n[37] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In NIPS, 2013. 1\n[38] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003. 2\n[39] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV,\n\n[40] P. Viola and M. J. Jones. Robust real-time face detection. Int. J. Comput. Vision, 57(2):137\u2013154, May\n\n[41] C. Zitnick and P. Dollar. Edge boxes: Locating object proposals from edges. In ECCV, 2014. 1, 2, 3, 4,\n\n2013. 1, 2, 3, 6, 7\n\n2004. 1, 3\n\n5, 7\n\n9\n\n\f", "award": [], "sourceid": 841, "authors": [{"given_name": "Qing", "family_name": "Sun", "institution": "Virginia Tech"}, {"given_name": "Dhruv", "family_name": "Batra", "institution": "Virginia Tech"}]}