{"title": "Weakly-supervised Discovery of Visual Pattern Configurations", "book": "Advances in Neural Information Processing Systems", "page_first": 1637, "page_last": 1645, "abstract": "The prominence of weakly labeled data gives rise to a growing demand for object detection methods that can cope with minimal supervision. We propose an approach that automatically identifies discriminative configurations of visual patterns that are characteristic of a given object class. We formulate the problem as a constrained submodular optimization problem and demonstrate the benefits of the discovered configurations in remedying mislocalizations and finding informative positive and negative training examples. Together, these lead to state-of-the-art weakly-supervised detection results on the challenging PASCAL VOC dataset.", "full_text": "Weakly-supervised Discovery of\nVisual Pattern Con\ufb01gurations\n\nHyun Oh Song\n\nYong Jae Lee*\n\nStefanie Jegelka\n\nTrevor Darrell\n\nUniversity of California, Berkeley\n\n*University of California, Davis\n\nAbstract\n\nThe prominence of weakly labeled data gives rise to a growing demand for ob-\nject detection methods that can cope with minimal supervision. We propose an\napproach that automatically identi\ufb01es discriminative con\ufb01gurations of visual pat-\nterns that are characteristic of a given object class. We formulate the problem as a\nconstrained submodular optimization problem and demonstrate the bene\ufb01ts of the\ndiscovered con\ufb01gurations in remedying mislocalizations and \ufb01nding informative\npositive and negative training examples. Together, these lead to state-of-the-art\nweakly-supervised detection results on the challenging PASCAL VOC dataset.\n\nIntroduction\n\n1\nThe growing amount of sparsely and noisily labeled image data demands robust detection methods\nthat can cope with a minimal amount of supervision. A prominent example of this scenario is the\nabundant availability of labels at the image level (i.e., whether a certain object is present or absent\nin the image); detailed annotations of the exact location of the object are tedious and expensive and,\nconsequently, scarce. Learning methods that can handle image-level labels circumvent the need\nfor such detailed annotations and therefore have the potential to effectively use the vast textually\nannotated visual data available on the Web. Moreover, if the detailed annotations happen to be noisy\nor erroneous, such weakly supervised methods can even be more robust than fully supervised ones.\nMotivated by these developments, recent work has explored learning methods that decreasingly\nrely on strong supervision. Early ideas for weakly supervised detection [11, 32] paved the way\nby successfully learning part-based object models, albeit on simple object-centric datasets (e.g.,\nCaltech-101). Since then, a number of approaches [21, 26, 29] have aimed at learning models from\nmore realistic and challenging data sets that feature large intra-category appearance variations and\nbackground clutter. These approaches typically generate multiple candidate regions and retain the\nones that occur most frequently in the positively-labeled images. However, due to intra-category\nvariations and deformations, the identi\ufb01ed (single) patches often correspond to only a part of the\nobject, such as a human face instead of the entire body. Such mislocalizations are a frequent problem\nfor weakly supervised detection methods.\nMislocalization and too large or too small bounding boxes are problematic in two respects. First,\ndetection is commonly phrased as multiple instance learning (MIL) and solved by non-convex op-\ntimization methods that alternatingly guess the location of the objects as positive examples (since\nthe true location is unknown) and train a detector based on those guesses. This procedure is heavily\naffected by the initial localizations. Second, the detector is often trained in stages; in each stage one\nadds informative \u201chard\u201d negative examples to the training data. If we are not given accurate true\nobject localizations in the training data, these hard examples must be derived from the detections\ninferred in earlier rounds. The higher the accuracy of the initial localizations, the more informative\nis the augmented training data \u2013 and this is key to the accuracy of the \ufb01nal learned model.\nIn this work, we address the issue of mislocalizations by identifying characteristic, discriminative\ncon\ufb01gurations of multiple patches (rather than a single one). This part-based approach is motivated\n\n1\n\n\fby the observation that automatically discovered single \u201cdiscriminative\u201d patches often correspond\nto object parts. In addition, while background patches (e.g., of water or sky) can also occur through-\nout the positive images, they will re-occur in arbitrary rather than \u201ctypical\u201d con\ufb01gurations. We\ndevelop an effective method that takes as input a set of images with labels of the form \u201cthe object is\npresent/absent\u201d, and automatically identi\ufb01es characteristic part con\ufb01gurations of the given object.\nTo identify such con\ufb01gurations, we use two main criteria. First, useful patches are discriminative,\ni.e., they occur in many positively-labeled images, and rarely in the negatively labeled ones. To iden-\ntify such patches, we use a discriminative covering formulation similar to [29]. Second, the patches\nshould represent different parts, i.e., they may be close but should not overlap too much. In covering\nformulations, one may rule out overlaps by saying that for two overlapping regions, one \u201ccovers\u201d\nthe other, i.e., they are treated as identical and picking one is as good as picking both. But identity is\na transitive relation, and the density of possible regions in detection would imply that all regions are\nidentical, strongly discouraging the selection of more than one part per image. Partial covers face\nthe problem of scale invariance. Hence, we instead formulate an independence constraint. This sec-\nond criterion ensures that we select regions that may be close but are non-redundant and suf\ufb01ciently\nnon-overlapping. We show that this constrained selection problem corresponds to maximizing a\nsubmodular function subject to a matroid intersection constraint, which leads to approximation al-\ngorithms with theoretical worst-case bounds. Given candidate parts identi\ufb01ed by these two criteria,\nwe effectively \ufb01nd frequently co-occurring con\ufb01gurations that take into account relative position,\nscale, and viewpoint.\nWe demonstrate multiple bene\ufb01ts of the discovered con\ufb01gurations. First, we observe that con\ufb01gu-\nrations of patches can produce more accurate spatial coverage of the full object, especially when the\nmost discriminative pattern corresponds to an object part. Second, any overlapping region between\nco-occurring visual patterns is likely to cover a part (but not the full) of the object of interest. Thus,\nthey can be used to generate mis-localized positives as informative hard negatives for training (see\nwhite boxes in Figure 3), which can further reduce localization errors at test time.\nIn short, our main contribution is a weakly-supervised object detection method that automatically\ndiscovers frequent con\ufb01gurations of discriminative visual patterns to train robust object detectors.\nIn our experiments on the challenging PASCAL VOC dataset, we \ufb01nd the inclusion of our discrim-\ninative, automatically detected con\ufb01gurations to outperform all existing state-of-the-art methods.\n\n2 Related work\n\nWeakly-supervised object detection. Object detectors have commonly been trained in a fully-\nsupervised manner, using tight bounding box annotations that cover the object of interest (e.g., [10]).\nTo reduce laborious bounding box annotation costs, recent weakly-supervised approaches [3, 4, 11,\n21, 26, 29, 32] use image-level object-presence labels with no information on object location.\nEarly efforts [11, 32] focused on simple datasets that have a single prominent object in each image\n(e.g., Caltech-101). More recent approaches [21, 26, 29] work with the more challenging PASCAL\ndataset that contains multiple objects in each image and large intra-category appearance variations.\nOf these, Song et al. [29] achieve state-of-the-art results by \ufb01nding discriminative image patches\nthat occur frequently in the positive images but rarely in the negative images, using deep Convolu-\ntional Neural Network (CNN) features [17] and a submodular cover formulation. We build on their\napproach to identify discriminative patches. But, contrary to [29] which assumes patches to contain\nentire objects, we assume patches to contain either full objects or merely object parts, and automat-\nically piece together those patches to produce better full-object estimates. To this end, we change\nthe covering formulation and identify patches that are both representative and explicitly mutually\ndifferent. This leads to more robust object estimates and further allows our system to intelligently\nselect \u201chard negatives\u201d (mislocalized objects), both of which improve detection performance.\nVisual data mining. Existing approaches discover high-level object categories [14, 7, 28], mid-level\npatches [5, 16, 24], or low-level foreground features [18] by grouping similar visual patterns (i.e.,\nimages, patches, or contours) according to their texture, color, shape, etc. Recent methods [5, 16]\nuse weakly-supervised labels to discover discriminative visual patterns. We use related ideas, but\nformulate the problem as a submodular optimization over matroids, which leads to approximation\nalgorithms with theoretical worst-case guarantees. Covering formulations have also been used in\n\n2\n\n\f[1, 2], but after running a trained object detector. An alternative discriminative approach is to use\nspectral methods [34].\nModeling co-occurring visual patterns. It is known that modeling the spatial and geometric rela-\ntionship between co-occurring visual patterns (objects or object-parts) often improves visual recog-\nnition performance [8, 18, 10, 11, 19, 23, 27, 24, 32, 33]. Co-occurring patterns are usually rep-\nresented as doublets [24], higher-order constellations [11, 32] or star-shaped models [10]. Among\nthese, our work is most inspired by [11, 32], which learn part-based models with weak supervi-\nsion. We use more informative deep CNN features and a different formulation, and show results on\nmore dif\ufb01cult datasets. Our work is also related to [19], which discovers high-level object composi-\ntions (\u201cvisual phrases\u201d [8]), but with ground-truth bounding box annotations. In contrast, we aim to\ndiscover part compositions to represent full objects and do so with less supervision.\n\n3 Approach\nOur goal is to \ufb01nd a discriminative set of patches that co-occur in the same con\ufb01guration in many\npositively-labeled images. We address this goal in two steps. First, we \ufb01nd a set of patches that are\ndiscriminative; i.e., they occur frequently in positive images and rarely in negative images. Second,\nwe ef\ufb01ciently \ufb01nd co-occurring con\ufb01gurations of pairs of such patches. Our approach easily extends\nbeyond pairs; for simplicity and to retain con\ufb01gurations that occur frequently enough, we here\nrestrict ourselves to pairs.\nDiscriminative candidate patches. For identifying discriminative patches, we begin with a con-\nstruction similar to that of Song et al. [29]. Let P be the set of positively-labeled images. Each\nimage I contains candidate boxes {bI,1, . . . , bI,m} found via selective search [30]. For each bI,i, we\n\ufb01nd its closest matching neighbor bI(cid:48),j in each other image I(cid:48) (regardless of the image label). The\nK closest of those neighbors form the neighborhood N (bI,i); the remaining ones are discarded.\nDiscriminative patches have neighborhoods mainly within images in P, i.e., if B(P) is the set of all\npatches from images in P, then |N (b)\u2229B(P)| \u2248 K. To identify a small, diverse and representative\nset of such patches, like [29], we construct a bipartite graph G = (U,V,E), where both U and V\ncontain copies of B(P). Each patch b \u2208 V is connected to the copy of its nearest neighbors in U (i.e.,\nN (b)\u2229B(P)). These will be K or fewer, depending on whether the K nearest neighbors of b occur\nin B(P) or in negatively-labeled images. The most representative patches maximize the covering\nfunction\n(1)\nwhere \u0393(S) = {u \u2208 U | (b, u) \u2208 E for some b \u2208 S} \u2286 U is the neighborhood of S \u2286 V in the\nbipartite graph. Figure 1 shows a cartoon illustration. The function F is monotone and submodular,\nand the C maximizing elements (for a given C) can be selected greedily [20].\nHowever, if we aim to \ufb01nd part con\ufb01gurations, we must select multiple, jointly informative patches\nper image. Patches selected to merely maximize coverage can still be redundant, since the most\nfrequently occurring ones are often highly overlapping. A straightforward modi\ufb01cation would be\nto treat highly overlapping patches as identical. This identi\ufb01cation would still admit a submodular\ncover model as in Equation (1). But, in our case, the candidate patches are very densely packed in\nthe image, and, by transitivity, we would have to make all of them identical. In consequence, this\nwould completely rule out the selection of more than one patch in an image and thereby prohibit the\ndiscovery of any co-occurring con\ufb01gurations.\nInstead, we directly constrain our selection such that no two patches b, b(cid:48)\n\u2208 V can be picked whose\nneighborhoods overlap by more than a fraction \u03b8. By overlap, we mean that the patches in the\nneighborhoods of b, b(cid:48) overlap signi\ufb01cantly (they need not be identical). This notion of diversity is\nreminiscent of NMS and similar to that in [5], but we here phrase and analyze it as a constrained\nsubmodular optimization problem. Our constraint can be expressed in terms of a different graph\nGC = (V,EC) with nodes V. In GC, there is an edge between b and b(cid:48) if their neighborhoods overlap\nprohibitively, as illustrated in Figure 1. Our family of feasible solutions is\n(2)\n\nF (S) = |\u0393(S)|,\n\nM = {S \u2286 V | \u2200 b, b(cid:48)\n\n\u2208 S there is no edge (b, b(cid:48)) \u2208 EC}.\n\nIn other words, M is the family of all independent sets in GC. We aim to maximize\n\nmaxS\u2286V F (S)\n\ns.t. S \u2208 M.\n\n(3)\n\n3\n\n\fFigure 1: Left: bipartite graph G that de\ufb01nes the utility function F and identi\ufb01es discriminative\npatches; right: graph GC that de\ufb01nes the diversifying independence constraints M. We may pick\nC1 (yellow) and C3 (green) together, but not C2 (red) with any of those.\n\n\u2206+2 F (S\u2217). If \u0393(b) \u2229 \u0393(b(cid:48)) = \u2205 for all b, b(cid:48)\n\nThis problem is NP-hard. We solve it approximately via the following greedy algorithm. Begin with\nS0 = \u2205, and, in iteration t, add b \u2208 argmaxb\u2208V\\S |\u0393(b) \\ \u0393(St\u22121)|. As we add b, we delete all of\nb\u2019s neighbors in GC from V. We continue until V = \u2205. If the neighborhoods of any b, b(cid:48) are disjoint\nbut contain overlapping elements (\u0393(b) \u2229 \u0393(b(cid:48)) = \u2205 but there exist u \u2208 \u0393(b) and u(cid:48)\n\u2208 \u0393(b(cid:48)) that\noverlap), then this algorithm amounts to the following simpli\ufb01ed scheme: we \ufb01rst sort all b \u2208 V in\nnon-increasing order by their degree \u0393(b), i.e., their number of neighbors in B(P), and visit them in\nthis order. We always add the currently highest b in the list to S, then delete it from the list, and with\nit all its immediate (overlapping) neighbors in GC. The following lemma states an approximation\nfactor for the greedy algorithm, where \u2206 is the maximum degree of any node in GC.\nLemma 1. The solution Sg returned by the greedy algorithm is a 1/(\u2206 + 2) approximation for\nProblem (2): F (Sg) \u2265 1\n\u2208 V, then the worst-case\napproximation factor is 1/(\u2206 + 1).\nThe proof relies on phrasing M as an intersection of matroids.\nDe\ufb01nition 1 (Matroid). A matroid (V,Ik) consists of a ground set V and a family Ik \u2286 2V of\n\u201cindependent sets\u201d that satisfy three axioms: (1) \u2205 \u2208 Ik; (2) downward closedness: if S \u2208 Ik then\nT \u2208 Ik for all T \u2286 S; and (3) the exchange property: if S, T \u2208 Ik and |S| < |T|, then there is an\nelement v \u2208 T \\ S such that S \u222a {v} \u2208 Ik.\nProof. (Lemma 1) We will argue that Problem (2) is the problem of maximizing a monotone sub-\nmodular function subject to the constraint that the solution lies in the intersection of \u2206 + 1 matroids.\nWith this insight, the approximation factor of the greedy algorithm for submodular F follows from\n[12] and that for non-intersecting \u0393(b) from [15], since in the latter case the problem is that of\n\ufb01nding a maximum weight vector in the intersection of \u2206 + 1 matroids.\nIt remains to argue that M is an intersection of matroids. Our matroids will be partition matroids\n(over the ground set V) whose independent sets are of the form Ik = {S | |S \u2229 e| \u2264 1, for all e \u2208\nEk}. To de\ufb01ne those, we partition the edges in GC into disjoint sets Ek, i.e., no two edges in Ek\nshare a common node. The Ek can be found by an edge coloring \u2013 one Ek and Ik for each color k.\nBy Vizing\u2019s theorem [31], we need at most \u2206+1 colors. The matroid Ik demands that for each edge\ne \u2208 Ek, we may only select one of its adjacent nodes. All matroids together say that for any edge\ne \u2208 E, we may only select one of the adjacent nodes, and that is the constraint in Equation (2), i.e.\nk=1 Ik. We do not ever need to explicitly compute Ek and Ik; all we need to do is check\nmembership in the intersection, and this is equivalent to checking whether a set S is an independent\nset in GC, which is achieved implicitly via the deletions in the algorithm.\nFrom the constrained greedy algorithm, we obtain a set S \u2282 V of discriminative patches. Together\nwith its neighborhood \u0393(b), each patch b \u2208 V forms a representative cluster. Figure 2 shows some\nexample patches derived from the labels \u201caeroplane\u201d and \u201cmotorbike\u201d. The discovered patches\nintuitively look like \u201cparts\u201d of the objects, and are frequent but suf\ufb01ciently different.\nFinding frequent con\ufb01gurations. The next step is to \ufb01nd frequent con\ufb01gurations of co-occurring\nclusters, e.g., the head patch of a person on top of the torso patch, or a bicycle with visible wheels.\n\nM =(cid:84)\u2206+1\n\n4\n\nVU\fFigure 2: Examples of discovered patch \u201cclusters\u201d for aeroplane, motorbike, and cat. The discovered\npatches intuitively look like object parts, and are frequent but suf\ufb01ciently different.\n\nA \u201ccon\ufb01guration\u201d consists of patches from two clusters Ci, Cj, their relative location, and their\nviewpoint and scale. In practice, we give preference to pairs that by themselves are very relevant\nand maximize a weighted combination of co-occurrence count and coverage max{\u0393(Ci), \u0393(Cj)}.\nAll possible con\ufb01gurations of all pairs of patches amount to too many to explicitly write down and\ncount. Instead, we follow an ef\ufb01cient procedure for \ufb01nding frequent con\ufb01gurations. Our approach\nis inspired by [19], but does not require any supervision. We \ufb01rst \ufb01nd con\ufb01gurations that occur in at\nleast two images. To do so, we consider each pair of images I1, I2 that have at least two co-occurring\nclusters. For each correspondence of cluster patches across the images, we \ufb01nd a corresponding\ntransform operation (translation, scale, viewpoint change). This results in a point in a 4D transform\nspace, for each cluster correspondence. We quantize this space into B bins. Our candidate con\ufb01gu-\nrations will be pairs of cluster correspondences ((bI1,1, bI2,1), (bI1,2, bI2,2)) \u2208 (Ci\u00d7Ci)\u00d7(Cj\u00d7Cj)\nthat fall in the same bin, i.e., share the same transform and have the same relative location. Between\na given pair of images, there can be multiple such pairs of correspondences. We keep track of those\nvia a multi-graph GP = (P,EP ) that has a node for each image I \u2208 P. For each correspondence\n((bI1,1, bI2,1), (bI1,2, bI2,2)), we draw an edge (I1, I2) and label it by the clusters Ci, Cj and the\ncommon relative position. As a result, there can be multiple edges (I1, Ij) in GP with different edge\nlabels.\nThe most frequently occurring con\ufb01guration can now be read out by \ufb01nding the largest connected\ncomponent in GP induced by retaining only edges with the same label. We use the largest compo-\nnent(s) as the characteristic con\ufb01gurations for a given image label (object class). If the component\nis very small, then there is not enough information to determine co-occurrences, and we simply use\nthe most frequent single cluster. The \ufb01nal single \u201ccorrect\u201d localization will be the smallest bounding\nbox that contains the full con\ufb01guration.\nDiscovering mislocalized hard negatives. Discovering frequent con\ufb01gurations can not only lead\nto better localization estimates of the full object, but they can also be used to generate mislocalized\nestimates as \u201chard negatives\u201d when training the object detector. We exploit this idea as follows.\nLet b1, b2 be a discovered con\ufb01guration within a given image. These patches typically constitute\nco-occurring parts or a part and the full object. Our foreground estimate is the smallest box that\nincludes both b1 and b2. Hence, any region within the foreground estimate that does not overlap\nsimultaneously with both b1 and b2 will capture only a fragment of the foreground object. We extract\nthe four largest such rectangular regions (see white boxes in Figure 3) as hard negative examples.\nSpeci\ufb01cally, we parameterize any rectangular region with [xl, xr, yt, yb], i.e., its x-left, x-right,\ny-top, and y-bottom coordinate values. Let the bounding box of bi (i = 1, 2) be [xl\ni ],\ni , yt\ni , yb\n2), yt =\n2), xr = min(xr\nthe foreground estimate be [xl\n1, xl\nf , xr\nf ],\n2). We generate four hard negatives: [xl\nmax(yt\n1, yb\nf , xl, yb\nf , yb\nf ]. If either b1 or b2 is very small in size relative to the foreground, the\nf , yb, yb\n[xl\nf , xr\nresulting hard negatives can have high overlap with the foreground, which will introduce undesirable\nnoise (false negatives) when training the detector. Thus, we shrink any hard negative that overlaps\nwith the foreground estimate by more than 50%, until its overlap is 50% (we adjust the boundary\nthat does not coincide with any of the foreground estimation boundaries).\n\n2), yb = min(yb\nf , yt], [xl\n\ni, xr\n1, xr\nf ], [xr, xr\n\n1, yt\nf , yt\n\nf , yt\n\nf , yb\n\nf ], and let xl = max(xl\n\nf , xr\n\nf , yt\n\nf , yt\n\n5\n\n\fFigure 3: Automatically discovered foreground estimation box (magenta), hard negative (white),\nand the patch (yellow) that induced the hard negative. Note that we are only showing the largest one\nout of (up to) four hard negatives per image.\n\nNote that simply taking arbitrary rectangular regions that overlap with the foreground estimation box\nby some threshold will not always generate useful hard negatives (as we show in the experiments).\nIf the overlap threshold is too low, the selected regions will be uninformative, and if the overlap\nthreshold is too high, the selected regions will cover too much of the foreground. Our approach\nselects informative hard negatives more robustly by ruling out the overlapping region between the\ncon\ufb01guration patches, which is very likely be part of the foreground object but not the full object.\nMining positives and training the detector. While the discovered con\ufb01gurations typically lead\nto better foreground localization, their absolute count can be relatively low compared to the total\nnumber of positive images. This is due to inaccuracies in the initial patch discovery stage: for a\nfrequent con\ufb01guration to be discovered, both of its patches must be found accurately. Thus, we also\n(cid:48) that did not produce any of\nmine additional positives from the set of remaining positive images P\nthe discovered con\ufb01gurations.\nTo do so, we train an initial object detector, using the foreground estimates derived from our discov-\nered con\ufb01gurations as positive examples, and the corresponding discovered hard negative regions as\nnegatives. In addition, we mine negative examples in negative images as in [10]. We run the detector\n(cid:48) and retain the region in each image with the highest detection\non all selective search regions in P\nscore as an additional positive training example. Our \ufb01nal detector is trained on this augmented\ntraining data, and iteratively improved by latent SVM (LSVM) updates (see [10, 29] for details).\n\n4 Experiments\nIn this section, we analyze: (1) detection performance of the models trained with the discovered\ncon\ufb01gurations, and (2) impact of the discovered hard negatives on detection performance.\nImplementation details. We employ a recent region based detection framework [13, 29] and use the\nsame fc7 features from the CNN model [6] on region proposals [30] throughout the experiments. For\ndiscriminative patch discovery, we use K = |P|/2, \u03b8 = K/20. For correspondence detection, we\ndiscretize the 4D transform space of {x: relative horizontal shift, y: relative vertical shift, s: relative\nscale, p: relative aspect ratio} with \u2206x = 30 px, \u2206y = 30 px, \u2206s = 1 px/px, \u2206p = 1 px/px.\nWe chose this binning scheme by examining a few qualitative examples so that scale and aspect\nratio agreement between the two paired instances are more strict, while their translation agreement\nis more loose, in order to handle deformable objects. More details regarding the transform space\nbinning can be found in [22].\nDiscovered con\ufb01gurations. Figure 5 shows the discovered con\ufb01gurations (solid green and yellow\nboxes) and foreground estimates (dashed magenta boxes) that have high degree in graph GP for all\n20 classes in the PASCAL dataset. Our method consistently \ufb01nds meaningful combinations such\nas a wheel and body of bicycles, face and torso of people, locomotive basement and upper body\nparts of trains/buses, and window and body frame of cars. Some failures include cases where the\nalgorithm latches onto different objects co-occurring in consistent con\ufb01gurations such as the lamp\nand sofa combination (right column, second row from the bottom in Figure 5).\nWeakly-supervised object detection. Following the evaluation protocol of the PASCAL VOC\ndataset, we report detection results on the PASCAL test set using detection average precision. For a\ndirect comparison with the state-of-the-art weakly-supervised object detection method [29], we do\nnot use the extra instance level annotations such as pose, dif\ufb01cult, truncated and restrict the supervi-\nsion to the image-level object presence annotations. Table 1 compares our detection results against\ntwo baseline methods [25, 29] on the full dataset. Our method improves detection performance on\n15 of the 20 classes. It is worth noting that our method yields signi\ufb01cant improvement on the person\n\n6\n\n\faero bike bird boat btl bus car cat chr cow tble dog horse mbk pson plnt shp sofa train tv mAP\n[25] 13.4 44.0 3.1 3.1 0.0 31.2 43.9 7.1 0.1 9.3 9.9 1.5 29.4 38.3 4.6 0.1 0.4 3.8 34.2 0.0 13.9\n[29] 27.6 41.9 19.7 9.1 10.4 35.8 39.1 33.6 0.6 20.9 10.0 27.7 29.4 39.2 9.1 19.3 20.5 17.1 35.6 7.1 22.7\n\nours1 31.9 47.0 21.9 8.7 4.9 34.4 41.8 25.6 0.3 19.5 14.2 23.0 27.8 38.7 21.2 17.6 26.9 12.8 40.1 9.2 23.4\nours2 36.3 47.6 23.3 12.3 11.1 36.0 46.6 25.4 0.7 23.5 12.5 23.5 27.9 40.9 14.8 19.2 24.2 17.1 37.7 11.6 24.6\n\nTable 1: Detection average precision (%) on full PASCAL VOC 2007 test set. ours1: before latent\nupdates. ours2: after latent updates\n\nw/o hard negatives\n\nneighboring hard negatives\n\ndiscovered hard negatives\n\nours + SVM\nours + LSVM\n\n22.5\n23.7\n\n22.2\n23.9\n\n23.4\n24.6\n\nTable 2: Effect of our hard negative examples on full PASCAL VOC 2007 test set.\n\nclass, which is arguably the most important category in the PASCAL dataset. Figure 4 shows some\nexample high scoring detections on the test set. Our method produces more complete detections\nsince it is trained on better localized instances of the object-of-interest.\n\nFigure 4: Example detections on test set. Green: our method, red: [29]\n\nImpact of discovered hard negatives. To analyze the effect of our discovered hard negatives, we\ncompare to two baselines: (1) not adding any negative examples from positives images, and (2)\nadding image regions around the foreground estimate, as conventionally implemented in fully su-\npervised object detection algorithms [9, 13]. For the latter, we use the criterion from [13], where\nall image regions in positive images with overlap score (intersection over union with respect to any\nforeground region) less than 0.3 are used as \u201cneighboring\u201d negative image regions on positive im-\nages. Table 2 shows the effect of our hard negative examples on detection mean average precision for\nall classes (mAP). We also added neighboring negative examples to [29], but this decreases its mAP\nfrom 20.3% to 20.2% (before latent updates) and from 22.7% to 21.8% (after latent updates). These\nexperiments show that adding neighboring negative regions does not lead to noticeable improve-\nment over not adding any negative regions from positive images, while adding our automatically\ndiscovered hard negative regions improves detection performance more substantially.\nConclusion. We developed a weakly-supervised object detection method that discovers frequent\ncon\ufb01gurations of discriminative visual patterns. We showed that the discovered con\ufb01gurations pro-\nvide more accurate spatial coverage of the full object and provide a way to generate useful hard\nnegatives. Together, these lead to state-of-the-art weakly-supervised detection results on the chal-\nlenging PASCAL VOC dataset.\n\nAcknowledgement. This work was supported in part by DARPA\u2019s MSEE and SMISC programs, by NSF awards IIS-1427425, IIS-1212798, IIS-1116411, and by\nsupport from Toyota. We thank the NVIDIA Corporation for generously providing GPUs through their academic program.\nReferences\n[1] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough trans-\n\nforms. IEEE TPAMI, 2012.\n\n[2] Y. Chen, H. Shioi, C. Fuentes-Montesinos, L. Koh, S. Wich, and A. Krause. Active detection via adaptive\n\nsubmodularity. In ICML, 2014.\n\n7\n\n\fFigure 5: Example con\ufb01gurations that have high degree in graph GP . The solid green and yel-\nlow boxes show the discovered discriminative visual parts, and the dashed magenta box shows the\nbounding box that tightly \ufb01ts their con\ufb01guration.\n\n8\n\n\f[3] T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In ECCV,\n\n2010.\n\n[4] T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervised localization and learning with generic knowl-\n\nedge. IJCV, 2012.\n\n[5] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. A. Efros. What Makes Paris Look like Paris?\n\nSIGGRAPH, 2012.\n\nIn\n\n[6] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep Convo-\n\nlutional Activation Feature for Generic Visual Recognition. arXiv e-prints, 2013.\n\n[7] A. Faktor and M. Irani. Clustering by Composition Unsupervised Discovery of Image Categories. In\n\nECCV, 2012.\n\n[8] A. Farhadi and A. Sadeghi. Recognition Using Visual Phrases. In CVPR, 2011.\n[9] P. Felzenszwalb, D. McAllester, and D. Ramanan. A Discriminatively Trained, Multiscale, Deformable\n\nPart Model. In CVPR, 2008.\n\n[10] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object Detection with Discriminatively\n\nTrained Part Based Models. TPAMI, 32(9), 2010.\n\n[11] R. Fergus, P. Perona, and A. Zisserman. Object Class Recognition by Unsupervised Scale-Invariant\n\nLearning. In CVPR, 2003.\n\n[12] M. Fisher, G. Nemhauser, and L. Wolsey. An analysis of approximations for maximizing submodular set\n\nfunctions - II. Math. Prog. Study, 8:73\u201387, 1978.\n\n[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. arXiv e-prints, 2013.\n\n[14] K. Grauman and T. Darrell. Unsupervised learning of categories from sets of partially matching image\n\nfeatures. In CVPR, 2006.\n\n[15] T. Jenkyns. The ef\ufb01cacy of the \u201cgreedy\u201d algorithm. In Proc. of 7th South Eastern Conference on Combi-\n\nnatorics, Graph Theory and Computing, pages 341\u2013350, 1976.\n\n[16] M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman. Blocks that Shout: Distinctive Parts for Scene\n\nClassi\ufb01cation. In CVPR, 2013.\n\n[17] A. Krizhevsky and I. S. G. Hinton. ImageNet Classi\ufb01cation with Deep Convolutional Neural Networks.\n\nIn NIPS, 2012.\n\n[18] Y. J. Lee and K. Grauman. Foreground Focus: Unsupervised Learning From Partially Matching Images.\n\nIJCV, 85, 2009.\n\n[19] C. Li, D. Parikh, and T. Chen. Automatic Discovery of Groups of Objects for Scene Understanding. In\n\nCVPR, 2012.\n\n[20] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of approximations for maximizing submodular set\n\nfunctions\u2014I. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[21] M. Pandey and S. Lazebnik. Scene recognition and weakly supervised object localization with deformable\n\npart-based models. In ICCV, 2011.\n\n[22] D. Parikh, C. L. Zitnick, and T. Chen. From Appearance to Context-Based Recognition: Dense Labeling\n\nin Small Images. In CVPR, 2008.\n\n[23] T. Quack, V. Ferrari, B. Leibe, and L. V. Gool. Ef\ufb01cient Mining of Frequent and Distinctive Feature\n\nCon\ufb01gurations. In ICCV, 2007.\n\n[24] S. Singh, A. Gupta, and A. A. Efros. Unsupervised Discovery of Mid-level Discriminative Patches. In\n\nECCV, 2012.\n\n[25] P. Siva and T. Xiang. Weakly supervised object detector learning with model drift detection. In ICCV,\n\n2011.\n\n[26] P. Siva, C. Russell, and T. Xiang. In defence of negative mining for annotating weakly labelled data. In\n\nECCV, 2012.\n\n[27] J. Sivic and A. Zisserman. Video Data Mining Using Con\ufb01gurations of Viewpoint Invariant Regions. In\n\nCVPR, 2004.\n\n[28] J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman. Discovering object categories in image\n\ncollections. In ICCV, 2005.\n\n[29] H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to localize\n\nobjects with minimal supervision. In ICML, 2014.\n\n[30] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. In\n\nIJCV, 2013.\n\n[31] V. Vizing. On an estimate of the chromatic class of a p-graph. Diskret. Analiz., 3:25\u201330, 1964.\n[32] M. Weber, M. Welling, and P. Perona. Unsupervised Learning of Models for Recognition. In ECCV,\n\n2000.\n\n[33] Y. Zhang and T. Chen. Ef\ufb01cient Kernels for Identifying Unbounded-order Spatial Features. In CVPR,\n\n2009.\n\n[34] J. Zou, D. Hsu, D. Parkes, and R. Adams. Contrastive learning using spectral methods. In NIPS, 2013.\n\n9\n\n\f", "award": [], "sourceid": 863, "authors": [{"given_name": "Hyun Oh", "family_name": "Song", "institution": "UC Berkeley"}, {"given_name": "Yong Jae", "family_name": "Lee", "institution": "UC Davis"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "UC Berkeley"}, {"given_name": "Trevor", "family_name": "Darrell", "institution": "UC Berkeley"}]}