{"title": "Object Detection with Grammar Models", "book": "Advances in Neural Information Processing Systems", "page_first": 442, "page_last": 450, "abstract": "Compositional models provide an elegant formalism for representing the visual appearance of highly variable objects. While such models are appealing from a theoretical point of view, it has been difficult to demonstrate that they lead to performance advantages on challenging datasets. Here we develop a grammar model for person detection and show that it outperforms previous high-performance systems on the PASCAL benchmark. Our model represents people using a hierarchy of deformable parts, variable structure and an explicit model of occlusion for partially visible objects. To train the model, we introduce a new discriminative framework for learning structured prediction models from weakly-labeled data.", "full_text": "Object Detection with Grammar Models\n\nRoss B. Girshick\n\nDept. of Computer Science\n\nUniversity of Chicago\n\nChicago, IL 60637\n\nrbg@cs.uchicago.edu\n\nPedro F. Felzenszwalb\nSchool of Engineering and\nDept. of Computer Science\n\nBrown University\n\nProvidence, RI 02912\npff@brown.edu\n\nDavid McAllester\n\nTTI-Chicago\n\nChicago, IL 60637\n\nmcallester@ttic.edu\n\nAbstract\n\nCompositional models provide an elegant formalism for representing the visual\nappearance of highly variable objects. While such models are appealing from a\ntheoretical point of view, it has been dif\ufb01cult to demonstrate that they lead to per-\nformance advantages on challenging datasets. Here we develop a grammar model\nfor person detection and show that it outperforms previous high-performance sys-\ntems on the PASCAL benchmark. Our model represents people using a hierar-\nchy of deformable parts, variable structure and an explicit model of occlusion for\npartially visible objects. To train the model, we introduce a new discriminative\nframework for learning structured prediction models from weakly-labeled data.\n\n1\n\nIntroduction\n\nThe idea that images can be hierarchically parsed into objects and their parts has a long history in\ncomputer vision, see for example [15]. Image parsing has also been of considerable recent interest\n[11, 13, 21, 22, 24]. However, it has been dif\ufb01cult to demonstrate that sophisticated compositional\nmodels lead to performance advantages on challenging metrics such as the PASCAL object detection\nbenchmark [9]. In this paper we achieve new levels of performance for person detection using a\ngrammar model that is richer than previous models used in high-performance systems. We also\nintroduce a general framework for learning discriminative models from weakly-labeled data.\nOur models are based on the object detection grammar formalism in [11]. Objects are represented\nin terms of other objects through compositional rules. Deformation rules allow for the parts of an\nobject to move relative to each other, leading to hierarchical deformable part models. Structural\nvariability provides choice between multiple part subtypes \u2014 effectively creating mixture models\nthroughout the compositional hierarchy \u2014 and also enables optional parts. In this formalism parts\nmay be reused both within an object category and across object categories.\nOur baseline and departure point is the UoC-TTI object detector [10, 12]. This system represents a\nclass of objects with three different pictorial structure models. Although these models are learned\nautomatically, making semantic interpretation unclear, it seems that the three components for the\nperson class differ in how much of the person is taken to be visible \u2014 just the head and shoulders,\nthe head and shoulders together with the upper body, or the whole standing person. Each of the three\ncomponents has independently trained parts. For example, each component has a head part trained\nindependently from the head part of the other components.\nHere we construct a single grammar model that allows more \ufb02exibility in describing the amount of\nthe person that is visible. The grammar model avoids dividing the training data between different\ncomponents and thus uses the training data more ef\ufb01ciently. The parts in the model, such as the\nhead part, are shared across different interpretations of the degree of visibility of the person. The\ngrammar model also includes subtype choice at the part level to accommodate greater appearance\n\n1\n\n\fvariability across object instances. We use parts with subparts to bene\ufb01t from high-resolution image\ndata, while also allowing for deformations. Unlike previous approaches, we explicitly model the\nsource of occlusion for partially visible objects.\nOur approach differs from that of Jin and Geman [13] in that theirs focuses on whole scene inter-\npretation with generative models, while we focus on discriminatively trained models of individual\nobjects. We also make Markovian restrictions not made in [13]. Our work is more similar to that of\nZhu et al. [21] who impose similar Markovian restrictions. However, our training method, image\nfeatures, and grammar design are substantially different.\nThe model presented here is designed to accurately capture the visible portion of a person. There\nhas been recent related work on occlusion modeling in pedestrian and person images [7, 18]. In\n[7], Enzweiler et al. assume access to depth and motion information in order to estimate occlusion\nboundaries. In [18], Wang et al. rely on the observation that the scores of individual \ufb01lter cells (using\nthe Dalal and Triggs detector [5]) can reliably predict occlusion in the INRIA pedestrian data. This\ndoes not hold for the harder PASCAL person data.\nIn addition to developing a grammar model for detecting people, we develop new training methods\nwhich contribute to our boost in performance. Training data for vision is often assigned weak labels\nsuch as bounding boxes or just the names of objects occurring in the image. In contrast, an image\nanalysis system will often produce strong predictions such as a segmentation or a pose. Existing\nstructured prediction methods, such as structural SVM [16, 17] and latent structural SVM [19], do\nnot directly support weak labels together with strong predictions. We develop the notion of a weak-\nlabel structural SVM which generalizes structural SVMs and latent-structural SVMs. The key idea\nis to introduce a loss L(y, s) for making a strong prediction s when the weak training label is y.\nA formalism for learning from weak labels was also developed in [2]. One important difference is\nthat [2] generalizes ranking SVMs.1 Our framework also allows for softer relations between weak\nlabels and strong predictions.\n\n2 Grammar models\nObject detection grammars [11] represent objects recursively in terms of other objects. Let N be a\nset of nonterminal symbols and T be a set of terminal symbols. We can think of the terminals as the\nbasic building blocks that can be found in an image. The nonterminals de\ufb01ne abstract objects whose\nappearance are de\ufb01ned in terms of expansions into terminals.\nLet \u2126 be a set of possible locations for a symbol within an image. A placed symbol, Y (\u03c9), speci\ufb01es\na placement of Y \u2208 N \u222aT at a location \u03c9 \u2208 \u2126. The structure of a grammar model is de\ufb01ned by a\nset, R, of weighted productions of the form\n\nX(\u03c90)\n\ns\u2212\u2192 { Y1(\u03c91), . . . , Yn(\u03c9n) },\n\n(1)\n\nwhere X \u2208 N , Yi \u2208 N \u222aT , \u03c9i \u2208 \u2126 and s \u2208 R is a score. We denote the score of r \u2208 R by s(r).\nWe can expand a placed nonterminal to a bag of placed terminals by repeatedly applying produc-\ntions. An expansion of X(\u03c9) leads to a derivation tree T rooted at X(\u03c9). The leaves of T are\nlabeled with placed terminals, and the internal nodes of T are labeled with placed nonterminals and\nwith the productions used to replace those symbols.\nWe de\ufb01ne appearance models for the terminals using a function score(A, \u03c9) that computes a score\nfor placing the terminal A at location \u03c9. This score depends implicitly on the image data. We de\ufb01ne\nthe score of a derivation tree T to be the sum of the scores of the productions used to generate T ,\nplus the score of placing the terminals associated with T \u2019s leaves in their respective locations.\n\nscore(T ) =\n\ns(r) +\n\nscore(A, \u03c9)\n\n(2)\n\nr\u2208internal(T )\n\nA(w)\u2208leaves(T )\n\n(cid:88)\n\n(cid:88)\n\nTo generalize the models from [10] we let \u2126 be positions and scales within a feature map pyramid\nH. We de\ufb01ne the appearance models for terminals by associating a \ufb01lter FA with each terminal A.\n1[2] claims the ranking framework overcomes a loss in performance when the number of background exam-\nples is increased. In contrast, we don\u2019t use a ranking framework but always observed a performance improve-\nment when increasing the number of background examples.\n\n2\n\n\fFigure 1: Shallow grammar model. This \ufb01gure illustrates a shallow version of our grammar model\n(Section 2.1). This model has six person parts and an occlusion model (\u201coccluder\u201d), each of which\ncomes in one of two subtypes. A detection places one subtype of each visible part at a location\nand scale in the image. If the derivation does not place all parts it must place the occluder. Parts\nare allowed to move relative to each other, but their displacements are constrained by deformation\npenalties.\nThen score(A, \u03c9) = FA \u00b7 \u03c6(H, \u03c9) is the dot product between the \ufb01lter coef\ufb01cients and the features\nin a subwindow of the feature map pyramid, \u03c6(H, \u03c9). We use the variant of histogram of oriented\ngradient (HOG [5]) features described in [10].\nWe consider models with productions speci\ufb01ed by two kinds of schemas (a schema is a template for\ngenerating productions). A structure schema speci\ufb01es one production for each placement \u03c9 \u2208 \u2126,\n\nX(\u03c9)\n\ns\u2212\u2192 { Y1(\u03c9 \u2295 \u03b41), . . . , Yn(\u03c9 \u2295 \u03b4n) }.\n\n(3)\nHere the \u03b4i specify constant displacements within the feature map pyramid. Structure schemas can\nbe used to de\ufb01ne decompositions of objects into other objects.\nLet \u2206 be the set of possible displacements within a single scale of a feature map pyramid. A\ndeformation schema speci\ufb01es one production for each placement \u03c9 \u2208 \u2126 and displacement \u03b4 \u2208 \u2206,\n\n\u03b1\u00b7\u03c6(\u03b4)\u2212\u2192 { Y (\u03c9 \u2295 \u03b4) }.\n\nX(\u03c9)\n\n(4)\nHere \u03c6(\u03b4) is a feature vector and \u03b1 is a vector of deformation parameters. Deformation schemas\ncan be used to de\ufb01ne deformable models. We de\ufb01ne \u03c6(\u03b4) = (dx, dy, dx2, dy2) so that deformation\nscores are quadratic functions of the displacements.\nThe parameters of our models are de\ufb01ned by a weight vector w with entries for the score of each\nstructure schema, the deformation parameters of each deformation schema and the \ufb01lter coef\ufb01cients\nassociated with each terminal. Then score(T ) = w \u00b7 \u03a6(T ), where \u03a6(T ) is the sum of (sparse)\nfeature vectors associated with each placed terminal and production in T .\n\n2.1 A grammar model for detecting people\n\nEach component in the person model learned by the voc-release4 system [12] is tuned to detect\npeople under a prototypical visibility pattern. Based on this observation we designed, by hand, the\nstructure of a grammar that models visibility by using structural variability and optional parts. For\nclarity, we begin by describing a shallow model (Figure 1) that places all \ufb01lters at the same resolution\nin the feature map pyramid. After explaining this model, we describe a deeper model that includes\ndeformable subparts at higher resolutions.\nFine-grained occlusion Our grammar model has a start symbol Q that can be expanded using one\nof six possible structure schemas. These choices model different degrees of visibility ranging from\nheavy occlusion (only the head and shoulders are visible) to no occlusion at all.\nBeyond modeling \ufb01ne-grained occlusion patterns when compared to the mixture models from [7]\nand [12], our grammar model is also richer in a number of ways. In Section 5 we show that each of\nthe following modeling choices improves detection performance.\n\n3\n\nParts 1-6 (no occlusion)Parts 1-4 & occluderParts 1-2 & occluderExample detections and derived \ufb01ltersSubtype 1Subtype 2Part 1Part 2Part 3Part 4Part 5Part 6Occluder\fOcclusion model If a person is occluded, then there must be some cause of the occlusion \u2014 either\nthe edge of the image or an occluding object, such as a desk or dinner table. We use a nontrivial\nmodel to capture the appearance of the stuff that occludes people.\nPart subtypes The mixture model from [12] has two subtypes for each mixture component. The\nsubtypes are forced to be mirror images of each other and correspond roughly to left-facing versus\nright-facing people. Our grammar model has two subtypes for each part, which are also forced to\nbe mirror images of each other. But in the case of our grammar model, the decision of which part\nsubtype to instantiate at detection time is independent for each part.\nThe shallow person grammar model is de\ufb01ned by the following grammar. The indices p (for part), t\n(for subtype), and k have the following ranges: p \u2208 {1, . . . , 6}, t \u2208 {L, R} and k \u2208 {1, . . . , 5}.\n\nQ(\u03c9)\nQ(\u03c9)\n\nsk\u2212\u2192 { Y1(\u03c9 \u2295 \u03b41), . . . , Yk(\u03c9 \u2295 \u03b4k), O(\u03c9 \u2295 \u03b4k+1) }\ns6\u2212\u2192 { Y1(\u03c9 \u2295 \u03b41), . . . , Y6(\u03c9 \u2295 \u03b46) }\n\nYp(\u03c9)\nO(\u03c9)\n\n0\u2212\u2192 { Yp,t(\u03c9) }\n0\u2212\u2192 { Ot(\u03c9) }\n\nYp,t(\u03c9)\nOt(\u03c9)\n\n\u03b1p,t\u00b7\u03c6(\u03b4)\u2212\u2192 { Ap,t(\u03c9 \u2295 \u03b4) }\n\u03b1t\u00b7\u03c6(\u03b4)\u2212\u2192 { At(\u03c9 \u2295 \u03b4) }\n\nThe grammar has a start symbol Q with six alternate choices that derive people under varying de-\ngrees of visibility (occlusion). Each part has a corresponding nonterminal Yp that is placed at some\nideal position relative to Q. Derivations with occlusion include the occlusion symbol O. A derivation\nselects a subtype and displacement for each visible part. The parameters of the grammar (production\nscores, deformation parameters and \ufb01lters) are learned with the discriminative procedure described\nin Section 4. Figure 1 illustrates the \ufb01lters in the resulting model and some example detections.\nDeeper model We extend the shallow model by adding deformable subparts at two scales: (1)\nthe same as, and (2) twice the resolution of the start symbol Q. When detecting large objects,\nhigh-resolution subparts capture \ufb01ne image details. However, when detecting small objects, high-\nresolution subparts cannot be used because they \u201cfall off the bottom\u201d of the feature map pyramid.\nThe model uses derivations with low-resolution subparts when detecting small objects.\nWe begin by replacing the productions from Yp,t in the grammar above, and then adding new pro-\nductions. Recall that p indexes the top-level parts and t indexes subtypes. In the following schemas,\nthe indices r (for resolution) and u (for subpart) have the ranges: r \u2208 {H, L}, u \u2208 {1, . . . , Np},\nwhere Np is the number of subparts in a top-level part Yp.\n\nYp,t(\u03c9)\nZp,t(\u03c9)\n\nWp,t,r,u(\u03c9)\n\n\u03b1p,t\u00b7\u03c6(\u03b4)\u2212\u2192\n0\u2212\u2192\n\u2212\u2192\n\n\u03b1p,t,r,u\u00b7\u03c6(\u03b4)\n\n{ Zp,t(\u03c9 \u2295 \u03b4) }\n{Ap,t(\u03c9), Wp,t,r,1(\u03c9 \u2295 \u03b4p,t,r,1), . . . , Wp,t,r,Np (\u03c9 \u2295 \u03b4p,t,r,Np )}\n{Ap,t,r,u(\u03c9 \u2295 \u03b4)}\n\nWe note that as in [23] our model has hierarchical deformations. The part terminal Ap,t can move\nrelative to Q and the subpart terminal Ap,t,r,u can move relative to Ap,t.\nThe displacements \u03b4p,t,H,u place the symbols Wp,t,H,u one octave below Zp,t in the feature map\npyramid. The displacements \u03b4p,t,L,u place the symbols Wp,t,L,u at the same scale as Zp,t. We add\nsubparts to the \ufb01rst two top-level parts (p = 1 and 2), with the number of subparts set to N1 = 3\nand N2 = 2. We \ufb01nd that adding additional subparts does not improve detection performance.\n\n2.2 Inference and test time detection\n\nInference involves \ufb01nding high scoring derivations. At test time, because images may contain mul-\ntiple instances of an object class, we compute the maximum scoring derivation rooted at Q(\u03c9), for\neach \u03c9 \u2208 \u2126. This can be done ef\ufb01ciently using a standard dynamic programming algorithm [11].\nWe retain only those derivations that score above a threshold, which we set low enough to ensure\nhigh recall. We use box(T ) to denote a detection window associated with a derivation T . Given a\nset of candidate detections, we apply nonmaximal suppression to produce a \ufb01nal set of detections.\nWe de\ufb01ne box(T ) by assigning a detection window size, in feature map coordinates, to each struc-\nture schema that can be applied to Q. This leads to detections with one of six possible aspect ratios,\ndepending on which production was used in the \ufb01rst step of the derivation. The absolute location\nand size of a detection depends on the placement of Q. For the \ufb01rst \ufb01ve production schemas, the\nideal location of the occlusion part, O, is outside of box(T ).\n\n4\n\n\f3 Learning from weakly-labeled data\nHere we de\ufb01ne a general formalism for learning functions from weakly-labeled data. Let X be an\ninput space, Y be a label space, and S be an output space. We are interested in learning functions\nf : X \u2192 S based on a set of training examples {(x1, y1), . . . , (xn, yn)} where (xi, yi) \u2208 X \u00d7 Y.\nIn contrast to the usual supervised learning setting, we do not assume that the label space and the\noutput space are the same. In particular there may be many output values that are compatible with\na label, and we can think of each example as being only weakly labeled. It will also be useful to\nassociate a subset of possible outputs, S(x) \u2286 S, with an example x. In this case f (x) \u2208 S(x).\nA connection between labels and outputs can be made using a loss function L : Y \u00d7S \u2192 R. L(y, s)\nassociates a cost with the prediction s \u2208 S on an example labeled y \u2208 Y. Let D be a distribution\nover X \u00d7 Y. Then a natural goal is to \ufb01nd a function f with low expected loss ED[L(y, f (x))].\nA simple example of a weakly-labeled training problem comes from learning sliding window clas-\nsi\ufb01ers in the PASCAL object detection dataset. The training data speci\ufb01es pixel-accurate bounding\nboxes for the target objects while a sliding window classi\ufb01er reports boxes with a \ufb01xed aspect ratio\nand at a \ufb01nite number of scales. The output space is, therefore, a subset of the label space.\nAs usual, we assume f is parameterized by a vector of model parameters w and generates predictions\nby maximizing a linear function of a joint feature map \u03a6(x, s), f (x) = argmaxs\u2208S(x) w \u00b7 \u03a6(x, s).\nWe can train w by minimizing a regularized risk on the training set. We de\ufb01ne a weak-label struc-\ntural SVM (WL-SSVM) by the following training equation,\n\nE(w) =\n\n||w||2 + C\n\n1\n2\n\nL(cid:48)(w, xi, yi).\n\nn(cid:88)\n\ni=1\n\n(cid:125)\n\n(cid:124)\n\nThe surrogate training loss L(cid:48) is de\ufb01ned in terms of two different loss augmented predictions.\n[w \u00b7 \u03a6(x, s) \u2212 Loutput(y, s)]\n\n[w \u00b7 \u03a6(x, s) + Lmargin(y, s)]\n\nL(cid:48)(w, x, y) = max\ns\u2208S(x)\n\n\u2212 max\ns\u2208S(x)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(6a)\n\n(5)\n\n(6)\n\n(cid:125)\n\n(cid:123)(cid:122)\n\n(6b)\n\nLmargin encourages high-loss outputs to \u201cpop out\u201d of (6a), so that their scores get pushed down.\nLoutput suppresses high-loss outputs in (6b), so the score of a low-loss prediction gets pulled up.\nIt is natural to take Lmargin = Loutput = L. In this case L(cid:48) becomes a type of ramp loss [4, 6, 14].\nAlternatively, taking Lmargin = L and Loutput = 0 gives the ramp loss that has been shown to\nbe consistent in [14]. As we discuss below, the choice of Loutput can have a signi\ufb01cant effect on\nthe computational dif\ufb01culty of the training problem. Several popular learning frameworks can be\nderived as special cases of WL-SSVM. For the examples below, let I(a, b) = 0 when a = b, and\nI(a, b) = \u221e when a (cid:54)= b.\nStructural SVM Let S = Y, Lmargin = L and Loutput(y, \u02c6y) = I(y, \u02c6y). Then L(cid:48)(w, x, y) is the\nhinge loss used in a structural SVM [17]. In this case L(cid:48) is convex in w because the maximization\nin (6b) disappears. We note, however, that this choice of Loutput may be problematic and lead to\ninconsistent training problems. Consider the following situation. A training example (x, y) may\nbe compatible with a different label \u02c6y (cid:54)= y, in the sense that L(y, \u02c6y) = 0. But even in this case a\nstructural SVM pushes the score w \u00b7 \u03a6(x, y) to be above w \u00b7 \u03a6(x, \u02c6y). This issue can be addressed\nby relaxing Loutput to include a maximization over labels in (6b).\nLatent structural SVM Now let Z be a space of latent values, S = Y \u00d7 Z, Lmargin = L and\nLoutput(y, (\u02c6y, \u02c6z)) = I(y, \u02c6y). Then L(cid:48)(w, x, y) is the hinge loss used in a latent structural SVM [19].\nIn this case L(cid:48) is not convex in w due to the maximization over latent values in (6b). As in the\nprevious example, this choice of Loutput can be problematic because it \u201crequires\u201d that the training\nlabels be predicted exactly. This can be addressed by relaxing Loutput, as in the previous example.\n\n4 Training grammar models\n\nNow we consider learning the parameters of an object detection grammar using the training data\nin the PASCAL VOC datasets with the WL-SSVM framework. For two rectangles a and b let\noverlap(a, b) = area(a\u2229 b)/ area(a\u222a b). We will use this measure of overlap in our loss functions.\n\n5\n\n\fFor training, we augment our model\u2019s output space (the set of all derivation trees), with the back-\nground output \u22a5. We de\ufb01ne \u03a6(x,\u22a5) to be the zero vector, as was done in [1]. Thus the score of a\nbackground hypothesis is zero independent of the model parameters w.\nThe training data speci\ufb01es a bounding box for each instance of an object in a set of training im-\nages. We construct a set of weakly-labeled examples {(x1, y1), . . . , (xn, yn)} as follows. For each\ntraining image I, and for each bounding box B in I, we de\ufb01ne a foreground example (x, y), where\ny = B, x speci\ufb01es the image I, and the set of valid predictions S(x) includes:\n\n1. Derivations T with overlap(box(T ), B) \u2265 0.1 and overlap(box(T ), B(cid:48)) < 0.5 for all B(cid:48)\n2. The background output \u22a5.\n\nin I such that B(cid:48) (cid:54)= B.\n\nThe overlap requirements in (1) ensure that we consider only predictions that are relevant for a\nparticular object instance, while avoiding interactions with other objects in the image.\nWe also de\ufb01ne a very large set of background examples. For simplicity, we use images that do not\ncontain any bounding boxes. For each background image I, we de\ufb01ne a different example (x, y) for\neach position and scale \u03c9 within I. In this case y = \u22a5, x speci\ufb01es the image I, and S(x) includes\nderivations T rooted at Q(\u03c9) and the background output \u22a5. The set of background examples is very\nlarge because the number of positions and scales within each image is typically around 250K.\n\n4.1 Loss functions\n\nThe PASCAL benchmark requires a correct detection to have at least 50% overlap with a ground-\ntruth bounding box. We use this rule to de\ufb01ne our loss functions. First, de\ufb01ne Ll,\u03c4 (y, s) as follows\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 l\n\n0\nl\n0\n\nLl,\u03c4 (y, s) =\n\nif y = \u22a5 and s (cid:54)= \u22a5\nif y = \u22a5 and s = \u22a5\nif y (cid:54)= \u22a5 and overlap(y, s) < \u03c4\nif y (cid:54)= \u22a5 and overlap(y, s) \u2265 \u03c4.\n\n(7)\n\nFollowing the PASCAL VOC protocol we use Lmargin = L1,0.5. For a foreground example this\npushes down the score of detections that don\u2019t overlap with the bounding box label by at least 50%.\nInstead of using Loutput = Lmargin, we let Loutput = L\u221e,0.7. For a foreground example this\nensures that the maximizer of (6b) is a detection with high overlap with the bounding box label. For\na background example, the maximizer of (6b) is always \u22a5. Later we discuss how this simpli\ufb01es our\noptimization algorithm. While our choice of Loutput does not produce a convex objective, it does\ntightly limit the range of outputs, making our optimization less prone to reaching bad local optima.\n\n4.2 Optimization\nSince L(cid:48) is not convex, the WL-SSVM objective (5) leads to a nonconvex optimization problem. We\nfollow [19] in which the CCCP procedure [20] was used to \ufb01nd a local optima of a similar objective.\nCCCP is an iterative algorithm that uses a decomposition of the objective into a sum of convex and\nconcave parts E(w) = Econvex(w) + Econcave(w).\n\n||w||2 + C\n\nEconvex(w) =\n\n1\n2\nEconcave(w) = \u2212C\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nmax\ns\u2208S(xi)\n\n[w \u00b7 \u03a6(xi, s) + Lmargin(yi, s)]\n\nmax\ns\u2208S(xi)\n[w \u00b7 \u03a6(xi, s) \u2212 Loutput(yi, s)]\n\n(8)\n\n(9)\n\nIn each iteration, CCCP computes a linear upper bound to Econcave based on a current weight vector\nwt. The bound depends on subgradients of the summands in (9). For each summand, we take the\nsubgradient \u03a6(xi, si(wt)), where si(w) = argmaxs\u2208S(xi) [w \u00b7 \u03a6(xi, s) \u2212 Loutput(yi, s)] is a loss\naugmented prediction.\nWe note that computing si(wt) for each training example can be costly. But from our de\ufb01nition of\nLoutput, we have that si(w) = \u22a5 for a background example independent of w. Therefore, for a\nbackground example \u03a6(xi, si(wt)) = 0.\n\n6\n\n\fTable 1: PASCAL 2010 results. UoC-TTI and our method compete in comp3. Poselets competes\ncomp4 due to its use of detailed pose and visibility annotations and non-PASCAL images.\n\nGrammar\n\n47.5\n\n+bbox\n47.6\n\nAP\n\n+context UoC-TTI [9]\n\n49.5\n\n44.4\n\n+bbox\n45.2\n\n+context Poselets [9]\n\n47.5\n\n48.5\n\nTable 2: Training objective and model structure evaluation on PASCAL 2007.\n\nGrammar LSVM Grammar WL-SSVM Mixture LSVM Mixture WL-SSVM\n\nAP\n\n45.3\n\n46.7\n\n42.6\n\n43.2\n\nAfter computing si(wt) and \u03a6(xi, si(wt)) for all examples (implicitly for background examples),\nthe weight vector is updated by minimizing a convex upper bound on the objective E(w):\n\nwt+1 =\n\nargmin\n\nw\n\n1\n2\n\n||w||2 + C\n\nn(cid:88)\n\ni=1\n\n(cid:20)\n\nmax\ns\u2208S(xi)\n\n[w \u00b7 \u03a6(xi, s) + Lmargin(yi, s)] \u2212 w \u00b7 \u03a6(xi, si(wt))\n\n.\n\n(10)\n\n(cid:21)\n\nThe optimization subproblem de\ufb01ned by equation (10) is similar in form to a structural SVM opti-\nmization. Given the size and nature of our training dataset we opt to solve this subproblem using\nstochastic subgradient descent and a modi\ufb01ed form of the data mining procedure from [10]. As\nin [10], we data mine over background images to collect support vectors for background examples.\nHowever, unlike in the binary LSVM setting considered in [10], we also need to apply data mining\nto foreground examples. This would be slow because it requires performing relatively expensive in-\nference (more than 1 second per image) on thousands of images. Instead of applying data mining to\nthe foreground examples, each time we compute si(wt) for a foreground example, we also compute\nthe top M scoring outputs s \u2208 S(xi) of wt \u00b7 \u03a6(xi, s) + Lmargin(yi, s), and place the corresponding\nfeature vectors in the data mining cache. This is ef\ufb01cient since much of the required computation\nis shared with computation already necessary for computing si(wt). While this is only a heuris-\ntic approximation to true data mining, it leads to an improvement over training with binary LSVM\n(see Section 5). In practice, we \ufb01nd that M = 1 is suf\ufb01cient for improved performance and that\nincreasing M beyond 1 does not improve our results.\n\n4.3\n\nInitialization\n\nUsing CCCP requires an initial model or heuristic for selecting the initial outputs si(w0). Inspired\nby the methods in [10, 12], we train a single \ufb01lter for fully visible people using a standard binary\nSVM. To de\ufb01ne the SVM\u2019s training data, we select vertically elongated examples. We apply the ori-\nentation clustering method in [12] to further divide these examples into two sets that approximately\ncorrespond to left-facing versus right-facing orientations. Examples from one of these two sets are\nthen anisotropically rescaled so their HOG feature maps match the dimensions of the \ufb01lter. These\nform the positive examples. For negative examples, random patches are extracted from background\nimages. After training the initial \ufb01lter, we slice it into sub\ufb01lters (one 8 \u00d7 8 and \ufb01ve 3 \u00d7 8) that form\nthe building blocks of the grammar model. We mirror these six \ufb01lters to get subtypes, and then add\nsubparts using the energy covering heuristic in [10, 12].\n\n5 Experimental results\n\nWe evaluated the performance of our person grammar and training framework on the PASCAL VOC\n2007 and 2010 datasets [8, 9]. We used the standard PASCAL VOC comp3 test protocol, which\nmeasures detection performance by average precision (AP) over different recall levels. Figure 2\nshows some qualitative results, including failure cases.\nPASCAL VOC 2010 Our results on the 2010 dataset are presented in Table 1 in the context of\ntwo strong baselines. The \ufb01rst, UoC-TTI, won the person category in the comp3 track of the 2010\ncompetition [9]. The 2010 entry of the UoC-TTI method extended [12] by adding an extra octave\nto the HOG feature map pyramid, which allows the detector to \ufb01nd smaller objects. We report the\nAP score of the UoC-TTI \u201craw\u201d person detector, as well as the scores after applying the bounding\n\n7\n\n\f(a) Full visibility\n\n(b) Occlusion boundaries\n\n(c) Early termination\n\n(d) Mistakes\n\nFigure 2: Example detections. Parts are blue. The occlusion part, if used, is dashed cyan. (a) Detec-\ntions of fully visible people. (b) Examples where the occlusion part detects an occlusion boundary.\n(c) Detections where there is no occlusion, but a partial person is appropriate. (d) Mistakes where\nthe model did not detect occlusion properly.\n\nbox prediction and context rescoring methods described in [10]. Comparing raw detector outputs\nour grammar model signi\ufb01cantly outperforms the mixture model: 47.5 vs. 44.4.\nWe also applied the two post-processing steps to the grammar model, and found that unlike with\nthe mixture model, the grammar model does not bene\ufb01t from bounding box prediction. This is\nlikely because our \ufb01ne-grained occlusion model reduces the number of near misses that are \ufb01xed\nby bounding box prediction. To test context rescoring, we used the UoC-TTI detection data for the\nother 19 object classes. Context rescoring boosts our \ufb01nal score to 49.5.\nThe second baseline is the poselets system described in [3]. Their system requires detailed pose and\nvisibility annotations, in contrast to our grammar model which was trained only with bounding box\nlabels. Prior to context rescoring, our model scores one point lower than the poselets model, and\nafter rescoring it scores one point higher.\nStructure and training We evaluated several aspects of our model structure and training objective\non the PASCAL VOC 2007 dataset. In all of our experiments we set the regularization constant\nto C = 0.006. In Table 2 we compare the WL-SSVM framework developed here with the binary\nLSVM framework from [10]. WL-SSVM improves performance of the grammar model by 1.4 AP\npoints over binary LSVM training. WL-SSVM also improves results obtained using a mixture of\npart-based models by 0.6 points. To investigate model structure, we evaluated the effect of part\nsubtypes and occlusion modeling. Removing subtypes reduces the score of the grammar model\nfrom 46.7 to 45.5. Removing the occlusion part also decreases the score from 46.7 to 45.5. The\nshallow model (no subparts) achieves a score of 40.6.\n\n6 Discussion\nOur results establish grammar-based methods as a high-performance approach to object detection\nby demonstrating their effectiveness on the challenging task of detecting people in the PASCAL\nVOC datasets. To do this, we carefully designed a \ufb02exible grammar model that can detect people\nunder a wide range of partial occlusion, pose, and appearance variability. Automatically learning\nthe structure of grammar models remains a signi\ufb01cant challenge for future work. We hope that\nour empirical success will provide motivation for pursing this goal, and that the structure of our\nhandcrafted grammar will yield insights into the properties that an automatically learned grammar\nmight require. We also develop a structured training framework, weak-label structural SVM, that\nnaturally handles learning a model with strong outputs, such as derivation trees, from data with\nweak labels, such as bounding boxes. Our training objective is nonconvex and we use a strong loss\nfunction to avoid bad local optima. We plan to explore making this loss softer, in an effort to make\nlearning more robust to outliers.\nAcknowledgments This research has been supported by NSF grant IIS-0746569.\n\n8\n\n\fReferences\n[1] M. Blaschko and C. Lampert. Learning to localize objects with structured output regression. In ECCV,\n\n2008.\n\n[2] M. Blaschko, A. Vedaldi, and A. Zisserman. Simultaneous object detection and ranking with weak super-\n\nvision. In NIPS, 2010.\n\n[3] L. Bourdev, S. Maji, T. Brox, and J. Malik. Detecting people using mutually consistent poselet activations.\n\nIn ECCV, 2010.\n\n[4] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In ICML, 2006.\n[5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[6] C. Do, Q. Le, C. Teo, O. Chapelle, and A. Smola. Tighter bounds for structured estimation. In NIPS,\n\n2008.\n\n[7] M. Enzweiler, A. Eigenstetter, B. Schiele, and D. M. Gavrila. Multi-cue pedestrian classi\ufb01cation with\n\npartial occlusion handling. In CVPR, 2010.\n\n[8] M. Everingham, L. Van Gool, C. K.\n\nI. Williams,\n\nPASCAL Visual Object Classes Challenge 2007 (VOC2007) Results.\nnetwork.org/challenges/VOC/voc2007/workshop/index.html.\nI. Williams,\n\n[9] M. Everingham, L. Van Gool, C. K.\n\nJ. Winn,\n\nPASCAL Visual Object Classes Challenge 2010 (VOC2010) Results.\nnetwork.org/challenges/VOC/voc2010/workshop/index.html.\n\nJ. Winn,\n\nand A. Zisserman.\n\nThe\nhttp://www.pascal-\n\nand A. Zisserman.\n\nThe\nhttp://www.pascal-\n\n[10] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively\n\ntrained part based models. PAMI, 2009.\n\n[11] P. Felzenszwalb and D. McAllester. Object detection grammars. Univerity of Chicago, CS Dept., Tech.\n\nRep. 2010-02.\n\n[12] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models,\n\nrelease 4. http://people.cs.uchicago.edu/\u02dcpff/latent-release4/.\n\n[13] Y. Jin and S. Geman. Context and hierarchy in a probabilistic image model. In CVPR, 2006.\n[14] D. McAllester and J. Keshet. Generalization bounds and consistency for latent structural probit and ramp\n\nloss. In NIPS, 2011.\n\n[15] Y. Ohta, T. Kanade, and T. Sakai. An analysis system for scenes containing objects with substructures. In\n\nICPR, 1978.\n\n[16] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.\n[17] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. JMLR, 2006.\n\n[18] X. Wang, T. Han, and S. Yan. An hog-lbp human detector with partial occlusion handling. In ICCV, 2009.\n[19] C.-N. J. Yu and T. Joachims. Learning structural svms with latent variables. In ICML, 2009.\n[20] A. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 2003.\n[21] L. Zhu, Y. Chen, A. Torralba, W. Freeman, and A. Yuille. Part and appearance sharing: Recursive\n\ncompositional models for multi-view multi-object detection. In CVPR, 2010.\n\n[22] L. Zhu, Y. Chen, and A. Yuille. Unsupervised learning of probabilistic grammar-markov models for object\n\ncategories. PAMI, 2009.\n\n[23] L. Zhu, Y. Chen, A. Yuille, and W. Freeman. Latent hierarchical structural learning for object detection.\n\nIn CVPR, 2010.\n\n[24] S. Zhu and D. Mumford. A stochastic grammar of images. Foundations and Trends in Computer Graphics\n\nand Vision, 2006.\n\n9\n\n\f", "award": [], "sourceid": 329, "authors": [{"given_name": "Ross", "family_name": "Girshick", "institution": null}, {"given_name": "Pedro", "family_name": "Felzenszwalb", "institution": null}, {"given_name": "David", "family_name": "McAllester", "institution": null}]}