{"title": "3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model", "book": "Advances in Neural Information Processing Systems", "page_first": 611, "page_last": 619, "abstract": "This paper addresses the problem of category-level 3D object detection. Given a monocular image, our aim is to localize the objects in 3D by enclosing them with tight oriented 3D bounding boxes. We propose a novel approach that extends the well-acclaimed deformable part-based model[Felz.] to reason in 3D. Our model represents an object class as a deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect to their anchors on the 3D box. We model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out the appearance variation induced by viewpoint. Our model reasons about face visibility patters called aspects. We train the cuboid model jointly and discriminatively and share weights across all aspects to attain efficiency. Inference then entails sliding and rotating the box in 3D and scoring object hypotheses. While for inference we discretize the search space, the variables are continuous in our model. We demonstrate the effectiveness of our approach in indoor and outdoor scenarios, and show that our approach outperforms the state-of-the-art in both 2D[Felz09] and 3D object detection[Hedau12].", "full_text": "3D Object Detection and Viewpoint Estimation with a\n\nDeformable 3D Cuboid Model\n\nSanja Fidler\nTTI Chicago\n\nfidler@ttic.edu\n\nSven Dickinson\n\nUniversity of Toronto\n\nsven@cs.toronto.edu\n\nRaquel Urtasun\n\nTTI Chicago\n\nrurtasun@ttic.edu\n\nAbstract\n\nThis paper addresses the problem of category-level 3D object detection. Given\na monocular image, our aim is to localize the objects in 3D by enclosing them\nwith tight oriented 3D bounding boxes. We propose a novel approach that extends\nthe well-acclaimed deformable part-based model [1] to reason in 3D. Our model\nrepresents an object class as a deformable 3D cuboid composed of faces and parts,\nwhich are both allowed to deform with respect to their anchors on the 3D box. We\nmodel the appearance of each face in fronto-parallel coordinates, thus effectively\nfactoring out the appearance variation induced by viewpoint. Our model reasons\nabout face visibility patters called aspects. We train the cuboid model jointly and\ndiscriminatively and share weights across all aspects to attain ef\ufb01ciency. Inference\nthen entails sliding and rotating the box in 3D and scoring object hypotheses.\nWhile for inference we discretize the search space, the variables are continuous\nin our model. We demonstrate the effectiveness of our approach in indoor and\noutdoor scenarios, and show that our approach signi\ufb01cantly outperforms the state-\nof-the-art in both 2D [1] and 3D object detection [2].\n\n1\n\nIntroduction\n\nEstimating semantic 3D information from monocular images is an important task in applications\nsuch as autonomous driving and personal robotics. Let\u2019s consider for example, the case of an au-\ntonomous agent driving around a city. In order to properly react to dynamic situations, such an agent\nneeds to reason about which objects are present in the scene, as well as their 3D location, orientation\nand 3D extent. Likewise, a home robot requires accurate 3D information in order to navigate in\ncluttered environments as well as grasp and manipulate objects.\nWhile impressive performance has been achieved for instance-level 3D object recognition [3],\ncategory-level 3D object detection has proven to be a much harder task, due to intra-class vari-\nation as well as appearance variation due to viewpoint changes. The most common approach to\n3D detection is to discretize the viewing sphere into bins and train a 2D detector for each view-\npoint [4, 5, 1, 6]. However, these approaches output rather weak 3D information, where typically a\n2D bounding box around the object is returned along with an estimated discretized viewpoint.\nIn contrast, object-centered approaches represent and reason about objects using more sophisticated\n3D models. The main idea is to index (or vote) into a parameterized pose space with local geomet-\nric [7] or appearance features, that bear only weak viewpoint dependencies [8, 9, 10, 11]. The main\nadvantage of this line of work is that it enables a continuous pose representation [10, 11, 12, 8], 3D\nbounding box prediction [8], and potentially requires less training examples due to its more com-\n\n1\n\n\fFigure 1: Left: Our deformable 3D cuboid model. Right Viewpoint angle \u03b8.\n\npact visual representation. Unfortunately, these approaches work with weaker appearance models\nthat cannot compete with current discriminative approaches [1, 6, 13]. Recently, Hedau et al. [2]\nproposed to extend the 2D HOG-based template detector of [14] to predict 3D cuboids. However,\nsince the model represents object\u2019s appearance as a rigid template in 3D, its performance has been\nshown to be inferior to (2D) deformable part-based models (DPMs) [1].\nIn contrast, in this paper we extend DPM to reason in 3D. Our model represents an object class with\na deformable 3D cuboid composed of faces and parts, which are both allowed to deform with respect\nto their anchors on the 3D box (see Fig 1). Towards this goal, we introduce the notion of stitching\npoint, which enables the deformation between the faces and the cuboid to be encoded ef\ufb01ciently.\nWe model the appearance of each face in fronto-parallel coordinates, thus effectively factoring out\nthe appearance variation due to viewpoint. We reason about different face visibility patterns called\naspects [15]. We train the cuboid model jointly and discriminatively and share weights across all\naspects to attain ef\ufb01ciency. In inference, our model outputs 2D along with oriented 3D bounding\nboxes around the objects. This enables the estimation of object\u2019s viewpoint which is a continuous\nvariable in our representation. We demonstrate the effectiveness of our approach in indoor [2] and\noutdoor scenarios [16], and show that our approach signi\ufb01cantly outperforms the state-of-the-art in\nboth 2D [1] and 3D object detection [2].\n2 Related work\n\nThe most common way to tackle 3D detection is to represent a 3D object by a collection of inde-\npendent 2D appearance models [4, 5, 1, 6, 13], one for each viewpoint. Several authors augmented\nthe multi-view representation with weak 3D information by linking the features or parts across\nviews [17, 18, 19, 20, 21]. This allows for a dense representation of the viewing sphere by morphing\nrelated near-by views [12]. Since these methods usually require a signi\ufb01cant amount of training\ndata, renderings of synthetic CAD models have been used to supplement under-represented views\nor provide supervision for training object parts or object geometry [22, 13, 8].\nObject-centered approaches, represent object classes with a 3D model typically equipped with view-\ninvariant geometry and appearance [7, 23, 24, 8, 9, 10, 11, 25]. While these types of models are\nattractive as they enable continuous viewpoint representations, their detection performance has typ-\nically been inferior to 2D deformable models.\nDeformable part-based models (DPMs) [1] are nowadays arguably the most successful approach\nto category-level 2D detection. Towards 3D, DPMs have been extended to reason about object\nviewpoint by training the mixture model with viewpoint supervision [6, 13]. Pepik et al. [13] took\na step further by incorporating supervision also at the part level. Consistency was enforced by\nforcing the parts for different 2D viewpoint models to belong to the same set of 3D parts in the\nphysical space. However, all these approaches base their representation in 2D and thus output only\n2D bounding boxes along with a discretized viewpoint.\nThe closest work to ours is [2], which models an object with a rigid 3D cuboid, composed of in-\ndependently trained faces without deformations or parts. Our model shares certain similarities with\nthis work, but has a set of important differences. First, our model is hierarchical and deformable:\nwe allow deformations of the faces, while the faces themselves are composed of deformable parts.\nWe also explicitly reason about the visibility patterns of the cuboid model and train the model ac-\ncordingly. Furthermore, all the parameters in our model are trained jointly using a latent SVM\nformulation. These differences are important, as our approach outperforms [2] by a signi\ufb01cant mar-\ngin.\n\n2\n\n\fFigure 2: Aspects, together with the range of \u03b8 that they cover, for (left) cars and (right) beds.\n\nFinally, in concurrent work, Xiang and Savarese [26] introduced a deformable 3D aspect model,\nwhere an object is represented as a set of planar parts in 3D. This model shares many similarities\nwith our approach, however, unlike ours, it requires a collection of CAD models in training.\n\n3 A Deformable 3D Cuboid Model\n\nIn this paper, we are interested in the problem of, given a single image, estimating the 3D location\nand orientation of the objects present in the scene. We parameterize the problem as the one of\nestimating a tight 3D bounding box around each object. Our 3D box is oriented, as we reason about\nthe correspondences between the faces in the estimated bounding box and the faces of our model\n(i.e., which face is the top face, front face, etc). Towards this goal, we represent an object class as\na deformable 3D cuboid, which is composed of 6 deformable faces, i.e., their locations and scales\ncan deviate from their anchors on the cuboid. The model for each cuboid\u2019s face is a 2D template\nthat represents the appearance of the object in view-recti\ufb01ed coordinates, i.e., where the face is\nfrontal. Additionally, we augment each face with parts, and employ a deformation model between\nthe locations of the parts and the anchor points on the face they belong to. We assume that any\nviewpoint of an object in the image domain can be modeled by rotating our cuboid in 3D, followed\nby perspective projection onto the image plane. Thus inference involves sliding and rotating the\ndeformable cuboid in 3D and scoring the hypotheses.\nA necessary component of any 3D model is to properly reason about the face visibility of the object\n(in our case, the cuboid). Assuming a perspective camera, for any given viewpoint, at most 3 faces\nare visible in an image. Topologically different visibility patterns de\ufb01ne different aspects [15] of\nthe object. Note that a cuboid can have up to 26 aspects, however, not all necessarily occur for\neach object class. For example, for objects supported by the \ufb02oor, the bottom face will never be\nvisible. For cars, typically the top face is not visible either. Our model only reasons about the\noccurring aspects of the object class of interest, which we estimate from the training data. Note\nthat the visibility, and thus the aspect, is a function of the 3D orientation and position of a cuboid\nhypothesis with respect to the camera. We de\ufb01ne \u03b8 to be the angle between the outer normal to the\nfront face of the cuboid hypothesis, and the vector connecting the camera and the center of the 3D\nbox. We refer the reader to Fig. 1 for a visualization. Assuming a camera overlooking the center of\nthe cuboid, Fig. 2 shows the range of the cuboid orientation angle on the viewing sphere for which\neach aspect occurs in the datasets of [2, 16], which we employ for our experiments. Note however,\nthat in inference we do not assume that the object\u2019s center lies on the camera\u2019s principal axis.\nIn order to make the cuboid deformable, we introduce the notion of stitching point, which is a point\non the box that is common to all visible faces for a particular aspect. We incorporate a quadratic\ndeformation cost between the locations of the faces and the stitching point to encourage the cuboid\nto be as rigid as possible. We impose an additional deformation cost between the visible faces,\nensuring that their sizes match when we stitch them into a cuboid hypothesis. Our model repre-\nsents each aspect with its own set of weights. To reduce the computational complexity and impose\nregularization, we share the face and part templates across all aspects, as well as the deformations\nbetween them. However, the deformations between the faces and the cuboid are aspect speci\ufb01c as\nthey depend on the stitching point.\nWe formally de\ufb01ne the model by a (6\u00b7 (n + 1) + 1)-tuple ({(Pi, Pi,1, . . . , Pi,n)}i=1,..,6, b) where Pi\nmodels the i-th face, Pi,j is a model for the j-th part belonging to face i, and b is a real valued bias\nterm. For ease of exposition, we assume each face to have the same number of parts, n; however,\nthe framework is general and allows the numbers of parts to vary across faces. For each aspect a,\n\n3\n\n\fFigure 3: Dataset [2] statistics for training our cuboid model (left and middle) and DPM [1] (right).\n\na,i\n\nwe de\ufb01ne each of its visible faces by a 3-tuple (Fi, ra,i, dstitch\n, ba), where Fi is a \ufb01lter for the i-th\nface, ra,i is a two-dimensional vector specifying the position of the i-th face relative to the position\nof the stitching point in the recti\ufb01ed view, and di is a four-dimensional vector specifying coef\ufb01cients\nof a quadratic function de\ufb01ning a deformation cost for each possible placement of the face relative\nto the position of the stitching point. Here, ba is a bias term that is aspect speci\ufb01c and allows us to\ncalibrate the scores across aspects with different number of visible faces. Note that Fi will be shared\nacross aspects and thus we omit index a.\nThe model representing each part is face-speci\ufb01c, and is de\ufb01ned by a 3-tuple (Fi,j, ri,j, di,j), where\nFi,j is a \ufb01lter for the j-th part of the i-th face, ri,j is a two-dimensional vector specifying an \u201can-\nchor\u201d position for part j relative to the root position of face i, and di,j is a four dimensional vector\nspecifying coef\ufb01cients of a quadratic function de\ufb01ning a deformation cost for each possible place-\nment of the part relative to the anchor position on the face. Note that the parts are de\ufb01ned relative to\nthe face and are thus independent of the aspects. We thus share them across aspects.\nThe appearance templates as well as the deformation parameters in the model are de\ufb01ned for each\nface in a canonical view where that face is frontal. We thus score a face hypothesis in the recti\ufb01ed\nview that makes the hypothesis frontal. Each pair of parallel faces shares a homography, and thus\nat most three recti\ufb01cations are needed for each viewpoint hypothesis \u03b8. In indoor scenarios, we\nestimate the 3 orthogonal vanishing points and assume a Manhattan world. As a consequence only\n3 recti\ufb01cations are necessary altogether. In the outdoor scenario, we assume that at least the vertical\nvanishing point is given, or equivalently, that the orientation (but not position) of the ground plane\nis known. As a consequence, we only need to search for a 1-D angle \u03b8, i.e., the azimuth, in order\nto estimate the rotation of the 3D box. A sliding window approach is then used to score the cuboid\nhypotheses, by scoring the parts, faces and their deformations in their own recti\ufb01ed view, as well as\nthe deformations of the faces with respect to the stitching point.\nFollowing 2D deformable part-based models [1], we use a pyramid of HOG features to describe\neach face-speci\ufb01c recti\ufb01ed view, H(i, \u03b8), and score a template for a face as follows:\n\nscore(pi, \u03b8) =\n\nFi(u(cid:48), v(cid:48)) \u00b7 H[ui + u(cid:48); vi + v(cid:48); i, \u03b8]\n\n(1)\n\nwhere pi = (ui, vi, li) speci\ufb01es the position (ui, vi) and level li of the face \ufb01lters in the face-recti\ufb01ed\nfeature pyramids. We score each part pi,j = (ui,j, vi,j, li,j) in a similar fashion, but the pyramid is\nindexed at twice the resolution of the face. We de\ufb01ne the compatibility score between the parts and\nthe corresponding face, denoted as pi = {pi,{pi,j}j=1,...,n}, as the sum over the part scores and\ntheir deformations with respect to the anchor positions on the face:\n\nscoreparts(pi, \u03b8) =\n\n(score(pi,j, \u03b8) \u2212 dij \u00b7 \u03c6d(pi, pi,j)) ,\n\n(2)\n\n(cid:88)\n\nu(cid:48),v(cid:48)\n\nn(cid:88)\n\nj=1\n\nWe thus de\ufb01ne the score of a 3D cuboid hypothesis to be the sum of the scores of each face and its\nparts, as well as the deformation of each face with respect to the stitching point and the deformation\nof the faces with respect to each other as follows\n\nscore(x, \u03b8, s, p) =\n\nV (i, a)(cid:0)score(pi, \u03b8) \u2212 dstitch\n\na,i\n\n\u00b7 \u03c6stich\n\nd\n\n(pi, s, \u03b8)(cid:1)\u2212\n6(cid:88)\n\n6(cid:88)\n\u2212 6(cid:88)\n\ni=1\n\nV (i, a) \u00b7 df ace\n\ni,ref \u03c6f ace\n\nd\n\n(pi, pref , \u03b8) +\n\nV (i, a) \u00b7 scoreparts(pi, \u03b8) + ba\n\ni>ref\n\ni=1\n\n4\n\nR\u2212TL\u2212TF\u2212TF\u2212R\u2212TF\u2212L\u2212T050100150200cuboid aspectsnum of training examplesBBOX3D: aspect statistics for bedfrontleftrighttop0100200300400cuboid facesnum of training examplesBBOX3D: face statistics for bed1234560100200300400mixture idnum of training examplesDPM: mixture statistics for bed\fFigure 4: Learned models for (left) bed, (right) car.\n\nwhere p = (p1,\u00b7\u00b7\u00b7 , p6) and V (i, a) is a binary variable encoding whether face i is visible under\naspect a. Note that a = a(\u03b8, s) can be deterministically computed from the rotation angle \u03b8 and the\nposition of the stitching point s (which we assume to always be visible), which in turns determines\nthe face visibility V . We use ref to index the \ufb01rst visible face in the aspect model, and\n\n\u03c6d(pi, pi,j, \u03b8) = \u03c6d(du, dv) = (du, dv, du2, dv2)\n\n(3)\nare the part deformation features, computed in the recti\ufb01ed image of face i implied by the 3D angle\n\u03b8. As in [1], we employ a quadratic deformation cost to model the relationships between the parts\nand the anchor points on the face, and de\ufb01ne (dui,j, dvi,j) = (ui,j, vi,j) \u2212 (2 \u00b7 (ui, vi) + ri,j) as\nthe displacement of the j-th part with respect to its anchor (ui, vi) in the recti\ufb01ed j-th face. The\ndeformation features \u03c6stich\n(pi, s, \u03b8) between the face pi and the stitching point s are de\ufb01ned as\n(dui, dvi) = (ui, vi)\u2212 (u(s, i), v(s, i)) + ra,i). Here, (u(s, i), v(s, i)) is the position of the stitching\npoint in the recti\ufb01ed coordinates corresponding to face i and level l.\nWe de\ufb01ne the deformation cost between the faces to be a function of their relative dimensions:\n\nd\n\n(cid:40)\n\n\u03c6f ace\nd\n\n(pi, pk, \u03b8) =\n\nif max(ei,ek)\n\nmin(ei,ek) < 1 + \u0001\n\n0,\n\u221e otherwise\n\n(4)\n\nwith ei and ek the lengths of the common edge between faces i and k. We de\ufb01ne the deformation of\na face with respect to the stitching point to also be quadratic. It is de\ufb01ned in the recti\ufb01ed view, and\nthus depend on \u03b8. We additionally incorporate a bias term for each aspect, ba, to make the scores of\nmultiple aspects comparable when we combine them into a full cuboid model.\nGiven an image x, the score of a hypothesized 3D cuboid can be obtained as the dot product between\nthe model\u2019s parameters and a feature vector, i.e., score(x, \u03b8, s, p) = wa \u00b7 \u03a6(x, a(\u03b8, s), p), with\n1,\u00b7\u00b7\u00b7 , F (cid:48)\nwa = (F (cid:48)\n6, F (cid:48)\nand the feature vector:\n\n\u03a6(x, a(\u03b8, s), p) =(cid:0) \u02c6H(p1, i, \u03b8),\u00b7\u00b7\u00b7 , \u02c6H(p1,1, i, \u03b8),\u2212 \u02c6\u03c6d(p1, p1,1),\u00b7\u00b7\u00b7 ,\u2212 \u02c6\u03c6d(p6, p6,n),\n(p1, p2),\u00b7\u00b7\u00b7 , 1(cid:1)\n\n6,n, d1,1,\u00b7\u00b7\u00b7 , d6,n, dstitch\n\n(p1, s, \u03b8),\u00b7\u00b7\u00b7 ,\u2212 \u02c6\u03c6stitch\n\n1,2 ,\u00b7\u00b7\u00b7 , df ace\n, df ace\n\n(p6, s, \u03b8),\u2212 \u02c6\u03c6f ace\n\n,\u00b7\u00b7\u00b7 , dstitch\n\na,6\n\n1,1,\u00b7\u00b7\u00b7 , F (cid:48)\n\n5,6 , ba), (5)\n\n\u2212 \u02c6\u03c6stitch\n\nd\n\na,1\n\nd\n\nd\n\nwhere \u02c6\u03c6 includes the visibility score in the feature vector, e.g., \u02c6\u03c6(i,\u00b7) = V (i, a) \u00b7 \u03c6(i,\u00b7).\n\nInference:\n\nInference in this model can be done by computing\n\nfw(x) = max\n\u03b8,s,p\n\nwa \u00b7 \u03a6(x, a(\u03b8, s), p)\n\nThis can be solved exactly via dynamic programming, where the score is \ufb01rst computed for each \u03b8,\ni.e., maxs,p wa \u00b7 \u03a6(x, a(\u03b8, s), p), and then a max is taken over the angles \u03b8. We use a discretization\nof 20 deg for the angles. To get the score for each \u03b8, we \ufb01rst compute the feature responses for\nthe part and face templates (Eq. (1)) using a sliding window approach in the corresponding feature\npyramids. As in [1], distance transforms are used to compute the deformation scores of the parts\nef\ufb01ciently, that is, Eq. (2). The score for each face simply sums the response of the face template\nand the scores of the parts. We again use distance transforms to compute the deformation scores\nfor each face and the stitching point, which is carried out in the recti\ufb01ed coordinates for each face.\nWe then compute the deformation scores between the faces in Eq. (4), which can be performed\nef\ufb01ciently due to the fact that sides of the same length along one dimension (horizontal or vertical)\nin the coordinates of face i will also be constant along the corresponding line when projected to the\ncoordinate system of face j. Thus, computing the side length ratios of two faces is not quadratic in\nthe number of pixels but only in the number of horizontal or vertical lines. Finally, we reproject the\nscores to the image coordinate system and sum them to get the score for each \u03b8.\n\n5\n\n\fHedau et al. [2]\nours\n\nDetectors\u2019 performance\n\nDPM [1]\n54.2%\n55.6%\n\n3D det.\n51.3%\n59.4%\n\ncombined\n\n59.6%\n60.5%\n\nLayout rescoring\n\nDPM [1]\n\n-\n\n60.0%\n\n3D det.\n\n-\n\n64.6%\n\ncombined\n\n62.8%\n63.8%\n\nTable 1: Detection performance (measured in AP at 0.5 IOU overlap) for the bed dataset of [2]\n3D measure DPM \ufb01t3D BBOX3D combined BBOX3D + layout\ncomb. + layout\nconvex hull\nface overlap\n\n57.1%\n33.6%\nTable 2: 3D detection performance in AP (50% IOU overlap of convex hulls and faces)\n\n48.2%\n16.3%\n\n53.9%\n33.0%\n\n53.9%\n34.4%\n\n57.8%\n33.5%\n\nFigure 5: Precision-recall curves for (left) 2D detection (middle) convex hull, (right) face overlap.\n\nLearning: Given a set of training samples D = ((cid:104)x1, y1, bb1(cid:105),\u00b7\u00b7\u00b7(cid:104)xN , yN , bbN(cid:105)), where x is an\nimage, yi \u2208 {\u22121, 1}, and bb \u2208 R8\u00d72 are the eight coordinates of the 3D bounding box in the\nimage, our goal is to learn the weights w = [wa1 ,\u00b7\u00b7\u00b7 , waP ] for all P aspects in Eq. (5). To train\nour model using partially labeled data, we use a latent SVM formulation [1], however, frameworks\nsuch as latent structural SVMs [27] are also possible. To initialize the full model, we \ufb01rst learn a\ndeformable face+parts model for each face independently, where the faces of the training examples\nare recti\ufb01ed to be frontal prior to training. We estimate the different aspects of our 3D model from\nthe statistics of the training data, and compute for each training cuboid the relative positions va,i of\nface i and the stitching point in the recti\ufb01ed view of each face. We then perform joint training of the\nfull model, treating the training cuboid and the stitching point as latent, however, requiring that each\nface \ufb01lter and the face annotation overlap more than 70%. Following [1], we utilize a stochastic\ngradient descent approach which alternates between solving for the latent variables and updating the\nweights w. Note that this algorithm is only guaranteed to converge to a local optimum, as the latent\nvariables make the problem non-convex.\n\n4 Experiments\n\nWe evaluate our approach on two datasets, the dataset of [2] as well as KITTI [16], an autonomous\ndriving dataset. To our knowledge, these are the only datasets which have been labeled with 3D\nbounding boxes. We begin our experimentation with the indoor scenario [2]. The bedroom dataset\ncontains 181 train and 128 test images. To enable a comparison with the DPM detector [1], we\ntrained a model with 6 mixtures and 8 parts using the same training instances but employing 2D\nbounding boxes. Our 3D bed model was trained with two parts per face. Fig. 3 shows the statistics\nof the dataset in terms of the number of training examples for each aspect (where L-R-T denotes an\naspect for which the front, right and the top face are visible), as well as per face. Note that the fact\nthat the dataset is unbalanced (fewer examples for aspects with two faces) does not affect too much\nour approach, as only the face-stitching point deformation parameters are aspect speci\ufb01c. As we\nshare the weights among the aspects, the number of training instances for each face is signi\ufb01cantly\nhigher (Fig. 3, middle). We compare this to DPM in Fig. 3, right. Our method can better exploit the\ntraining data by factoring out the viewpoint dependance of the training examples.\nWe begin our quantitative evaluation by using our model to reason about 2D detection. The 2D\nbounding boxes for our model are computed by \ufb01tting a 2D box around the convex hull of the\nprojection of the predicted 3D box. We report average precision (AP) where we require that the\noutput 2D boxes overlap with the ground-truth boxes at least 50% using the intersection-over-union\n(IOU) criteria. The precision-recall curves are shown in Fig. 5. We compare our approach to the\ndeformable part model (DPM) [1] and the cuboid model of Hedau et al. [2]. As shown in Table 1\nwe outperform the cuboid model of [2] by 8.1% and DPM by 3.8%. This is notable, as to the best\n\n6\n\n00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionbed: 2D Detection performance DPM (AP = 0.556)3D BBOX (AP = 0.594)combined (AP = 0.605)00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionbed: 3D perf.: conv hull overlap DPM fit3D (AP = 0.482)3D BBOX (AP = 0.539)combined (AP = 0.539)00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionbed: 3D perf.: face overlap DPM fit3D (AP = 0.163)3D BBOX (AP = 0.330)combined (AP = 0.344)\fFigure 6: Detection examples obtained with our model on the bed dataset [2].\n\nFigure 7: Detections in 3D + layout\n\nof our knowledge, this is the \ufb01rst time that a 3D approach outperforms the DPM. 1 Examples of\ndetections of our model are shown in Fig. 6.\nA standard way to improve the detector\u2019s performance has been to rescore object detections using\ncontextual information [1]. Following [2], we use two types of context. We \ufb01rst combined our\ndetector with the 2D-DPM [1] to see whether the two sources of information complement each\nother. The second type of context is at the scene level, where we exploit the fact that the objects in\nindoor environments do not penetrate the walls and usually respect certain size ratios in 3D.\nWe combine the 3D and 2D detectors using a two step process, where \ufb01rst the 2D detector is run\ninside the bounding boxes produced by our cuboid model. A linear SVM that utilizes both scores\nas input is then employed to produce a score for the combined detection. While we observe a slight\nimprovement in performance (1.1%), it seems that our cuboid model is already scoring the correct\nboxes well. This is in contrast to the cuboid model of [2], where the increase in performance is more\nsigni\ufb01cant due to the poorer accuracy of their 3D approach.\nFollowing [2], we use an estimate of the room layout to rescore the object hypotheses at the scene\nlevel. We use the approach by Schwing et al. [28] to estimate the layout. To train the re-scoring\nclassi\ufb01er, we use the image-relative width and height features as in [1], footprint overlap between\nthe 3D box and the \ufb02oor as in [2] as well as 3D statistics such as distance between the object 3D\nbox and the wall relative to the room height and the ratio between the object and room height in 3D.\nThis further increases our performance by 5.2% (Table 1). Examples of 3D reconstruction of the\nroom and our predicted 3D object hypotheses are shown in Fig. 7.\nTo evaluate the 3D performance of our detector we use the convex hull overlap measure as intro-\nduced in [2]. Here, instead of computing the overlap between the predicted boxes, we require that\nthe convex hulls of our 3D hypotheses projected to the image plane and groundtruth annotations\noverlap at least 50% in IOU measure. Table 2 reports the results and shows that only little is lost in\nperformance due to a stricter overlap measure.\n\n1Note that the numbers for our and [2]\u2019s version of DPM slightly differ. The difference is likely due to how\nthe negative examples are sampled during training (the dataset has a positive example in each training image).\n\n7\n\n\fFigure 8: KITTI: examples of car detections.\naugmented with best \ufb01tting CAD models to visualize inferred 3D box orientations.\n\n(top) Ground truth, (bottom) Our 3D detections,\n\nSince our model also predicts the locations of the dominant object faces (and thus the 3D object\norientation), we would like to quantify its accuracy. We introduce an even stricter measure where\nwe require that also the predicted cuboid faces overlap with the faces of the ground-truth cuboids. In\nparticular, a hypothesis is correct if the average of the overlaps between top faces and vertical faces\nexceeds 50% IOU. We compare the results of our approach to DPM [1]. Note however, that [1]\nreturns only 2D boxes and hence a direct comparison is not possible. We thus augment the original\nDPM with 3D information in the following way. Since the three dominant orientations of the room,\nand thus the objects, are known (estimated via the vanishing points), we can \ufb01nd a 3D box whose\nprojection best overlaps with the output of the 2D detector. This can be done by sliding a cuboid\n(whose dimensions match our cuboid model) in 3D to best \ufb01t the 2D bounding box. Our approach\noutperforms the 3D augmented DPM by a signi\ufb01cant margin of 16.7%. We attribute this to the fact\nthat our cuboid is deformable and thus the faces localize more accurately on the faces of the object.\nWe also conducted preliminary tests for our model on the autonomous driving dataset KITTI [16].\nWe trained our model with 8 aspects (estimated from the data) and 4 parts per face. An example of a\nlearned aspect model is shown in Fig. 4. Note that the rectangular patches on the faces represent the\nparts, and color coding is used to depict the learned part and face deformation weights. We can ob-\nserve that the model effectively and compactly factors out the appearance changes due to changes in\nviewpoint. Examples of detections are shown in Fig.8. The top rows show groundtruth annotations,\nwhile the bottom rows depict our predicted 3D boxes. To showcase also the viewpoint prediction\nof our detector we insert a CAD model inside each estimated 3D box, matching its orientation in\n3D. In particular, for each detection we automatically chose a CAD model out of a collection of 80\nmodels whose 3D bounding box best matches the dimensions of the predicted box. One can see that\nour 3D detector is able to predict the viewpoints of the objects well, as well as the type of car.\n5 Conclusion\nWe proposed a novel approach to 3D object detection, which extends the well-acclaimed DPM to\nreason in 3D by means of a deformable 3D cuboid. Our cuboid allows for deformations at the face\nlevel via a stitching point as well as deformations between the faces and the parts. We demonstrated\nthe effectiveness of our approach in indoor and outdoor scenarios and showed that our approach\noutperforms [1] and [2] in terms of 2D and 3D estimation. In future work, we plan to reason jointly\nabout the 3D scene layout and the objects in order to improve the performance in both tasks.\n\nAcknowledgements. S.F. has been supported in part by DARPA, contract number W911NF-10-2-\n0060. The views and conclusions contained in this document are those of the authors and should not\nbe interpreted as representing the of\ufb01cial policies, either express or implied, of the Army Research\nLaboratory or the U.S. Government.\n\n8\n\n\fReferences\n[1] Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and Ramanan, D. (2010) Object detection with\n\ndiscriminatively trained part based models. IEEE TPAMI, 32, 1627\u20131645.\n\n[2] Hedau, V., Hoiem, D., and Forsyth, D. (2010) Thinking inside the box: Using appearance models and\n\ncontext based on room geometry. ECCV, vol. 6, pp. 224\u2013237.\n\n[3] Hinterstoisser, S., Lepetit, V., Ilic, S., Fua, P., and Navab, N. (2010) Dominant orientation templates for\n\nreal-time detection of texture-less objects. CVPR.\n\n[4] Schneiderman, H. and Kanade, T. (2000) A statistical method for 3d object detection applied to faces and\n\ncars. CVPR, pp. 1746\u20131759.\n\n[5] Torralba, A., Murphy, K. P., and Freeman, W. T. (2007) Sharing visual features for multiclass and multi-\n\nview object detection. IEEE TPAMI, 29, 854\u2013869.\n\n[6] Gu, C. and Ren, X. (2010) Discriminative mixture-of-templates for viewpoint classi\ufb01cation. ECCV, pp.\n\n408\u2013421.\n\n[7] Lowe, D. (1991) Fitting parameterized three-dimensional models to images. IEEE TPAMI, 13, 441\u2013450.\n[8] Liebelt, J., Schmid, C., and Schertler, K. (2008) Viewpoint-independent object class detection using 3d\n\nfeature maps. CVPR.\n\n[9] Yan, P., Khan, S. M., and Shah, M. (2007) 3d model based oblect class detection in an arbitrary view.\n\nICCV.\n\n[10] Glasner, D., Galun, M., Alpert, S., Basri, R., and Shakhnarovich, G. (2011) Viewpoint-aware object\n\ndetection and pose estimation. ICCV.\n\n[11] Savarese, S. and Fei-Fei, L. (2007) 3d generic object categorization, localization and pose estimation.\n\nICCV.\n\n[12] Su, H., Sun, M., Fei-Fei, L., and Savarese, S. (2009) Learning a dense multi-view representation for\n\ndetection, viewpoint classi\ufb01cation and synthesis of object categories. ICCV.\n\n[13] Pepik, B., Stark, M., Gehler, P., and Schiele, B. (2012) Teaching 3d geometry to deformable part models.\n\nBelongie, S., Blake, A., Luo, J., and Yuille, A. (eds.), CVPR.\n\n[14] Dalal, N. and Triggs, B. (2005) Histograms of oriented gradients for human detection. CVPR.\n[15] Koenderink, J. and van Doorn, A. (1976) The singularities of the visual mappings. Bio. Cyber., 24, 51\u201359.\n[16] Geiger, A., Lenz, P., and Urtasun, R. (2012) Are we ready for autonomous driving? CVPR.\n[17] Kushal, A., Schmid, C., and Ponce, J. (2007) Flexible object models for category-level 3d object recog-\n\nnition. CVPR.\n\n[18] Thomas, A., Ferrari, V., Leibe, B., Tuytelaars, T., Schiele, B., and Gool, L. V. (2006) Toward multi-view\n\nobject class detection. CVPR.\n\n[19] Hoiem, D., Rother, C., and Winn, J. (2007) 3d layoutcrf for multi-view object class recognition and\n\nsegmentation. CVPR.\n\n[20] Sun, M., Su, H., Savarese, S., and Fei-Fei, L. (2009) A multi-view probabilistic model for 3d oblect\n\nclasses. CVPR.\n\n[21] Payet, N. and Todorovic, S. (2011) Probabilistic pose recovery using learned hierarchical object models.\n\nICCV.\n\n[22] Stark, M., Goesele, M., and Schiele, B. (2010) Back to the future: Learning shape models from 3d cad\n\ndata. British Machine Vision Conference.\n\n[23] Brooks, R. A. (1983) Model-based three-dimensional interpretations of two-dimensional images. IEEE\n\nTPAMI, 5, 140\u2013150.\n\n[24] Dickinson, S. J., Pentland, A. P., and Rosenfeld, A. (1992) 3-d shape recovery using distributed aspect\n\nmatching. IEEE TPAMI, 14, 174\u2013198.\n\n[25] Sun, M., Bradski, G., Xu, B.-X., and Savarese, S. (2010) Depth-encoded hough voting for coherent object\n\ndetection, pose estimation, and shape recovery. ECCV.\n\n[26] Xiang, Y. and Savarese, S. (2012) Estimating the aspect layout of object categories. CVPR.\n[27] Yu, C.-N. and Joachims, T. (2009) Learning structural svms with latent variables. ICML.\n[28] Schwing, A., Hazan, T., Pollefeys, M., and Urtasun, R. (2012) Ef\ufb01cient structured prediction for 3d indoor\n\nscene understanding. CVPR.\n\n9\n\n\f", "award": [], "sourceid": 289, "authors": [{"given_name": "Sanja", "family_name": "Fidler", "institution": null}, {"given_name": "Sven", "family_name": "Dickinson", "institution": null}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": null}]}