{"title": "3D Object Proposals for Accurate Object Class Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 424, "page_last": 432, "abstract": "The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. Our method exploits stereo imagery to place proposals in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outperforms all existing results on all three KITTI object classes.", "full_text": "3D Object Proposals for Accurate Object Class\n\nDetection\n\nXiaozhi Chen\u2217,1 Kaustav Kundu \u2217,2\n\nYukun Zhu2\n\nAndrew Berneshawi2\n\nHuimin Ma1\n\nSanja Fidler2\n\nRaquel Urtasun2\n\n1Department of Electronic Engineering\nTsinghua University\n\n2Department of Computer Science\nUniversity of Toronto\n\nchenxz12@mails.tsinghua.edu.cn, {kkundu, yukun}@cs.toronto.edu,\n\nandrew.berneshawi@mail.utoronto.ca, mhmpub@tsinghua.edu.cn,\n\n{fidler, urtasun}@cs.toronto.edu\n\nAbstract\n\nThe goal of this paper is to generate high-quality 3D object proposals in the con-\ntext of autonomous driving. Our method exploits stereo imagery to place propos-\nals in the form of 3D bounding boxes. We formulate the problem as minimizing an\nenergy function encoding object size priors, ground plane as well as several depth\ninformed features that reason about free space, point cloud densities and distance\nto the ground. Our experiments show signi\ufb01cant performance gains over existing\nRGB and RGB-D object proposal methods on the challenging KITTI benchmark.\nCombined with convolutional neural net (CNN) scoring, our approach outper-\nforms all existing results on all three KITTI object classes.\n\n1\n\nIntroduction\n\nDue to the development of advanced warning systems, cameras are available onboard of almost\nevery new car produced in the last few years. Computer vision provides a very cost effective solution\nnot only to improve safety, but also to one of the holy grails of AI, fully autonomous self-driving\ncars. In this paper we are interested in 2D and 3D object detection for autonomous driving.\nWith the large success of deep learning in the past years, the object detection community shifted\nfrom simple appearance scoring on exhaustive sliding windows [1] to more powerful, multi-layer\nvisual representations [2, 3] extracted from a smaller set of object/region proposals [4, 5]. This\nresulted in over 20% absolute performance gains [6, 7] on the PASCAL VOC benchmark [8].\nThe motivation behind these bottom-up grouping approaches is to provide a moderate number of\nregion proposals among which at least a few accurately cover the ground-truth objects. These\napproaches typically over-segment an image into super pixels and group them based on several\nsimilarity measures [4, 5]. This is the strategy behind Selective Search [4], which is used in most\nstate-of-the-art detectors these days. Contours in the image have also been exploited in order to\nlocate object proposal boxes [9]. Another successful approach is to frame the problem as energy\nminimization where a parametrized family of energies represents various biases for grouping, thus\nyielding multiple diverse solutions [10].\nInterestingly, the state-of-the-art R-CNN approach [6] does not work well on the autonomous driv-\ning benchmark KITTI [11], falling signi\ufb01cantly behind the current top performers [12, 13]. This\nis due to the low achievable recall of the underlying box proposals on this benchmark. KITTI im-\nages contain many small objects, severe occlusion, high saturated areas and shadows. Furthermore,\n\n\u2217 Denotes equal contribution\n\n1\n\n\fImage\n\nStereo\n\ndepth-Feat\n\nPrior\n\nFigure 1: Features: From left to right: original image, stereo reconstruction, depth-based features and our\nprior. In the third image, purple is free space (F in Eq. (2)) and occupancy is yellow (S in Eq. (1)). In the prior,\nthe ground plane is green and red to blue indicates distance to the ground.\n\nKITTI\u2019s evaluation requires a much higher overlap with ground-truth for cars in order for a detec-\ntion to count as correct. Since most existing object/region proposal methods rely on grouping super\npixels based on intensity and texture, they fail in these challenging conditions.\nIn this paper, we propose a new object proposal approach that exploits stereo information as well\nas contextual models speci\ufb01c to the domain of autonomous driving. Our method reasons in 3D and\nplaces proposals in the form of 3D bounding boxes. We exploit object size priors, ground plane, as\nwell as several depth informed features such as free space, point densities inside the box, visibility\nand distance to the ground. Our experiments show a signi\ufb01cant improvement in achievable recall\nover the state-of-the-art at all overlap thresholds and object occlusion levels, demonstrating that our\napproach produces highly accurate object proposals. In particular, we achieve a 25% higher recall for\n2K proposals than the state-of-the-art RGB-D method MCG-D [14]. Combined with CNN scoring,\nour method outperforms all published results on object detection for Car, Cyclist and Pedestrian on\nKITTI [11]. Our code and data are online: http://www.cs.toronto.edu/\u02dcobjprop3d.\n\n2 Related Work\n\nWith the wide success of deep networks [2, 3], which typically operate on a \ufb01xed spatial scope,\nthere has been increased interest in object proposal generation. Existing approaches range from\npurely RGB [4, 9, 10, 5, 15, 16], RGB-D [17, 14, 18, 19], to video [20]. In RGB, most approaches\ncombine superpixels into larger regions based on color and texture similarity [4, 5]. These ap-\nproaches produce around 2,000 proposals per image achieving nearly perfect achievable recall on\nthe PASCAL VOC benchmark [8]. In [10], regions are proposed by de\ufb01ning parametric af\ufb01nities\nbetween pixels and solving the energy using parametric min-cut. The proposed solutions are then\nscored using simple Gestalt-like features, and typically only 150 top-ranked proposals are needed\nto succeed in consequent recognition tasks [21, 22, 7]. [16] introduces learning into proposal gen-\neration with parametric energies. Exhaustively sampled bounding boxes are scored in [23] using\nseveral \u201cobjectness\u201d features. BING [15] proposals also score windows based on an object closure\nmeasure as a proxy for \u201cobjectness\u201d. Edgeboxes [9] score millions of windows based on contour\ninformation inside and on the boundary of each window. A detailed comparison is done in [24].\nFewer approaches exist that exploit RGB-D. [17, 18] extend CPMC [10] with additional af\ufb01nities\nthat encourage the proposals to respect occlusion boundaries.\n[14] extends MCG [5] to 3D by\nan additional set of depth-informed features. They show signi\ufb01cant improvements in performance\nwith respect to past work. In [19], RGB-D videos are used to propose boxes around very accurate\npoint clouds. Relevant to our work is Sliding Shapes [25], which exhaustively evaluates 3D cuboids\nin RGB-D scenes. This approach, however, utilizes an object scoring function trained on a large\nnumber of rendered views of CAD models, and uses complex class-based potentials that make the\nmethod run slow in both training and inference. Our work advances over prior work by exploiting the\ntypical sizes of objects in 3D, the ground plane and very ef\ufb01cient depth-informed scoring functions.\nRelated to our work are also detection approaches for autonomous driving. In [26], objects are pre-\ndetected via a poselet-like approach and a deformable wireframe model is then \ufb01t using the image\ninformation inside the box. Pepik et al. [27] extend the Deformable Part-based Model [1] to 3D by\nlinking parts across different viewpoints and using a 3D-aware loss function. In [28], an ensemble\nof models derived from visual and geometrical clusters of object instances is employed. In [13],\nSelective Search boxes are re-localized using top-down, object level information. [29] proposes a\nholistic model that re-reasons about DPM detections based on priors from cartographic maps. In\nKITTI, the best performing method so far is the recently proposed 3DVP [12] which uses the ACF\ndetector [30] and learned occlusion patters in order to improve performance of occluded cars.\n\n2\n\n\fr\na\nC\n\nn\na\ni\nr\nt\ns\ne\nd\ne\nP\n\nt\ns\ni\nl\nc\ny\nC\n\nFigure 2: Proposal recall: We use 0.7 overlap threshold for Car , and 0.5 for Pedestrian and Cyclist.\n\n(c) Hard\n\n(a) Easy\n\n(b) Moderate\n\n3 3D Object Proposals\n\nThe goal of our approach is to output a diverse set of object proposals in the context of autonomous\ndriving. Since 3D reasoning is of crucial importance in this domain, we place our proposals in 3D\nand represent them as cuboids. We assume a stereo image pair as input and compute depth via the\nstate-of-the-art approach by Yamaguchi et al. [31]. We use depth to compute a point-cloud x and\nconduct all our reasoning in this domain. We next describe our notation and present our framework.\n\n3.1 Proposal Generation as Energy Minimization\n\nWe represent each object proposal with a 3D bounding box, denoted by y, which is parametrized\nby a tuple, (x, y, z, \u03b8, c, t), where (x, y, z) denotes the center of the 3D box and \u03b8, represents its\nazimuth angle. Note that each box y in principle lives in a continuous space, however, for ef\ufb01ciency\nwe reason in a discretized space (details in Sec. 3.2). Here, c denotes the object class of the box and\nt \u2208 {1, . . . , Tc} indexes the set of 3D box \u201ctemplates\u201d which represent the physical size variations\nof each object class c. The templates are learned from the training data.\nWe formulate the proposal generation problem as inference in a Markov Random Field (MRF)\nwhich encodes the fact that the proposal y should enclose a high density region in the point cloud.\nFurthermore, since the point cloud represents only the visible portion of the 3D space, y should not\noverlap with the free space that lies within the rays between the points in the point cloud and the\ncamera. If that was the case, the box would in fact occlude the point cloud, which is not possible.\nWe also encode the fact that the point cloud should not extend vertically beyond our placed 3D box,\nand that the height of the point cloud in the immediate vicinity of the box should be lower than the\nbox. Our MRF energy thus takes the following form:\nE(x, y) = w(cid:62)\n\nc,pcd\u03c6pcd(x, y) + w(cid:62)\n\nc,ht\u2212contr\u03c6ht\u2212contr(x, y)\n\nc,f s\u03c6f s(x, y) + w(cid:62)\n\nc,ht\u03c6ht(x, y) + w(cid:62)\n\nNote that our energy depends on the object class via class-speci\ufb01c weights w(cid:62)\nusing structured SVM [32] (details in Sec. 3.4). We now explain each potential in more detail.\n\nc , which are trained\n\n3\n\n# candidates101102103104recall at IoU threshold 0.700.20.40.60.81BINGSSEBMCGMCG-DOurs# candidates101102103104recall at IoU threshold 0.700.20.40.60.81BINGSSEBMCGMCG-DOurs# candidates101102103104recall at IoU threshold 0.700.20.40.60.81BINGSSEBMCGMCG-DOurs# candidates101102103104recall at IoU threshold 0.500.20.40.60.81BINGSSEBMCGMCG-DOurs# candidates101102103104recall at IoU threshold 0.500.20.40.60.81BINGSSEBMCGMCG-DOurs# candidates101102103104recall at IoU threshold 0.500.20.40.60.81BINGSSEBMCGMCG-DOurs# candidates101102103104recall at IoU threshold 0.500.20.40.60.81BINGSSEBMCGMCG-DOurs# candidates101102103104recall at IoU threshold 0.500.20.40.60.81BINGSSEBMCGMCG-DOurs# candidates101102103104recall at IoU threshold 0.500.20.40.60.81BINGSSEBMCGMCG-DOurs\fPoint Cloud Density: This potential encodes the density of the point cloud within the box\n\n(cid:80)\n\n\u03c6pcd(x, y) =\n\np\u2208\u2126(y) S(p)\n|\u2126(y)|\n\nwhere S(p) indicates whether the voxel p is occupied or not (contains point cloud points), and \u2126(y)\ndenotes the set of voxels inside the box de\ufb01ned by y. Fig. 1 visualizes the potential. This potential\nsimply counts the fraction of occupied voxels inside the box.\nIt can be ef\ufb01ciently computed in\nconstant time via integral accumulators, which is a generalization of integral images to 3D.\nFree Space: This potential encodes the constraint that the free space between the point cloud and\nthe camera cannot be occupied by the box. Let F represent a free space grid, where F (p) = 1 means\nthat the ray from the camera to the voxel p does not hit an occupied voxel, i.e., voxel p lies in the\nfree space. We de\ufb01ne the potential as follows:\n\n\u03c6f s(x, y) =\n\np\u2208\u2126(y)(1 \u2212 F (p))\n\n|\u2126(y)|\n\n(2)\n\n(cid:80)\n\n(1)\n\n(3)\n\n(4)\n\nThis potential thus tries to minimize the free space inside the box, and can also be computed ef\ufb01-\nciently using integral accumulators.\nHeight Prior: This potential encodes the fact that the height of the point cloud inside the box\nshould be close to the mean height of the object class c. This is encoded in the following way:\n\n\u03c6ht(x, y) =\n\nHc(p)\n\n1\n\n(cid:88)\n(cid:19)2(cid:35)\n(cid:18) dp \u2212 \u00b5c,ht\n\n|\u2126(y)|\n\np\u2208\u2126(y)\n\n\u03c3c,ht\n\n(cid:34)\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 exp\n\n0,\n\n\u2212 1\n2\n\n,\n\nif S(p) = 1\n\no.w.\n\nwith\n\nHc(p) =\n\nwhere, dp indicates the height of the road plane lying below the voxel p. Here, \u00b5c,ht, \u03c3c,ht are the\nMLE estimates of the mean height and standard deviation by assuming a Gaussian distribution of\nthe data. Integral accumulators can be used to ef\ufb01ciently compute these features.\nHeight Contrast: This potential encodes the fact that the point cloud that surrounds the bounding\nbox should have a lower height than the height of the point cloud inside the box. This is encoded as:\n\n\u03c6ht\u2212contr(x, y) =\n\n\u03c6ht(x, y)\n\n\u03c6ht(x, y+) \u2212 \u03c6ht(x, y)\n\n(5)\n\nwhere y+ represents the cuboid obtained by extending y by 0.6m in the direction of each face.\n3.2 Discretization and Accumulators\nOur point cloud is de\ufb01ned with respect to a left-handed coordinate system, where the positive Z-axis\nis along the viewing direction of the camera and the Y-axis is along the direction of gravity. We\ndiscretize the continuous space such that the width of each voxel is 0.2m in each dimension. We\ncompute the occupancy, free space and height prior grids in this discretized space. Following the\nidea of integral images, we compute our accumulators in 3D.\n3.3\nInference in our model is performed by minimizing the energy:\ny\u2217 = argminyE(x, y)\n\nInference\n\nDue to the ef\ufb01cient computation of the features using integral accumulators evaluating each con-\n\ufb01guration y takes constant time. Still, evaluating exhaustively in the entire grid would be slow. In\norder to reduce the search space, we carve certain regions of the grid by skipping con\ufb01gurations\nwhich do not overlap with the point cloud. We further reduce the search space along the vertical\ndimension by placing all our bounding boxes on the road plane, y = yroad. We estimate the road\nby partitioning the image into super pixels, and train a road classi\ufb01er using a neural net with several\n\n4\n\n\fr\na\nC\n\nn\na\ni\nr\nt\ns\ne\nd\ne\nP\n\nt\ns\ni\nl\nc\ny\nC\n\n(a) Easy\n\n(b) Moderate\n\n(c) Hard\n\nFigure 3: Recall vs IoU for 500 proposals. The number next to the labels indicates the average recall (AR).\n\n2D and 3D features. We then use RANSAC on the predicted road pixels to \ufb01t the ground plane. Us-\ning the ground-plane considerably reduces the search space along the vertical dimension. However\nsince the points are noisy at large distances from the camera, we sample additional proposal boxes\nat locations farther than 20m from the camera. We sample these boxes at heights y = yroad \u00b1 \u03c3road,\nwhere \u03c3road is the MLE estimate of the standard deviation by assuming a Gaussian distribution of\nthe distance between objects and the estimated ground plane. Using our sampling strategy, scoring\nall possible con\ufb01gurations takes only a fraction of a second.\nNote that by minimizing our energy we only get one, best object candidate. In order to generate\nN diverse proposals, we sort the values of E(x, y) for all y, and perform greedy inference: we\npick the top scoring proposal, perform NMS, and iterate. The entire inference process and feature\ncomputation takes on average 1.2s per image for N = 2000 proposals.\n3.4 Learning\nWe learn the weights {wc,pcd, wc,f s, wc,ht, wc,ht\u2212contr} of the model using structured SVM [32].\nGiven N ground truth input-output pairs, {x(i), y(i)}i=1,\u00b7\u00b7\u00b7 ,N , the parameters are learnt by solving\nthe following optimization problem:\n\nmin\nw\u2208RD\n\n1\n2\n\n||w||2 +\n\nC\nN\n\n\u03bei\n\nN(cid:88)\n\ni=1\n\ns.t.: wT (\u03c6(x(i), y) \u2212 \u03c6(x(i), y(i))) \u2265 \u2206(y(i), y) \u2212 \u03bei, \u2200y \\ y(i)\n\nWe use the parallel cutting plane of [33] to solve this minimization problem. We use Intersection-\nover-Union (IoU) between the set of GT boxes, y(i), and candidates y as the task loss \u2206(y(i), y).\nWe compute IoU in 3D as the volume of intersection of two 3D boxes divided by the volume of their\nunion. This is a very strict measure that encourages accurate 3D placement of the proposals.\n3.5 Object Detection and Orientation Estimation Network\nWe use our object proposal method for the task of object detection and orientation estimation. We\nscore bounding box proposals using CNN. Our network is built on Fast R-CNN [34], which share\n\n5\n\nIoU overlap threshold0.50.60.70.80.91recall00.20.40.60.81BING 12.3SS 26.7EB 37.4MCG 45.1MCG-D 49.6Ours 65.8IoU overlap threshold0.50.60.70.80.91recall00.20.40.60.81BING 7.7SS 18EB 26.4MCG 36.1MCG-D 38.8Ours 57.5IoU overlap threshold0.50.60.70.80.91recall00.20.40.60.81BING 7SS 16.5EB 23MCG 31.3MCG-D 32.8Ours 57IoU overlap threshold0.50.60.70.80.91recall00.20.40.60.81BING 7.6SS 5.4EB 9.2MCG 15MCG-D 19.6Ours 49.3IoU overlap threshold0.50.60.70.80.91recall00.20.40.60.81BING 6.6SS 5.1EB 7.7MCG 13.2MCG-D 16.1Ours 43.6IoU overlap threshold0.50.60.70.80.91recall00.20.40.60.81BING 6.1SS 5EB 6.9MCG 12.2MCG-D 14Ours 38.6IoU overlap threshold0.50.60.70.80.91recall00.20.40.60.81BING 6.1SS 7.4EB 5MCG 10.9MCG-D 10.8Ours 52.7IoU overlap threshold0.50.60.70.80.91recall00.20.40.60.81BING 4.1SS 6EB 4.4MCG 8MCG-D 10.2Ours 37.6IoU overlap threshold0.50.60.70.80.91recall00.20.40.60.81BING 4.3SS 6.1EB 4.4MCG 8.2MCG-D 10.7Ours 37.7\fLSVM-MDPM-sv [35, 1]\n\nSquaresICF [36]\nDPM-C8B1 [37]\nMDPM-un-BB [1]\nDPM-VOC+VP [27]\n\nOC-DPM [38]\n\nAOG [39]\nSubCat [28]\nDA-DPM [40]\n\nFusion-DPM [41]\n\nR-CNN [42]\n\nFilteredICF [43]\npAUCEnsT [44]\n\nMV-RGBD-RF [45]\n\n3DVP [12]\n\nRegionlets [13]\n\nOurs\n\n-\n\n74.33\n71.19\n74.95\n74.94\n84.36\n84.14\n\n-\n-\n-\n-\n-\n-\n\n87.46\n84.75\n93.04\n\nCars\n\n-\n\n60.99\n62.16\n64.71\n65.95\n71.88\n75.46\n\n-\n-\n-\n-\n-\n-\n\n75.77\n76.45\n88.64\n\nCars\n\n-\n\n47.16\n48.43\n48.76\n53.86\n59.27\n59.71\n\n-\n-\n-\n-\n-\n-\n\n65.38\n59.70\n79.10\n\nEasy Moderate Hard\n68.02\n44.18\n\n56.48\n\nPedestrians\n\nEasy Moderate Hard\n35.95\n47.74\n40.08\n57.33\n38.96\n25.61\n\n39.36\n44.42\n29.03\n\nCyclists\n\nEasy Moderate Hard\n35.04\n26.21\n\n27.50\n\n43.49\n\n29.04\n\n26.20\n\n59.48\n\n44.86\n\n40.37\n\n42.43\n\n31.08\n\n28.23\n\n54.67\n56.36\n59.51\n61.61\n61.14\n65.26\n70.21\n\n-\n\n73.14\n81.78\n\n42.34\n45.51\n46.67\n50.13\n53.98\n54.49\n54.56\n\n-\n\n61.15\n67.47\n\n37.95\n41.08\n42.05\n44.79\n49.29\n48.60\n51.25\n\n-\n\n55.21\n64.70\n\n51.62\n54.02\n\n70.41\n78.39\n\n38.03\n39.72\n\n58.72\n68.94\n\n33.38\n34.82\n\n51.83\n61.37\n\n-\n\n-\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n-\n-\n-\n-\n-\n-\n\n-\n\n-\n\n-\n-\n-\n\n-\n\n-\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n-\n-\n-\n-\n-\n-\n\n-\n\n-\n\n-\n-\n-\n\n-\n\n-\n\n-\n-\n-\n-\n-\n-\n-\n\n-\n\n-\n\n-\n-\n-\n\nTable 1: Average Precision (AP) (in %) on the test set of the KITTI Object Detection Benchmark.\n\nAOG [39]\n\nDPM-C8B1 [37]\n\nLSVM-MDPM-sv [35, 1]\n\nDPM-VOC+VP [27]\n\nOC-DPM [38]\nSubCat [28]\n3DVP [12]\n\nEasy Moderate Hard\n31.53\n43.81\n39.22\n59.51\n67.27\n43.59\n46.54\n72.28\n52.40\n73.50\n58.83\n83.41\n64.11\n86.92\n91.44\n76.52\n\n38.21\n50.32\n55.77\n61.84\n64.42\n74.42\n74.59\n86.10\n\nPedestrians\n\nCyclists\n\nEasy Moderate Hard\n\nEasy Moderate Hard\n\n31.08\n43.58\n53.55\n\n23.37\n35.49\n39.83\n\n20.72\n32.42\n35.73\n\n27.25\n27.54\n30.52\n\n19.25\n22.07\n23.17\n\n17.95\n21.45\n21.58\n\n44.32\n\n34.18\n\n30.76\n\nOurs\n\n52.35\nTable 2: AOS scores (in %) on the test set of KITTI\u2019s Object Detection and Orientation Estimation Benchmark.\n\n57.03\n\n70.13\n\n58.68\n\n59.80\n\n72.94\n\nconvolutional features across all proposals and use ROI pooling layer to compute proposal-speci\ufb01c\nfeatures. We extend this basic network by adding a context branch after the last convolutional layer,\nand an orientation regression loss to jointly learn object location and orientation. Features output\nfrom the original and the context branches are concatenated and fed to the prediction layers. The\ncontext regions are obtained by enlarging the candidate boxes by a factor of 1.5. We used smooth\nL1 loss [34] for orientation regression. We use OxfordNet [3] trained on ImageNet to initialize the\nweights of convolutional layers and the branch for candidate boxes. The parameters of the context\nbranch are initialized by copying the weights from the original branch. We then \ufb01ne-tune it end to\nend on the KITTI training set.\n\n4 Experimental Evaluation\nWe evaluate our approach on the challenging KITTI autonomous driving dataset [11], which con-\ntains three object classes: Car, Pedestrian, and Cyclist. KITTI\u2019s object detection benchmark has\n7,481 training and 7,518 test images. Evaluation is done in three regimes: easy, moderate and hard,\ncontaining objects at different occlusion and truncation levels. The moderate regime is used to rank\nthe competing methods in the benchmark. Since the test ground-truth labels are not available, we\nsplit the KITTI training set into train and validation sets (each containing half of the images). We\nensure that our training and validation set do not come from the same video sequences, and evaluate\nthe performance of our bounding box proposals on the validation set.\nFollowing [4, 24], we use the oracle recall as metric. For each ground-truth (GT) object we \ufb01nd\nthe proposal that overlaps the most in IoU (i.e., \u201cbest proposal\u201d). We say that a GT instance has\nbeen recalled if IoU exceeds 70% for cars, and 50% for pedestrians and cyclists. This follows the\nstandard KITTI\u2019s setup. Oracle recall thus computes the percentage of recalled GT objects, and thus\nthe best achievable recall. We also show how different number of generated proposals affect recall.\nComparison to the State-of-the-art: We compare our approach to several baselines: MCG-\nD [14], MCG [5], Selective Search (SS) [4], BING [15], and Edge Boxes (EB) [9]. Fig. 2 shows\nrecall as a function of the number of candidates. We can see that by using 1000 proposals, we\n\n6\n\n\fs\ne\ng\na\nm\n\nI\n\n.\n\np\no\nr\np\n0\n0\n1\np\no\nT\nh\nt\nu\nr\nt\n\nd\nn\nu\no\nr\nG\n\n.\n\np\no\nr\np\n\nt\ns\ne\nB\n\nFigure 4: Qualitative results for the Car class. We show the original image, 100 top scoring proposals, ground-\ntruth 3D boxes, and our best set of proposals that cover the ground-truth.\n\ns\ne\ng\na\nm\n\nI\n\n.\n\np\no\nr\np\n\n0\n0\n1\n\np\no\nT\nh\nt\nu\nr\nt\n\nd\nn\nu\no\nr\nG\n\n.\n\np\no\nr\np\n\nt\ns\ne\nB\n\nMethod\n\nTime (seconds)\n\nBING\n0.01\n\nSelective Search\n\n15\n\nFigure 5: Qualitative examples for the Pedestrian class.\nMCG\n100\nTable 3: Running time of different proposal methods.\n\nEdge Boxes (EB)\n\n1.5\n\nMCG-D\n\n160\n\nOurs\n1.2\n\nachieve around 90% recall for Cars in the moderate and hard regimes, while for easy we need\nonly 200 candidates to get the same recall. Notice that other methods saturate or require orders\nof magnitude more candidates to reach 90% recall. For Pedestrians and Cyclists our results show\nsimilar improvements over the baselines. Note that while we use depth-based features, MCG-D uses\nboth depth and appearance based features, and all other methods use only appearance features. This\nshows the importance of 3D information in the autonomous driving scenario. Furthermore, the other\nmethods use class agnostic proposals to generate the candidates, whereas we generate them based\non the object class. This allows us to achieve higher recall values by exploiting size priors tailored\nto each class. Fig. 3 shows recall for 500 proposals as a function of the IoU overlap. Our approach\nsigni\ufb01cantly outperforms the baselines, particularly for Cyclists.\nRunning Time: Table 3 shows running time of different proposal methods. Our approach is fairly\nef\ufb01cient and can compute all features and proposals in 1.2s on a single core.\nQualitative Results: Figs. 4 and 5 show qualitative results for cars and pedestrians. We show the\ninput RGB image, top 100 proposals, the GT boxes in 3D, as well as proposals from our method\nwith the best 3D IoU (chosen among 2000 proposals). Our method produces very precise proposals\neven for the more dif\ufb01cult (far away or occluded) objects.\n\n7\n\n\fObject Detection: To evaluate our full object detection pipeline, we report results on the test set\nof the KITTI benchmark. The results are presented in Table 1. Our approach outperforms all the\ncompetitors signi\ufb01cantly across all categories. In particular, we achieve 12.19%, 6.32% and 10.22%\nimprovement in AP for Cars, Pedestrians, and Cyclists, in the moderate setting.\nObject Orientation Estimation: Average Orientation Similarity (AOS) [11] is used as the evalu-\nation metric in object detection and orientation estimation task. Results on KITTI test set are shown\nin Table 2. Our approach again outperforms all approaches by a large margin. Particularly, our\napproach achieves \u223c12% higher scores than 3DVP [12] on Cars in moderate and hard data. The im-\nprovement on Pedestrians and Cyclists are even more signi\ufb01cant as they are more than 20% higher\nthan the second best method.\nSuppl. material: We refer the reader to supplementary material for many additional results.\n5 Conclusion\nWe have presented a novel approach to object proposal generation in the context of autonomous\ndriving. In contrast to most existing work, we take advantage of stereo imagery and reason directly\nin 3D. We formulate the problem as inference in a Markov random \ufb01eld encoding object size priors,\nground plane and a variety of depth informed features. Our approach signi\ufb01cantly outperforms ex-\nisting state-of-the-art object proposal methods on the challenging KITTI benchmark. In particular,\nfor 2K proposals our approach achieves a 25% higher recall than the state-of-the-art RGB-D method\nMCG-D [14]. Combined with CNN scoring our method signi\ufb01cantly outperforms all previous pub-\nlished object detection results for all three object classes on the KITTI [11] benchmark.\nAcknowledgements: The work was partially supported by NSFC 61171113, NSERC and Toyota\nMotor Corporation.\n\nReferences\n[1] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discrimina-\n\ntively trained part based models. PAMI, 2010.\n\n[2] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional neural net-\n\nworks. In NIPS, 2012.\n\n[3] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\narXiv:1409.1556, 2014.\n\n[4] K. Van de Sande, J. Uijlings, T. Gevers, and A. Smeulders. Segmentation as selective search for object\n\nrecognition. In ICCV, 2011.\n\n[5] P. Arbelaez, J. Pont-Tusetand, J. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping.\n\nIn CVPR. 2014.\n\n[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. arXiv preprint arXiv:1311.2524, 2013.\n\n[7] Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. SegDeepM: Exploiting segmentation and context in\n\ndeep neural networks for object detection. In CVPR, 2015.\n\n[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object\n\nClasses Challenge 2010 (VOC2010) Results.\n\n[9] L. Zitnick and P. Doll\u00b4ar. Edge boxes: Locating object proposals from edges. In ECCV. 2014.\n[10] J. Carreira and C. Sminchisescu. Cpmc: Automatic object segmentation using constrained parametric\n\nmin-cuts. PAMI, 34(7):1312\u20131328, 2012.\n\n[11] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark\n\nsuite. In CVPR, 2012.\n\n[12] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3d voxel patterns for object category recognition.\n\nIn CVPR, 2015.\n\n[13] C. Long, X. Wang, G. Hua, M. Yang, and Y. Lin. Accurate object detection with location relaxation and\n\nregionlets relocalization. In ACCV, 2014.\n\n[14] S. Gupta, R. Girshick, P. Arbelaez, and J. Malik. Learning rich features from RGB-D images for object\n\ndetection and segmentation. In ECCV. 2014.\n\n8\n\n\f[15] M. Cheng, Z. Zhang, M. Lin, and P. Torr. BING: Binarized normed gradients for objectness estimation at\n\n300fps. In CVPR, 2014.\n\n[16] T. Lee, S. Fidler, and S. Dickinson. A learning framework for generating region proposals with mid-level\n\ncues. In ICCV, 2015.\n\n[17] D. Banica and C Sminchisescu. Cpmc-3d-o2p: Semantic segmentation of rgb-d images using cpmc and\n\nsecond order pooling. In CoRR abs/1312.7715, 2013.\n\n[18] D. Lin, S. Fidler, and R. Urtasun. Holistic scene understanding for 3d object detection with rgbd cameras.\n\nIn ICCV, 2013.\n\n[19] A. Karpathy, S. Miller, and Li Fei-Fei. Object discovery in 3d scenes via shape analysis. In ICRA, 2013.\n[20] D. Oneata, J. Revaud, J. Verbeek, and C. Schmid. Spatio-temporal object detection proposals. In ECCV,\n\n2014.\n\n[21] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pool-\n\ning. In ECCV. 2012.\n\n[22] S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun. Bottom-up segmentation for top-down detection. In\n\nCVPR, 2013.\n\n[23] B. Alexe, T. Deselares, and V. Ferrari. Measuring the objectness of image windows. PAMI, 2012.\n[24] J. Hosang, R. Benenson, P. Doll\u00b4ar, and B. Schiele. What makes for effective detection proposals?\n\narXiv:1502.05082, 2015.\n\n[25] S. Song and J. Xiao. Sliding shapes for 3d object detection in depth images. In ECCV. 2014.\n[26] M. Zia, M. Stark, and K. Schindler. Towards scene understanding with detailed 3d object representations.\n\nIJCV, 2015.\n\n[27] B. Pepik, M. Stark, P. Gehler, and B. Schiele. Multi-view and 3d deformable part models. PAMI, 2015.\n[28] E. Ohn-Bar and M. M. Trivedi. Learning to detect vehicles by clustering appearance patterns.\n\nIEEE\n\nTransactions on Intelligent Transportation Systems, 2015.\n\n[29] S. Wang, S. Fidler, and R. Urtasun. Holistic 3d scene understanding from a single geo-tagged image. In\n\nCVPR, 2015.\n\n[30] P. Doll\u00b4ar, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. PAMI, 2014.\n[31] K. Yamaguchi, D. McAllester, and R. Urtasun. Ef\ufb01cient joint segmentation, occlusion labeling, stereo\n\nand \ufb02ow estimation. In ECCV, 2014.\n\n[32] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support Vector Learning for Interdependent\n\nand Structured Output Spaces. In ICML, 2004.\n\n[33] A. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun. Box in the box: Joint 3d layout and object reasoning\n\nfrom single images. In ICCV, 2013.\n\n[34] Ross Girshick. Fast R-CNN. In ICCV, 2015.\n[35] A. Geiger, C. Wojek, and R. Urtasun. Joint 3d estimation of objects and scene layout. In NIPS, 2011.\n[36] R. Benenson, M. Mathias, T. Tuytelaars, and L. Van Gool. Seeking the strongest rigid detector. In CVPR,\n\n2013.\n\n[37] J. Yebes, L. Bergasa, R. Arroyo, and A. Lzaro. Supervised learning and evaluation of KITTI\u2019s cars\n\ndetector with DPM. In IV, 2014.\n\n[38] B. Pepik, M. Stark, P. Gehler, and B. Schiele. Occlusion patterns for object class detection. In CVPR,\n\n2013.\n\n[39] B. Li, T. Wu, and S. Zhu. Integrating context and occlusion for car detection by hierarchical and-or model.\n\nIn ECCV, 2014.\n\n[40] J. Xu, S. Ramos, D. Vozquez, and A. Lopez. Hierarchical Adaptive Structural SVM for Domain Adapta-\n\ntion. In arXiv:1408.5400, 2014.\n\n[41] C. Premebida, J. Carreira, J. Batista, and U. Nunes. Pedestrian detection combining rgb and dense lidar\n\ndata. In IROS, 2014.\n\n[42] J. Hosang, M. Omran, R. Benenson, and B. Schiele. Taking a deeper look at pedestrians. In arXiv, 2015.\n[43] S. Zhang, R. Benenson, and B. Schiele.\nIn\n\nFiltered channel features for pedestrian detection.\n\narXiv:1501.05759, 2015.\n\n[44] S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Pedestrian detection with spatially pooled features\n\nand structured ensemble learning. In arXiv:1409.5209, 2014.\n\n[45] A. Gonzalez, G. Villalonga, J. Xu, D. Vazquez, J. Amores, and A. Lopez. Multiview random forest of\n\nlocal experts combining rgb and lidar data for pedestrian detection. In IV, 2015.\n\n9\n\n\f", "award": [], "sourceid": 331, "authors": [{"given_name": "Xiaozhi", "family_name": "Chen", "institution": "Tsinghua University"}, {"given_name": "Kaustav", "family_name": "Kundu", "institution": "University of Toronto"}, {"given_name": "Yukun", "family_name": "Zhu", "institution": "University of Toronto"}, {"given_name": "Andrew", "family_name": "Berneshawi", "institution": "University of Toronto"}, {"given_name": "Huimin", "family_name": "Ma", "institution": "Tsinghua University"}, {"given_name": "Sanja", "family_name": "Fidler", "institution": "University of Toronto"}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": "University of Toronto"}]}