{"title": "PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points", "book": "Advances in Neural Information Processing Systems", "page_first": 8905, "page_last": 8917, "abstract": "Detecting 3D objects from a single RGB image is intrinsically ambiguous, thus requiring appropriate prior knowledge and intermediate representations as constraints to reduce the uncertainties and improve the consistencies between the 2D image plane and the 3D world coordinate. To address this challenge, we propose to adopt perspective points as a new intermediate representation for 3D object detection, defined as the 2D projections of local Manhattan 3D keypoints to locate an object; these perspective points satisfy geometric constraints imposed by the perspective projection. We further devise PerspectiveNet, an end-to-end trainable model that simultaneously detects the 2D bounding box, 2D perspective points, and 3D object bounding box for each object from a single RGB image. PerspectiveNet yields three unique advantages: (i) 3D object bounding boxes are estimated based on perspective points, bridging the gap between 2D and 3D bounding boxes without the need of category-specific 3D shape priors. (ii) It predicts the perspective points by a template-based method, and a perspective loss is formulated to maintain the perspective constraints. (iii) It maintains the consistency between the 2D perspective points and 3D bounding boxes via a differentiable projective function. Experiments on SUN RGB-D dataset show that the proposed method significantly outperforms existing RGB-based approaches for 3D object detection.", "full_text": "PerspectiveNet: 3D Object Detection from\na Single RGB Image via Perspective Points\n\nSiyuan Huang\n\nDepartment of Statistics\nhuangsiyuan@ucla.edu\n\nYixin Chen\n\nDepartment of Statistics\nethanchen@ucla.edu\n\nTao Yuan\n\nDepartment of Statistics\n\ntaoyuan@ucla.edu\n\nSiyuan Qi\n\nDepartment of Computer Science\n\nsyqi@cs.ucla.edu\n\nYixin Zhu\n\nDepartment of Statistics\nyixin.zhu@ucla.edu\n\nSong-Chun Zhu\n\nDepartment of Statistics\nsczhu@stat.ucla.edu\n\nAbstract\n\nDetecting 3D objects from a single RGB image is intrinsically ambiguous, thus re-\nquiring appropriate prior knowledge and intermediate representations as constraints\nto reduce the uncertainties and improve the consistencies between the 2D image\nplane and the 3D world coordinate. To address this challenge, we propose to adopt\nperspective points as a new intermediate representation for 3D object detection,\nde\ufb01ned as the 2D projections of local Manhattan 3D keypoints to locate an object;\nthese perspective points satisfy geometric constraints imposed by the perspective\nprojection. We further devise PerspectiveNet, an end-to-end trainable model that\nsimultaneously detects the 2D bounding box, 2D perspective points, and 3D object\nbounding box for each object from a single RGB image. PerspectiveNet yields\nthree unique advantages: (i) 3D object bounding boxes are estimated based on\nperspective points, bridging the gap between 2D and 3D bounding boxes without\nthe need of category-speci\ufb01c 3D shape priors. (ii) It predicts the perspective points\nby a template-based method, and a perspective loss is formulated to maintain\nthe perspective constraints. (iii) It maintains the consistency between the 2D per-\nspective points and 3D bounding boxes via a differentiable projective function.\nExperiments on SUN RGB-D dataset show that the proposed method signi\ufb01cantly\noutperforms existing RGB-based approaches for 3D object detection.\n\n1 Introduction\n\nIf one hopes to achieve a full understanding of a system as complicated as a\nnervous system, . . . , or even a large computer program, then one must be prepared\nto contemplate different kinds of explanation at different levels of description that\nare linked, at least in principle, into a cohesive whole, even if linking the levels in\ncomplete details is impractical.\n\u2014 David Marr [1], pp. 20\u201321\n\nIn a classic view of computer vision, David Marr [1] conjectured that the perception of a 2D image\nis an explicit multi-phase information process, involving (i) an early vision system of perceiving\ntextures [2, 3] and textons [4, 5] to form a primal sketch as a perceptually lossless conversion from\nthe raw image [6, 7], (ii) a mid-level vision system to construct 2.1D (multiple layers with partial\nocclusion) [8\u201310] and 2.5D [11] sketches, and (iii) a high-level vision system that recovers the full\n3D [12\u201314]. In particular, he highlighted the importance of different levels of organization and the\ninternal representation [15].\nIn parallel, the school of Gestalt Laws [16\u201323] and perceptual organization [24, 25] aims to resolve\nthe 3D reconstruction problem from a single RGB image without forming the depth cues; but rather,\nthey often use some sorts of priors\u2014groupings and structural cues [26, 27] that are likely to be\ninvariant over wide ranges of viewpoints [28], resulting in the birth of the SIFT feature [29]. Later,\nfrom a Bayesian perspective at a scene level, such priors, independent of any 3D scene structures,\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Traditional 3D object detection methods directly estimate (c) the 3D object bounding boxes from (a)\nthe 2D bounding boxes, which suffer from the uncertainties between the 2D image plane and the 3D world.\nThe proposed PerspectiveNet utilizes (b) the 2D perspective points as the intermediate representation to bridge\nthe gap. The perspective points are the 2D perspective projection of the 3D bounding box corners, containing\nrich 3D information (e.g., positions, orientations). The red dots indicate the perspective points of the bed that\nare challenging to emerge based on the visual features, but could be inferred by the context (correlations and\ntopology) among other perspective points.\nwere found in the human-made scenes, known as the Manhattan World assumption [30]. Importantly,\nfurther studies found that such priors help to improve object detection [31].\nIn this paper, inspired by these two classic schools in computer vision, we seek to test the following\ntwo hypotheses using modern computer vision methods: (i) Could an intermediate representation\nfacilitate modern computer vision tasks? (ii) Is such an intermediate representation a better and more\ninvariant prior compared to the priors obtained directly from speci\ufb01c tasks?\nIn particular, we tackle the challenging task of 3D object detection from a single RGB image. Despite\nthe recent success in 2D scene understanding (e.g., [32, 33], there is still a signi\ufb01cant performance gap\nfor 3D computer vision tasks based on a single 2D image. Recent modern approaches directly regress\nthe 3D bounding boxes [34\u201336] or reconstruct the 3D objects with speci\ufb01c 3D object priors [37\u201340].\nIn contrast, we propose an end-to-end trainable framework, PerspectiveNet, that sequentially estimates\nthe 2D bounding box, 2D perspective points, and 3D bounding box for each object with a local\nManhattan assumption [41], in which the perspective points serve as the intermediate representation,\nde\ufb01ned as the 2D projections of local Manhattan 3D keypoints to locate an object.\nThe proposed method offers three unique advantages. First, the use of perspective points as the\nintermediate representation bridges the gap between 2D and 3D bounding boxes without utilizing\nany extra category-speci\ufb01c 3D shape priors. As shown in Figure 1, it is often challenging for\nlearning-based methods to estimate the 3D bounding boxes from 2D images directly; regressing 3D\nbounding boxes from 2D input is a highly under-constrained problem and can be easily in\ufb02uenced\nby appearance variations of shape, texture, lighting, and background. To alleviate this issue, we\nadopt the perspective points as an intermediate representation to represent the local Manhattan frame\nthat each 3D object aligns with. Intuitively, the perspective points of an object are 3D geometric\nconstraints in the 2D space. More speci\ufb01cally, the 2D perspective points for each object are de\ufb01ned\nas the perspective projection of the 3D object bounding box (concatenated with its center), and each\n3D box aligns within a 3D local Manhattan frame. These perspective points are fused into the 3D\nbranch to predict the 3D attributes of the 3D bounding boxes.\nSecond, we devise a template-based method to ef\ufb01ciently and robustly estimate the perspective points.\nExisting methods [42\u201344, 33, 45] usually exploit heatmap or probability distribution map as the\nrepresentation to learn the location of visual points (e.g., object keypoint, human skeleton, room\nlayout), relying heavily on the view-dependent visual features, thus insuf\ufb01cient to resolve occlusions\nor large rotation/viewpoint changes in complex scenes; see an example in Figure 1 (b) where the \ufb01ve\nperspective points (in red) are challenging to emerge from pure visual features but could be inferred\nby the correlations and topology among other perspective points. To tackle this problem, we treat\neach set of 2D perspective points as the low dimensional embedding of its corresponding set of 3D\npoints with a constant topology; such an embedding is learned by predicting the perspective points as\na mixture of sparse templates. A perspective loss is formulated to impose the perspective constraints;\nthe details are described in \u00a7 3.2.\nThird, the consistency between the 2D perspective points and 3D bounding boxes can be maintained\nby a differentiable projective function; it is end-to-end trainable, from the 2D region proposals, to the\n2D bounding boxes, to the 2D perspective points, and to the 3D bounding boxes.\nIn the experiment, we show that the proposed PerspectiveNet outperforms previous methods with a\nlarge margin on SUN RGB-D dataset [46], demonstrating its ef\ufb01cacy on 3D object detection.\n\n2\n\n(a) 2D Bounding Boxes(b) 2D Perspective Points(c) 3D Bounding Boxes\f2 Related Work\n\n3D object detection from a single image Detecting 3D objects from a single RGB image is a\nchallenging problem, particularly due to the intrinsic ambiguity of the problem. Existing methods\ncould be categorized into three streams: (i) geometry-based methods that estimate the 3D bounding\nboxes with geometry and 3D world priors [47\u201351]; (ii) learning-based methods that incorporate\ncategory-speci\ufb01c 3D shape prior [52, 38, 40] or extra 2.5D information (depth, surface normal, and\nsegmentation) [37, 39, 53] to detect 3D bounding boxes or reconstruct the 3D object shape; and\n(iii) deep learning methods that directly estimates the 3D object bounding boxes from 2D bounding\nboxes [54, 34\u201336]. To make better estimations, various techniques have been devised to enforce\nconsistencies between the estimated 3D and the input 2D image. Huang et al. [36] proposed a\ntwo-stage method to learn the 3D objects and 3D layout cooperatively. Kundu et al. [37] proposed a\n3D object detection and reconstruction method using category-speci\ufb01c object shape prior by render-\nand-compare. Different from these methods, the proposed PerspectiveNet is a one-stage end-to-end\ntrainable 3D object detection framework using perspective points as an intermediate representation;\nthe perspective points naturally bridge the gap between the 2D and 3D bounding boxes without any\nextra annotations, category-speci\ufb01c 3D shape priors, or 2.5D maps.\n\nManhattan World assumption Human-made environment, from the layout of a city to structures\nsuch as buildings, room, furniture, and many other objects, could be viewed as a set of parallel and\northogonal planes, known as the Manhattan World (MW) assumption [31]. Formally, it indicates\nthat most human-made structures could be approximated by planar surfaces that are parallel to one\nof the three principal planes of a common orthogonal coordinate system. This strict Manhattan\nWorld assumption is later extended by a Mixture of Manhattan Frame (MMF) [55] to represent\nmore complex real-world scenes (e.g., city layouts, rotated objects). In literature, MW and MMF\nhave been adopted in vanish points (VPs) estimation and camera calibration [56, 57], orientation\nestimation [58\u201360], layout estimation [61\u201364, 44], and 3D scene reconstruction [65\u201367, 41, 68, 69].\nIn this work, we extend the MW to local Manhattan assumption where the cuboids are aligned with\nthe vertical (gravity) direction but with arbitrary horizontal orientation (also see Xiao and Furukawa\n[41]), and perspective points are adopted as the intermediate representation for 3D object detection.\n\nIntermediate 3D representation Intermediate 3D representations are bridges that narrow the gap\nand maintain the consistency between the 2D image plane and 3D world. Among them, 2.5D sketches\nhave been broadly used in reconstructing the 3D shapes [70\u201372] and 3D scenes [73, 38]. Other\nrecent alternative intermediate 3D representations include: (i) Wu et al. [74] uses pre-annotated and\ncategory-speci\ufb01c object keypoints as an intermediate representation, and (ii) Tekin et al. [75] uses\nprojected corners of 3D bounding boxes in learning the 6D object pose. In this paper, we explore the\nperspective points as an intermediate representation of 2D and 3D bounding boxes, and provide an\nef\ufb01cient learning framework for 3D object detection.\n\n3 Learning Perspective Points for 3D Object Detection\n\n3.1 Overall Architecture\n\nAs shown in Figure 2, the proposed PerspectiveNet contains a backbone architecture for feature\nextraction over the entire image, a region proposal network (RPN) [32] that proposes regions of\ninterest (RoIs), and a network head including three region-wise parallel branches. For each proposed\nbox, its RoI feature is fed into the three network branches to predict: (i) the object class and the\n2D bounding box offset, (ii) the 2D perspective points (projected 3D box corners and object center)\nas a weighted sum of predicted perspective templates, and (iii) the 3D box size, orientation, and\nits distance from the camera. Detected 3D boxes are reconstructed by the projected object center,\ndistance, box size, and rotation. The overall architecture of the PerspectiveNet resembles the R-CNN\nstructure, and we refer readers to [32, 76, 33] for more details of training R-CNN detectors.\nDuring training, we de\ufb01ne a multi-task loss on each proposed RoI as\n\nL = Lcls + L2D + Lpp + Lp + L3D + Lproj,\n\n(1)\nwhere the classi\ufb01cation loss Lcls and 2D bounding box loss L2D belong to the 2D bounding box\nbranch and are identical to those de\ufb01ned in 2D R-CNNs [32, 33]. Lpp and Lp are de\ufb01ned on the\nperspective point branch (\u00a7 3.2), L3D is de\ufb01ned on the 3D bounding box branch (see \u00a7 3.3), and the\nLproj is de\ufb01ned on maintaining the 2D-3D projection consistency (see \u00a7 3.4).\n\n3\n\n\fFigure 2: The proposed framework of the PerspectiveNet. Given an RGB Image, the backbone of PerspectiveNet\nextracts global features and propose candidate 2D bounding boxes (RoIs). For each proposed box, its RoI feature\nis fed into three network branches to predict: (i) the object class and the 2D box offset, (ii) 2D perspective\ntemplates (projected 3D box corners and object center) and the corresponding coef\ufb01cients, and (iii) the 3D box\nsize, orientation, and its distance from the camera. Detected 3D boxes are reconstructed by the projected object\ncenter, distance, box size, and rotation. By projecting the detected 3D boxes to 2D and comparing them with 2D\nperspective points, the network imposes and learns a consistency between the 2D inputs and 3D estimations.\n\n3.2 Perspective Point Estimation\n\nThe perspective point branch estimates the set of 2D perspective points for each RoI. Formally, the 2D\nperspective points of an object are the 2D projections of local Manhattan 3D keypoints to locate that\nobject, and they satisfy certain geometric constraints imposed by the perspective projection. In our\ncase, the perspective points (Figure 1(b)) include the 2D projections of the 3D bounding box corners\nand the 3D object center. The perspective points are predicted using a template-based regression and\nlearned by a mean squared error and a perspective loss detailed below.\n\n3.2.1 Template-based Regression\n\nMost of the existing methods [42\u201344, 33, 45] estimate visual keypoints with heatmaps, where each\nmap predicts the location for a certain keypoint. However, predicting perspective points by heatmaps\nhas two major problems: (i) Heatmap prediction for different keypoints is independent, thus fail to\ncapture the topology correlations among the perspective points. (ii) Heatmap prediction for each\nkeypoint relies heavily on the visual feature such as corners, which may be dif\ufb01cult to detect (see\nan example in Figure 1(b)). In contrast, each set of 2D perspective points can be treated as a low\ndimensional embedding of a set of 3D points with a particular topology, thus inferring such points\nrelies more on the relation and topology among the points instead of just the visual features.\nTo tackle these problems, we avoid dense per-pixel predictions. Instead, we estimate the perspective\npoints by a mixture of sparse templates [77, 78]. The sparse templates are more robust when facing\nunfamiliar scenes or objects. Ablative experiments show that the proposed template-based method\nprovides a more accurate estimation of perspective points than heatmap-based methods; see \u00a7 5.1.\nSpeci\ufb01cally, we project both the 3D object center and eight 3D bounding box corners to 2D with\ncamera parameters to generate the ground-truth 2D perspective points Pgt \u2208 R2\u00d79. Since a portion\nof the perspective points usually lies out of the RoI, we calculate the location of the perspective points\nin an extended (doubled) size of RoI and normalize the locations to [0, 1].\nWe predict the perspective points by a linear combination of templates; see Figure 3. The perspective\npoint branch has a C \u00d7 K \u00d7 2 \u00d7 9 dimensional output for the templates T , and a C \u00d7 K dimensional\noutput for the coef\ufb01cients w, where K denotes the number of templates for each class and C\ndenotes the number of object classes. The templates T is scaled to [0, 1] by a sigmoid nonlinear\nfunction, and the coef\ufb01cients w is normalized by a softmax function. The estimated perspective\npoints \u02c6P \u2208 RC\u00d72\u00d79 can be computed by a linear combination:\n\n\u02c6Pi =\n\nwik Tik,\n\n\u2200i = 1,\u00b7\u00b7\u00b7 , C.\n\n(2)\n\nK(cid:88)\n\nk=1\n\n4\n\nRGB ImagePerspective Points BranchObject Class2D Offset CoefficientsTemplates2D Box Branch3DBoxReprojection(w1,w2,...,wn)AAAB/XicbZDLSsNAFIZPvNZ6i5edm8EiVCghKYJdFty4rGAv0IYwmU7boZNJmJlYaim+ihsXirj1Pdz5Nk7bLLT1h4GP/5zDOfOHCWdKu+63tba+sbm1ndvJ7+7tHxzaR8cNFaeS0DqJeSxbIVaUM0HrmmlOW4mkOAo5bYbDm1m9+UClYrG41+OE+hHuC9ZjBGtjBfZpcRR4JTQKyiXkOM6MxGVgF1zHnQutgpdBATLVAvur041JGlGhCcdKtT030f4ES80Ip9N8J1U0wWSI+7RtUOCIKn8yv36KLozTRb1Ymic0mru/JyY4UmochaYzwnqglmsz879aO9W9ij9hIkk1FWSxqJdypGM0iwJ1maRE87EBTCQztyIywBITbQLLmxC85S+vQqPseK7j3V0VqpUsjhycwTkUwYNrqMIt1KAOBB7hGV7hzXqyXqx362PRumZlMyfwR9bnD0QPkns=AAAB/XicbZDLSsNAFIZPvNZ6i5edm8EiVCghKYJdFty4rGAv0IYwmU7boZNJmJlYaim+ihsXirj1Pdz5Nk7bLLT1h4GP/5zDOfOHCWdKu+63tba+sbm1ndvJ7+7tHxzaR8cNFaeS0DqJeSxbIVaUM0HrmmlOW4mkOAo5bYbDm1m9+UClYrG41+OE+hHuC9ZjBGtjBfZpcRR4JTQKyiXkOM6MxGVgF1zHnQutgpdBATLVAvur041JGlGhCcdKtT030f4ES80Ip9N8J1U0wWSI+7RtUOCIKn8yv36KLozTRb1Ymic0mru/JyY4UmochaYzwnqglmsz879aO9W9ij9hIkk1FWSxqJdypGM0iwJ1maRE87EBTCQztyIywBITbQLLmxC85S+vQqPseK7j3V0VqpUsjhycwTkUwYNrqMIt1KAOBB7hGV7hzXqyXqx362PRumZlMyfwR9bnD0QPkns=AAAB/XicbZDLSsNAFIZPvNZ6i5edm8EiVCghKYJdFty4rGAv0IYwmU7boZNJmJlYaim+ihsXirj1Pdz5Nk7bLLT1h4GP/5zDOfOHCWdKu+63tba+sbm1ndvJ7+7tHxzaR8cNFaeS0DqJeSxbIVaUM0HrmmlOW4mkOAo5bYbDm1m9+UClYrG41+OE+hHuC9ZjBGtjBfZpcRR4JTQKyiXkOM6MxGVgF1zHnQutgpdBATLVAvur041JGlGhCcdKtT030f4ES80Ip9N8J1U0wWSI+7RtUOCIKn8yv36KLozTRb1Ymic0mru/JyY4UmochaYzwnqglmsz879aO9W9ij9hIkk1FWSxqJdypGM0iwJ1maRE87EBTCQztyIywBITbQLLmxC85S+vQqPseK7j3V0VqpUsjhycwTkUwYNrqMIt1KAOBB7hGV7hzXqyXqx362PRumZlMyfwR9bnD0QPkns=AAAB/XicbZDLSsNAFIZPvNZ6i5edm8EiVCghKYJdFty4rGAv0IYwmU7boZNJmJlYaim+ihsXirj1Pdz5Nk7bLLT1h4GP/5zDOfOHCWdKu+63tba+sbm1ndvJ7+7tHxzaR8cNFaeS0DqJeSxbIVaUM0HrmmlOW4mkOAo5bYbDm1m9+UClYrG41+OE+hHuC9ZjBGtjBfZpcRR4JTQKyiXkOM6MxGVgF1zHnQutgpdBATLVAvur041JGlGhCcdKtT030f4ES80Ip9N8J1U0wWSI+7RtUOCIKn8yv36KLozTRb1Ymic0mru/JyY4UmochaYzwnqglmsz879aO9W9ij9hIkk1FWSxqJdypGM0iwJ1maRE87EBTCQztyIywBITbQLLmxC85S+vQqPseK7j3V0VqpUsjhycwTkUwYNrqMIt1KAOBB7hGV7hzXqyXqx362PRumZlMyfwR9bnD0QPkns=(,,...,)AAACCXicbVDLSgMxFL3js9bXqEs3wSJUkGFGBLssuHFZwT6gLSWTZtrQTGZIMkIZ6tKNv+LGhSJu/QN3/o3pdCjaekLg5Jx7ubnHjzlT2nW/rZXVtfWNzcJWcXtnd2/fPjhsqCiRhNZJxCPZ8rGinAla10xz2oolxaHPadMfXU/95j2VikXiTo9j2g3xQLCAEayN1LNR+QFl5xzNieM489dZzy65jpsBLRMvJyXIUevZX51+RJKQCk04VqrtubHuplhqRjidFDuJojEmIzygbUMFDqnqptkmE3RqlD4KImmu0ChTf3ekOFRqHPqmMsR6qBa9qfif1050UOmmTMSJpoLMBgUJRzpC01hQn0lKNB8bgolk5q+IDLHERJvwiiYEb3HlZdK4cDzX8W4vS9VKHkcBjuEEyuDBFVThBmpQBwKP8Ayv8GY9WS/Wu/UxK12x8p4j+APr8wf76ZaoAAACCXicbVDLSgMxFL3js9bXqEs3wSJUkGFGBLssuHFZwT6gLSWTZtrQTGZIMkIZ6tKNv+LGhSJu/QN3/o3pdCjaekLg5Jx7ubnHjzlT2nW/rZXVtfWNzcJWcXtnd2/fPjhsqCiRhNZJxCPZ8rGinAla10xz2oolxaHPadMfXU/95j2VikXiTo9j2g3xQLCAEayN1LNR+QFl5xzNieM489dZzy65jpsBLRMvJyXIUevZX51+RJKQCk04VqrtubHuplhqRjidFDuJojEmIzygbUMFDqnqptkmE3RqlD4KImmu0ChTf3ekOFRqHPqmMsR6qBa9qfif1050UOmmTMSJpoLMBgUJRzpC01hQn0lKNB8bgolk5q+IDLHERJvwiiYEb3HlZdK4cDzX8W4vS9VKHkcBjuEEyuDBFVThBmpQBwKP8Ayv8GY9WS/Wu/UxK12x8p4j+APr8wf76ZaoAAACCXicbVDLSgMxFL3js9bXqEs3wSJUkGFGBLssuHFZwT6gLSWTZtrQTGZIMkIZ6tKNv+LGhSJu/QN3/o3pdCjaekLg5Jx7ubnHjzlT2nW/rZXVtfWNzcJWcXtnd2/fPjhsqCiRhNZJxCPZ8rGinAla10xz2oolxaHPadMfXU/95j2VikXiTo9j2g3xQLCAEayN1LNR+QFl5xzNieM489dZzy65jpsBLRMvJyXIUevZX51+RJKQCk04VqrtubHuplhqRjidFDuJojEmIzygbUMFDqnqptkmE3RqlD4KImmu0ChTf3ekOFRqHPqmMsR6qBa9qfif1050UOmmTMSJpoLMBgUJRzpC01hQn0lKNB8bgolk5q+IDLHERJvwiiYEb3HlZdK4cDzX8W4vS9VKHkcBjuEEyuDBFVThBmpQBwKP8Ayv8GY9WS/Wu/UxK12x8p4j+APr8wf76ZaoAAACCXicbVDLSgMxFL3js9bXqEs3wSJUkGFGBLssuHFZwT6gLSWTZtrQTGZIMkIZ6tKNv+LGhSJu/QN3/o3pdCjaekLg5Jx7ubnHjzlT2nW/rZXVtfWNzcJWcXtnd2/fPjhsqCiRhNZJxCPZ8rGinAla10xz2oolxaHPadMfXU/95j2VikXiTo9j2g3xQLCAEayN1LNR+QFl5xzNieM489dZzy65jpsBLRMvJyXIUevZX51+RJKQCk04VqrtubHuplhqRjidFDuJojEmIzygbUMFDqnqptkmE3RqlD4KImmu0ChTf3ekOFRqHPqmMsR6qBa9qfif1050UOmmTMSJpoLMBgUJRzpC01hQn0lKNB8bgolk5q+IDLHERJvwiiYEb3HlZdK4cDzX8W4vS9VKHkcBjuEEyuDBFVThBmpQBwKP8Ayv8GY9WS/Wu/UxK12x8p4j+APr8wf76ZaoGateDistanceBox SizeOrientation3D Box Branch010\u20260RoIs\fFigure 3: Perspective point estimation. (a) The perspective points are estimated by a mixture of templates\nthrough a linear combination. Each template encodes geometric cues including orientations and viewpoints. (b)\nThe perspective loss enforces each set of 2D perspective points to be the perspective projection of a (vertical) 3D\ncuboid. For a vertical cuboid, the projected vertical edges (i.e., ae, bf, cg, and dh) should be parallel or near\nparallel (under small camera tilting angles). For 3D parallel lines that are perpendicular to the gravity direction,\nthe vanishing points of their 2D projections should coincide (e.g., u1 and u2).\nThe template design is both class-speci\ufb01c and instance-speci\ufb01c: (i) Class-speci\ufb01c: we decouple the\nprediction of the perspective point and the object class, allowing the network to learn perspective\npoints for every class without competition among classes. (ii) Instance-speci\ufb01c: the templates are\ninferred for each RoI; hence, they are speci\ufb01c to each object instance. The templates are automatically\nlearned for each object instance from data with the end-to-end learning framework; thus, both the\ntemplates and coef\ufb01cients for each instance are optimizable and can better \ufb01t the training data.\nThe average mean squared error (MSE) loss is de\ufb01ned as Lpp = MSE( \u02c6Pc, Pgt). For an RoI associated\nwith ground-truth class c, Lpp is only de\ufb01ned on the c\u2019s perspective points during training; perspective\npoint outputs from other classes do not contribute to the loss. In inference, we rely on the dedicated\nclassi\ufb01cation branch to predict the class label to select the output perspective points.\n\n3.2.2 Perspective Loss\n\nUnder the assumption that each 3D bounding box aligns with a local Manhattan frame, we regularize\nthe estimation of the perspective points to satisfy the constraint of perspective projection. Each set of\nmutually parallel lines in 3D can be projected into 2D as intersecting lines; see Figure 3 (b). These\nintersecting lines should converge at the same vanishing point. Therefore, the desired algorithm\nwould penalize the distance between the intersection points from the two sets of intersecting lines.\nFor example in Figure 3 (b), we select line ad and line eh as a pair of lines, bc and f g as another, and\ncompute the distance between their intersection point u1 and u2. Additionally, since we assume each\n3D local Manhattan frame aligns with the vertical (gravity) direction, we enforce the edges in gravity\ndirection (i.e., ae, bf, cg, and dh) to be parallel by penalizing the large slope variance.\nThe perspective loss is computed as Lp = Ld1 + Ld2 + Lgrav, where Lgrav penalizes the slope\nvariance in gravity direction, Ld1 and Ld2 penalize the intersection point distance for the two\nperpendicular directions along the gravity direction.\n\n3.3\n\n3D Bounding Box Estimation\n\nEstimating 3D bounding boxes is a two-step process. In the \ufb01rst step, the 3D branch estimates the\n3D attributes, including the distance between the camera center and the 3D object center, as well as\nthe 3D size and orientation following Huang et al. [36]. Since the perspective point branch encodes\nrich 3D geometric features, the 3D attribute estimator aggregates the feature from perspective point\nbranch with a soft gated function between [0, 1] to improve the prediction. The gated function serves\nas a soft-attention mechanism that decides how much information from perspective points should\ncontribute to the 3D prediction.\nIn the second step, with the estimated projected 3D bounding boxes center (i.e., the \ufb01rst estimated\nperspective point) and the 3D attributes, we compose the 3D bounding boxes by the inverse projection\nfrom the 2D image plane to the 3D world following Huang et al. [36] given camera parameters.\nThe 3D loss is computed by the sum of individual losses of 3D attributes and a joint loss of 3D\nbounding box L3D = Ldis + Lsize + Lori + Lbox3d.\n\n5\n\nabcdefghu1u2v1v2(a) Mixture of templates(b) Perspective Loss=AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahp5KIoBeh4MVjC/YD2lA220m7drMJuxuhhP4CLx4U8epP8ua/cdvmoK0PBh7vzTAzL0gE18Z1v53CxubW9k5xt7S3f3B4VD4+aes4VQxbLBax6gZUo+ASW4Ybgd1EIY0CgZ1gcjf3O0+oNI/lg5km6Ed0JHnIGTVWat4OyhW35i5A1omXkwrkaAzKX/1hzNIIpWGCat3z3MT4GVWGM4GzUj/VmFA2oSPsWSpphNrPFofOyIVVhiSMlS1pyEL9PZHRSOtpFNjOiJqxXvXm4n9eLzXhjZ9xmaQGJVsuClNBTEzmX5MhV8iMmFpCmeL2VsLGVFFmbDYlG4K3+vI6aV/WPLfmNa8q9WoeRxHO4Byq4ME11OEeGtACBgjP8ApvzqPz4rw7H8vWgpPPnMIfOJ8/hPuMpw==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahp5KIoBeh4MVjC/YD2lA220m7drMJuxuhhP4CLx4U8epP8ua/cdvmoK0PBh7vzTAzL0gE18Z1v53CxubW9k5xt7S3f3B4VD4+aes4VQxbLBax6gZUo+ASW4Ybgd1EIY0CgZ1gcjf3O0+oNI/lg5km6Ed0JHnIGTVWat4OyhW35i5A1omXkwrkaAzKX/1hzNIIpWGCat3z3MT4GVWGM4GzUj/VmFA2oSPsWSpphNrPFofOyIVVhiSMlS1pyEL9PZHRSOtpFNjOiJqxXvXm4n9eLzXhjZ9xmaQGJVsuClNBTEzmX5MhV8iMmFpCmeL2VsLGVFFmbDYlG4K3+vI6aV/WPLfmNa8q9WoeRxHO4Byq4ME11OEeGtACBgjP8ApvzqPz4rw7H8vWgpPPnMIfOJ8/hPuMpw==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahp5KIoBeh4MVjC/YD2lA220m7drMJuxuhhP4CLx4U8epP8ua/cdvmoK0PBh7vzTAzL0gE18Z1v53CxubW9k5xt7S3f3B4VD4+aes4VQxbLBax6gZUo+ASW4Ybgd1EIY0CgZ1gcjf3O0+oNI/lg5km6Ed0JHnIGTVWat4OyhW35i5A1omXkwrkaAzKX/1hzNIIpWGCat3z3MT4GVWGM4GzUj/VmFA2oSPsWSpphNrPFofOyIVVhiSMlS1pyEL9PZHRSOtpFNjOiJqxXvXm4n9eLzXhjZ9xmaQGJVsuClNBTEzmX5MhV8iMmFpCmeL2VsLGVFFmbDYlG4K3+vI6aV/WPLfmNa8q9WoeRxHO4Byq4ME11OEeGtACBgjP8ApvzqPz4rw7H8vWgpPPnMIfOJ8/hPuMpw==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahp5KIoBeh4MVjC/YD2lA220m7drMJuxuhhP4CLx4U8epP8ua/cdvmoK0PBh7vzTAzL0gE18Z1v53CxubW9k5xt7S3f3B4VD4+aes4VQxbLBax6gZUo+ASW4Ybgd1EIY0CgZ1gcjf3O0+oNI/lg5km6Ed0JHnIGTVWat4OyhW35i5A1omXkwrkaAzKX/1hzNIIpWGCat3z3MT4GVWGM4GzUj/VmFA2oSPsWSpphNrPFofOyIVVhiSMlS1pyEL9PZHRSOtpFNjOiJqxXvXm4n9eLzXhjZ9xmaQGJVsuClNBTEzmX5MhV8iMmFpCmeL2VsLGVFFmbDYlG4K3+vI6aV/WPLfmNa8q9WoeRxHO4Byq4ME11OEeGtACBgjP8ApvzqPz4rw7H8vWgpPPnMIfOJ8/hPuMpw==w2AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5IUQY8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK90+D+qBccWvuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFqTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDK/9TKgkRa7YclGYSoIxmf9NhkJzhnJqCWVa2FsJG1NNGdp0SjYEb/XlddKu1zy35t1dVhrVPI4inME5VMGDK2jALTShBQxG8Ayv8OZI58V5dz6WrQUnnzmFP3A+fwAD3I2GAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5IUQY8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK90+D+qBccWvuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFqTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDK/9TKgkRa7YclGYSoIxmf9NhkJzhnJqCWVa2FsJG1NNGdp0SjYEb/XlddKu1zy35t1dVhrVPI4inME5VMGDK2jALTShBQxG8Ayv8OZI58V5dz6WrQUnnzmFP3A+fwAD3I2GAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5IUQY8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK90+D+qBccWvuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFqTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDK/9TKgkRa7YclGYSoIxmf9NhkJzhnJqCWVa2FsJG1NNGdp0SjYEb/XlddKu1zy35t1dVhrVPI4inME5VMGDK2jALTShBQxG8Ayv8OZI58V5dz6WrQUnnzmFP3A+fwAD3I2GAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5IUQY8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK90+D+qBccWvuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFqTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDK/9TKgkRa7YclGYSoIxmf9NhkJzhnJqCWVa2FsJG1NNGdp0SjYEb/XlddKu1zy35t1dVhrVPI4inME5VMGDK2jALTShBQxG8Ayv8OZI58V5dz6WrQUnnzmFP3A+fwAD3I2Gw1AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KIUI8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK908Db1CuuDV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophtd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlf1jy35t1dVRrVPI4inME5VMGDOjTgFprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifPwJYjYU=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KIUI8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK908Db1CuuDV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophtd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlf1jy35t1dVRrVPI4inME5VMGDOjTgFprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifPwJYjYU=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KIUI8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK908Db1CuuDV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophtd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlf1jy35t1dVRrVPI4inME5VMGDOjTgFprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifPwJYjYU=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KIUI8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK908Db1CuuDV3AbJOvJxUIEdzUP7qD2OWRlwhk9SYnucm6GdUo2CSz0r91PCEsgkd8Z6likbc+Nni1Bm5sMqQhLG2pZAs1N8TGY2MmUaB7Ywojs2qNxf/83ophtd+JlSSIldsuShMJcGYzP8mQ6E5Qzm1hDIt7K2EjammDG06JRuCt/ryOmlf1jy35t1dVRrVPI4inME5VMGDOjTgFprQAgYjeIZXeHOk8+K8Ox/L1oKTz5zCHzifPwJYjYU=w3AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KooMeCF48V7Qe0oWy2m3bpZhN2J0oJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbF6wEnC/YgOlQgFo2il+6f+Rb9ccWvuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5qVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1NBkJzhnJiCWVa2FsJG1FNGdp0SjYEb/nlVdI6r3luzbu7rNSreRxFOIFTqIIHV1CHW2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gAFYI2HAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KooMeCF48V7Qe0oWy2m3bpZhN2J0oJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbF6wEnC/YgOlQgFo2il+6f+Rb9ccWvuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5qVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1NBkJzhnJiCWVa2FsJG1FNGdp0SjYEb/nlVdI6r3luzbu7rNSreRxFOIFTqIIHV1CHW2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gAFYI2HAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KooMeCF48V7Qe0oWy2m3bpZhN2J0oJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbF6wEnC/YgOlQgFo2il+6f+Rb9ccWvuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5qVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1NBkJzhnJiCWVa2FsJG1FNGdp0SjYEb/nlVdI6r3luzbu7rNSreRxFOIFTqIIHV1CHW2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gAFYI2HAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KooMeCF48V7Qe0oWy2m3bpZhN2J0oJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GNzO//ci1EbF6wEnC/YgOlQgFo2il+6f+Rb9ccWvuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5qVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1NBkJzhnJiCWVa2FsJG1FNGdp0SjYEb/nlVdI6r3luzbu7rNSreRxFOIFTqIIHV1CHW2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gAFYI2HwnAAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KIUI8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK908DNShX3Jq7AFknXk4qkKM5KH/1hzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx6oxcWGVIwljbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uw2s/EypJkSu2XBSmkmBM5n+TodCcoZxaQpkW9lbCxlRThjadkg3BW315nbQva55b8+6uKo1qHkcRzuAcquBBHRpwC01oAYMRPMMrvDnSeXHenY9la8HJZ07hD5zPH17MjcI=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KIUI8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK908DNShX3Jq7AFknXk4qkKM5KH/1hzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx6oxcWGVIwljbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uw2s/EypJkSu2XBSmkmBM5n+TodCcoZxaQpkW9lbCxlRThjadkg3BW315nbQva55b8+6uKo1qHkcRzuAcquBBHRpwC01oAYMRPMMrvDnSeXHenY9la8HJZ07hD5zPH17MjcI=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KIUI8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK908DNShX3Jq7AFknXk4qkKM5KH/1hzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx6oxcWGVIwljbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uw2s/EypJkSu2XBSmkmBM5n+TodCcoZxaQpkW9lbCxlRThjadkg3BW315nbQva55b8+6uKo1qHkcRzuAcquBBHRpwC01oAYMRPMMrvDnSeXHenY9la8HJZ07hD5zPH17MjcI=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5KIUI8FLx4r2g9oQ9lsN+3SzSbsTpQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbuZ+55FrI2L1gNOE+xEdKREKRtFK908DNShX3Jq7AFknXk4qkKM5KH/1hzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx6oxcWGVIwljbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uw2s/EypJkSu2XBSmkmBM5n+TodCcoZxaQpkW9lbCxlRThjadkg3BW315nbQva55b8+6uKo1qHkcRzuAcquBBHRpwC01oAYMRPMMrvDnSeXHenY9la8HJZ07hD5zPH17MjcI=+AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==+AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==+AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==+AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBahIJREBD0WvHhswX5AG8pmO2nXbjZhdyOU0F/gxYMiXv1J3vw3btsctPXBwOO9GWbmBYng2rjut1PY2Nza3inulvb2Dw6PyscnbR2nimGLxSJW3YBqFFxiy3AjsJsopFEgsBNM7uZ+5wmV5rF8MNME/YiOJA85o8ZKzctBueLW3AXIOvFyUoEcjUH5qz+MWRqhNExQrXuemxg/o8pwJnBW6qcaE8omdIQ9SyWNUPvZ4tAZubDKkISxsiUNWai/JzIaaT2NAtsZUTPWq95c/M/rpSa89TMuk9SgZMtFYSqIicn8azLkCpkRU0soU9zeStiYKsqMzaZkQ/BWX14n7aua59a85nWlXs3jKMIZnEMVPLiBOtxDA1rAAOEZXuHNeXRenHfnY9lacPKZU/gD5/MHabOMlQ==...AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5CIoMeCF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZemEph0PO+ndLG5tb2Tnm3srd/cHhUPT5pmyTTjLdYIhPdDanhUijeQoGSd1PNaRxK3gknt3O/88S1EYl6xGnKg5iOlIgEo2ilB9d1B9Wa53oLkHXiF6QGBZqD6ld/mLAs5gqZpMb0fC/FIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BD81ZfXSfvS9T3Xv7+qNepFHGU4g3Oogw/X0IA7aEILGIzgGV7hzZHOi/PufCxbS04xcwp/4Hz+AEQSjQg=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5CIoMeCF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZemEph0PO+ndLG5tb2Tnm3srd/cHhUPT5pmyTTjLdYIhPdDanhUijeQoGSd1PNaRxK3gknt3O/88S1EYl6xGnKg5iOlIgEo2ilB9d1B9Wa53oLkHXiF6QGBZqD6ld/mLAs5gqZpMb0fC/FIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BD81ZfXSfvS9T3Xv7+qNepFHGU4g3Oogw/X0IA7aEILGIzgGV7hzZHOi/PufCxbS04xcwp/4Hz+AEQSjQg=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5CIoMeCF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZemEph0PO+ndLG5tb2Tnm3srd/cHhUPT5pmyTTjLdYIhPdDanhUijeQoGSd1PNaRxK3gknt3O/88S1EYl6xGnKg5iOlIgEo2ilB9d1B9Wa53oLkHXiF6QGBZqD6ld/mLAs5gqZpMb0fC/FIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BD81ZfXSfvS9T3Xv7+qNepFHGU4g3Oogw/X0IA7aEILGIzgGV7hzZHOi/PufCxbS04xcwp/4Hz+AEQSjQg=AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5CIoMeCF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZemEph0PO+ndLG5tb2Tnm3srd/cHhUPT5pmyTTjLdYIhPdDanhUijeQoGSd1PNaRxK3gknt3O/88S1EYl6xGnKg5iOlIgEo2ilB9d1B9Wa53oLkHXiF6QGBZqD6ld/mLAs5gqZpMb0fC/FIKcaBZN8VulnhqeUTeiI9yxVNOYmyBenzsiFVYYkSrQthWSh/p7IaWzMNA5tZ0xxbFa9ufif18swuglyodIMuWLLRVEmCSZk/jcZCs0ZyqkllGlhbyVsTDVlaNOp2BD81ZfXSfvS9T3Xv7+qNepFHGU4g3Oogw/X0IA7aEILGIzgGV7hzZHOi/PufCxbS04xcwp/4Hz+AEQSjQg=\fFigure 4: Qualitative results (top 50%). For every three columns as a group: (Left) The RGB image with 2D\ndetection results. (Middle) The RGB image with estimated perspective points. (Right) The results in 3D point\ncloud; point cloud is used for visualization only.\n\n3.4\n\n2D-3D Consistency\n\nIn contrast to prior work [74, 79, 80, 35, 70, 36] that enforces the consistency between estimated\n3D objects and 2D image, we devise a new way to impose a re-projection consistency loss between\n3D bounding boxes and perspective points. Speci\ufb01cally, we compute the 2D projected perspective\npoints Pproj by projecting the 3D bounding box corners back to 2D image plane and computing the\ndistance with respect to ground-truth perspective points Lproj = MSE(Pproj, Pgt). Comparing with\nprior work to maintain the consistency between 2D and 3D bounding boxes by approximating the 2D\nprojection of 3D bounding boxes [35, 36], the proposed method uses the exact projection of projected\n3D boxes to establish the consistency, capturing a more precise 2D-3D relationship.\n\n4\n\nImplementation Details\n\nNetwork Backbone\nInspired by He et al. [33], we use the combination of residual network\n(ResNet) [81] and feature pyramid network (FPN) [82] to extract the feature from the entire image. A\nregion proposal network (RPN) [32] is used to produce object proposals (i.e., RoI). A RoIAlign [33]\nmodule is adopted to extract a smaller features map (256 \u00d7 7 \u00d7 7) for each proposal.\nNetwork Head The network head consists of three branches, and each branch has its individual\nfeature extractor and predictor. Three feature extractors have the same architecture of two fully\nconnected (FC) layers; each FC layer is followed by a ReLU function. The feature extractors take the\n256 \u00d7 7 \u00d7 7 dimensional RoI features as the input and output a 1024 dimensional vector.\nThe predictor in the 2D branch has two separate FC layers to predict a C dimensional object class\nprobabilities and a C \u00d7 4 dimensional 2D bounding box offset. The predictor in the perspective point\nbranch predicts C \u00d7 K \u00d7 2 \u00d7 9 dimensional templates and C \u00d7 K dimensional coef\ufb01cients with two\nFC layers and their corresponding nonlinear activation functions (i.e., sigmoid, softmax). The soft\ngate in the 3D branch consists of an FC layer (1024-1) and a sigmoid function to generate the weight\nfor feature aggregation. The predictor in the 3D branch consists of three FC layers to predict the size,\nthe distance from the camera, and the orientation of the 3D bounding box.\n\n6\n\n\fFigure 5: Precision-Recall (PR) curves for 3D object detection on SUN RGB-D\n\n5 Experiments\n\nDataset We conduct comprehensive experiments on SUN RGB-D [46] dataset. The SUN RGB-D\ndataset has a total of 10,335 images, in which 5,050 are test images. It has a rich annotation of scene\ncategories, camera pose, and 3D bounding boxes. We evaluate the 3D object detection results of the\nproposed PerspectiveNet, make comparisons with the state-of-the-art methods, and further examine\nthe contribution of each module in ablative experiments.\nExperimental Setup\nTo prepare valid data for training the proposed model, we discard the images\nwith no 3D objects or incorrect correspondence between 2D and 3D bounding boxes, resulting 4783\ntraining images and 4220 test images. We detect 30 categories of objects following Huang et al. [36].\nReproduciblity Details During training, an RoI is considered positive if it has the IoU with a\nground-truth box of at least 0.5. Lpp, Lp, L3D, and Lproj are only de\ufb01ned on positive RoIs. Each\nimage has N sampled RoIs, where the ratio of positive to negative is 1:3 following the protocol\npresented in Girshick [76].\nWe resize the images so that the shorter edges are all 800 pixels. To avoid over-\ufb01tting, a data\naugmentation procedure is performed by randomly \ufb02ipping the images or randomly shifting the 2D\nbounding boxes with corresponding labels during the training. We use SGD for optimization with a\nbatch size of 32 on a desktop with 4 Nvidia TITAN RTX cards (8 images each card). The learning\nrate starts at 0.01 and decays by 0.1 at 30,000 and 35,000 iterations. We implement our framework\nbased on the code of Massa and Girshick [83]. It takes 6 hours to train, and the trained PerspectiveNet\nprovides inference in real-time (20 FPS) using a single GPU.\nSince the consistency loss and perspective loss can be substantial during the early stage of the training\nprocess, we add them to the joint loss when the learning rate decays twice. The hyper-parameter (e.g.,\nthe weights of losses, the architecture of network head) is tuned empirically by a local search.\nEvaluation Metric We evaluate the performance of 3D object detection using the metric presented\nin Song et al. [46]. Speci\ufb01cally, we \ufb01rst calculate the 3D Intersection over Union (IoU) between the\npredicted 3D bounding boxes and the ground-truth 3D bounding boxes, and then compute the mean\naverage precision (mAP). Following Huang et al. [36], we set the 3D IoU threshold as 0.15 in the\nabsence of depth information.\nQualitative Results\nThe qualitative results of 2D object detection, 2D perspective point estimation,\nand 3D object detection are shown in Figure 4. Note that the proposed method performs accurate 3D\nobject detection in some challenging scenes. For the perspective point estimation, even though some\nof the perspective points are not aligned with image features, the proposed method can still localize\ntheir positions robustly.\nQuantitative Results\nSince the state-of-the-art method [36] learns the camera extrinsic parameters\njointly, we provide two protocals for evaluations for a fair comparison: (i) PerspectiveNet given\nground-truth camera extrinsic parameter (full), and (ii) PerspectiveNet without ground-truth camera\nextrinsic parameter by learning it jointly following [36] (w/o. cam).\nWe learn the detector for 30 object categories and report the precision-recall (PR) curve of 10 main\ncategories in Figure 5. We calculate the area under the curve to compute AP; Table 1 shows the\ncomparisons of APs of the proposed models with existing approaches (see supplementary materials\nfor the APs of all 30 categories).\n\n7\n\nToiletBinSinkShelfLampBedChairSofaTableDeskRecallRecallRecallRecallRecallRecallRecallRecallRecallRecallPrecisionPrecisionPrecisionPrecisionPrecisionPrecisionPrecisionPrecisionPrecisionPrecision\fFigure 6: Heatmaps vs. templates for perspective point prediction. (Left) Estimated by heatmap-based method.\n(Right) Estimated by the proposed template-based method.\nNote that the critical difference between the proposed model and the state-of-the-art method [36] is\nthe intermediate representation to learn the 2D-3D consistency. Huang et al. [36] uses 2D bounding\nboxes to enforce a 2D-3D consistency by minimizing the differences between projected 3D boxes\nand detected 2D boxes. In contrast, the proposed intermediate representation has a clear advantage\nsince projected 3D boxes often are not 2D rectangles, and perspective points eliminate such errors.\nQuantitatively, our full model improves the mAP of the state-of-the-art method [36] by 14.71%, and\nthe model without the camera extrinsic parameter improves by 10.91%. The signi\ufb01cant improvement\nof the mAP demonstrates the ef\ufb01cacy of the proposed intermediate representation. We defer more\nanalysis on how each component contributes to the overall performance in \u00a7 5.1.\n\n5.1 Ablative Analysis\n\nIn this section, we analyze each major component of the model to examine its contribution to the\noverall signi\ufb01cant performance gain. Speci\ufb01cally, we design six variants of the proposed model.\n\u2022 S1: The model trained without the perspective point branch, using the 2D offset to predict the 3D\ncenter of the object following Huang et al. [36].\n\u2022 S2: The model that aggregates the feature from the perspective point branch and 3D branch directly\nwithout the gate function.\n\u2022 S3: The model that aggregates the feature from the perspective point branch and 3D branch with a\ngate function that only outputs 0 or 1 (hard gate).\n\u2022 S4: The model trained without the perspective loss.\n\u2022 S5: The model trained without the consistency loss.\n\u2022 S6: The model trained without the perspective branch, perspective loss, or consistency loss.\nTable 2 shows the mAP for each variant of the proposed model. The mAP drops 3.86% without the\nperspective point branch (S1), 1.66% without the consistency loss (S5), indicating that the perspective\npoint and re-projection consistency in\ufb02uence the most to the proposed framework. In addition, the\nswitch of gate function (S2, S3) and perspective loss (S4) contribute less to the \ufb01nal performance.\nSince S6 is still higher than the state-of-the-art result [36] with 9.32%, we conjecture this performance\ngain may come from the one-stage (vs. two-stage) end-to-end training framework and the usage of\nground-truth camera parameter; we will further investigate this in future work.\n\n5.2 Heatmaps vs. Templates\n\nAs discussed in \u00a7 3.2, we test two different methods for the perspective point estimation: (i) dense\nprediction as heatmaps following the human pose estimation mechanism in He et al. [33] by adding\na parallel heatmap prediction branch, and (ii) template-based regression by the proposed method.\nThe qualitative results (see Figure 6) show that the heatmap-based estimation suffers severely from\nocclusion and topology change among the perspective points, whereas the proposed template-based\nregression eases the problem signi\ufb01cantly by learning robust sparse templates, capturing consistent\ntopological relations. We also evaluate the quantitative results by computing the average absolute\ndistance between the ground-truth and estimated perspective points. The heatmap-based method has\na 10.25 pix error, while the proposed method only has a 6.37 pix error, which further demonstrates\nthe ef\ufb01cacy of the proposed template-based perspective point estimation.\n\nTable 1: Comparisons of 3D object detection on SUN RGB-D (AP).\n\nsofa\n3.24\n\nchair\n2.31\n\ntable\nbed\n5.62\n1.23\n58.29 13.56 28.37 12.12\n63.58 17.12 41.22 26.21\n\n3DGP [49]\n1.29\nHoPR [38]\nCooP [36]\n3.01\nOurs (w/o. cam) 71.39 34.94 55.63 34.10 14.23 73.73 17.47 34.41 4.21\nOurs (full)\n\n14.01\n23.65\n34.96\n79.69 40.42 62.35 44.12 20.19 81.22 22.42 41.35 8.29 13.14 39.09\n\n16.50\n0.63\n58.55 10.19\n\n2.41\n1.75\n9.54\n\n2.18\n5.34\n\nsink\n\nshelf\n\nlamp mAP\n\ndesk\n\ntoilet\n\n-\n\n4.79\n9.55\n\nbin\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\nTable 2: Ablative analysis of the proposed model on SUN RGB-D. We evaluate the mAP for 3D object detection.\n\nSetting\nmAP\n\nS1\n35.23\n\nS2\n38.63\n\nS3\n38.87\n\nS4\n39.01\n\nS5\n37.43\n\nS6\n32.97\n\nFull\n39.09\n\n8\n\n\fFigure 7: Some failure cases. The perspective point estimation and the 3D box estimation fail at the same time.\n\n5.3 Failure Cases\n\nIn a large portion of the failure cases, the perspective point estimation and the 3D box estimation\nfail at the same time; see Figure 7. It implies that the perspective point estimation and the 3D box\nestimation are highly coupled, which supports the assumptions that the perspective points encode\nricher 3D information, and the 3D branch learns meaningful knowledge from the 2D branch. In\nfuture work, we may need a more sophisticated and general 3D prior to infer the 3D locations of\nobjects for such challenging cases.\n\n5.4 Discussions and Future Work\n\nComparison with optimization-based methods. Assume the estimated 3D size or distance is\ngiven, it is possible to compute the 3D bounding box with an optimization-based method like ef\ufb01cient\nPnP. However, the optimization-based methods are sensitive to the accuracy of the given known\nvariables. It is more suitable for tasks with smaller solution spaces (e.g., 6-DoF pose estimation where\nthe 3D shapes of objects are \ufb01xed). However, it would be dif\ufb01cult for tasks with larger solution spaces\n(e.g., 3D object detection where the 3D size, distance, and object pose could vary signi\ufb01cantly).\nTherefore, we argue that directly estimating each variable with constraints imposed among them is a\nmore natural and more straightforward solution.\n\nPotential incorporation with depth information. The PerspectiveNet estimates the distance be-\ntween the 3D object center and camera center based on the color image only (pure RGB without any\ndepth information). If the depth information was also provided, the proposed method should be able\nto make a much more accurate distance prediction.\n\nPotential application to outdoor environment.\nIt would be interesting to see how the proposed\nmethod would perform on outdoor 3D object detection datasets like KITTI [84]. The differences\nbetween the indoor and outdoor datasets for the task of 3D object detection lie in various aspects,\nincluding the diversity of object categories, the variety of object dimension, the severeness of the\nocclusion, the range of the camera angles, and the range of the distance (depth). We hope to adopt\nthe PerspectiveNet in future to the outdoor scenarios.\n\n6 Conclusion\n\nWe propose the PerspectiveNet, an end-to-end differentiable framework for 3D object detection\nfrom a single RGB image. It uses perspective points as an intermediate representation between 2D\ninput and 3D estimations. The PerspectiveNet adopts an R-CNN structure, where the region-wise\nbranches predict 2D boxes, perspective points, and 3D boxes. Instead of using a direct regression\nof 2D-3D relations, we further propose a template-based regression for estimating the perspective\npoints, which enforces a better consistency between the predicted 3D boxes and the 2D image input.\nThe experiments show that the proposed method signi\ufb01cantly improves existing RGB-based methods.\n\nAcknowledgments This work reported herein is supported by MURI ONR N00014-16-1-2007,\nDARPA XAI N66001-17-2-4029, ONR N00014-19-1-2153, and an NVIDIA GPU donation grant.\n\n9\n\n\fReferences\n\n[1] David Marr. Vision: A computational investigation into the human representation and processing\n\nof visual information. WH Freeman, 1982.\n\n[2] Bela Julesz. Visual pattern discrimination. IRE transactions on Information Theory, 8(2):84\u201392,\n\n1962.\n\n[3] Song Chun Zhu, Yingnian Wu, and David Mumford. Filters, random \ufb01elds and maximum\nInternational Journal of\n\nentropy (frame): Towards a uni\ufb01ed theory for texture modeling.\nComputer Vision (IJCV), 27(2):107\u2013126, 1998.\n\n[4] Bela Julesz. Textons, the elements of texture perception, and their interactions. Nature, 290\n\n(5802):91, 1981.\n\n[5] Song-Chun Zhu, Cheng-En Guo, Yizhou Wang, and Zijian Xu. What are textons? International\n\nJournal of Computer Vision (IJCV), 62(1-2):121\u2013143, 2005.\n\n[6] Cheng-en Guo, Song-Chun Zhu, and Ying Nian Wu. Towards a mathematical theory of primal\n\nsketch and sketchability. In International Conference on Computer Vision (ICCV), 2003.\n\n[7] Cheng-en Guo, Song-Chun Zhu, and Ying Nian Wu. Primal sketch: Integrating structure and\n\ntexture. Computer Vision and Image Understanding (CVIU), 106(1):5\u201319, 2007.\n\n[8] Mark Nitzberg and David Mumford. The 2.1-d sketch. In ICCV, 1990.\n[9] John YA Wang and Edward H Adelson. Layered representation for motion analysis.\n\nConference on Computer Vision and Pattern Recognition (CVPR), 1993.\n\nIn\n\n[10] John YA Wang and Edward H Adelson. Representing moving images with layers. Transactions\n\non Image Processing (TIP), 3(5):625\u2013638, 1994.\n\n[11] David Marr and Herbert Keith Nishihara. Representation and recognition of the spatial orga-\nnization of three-dimensional shapes. Proceedings of the Royal Society of London. Series B.\nBiological Sciences, 200(1140):269\u2013294, 1978.\n\n[12] I Binford. Visual perception by computer. In IEEE Conference of Systems and Control, 1971.\n[13] Rodney A Brooks. Symbolic reasoning among 3-d models and 2-d images. Arti\ufb01cial Intelligence,\n\n17(1-3):285\u2013348, 1981.\n\n[14] Takeo Kanade. Recovery of the three-dimensional shape of an object from a single view.\n\nArti\ufb01cial intelligence, 17(1-3):409\u2013460, 1981.\n\n[15] Donald Broadbent. A question of levels: Comment on McClelland and Rumelhart. American\n\nPsychological Association, 1985.\n\n[16] Max Wertheimer. Experimentelle studien uber das sehen von bewegung [experimental studies\n\non the seeing of motion]. Zeitschrift fur Psychologie, 61:161\u2013265, 1912.\n\n[17] Johan Wagemans, James H Elder, Michael Kubovy, Stephen E Palmer, Mary A Peterson,\nManish Singh, and R\u00fcdiger von der Heydt. A century of gestalt psychology in visual perception:\nI. perceptual grouping and \ufb01gure\u2013ground organization. Psychological bulletin, 138(6):1172,\n2012.\n\n[18] Johan Wagemans, Jacob Feldman, Sergei Gepshtein, Ruth Kimchi, James R Pomerantz, Peter A\nVan der Helm, and Cees Van Leeuwen. A century of gestalt psychology in visual perception: Ii.\nconceptual and theoretical foundations. Psychological bulletin, 138(6):1218, 2012.\n\n[19] Wolfgang K\u00f6hler. Die physischen Gestalten in Ruhe und im station\u00e4renZustand. Eine natur-\nphilosophische Untersuchung [The physical Gestalten at rest and in steady state]. Braunschweig,\nGermany: Vieweg und Sohn., 1920.\n\n[20] Wolfgang K\u00f6hler. Physical gestalten. In A source book of Gestalt psychology, pages 17\u201354.\n\nLondon, England: Routledge & Kegan Paul, 1938.\n\n[21] Max Wertheimer. Untersuchungen zur lehre von der gestalt, ii. [investigations in gestalt theory:\n\nIi. laws of organization in perceptual forms]. Psychologische Forschung, 4:301\u2013350, 1923.\n\n[22] Max Wertheimer. Laws of organization in perceptual forms. In A source book of Gestalt\n\npsychology, pages 71\u201394. London, England: Routledge & Kegan Paul, 1938.\n\n[23] Kurt Koffka. Principles of Gestalt psychology. Routledge, 2013.\n\n10\n\n\f[24] David Lowe. Perceptual organization and visual recognition, volume 5. Springer Science &\n\nBusiness Media, 2012.\n\n[25] Alex P Pentland. Perceptual organization and the representation of natural form. In Readings in\n\nComputer Vision, pages 680\u2013699. Elsevier, 1987.\n\n[26] David Waltz. Understanding line drawings of scenes with shadows. In The psychology of\n\ncomputer vision, 1975.\n\n[27] Harry G Barrow and Jay M Tenenbaum.\n\nInterpreting line drawings as three-dimensional\n\nsurfaces. Arti\ufb01cial Intelligence, 17(1-3):75\u2013116, 1981.\n\n[28] David G Lowe. Three-dimensional object recognition from single two-dimensional images.\n\nArti\ufb01cial Intelligence, 31(3):355\u2013395, 1987.\n\n[29] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal\n\nof computer vision, 60(2):91\u2013110, 2004.\n\n[30] James M Coughlan and Alan L Yuille. Manhattan world: Orientation and outlier detection by\n\nbayesian inference. Neural Computation, 2003.\n\n[31] James M Coughlan and Alan L Yuille. Manhattan world: Compass direction from a single\nimage by bayesian inference. In Conference on Computer Vision and Pattern Recognition\n(CVPR), 1999.\n\n[32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time\nobject detection with region proposal networks. In Advances in Neural Information Processing\nSystems (NeurIPS), 2015.\n\n[33] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask r-cnn. In International\n\nConference on Computer Vision (ICCV), 2017.\n\n[34] Xiaozhi Chen, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun.\nMonocular 3d object detection for autonomous driving. In Conference on Computer Vision and\nPattern Recognition (CVPR), 2016.\n\n[35] Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Ko\u0161eck\u00e1. 3d bounding box\nestimation using deep learning and geometry. In Conference on Computer Vision and Pattern\nRecognition (CVPR), 2017.\n\n[36] Siyuan Huang, Siyuan Qi, Yinxue Xiao, Yixin Zhu, Ying Nian Wu, and Song-Chun Zhu. Coop-\nerative holistic scene understanding: Unifying 3d object, layout, and camera pose estimation.\nIn Advances in Neural Information Processing Systems (NeurIPS), 2018.\n\n[37] Abhijit Kundu, Yin Li, and James M Rehg. 3d-rcnn: Instance-level 3d object reconstruction\nvia render-and-compare. In Conference on Computer Vision and Pattern Recognition (CVPR),\n2018.\n\n[38] Siyuan Huang, Siyuan Qi, Yixin Zhu, Yinxue Xiao, Yuanlu Xu, and Song-Chun Zhu. Holistic\n3d scene parsing and reconstruction from a single rgb image. In European Conference on\nComputer Vision (ECCV), 2018.\n\n[39] Shunyu Yao, Tzu Ming Hsu, Jun-Yan Zhu, Jiajun Wu, Antonio Torralba, Bill Freeman, and\nJosh Tenenbaum. 3d-aware scene manipulation via inverse graphics. In Advances in Neural\nInformation Processing Systems (NeurIPS), 2018.\n\n[40] Tong He and Stefano Soatto. Mono3d++: Monocular 3d vehicle detection with two-scale 3d\n\nhypotheses and task priors. arXiv preprint arXiv:1901.03446, 2019.\n\n[41] Jianxiong Xiao and Yasutaka Furukawa. Reconstructing the world\u2019s museums. International\n\nJournal of Computer Vision (IJCV), 2014.\n\n[42] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose\n\nestimation. In European Conference on Computer Vision (ECCV), 2016.\n\n[43] Chen-Yu Lee, Vijay Badrinarayanan, Tomasz Malisiewicz, and Andrew Rabinovich. Roomnet:\nEnd-to-end room layout estimation. In International Conference on Computer Vision (ICCV),\n2017.\n\n[44] Chuhang Zou, Alex Colburn, Qi Shan, and Derek Hoiem. Layoutnet: Reconstructing the\nIn Conference on Computer Vision and Pattern\n\n3d room layout from a single rgb image.\nRecognition (CVPR), 2018.\n\n11\n\n\f[45] Supasorn Suwajanakorn, Noah Snavely, Jonathan J Tompson, and Mohammad Norouzi. Dis-\ncovery of latent 3d keypoints via end-to-end geometric reasoning. In Advances in Neural\nInformation Processing Systems (NeurIPS), 2018.\n\n[46] Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene under-\nstanding benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR),\n2015.\n\n[47] Yibiao Zhao and Song-Chun Zhu. Image parsing with stochastic scene grammar. In Advances\n\nin Neural Information Processing Systems (NeurIPS), 2011.\n\n[48] Yibiao Zhao and Song-Chun Zhu. Scene parsing by integrating function, geometry and appear-\n\nance models. In Conference on Computer Vision and Pattern Recognition (CVPR), 2013.\n\n[49] Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, and Silvio Savarese. Understanding indoor\nscenes using 3d geometric phrases. In Conference on Computer Vision and Pattern Recognition\n(CVPR), 2013.\n\n[50] Dahua Lin, Sanja Fidler, and Raquel Urtasun. Holistic scene understanding for 3d object\ndetection with rgbd cameras. In International Conference on Computer Vision (ICCV), 2013.\n[51] Yinda Zhang, Shuran Song, Ping Tan, and Jianxiong Xiao. Panocontext: A whole-room 3d\ncontext model for panoramic scene understanding. In European Conference on Computer Vision\n(ECCV), 2014.\n\n[52] Hamid Izadinia, Qi Shan, and Steven M Seitz. Im2cad. In Conference on Computer Vision and\n\nPattern Recognition (CVPR), 2017.\n\n[53] Bin Xu and Zhenzhong Chen. Multi-level fusion based 3d object detection from monocular\n\nimages. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[54] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler,\nand Raquel Urtasun. 3d object proposals for accurate object class detection. In Advances in\nNeural Information Processing Systems (NeurIPS), 2015.\n\n[55] Julian Straub, Guy Rosman, Oren Freifeld, John J Leonard, and John W Fisher. A mixture of\nmanhattan frames: Beyond the manhattan world. In International Conference on Computer\nVision (ICCV), 2014.\n\n[56] Grant Schindler and Frank Dellaert. Atlanta world: An expectation maximization frame-\nwork for simultaneous low-level edge grouping and camera calibration in complex man-made\nenvironments. In Conference on Computer Vision and Pattern Recognition (CVPR), 2004.\n\n[57] Till Kroeger, Dengxin Dai, and Luc Van Gool. Joint vanishing point extraction and tracking. In\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[58] Michael Bosse, Richard Rikoski, John Leonard, and Seth Teller. Vanishing points and three-\n\ndimensional lines from omni-directional video. The Visual Computer, 2003.\n\n[59] Julian Straub, Nishchal Bhandari, John J Leonard, and John W Fisher. Real-time manhattan\nworld rotation estimation in 3d. In International Conference on Intelligent Robots and Systems\n(IROS), 2015.\n\n[60] Bernard Ghanem, Ali Thabet, Juan Carlos Niebles, and Fabian Caba Heilbron. Robust manhattan\nframe estimation from a single rgb-d image. In Conference on Computer Vision and Pattern\nRecognition (CVPR), 2015.\n\n[61] Varsha Hedau, Derek Hoiem, and David Forsyth. Recovering the spatial layout of cluttered\n\nrooms. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009.\n\n[62] David C Lee, Martial Hebert, and Takeo Kanade. Geometric reasoning for single image structure\n\nrecovery. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009.\n\n[63] Varsha Hedau, Derek Hoiem, and David Forsyth. Thinking inside the box: Using appearance\nmodels and context based on room geometry. In European Conference on Computer Vision\n(ECCV), 2010.\n\n[64] Alexander G Schwing, Tamir Hazan, Marc Pollefeys, and Raquel Urtasun. Ef\ufb01cient structured\nprediction for 3d indoor scene understanding. In Conference on Computer Vision and Pattern\nRecognition (CVPR), 2012.\n\n12\n\n\f[65] Erick Delage, Honglak Lee, and Andrew Y Ng. Automatic single-image 3d reconstructions of\n\nindoor manhattan world scenes. In Robotics Research, pages 305\u2013321. Springer, 2007.\n\n[66] Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski. Manhattan-world\n\nstereo. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009.\n\n[67] Jianxiong Xiao, James Hays, Bryan C Russell, Genevieve Patterson, Krista Ehinger, Antonio\nTorralba, and Aude Oliva. Basic level scene understanding: categories, attributes and structures.\nFrontiers in psychology, 4:506, 2013.\n\n[68] Zhile Ren and Erik B Sudderth. Three-dimensional object detection and layout prediction\nusing clouds of oriented gradients. In Conference on Computer Vision and Pattern Recognition\n(CVPR), 2016.\n\n[69] Xiaobai Liu, Yibiao Zhao, and Song-Chun Zhu. Single-view 3d scene reconstruction and\nparsing by attribute grammar. Transactions on Pattern Analysis and Machine Intelligence\n(TPAMI), 40(3):710\u2013725, 2017.\n\n[70] Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Freeman, and Josh Tenenbaum.\nIn Advances in Neural Information\n\nMarrnet: 3d shape reconstruction via 2.5 d sketches.\nProcessing Systems (NeurIPS), 2017.\n\n[71] Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Joshua B. Tenen-\nbaum, and William T. Freeman. Visual object networks: Image generation with disentangled 3d\nrepresentations. In Advances in Neural Information Processing Systems (NeurIPS), 2018.\n\n[72] Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Joshua B Tenenbaum, William T Freeman,\nand Jiajun Wu. Learning to reconstruct shapes from unseen classes. In Advances in Neural\nInformation Processing Systems (NeurIPS), 2018.\n\n[73] Shubham Tulsiani, Saurabh Gupta, David Fouhey, Alexei A Efros, and Jitendra Malik. Factoring\nshape, pose, and layout from the 2d image of a 3d scene. In Conference on Computer Vision\nand Pattern Recognition (CVPR), 2018.\n\n[74] Jiajun Wu, Tianfan Xue, Joseph J Lim, Yuandong Tian, Joshua B Tenenbaum, Antonio Torralba,\nand William T Freeman. Single image 3d interpreter network. In European Conference on\nComputer Vision (ECCV), 2016.\n\n[75] Bugra Tekin, Sudipta N Sinha, and Pascal Fua. Real-time seamless single shot 6d object pose\n\nprediction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[76] Ross Girshick. Fast r-cnn. In International Conference on Computer Vision (ICCV), 2015.\n[77] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive \ufb01eld properties by\n\nlearning a sparse code for natural images. Nature, 381(6583):607, 1996.\n\n[78] Ying Nian Wu, Zhangzhang Si, Haifeng Gong, and Song-Chun Zhu. Learning active basis\nmodel for object detection and recognition. International Journal of Computer Vision (IJCV),\n90(2):198\u2013235, 2010.\n\n[79] Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg,\nand Nicolas Heess. Unsupervised learning of 3d structure from images. In Advances in Neural\nInformation Processing Systems (NeurIPS), 2016.\n\n[80] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer\nnets: Learning single-view 3d object reconstruction without 3d supervision. In Advances in\nNeural Information Processing Systems (NeurIPS), 2016.\n\n[81] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[82] Tsung-Yi Lin, Piotr Doll\u00e1r, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.\nFeature pyramid networks for object detection. In Conference on Computer Vision and Pattern\nRecognition (CVPR), 2017.\n\n[83] Francisco Massa and Ross Girshick. maskrcnn-benchmark: Fast, modular reference imple-\nmentation of Instance Segmentation and Object Detection algorithms in PyTorch. https:\n//github.com/facebookresearch/maskrcnn-benchmark, 2018.\n\n[84] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The\n\nkitti dataset. International Journal of Robotics Research (IJRR), 32(11):1231\u20131237, 2013.\n\n13\n\n\f", "award": [], "sourceid": 4791, "authors": [{"given_name": "Siyuan", "family_name": "Huang", "institution": "University of California, Los Angeles"}, {"given_name": "Yixin", "family_name": "Chen", "institution": "UCLA"}, {"given_name": "Tao", "family_name": "Yuan", "institution": "UCLA"}, {"given_name": "Siyuan", "family_name": "Qi", "institution": "UCLA"}, {"given_name": "Yixin", "family_name": "Zhu", "institution": "University of California, Los Angeles"}, {"given_name": "Song-Chun", "family_name": "Zhu", "institution": "UCLA"}]}