Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
- The novelty of this work lies in the insight that perspective points could serve as a better geometric constraint for 3d object detection when used as an intermediate representation as opposed to 2d bounding boxes which were used in prior work. Although such a representation has been utilized in applications such as pose estimation, it is still interesting to see it being applied for 3d detection. - Also, in this work, the projective points are inferred as a mixture of sparse templates instead of dense heatmap-based predictions for each point independently. It has been shown to be more effective with an ablation study. The perspective loss also was defined under the assumption that each 3d bounding box aligns with a local Manhattan frame. These choices made in system development has been presented well with clear writing, related work and supported with experiments.
The template-based prediction is novel, the writing is clear and the overall performance is significant. Other detailed comments: (1) It's better to clarify the design details of the perspective templates. Are templates in the same class have different pose? If so, what are the poses? (2) Sec. 5.2 heatmaps vs templates. The authors compares with the keypoint prediction baseline of Mask-RCNN, which uses the same backbone as the proposed method. I'm wondering if the worse performance of heat map based approach is due to the backbone network design. E.g., use hourglass network [A] (which is commonly used for heat map prediction) instead and see the performance? [A] Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." European conference on computer vision. Springer, Cham, 2016.
Originality: To the best of my knowledge, using projected 3D bounding box corners as an intermediate representation is a novel idea. Moreover, this is much more intuitive and natural compared to previous works. The related works are very well cited, making this paper more informative. Quality: The paper is technically sound. By introducing projected perspective points, this work achieves state of art 3D detection result on a challenging dataset. However, several ambiguities arise in the experiment section, which makes some important details less clear. 1. How are the templates defined? It seems like the templates are defined per-class and through out the paper there seems less description on how such templates are derived. Are they hand crafted or derived from the statistics of the dataset? Or are they actually optimizable so that they are actually learnt along the training? 2. Though it is nice to introduce intermediate representations, but it seems less intuitive if such representation is class conditioned. According to David Marr, as cited in the paper, such intermediate representation should be coming from lower level signals to form higher level representation or reasoning. Such as edges and depths comes from images and then they form the notion of objects or class. By adding higher level constraint (a hard constraint as the templates are class conditioned) in such representation seems less aligned with Marr's theory. Therefore, the question is what would happen if this representation is class agnostic? 3. Is 3D bbox branch really necessary? If the perspective points are given, then it is possible to directly have a MSE estimate of the 3D bbox that minimize the projection error. For example, as the 3D projection is template based, the correspondence between projection and a 3D cube is automatically given. Then, it is possible to formulate the problem into perspective n point and solve the linear system via efficient pnp. Or, one can directly represent the variables as rotation angles, distance and scales as unknown variables and solve a non-linear sum of squares problem to give an estimate of 3D bbox. Therefore it seems less intuitive why there's need to introduce the 3D bbox branch in the first place. Clarity: The paper is well written and easy to read. Though it would be better if it is more clear about some details mentioned above. Significance: I think predicting 3D properties by their projections is the right way to go and is the direction that the 3D vision community needs to hear more about. I believe future researchers should be more comfortable with this concept and use this as their default setup. Clarity: Significance: