Review for NeurIPS paper: RepPoints v2: Verification Meets Regression for Object Detection

NeurIPS 2020

RepPoints v2: Verification Meets Regression for Object Detection

Review 1

Summary and Contributions: This paper introduces classification task into the RepPoints framework. The specific contributions include proposing to model verification tasks by auxiliary side-branches, and combining them at the feature level and inference phase. The final model obtains 52.1 mAP on COCO test-dev by a single model. And It also shows effectiveness on instance segmentation.

Strengths: The paper provides detailed description of their proposed methods, including the implementation details and the evaluation codes. Extended experiments also prove the effectiveness of their methods. This paper is aimed at object detection, which is a basic task in the field of computer vision, and has a high relevance to the NeurIPS community.

Weaknesses: The novelty of this paper is limited. The idea of incorporating other tasks into the framework of object detection framework to enjoy the improvements brought by multi-task learning is not novel. Although authors state that their method does not require additional annotations, while mask r-cnn does. But the subtle differences between these are not very innovative. There are also several works using keypoints to assist object detection, such as CentripetalNet [1] and RetinaFace [2]. Reference: [1] Dong Z, Li G, Liao Y, et al. Centripetalnet: Pursuing high-quality keypoint pairs for object detection. CVPR 2020. [2] Deng J, Guo J, Zhou Y, et al. Retinaface: Single-stage dense face localisation in the wild.

Correctness: The method proposed in the article sounds effective, and the experimental part fully proves it.

Clarity: The paper is well written and easy to read.

Relation to Prior Work: The paper discuss clearly the difference between RepPoints v1 and v2. But the difference with other similar works need to be claimed detailedly.

Reproducibility: Yes

Additional Feedback: The rebuttal partly address my concerns, I would like to raise my score to 6.

Review 2

Summary and Contributions: This paper propose the idea of introducing verification tasks into regression based object detectors. The authors discuss two kinds of verification tasks, corner point verification and within-box foreground verification and introduce several fusion methods to combine the advantages of verification and regression.

Strengths: 1. This paper is well-written and easy to follow for readers who are familiar with RepPoints. 2. The finding of differences between verification based method and regression based method is very interesting. 3. This paper introduce a joint inference method to combine the advantages of verification and regression, which yield significant performance improvement.

Weaknesses: 1. It will be better to add more descriptions about RepPoints. It is hard for readers to follow the implementation details described in Sec 3.4 if they are not familiar with RepPoints. What is the meaning of “..., such that the first two points explicitly represent the top-left and bottom-right corner points.” ? In my understanding, the authors still predict n sample points following RepPoints, but they adopt the first two points to do point-bbox transformation. I suggest the authors replace the description of “the first two points” with “the first two points in point sets R and R`”. Describe the definition of RepPoints loss in the paper instead of appendix will make it more easy to understand the implementations. 2. Within-box foreground verification plays a similar role as centerness proposed by FCOS. The authors should add discussion and comparison. 3. The major improvement in higher IOU criteria is obtained by the corner point head. It would be better if the author could explain why corner point based method can yield more accurate localization performance. 4. Corner point verification and joint inference bring significant performance improvement for higher IOU criteria. However, as shown in Table 1. CornerNet still outperform RepPoints v2 4.9 mAP in AP90. I guess this is because CornetNet adopt high resolution feature maps to obtain more accurate results. Would the authors explain the reason why adopting high resolution feature map for corner point verification could not yield better performance?

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: See above ========================= The rebuttal addresses most of my concerns, I will keep as weak accept.

Review 3

Summary and Contributions: The paper proposes to use verification branches (i.e., heatmap regression/classification) to improve the performance of regression-based object detectors such as RepPoints and FCOS. The knowledge learned by the verification branches is fused into the detectors at both feature-level and result-level. The proposed method consistently improves RepPoint by about 2 mAP and FCOS by more than 1 AP on COCO.

Strengths: 1) The proposed method of using verficaition to improve the regression-based detectors sounds reasonable. I agree that the verfication and regression tasks are complementary, as shown in the paper. 2) The proposed method consistently improves the preformance of previous regression-based detectors, which shows the its effectiveness. 3) It also achieves state-of-the-art performance on MS-COCO.

Weaknesses: 1) It is more common and accurate to use the term "heatmap classification/regression" rather than "verification". 2) It is not suitable to call the work RepPoints v2, as the main concern is to add extra heatmap classification tasks to improve the precision of regression. Moreover, if I understand the paper correctly, in L201, the paper only makes use of two corners to represent an object (i.e., the explicit corners variant). Using explict corners is in conflict with the motivation of the original RepPoints paper, which attempts to use multiple points to represent an object. 2) In addition, the hyperparameter r controls the search range for refining. Please provide detailed ablations on the hyperparameter.

Correctness: Yes.

Clarity: The writing is good but it can be improved.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: The authors address my questions in rebuttal, so I keep my score.

Review 4

Summary and Contributions: This paper analyzes the verification and regression methodologies in current object detection systems and proposes RepPoint v2. The authors incorporate anchor-based and point-based verification tasks as auxiliary tasks to facilitate the training of RepPoint. Jointly inferencing with the auxiliary tasks also boost the performance further more. The model obtains about 2.0 mAP improvements on COCO test-dev benchmark. The authors also prove the generalization on other detectors and instance segmentation applications.

Strengths: 1. This method is well-written and easy to follow. 2. The experiments are thorough and verify the effectivenss of the proposed method. 3. Overall, I think this work is a great complementary to RepPoint. RepPoint v2 remedy the drawbacks of direct regression in RepPoint and boost the performance. As mentioned in the literature, all most other objects do several steps to regression the localization targets, while reppoint do one-shot regression which is inferior to other methods. This paper propose methods to remedy the drawbacks. The motivation is reasonable and the problem is tackled with good methods.

Weaknesses: 1. The descriptions in joint inference is not very clear. I cannot get how the refine process do according to Equ. 2. It would be great if the authors can clear this part during the rebuttal and polish this part in the final version. 2. I have some doubts about the definations in Table1. What's the different between anchor-based regression and the regression in RepPoints? in RetinaNet, there is also only a one-shot regression. And in ATSS, this literature has proved that the regression methods do not influence a lot. The method that directly regresses [w, h] to the center point is good enough. While RepPoints regresses distance to the location of feature maps. I think there is no obvious difference between the two methods. I hope the authors can clarify this problem. If not, the motivations here is not solid enough. 3. It would be great if the authors can analyze the computational costs and inference speeds for the proposed method.

Correctness: Yes.

Clarity: Yes, the paper is well written.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Comments after author feedback: As the paper receives positive comments and the rebuttal address the comments. I keep my original review to accept the paper. But I hope the authors can polish the academic writing of the paper especially the technical details.